talk-data.com talk-data.com

Topic

Data Modelling

data_governance data_quality metadata_management

355

tagged

Activity Trend

18 peak/qtr
2020-Q1 2026-Q1

Activities

355 activities · Newest first

Summary The current trend in data management is to centralize the responsibilities of storing and curating the organization’s information to a data engineering team. This organizational pattern is reinforced by the architectural pattern of data lakes as a solution for managing storage and access. In this episode Zhamak Dehghani shares an alternative approach in the form of a data mesh. Rather than connecting all of your data flows to one destination, empower your individual business units to create data products that can be consumed by other teams. This was an interesting exploration of a different way to think about the relationship between how your data is produced, how it is used, and how to build a technical platform that supports the organizational needs of your business.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! And to grow your professional network and find opportunities with the startups that are changing the world then Angel List is the place to go. Go to dataengineeringpodcast.com/angel to sign up today. You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management.For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Upcoming events include the O’Reilly AI Conference, the Strata Data Conference, and the combined events of the Data Architecture Summit and Graphorum. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Zhamak Dehghani about building a distributed data mesh for a domain oriented approach to data management

Interview

Introduction How did you get involved in the area of data management? Can you start by providing your definition of a "data lake" and discussing some of the problems and challenges that they pose?

What are some of the organizational and industry trends that tend to lead to this solution?

You have written a detailed post outlining the concept of a "data mesh" as an alternative to data lakes. Can you give a summary of what you mean by that phrase?

In a domain oriented data model, what are some useful methods for determining appropriate boundaries for the various data products?

What are some of the challenges that arise in this data mesh approach and how do they compare to those of a data lake? One of the primary complications of any data platform, whether distributed or monolithic, is that of discoverability. How do you approach that in a data mesh scenario?

A corollary to the issue of discovery is that of access

Implementing CDISC Using SAS, 2nd Edition

For decades researchers and programmers have used SAS to analyze, summarize, and report clinical trial data. Now Chris Holland and Jack Shostak have updated their popular Implementing CDISC Using SAS, the first comprehensive book on applying clinical research data and metadata to the Clinical Data Interchange Standards Consortium (CDISC) standards. Implementing CDISC Using SAS: An End-to-End Guide, Revised Second Edition, is an all-inclusive guide on how to implement and analyze the Study Data Tabulation Model (SDTM) and the Analysis Data Model (ADaM) data and prepare clinical trial data for regulatory submission. Updated to reflect the 2017 FDA mandate for adherence to CDISC standards, this new edition covers creating and using metadata, developing conversion specifications, implementing and validating SDTM and ADaM data, determining solutions for legacy data conversions, and preparing data for regulatory submission. The book covers products such as Base SAS, SAS Clinical Data Integration, and the SAS Clinical Standards Toolkit, as well as JMP Clinical. Topics included in this edition include an implementation of the Define-XML 2.0 standard, new SDTM domains, validation with Pinnacle 21 software, event narratives in JMP Clinical, STDM and ADAM metadata spreadsheets, and of course new versions of SAS and JMP software. The second edition was revised to add the latest C-Codes from the most recent release as well as update the make_define macro that accompanies this book in order to add the capability to handle C-Codes. The metadata spreadsheets were updated accordingly. Any manager or user of clinical trial data in this day and age is likely to benefit from knowing how to either put data into a CDISC standard or analyzing and finding data once it is in a CDISC format. If you are one such person--a data manager, clinical and/or statistical programmer, biostatistician, or even a clinician--then this book is for you.

Summary One of the biggest challenges for any business trying to grow and reach customers globally is how to scale their data storage. FaunaDB is a cloud native database built by the engineers behind Twitter’s infrastructure and designed to serve the needs of modern systems. Evan Weaver is the co-founder and CEO of Fauna and in this episode he explains the unique capabilities of Fauna, compares the consensus and transaction algorithm to that used in other NewSQL systems, and describes the ways that it allows for new application design patterns. One of the unique aspects of Fauna that is worth drawing attention to is the first class support for temporality that simplifies querying of historical states of the data. It is definitely worth a good look for anyone building a platform that needs a simple to manage data layer that will scale with your business.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! Alluxio is an open source, distributed data orchestration layer that makes it easier to scale your compute and your storage independently. By transparently pulling data from underlying silos, Alluxio unlocks the value of your data and allows for modern computation-intensive workloads to become truly elastic and flexible for the cloud. With Alluxio, companies like Barclays, JD.com, Tencent, and Two Sigma can manage data efficiently, accelerate business analytics, and ease the adoption of any cloud. Go to dataengineeringpodcast.com/alluxio today to learn more and thank them for their support. Understanding how your customers are using your product is critical for businesses of any size. To make it easier for startups to focus on delivering useful features Segment offers a flexible and reliable data infrastructure for your customer analytics and custom events. You only need to maintain one integration to instrument your code and get a future-proof way to send data to over 250 services with the flip of a switch. Not only does it free up your engineers’ time, it lets your business users decide what data they want where. Go to dataengineeringpodcast.com/segmentio today to sign up for their startup plan and get $25,000 in Segment credits and $1 million in free software from marketing and analytics companies like AWS, Google, and Intercom. On top of that you’ll get access to Analytics Academy for the educational resources you need to become an expert in data analytics for measuring product-market fit. You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Evan Weaver about FaunaDB, a modern operational data platform built for your cloud

Interview

Introduction How did you get involved in the area of data management? Can you start by explaining what FaunaDB is and how it got started? What are some of the main use cases that FaunaDB is targeting?

How does it compare to some of the other global scale databases that have been built in recent years such as CockroachDB?

Can you describe the architecture of FaunaDB and how it has evolved? The consensus and replication protocol in Fauna is intriguing. Can you talk through how it works?

What are some of the edge cases that users should be aware of? How are conflicts managed in Fauna?

What is the underlying storage layer?

How is the query layer designed to allow for different query patterns and model representations?

How does data modeling in Fauna compare to that of relational or document databases?

Can you describe the query format? What are some of the common difficulties or points of confusion around interacting with data in Fauna?

What are some application design patterns that are enabled by using Fauna as the storage layer? Given the ability to replicate globally, how do you mitigate latency when interacting with the database? What are some of the most interesting or unexpected ways that you have seen Fauna used? When is it the wrong choice? What have been some of the most interesting/unexpected/challenging aspects of building the Fauna database and company? What do you have in store for the future of Fauna?

Contact Info

@evan on Twitter LinkedIn

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

Fauna Ruby on Rails CNET GitHub Twitter NoSQL Cassandra InnoDB Redis Memcached Timeseries Spanner Paper DynamoDB Paper Percolator ACID Calvin Protocol Daniel Abadi LINQ LSM Tree (Log-structured Merge-tree) Scala Change Data Capture GraphQL

Podcast.init Interview About Graphene

Fauna Query Language (FQL) CQL == Cassandra Query Language Object-Relational Databases LDAP == Lightweight Directory Access Protocol Auth0 OLAP == Online Analytical Processing Jepsen distributed systems safety research

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Support Data Engineering Podcast

R Statistics Cookbook

The "R Statistics Cookbook" offers a comprehensive guide to solving statistical problems using R 3.5. Through over 100 practical recipes, you'll learn to perform essential statistical analyses, such as t-tests and regression, while mastering techniques for data modeling, nonparametric methods, and machine learning. This resource is tailored for tackling statistics-centric challenges across industries. What this Book will help me do Confidently use R 3.5 to perform statistical analyses that meet your data needs. Apply various hypothesis testing methods, such as t-tests and ANOVA, effectively. Model and forecast data using time series analysis and mixed-effects modeling. Implement regression techniques, including Bayesian regression, for actionable insights. Leverage robust statistics and the caret package for machine learning applications in R. Author(s) None Juretig, a professional statistician and experienced educator, has an extensive background in applying statistical methods to real-world problems using R. Their writing combines deep technical knowledge with an approachable teaching style, making complex statistical concepts accessible to learners of varying levels. Who is it for? If you're a statistician, data scientist, researcher, or analyst with proficiency in R programming and foundational knowledge of linear algebra, this book is crafted for you. It caters to professionals looking to solidify their statistical knowledge while exploring practical, real-world applications. Whether seeking to apply advanced methods or refine your statistical approaches, this guide provides actionable insights.

Hands-On Business Intelligence with Qlik Sense

"Hands-On Business Intelligence with Qlik Sense" teaches you how to harness the powerful capabilities of Qlik Sense to build dynamic, interactive dashboards and analyze data effectively. This book provides comprehensive guidance, from data modeling to creating visualizations, geospatial analysis, forecasting, and sharing insights across your organization. What this Book will help me do Understand the core concepts of Qlik Sense for building business intelligence dashboards. Master the process of loading, reshaping, and modeling data for analysis and reporting. Create impactful visual representations of data using Qlik Sense visualization tools. Leverage advanced analytics techniques, including Python and R integration, for deeper insights. Utilize Qlik Sense GeoAnalytics to perform geospatial analysis and produce location-based insights. Author(s) The authors of "Hands-On Business Intelligence with Qlik Sense" are experts in Qlik Sense and data analysis. They collectively bring decades of experience in business intelligence development and implementation. Their practical approach ensures that readers not only learn the theory but can also apply the techniques in real-world scenarios. Who is it for? This book is designed for business intelligence developers, data analysts, and anyone interested in exploring Qlik Sense for their data analysis tasks. If you're aiming to start with Qlik Sense and want a practical and hands-on guide, this book is ideal. No prior experience with Qlik Sense is necessary, but familiarity with data analysis concepts is helpful.

Mastering Hadoop 3

"Mastering Hadoop 3" is your in-depth guide to understanding and mastering the advanced features of the Hadoop ecosystem. With a focus on distributed computing and data processing, this book covers essential tools such as YARN, MapReduce, and Apache Spark to help you build scalable, efficient data pipelines. What this Book will help me do Gain a comprehensive understanding of Hadoop Distributed File System (HDFS) and YARN for effective resource management. Master data processing with MapReduce and learn to integrate with real-time processing engines like Spark and Flink. Develop and secure enterprise-grade Hadoop-based data pipelines by implementing robust security and governance measures. Explore techniques for batch data processing, data modeling, and designing applications tailored for Hadoop environments. Understand best practices for optimizing and troubleshooting Hadoop clusters for enhanced performance and reliability. Author(s) The authors, including None Wong, None Singh, and None Kumar, bring together years of experience in big data engineering, distributed systems, and enterprise application development. They aim to provide a clear pathway to mastering Hadoop ecosystem tools. Who is it for? This book is ideal for budding big data professionals who have some familiarity with Java and basic Hadoop concepts and wish to elevate their expertise. If you're a Hadoop career practitioner keen to expand your understanding of the ecosystem's advanced capabilities or a professional looking to implement Hadoop in organizational workflows, this book is well-suited for you.

Mobile Apps are not the same as web, so why have we been measuring them as such? With the old GA Services SDK turning down for some users, it's time to look into how to use Google Analytics for Firebase to measure and action on your mobile app's data. Krista will walk you through the benefits and power of the tool, explain the differences in data model and implementation best practices, and tips for how to migrate.

Learning PostgreSQL 11 - Third Edition

Immerse yourself in the capabilities of PostgreSQL 11 with this comprehensive beginner's guide. Learning PostgreSQL 11 will take you through relational database fundamentals and advanced database functionality, empowering you to build efficient and scalable database solutions with confidence. By the end of this book, you'll have mastery over PostgreSQL's features to develop, manage, and optimize your own databases. What this Book will help me do Gain a solid understanding of relational database principles and the PostgreSQL ecosystem. Learn to install PostgreSQL, create a database, and design a data model effectively. Develop skills to create, manipulate, and optimize tables, views, and efficient indexes. Utilize server-side programming with PL/pgSQL and advanced data types like JSONB. Enhance database reliability and performance, and connect to your Python applications seamlessly. Author(s) Christopher Travers and None Volkov bring their collective expertise and practical experience to this book. Christopher has a strong background in software development and database systems, with years of hands-on involvement with PostgreSQL. None has contributed significantly to innovative database solutions, emphasizing clear and actionable instructions. Together, they aim to demystify PostgreSQL for learners of all backgrounds. Who is it for? This book is crafted for developers, database administrators, and tech enthusiasts who want to delve into PostgreSQL. Beginners with no prior database experience will find its approach accessible, while those aiming to enhance their skills with PostgreSQL's latest features will benefit immensely. It's ideal for anyone seeking to build solid database or data warehousing applications with modern capabilities and best practices.

QlikView: Advanced Data Visualization

Build powerful data analytics applications with this business intelligence tool and overcome all your business challenges Key Features Master time-saving techniques and make your QlikView development more efficient Perform geographical analysis and sentiment analysis in your QlikView applications Explore advanced QlikView techniques, tips, and tricks to deliver complex business requirements Book Description QlikView is one of the most flexible and powerful business intelligence platforms around, and if you want to transform data into insights, it is one of the best options you have at hand. Use this Learning Path, to explore the many features of QlikView to realize the potential of your data and present it as impactful and engaging visualizations. Each chapter in this Learning Path starts with an understanding of a business requirement and its associated data model and then helps you create insightful analysis and data visualizations around it. You will look at problems that you might encounter while visualizing complex data insights using QlikView, and learn how to troubleshoot these and other not-so-common errors. This Learning Path contains real-world examples from a variety of business domains, such as sales, finance, marketing, and human resources. With all the knowledge that you gain from this Learning Path, you will have all the experience you need to implement your next QlikView project like a pro. This Learning Path includes content from the following Packt products: QlikView for Developers by Miguel Angel Garcia, Barry Harmsen Mastering QlikView by Stephen Redmond Mastering QlikView Data Visualization by Karl Pover What you will learn Deliver common business requirements using advanced techniques Load data from disparate sources to build associative data models Understand when to apply more advanced data visualization Utilize the built-in aggregation functions for complex calculations Build a data architecture that supports scalable QlikView deployments Troubleshoot common data visualization errors in QlikView Protect your QlikView applications and data Who this book is for This Learning Path is designed for developers who want to go beyond their technical knowledge of QlikView and understand how to create analysis and data visualizations that solve real business needs. To grasp the concepts explained in this Learning Path, you should have a basic understanding of the common QlikView functions and some hands-on experience with the tool. Downloading the example code for this book You can download the example code files for all Packt books you have purchased from your account at http://www.PacktPub.com. If you purchased this book elsewhere, you can visit http://www.PacktPub.com/support and register to have the files e-mailed directly to you.

Hands-On Big Data Modeling

This book, Hands-On Big Data Modeling, provides you with practical guidance on data modeling techniques, focusing particularly on the challenges of big data. You will learn the concepts behind various data models, explore tools and platforms for efficient data management, and gain hands-on experience with structured and unstructured data. What this Book will help me do Master the fundamental concepts of big data and its challenges. Explore advanced data modeling techniques using SQL, Python, and R. Design effective models for structured, semi-structured, and unstructured data types. Apply data modeling to real-world datasets like social media and sensor data. Optimize data models for performance and scalability in various big data platforms. Author(s) The authors of this book are experienced data architects and engineers with a strong background in developing scalable data solutions. They bring their collective expertise to simplify complex concepts in big data modeling, ensuring readers can effectively apply these techniques in their projects. Who is it for? This book is intended for data architects, business intelligence professionals, and any programmer interested in understanding and applying big data modeling concepts. If you are already familiar with basic data management principles and want to enhance your skills, this book is perfect for you. You will learn to tackle real-world datasets and create scalable models. Additionally, it is suitable for professionals transitioning to working with big data frameworks.

Summary

With the growth of the Hadoop ecosystem came a proliferation of implementations for the Hive table format. Unfortunately, with no formal specification, each project works slightly different which increases the difficulty of integration across systems. The Hive format is also built with the assumptions of a local filesystem which results in painful edge cases when leveraging cloud object storage for a data lake. In this episode Ryan Blue explains how his work on the Iceberg table format specification and reference implementation has allowed Netflix to improve the performance and simplify operations for their S3 data lake. This is a highly detailed and technical exploration of how a well-engineered metadata layer can improve the speed, accuracy, and utility of large scale, multi-tenant, cloud-native data platforms.

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Ryan Blue about Iceberg, a Netflix project to implement a high performance table format for batch workloads

Interview

Introduction How did you get involved in the area of data management? Can you start by explaining what Iceberg is and the motivation for creating it?

Was the project built with open-source in mind or was it necessary to refactor it from an internal project for public use?

How has the use of Iceberg simplified your work at Netflix? How is the reference implementation architected and how has it evolved since you first began work on it?

What is involved in deploying it to a user’s environment?

For someone who is interested in using Iceberg within their own environments, what is involved in integrating it with their existing query engine?

Is there a migration path for pre-existing tables into the Iceberg format?

How is schema evolution managed at the file level?

How do you handle files on disk that don’t contain all of the fields specified in a table definition?

One of the complicated problems in data modeling is managing table partitions. How does Iceberg help in that regard? What are the unique challenges posed by using S3 as the basis for a data lake?

What are the benefits that outweigh the difficulties?

What have been some of the most challenging or contentious details of the specification to define?

What are some things that you have explicitly left out of the specification?

What are your long-term goals for the Iceberg specification?

Do you anticipate the reference implementation continuing to be used and maintained?

Contact Info

rdblue on GitHub LinkedIn

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

Iceberg Reference Implementation Iceberg Table Specification Netflix Hadoop Cloudera Avro Parquet Spark S3 HDFS Hive ORC S3mper Git Metacat Presto Pig DDL (Data Definition Language) Cost-Based Optimization

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast

Applied Analytics through Case Studies Using SAS and R: Implementing Predictive Models and Machine Learning Techniques

Examine business problems and use a practical analytical approach to solve them by implementing predictive models and machine learning techniques using SAS and the R analytical language. This book is ideal for those who are well-versed in writing code and have a basic understanding of statistics, but have limited experience in implementing predictive models and machine learning techniques for analyzing real world data. The most challenging part of solving industrial business problems is the practical and hands-on knowledge of building and deploying advanced predictive models and machine learning algorithms. Applied Analytics through Case Studies Using SAS and R is your answer to solving these business problems by sharpening your analytical skills. What You'll Learn Understand analytics and basic data concepts Use an analytical approach to solve Industrial business problems Build predictive model with machine learning techniques Create and apply analytical strategies Who This Book Is For Data scientists, developers, statisticians, engineers, and research students with a great theoretical understanding of data and statistics who would like to enhance their skills by getting practical exposure in data modeling.

Microsoft Power BI Quick Start Guide

Uncover the power of Microsoft Power BI with this accessible and practical guide. This book introduces you to the concepts of data modeling, transformation, and visualization, ensuring that you can build effective dashboards and gain valuable insights. You'll be empowered to productively utilize Power BI in your organization to achieve your analytics goals. What this Book will help me do Connect to various data sources and harness the capabilities of the Query Editor. Transform and clean data for analysis, learning to use languages like M and R. Build robust data models with relationships and powerful DAX expressions. Create impactful reports with efficient and custom visualizations in Power BI. Deploy and administer Power BI solutions both in the cloud and on-premise. Author(s) The authors, Devin Knight, Mitchell Pearson, and Manuel Quintana, are seasoned experts in Business Intelligence and Power BI. They bring years of experience simplifying complex data challenges. Their writing is approachable and hands-on, equipping readers with the skills to solve real-world problems. Who is it for? This book is perfectly suited for professionals in Business Intelligence roles, data analysts, or those aiming to adopt Power BI solutions. Whether you're new to Power BI or have basic BI knowledge, this guide will take you from fundamentals to advanced implementations. Ideal for anyone aiming to unlock actionable insights from their data.

Summary

Using a multi-model database in your applications can greatly reduce the amount of infrastructure and complexity required. ArangoDB is a storage engine that supports documents, dey/value, and graph data formats, as well as being fast and scalable. In this episode Jan Steeman and Jan Stücke explain where Arango fits in the crowded database market, how it works under the hood, and how you can start working with it today.

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. Your host is Tobias Macey and today I’m interviewing Jan Stücke and Jan Steeman about ArangoDB, a multi-model distributed database for graph, document, and key/value storage.

Interview

Introduction How did you get involved in the area of data management? Can you give a high level description of what ArangoDB is and the motivation for creating it?

What is the story behind the name?

How is ArangoDB constructed?

How does the underlying engine store the data to allow for the different ways of viewing it?

What are some of the benefits of multi-model data storage?

When does it become problematic?

For users who are accustomed to a relational engine, how do they need to adjust their approach to data modeling when working with Arango? How does it compare to OrientDB? What are the options for scaling a running system?

What are the limitations in terms of network architecture or data volumes?

One of the unique aspects of ArangoDB is the Foxx framework for embedding microservices in the data layer. What benefits does that provide over a three tier architecture?

What mechanisms do you have in place to prevent data breaches from security vulnerabilities in the Foxx code? What are some of the most interesting or surprising uses of this functionality that you have seen?

What are some of the most challenging technical and business aspects of building and promoting ArangoDB? What do you have planned for the future of ArangoDB?

Contact Info

Jan Steemann

jsteemann on GitHub @steemann on Twitter

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

ArangoDB Köln Multi-model Database Graph Algorithms Apache 2 C++ ArangoDB Foxx Raft Protocol Target Partners RocksDB AQL (ArangoDB Query Language) OrientDB PostGreSQL OrientDB Studio Google Spanner 3-Tier Architecture Thomson-Reuters Arango Search Dell EMC Google S2 Index ArangoDB Geographic Functionality JSON Schema

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast

Summary

The Open Data Science Conference brings together a variety of data professionals each year in Boston. This week’s episode consists of a pair of brief interviews conducted on-site at the conference. First up you’ll hear from Andy Eschbacher of Carto. He dscribes some of the complexities inherent to working with geospatial data, how they are handling it, and some of the interesting use cases that they enable for their customers. Next is Todd Blaschka, COO of TigerGraph. He explains how graph databases differ from relational engines, where graph algorithms are useful, and how TigerGraph is built to alow for fast and scalable operation.

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. Your host is Tobias Macey and last week I attended the Open Data Science Conference in Boston and recorded a few brief interviews on-site. In this second part you will hear from Andy Eschbacher of Carto about the challenges of managing geospatial data, as well as Todd Blaschka of TigerGraph about graph databases and how his company has managed to build a fast and scalable platform for graph storage and traversal.

Interview

Andy Eschbacher From Carto

What are the challenges associated with storing geospatial data? What are some of the common misconceptions that people have about working with geospatial data?

Contact Info

andy-esch on GitHub @MrEPhysics on Twitter Website

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

Carto Geospatial Analysis GeoJSON

Todd Blaschka From TigerGraph

What are graph databases and how do they differ from relational engines? What are some of the common difficulties that people have when deling with graph algorithms? How does data modeling for graph databases differ from relational stores?

Contact Info

LinkedIn @toddblaschka on Twitter

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

TigerGraph Graph Databases

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast

Beginning DAX with Power BI: The SQL Pro’s Guide to Better Business Intelligence

Attention all SQL Pros, DAX is not just for writing Excel-based formulas! Get hands-on learning and expert advice on how to use the vast capabilities of the DAX language to solve common data modeling challenges. Beginning DAX with Power BI teaches key concepts such as mapping techniques from SQL to DAX, filtering, grouping, joining, pivoting, and using temporary tables, all aimed at the SQL professional. Join author Philip Seamark as he guides you on a journey through typical business data transformation scenarios and challenges, and teaches you, step-by-step, how to resolve challenges using DAX. Tips, tricks, and shortcuts are included and explained, along with examples of the SQL equivalent, in order to accelerate learning. Examples in the book range from beginner to advanced, with plenty of detailed explanation when walking through each scenario. What You’ll Learn Turbocharge your Power BI model by adding advanced DAX programming techniques Know when to use calculated measures versus calculated columns Generate new tables on the fly from existing data Optimize, monitor, and tune Power BI to improve performance of your models Discover new ideas, tricks, and time-saving techniques for better models Who This Book Is For Business intelligence developers, business analysts, or any SQL user who wants to use Power BI as a reporting tool. A solid understanding of SQL is recommended, as examples throughout the book include the DAX equivalents to SQL problem/solution scenarios.

Mastering Microsoft Power BI

Dive right into the powerful world of Microsoft Power BI with this comprehensive guide. This book takes you through every step of mastering Power BI, from data modeling to creating actionable visualizations. You'll find clear explanations and practical steps to improve your data analytics and enhance business decision-making. What this Book will help me do Learn to connect and transform data using Power Query M Language to create clean, structured datasets. Understand how to design scalable and performance-optimized Power BI Data Models for effective analytics. Develop professional, visually appealing and interactive reports and dashboards to convey insights confidently. Implement best practices for managing Power BI solutions, including deployment, version control, and monitoring. Gain practical knowledge to administer Power BI across organizational structures, ensuring security and efficiency. Author(s) None Powell is a seasoned expert in business intelligence and a passionate educator in the field of data analytics. With extensive hands-on experience in Microsoft Power BI, None has supported many organizations in unlocking the potential of their data. The approachable writing style reflects a real-world yet proficient understanding of Power BI's capabilities. Who is it for? This book is ideal for business intelligence professionals looking to deepen their expertise in Microsoft Power BI. Readers already familiar with basic BI concepts and Power BI will gain significant technical depth. It suits professionals keen to enhance their data modeling, visualization, and analytics skills. If you're aiming to create impactful dashboards and benefit from advanced insights, this book is for you.

Expert Apache Cassandra Administration

Follow this handbook to build, configure, tune, and secure Apache Cassandra databases. Start with the installation of Cassandra and move on to the creation of a single instance, and then a cluster of Cassandra databases. Cassandra is increasingly a key player in many big data environments, and this book shows you how to use Cassandra with Apache Spark, a popular big data processing framework. Also covered are day-to-day topics of importance such as the backup and recovery of Cassandra databases, using the right compression and compaction strategies, and loading and unloading data. Expert Apache Cassandra Administration provides numerous step-by-step examples starting with the basics of a Cassandra database, and going all the way through backup and recovery, performance optimization, and monitoring and securing the data. The book serves as an authoritative and comprehensive guide to the building and management of simpleto complex Cassandra databases. The book: Takes you through building a Cassandra database from installation of the software and creation of a single database, through to complex clusters and data centers Provides numerous examples of actual commands in a real-life Cassandra environment that show how to confidently configure, manage, troubleshoot, and tune Cassandra databases Shows how to use the Cassandra configuration properties to build a highly stable, available, and secure Cassandra database that always operates at peak efficiency What You'll Learn Install the Cassandra software and create your first database Understand the Cassandra data model, and the internal architecture of a Cassandra database Create your own Cassandra cluster, step-by-step Run a Cassandra cluster on Docker Work with Apache Spark by connecting to a Cassandra database Deploy Cassandra clusters in your data center, or on Amazon EC2 instances Back up and restore mission-critical Cassandra databases Monitor, troubleshoot, and tune production Cassandra databases, and cut your spending on resources such as memory, servers, and storage Who This Book Is For Database administrators, developers, and architects who are looking for an authoritative and comprehensive single volume for all their Cassandra administration needs. Also for administrators who are tasked with setting up and maintaining highly reliable and high-performing Cassandra databases. An excellent choice for big data administrators, database administrators, architects, and developers who use Cassandra as their key data store, to support high volume online transactions, or as a decentralized, elastic data store.

Pro Power BI Desktop

Deliver eye-catching Business Intelligence with Microsoft Power BI Desktop. This new edition has been updated to cover all the latest features, including combo charts, Cartesian charts, trend lines, use of gauges, and more. Also covered are Top-N features, the ability to bin data into groupings and chart the groupings, and new techniques for detecting and handling outlier data points. You can take data from virtually any source and use it to produce stunning dashboards and compelling reports that will seize your audience’s attention. Slice and dice the data with remarkable ease and then add metrics and KPIs to project the insights that create your competitive advantage. Make raw data into clear, accurate, and interactive information with Microsoft’s free self-service business intelligence tool. Pro Power BI Desktop shows you how to choose from a wide range of built-in and third-party visualization types so that your message is always enhanced. You’ll be able to deliver those results on the PC, tablets, and smartphones, as well as share results via the cloud. This book helps you save time by preparing the underlying data correctly without needing an IT department to prepare it for you. What You'll Learn Deliver attention-grabbing information, turning data into insight Mash up data from multiple sources into a cleansed and coherent data model Create dashboards that help in monitoring key performance indicators of your business Build interdependent charts, maps, and tables to deliver visually stunning information Share business intelligence in the cloud without involving IT Deliver visually stunning and interactive charts, maps, and tables Find new insights as you chop and tweak your data as never before Adapt delivery to mobile devices such as phones and tablets Who This Book Is For Everyone from CEOs and Business Intelligence developers to power users and IT managers

Learning Neo4j 3.x - Second Edition

"Learning Neo4j 3.x" provides a comprehensive introduction to the world of graph databases, focusing on practical usage of Neo4j. This book guides you through the fundamentals, from installation and modeling to advanced features including security and optimization. You'll gain the skills to harness Neo4j for effective data management and visualization. What this Book will help me do Understand the basics of graph databases and how to use them effectively in real-world scenarios. Master the Cypher query language to query and manipulate graph data powerfully and intuitively. Learn to implement and optimize advanced graph techniques using the APOC library. Develop the ability to extend Neo4j's core functionality using available plugins and advanced extensions. Acquire skills to design and deploy scalable, secure enterprise-grade graph database solutions. Author(s) Jerome Baton and None Van Bruggen are experienced Neo4j specialists who share a passion for making complex technical concepts accessible. Jerome brings years of real-world experience in graph database applications, while None contributes expertise in data modeling and visualization. Together, they provide clear, focused insights with practical examples and hands-on guidance. Who is it for? This book is tailored for developers looking to extend their knowledge with graph databases to take on modern connected data challenges. It is suitable for those new to Neo4j, including beginners with databases, and will serve as a valuable guide for professionals aiming to deepen their expertise in data storage and query optimization using Neo4j.