talk-data.com talk-data.com

Topic

Parquet

Apache Parquet

columnar_storage big_data compression file_format storage

21

tagged

Activity Trend

5 peak/qtr
2020-Q1 2026-Q1

Activities

21 activities · Newest first

Summary In this episode of the Data Engineering Podcast Prashanth Rao, an AI engineer at KuzuDB, talks about their embeddable graph database. Prashanth explains how KuzuDB addresses performance shortcomings in existing solutions through columnar storage and novel join algorithms. He discusses the usability and scalability of KuzuDB, emphasizing its open-source nature and potential for various graph applications. The conversation explores the growing interest in graph databases due to their AI and data engineering applications, and Prashanth highlights KuzuDB's potential in edge computing, ephemeral workloads, and integration with other formats like Iceberg and Parquet.

Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementData migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details.Your host is Tobias Macey and today I'm interviewing Prashanth Rao about KuzuDB, an embeddable graph databaseInterview IntroductionHow did you get involved in the area of data management?Can you describe what KuzuDB is and the story behind it?What are the core use cases that Kuzu is focused on addressing?What is explicitly out of scope?Graph engines have been available and in use for a long time, but generally for more niche use cases. How would you characterize the current state of the graph data ecosystem?You note scalability as a feature of Kuzu, which is a phrase with many potential interpretations. Typically horizontal scaling of graphs has been complicated, in what sense does Kuzu make that claim?Can you describe some of the typical architecture and integration patterns of Kuzu?What are some of the more interesting or esoteric means of architecting with Kuzu?For cases where Kuzu is rendering a graph across an external data repository (e.g. Iceberg, etc.), what are the patterns for balancing data freshness with network/compute efficiency? (e.g. read and create every time or persist the Kuzu state)Can you describe the internal architecture of Kuzu and key design factors?What are the benefits and tradeoffs of using a columnar store with adjacency lists vs. a more graph-native storage format?What are the most interesting, innovative, or unexpected ways that you have seen Kuzu used?What are the most interesting, unexpected, or challenging lessons that you have learned while working on Kuzu?When is Kuzu the wrong choice?What do you have planned for the future of Kuzu?Contact Info WebsiteLinkedInParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Links KuzuDBBERTTransformer ArchitectureDuckDBPodcast EpisodeMonetDBUmbra DBsqliteCypher Query LanguageProperty GraphNeo4JGraphRAGContext EngineeringWrite-Ahead LogBauplanIcebergDuckLakeLanceLanceDBArrowPolarsArrow DataFusionGQLClickHouseAdjacency ListWhy Graph Databases Need New Join AlgorithmsKuzuDB WASMRAG == Retrieval Augmented GenerationNetworkXThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Summary The rapid growth of generative AI applications has prompted a surge of investment in vector databases. While there are numerous engines available now, Lance is designed to integrate with data lake and lakehouse architectures. In this episode Weston Pace explains the inner workings of the Lance format for table definitions and file storage, and the optimizations that they have made to allow for fast random access and efficient schema evolution. In addition to integrating well with data lakes, Lance is also a first-class participant in the Arrow ecosystem, making it easy to use with your existing ML and AI toolchains. This is a fascinating conversation about a technology that is focused on expanding the range of options for working with vector data. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementImagine catching data issues before they snowball into bigger problems. That’s what Datafold’s new Monitors do. With automatic monitoring for cross-database data diffs, schema changes, key metrics, and custom data tests, you can catch discrepancies and anomalies in real time, right at the source. Whether it’s maintaining data integrity or preventing costly mistakes, Datafold Monitors give you the visibility and control you need to keep your entire data stack running smoothly. Want to stop issues before they hit production? Learn more at dataengineeringpodcast.com/datafold today!Your host is Tobias Macey and today I'm interviewing Weston Pace about the Lance file and table format for column-oriented vector storageInterview IntroductionHow did you get involved in the area of data management?Can you describe what Lance is and the story behind it?What are the core problems that Lance is designed to solve?What is explicitly out of scope?The README mentions that it is straightforward to convert to Lance from Parquet. What is the motivation for this compatibility/conversion support?What formats does Lance replace or obviate?In terms of data modeling Lance obviously adds a vector type, what are the features and constraints that engineers should be aware of when modeling their embeddings or arbitrary vectors?Are there any practical or hard limitations on vector dimensionality?When generating Lance files/datasets, what are some considerations to be aware of for balancing file/chunk sizes for I/O efficiency and random access in cloud storage?I noticed that the file specification has space for feature flags. How has that aided in enabling experimentation in new capabilities and optimizations?What are some of the engineering and design decisions that were most challenging and/or had the biggest impact on the performance and utility of Lance?The most obvious interface for reading and writing Lance files is through LanceDB. Can you describe the use cases that it focuses on and its notable features?What are the other main integrations for Lance?What are the opportunities or roadblocks in adding support for Lance and vector storage/indexes in e.g. Iceberg or Delta to enable its use in data lake environments?What are the most interesting, innovative, or unexpected ways that you have seen Lance used?What are the most interesting, unexpected, or challenging lessons that you have learned while working on the Lance format?When is Lance the wrong choice?What do you have planned for the future of Lance?Contact Info LinkedInGitHubParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Links Lance FormatLanceDBSubstraitPyArrowFAISSPineconePodcast EpisodeParquetIcebergPodcast EpisodeDelta LakePodcast EpisodePyLanceHilbert CurvesSIFT VectorsS3 ExpressWekaDataFusionRay DataTorch Data LoaderHNSW == Hierarchical Navigable Small Worlds vector indexIVFPQ vector indexGeoJSONPolarsThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Summary

Data lakehouse architectures have been gaining significant adoption. To accelerate adoption in the enterprise Microsoft has created the Fabric platform, based on their OneLake architecture. In this episode Dipti Borkar shares her experiences working on the product team at Fabric and explains the various use cases for the Fabric service.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst is an end-to-end data lakehouse platform built on Trino, the query engine Apache Iceberg was designed for, with complete support for all table formats including Apache Iceberg, Hive, and Delta Lake. Trusted by teams of all sizes, including Comcast and Doordash. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. Your host is Tobias Macey and today I'm interviewing Dipti Borkar about her work on Microsoft Fabric and performing analytics on data withou

Interview

Introduction How did you get involved in the area of data management? Can you describe what Microsoft Fabric is and the story behind it? Data lakes in various forms have been gaining significant popularity as a unified interface to an organization's analytics. What are the motivating factors that you see for that trend? Microsoft has been investing heavily in open source in recent years, and the Fabric platform relies on several open components. What are the benefits of layering on top of existing technologies rather than building a fully custom solution?

What are the elements of Fabric that were engineered specifically for the service? What are the most interesting/complicated integration challenges?

How has your prior experience with Ahana and Presto informed your current work at Microsoft? AI plays a substantial role in the product. What are the benefits of embedding Copilot into the data engine?

What are the challenges in terms of safety and reliability?

What are the most interesting, innovative, or unexpected ways that you have seen the Fabric platform used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on data lakes generally, and Fabric specifically? When is Fabric the wrong choice? What do you have planned for the future of data lake analytics?

Contact Info

LinkedIn

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.

Links

Microsoft Fabric Ahana episode DB2 Distributed Spark Presto Azure Data MAD Landscape

Podcast Episode ML Podcast Episode

Tableau dbt Medallion Architecture Microsoft Onelake ORC Parquet Avro Delta Lake Iceberg

Podcast Episode

Hudi

Podcast Episode

Hadoop PowerBI

Podcast Episode

Velox Gluten Apache XTable GraphQL Formula 1 McLaren

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Sponsored By: Starburst: Starburst Logo

This episode is brought to you by Starburst - an end-to-end data lakehouse platform for data engineers who are battling to build and scale high quality data pipelines on the data lake. Powered by T

Summary

Stripe is a company that relies on data to power their products and business. To support that functionality they have invested in Trino and Iceberg for their analytical workloads. In this episode Kevin Liu shares some of the interesting features that they have built by combining those technologies, as well as the challenges that they face in supporting the myriad workloads that are thrown at this layer of their data platform.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst is an end-to-end data lakehouse platform built on Trino, the query engine Apache Iceberg was designed for, with complete support for all table formats including Apache Iceberg, Hive, and Delta Lake. Trusted by teams of all sizes, including Comcast and Doordash. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. Your host is Tobias Macey and today I'm interviewing Kevin Liu about his use of Trino and Iceberg for Stripe's data lakehouse

Interview

Introduction How did you get involved in the area of data management? Can you describe what role Trino and Iceberg play in Stripe's data architecture?

What are the ways in which your job responsibilities intersect with Stripe's lakehouse infrastructure?

What were the requirements and selection criteria that led to the selection of that combination of technologies?

What are the other systems that feed into and rely on the Trino/Iceberg service?

what kinds of questions are you answering with table metadata

what use case/team does that support

comparative utility of iceberg REST catalog What are the shortcomings of Trino and Iceberg? What are the most interesting, innovative, or unexpected ways that you have seen Iceberg/Trino used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Stripe's data infrastructure? When is a lakehouse on Trino/Iceberg the wrong choice? What do you have planned for the future of Trino and Iceberg at Stripe?

Contact Info

Substack LinkedIn

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.

Links

Trino Iceberg Stripe Spark Redshift Hive Metastore Python Iceberg Python Iceberg REST Catalog Trino Metadata Table Flink

Podcast Episode

Tabular

Podcast Episode

Delta Table

Podcast Episode

Databricks Unity Catalog Starburst AWS Athena Kevin Trinofest Presentation Alluxio

Podcast Episode

Parquet Hudi Trino Project Tardigrade Trino On Ice

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Sponsored By: Starburst: Starburst Logo

This episode is brought to you by Starburst - an end-to-end data lakehouse platform for data engineers who are battling to build and scale high quality data pipelines on the data lake. Powered by Trino, the query engine Apache Iceberg was designed for, Starburst is an open platform with support for all table formats including Apache Iceberg, Hive, and Delta Lake.

Trusted by the teams at Comcast and Doordash, Starburst del

Summary

Building a database engine requires a substantial amount of engineering effort and time investment. Over the decades of research and development into building these software systems there are a number of common components that are shared across implementations. When Paul Dix decided to re-write the InfluxDB engine he found the Apache Arrow ecosystem ready and waiting with useful building blocks to accelerate the process. In this episode he explains how he used the combination of Apache Arrow, Flight, Datafusion, and Parquet to lay the foundation of the newest version of his time-series database.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management Dagster offers a new approach to building and running data platforms and data pipelines. It is an open-source, cloud-native orchestrator for the whole development lifecycle, with integrated lineage and observability, a declarative programming model, and best-in-class testability. Your team can get up and running in minutes thanks to Dagster Cloud, an enterprise-class hosted solution that offers serverless and hybrid deployments, enhanced security, and on-demand ephemeral test deployments. Go to dataengineeringpodcast.com/dagster today to get started. Your first 30 days are free! Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. Join us at the top event for the global data community, Data Council Austin. From March 26-28th 2024, we'll play host to hundreds of attendees, 100 top speakers and dozens of startups that are advancing data science, engineering and AI. Data Council attendees are amazing founders, data scientists, lead engineers, CTOs, heads of data, investors and community organizers who are all working together to build the future of data and sharing their insights and learnings through deeply technical talks. As a listener to the Data Engineering Podcast you can get a special discount off regular priced and late bird tickets by using the promo code dataengpod20. Don't miss out on our only event this year! Visit dataengineeringpodcast.com/data-council and use code dataengpod20 to register today! Your host is Tobias Macey and today I'm interviewing Paul Dix about his investment in the Apache Arrow ecosystem and how it led him to create the latest PFAD in database design

Interview

Introduction How did you get involved in the area of data management? Can you start by describing the FDAP stack and how the components combine to provide a foundational architecture for database engines?

This was the core of your recent re-write of the InfluxDB engine. What were the design goals and constraints that led you to this architecture?

Each of the architectural components are well engineered for their particular scope. What is the engineering work that is involved in building a cohesive platform from those components? One of the major benefits of using open source components is the network effect of ecosystem integrations. That can also be a risk when the community vision for the project doesn't align with your own goals. How have you worked to mitigate that risk in your specific platform? Can you describe the

Summary

Data is a team sport, but it's often difficult for everyone on the team to participate. For a long time the mantra of data tools has been "by developers, for developers", which automatically excludes a large portion of the business members who play a crucial role in the success of any data project. Quilt Data was created as an answer to make it easier for everyone to contribute to the data being used by an organization and collaborate on its application. In this episode Aneesh Karve shares the journey that Quilt has taken to provide an approachable interface for working with versioned data in S3 that empowers everyone to collaborate.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management Truly leveraging and benefiting from streaming data is hard - the data stack is costly, difficult to use and still has limitations. Materialize breaks down those barriers with a true cloud-native streaming database - not simply a database that connects to streaming systems. With a PostgreSQL-compatible interface, you can now work with real-time data using ANSI SQL including the ability to perform multi-way complex joins, which support stream-to-stream, stream-to-table, table-to-table, and more, all in standard SQL. Go to dataengineeringpodcast.com/materialize today and sign up for early access to get started. If you like what you see and want to help make it better, they're hiring across all functions! Your host is Tobias Macey and today I'm interviewing Aneesh Karve about how Quilt Data helps you bring order to your chaotic data in S3 with transactional versioning and data discovery built in

Interview

Introduction How did you get involved in the area of data management? Can you describe what Quilt is and the story behind it?

How have the goals and features of the Quilt platform changed since I spoke with Kevin in June of 2018?

What are the main problems that users are trying to solve when they find Quilt?

What are some of the alternative approaches/products that they are coming from?

How does Quilt compare with options such as LakeFS, Unstruk, Pachyderm, etc.? Can you describe how Quilt is implemented? What are the types of tools and systems that Quilt gets integrated with?

How do you manage the tension between supporting the lowest common denominator, while providing options for more advanced capabilities?

What is a typical workflow for a team that is using Quilt to manage their data? What are the most interesting, innovative, or unexpected ways that you have seen Quilt used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Quilt? When is Quilt the wrong choice? What do you have planned for the future of Quilt?

Contact Info

LinkedIn @akarve on Twitter

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story. To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers

Links

Quilt Data

Podcast Episode

UW Madison Docker Swarm Kaggle open.quiltdata.com FinOS Perspective LakeFS

Podcast Episode

Pachyderm

Podcast Episode

Unstruk

Podcast Episode

Parquet Avro ORC Cloudformation Troposphere CDK == Cloud Development Kit Shadow IT

Podcast Episode

Delta Lake

Podcast Episode

Apache Iceberg

Podcast Episode

Datasette Frictionless DVC

Podcast.init Episode

The in

Summary Machine learning has become a meaningful target for data applications, bringing with it an increase in the complexity of orchestrating the entire data flow. Flyte is a project that was started at Lyft to address their internal needs for machine learning and integrated closely with Kubernetes as the execution manager. In this episode Ketan Umare and Haytham Abuelfutuh share the story of the Flyte project and how their work at Union is focused on supporting and scaling the code and community that has made Flyte successful.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! This episode is brought to you by Acryl Data, the company behind DataHub, the leading developer-friendly data catalog for the modern data stack. Open Source DataHub is running in production at several companies like Peloton, Optum, Udemy, Zynga and others. Acryl Data provides DataHub as an easy to consume SaaS product which has been adopted by several companies. Signup for the SaaS product at dataengineeringpodcast.com/acryl RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder. Data lake architectures provide the best combination of massive scalability and cost reduction, but they aren’t always the most performant option. That’s why Kyligence has built on top of the leading open source OLAP engine for data lakes, Apache Kylin. With their AI augmented engine they detect patterns from your critical queries, automatically build data marts with optimized table structures, and provide a unified SQL interface across your lake, cubes, and indexes. Their cost-based query router will give you interactive speeds across petabyte scale data sets for BI dashboards and ad-hoc data exploration. Stop struggling to speed up your data lake. Get started with Kyligence today at dataengineeringpodcast.com/kyligence Your host is Tobias Macey and today I’m interviewing Ketan Umare and Haytham Abuelfutuh about Flyte, the open source and kubernetes-native orchestration engine for your data systems

Interview

Introduction How did you get involved in the area of data management? Can you describe what Flyte is and the story behind it? What was missing in the ecosystem of available tools that made it necessary/worthwhile to create Flyte? Workflow orchestrators have been around for several years and have gone through a number of generational shifts. How would you characterize Flyte’s position in the ecosystem?

What do you see as the closest alternatives? What are the core differentiators that might lead someone to choose Flyte over e.g. Airflow/Prefect/Dagster?

What are the core primitives that Flyte exposes for building up complex workflows?

Machine learning use cases have been a core focus since the project’s inception. What are some of the ways that that manifests in the design and feature set?

Can you describe the architecture of Flyte?

How have the design and goals of the platform changed/evolved since you first started working on it?

What are the changes in the data ecosystem that have had the most substantial impact on the Flyte project? (e.g. roadmap, integrations, pushing people toward adoption, etc.) What is the process for setting up a Flyte deployment? What are the user personas that you prioritize in the design and feature development for Flyte? What is the workflow for someone building a new pipeline in Flyte?

What are the patterns that you and the community have established to encourage discovery and reuse of granular task definitions? Beyond code reuse, how can teams scale usage of Flyte at the company/organization level?

What are the affordances that you have created to facilitate local development and testing of workflows while ensuring a smooth transition to production?

What are the patterns that are available for CI/CD of workflows using Flyte?

How have you approached the design of data contracts/type definitions to provide a consistent/portable API for defining inter-task dependencies across languages? What are the available interfaces for extending Flyte and building integrations with other components across the data ecosystem? Data orchestration engines are a natural point for generating and taking advantage of rich metadata. How do you manage creation and propagation of metadata within and across the framework boundaries? Last year you founded Union to offer a managed version of Flyte. What are the features that you are offering beyond what is available in the open source?

What are the opportunities that you see for the Flyte ecosystem with a corporate entity to invest in expanding adoption?

What are the most interesting, innovative, or unexpected ways that you have seen Flyte used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Flyte? When is Flyte the wrong choice? What do you have planned for the future of Flyte?

Contact Info

Ketan Umare Haytham Abuelfutuh

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

Thank you for listening! Don’t forget to check out our other show, Podcast.init to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers

Links

Flyte

Slack Channel

Union.ai Kubeflow Airflow AWS Step Functions Protocol Buffers XGBoost MLFlow Dagster

Podcast Episode

Prefect

Podcast Episode

Arrow Parquet Metaflow Pytorch

Podcast.init Episode

dbt FastAPI

Podcast.init Interview

Python Type Annotations Modin

Podcast.init Interview

Monad Datahub

Podcast Episode

OpenMetadata

Podcast Episode

Hudi

Podcast Episode

Iceberg

Podcast Episode

Great Expectations

Podcast Episode

Pandera Union ML Weights and Biases Whylogs

Podcast Episode

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Sponsored By: a…

Julien has a unique history of building open frameworks that make data platforms interoperable. He's contributed in various ways to Apache Arrow, Apache Iceberg, Apache Parquet, and Marquez, and is currently leading OpenLineage, an open framework for data lineage collection and analysis. In this episode, Tristan & Julia dive into how open source projects grow to become standards, and why data lineage in particular is in need of an open standard. They also cover into some of the compelling use cases for this data lineage metadata, and where you might be able to deploy it in your work. For full show notes and to read 6+ years of back issues of the podcast's companion newsletter, head to https://roundup.getdbt.com.  The Analytics Engineering Podcast is sponsored by dbt Labs.

Summary Data lake architectures have largely been biased toward batch processing workflows due to the volume of data that they are designed for. With more real-time requirements and the increasing use of streaming data there has been a struggle to merge fast, incremental updates with large, historical analysis. Vinoth Chandar helped to create the Hudi project while at Uber to address this challenge. By adding support for small, incremental inserts into large table structures, and building support for arbitrary update and delete operations the Hudi project brings the best of both worlds together. In this episode Vinoth shares the history of the project, how its architecture allows for building more frequently updated analytical queries, and the work being done to add a more polished experience to the data lake paradigm.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management You listen to this show to learn about all of the latest tools, patterns, and practices that power data engineering projects across every domain. Now there’s a book that captures the foundational lessons and principles that underly everything that you hear about here. I’m happy to announce I collected wisdom from the community to help you in your journey as a data engineer and worked with O’Reilly to publish it as 97 Things Every Data Engineer Should Know. Go to dataengineeringpodcast.com/97things today to get your copy! When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! RudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today. We’ve all been asked to help with an ad-hoc request for data by the sales and marketing team. Then it becomes a critical report that they need updated every week or every day. Then what do you do? Send a CSV via email? Write some Python scripts to automate it? But what about incremental sync, API quotas, error handling, and all of the other details that eat up your time? Today, there is a better way. With Census, just write SQL or plug in your dbt models and start syncing your cloud warehouse to SaaS applications like Salesforce, Marketo, Hubspot, and many more. Go to dataengineeringpodcast.com/census today to get a free 14-day trial. Your host is Tobias Macey and today I’m interviewing Vinoth Chandar about Apache Hudi, a data lake management layer for supporting fast and incremental updates to your tables.

Interview

Introduction How did you get involved in the area of data management? Can you describe what Hudi is and the story behind it? What are the use cases that it is focused on supporting? There have been a number of alternative table formats introduced for data lakes recently. How does Hudi compare to projects like Iceberg, Delta Lake, Hive, etc.? Can you describe how Hudi is architected?

How have the goals and design of Hudi changed or evolved since you first began working on it? If you were to start the whole project over today, what would you do differently?

Can you talk through the lifecycle of a data record as it is ingested, compacted, and queried in a Hudi deployment? One of the capabilities that is interesting to explore is support for arbitrary record deletion. Can you talk through why this is a challenging operation in data lake architectures?

How does Hudi make that a tractable problem?

What are the data platform components that are needed to support an installation of Hudi? What is involved in migrating an existing data lake to use Hudi?

How would someone approach supporting heterogeneous table formats in their lake?

As someone who has invested a lot of time in technologies for supporting data lakes, what are your thoughts on the tradeoffs of data lake vs data warehouse and the current trajectory of the ecosystem? What are the most interesting, innovative, or unexpected ways that you have seen Hudi used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Hudi? When is Hudi the wrong choice? What do you have planned for the future of Hudi?

Contact Info

Linkedin Twitter

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

Thank you for listening! Don’t forget to check out our other show, Podcast.init to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat

Links

Hudi Docs Hudi Design & Architecture Incremental Processing CDC == Change Data Capture

Podcast Episodes

Oracle GoldenGate Voldemort Kafka Hadoop Spark HBase Parquet Iceberg Table Format

Data Engineering Episode

Hive ACID Apache Kudu

Podcast Episode

Vertica Delta Lake

Podcast Episode

Optimistic Concurrency Control MVCC == Multi-Version Concurrency Control Presto Flink

Podcast Episode

Trino

Podcast Episode

Gobblin LakeFS

Podcast Episode

Nessie

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Support Data Engineering Podcast

Summary Data integration is a critical piece of every data pipeline, yet it is still far from being a solved problem. There are a number of managed platforms available, but the list of options for an open source system that supports a large variety of sources and destinations is still embarrasingly short. The team at Airbyte is adding a new entry to that list with the goal of making robust and easy to use data integration more accessible to teams who want or need to maintain full control of their data. In this episode co-founders John Lafleur and Michel Tricot share the story of how and why they created Airbyte, discuss the project’s design and architecture, and explain their vision of what an open soure data integration platform should offer. If you are struggling to maintain your extract and load pipelines or spending time on integrating with a new system when you would prefer to be working on other projects then this is definitely a conversation worth listening to.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold. Once you sign up and create an alert in Datafold for your company data, they will send you a cool water flask. RudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today. Your host is Tobias Macey and today I’m interviewing Michel Tricot and John Lafleur about Airbyte, an open source framework for building data integration pipelines.

Interview

Introduction How did you get involved in the area of data management? Can you start by explaining what Airbyte is and the story behind it? Businesses and data engineers have a variety of options for how to manage their data integration. How would you characterize the overall landscape and how does Airbyte distinguish itself in that space? How would you characterize your target users?

How have those personas instructed the priorities and design of Airbyte? What do you see as the benefits and tradeoffs of a UI oriented data integration platform as compared to a code first approach?

what are the complex/challenging elements of data integration that makes it such a slippery problem? motivation for creating open source ELT as a business Can you describe how the Airbyte platform is implemented?

What was your motivation for choosing Java as the primary language?

incidental complexity of forcing all connectors to be packaged as containers shortcomings of the Singer specification/motivation for creating a backwards incompatible interface perceived potential for community adoption of Airbyte specification tradeoffs of using JSON as interchange format vs. e.g. protobuf/gRPC/Avro/etc.

information lost when converting records to JSON types/how to preserve that information (e.g. field constraints, valid enums, etc.)

interfaces/extension points for integrating with other tools, e.g. Dagster abstraction layers for simplifying implementation of new connectors tradeoffs of storing all connectors in a monorepo with the Airbyte core

impact of community adoption/contributions

What is involved in setting up an Airbyte installation? What are the available axes for scaling an Airbyte deployment? challenges of setting up and maintaining CI environment for Airbyte How are you managing governance and long term sustainability of the project? What are some of the most interesting, unexpected, or innovative ways that you have seen Airbyte used? What are the most interesting, unexpected, or challenging lessons that you have learned while building Airbyte? When is Airbyte the wrong choice? What do you have planned for the future of the project?

Contact Info

Michel

LinkedIn @MichelTricot on Twitter michel-tricot on GitHub

John

LinkedIn @JeanLafleur on Twitter johnlafleur on GitHub

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

Airbyte Liveramp Fivetran

Podcast Episode

Stitch Data Matillion DataCoral

Podcast Episode

Singer Meltano

Podcast Episode

Airflow

Podcast.init Episode

Kotlin Docker Monorepo Airbyte Specification Great Expectations

Podcast Episode

Dagster

Data Engineering Podcast Episode Podcast.init Episode

Prefect

Podcast Episode

DBT

Podcast Episode

Kubernetes Snowflake

Podcast Episode

Redshift Presto Spark Parquet

Podcast Episode

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Support Data Engineering Podcast

Summary DataDog is one of the most successful companies in the space of metrics and monitoring for servers and cloud infrastructure. In order to support their customers, they need to capture, process, and analyze massive amounts of timeseries data with a high degree of uptime and reliability. Vadim Semenov works on their data engineering team and joins the podcast in this episode to discuss the challenges that he works through, the systems that DataDog has built to power their business, and how their teams are organized to allow for rapid growth and massive scale. Getting an inside look at the companies behind the services we use is always useful, and this conversation was no exception.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Corinium Global Intelligence, ODSC, and Data Council. Upcoming events include the Software Architecture Conference in NYC, Strata Data in San Jose, and PyCon US in Pittsburgh. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today. Your host is Tobias Macey and today I’m interviewing Vadim Semenov about how data engineers work at DataDog

Interview

Introduction How did you get involved in the area of data management? For anyone who isn’t familiar with DataDog, can you start by describing the types and volumes of data that you’re dealing with? What are the main components of your platform for managing that information? How are the data teams at DataDog organized and what are your primary responsibilities in the organization? What are some of the complexities and challenges that you face in your work as a result of the volume of data that you are processing?

What are some of the strategies which have proven to be most useful in overcoming those challenges?

Who are the main consumers of your work and how do you build in feedback cycles to ensure that their needs are being met? Given that the majority of the data being ingested by DataDog is timeseries, what are your lifecycle and retention policies for that information? Most of the data that you are working with is customer generated from your deployed agents and API integrations. How do you manage cleanliness and schema enforcement for the events as they are being delivered? What are some of the upcoming projects that you have planned for the upcoming months and years? What are some of the technologies, patterns, or practices that you are hoping to adopt?

Contact Info

LinkedIn @databuryat on Twitter

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

Thank you for listening! Don’t forget to check out our other show, Podcast.init to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat

Links

DataDog Hadoop Hive Yarn Chef SRE == Site Reliability Engineer Application Performance Management (APM) Apache Kafka RocksDB Cassandra Apache Parquet data serialization format SLA == Service Level Agreement WatchDog Apache Spark

Podcast Episode

Apache Pig Databricks JVM == Java Virtual Machine Kubernetes SSIS (SQL Server Integration Services) Pentaho JasperSoft Apache Airflow

Podcast.init Episode

Apache NiFi

Podcast Episode

Luigi Dagster

Podcast Episode

Prefect

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Support Data Engineering Podcast

Summary Building clean datasets with reliable and reproducible ingestion pipelines is completely useless if it’s not possible to find them and understand their provenance. The solution to discoverability and tracking of data lineage is to incorporate a metadata repository into your data platform. The metadata repository serves as a data catalog and a means of reporting on the health and status of your datasets when it is properly integrated into the rest of your tools. At WeWork they needed a system that would provide visibility into their Airflow pipelines and the outputs produced. In this episode Julien Le Dem and Willy Lulciuc explain how they built Marquez to serve that need, how it is architected, and how it compares to other options that you might be considering. Even if you already have a metadata repository this is worth a listen to learn more about the value that visibility of your data can bring to your organization.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! You work hard to make sure that your data is clean, reliable, and reproducible throughout the ingestion pipeline, but what happens when it gets to the data warehouse? Dataform picks up where your ETL jobs leave off, turning raw data into reliable analytics. Their web based transformation tool with built in collaboration features lets your analysts own the full lifecycle of data in your warehouse. Featuring built in version control integration, real-time error checking for their SQL code, data quality tests, scheduling, and a data catalog with annotation capabilities it’s everything you need to keep your data warehouse in order. Sign up for a free trial today at dataengineeringpodcast.com/dataform and email [email protected] with the subject "Data Engineering Podcast" to get a hands-on demo from one of their data experts. You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Corinium Global Intelligence, ODSC, and Data Council. Upcoming events include the Software Architecture Conference, the Strata Data conference, and PyCon US. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today. Your host is Tobias Macey and today I’m interviewing Willy Lulciuc and Julien Le Dem about Marquez, an open source platform to collect, aggregate, and visualize a data ecosystem’s metadata

Interview

Introduction How did you get involved in the area of data management? Can you start by describing what Marquez is?

What was missing in existing metadata management platforms that necessitated the creation of Marquez?

How do the capabilities of Marquez compare with tools and services that bill themselves as data catalogs?

How does it compare to the Amundsen platform that Lyft recently released?

What are some of the tools or platforms that are currently integrated with Marquez and what additional integrations would you like to see? What are some of the capabilities that are unique to Marquez and how are you using them at WeWork? What are the primary resource types that you support in Marquez?

What are some of the lowest common denominator attributes that are necessary and useful to track in a metadata repository?

Can you explain how Marquez is architected and how the design has evolved since you first began working on it?

Many metadata management systems are simply a service layer on top of a separate data storage engine. What are the benefits of using PostgreSQL as the system of record for Marquez?

What are some of the complexities that arise from relying on a relational engine as opposed to a document store or graph database?

How is the metadata itself stored and managed in Marquez?

How much up-front data modeling is necessary and what types of schema representations are supported?

Can you talk through the overall workflow of someone using Marquez in their environment?

What is involved in registering and updating datasets? How do you define and track the health of a given dataset? What are some of the interesting questions that can be answered from the information stored in Marquez?

What were your assumptions going into this project and how have they been challenged or updated as you began using it for production use cases? For someone who is interested in using Marquez what is involved in deploying and maintaining an installation of it? What have you found to be the most challenging or unanticipated aspects of building and maintaining a metadata repository and data discovery platform? When is Marquez the wrong choice for a metadata repository? What do you have planned for the future of Marquez?

Contact Info

Julien Le Dem

@J_ on Twitter Email julienledem on GitHub

Willy

LinkedIn @wslulciuc on Twitter wslulciuc on GitHub

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

Thank you for listening! Don’t forget to check out our other show, Podcast.init to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat

Links

Marquez

DataEngConf Presentation

WeWork Canary Yahoo Dremio Hadoop Pig Parquet

Podcast Episode

Airflow Apache Atlas Amundsen

Podcast Episode

Uber DataBook LinkedIn DataHub Iceberg Table Format

Podcast Episode

Delta Lake

Podcast Episode

Great Expectations data pipeline unit testing framework

Podcast.init Episode

Redshift SnowflakeDB

Podcast Episode

Apache Kafka Schema Registry

Podcast Episode

Open Tracing Jaeger Zipkin DropWizard Java framework Marquez UI Cayley Graph Database Kubernetes Marquez Helm Chart Marquez Docker Container Dagster

Podcast Episode

Luigi DBT

Podcast Episode

Thrift Protocol Buffers

The intro and outro music is from a href="http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug?utm_source=rss&utm_medium=rss"…

Summary Data warehouses have gone through many transformations, from standard relational databases on powerful hardware, to column oriented storage engines, to the current generation of cloud-native analytical engines. SnowflakeDB has been leading the charge to take advantage of cloud services that simplify the separation of compute and storage. In this episode Kent Graziano, chief technical evangelist for SnowflakeDB, explains how it is differentiated from other managed platforms and traditional data warehouse engines, the features that allow you to scale your usage dynamically, and how it allows for a shift in your workflow from ETL to ELT. If you are evaluating your options for building or migrating a data platform, then this is definitely worth a listen.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media and the Python Software Foundation. Upcoming events include the Software Architecture Conference in NYC and PyCOn US in Pittsburgh. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today. Your host is Tobias Macey and today I’m interviewing Kent Graziano about SnowflakeDB, the cloud-native data warehouse

Interview

Introduction How did you get involved in the area of data management? Can you start by explaining what SnowflakeDB is for anyone who isn’t familiar with it?

How does it compare to the other available platforms for data warehousing? How does it differ from traditional data warehouses?

How does the performance and flexibility affect the data modeling requirements?

Snowflake is one of the data stores that is enabling the shift from an ETL to an ELT workflow. What are the features that allow for that approach and what are some of the challenges that it introduces? Can you describe how the platform is architected and some of the ways that it has evolved as it has grown in popularity?

What are some of the current limitations that you are struggling with?

For someone getting started with Snowflake what is involved with loading data into the platform?

What is their workflow for allocating and scaling compute capacity and running anlyses?

One of the interesting features enabled by your architecture is data sharing. What are some of the most interesting or unexpected uses of that capability that you have seen? What are some other features or use cases for Snowflake that are not as well known or publicized which you think users should know about? When is SnowflakeDB the wrong choice? What are some of the plans for the future of SnowflakeDB?

Contact Info

LinkedIn Website @KentGraziano on Twitter

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

SnowflakeDB

Free Trial Stack Overflow

Data Warehouse Oracle DB MPP == Massively Parallel Processing Shared Nothing Architecture Multi-Cluster Shared Data Architecture Google BigQuery AWS Redshift AWS Redshift Spectrum Presto

Podcast Episode

SnowflakeDB Semi-Structured Data Types Hive ACID == Atomicity, Consistency, Isolation, Durability 3rd Normal Form Data Vault Modeling Dimensional Modeling JSON AVRO Parquet SnowflakeDB Virtual Warehouses CRM == Customer Relationship Management Master Data Management

Podcast Episode

FoundationDB

Podcast Episode

Apache Spark

Podcast Episode

SSIS == SQL Server Integration Services Talend Informatica Fivetran

Podcast Episode

Matillion Apache Kafka Snowpipe Snowflake Data Exchange OLTP == Online Transaction Processing GeoJSON Snowflake Documentation SnowAlert Splunk Data Catalog

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Support Data Engineering Podcast

Summary The scale and complexity of the systems that we build to satisfy business requirements is increasing as the available tools become more sophisticated. In order to bridge the gap between legacy infrastructure and evolving use cases it is necessary to create a unifying set of components. In this episode Dipti Borkar explains how the emerging category of data orchestration tools fills this need, some of the existing projects that fit in this space, and some of the ways that they can work together to simplify projects such as cloud migration and hybrid cloud environments. It is always useful to get a broad view of new trends in the industry and this was a helpful perspective on the need to provide mechanisms to decouple physical storage from computing capacity.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! This week’s episode is also sponsored by Datacoral, an AWS-native, serverless, data infrastructure that installs in your VPC. Datacoral helps data engineers build and manage the flow of data pipelines without having to manage any infrastructure, meaning you can spend your time invested in data transformations and business needs, rather than pipeline maintenance. Raghu Murthy, founder and CEO of Datacoral built data infrastructures at Yahoo! and Facebook, scaling from terabytes to petabytes of analytic data. He started Datacoral with the goal to make SQL the universal data programming language. Visit dataengineeringpodcast.com/datacoral today to find out more. You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, Alluxio, and Data Council. Upcoming events include the combined events of the Data Architecture Summit and Graphorum, the Data Orchestration Summit, and Data Council in NYC. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today. Your host is Tobias Macey and today I’m interviewing Dipti Borkark about data orchestration and how it helps in migrating data workloads to the cloud

Interview

Introduction How did you get involved in the area of data management? Can you start by describing what you mean by the term "Data Orchestration"?

How does it compare to the concept of "Data Virtualization"? What are some of the tools and platforms that fit under that umbrella?

What are some of the motivations for organizations to use the cloud for their data oriented workloads?

What are they giving up by using cloud resources in place of on-premises compute?

For businesses that have invested heavily in their own datacenters, what are some ways that they can begin to replicate some of the benefits of cloud environments? What are some of the common patterns for cloud migration projects and what challenges do they present?

Do you have advice on useful metrics to track for determining project completion or success criteria?

How do businesses approach employee education for designing and implementing effective systems for achieving their migration goals? Can you talk through some of the ways that different data orchestration tools can be composed together for a cloud migration effort?

What are some of the common pain points that organizations encounter when working on hybrid implementations?

What are some of the missing pieces in the data orchestration landscape?

Are there any efforts that you are aware of that are aiming to fill those gaps?

Where is the data orchestration market heading, and what are some industry trends that are driving it?

What projects are you most interested in or excited by?

For someone who wants to learn more about data orchestration and the benefits the technologies can provide, what are some resources that you would recommend?

Contact Info

LinkedIn @dborkar on Twitter

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

Thank you for listening! Don’t forget to check out our other show, Podcast.init to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat

Links

Alluxio

Podcast Episode

UC San Diego Couchbase Presto

Podcast Episode

Spark SQL Data Orchestration Data Virtualization PyTorch

Podcast.init Episode

Rook storage orchestration PySpark MinIO

Podcast Episode

Kubernetes Openstack Hadoop HDFS Parquet Files

Podcast Episode

ORC Files Hive Metastore Iceberg Table Format

Podcast Episode

Data Orchestration Summit Star Schema Snowflake Schema Data Warehouse Data Lake Teradata

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Support Data Engineering Podcast

Summary

The Hadoop platform is purpose built for processing large, slow moving data in long-running batch jobs. As the ecosystem around it has grown, so has the need for fast data analytics on fast moving data. To fill this need the Kudu project was created with a column oriented table format that was tuned for high volumes of writes and rapid query execution across those tables. For a perfect pairing, they made it easy to connect to the Impala SQL engine. In this episode Brock Noland and Jordan Birdsell from PhData explain how Kudu is architected, how it compares to other storage systems in the Hadoop orbit, and how to start integrating it into you analytics pipeline.

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. To help other people find the show please leave a review on iTunes, or Google Play Music, tell your friends and co-workers, and share it on social media. Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Brock Noland and Jordan Birdsell about Apache Kudu and how it is able to provide fast analytics on fast data in the Hadoop ecosystem

Interview

Introduction How did you get involved in the area of data management? Can you start by explaining what Kudu is and the motivation for building it?

How does it fit into the Hadoop ecosystem? How does it compare to the work being done on the Iceberg table format?

What are some of the common application and system design patterns that Kudu supports? How is Kudu architected and how has it evolved over the life of the project? There are many projects in and around the Hadoop ecosystem that rely on Zookeeper as a building block for consensus. What was the reasoning for using Raft in Kudu? How does the storage layer in Kudu differ from what would be found in systems like Hive or HBase?

What are the implementation details in the Kudu storage interface that have had the greatest impact on its overall speed and performance?

A number of the projects built for large scale data processing were not initially built with a focus on operational simplicity. What are the features of Kudu that simplify deployment and management of production infrastructure? What was the motivation for using C++ as the language target for Kudu?

If you were to start the project over today what would you do differently?

What are some situations where you would advise against using Kudu? What have you found to be the most interesting/unexpected/challenging lessons learned in the process of building and maintaining Kudu? What are you most excited about for the future of Kudu?

Contact Info

Brock

LinkedIn @brocknoland on Twitter

Jordan

LinkedIn @jordanbirdsell jbirdsell on GitHub

PhData

Website phdata on GitHub @phdatainc on Twitter

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

Kudu PhData Getting Started with Apache Kudu Thomson Reuters Hadoop Oracle Exadata Slowly Changing Dimensions HDFS S3 Azure Blob Storage State Farm Stanly Black & Decker ETL (Extract, Transform, Load) Parquet

Podcast Episode

ORC HBase Spark

Podcast Episode

Summary

When your data lives in multiple locations, belonging to at least as many applications, it is exceedingly difficult to ask complex questions of it. The default way to manage this situation is by crafting pipelines that will extract the data from source systems and load it into a data lake or data warehouse. In order to make this situation more manageable and allow everyone in the business to gain value from the data the folks at Dremio built a self service data platform. In this episode Tomer Shiran, CEO and co-founder of Dremio, explains how it fits into the modern data landscape, how it works under the hood, and how you can start using it today to make your life easier.

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Tomer Shiran about Dremio, the open source data as a service platform

Interview

Introduction How did you get involved in the area of data management? Can you start by explaining what Dremio is and how the project and business got started?

What was the motivation for keeping your primary product open source? What is the governance model for the project?

How does Dremio fit in the current landscape of data tools?

What are some use cases that Dremio is uniquely equipped to support? Do you think that Dremio obviates the need for a data warehouse or large scale data lake?

How is Dremio architected internally?

How has that architecture evolved from when it was first built?

There are a large array of components (e.g. governance, lineage, catalog) built into Dremio that are often found in dedicated products. What are some of the strategies that you have as a business and development team to manage and integrate the complexity of the product?

What are the benefits of integrating all of those capabilities into a single system? What are the drawbacks?

One of the useful features of Dremio is the granular access controls. Can you discuss how those are implemented and controlled? For someone who is interested in deploying Dremio to their environment what is involved in getting it installed?

What are the scaling factors?

What are some of the most exciting features that have been added in recent releases? When is Dremio the wrong choice? What have been some of the most challenging aspects of building, maintaining, and growing the technical and business platform of Dremio? What do you have planned for the future of Dremio?

Contact Info

Tomer

@tshiran on Twitter LinkedIn

Dremio

Website @dremio on Twitter dremio on GitHub

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

Dremio MapR Presto Business Intelligence Arrow Tableau Power BI Jupyter OLAP Cube Apache Foundation Hadoop Nikon DSLR Spark ETL (Extract, Transform, Load) Parquet Avro K8s Helm Yarn Gandiva Initiative for Apache Arrow LLVM TLS

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast

Summary

A data lake can be a highly valuable resource, as long as it is well built and well managed. Unfortunately, that can be a complex and time-consuming effort, requiring specialized knowledge and diverting resources from your primary business. In this episode Yoni Iny, CTO of Upsolver, discusses the various components that are necessary for a successful data lake project, how the Upsolver platform is architected, and how modern data lakes can benefit your organization.

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Yoni Iny about Upsolver, a data lake platform that lets developers integrate and analyze streaming data with ease

Interview

Introduction How did you get involved in the area of data management? Can you start by describing what Upsolver is and how it got started?

What are your goals for the platform?

There are a lot of opinions on both sides of the data lake argument. When is it the right choice for a data platform?

What are the shortcomings of a data lake architecture?

How is Upsolver architected?

How has that architecture changed over time? How do you manage schema validation for incoming data? What would you do differently if you were to start over today?

What are the biggest challenges at each of the major stages of the data lake? What is the workflow for a user of Upsolver and how does it compare to a self-managed data lake? When is Upsolver the wrong choice for an organization considering implementation of a data platform? Is there a particular scale or level of data maturity for an organization at which they would be better served by moving management of their data lake in house? What features or improvements do you have planned for the future of Upsolver?

Contact Info

Yoni

yoniiny on GitHub LinkedIn

Upsolver

Website @upsolver on Twitter LinkedIn Facebook

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

Upsolver Data Lake Israeli Army Data Warehouse Data Engineering Podcast Episode About Data Curation Three Vs Kafka Spark Presto Drill Spot Instances Object Storage Cassandra Redis Latency Avro Parquet ORC Data Engineering Podcast Episode About Data Serialization Formats SSTables Run Length Encoding CSV (Comma Separated Values) Protocol Buffers Kinesis ETL DevOps Prometheus Cloudwatch DataDog InfluxDB SQL Pandas Confluent KSQL

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast

Summary

With the growth of the Hadoop ecosystem came a proliferation of implementations for the Hive table format. Unfortunately, with no formal specification, each project works slightly different which increases the difficulty of integration across systems. The Hive format is also built with the assumptions of a local filesystem which results in painful edge cases when leveraging cloud object storage for a data lake. In this episode Ryan Blue explains how his work on the Iceberg table format specification and reference implementation has allowed Netflix to improve the performance and simplify operations for their S3 data lake. This is a highly detailed and technical exploration of how a well-engineered metadata layer can improve the speed, accuracy, and utility of large scale, multi-tenant, cloud-native data platforms.

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Ryan Blue about Iceberg, a Netflix project to implement a high performance table format for batch workloads

Interview

Introduction How did you get involved in the area of data management? Can you start by explaining what Iceberg is and the motivation for creating it?

Was the project built with open-source in mind or was it necessary to refactor it from an internal project for public use?

How has the use of Iceberg simplified your work at Netflix? How is the reference implementation architected and how has it evolved since you first began work on it?

What is involved in deploying it to a user’s environment?

For someone who is interested in using Iceberg within their own environments, what is involved in integrating it with their existing query engine?

Is there a migration path for pre-existing tables into the Iceberg format?

How is schema evolution managed at the file level?

How do you handle files on disk that don’t contain all of the fields specified in a table definition?

One of the complicated problems in data modeling is managing table partitions. How does Iceberg help in that regard? What are the unique challenges posed by using S3 as the basis for a data lake?

What are the benefits that outweigh the difficulties?

What have been some of the most challenging or contentious details of the specification to define?

What are some things that you have explicitly left out of the specification?

What are your long-term goals for the Iceberg specification?

Do you anticipate the reference implementation continuing to be used and maintained?

Contact Info

rdblue on GitHub LinkedIn

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

Iceberg Reference Implementation Iceberg Table Specification Netflix Hadoop Cloudera Avro Parquet Spark S3 HDFS Hive ORC S3mper Git Metacat Presto Pig DDL (Data Definition Language) Cost-Based Optimization

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast

Summary

Collaboration, distribution, and installation of software projects is largely a solved problem, but the same cannot be said of data. Every data team has a bespoke means of sharing data sets, versioning them, tracking related metadata and changes, and publishing them for use in the software systems that rely on them. The CEO and founder of Quilt Data, Kevin Moore, was sufficiently frustrated by this problem to create a platform that attempts to be the means by which data can be as collaborative and easy to work with as GitHub and your favorite programming language. In this episode he explains how the project came to be, how it works, and the many ways that you can start using it today.

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. Are you struggling to keep up with customer request and letting errors slip into production? Want to try some of the innovative ideas in this podcast but don’t have time? DataKitchen’s DataOps software allows your team to quickly iterate and deploy pipelines of code, models, and data sets while improving quality. Unlike a patchwork of manual operations, DataKitchen makes your team shine by providing an end to end DataOps solution with minimal programming that uses the tools you love. Join the DataOps movement and sign up for the newsletter at datakitchen.io/de today. After that learn more about why you should be doing DataOps by listening to the Head Chef in the Data Kitchen at dataengineeringpodcast.com/datakitchen Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. Your host is Tobias Macey and today I’m interviewing Kevin Moore about Quilt Data, a platform and tooling for packaging, distributing, and versioning data

Interview

Introduction How did you get involved in the area of data management? What is the intended use case for Quilt and how did the project get started? Can you step through a typical workflow of someone using Quilt?

How does that change as you go from a single user to a team of data engineers and data scientists?

Can you describe the elements of what a data package consists of?

What was your criteria for the file formats that you chose?

How is Quilt architected and what have been the most significant changes or evolutions since you first started? How is the data registry implemented?

What are the limitations or edge cases that you have run into? What optimizations have you made to accelerate synchronization of the data to and from the repository?

What are the limitations in terms of data volume, format, or usage? What is your goal with the business that you have built around the project? What are your plans for the future of Quilt?

Contact Info

Email LinkedIn

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

Quilt Data GitHub Jobs Reproducible Data Dependencies in Jupyter Reproducible Machine Learning with Jupyter and Quilt Allen Institute: Programmatic Data Access with Quilt Quilt Example: MissingNo Oracle Pandas Jupyter Ycombinator Data.World

Podcast Episode with CTO Bryon Jacob

Kaggle Parquet HDF5 Arrow PySpark Excel Scala Binder Merkle Tree Allen Institute for Cell Science Flask PostGreSQL Docker Airflow Quilt Teams Hive Hive Metastore PrestoDB

Podcast Episode

Netflix Iceberg Kubernetes Helm

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast

Summary

To process your data you need to know what shape it has, which is why schemas are important. When you are processing that data in multiple systems it can be difficult to ensure that they all have an accurate representation of that schema, which is why Confluent has built a schema registry that plugs into Kafka. In this episode Ewen Cheslack-Postava explains what the schema registry is, how it can be used, and how they built it. He also discusses how it can be extended for other deployment targets and use cases, and additional features that are planned for future releases.

Preamble

Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure When you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at dataengineeringpodcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show. Continuous delivery lets you get new features in front of your users as fast as possible without introducing bugs or breaking production and GoCD is the open source platform made by the people at Thoughtworks who wrote the book about it. Go to dataengineeringpodcast.com/gocd to download and launch it today. Enterprise add-ons and professional support are available for added peace of mind. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. You can help support the show by checking out the Patreon page which is linked from the site. To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers Your host is Tobias Macey and today I’m interviewing Ewen Cheslack-Postava about the Confluent Schema Registry

Interview

Introduction How did you get involved in the area of data engineering? What is the schema registry and what was the motivating factor for building it? If you are using Avro, what benefits does the schema registry provide over and above the capabilities of Avro’s built in schemas? How did you settle on Avro as the format to support and what would be involved in expanding that support to other serialization options? Conversely, what would be involved in using a storage backend other than Kafka? What are some of the alternative technologies available for people who aren’t using Kafka in their infrastructure? What are some of the biggest challenges that you faced while designing and building the schema registry? What is the tipping point in terms of system scale or complexity when it makes sense to invest in a shared schema registry and what are the alternatives for smaller organizations? What are some of the features or enhancements that you have in mind for future work?

Contact Info

ewencp on GitHub Website @ewencp on Twitter

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

Kafka Confluent Schema Registry Second Life Eve Online Yes, Virginia, You Really Do Need a Schema Registry JSON-Schema Parquet Avro Thrift Protocol Buffers Zookeeper Kafka Connect

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast