talk-data.com talk-data.com

Description

Summary Data engineering systems are complex and interconnected with myriad and often opaque chains of dependencies. As they scale, the problems of visibility and dependency management can increase at an exponential rate. In order to turn this into a tractable problem one approach is to define and enforce contracts between producers and consumers of data. Ananth Packildurai created Schemata as a way to make the creation of schema contracts a lightweight process, allowing the dependency chains to be constructed and evolved iteratively and integrating validation of changes into standard delivery systems. In this episode he shares the design of the project and how it fits into your development practices.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management

When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!

Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to dataengineeringpodcast.com/atlan today to learn more about how Atlan’s active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork & Unilever achieve extraordinary things with metadata and escape the chaos.

Prefect is the modern Dataflow Automation platform for the modern data stack, empowering data practitioners to build, run and monitor robust pipelines at scale. Guided by the principle that the orchestrator shouldn’t get in your way, Prefect is the only tool of its kind to offer the flexibility to write code as workflows. Prefect specializes in glueing together the disparate pieces of a pipeline, and integrating with modern distributed compute libraries to bring power where you need it, when you need it. Trusted by thousands of organizations and supported by over 20,000 community members, Prefect powers over 100MM business critical tasks a month. For more information on Prefect, visit dataengineeringpodcast.com/prefect.

Your host is Tobias Macey and today I’m interviewing Ananth Packkildurai about Schemata, a modelling framework for decentralised domain-driven ownership of data.

Interview

Introduction How did you get involved in the area of data management? Can you describe what Schemata is and the story behind it?

How does the garbage in/garbage out problem manifest in data warehouse/data lake environments?

What are the different places in a data system that schema definitions need to be established?

What are the different ways that schema management gets complicated across those various points of interaction?

Can you walk me through the end-to-end flow of how Schemata integrates with engineering practices across an organization’s data lifecycle?

How does the use of Schemata help with capturing and propagating context that would otherwise be lost or siloed?

How is the Schemata utility implemented?

What are some of the design and scope questions that you had to work through while developing Schemata?

What is the broad vision that you have for Schemata and its impact on data practices? How