talk-data.com talk-data.com

Description

Summary With all of the tools and services available for building a data platform it can be difficult to separate the signal from the noise. One of the best ways to get a true understanding of how a technology works in practice is to hear from people who are running it in production. In this episode Zeeshan Qureshi and Michelle Ark share their experiences using DBT to manage the data warehouse for Shopify. They explain how the structured the project to allow for multiple teams to collaborate in a scalable manner, the additional tooling that they added to address the edge cases that they have run into, and the optimizations that they baked into their continuous integration process to provide fast feedback and reduce costs. This is a great conversation about the lessons learned from real world use of a specific technology and how well it lives up to its promises.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold. Once you sign up and create an alert in Datafold for your company data, they will send you a cool water flask. Today’s episode of Data Engineering Podcast is sponsored by Datadog, the monitoring and analytics platform for cloud-scale infrastructure and applications. Datadog’s machine-learning based alerts, customizable dashboards, and 400+ vendor-backed integrations makes it easy to unify disparate data sources and pivot between correlated metrics and events for faster troubleshooting. By combining metrics, traces, and logs in one place, you can easily improve your application performance. Try Datadog free by starting a your 14-day trial and receive a free t-shirt once you install the agent. Go to dataengineeringpodcast.com/datadog today see how you can unify your monitoring today. Your host is Tobias Macey and today I’m interviewing Zeeshan Qureshi and Michelle Ark about how Shopify is building their production data warehouse platform with DBT

Interview

Introduction How did you get involved in the area of data management? Can you start by giving an overview of what the Shopify platform is? What kinds of data sources are you working with?

Can you share some examples of the types of analysis, decisions, and products that you are building with the data that you manage? How have you structured your data teams to be able to deliver those projects?

What are the systems that you have in place, technological or otherwise, to allow you to support the needs of