dbt

How to Version Control Your Metrics to Create a Single Source of Truth for Business Metrics

2020-12-10 · dbt Coalesce 2020 Watch

video

Analytics Git

What happens when two people come to a meeting to talk about business metrics but they have different values for the same metric?

That meeting ends up being spent discussing how the metric was calculated rather than how to impact it.

In this video, you'll learn how the Fishtown Analytics team uses dbt to version control business metrics and create a single source of truth.

You'll also get a framework for how to implement version control for metrics at your organization.

Learn more about dbt at: https://getdbt.com https://twitter.com/getdbt

Learn more about Fishtown Analytics at: https://fishtownanalytics.com https://twitter.com/fishtowndata https://www.linkedin.com/company/fishtown-analytics/

How to Scale Data Teams with Data Clinics and Balance Short-Term and Long-Term Projects

2020-12-09 · dbt Coalesce 2020 Watch

video

by Jacob Frackson

Analytics BI

You’re in a state of flow, building out dbt models and then you get the dreaded message — "Quick question about this data..."

As a data team, how do you balance the roadmap work against those "quick" questions?

How do you prioritize all the work you need to do in the short-term (backlog items) while also working on your long-term projects (roadmap items)?

There are advantages to both backlog and roadmap items. How can data teams get the advantages of both?

In this video, Jacob Frackson will show how Data Clinics dedicated time put aside to work on these requests, can help your data team achieve this balance and empower self-serve along the way.

Data clinics have helped an organization: - Deliver 80% of Sprint Points - Answer up to 8 data questions per day - 10x weekly self-serve users on BI tools

Learn more about dbt at: https://getdbt.com https://twitter.com/getdbt

Learn more about Fishtown Analytics at: https://fishtownanalytics.com https://twitter.com/fishtowndata https://www.linkedin.com/company/fishtown-analytics/

How JetBlue became a data-driven airline using dbt

2020-12-09 · dbt Coalesce 2020 Watch

video

by Ashley Van Name (JetBlue)

Analytics DWH

What does a data-driven airline look like? How does a data-driven airline behave and treat customers?

JetBlue believes a data-driven airline should: - Offer personalized customer interactions - Predict delays and other "irregular" operations - Enable all analysts to easily access a variety of data sources - Study and monitor operations in real-time to make smarter decisions

The big question is... how does an airline become more data driven?

In this video, Ashley Van Name shares how a small team of data engineers at JetBlue successfully migrated their entire data warehouse workload to dbt and shares tips for setting yourself up for success with dbt.

Fun fact about JetBlue's dbt project — they have 1800 data models, on top of 280 data sources, have defined 8500 tests and they built their entire dbt project in six months!

Learn more about dbt at: https://getdbt.com https://twitter.com/getdbt

Learn more about Fishtown Analytics at: https://fishtownanalytics.com https://twitter.com/fishtowndata https://www.linkedin.com/company/fishtown-analytics/

How to Map the Customer Journey from a Product Perspective Using dbt

2020-12-09 · dbt Coalesce 2020 Watch

video

by Grant Winship (Fishtown Analytics) , Sanjana Sen (Fishtown Analytics)

Analytics Funnel Marketing

In this talk, you'll learn how the team at TULA Skincare took a product perspective to the customer journey to understand how customers progress from. basic products to more advanced ones.

It's important to map out the customer journey to understand where they get stuck, where they need help, where the business can improve.

However, when folx talk about mapping a customer’s journey, it's typically only from a marketing perspective. Which channels brought a customer into the funnel? How did they end up converting?

This is important, but that only covers the beginning of the journey where they become a customer. What about the rest of the customer journey where they begin to use your product(s) then go on to buy from you again and again?

What does that customer journey look like?

In this video, Sanjana Sen and Grant Winship of Fishtown Analytics talk through how they approached this exercise while working with the TULA team.

Learn more about dbt at: https://getdbt.com https://twitter.com/getdbt

Learn more about Fishtown Analytics at: https://fishtownanalytics.com https://twitter.com/fishtowndata https://www.linkedin.com/company/fishtown-analytics/

Introduction to dbt (data build tool) from Fishtown Analytics

2020-12-09 · dbt Coalesce 2020 Watch

video

Analytics ETL/ELT SQL

In this introduction to dbt tutorial, you'll to learn about the core concepts of dbt and how it's used.

You probably know that data is a huge part of how the world runs now, including how businesses report on metrics and how they operate.

One of the difficult parts of working with data is communicating enough context and information to everyone in the organization so they understand the data they're looking at and whether it answers their questions.

That's where dbt comes in. dbt is a data transformation and documentation tool that helps data analysts, data engineers, and business stakeholders collaborate on data.

This introduction to dbt will walk you through: a short history of ELT, what is dbt (data build tool), and dbt core concepts.

The core dbt concepts include: - Expressing transforms with SQL select - Automatically build the DAG with ref(s) - Tests ensure model accuracy - Documentation is accessible and easily updated - Use macros to write reusable SQL

Learn more about dbt at: https://getdbt.com https://twitter.com/getdbt

Learn more about dbt Labs (formerly Fishtown Analytics) at: https://www.getdbt.com/dbt-labs/about-us/ https://twitter.com/dbt_labs https://www.linkedin.com/company/dbtlabs

Streaming Data Integration Without The Code at Equalum

2020-11-30 · Data Engineering Podcast Listen

podcast_episode

by Ido Friedman (Equalum) , Tobias Macey

Airflow Analytics BI CI/CD Cloud Computing Data Analytics Data Engineering Data Governance Data Management Data Quality Datafold ETL/ELT +3 more

Summary The first stage of every good pipeline is to perform data integration. With the increasing pace of change and the need for up to date analytics the need to integrate that data in near real time is growing. With the improvements and increased variety of options for streaming data engines and improved tools for change data capture it is possible for data teams to make that goal a reality. However, despite all of the tools and managed distributions of those streaming engines it is still a challenge to build a robust and reliable pipeline for streaming data integration, especially if you need to expose those capabilities to non-engineers. In this episode Ido Friedman, CTO of Equalum, explains how they have built a no-code platform to make integration of streaming data and change data capture feeds easier to manage. He discusses the challenges that are inherent in the current state of CDC technologies, how they have architected their system to integrate well with existing data platforms, and how to build an appropriate level of abstraction for such a complex problem domain. If you are struggling with streaming data integration and change data capture then this interview is definitely worth a listen.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise. When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold. Once you sign up and create an alert in Datafold for your company data, they will send you a cool water flask. Are you bogged down by having to manually manage data access controls, repeatedly move and copy data, and create audit reports to prove compliance? How much time could you save if those tasks were automated across your cloud platforms? Immuta is an automated data governance solution that enables safe and easy data analytics in the cloud. Our comprehensive data-level security, auditing and de-identification features eliminate the need for time-consuming manual processes and our focus on data and compliance team collaboration empowers you to deliver quick and valuable data analytics on the most sensitive data to unloc

Keeping A Bigeye On The Data Quality Market

2020-11-23 · Data Engineering Podcast Listen

podcast_episode

by Egor Gryaznov (Bigeye) , Tobias Macey

Airflow Analytics BI BigEye CI/CD Cloud Computing Data Analytics Data Engineering Data Governance Data Management Data Quality Datafold +3 more

Summary One of the oldest aphorisms about data is "garbage in, garbage out", which is why the current boom in data quality solutions is no surprise. With the growth in projects, platforms, and services that aim to help you establish and maintain control of the health and reliability of your data pipelines it can be overwhelming to stay up to date with how they all compare. In this episode Egor Gryaznov, CTO of Bigeye, joins the show to explore the landscape of data quality companies, the general strategies that they are using, and what problems they solve. He also shares how his own product is designed and the challenges that are involved in building a system to help data engineers manage the complexity of a data platform. If you are wondering how to get better control of your own pipelines and the traps to avoid then this episode is definitely worth a listen.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise. When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold. Once you sign up and create an alert in Datafold for your company data, they will send you a cool water flask. Are you bogged down by having to manually manage data access controls, repeatedly move and copy data, and create audit reports to prove compliance? How much time could you save if those tasks were automated across your cloud platforms? Immuta is an automated data governance solution that enables safe and easy data analytics in the cloud. Our comprehensive data-level security, auditing and de-identification features eliminate the need for time-consuming manual processes and our focus on data and compliance team collaboration empowers you to deliver quick and valuable data analytics on the most sensitive data to unlock the full potential of your cloud data platforms. Learn how we streamline and accelerate manual processes to help you derive real results from your data at dataengineeringpodcast.com/immuta. Your host is Tobias Macey and today I’m interviewing Egor Gryaznov about the state of the industry for data quality management and what he is building at B

Self Service Data Management From Ingest To Insights With Isima

2020-11-17 · Data Engineering Podcast Listen

podcast_episode

by Darshan Rawal (Isima) , Tobias Macey

AI/ML Airflow Analytics API BI CI/CD Cloud Computing Data Analytics Data Engineering Data Governance Data Management Data Quality +4 more

Summary The core mission of data engineers is to provide the business with a way to ask and answer questions of their data. This often takes the form of business intelligence dashboards, machine learning models, or APIs on top of a cleaned and curated data set. Despite the rapid progression of impressive tools and products built to fulfill this mission, it is still an uphill battle to tie everything together into a cohesive and reliable platform. At Isima they decided to reimagine the entire ecosystem from the ground up and built a single unified platform to allow end-to-end self service workflows from data ingestion through to analysis. In this episode CEO and co-founder of Isima Darshan Rawal explains how the biOS platform is architected to enable ease of use, the challenges that were involved in building an entirely new system from scratch, and how it can integrate with the rest of your data platform to allow for incremental adoption. This was an interesting and contrarian take on the current state of the data management industry and is worth a listen to gain some additional perspective.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise. When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Follow go.datafold.com/dataengineeringpodcast to start a 30-day trial of Datafold. Once you sign up and create an alert in Datafold for your company data, they will send you a cool water flask. Are you bogged down by having to manually manage data access controls, repeatedly move and copy data, and create audit reports to prove compliance? How much time could you save if those tasks were automated across your cloud platforms? Immuta is an automated data governance solution that enables safe and easy data analytics in the cloud. Our comprehensive data-level security, auditing and de-identification features eliminate the need for time-consuming manual processes and our focus on data and compliance team collaboration empowers you to deliver quick and valuable data analytics on the most sensitive data to unlock the full potential of your cloud data platforms. Learn how we streamline and accelerate manual processes to help y

Building A Cost Effective Data Catalog With Tree Schema

2020-11-10 · Data Engineering Podcast Listen

podcast_episode

by Grant Seward (Tree Schema) , Tobias Macey

Airflow Analytics BI CI/CD Cloud Computing Data Analytics Data Engineering Data Governance Data Management Data Quality Datafold ETL/ELT +2 more

Summary A data catalog is a critical piece of infrastructure for any organization who wants to build analytics products, whether internal or external. While there are a number of platforms available for building that catalog, many of them are either difficult to deploy and integrate, or expensive to use at scale. In this episode Grant Seward explains how he built Tree Schema to be an easy to use and cost effective option for organizations to build their data catalogs. He also shares the internal architecture, how he approached the design to make it accessible and easy to use, and how it autodiscovers the schemas and metadata for your source systems.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise. When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Follow go.datafold.com/dataengineeringpodcast to start a 30-day trial of Datafold. Once you sign up and create an alert in Datafold for your company data, they will send you a cool water flask. Are you bogged down by having to manually manage data access controls, repeatedly move and copy data, and create audit reports to prove compliance? How much time could you save if those tasks were automated across your cloud platforms? Immuta is an automated data governance solution that enables safe and easy data analytics in the cloud. Our comprehensive data-level security, auditing and de-identification features eliminate the need for time-consuming manual processes and our focus on data and compliance team collaboration empowers you to deliver quick and valuable data analytics on the most sensitive data to unlock the full potential of your cloud data platforms. Learn how we streamline and accelerate manual processes to help you derive real results from your data at dataengineeringpodcast.com/immuta. Your host is Tobias Macey and today I’m interviewing Grant Seward about Tree Schema, a human friendly data catalog

Interview

Introduction How did you get involved in the area of data management? Can you start by giving an overview of what you have built at Tree Schema?

What was your motivation for creating it?

At what stage of maturity should a team or organization

Building reuseable and trustworthy ELT pipelines (A templated approach)

2020-07-01 · Airflow Summit 2020

session

by Nehil Jain (SnapTravel)

Airflow ETL/ELT Singer SQL

To improve automation of data pipelines, I propose a universal approach to ELT pipeline that optimizes for data integrity, extensibility, and speed to delivery. The workflow is built using open source tools and standards like Apache Airflow, Singer, Great Expectations, and DBT. Templating ETLs is challenging! The creation and maintenance of data pipelines in production require hard work to manage bugs in code and bad data. I like to propose a data pipeline pattern that can simplify building pipelines while optimizing for data integrity and observability. The workflow is built using open source tools like Singer, Great Expectations, and DBT. Goals: Make EL T simple and fast to implement Validate your assumptions of the data before you make it available for use Allow analysts/data scientists add pain-free contributions to EL T using SQL Generate data documentation, failure logs for quick recovery, and fixes outages in your pipeline Target Audience: Approachable to any level of developer Novice data personals interested in starting ELT workflow and learning about different tools of the ecosystem Intermediate+ developers interested in supercharging their pipeline with Write Audit Publish pattern and reducing pipeline debt

Solving Data Lineage Tracking And Data Discovery At WeWork

2019-12-16 · Data Engineering Podcast Listen

podcast_episode

by Willy Lulciuc (WeWork) , Julien Le Dem (Astronomer) , Tobias Macey

AI/ML Airflow Analytics Big Data Dagster Data Engineering Data Management Data Modelling Data Quality Google Dataform Delta Docker +18 more

Summary Building clean datasets with reliable and reproducible ingestion pipelines is completely useless if it’s not possible to find them and understand their provenance. The solution to discoverability and tracking of data lineage is to incorporate a metadata repository into your data platform. The metadata repository serves as a data catalog and a means of reporting on the health and status of your datasets when it is properly integrated into the rest of your tools. At WeWork they needed a system that would provide visibility into their Airflow pipelines and the outputs produced. In this episode Julien Le Dem and Willy Lulciuc explain how they built Marquez to serve that need, how it is architected, and how it compares to other options that you might be considering. Even if you already have a metadata repository this is worth a listen to learn more about the value that visibility of your data can bring to your organization.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! You work hard to make sure that your data is clean, reliable, and reproducible throughout the ingestion pipeline, but what happens when it gets to the data warehouse? Dataform picks up where your ETL jobs leave off, turning raw data into reliable analytics. Their web based transformation tool with built in collaboration features lets your analysts own the full lifecycle of data in your warehouse. Featuring built in version control integration, real-time error checking for their SQL code, data quality tests, scheduling, and a data catalog with annotation capabilities it’s everything you need to keep your data warehouse in order. Sign up for a free trial today at dataengineeringpodcast.com/dataform and email [email protected] with the subject "Data Engineering Podcast" to get a hands-on demo from one of their data experts. You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Corinium Global Intelligence, ODSC, and Data Council. Upcoming events include the Software Architecture Conference, the Strata Data conference, and PyCon US. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today. Your host is Tobias Macey and today I’m interviewing Willy Lulciuc and Julien Le Dem about Marquez, an open source platform to collect, aggregate, and visualize a data ecosystem’s metadata

Interview

Introduction How did you get involved in the area of data management? Can you start by describing what Marquez is?

What was missing in existing metadata management platforms that necessitated the creation of Marquez?

How do the capabilities of Marquez compare with tools and services that bill themselves as data catalogs?

How does it compare to the Amundsen platform that Lyft recently released?

What are some of the tools or platforms that are currently integrated with Marquez and what additional integrations would you like to see? What are some of the capabilities that are unique to Marquez and how are you using them at WeWork? What are the primary resource types that you support in Marquez?

What are some of the lowest common denominator attributes that are necessary and useful to track in a metadata repository?

Can you explain how Marquez is architected and how the design has evolved since you first began working on it?

Many metadata management systems are simply a service layer on top of a separate data storage engine. What are the benefits of using PostgreSQL as the system of record for Marquez?

What are some of the complexities that arise from relying on a relational engine as opposed to a document store or graph database?

How is the metadata itself stored and managed in Marquez?

How much up-front data modeling is necessary and what types of schema representations are supported?

Can you talk through the overall workflow of someone using Marquez in their environment?

What is involved in registering and updating datasets? How do you define and track the health of a given dataset? What are some of the interesting questions that can be answered from the information stored in Marquez?

What were your assumptions going into this project and how have they been challenged or updated as you began using it for production use cases? For someone who is interested in using Marquez what is involved in deploying and maintaining an installation of it? What have you found to be the most challenging or unanticipated aspects of building and maintaining a metadata repository and data discovery platform? When is Marquez the wrong choice for a metadata repository? What do you have planned for the future of Marquez?

Contact Info

Julien Le Dem

@J_ on Twitter Email julienledem on GitHub

Willy

LinkedIn @wslulciuc on Twitter wslulciuc on GitHub

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

Thank you for listening! Don’t forget to check out our other show, Podcast.init to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat

Links

Marquez

DataEngConf Presentation

WeWork Canary Yahoo Dremio Hadoop Pig Parquet

Podcast Episode

Airflow Apache Atlas Amundsen

Podcast Episode

Uber DataBook LinkedIn DataHub Iceberg Table Format

Podcast Episode

Delta Lake

Podcast Episode

Great Expectations data pipeline unit testing framework

Podcast.init Episode

Redshift SnowflakeDB

Podcast Episode

Apache Kafka Schema Registry

Podcast Episode

Open Tracing Jaeger Zipkin DropWizard Java framework Marquez UI Cayley Graph Database Kubernetes Marquez Helm Chart Marquez Docker Container Dagster

Podcast Episode

Luigi DBT

Podcast Episode

Thrift Protocol Buffers

The intro and outro music is from a href="http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug?utm_source=rss&utm_medium=rss"…

Build Your Data Analytics Like An Engineer With DBT

2019-05-20 · Data Engineering Podcast Listen

podcast_episode

by Drew Banin (Fishtown Analytics) , Tobias Macey

AI/ML Airflow Analytics AWS Azure BI Big Data BigQuery CI/CD Data Analytics Data Engineering Data Lake +17 more

Summary In recent years the traditional approach to building data warehouses has shifted from transforming records before loading, to transforming them afterwards. As a result, the tooling for those transformations needs to be reimagined. The data build tool (dbt) is designed to bring battle tested engineering practices to your analytics pipelines. By providing an opinionated set of best practices it simplifies collaboration and boosts confidence in your data teams. In this episode Drew Banin, creator of dbt, explains how it got started, how it is designed, and how you can start using it today to create reliable and well-tested reports in your favorite data warehouse.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! Understanding how your customers are using your product is critical for businesses of any size. To make it easier for startups to focus on delivering useful features Segment offers a flexible and reliable data infrastructure for your customer analytics and custom events. You only need to maintain one integration to instrument your code and get a future-proof way to send data to over 250 services with the flip of a switch. Not only does it free up your engineers’ time, it lets your business users decide what data they want where. Go to dataengineeringpodcast.com/segmentio today to sign up for their startup plan and get $25,000 in Segment credits and $1 million in free software from marketing and analytics companies like AWS, Google, and Intercom. On top of that you’ll get access to Analytics Academy for the educational resources you need to become an expert in data analytics for measuring product-market fit. You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Drew Banin about DBT, the Data Build Tool, a toolkit for building analytics the way that developers build applications

Interview

Introduction How did you get involved in the area of data management? Can you start by explaining what DBT is and your motivation for creating it? Where does it fit in the overall landscape of data tools and the lifecycle of data in an analytics pipeline? Can you talk through the workflow for someone using DBT? One of the useful features of DBT for stability of analytics is the ability to write and execute tests. Can you explain how those are implemented? The packaging capabilities are beneficial for enabling collaboration. Can you talk through how the packaging system is implemented?

Are these packages driven by Fishtown Analytics or the dbt community?

What are the limitations of modeling everything as a SELECT statement? Making SQL code reusable is notoriously difficult. How does the Jinja templating of DBT address this issue and what are the shortcomings?

What are your thoughts on higher level approaches to SQL that compile down to the specific statements?

Can you explain how DBT is implemented and how the design has evolved since you first began working on it? What are some of the features of DBT that are often overlooked which you find particularly useful? What are some of the most interesting/unexpected/innovative ways that you have seen DBT used? What are the additional features that the commercial version of DBT provides? What are some of the most useful or challenging lessons that you have learned in the process of building and maintaining DBT? When is it the wrong choice? What do you have planned for the future of DBT?

Contact Info

Email @drebanin on Twitter drebanin on GitHub

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

DBT Fishtown Analytics 8Tracks Internet Radio Redshift Magento Stitch Data Fivetran Airflow Business Intelligence Jinja template language BigQuery Snowflake Version Control Git Continuous Integration Test Driven Development Snowplow Analytics

Podcast Episode

dbt-utils We Can Do Better Than SQL blog post from EdgeDB EdgeDB Looker LookML

Podcast Interview

Presto DB

Podcast Interview

Spark SQL Hive Azure SQL Data Warehouse Data Warehouse Data Lake Data Council Conference Slowly Changing Dimensions dbt Archival Mode Analytics Periscope BI dbt docs dbt repository

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Support Data Engineering Podcast

An Evolving DAG for the LLM world - Julia Schottenstein of LangChain at Small Data SF

· Small Data SF 2024 Watch

video

AI/ML LLM RAG Vector DB

Directed Acyclic Graphs (DAGs) are the foundation of most orchestration frameworks. But what happens when you allow an LLM to act as the router? Acyclic graphs now become cyclic, which means you have to design for the challenges resulting from all this extra power. We'll cover the ins and outs of agentic applications and how to best use them in your work as a data practitioner or developer building today.

➡️ Follow Us LinkedIn: https://www.linkedin.com/company/small-data-sf/ X/Twitter : https://twitter.com/smalldatasf Website: https://www.smalldatasf.com/

Discover LangChain, the open-source framework for building powerful agentic systems. Learn how to augment LLMs with your private data, moving beyond their training cutoffs. We'll break down how LangChain uses "chains," which are essentially Directed Acyclic Graphs (DAGs) similar to data pipelines you might recognize from dbt. This structure is perfect for common patterns like Retrieval Augmented Generation (RAG), where you orchestrate steps to fetch context from a vector database and feed it to an LLM to generate an informed response, much like preparing data for analysis.

Dive into the world of AI agents, where the LLM itself determines the application's control flow. Unlike a predefined DAG, this allows for dynamic, cyclic graphs where an agent can iterate and improve its response based on previous attempts. We'll explore the core challenges in building reliable agents: effective planning and reflection, managing shared memory across multiple agents in a cognitive architecture, and ensuring reliability against task ambiguity. Understand the critical trade-offs between the dependability of static chains and the flexibility of dynamic LLM agents.

Introducing LangGraph, a framework designed to solve the agent reliability problem by balancing agent control with agency. Through a live demo in LangGraph Studio, see how to build complex AI applications using a cyclic graph. We'll demonstrate how a router agent can delegate tasks, execute a research plan with multiple steps, and use cycles to iterate on a problem. You'll also see how human-in-the-loop intervention can steer the agent for improved performance, a critical feature for building robust and observable agentic systems.

Explore some of the most exciting AI agents in production today. See how Roblox uses an AI assistant to generate virtual worlds from a prompt, how TripAdvisor’s agent acts as a personal travel concierge to create custom itineraries, and how Replit’s coding agent automates code generation and pull requests. These real-world examples showcase the practical power of moving from simple DAGs to dynamic, cyclic graphs for solving complex, agentic problems.

Avoiding dbt mess: From 1000+ model dbt mono-repository to multi-project

· Madrid dbt Meetup #7 (in-person)

talk

Data Contracts cookiecutter templates tdd

Behind the scenes of the journey to redefine data workflows and elevate data quality by breaking down a huge dbt repository into smaller projects using cookiecutter templates, enforcing software engineering best practices, TDD, and defining data contracts.

dbt in Snowflake - First experiences and comparisons to dbt Cloud

· NL dbt meetup: 14th Edition

talk

by Chris Verweij (Nimbus Intelligence)

Cloud Computing Git Snowflake

Taking a first look at running dbt projects natively in Snowflake Workspaces, the advantages and disadvantages compared to dbt Cloud, and whether moving right now fits your usecase. With hands-on demonstration and some extra tips on connecting Snowflake to your remote git dbt repository.

DuckDBT: Not a database or a dbt adapter but a secret third thing – DuckCon #3 (San Francisco)

· DuckCon #3 San Francisco 2023 Watch

video

by Josh Wills

DuckDB

Speaker: Josh Wills Slides: https://blobs.duckdb.org/events/duckcon3/josh-wills-duckdbt.pdf

DuckLake + dbt: Time traveling with SQL, metadata, and a little AI

· NL dbt meetup: 14th Edition

talk

by Leonardo Vida (MotherDuck)

ducklake

We’ll take a hands-on look at using DuckLake behind dbt: how inserts land, how to time travel your data, and how to see what changed between runs. I’ll show some handy dbt macros plus a few tricks for skipping a heavy metrics layer and leaning on AI to suggest columns and comments instead.

How to enable self-service analytics through the (dbt) semantic layer

· Madrid dbt Meetup #7 (in-person)

talk

by Cris Navas (Astrafy)

Lightdash semantic layer

Discover how semantic layers (dbt/Lightdash) empower all kinds of users to navigate their data without technical knowledge.

talk-data.com

Activity Trend

Top Events

Top Speakers

How to Version Control Your Metrics to Create a Single Source of Truth for Business Metrics

How to Scale Data Teams with Data Clinics and Balance Short-Term and Long-Term Projects

How JetBlue became a data-driven airline using dbt

How to Map the Customer Journey from a Product Perspective Using dbt

Introduction to dbt (data build tool) from Fishtown Analytics

Streaming Data Integration Without The Code at Equalum

Keeping A Bigeye On The Data Quality Market

Self Service Data Management From Ingest To Insights With Isima

Building A Cost Effective Data Catalog With Tree Schema

Building reuseable and trustworthy ELT pipelines (A templated approach)

Solving Data Lineage Tracking And Data Discovery At WeWork

Build Your Data Analytics Like An Engineer With DBT

An Evolving DAG for the LLM world - Julia Schottenstein of LangChain at Small Data SF

Avoiding dbt mess: From 1000+ model dbt mono-repository to multi-project

dbt in Snowflake - First experiences and comparisons to dbt Cloud

DuckDBT: Not a database or a dbt adapter but a secret third thing – DuckCon #3 (San Francisco)

DuckLake + dbt: Time traveling with SQL, metadata, and a little AI

How to enable self-service analytics through the (dbt) semantic layer