Airbyte

Daniel Herrera: Building AI analyst agents from scratch

2025-11-05 · Analytics Engineering and AI Native Amsterdam Co-Op Meetup

talk

by Daniel Herrera (Teradata)

Airflow dbt llms

Daniel will take a hands-on journey into building AI analyst agents from scratch. Using dbt metadata to provide large language models with the right context, he’ll show how to connect LLMs to your data effectively. Expect a deep dive into the challenges of query generation, practical frameworks for success, and lessons learned from real-world implementations.

One platform, two-way value: How data activation drives business impact

2025-10-14 · dbt Coalesce 2025 Watch

talk

by Charles Giardina (Airbyte)

dbt ETL/ELT

A leading life sciences company discovered the hidden cost of disconnected tools when it was locked into an expensive, vertically-focused data integration platform. Find out how the company consolidated its data activation workflow (ELT and reverse ETL) on a unified platform with Airbyte connected to dbt.

What Happens When You Ask an AI Agent to Refresh Your Data?

2025-09-25 · Big Data LDN 2025

Face To Face

by Davin Chia

AI/ML

It sounds simple: “Hey AI, refresh my Salesforce data.” But what really happens when that request travels through your stack? Using Airbyte’

What Happens When You Ask an AI Agent to Refresh Your Data?

2025-09-25 · Big Data LDN 2025

Face To Face

AI/ML LLM Cyber Security

It sounds simple: “Hey AI, refresh my Salesforce data.” But what really happens when that request travels through your stack?

Using Airbyte’s architecture as a model, this talk explores the complexity behind natural language data triggers - from spinning up connectors and handling credentials, to enforcing access controls and orchestrating safe, purpose-driven movement. We’ll introduce a unified framework for thinking about all types of data movement, from bulk ingestion to fine-grained activation - a model we’ve developed to bring clarity to a space crowded with overlapping terms and toolchains.

We’ll also explore how this foundation—and any modern data movement platform—must evolve for an AI-native world, where speed, locality, and security are non-negotiable. That includes new risks: leaking credentials into LLMs, or triggering unintended downstream effects from a single prompt.

We’ll close with a live demo: spinning up a local data plane and moving data via Airbyte—simply by chatting with a bot.

Driving Analytics with Open Source: Airbyte, dbt, Airflow & Metabase

2025-07-01 · Airflow Summit 2025

session

by Ayoade Adegbite

Airflow Analytics dbt Metabase postgresql

In this talk, I’ll walk through how we built an end-to-end analytics pipeline using open-source tools ( Airbyte, dbt, Airflow, and Metabase). At WirePick, we extract data from multiple sources using Airbyte OSS into PostgreSQL, transform it into business-specific data marts with dbt, and automate the entire workflow using Airflow. Our Metabase dashboards provide real-time insights, and we integrate Slack notifications to alert stakeholders when key business metrics change. This session will cover: Data extraction: Using Airbyte OSS to pull data from multiple sources Transformation & Modeling: How dbt helps create reusable data marts Automation & Orchestration: Managing the workflow with Airflow Data-driven decision-making: Delivering insights through Metabase & Slack alerts

Sponsored by: Airbyte | How Data Movement Powers GenAI

Let's Save Tons of Money With Cloud-Native Data Ingestion!

2025-06-10 · Data + AI Summit 2025 Watch

talk

by Tyler Croy (Scribd, Inc.)

AWS Aurora Kinesis Azure Cloud Computing Databricks Delta GCP Kafka

Delta Lake is a fantastic technology for quickly querying massive data sets, but first you need those massive data sets! In this session we will dive into the cloud-native architecture Scribd has adopted to ingest data from AWS Aurora, SQS, Kinesis Data Firehose and more. By using off-the-shelf open source tools like kafka-delta-ingest, oxbow and Airbyte, Scribd has redefined its ingestion architecture to be more event-driven, reliable, and most importantly: cheaper. No jobs needed! Attendees will learn how to use third-party tools in concert with a Databricks and Unity Catalog environment to provide a highly efficient and available data platform. This architecture will be presented in the context of AWS but can be adapted for Azure, Google Cloud Platform or even on-premise environments.

Coalesce 2024: Fueling product development and customer insights with dbt

2024-10-17 · Dbt Coalesce 2024 Watch

video

by Natalie Kwong (Airbyte)

Analytics Cloud Computing dbt GitHub

At Airbyte, we leverage dbt to power our roadmap - from user discovery to customer retention efforts.

We need to parse across many sources of data across our open-source and Cloud communities, including Gong transcripts, NPS surveys, and Github issues. I'll share examples of how dbt powers how we work - from discovering product gaps and their importance to deals, to building retention tools like custom notifications around customer pipelines.

Speaker: Natalie Kwong Product Airbyte

Read the blog to learn about the latest dbt Cloud features announced at Coalesce, designed to help organizations embrace analytics best practices at scale https://www.getdbt.com/blog/coalesce-2024-product-announcements

Scaling Airbyte: Challenges and Milestones on the Road to 1.0

2024-09-23 · Data Engineering Podcast Listen

podcast_episode

by Michel Tricot (Airbyte) , Tobias Macey

AI/ML Cloud Computing Data Engineering Data Management GenAI Modern Data Stack Python

Summary Airbyte is one of the most prominent platforms for data movement. Over the past 4 years they have invested heavily in solutions for scaling the self-hosted and cloud operations, as well as the quality and stability of their connectors. As a result of that hard work, they have declared their commitment to the future of the platform with a 1.0 release. In this episode Michel Tricot shares the highlights of their journey and the exciting new capabilities that are coming next. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementYour host is Tobias Macey and today I'm interviewing Michel Tricot about the journey to the 1.0 launch of Airbyte and what that means for the projectInterview IntroductionHow did you get involved in the area of data management?Can you describe what Airbyte is and the story behind it?What are some of the notable milestones that you have traversed on your path to the 1.0 release?The ecosystem has gone through some significant shifts since you first launched Airbyte. How have trends such as generative AI, the rise and fall of the "modern data stack", and the shifts in investment impacted your overall product and business strategies?What are some of the hard-won lessons that you have learned about the realities of data movement and integration?What are some of the most interesting/challenging/surprising edge cases or performance bottlenecks that you have had to address?What are the core architectural decisions that have proven to be effective?How has the architecture had to change as you progressed to the 1.0 release?A 1.0 version signals a degree of stability and commitment. Can you describe the decision process that you went through in committing to a 1.0 version?What are the most interesting, innovative, or unexpected ways that you have seen Airbyte used?What are the most interesting, unexpected, or challenging lessons that you have learned while working on Airbyte?When is Airbyte the wrong choice?What do you have planned for the future of Airbyte after the 1.0 launch?Contact Info LinkedInParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.Links AirbytePodcast EpisodeAirbyte CloudAirbyte Connector BuilderSinger ProtocolAirbyte ProtocolAirbyte CDKModern Data StackELTVector DatabasedbtFivetranPodcast EpisodeMeltanoPodcast EpisodedltReverse ETLGraphRAGAI Engineering Podcast EpisodeThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

The Hard Road to Seamless Data: Lessons from Building Airbyte

2024-09-18 · Big Data LDN 2024

Face To Face

by Davin Chia

AI/ML GenAI

In today’s data-driven world, whether you’re building your own data pipelines or relying on third-party vendors, understanding the fundamentals of great data movement systems is invaluable. It’s not just about making things work—it’s about ensuring your data operations are reliable, scalable, and cost-effective.

As an early employee and Airbyte’s Platform Architect, I’ve spent the last 3.5 years working through the challenges and intricacies of building a data movement platform. Along the way, I’ve learned some important lessons, often the hard way, that I believe could be helpful to others who are on a similar journey.

In this session, I’ll share these lessons in the hope that my experiences can offer some guidance, whether you’re just starting out or looking to refine what you’ve already built. I’ll also touch on how the rapid rise of generative AI is changing the landscape, and how we’re trying to adapt to these new challenges. My goal is to provide insights anyone can take back to their own projects, helping them avoid some of the pitfalls and navigate the complexities of modern data movement.

2 - 3 Main Actionable Takeaways:

• A general framework for designing a data movement system.

• Crucial fine print such as managing various destination memory types, the surprising need to re-import data and the shortcuts & pitfalls of artificial cursors.

• Adjusting data movement systems for an AI-first world.

Release Management For Data Platform Services And Logic

2024-05-12 · Data Engineering Podcast Listen

podcast_episode

by Tobias Macey

AI/ML Dagster Data Engineering Data Lake Data Lakehouse Data Management dbt Delta Hive Iceberg Python Snowflake +2 more

Summary

Building a data platform is a substrantial engineering endeavor. Once it is running, the next challenge is figuring out how to address release management for all of the different component parts. The services and systems need to be kept up to date, but so does the code that controls their behavior. In this episode your host Tobias Macey reflects on his current challenges in this area and some of the factors that contribute to the complexity of the problem.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management This episode is supported by Code Comments, an original podcast from Red Hat. As someone who listens to the Data Engineering Podcast, you know that the road from tool selection to production readiness is anything but smooth or straight. In Code Comments, host Jamie Parker, Red Hatter and experienced engineer, shares the journey of technologists from across the industry and their hard-won lessons in implementing new technologies. I listened to the recent episode "Transforming Your Database" and appreciated the valuable advice on how to approach the selection and integration of new databases in applications and the impact on team dynamics. There are 3 seasons of great episodes and new ones landing everywhere you listen to podcasts. Search for "Code Commentst" in your podcast player or go to dataengineeringpodcast.com/codecomments today to subscribe. My thanks to the team at Code Comments for their support. Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst is an end-to-end data lakehouse platform built on Trino, the query engine Apache Iceberg was designed for, with complete support for all table formats including Apache Iceberg, Hive, and Delta Lake. Trusted by teams of all sizes, including Comcast and Doordash. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. Your host is Tobias Macey and today I want to talk about my experiences managing the QA and release management process of my data platform

Interview

Introduction As a team, our overall goal is to ensure that the production environment for our data platform is highly stable and reliable. This is the foundational element of establishing and maintaining trust with the consumers of our data. In order to support this effort, we need to ensure that only changes that have been tested and verified are promoted to production. Our current challenge is one that plagues all data teams. We want to have an environment that mirrors our production environment that is available for testing, but it’s not feasible to maintain a complete duplicate of all of the production data. Compounding that challenge is the fact that each of the components of our data platform interact with data in slightly different ways and need different processes for ensuring that changes are being promoted safely.

Contact Info

LinkedIn Website

Closing Announcements

Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.

Links

Data Platforms and Leaky Abstractions Episode Building A Data Platform From Scratch Airbyte

Podcast Episode

Trino dbt Starburst Galaxy Superset Dagster LakeFS

Podcast Episode

Nessie

Podcast Episode

Iceberg Snowflake LocalStack DSL == Domain Specific Language

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-S

Fundamentals of Analytics Engineering

2024-03-29 · O'Reilly Data Science Books O'Reilly Amazon

book

by Lasse Benninga (LaBenni Consulting) , Juan Manuel Perafan (Xebia) , Fanny Kassapian , Ricardo Angel Granados Lopez , Dumky de Wilde (MotherDuck) , Taís Laurindo Pereira , Jovan Gligorevic

Analytics Analytics Engineering BigQuery CI/CD Data Engineering Data Modelling Data Quality dbt Git business-intelligence data data-science +1 more

Master the art and science of analytics engineering with 'Fundamentals of Analytics Engineering.' This book takes you on a comprehensive journey from understanding foundational concepts to implementing end-to-end analytics solutions. You'll gain not just theoretical knowledge but practical expertise in building scalable, robust data platforms to meet organizational needs. What this Book will help me do Design and implement effective data pipelines leveraging modern tools like Airbyte, BigQuery, and dbt. Adopt best practices for data modeling and schema design to enhance system performance and develop clearer data structures. Learn advanced techniques for ensuring data quality, governance, and observability in your data solutions. Master collaborative coding practices, including version control with Git and strategies for maintaining well-documented codebases. Automate and manage data workflows efficiently using CI/CD pipelines and workflow orchestrators. Author(s) Dumky De Wilde, alongside six co-authors-experienced professionals from various facets of the analytics field-delivers a cohesive exploration of analytics engineering. The authors blend their expertise in software development, data analysis, and engineering to offer actionable advice and insights. Their approachable ethos makes complex concepts understandable, promoting educational learning. Who is it for? This book is a perfect fit for data analysts and engineers curious about transitioning into analytics engineering. Aspiring professionals as well as seasoned analytics engineers looking to deepen their understanding of modern practices will find guidance. It's tailored for individuals aiming to boost their career trajectory in data engineering roles, addressing fundamental to advanced topics.

The Entrepreneurship Journey: From Freelancing to Starting a Company - Adrian Brudaru

2023-12-19 · DataTalks.Club Listen

podcast_episode

by Adrian Brudaru (dlthub)

AI/ML Fivetran GitHub HTML

We talked about:

Adrian’s background The benefits of freelancing Having an agency vs freelancing What let Adrian switch over from freelancing The conception of DLT (Growth Full Stack) The investment required to start a company Growth through the provision of services Growth through teaching (product-market fit) Moving on to creating docs Adrian’s current role Strategic partnerships and community growth through DocDB Plans for the future of DLT DLT vs Airbyte vs Fivetran Adrian’s resource recommendations

Links:

Adrian's LinkedIn: https://www.linkedin.com/in/data-team/ Twitter: https://twitter.com/dlt_library Github: https://github.com/dlt-hub/dlt Website: https://dlthub.com/docs/intro

Free ML Engineering course: http://mlzoomcamp.com Join DataTalks.Club: https://datatalks.club/slack.html Our events: https://datatalks.club/events.html

New Girl, but Jess is a chatbot: AI joins the data team's loft - Coalesce 2023

2023-10-25 · dbt Coalesce 2023 Watch

video

by Alex Gronemeyer (Airbyte)

AI/ML Analytics

A data team looks to grow by 33% by making their biggest hire to date: an AI Powered Chatbot. The biggest problem for this two person team: they don’t know where to start.

Join Airbyte on the journey from first Google search through procurement to implementation as they try to figure out if they can make an AI chatbot work for answering all of our companies questions.

They're (most likely) just like you: they know that they needed to do something with AI but weren't sure where or how to start. In this presentation, the Airbyte team talks through the process of trying to figure out how to use AI to make their team more efficient and have a wider reach in a rapidly growing, fully remote organization.

Speaker: Alex Gronemeyer, Lead Analytics Engineer, Airbyte

Register for Coalesce at https://coalesce.getdbt.com

Domesticating a feral cat data stack - Coalesce 2023

2023-10-24 · dbt Coalesce 2023 Watch

video

by Lauren Benezra (dbt Labs)

Analytics BigQuery Cloud Computing dbt Google Sheets Modern Data Stack

Lauren Benezra has been volunteering with a local cat rescue since 2018. She recently took on the challenge of rebuilding their data stack from scratch, replacing a Jenga tower of incomprehensible Google Sheets with a more reliable system backed by the Modern Data Stack. By using Airtable, Airbyte, BigQuery, dbt Cloud and Census, her role as Foster Coordinator has transformed: instead of digging for buried information while wrangling cats, she now serves up accurate data with ease while... well... wrangling cats.

Viewers will learn that it's possible to run an extremely scalable and reliable stack on a shoestring budget, and will come away with actionable steps to put Lauren's hard-won lessons into practice in their own volunteering projects or as the first data hire in a tiny startup.

Speakers: Lauren Benezra, Senior Analytics Engineer, dbt Labs

Register for Coalesce at https://coalesce.getdbt.com/

Reducing The Barrier To Entry For Building Stream Processing Applications With Decodable

2023-10-15 · Data Engineering Podcast Listen

podcast_episode

by Eric Sammer (Decodable) , Tobias Macey

AI/ML Analytics Flink API Kinesis BI CI/CD Cloud Computing Data Engineering Data Management Data Quality Data Science +21 more

Summary

Building streaming applications has gotten substantially easier over the past several years. Despite this, it is still operationally challenging to deploy and maintain your own stream processing infrastructure. Decodable was built with a mission of eliminating all of the painful aspects of developing and deploying stream processing systems for engineering teams. In this episode Eric Sammer discusses why more companies are including real-time capabilities in their products and the ways that Decodable makes it faster and easier.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack This episode is brought to you by Datafold – a testing automation platform for data engineers that finds data quality issues before the code and data are deployed to production. Datafold leverages data-diffing to compare production and development environments and column-level lineage to show you the exact impact of every code change on data, metrics, and BI tools, keeping your team productive and stakeholders happy. Datafold integrates with dbt, the modern data stack, and seamlessly plugs in your data CI for team-wide and automated testing. If you are migrating to a modern data stack, Datafold can also help you automate data and code validation to speed up the migration. Learn more about Datafold by visiting dataengineeringpodcast.com/datafold You shouldn't have to throw away the database to build with fast-changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date. With Materialize, you can! It’s the only true SQL streaming database built from the ground up to meet the needs of modern data products. Whether it’s real-time dashboarding and analytics, personalization and segmentation or automation and alerting, Materialize gives you the ability to work with fresh, correct, and scalable results — all in a familiar SQL interface. Go to dataengineeringpodcast.com/materialize today to get 2 weeks free! As more people start using AI for projects, two things are clear: It’s a rapidly advancing field, but it’s tough to navigate. How can you get the best results for your use case? Instead of being subjected to a bunch of buzzword bingo, hear directly from pioneers in the developer and data science space on how they use graph tech to build AI-powered apps. . Attend the dev and ML talks at NODES 2023, a free online conference on October 26 featuring some of the brightest minds in tech. Check out the agenda and register today at Neo4j.com/NODES. Your host is Tobias Macey and today I'm interviewing Eric Sammer about starting your stream processing journey with Decodable

Interview

Introduction How did you get involved in the area of data management? Can you describe what Decodable is and the story behind it?

What are the notable changes to the Decodable platform since we last spoke? (October 2021) What are the industry shifts that have influenced the product direction?

What are the problems that customers are trying to solve when they come to Decodable? When you launched your focus was on SQL transformations of streaming data. What was the process for adding full Java support in addition to SQL? What are the developer experience challenges that are particular to working with streaming data?

How have you worked to address that in the Decodable platform and interfaces?

As you evolve the technical and product direction, what is your heuristic for balancing the unification of interfaces and system integration against the ability to swap different components or interfaces as new technologies are introduced? What are the most interesting, innovative, or unexpected ways that you have seen Decodable used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Decodable? When is Decodable the wrong choice? What do you have planned for the future of Decodable?

Contact Info

esammer on GitHub LinkedIn

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story. To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers

Links

Decodable

Podcast Episode

Understanding the Apache Flink Journey Flink

Podcast Episode

Debezium

Podcast Episode

Kafka Redpanda

Podcast Episode

Kinesis PostgreSQL

Podcast Episode

Snowflake

Podcast Episode

Databricks Startree Pinot

Podcast Episode

Rockset

Podcast Episode

Druid InfluxDB Samza Storm Pulsar

Podcast Episode

ksqlDB

Podcast Episode

dbt GitHub Actions Airbyte Singer Splunk Outbox Pattern

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Sponsored By: Neo4J: NODES Conference Logo

NODES 2023 is a free online conference focused on graph-driven innovations with content for all skill levels. Its 24 hours are packed with 90 interactive technical sessions from top developers and data scientists across the world covering a broad range of topics and use cases. The event tracks: - Intelligent Applications: APIs, Libraries, and Frameworks – Tools and best practices for creating graph-powered applications and APIs with any software stack and programming language, including Java, Python, and JavaScript - Machine Learning and AI – How graph technology provides context for your data and enhances the accuracy of your AI and ML projects (e.g.: graph neural networks, responsible AI) - Visualization: Tools, Techniques, and Best Practices – Techniques and tools for exploring hidden and unknown patterns in your data and presenting complex relationships (knowledge graphs, ethical data practices, and data representation)

Don’t miss your chance to hear about the latest graph-powered implementations and best practices for free on October 26 at NODES 2023. Go to Neo4j.com/NODES today to see the full agenda and register!Rudderstack:

Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstackMaterialize:

You shouldn't have to throw away the database to build with fast-changing data. Keep the familiar SQL, keep the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date.

That is Materialize, the only true SQL streaming database built from the ground up to meet the needs of modern data products: Fresh, Correct, Scalable — all in a familiar SQL UI. Built on Timely Dataflow and Differential Dataflow, open source frameworks created by cofounder Frank McSherry at Microsoft Research, Materialize is trusted by data and engineering teams at Ramp, Pluralsight, Onward and more to build real-time data products without the cost, complexity, and development time of stream processing.

Go to materialize.com today and get 2 weeks free!Datafold:

This episode is brought to you by Datafold – a testing automation platform for data engineers that finds data quality issues before the code and data are deployed to production. Datafold leverages data-diffing to compare…

Michel Tricot - The Impact of AI on the Modern Data Stack

2023-09-19 · The Joe Reis Show Listen

podcast_episode

by Michel Tricot (Airbyte) , Joe Reis (DeepLearning.AI)

AI/ML ETL/ELT LLM Modern Data Stack Pinecone

Michel Tricot (CEO of Airbyte) joins me to chat about the impact of AI on the modern data stack, ETL for AI, the challenges of moving from open source to a paid product, and much more.

Airbyte & Pinecone - https://airbyte.com/tutorials/chat-with-your-data-using-openai-pinecone-airbyte-and-langchain

Note from Joe - I had audio issues cuz he got a new computer and didn't use the correct mic :(

Eliminate The Overhead In Your Data Integration With The Open Source dlt Library

2023-09-04 · Data Engineering Podcast Listen

podcast_episode

by Tobias Macey , Adrian Brudaru (dlthub)

Analytics BI CI/CD Cloud Computing Data Engineering Data Management Data Quality Datafold dbt ETL/ELT Fivetran Matillion +7 more

Summary

Cloud data warehouses and the introduction of the ELT paradigm has led to the creation of multiple options for flexible data integration, with a roughly equal distribution of commercial and open source options. The challenge is that most of those options are complex to operate and exist in their own silo. The dlt project was created to eliminate overhead and bring data integration into your full control as a library component of your overall data system. In this episode Adrian Brudaru explains how it works, the benefits that it provides over other data integration solutions, and how you can start building pipelines today.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack You shouldn't have to throw away the database to build with fast-changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date. With Materialize, you can! It’s the only true SQL streaming database built from the ground up to meet the needs of modern data products. Whether it’s real-time dashboarding and analytics, personalization and segmentation or automation and alerting, Materialize gives you the ability to work with fresh, correct, and scalable results — all in a familiar SQL interface. Go to dataengineeringpodcast.com/materialize today to get 2 weeks free! This episode is brought to you by Datafold – a testing automation platform for data engineers that finds data quality issues before the code and data are deployed to production. Datafold leverages data-diffing to compare production and development environments and column-level lineage to show you the exact impact of every code change on data, metrics, and BI tools, keeping your team productive and stakeholders happy. Datafold integrates with dbt, the modern data stack, and seamlessly plugs in your data CI for team-wide and automated testing. If you are migrating to a modern data stack, Datafold can also help you automate data and code validation to speed up the migration. Learn more about Datafold by visiting dataengineeringpodcast.com/datafold Your host is Tobias Macey and today I'm interviewing Adrian Brudaru about dlt, an open source python library for data loading

Interview

Introduction How did you get involved in the area of data management? Can you describe what dlt is and the story behind it?

What is the problem you want to solve with dlt? Who is the target audience?

The obvious comparison is with systems like Singer/Meltano/Airbyte in the open source space, or Fivetran/Matillion/etc. in the commercial space. What are the complexities or limitations of those tools that leave an opening for dlt? Can you describe how dlt is implemented? What are the benefits of building it in Python? How have the design and goals of the project changed since you first started working on it? How does that language choice influence the performance and scaling characteristics? What problems do users solve with dlt? What are the interfaces available for extending/customizing/integrating with dlt? Can you talk through the process of adding a new source/destination? What is the workflow for someone building a pipeline with dlt? How does the experience scale when supporting multiple connections? Given the limited scope of extract and load, and the composable design of dlt it seems like a purpose built companion to dbt (down to th

What Happens When The Abstractions Leak On Your Data

2023-05-15 · Data Engineering Podcast Listen

podcast_episode

by Tobias Macey

AI/ML API AWS AWS Glue BigQuery CDP Dagster Data Engineering Data Lake Data Lakehouse Data Management Data Modelling +10 more

Summary

All of the advancements in our technology is based around the principles of abstraction. These are valuable until they break down, which is an inevitable occurrence. In this episode the host Tobias Macey shares his reflections on recent experiences where the abstractions leaked and some observances on how to deal with that situation in a data platform architecture.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enable you to automatically send data to hundreds of downstream tools. Sign up free at dataengineeringpodcast.com/rudderstack Your host is Tobias Macey and today I'm sharing some thoughts and observances about abstractions and impedance mismatches from my experience building a data lakehouse with an ELT workflow

Interview

Introduction impact of community tech debt

hive metastore new work being done but not widely adopted

tensions between automation and correctness data type mapping

integer types complex types naming things (keys/column names from APIs to databases)

disaggregated databases - pros and cons

flexibility and cost control not as much tooling invested vs. Snowflake/BigQuery/Redshift

data modeling

dimensional modeling vs. answering today's questions

What are the most interesting, unexpected, or challenging lessons that you have learned while working on your data platform? When is ELT the wrong choice? What do you have planned for the future of your data platform?

Contact Info

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story. To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers

Links

dbt Airbyte

Podcast Episode

Dagster

Podcast Episode

Trino

Podcast Episode

ELT Data Lakehouse Snowflake BigQuery Redshift Technical Debt Hive Metastore AWS Glue

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Sponsored By: Rudderstack:

RudderStack provides all your customer data pipelines in one platform. You can collect, transform, and route data across your entire stack with its event streaming, ETL, and reverse ETL pipelines.

RudderStack’s warehouse-first approach means it does not store sensitive information, and it allows you to leverage your existing data warehouse/data lake infrastructure to build a single source of truth for every team.

RudderStack also supports real-time use cases. You can Implement RudderStack SDKs once, then automatically send events to your warehouse and 150+ business tools, and you’ll never have to worry about API changes again.

Visit dataengineeringpodcast.com/rudderstack to sign up for free today, and snag a free T-Shirt just for being a Data Engineering Podcast listener.Support Data Engineering Podcast

Building Leverage with dbt Labs at Airbyte

2022-10-25 · dbt Coalesce 2022 Watch

video

by Christophe Duong (Airbyte) , Charles Giardina (Airbyte)

Cloud Computing dbt DWH

Airbyte supports loading data into a wide array of databases and data warehouses. We must enforce the same structure and transformations in each of these tools, and writing different transformations for each would be prohibitive. Instead, we use dbt to write this code once and reuse it for every database and data warehouse that we support. In an effort to improve our support across all these tools, we are also introducing a dbt Cloud integration within Airbyte Cloud. This will allow Airbyte Cloud users to leverage the lessons we’ve learned and build their own custom transformations using dbt Cloud.

Check the slides here: https://docs.google.com/presentation/d/19asIBrCgs04dJ07zhb1cosYEQHQC0yqEVSMlLcUymZY/edit?usp=sharing

Coalesce 2023 is coming! Register for free at https://coalesce.getdbt.com/.

talk-data.com

Activity Trend

Top Events

Top Speakers

Daniel Herrera: Building AI analyst agents from scratch

One platform, two-way value: How data activation drives business impact

What Happens When You Ask an AI Agent to Refresh Your Data?

What Happens When You Ask an AI Agent to Refresh Your Data?

Driving Analytics with Open Source: Airbyte, dbt, Airflow & Metabase

Sponsored by: Airbyte | How Data Movement Powers GenAI

Let's Save Tons of Money With Cloud-Native Data Ingestion!

Coalesce 2024: Fueling product development and customer insights with dbt

Scaling Airbyte: Challenges and Milestones on the Road to 1.0

The Hard Road to Seamless Data: Lessons from Building Airbyte

Release Management For Data Platform Services And Logic

Fundamentals of Analytics Engineering

The Entrepreneurship Journey: From Freelancing to Starting a Company - Adrian Brudaru

New Girl, but Jess is a chatbot: AI joins the data team's loft - Coalesce 2023

Domesticating a feral cat data stack - Coalesce 2023

Reducing The Barrier To Entry For Building Stream Processing Applications With Decodable

Michel Tricot - The Impact of AI on the Modern Data Stack

Eliminate The Overhead In Your Data Integration With The Open Source dlt Library

What Happens When The Abstractions Leak On Your Data

Building Leverage with dbt Labs at Airbyte