talk-data.com talk-data.com

Topic

dbt

dbt (data build tool)

data_transformation analytics_engineering sql

8

tagged

Activity Trend

134 peak/qtr
2020-Q1 2026-Q1

Activities

Showing filtered results

Filtering by: Airflow Summit 2024 ×

OpenLineage is an open standard for lineage data collection, integrated into the Airflow codebase, facilitating lineage collection across providers like Google, Amazon, and more. Atlan Data Catalog is a 3rd generation active metadata platform that is a single source of trust unifying cataloging, data discovery, lineage, and governance experience. We will demonstrate what OpenLineage is and how, with minimal and intuitive setup across Airlfow and Atlan, it presents unified workflows view, efficient cross-platform lineage collection, including column level, in various technologies (Python, Spark, dbt, SQL etc.) and clouds (AWS, Azure, GCP, etc.) - all orchestrated by Airflow. This integration enables further use case unlocks on automated metadata management by making the operational pipelines dataset-aware for self-service exploration. It also will demonstrate real world challenges and resolutions for lineage consumers in improving audit and compliance accuracy through column-level lineage traceability across the data estate. The talk will also briefly overview the most recent OpenLineage developments and planned future enhancements.

Airflow is often used for running data pipelines, which themselves connect with other services through the provider system. However, it is also increasingly used as an engine under-the-hood for other projects building on top of the DAG primitive. For example, Cosmos is a framework for automatically transforming dbt DAGs into Airflow DAGs, so that users can supplement the developer experience of dbt with the power of Airflow. This session dives into how a select group of these frameworks (Cosmos, Meltano, Chronon) use Airflow as an engine for orchestrating complex workflows their systems depend on. In particular, we will discuss ways that we’ve increased Airflow performance to meet application-specific demands (high-task-count Cosmos DAGs, streaming jobs in Chronon), new Airflow features that will evolve how these frameworks use Airflow under the hood (DAG versioning, dataset integrations), and paths we see these projects taking over the next few years as Airflow grows. Airflow is not just a DAG platform, it’s an application platform!

Looking for a way to streamline your data workflows and master the art of orchestration? As we navigate the complexities of modern data engineering, Airflow’s dynamic workflow and complex data pipeline dependencies are starting to become more and more common nowadays. In order to empower data engineers to exploit Airflow as the main orchestrator, Airflow Datasets can be easily integrated in your data journey. This session will showcase the Dynamic Workflow orchestration in Airflow and how to manage multi-DAGs dependencies with Multi-Dataset listening. We’ll take you through a real-time data pipeline with Pub/Sub messaging integration and dbt in Google Cloud environment, to ensure data transformations are triggered only upon new data ingestion, moving away from rigid time-based scheduling or the use of sensors and other legacy ways to trigger a DAG.

Balyasny Asset Management (BAM) is a diversified global investment firm founded in 2001 with over $20 billion in assets under management. As dbt took hold at BAM, we had multiple teams building dbt projects against Snowflake, Redshift, and SQL Server. The common question was: How can we quickly and easily productionise our projects? Airflow is the orchestrator of choice at BAM, but our dbt users ranged from Airflow power users to people who’d never heard of Airflow before. We built a single solution on top of Cosmos that allowed us to: Decouple the dbt project from the Airflow repository Have each dbt node run as a separate Airflow task Allow users to run dbt with little to no Airflow knowledge Enable users to have fine-grained control over how dbt is run and to combine it with other Airflow tasks Provide observability, monitoring, and alerting.

dbt became the de facto for data teams building reliable and trustworthy SQL code leveraging a modern data stack architecture. The dbt logic needs to be orchestrated, and jobs scheduled to meet business expectations. That’s where Airflow comes into play. In this quick introduction session, you’ll gonna learn: How to leverage dbt-Core & Airflow to orchestrate pipelines Write DAGs in a Pythonic way Apply best practices on your jobs

There are many Airflow tutorials. However, many don’t show the full process of sourcing, transforming, testing, alerting, documenting, and finally supplying data. This talk with go over how to piece together an end-to-end Airflow project that transforms raw data to be consumable by the business. It will include how various technologies can all be orchestrated by Airflow to satisfy the needs of analysts, engineers, and business stakeholders. The talk will be divided into the following sections: Introduction: Introducing the business problem and how we came up with the solution design Data sourcing: Fetching and storing API data using basic operators and hooks Transformation and Testing: How to use dbt to build and test models based on the raw data Alerting: Alerting the necessary parties when any part of this DAG fails using Slack Consumption: How to make dynamic data accessible to business stakeholders

The integration between dbt and Airflow is a popular topic in the community, both in previous editions of Airflow Summit, in Coalesce and the #airflow-dbt Slack channel. Astronomer Cosmos ( https://github.com/astronomer/astronomer-cosmos/ ) stands out as one of the libraries that strives to enhance this integration, having over 300k downloads per month. During its development, we’ve encountered various performance challenges in terms of scheduling and task execution. While we’ve managed to address some, others remain to be resolved. This talk describes how Cosmos works, the improvements made over the last 1.5 years, and the roadmap. It also aims to collect feedback from the community on how we can further improve the experience of running dbt in Airflow.

In this talk, we’ll discuss how Instacart leverages Apache Airflow to orchestrate a vast network of data pipelines, powering both our core infrastructure and dbt deployments. As a data-driven company, Airflow plays a critical role in enabling us to execute large and intricate pipelines securely, compliantly, and at scale. We’ll delve into the following key areas: a. High-Throughput Cluster Management: We’ll explore how we manage and maintain our Airflow cluster, ensuring the efficient execution of over 2,000 DAGs across diverse use cases. b. Centralized Airflow Vision: We’ll outline our plans for establishing a company-wide, centralized Airflow cluster, consolidating all Airflow instances at Instacart. c. Custom Airflow Tooling: We’ll showcase the custom tooling we’ve developed to manage YML-based DAGs, execute DAGs on external ECS workers, leverage Terraform for cluster deployment, and implement robust cluster monitoring at scale. By sharing our extensive experience with Airflow, we aim to contribute valuable insights to the Airflow community.