This session details practical strategies for introducing Apache Airflow in strict, compliance-heavy organizations. Learn how on-premise deployment and hybrid tooling can help modernize legacy workflows when public cloud solutions and container technologies are restricted. Discover how cross-platform engineering teams can collaborate securely using CI/CD bridges, and what it takes to meet rigorous security and governance standards. Key lessons address navigating resistance to change, achieving production sign-off, and avoiding common compliance pitfalls, relevant to anyone automating in public sector settings.
talk-data.com
Topic
Airflow
Apache Airflow
682
tagged
Activity Trend
Top Events
As Data Engineers, our jobs regularly include scheduling or scaling workflows. But have you ever asked yourself, can I scale my scheduling ? It turns out that you can! But doing so raises a number of issues that need to be addressed. In this talk we’ll be: Recapping Asset-aware scheduling in Apache Airflow Discussing diverse methods to upscale our scheduling Solving the issue of maintaining our Airflow Asset synchronized between instances Comparing our professional push based solution and the built-in solution from AIP-82 and the pros and cons of each method. I hope you will enjoy it!
At Yahoo, we built a secure, scalable, and cost-efficient batch processing platform using Amazon MWAA to orchestrate Apache Flink jobs on EKS, managed by the Flink Kubernetes Operator. This setup enables dynamic job orchestration while meeting strict enterprise compliance standards. In this session, we’ll share how Airflow DAGs: Dynamically launch, monitor, and clean up isolated Flink clusters per batch job, improving resource efficiency. Securely fetch EKS kubeconfig, submit FlinkDeployment CRDs using FlinkKubernetesOperator, and poll job status using Airflow sensors. Integrate IAM for access control and meet Yahoo’s security requirements, including mutual TLS (mTLS) with Athenz. Optimize for cost and resilience through automated cleanup of jobs and the operator, and handle job failures and retries. Join us for practical strategies and lessons from Yahoo’s production-scale Flink workflows in a Kubernetes environment.
Our development workflows look dramatically different than they did a year ago. Code generation, automated testing, and AI-assisted documentation tools are now part of many developers’ daily work. Yet as these tools reshape how we code, I’ve noticed something worth examining: while our toolbox is changing rapidly, the core of being a good developer hasn’t. Problem-solving, collaborative debugging, and systems thinking remain as crucial as ever. In this keynote, I’ll share observations about: Which parts of our workflow are genuinely enhanced by new tools. The development skills that continue to separate good code from great code. How teams can collaborate effectively when everyone’s tools are evolving. What Airflow’s journey teaches us about balancing innovation with stability. No hype or grand pronouncements—just an honest look at incorporating new tools while preserving the craft that makes us developers in the first place.
Before Airflow, our BigQuery pipelines at Create Music Group operated like musicians without a conductor—each playing on its own schedule, regardless of whether upstream data was ready. As our data platform grew, this chaos led to spiralling costs, performance bottlenecks, and became utterly unsustainable. This talk tells the story of how Create Music Group brought harmony to its data workflows by adopting Apache Airflow and the Medallion architecture, ultimately slashing our data processing costs by 50%. We’ll show how moving to event-driven scheduling with datasets helped eliminate stale data issues, dramatically improved performance, and unlocked faster iteration across teams. Discover how we replaced repetitive SQL with standardized dimension/fact tables, empowering analysts in a safer sandbox.
How a Complete Beginner in Data Engineering / Junior Computer Science Student Became an Apache Airflow Committer in Just 5 Months—With 70+ PRs and 300 Hours of Contributions This talk is aimed at those who are still hesitant about contributing to Apache Airflow. I hope to inspire and encourage anyone to take the first step and start their journey in open-source—let’s build together!
Apache Airflow® 3 is here, bringing major improvements to data orchestration. In this keynote, core Airflow contributors will walk through key enhancements that boost flexibility, efficiency, and user experience. Vikram Koka will kick things off with an overview of Airflow 3, followed by deep dives into DAG versioning (Jed Cunningham), enhanced backfilling (Daniel Standish), and a modernized UI (Brent Bovenzi & Pierre Jeambrun). Next, Ash Berlin-Taylor, Kaxil Naik, and Amogh Desai will introduce the Task Execution Interface and Task SDK, enabling tasks in any environment and language. Jens Scheffler will showcase the Edge Executor, while Tzu-ping Chung and Vincent Beck will demo event-driven scheduling and data assets. Finally, Buğra Öztürk will unveil CLI enhancements for automation and debugging. This keynote sets the stage for Airflow 3—don’t miss the chance to learn from the experts shaping the future of workflow orchestration!
Yes, you read that right — 200,000 pipelines, nearly 1 million task executions per day, all powered by a single Airflow instance. In this session, we’ll take you behind the scenes of one of the boldest orchestration projects ever attempted: how Uber’s data platform team is executing what might be the largest Apache Airflow migration in history — and doing it straight to Airflow 3. From scaling challenges and architectural choices to lessons learned in high-throughput orchestration, this is a deep dive into the tech, the chaos, and the strategy behind making data fly at unprecedented scale.
In the age of Generative AI, knowledge bases are the backbone of intelligent systems, enabling them to deliver accurate and context-aware responses. But how do you ensure that these knowledge bases remain up-to-date and relevant in a rapidly changing world? Enter Apache Airflow, a robust orchestration tool that streamlines the automation of data workflows. This talk will explore how Airflow can be leveraged to manage and update AI knowledge bases across multiple data sources. We’ll dive into the architecture, demonstrate how Airflow enables efficient data extraction, transformation, and loading (ETL), and share insights on tackling challenges like data consistency, scheduling, and scalability. Whether you’re building your own AI-driven systems or looking to optimize existing workflows, this session will provide practical takeaways to make the most of Apache Airflow in orchestrating intelligent solutions.
As organizations increasingly rely on data-driven applications, managing the diverse tools, data, and teams involved can create challenges. Amazon SageMaker Unified Studio addresses this by providing an integrated, governed platform to orchestrate end-to-end data and AI/ML workflows. In this workshop, we’ll explore how to leverage Amazon SageMaker Unified Studio to build and deploy scalable Apache Airflow workflows that span the data and AI/ML lifecycle. We’ll walk through real-world examples showcasing how this AWS service brings together familiar Airflow capabilities with SageMaker’s data processing, model training, and inference features - all within a unified, collaborative workspace. Key topics covered: Authoring and scheduling Airflow DAGs in SageMaker Unified Studio Understanding how Apache Airflow powers workflow orchestration under the hood Leveraging SageMaker capabilities like Notebooks, Data Wrangler, and Models Implementing centralized governance and workflow monitoring Enhancing productivity through unified development environments Join us to transform your ML workflow experience from complex and fragmented to streamlined and efficient.
As data workloads grow in complexity, teams need seamless orchestration to manage pipelines across batch, streaming, and AI/ML workflows. Apache Airflow provides a flexible and open-source way to orchestrate Databricks’ entire platform, from SQL analytics with Materialized Views (MVs) and Streaming Tables (STs) to AI/ML model training and deployment. In this session, we’ll showcase how Airflow can automate and optimize Databricks workflows, reducing costs and improving performance for large-scale data processing. We’ll highlight how MVs and STs eliminate manual incremental logic, enable real-time ingestion, and enhance query performance—all while maintaining governance and flexibility. Additionally, we’ll demonstrate how Airflow simplifies ML model lifecycle management by integrating Databricks’ AI/ML capabilities into end-to-end data pipelines. Whether you’re a dbt user seeking better performance, a data engineer managing streaming pipelines, or an ML practitioner scaling AI workloads, this session will provide actionable insights on using Airflow and Databricks together to build efficient, cost-effective, and future-proof data platforms.
As data workloads grow in complexity, teams need seamless orchestration to manage pipelines across batch, streaming, and AI/ML workflows. Apache Airflow provides a flexible and open-source way to orchestrate Databricks’ entire platform, from SQL analytics with Materialized Views (MVs) and Streaming Tables (STs) to AI/ML model training and deployment. In this session, we’ll showcase how Airflow can automate and optimize Databricks workflows, reducing costs and improving performance for large-scale data processing. We’ll highlight how MVs and STs eliminate manual incremental logic, enable real-time ingestion, and enhance query performance—all while maintaining governance and flexibility. Additionally, we’ll demonstrate how Airflow simplifies ML model lifecycle management by integrating Databricks’ AI/ML capabilities into end-to-end data pipelines. Whether you’re a dbt user seeking better performance, a data engineer managing streaming pipelines, or an ML practitioner scaling AI workloads, this session will provide actionable insights on using Airflow and Databricks together to build efficient, cost-effective, and future-proof data platforms.
Ensuring high-quality data is essential for building user trust and enabling data teams to work efficiently. In this talk, we’ll explore how the Astronomer data team leverages Airflow to uphold data quality across complex pipelines; minimizing firefighting and maximizing confidence in reported metrics. Maintaining data quality requires a multi-faceted approach: safeguarding the integrity of source data, orchestrating pipelines reliably, writing robust code, and maintaining consistency in outputs. We’ve embedded data quality into the DevEx experience, so it’s always at the forefront instead of in the backlog of tech debt. We’ll share how we’ve operationalized: Implementing data contracts to define and enforce expectations Differentiating between critical (pipeline-blocking) and non-critical (soft) failures Exposing upstream data issues to domain owners Tracking metrics to measure overall data quality of our team Join us to learn practical strategies for building scalable, trustworthy data systems powered by Airflow.
This talk explores EDB’s journey from siloed reporting to a unified data platform, powered by Airflow. We’ll delve into the architectural evolution, showcasing how Airflow orchestrates a diverse range of use cases, from Analytics Engineering to complex MLOps pipelines. Learn how EDB leverages Airflow and Cosmos to integrate dbt for robust data transformations, ensuring data quality and consistency. We’ll provide a detailed case study of our MLOps implementation, demonstrating how Airflow manages training, inference, and model monitoring pipelines for Azure Machine Learning models. Discover the design considerations driven by our internal data governance framework and gain insights into our future plans for AIOps integration with Airflow.
The journey from ML model development to production deployment and monitoring is often complex and fragmented. How can teams overcome the chaos of disparate tools and processes? This session dives into how Apache Airflow serves as a unifying force in MLOps. We’ll begin with a look at the broader MLOps trends observed by Google within the Airflow community, highlighting how Airflow is evolving to meet these challenges and showcasing diverse MLOps use cases – both current and future. Then, Priceline will present a deep-dive case study on their MLOps transformation. Learn how they leveraged Cloud Composer, Google Cloud’s managed Apache Airflow service, to orchestrate their entire ML pipeline end-to-end: ETL, data preprocessing, model building & training, Dockerization, Google Artifact Registry integration, deployment, model serving, and evaluation. Discover how using Cloud Composer on GCP enabled them to build a scalable, reliable, adaptable, and maintainable MLOps practice, moving decisively from chaos to coordination. Cloud Composer (Airflow) has served as a major backbone in transforming the whole ML experience in Priceline. Join us to learn how to harness Airflow, particularly within a managed environment like Cloud Composer, for robust MLOps workflows, drawing lessons from both industry trends and a concrete, successful implementation.
Airflow powers thousands of data and ML pipelines—but in the enterprise, these pipelines often need to interact with business-critical systems like ERPs, CRMs, and core banking platforms. In this demo-driven session we will connect Airflow with Control-M from BMC and showcase how Airflow can participate in end-to-end workflows that span not just data platforms but also transactional business applications. Session highlights Trigger Airflow DAGs based on business events (e.g., invoice approvals, trade settlements) Feed Airflow pipeline outputs into ERP systems (e.g., SAP) or CRMs (e.g., Salesforce) Orchestrate multi-platform workflows from cloud to mainframe with SLA enforcement, dependency management, and centralized control. Provide unified monitoring and auditing across data and application layers
Airflow powers thousands of data and ML pipelines—but in the enterprise, these pipelines often need to interact with business-critical systems like ERPs, CRMs, and core banking platforms. In this demo-driven session we will connect Airflow with Control-M from BMC and showcase how Airflow can participate in end-to-end workflows that span not just data platforms but also transactional business applications. Session highlights Trigger Airflow DAGs based on business events (e.g., invoice approvals, trade settlements) Feed Airflow pipeline outputs into ERP systems (e.g., SAP) or CRMs (e.g., Salesforce) Orchestrate multi-platform workflows from cloud to mainframe with SLA enforcement, dependency management, and centralized control. Provide unified monitoring and auditing across data and application layers
The City of Pittsburgh utilizes Airflow (via Astronomer) for a wide variety of tasks. From employee-focused use cases, like time bank balancing and internal dashboards, to public-facing publication, the City’s data flows through our DAGs from many sources to many sources. Airflow acts as a funnel point and is an essential tool for Pittsburgh’s Data Services team.
As a popular open-source library for analytics engineering, dbt is often combined with Airflow. Orchestrating and executing dbt models as DAGs ensures an additional layer of control over tasks, observability, and provides a reliable, scalable environment to run dbt models. This workshop will cover a step-by-step guide to Cosmos , a popular open-source package from Astronomer that helps you quickly run your dbt Core projects as Airflow DAGs and Task Groups, all with just a few lines of code. We’ll walk through: Running and visualising your dbt transformations Managing dependency conflicts Defining database credentials (profiles) Configuring source and test nodes Using dbt selectors Customising arguments per model Addressing performance challenges Leveraging deferrable operators Visualising dbt docs in the Airflow UI Example of how to deploy to production Troubleshooting We encourage participants to bring their dbt project to follow this step-by-step workshop.
Airflow’s traditional execution model often leads to wasted resources: worker nodes sitting idle, waiting on external systems. At Wix, we tackled this inefficiency head-on by refactoring our in-house operators to support Airflow’s deferrable execution model. Join us on a walk through Wix’s journey to a more efficient Airflow setup, from identifying bottlenecks to implementing deferrable operators and reaping their benefits. We’ll share the alternatives considered, the refactoring process, and how the team seamlessly integrated deferrable execution with no disruption to data engineers’ workflows. Attendees will get a practical introduction to deferrable operators: how they work, what they require, and how to implement them. We’ll also discuss the changes made to Wix’s Airflow environment, the process of prioritizing operators for modification, and lessons learned from testing and rollout. By the end of this talk, attendees will be ready to embrace more purple tasks in their Airflow UI, boosting efficiency, cutting costs, and making their workflows greener in every other way.