talk-data.com talk-data.com

Topic

Airflow

Apache Airflow

workflow_management data_orchestration etl

682

tagged

Activity Trend

157 peak/qtr
2020-Q1 2026-Q1

Activities

682 activities · Newest first

Summary In this episode of the Data Engineering Podcast Akshay Agrawal from Marimo discusses the innovative new Python notebook environment, which offers a reactive execution model, full Python integration, and built-in UI elements to enhance the interactive computing experience. He discusses the challenges of traditional Jupyter notebooks, such as hidden states and lack of interactivity, and how Marimo addresses these issues with features like reactive execution and Python-native file formats. Akshay also explores the broader landscape of programmatic notebooks, comparing Marimo to other tools like Jupyter, Streamlit, and Hex, highlighting its unique approach to creating data apps directly from notebooks and eliminating the need for separate app development. The conversation delves into the technical architecture of Marimo, its community-driven development, and future plans, including a commercial offering and enhanced AI integration, emphasizing Marimo's role in bridging the gap between data exploration and production-ready applications.

Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementTired of data migrations that drag on for months or even years? What if I told you there's a way to cut that timeline by up to 6x while guaranteeing accuracy? Datafold's Migration Agent is the only AI-powered solution that doesn't just translate your code; it validates every single data point to ensure perfect parity between your old and new systems. Whether you're moving from Oracle to Snowflake, migrating stored procedures to dbt, or handling complex multi-system migrations, they deliver production-ready code with a guaranteed timeline and fixed price. Stop burning budget on endless consulting hours. Visit dataengineeringpodcast.com/datafold to book a demo and see how they're turning months-long migration nightmares into week-long success stories.Your host is Tobias Macey and today I'm interviewing Akshay Agrawal about Marimo, a reusable and reproducible Python notebook environmentInterview IntroductionHow did you get involved in the area of data management?Can you describe what Marimo is and the story behind it?What are the core problems and use cases that you are focused on addressing with Marimo?What are you explicitly not trying to solve for with Marimo?Programmatic notebooks have been around for decades now. Jupyter was largely responsible for making them popular outside of academia. How have the applications of notebooks changed in recent years?What are the limitations that have been most challenging to address in production contexts?Jupyter has long had support for multi-language notebooks/notebook kernels. What is your opinion on the utility of that feature as a core concern of the notebook system?Beyond notebooks, Streamlit and Hex have become quite popular for publishing the results of notebook-style analysis. How would you characterize the feature set of Marimo for those use cases?For a typical data team that is working across data pipelines, business analytics, ML/AI engineering, etc. How do you see Marimo applied within and across those contexts?One of the common difficulties with notebooks is that they are largely a single-player experience. They may connect into a shared compute cluster for scaling up execution (e.g. Ray, Dask, etc.). How does Marimo address the situation where a data platform team wants to offer notebooks as a service to reduce the friction to getting started with analyzing data in a warehouse/lakehouse context?How are you seeing teams integrate Marimo with orchestrators (e.g. Dagster, Airflow, Prefect)?What are some of the most interesting or complex engineering challenges that you have had to address while building and evolving Marimo?\What are the most interesting, innovative, or unexpected ways that you have seen Marimo used?What are the most interesting, unexpected, or challenging lessons that you have learned while working on Marimo?When is Marimo the wrong choice?What do you have planned for the future of Marimo?Contact Info LinkedInParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.Links MarimoJupyterIPythonStreamlitPodcast.init EpisodeVector EmbeddingsDimensionality ReductionKagglePytestPEP 723 script dependency metadataMatLabVisicalcMathematicaRMarkdownRShinyElixir LivebookDatabricks NotebooksPapermillPluto - Julia NotebookHexDirected Acyclic Graph (DAG)Sumble Kaggle founder Anthony Goldblum's startupRayDaskJupytextnbdevDuckDBPodcast EpisodeIcebergSupersetjupyter-marimo-proxyJupyterHubBinderNixAnyWidgetJupyter WidgetsMatplotlibAltairPlotlyDataFusionPolarsMotherDuckThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Observability in data workflows often stops at logs and metrics, leaving data lineage as a blind spot. At Booking, we set out to change that by treating lineage as a core observability layer. In this talk, I'll walk through how we integrated lineage tracking into our Airflow ecosystem, what metadata we capture, and how we surface it to users in a meaningful way. I'll also share how lineage data helps us debug failures, detect unexpected changes, and ensure compliance. You'll leave with a practical view of what it takes to make lineage not just visible, but actionable.

Running Airflow at scale for thousands of workflows across multiple teams introduces challenges around standardization, governance, and isolation. At Booking, we've built a multi-tenant Airflow platform that serves over 4,000 workflows using a custom DSL defined in workflow.yaml files. In this talk, I'll show how we use automated DAG generation to bring structure to complexity, how we achieved horizontal scalability by decoupling orchestration from execution, and how reusable step templates help us enforce governance--without sacrificing workflow isolation. You'll leave with a blueprint for taming Airflow at scale.

Historically Airflow was only capable of time-based scheduling, where a DAG would run at certain times. For data updates at varying times, such as an external party delivering data to an S3 bucket, that meant having to run a DAG and continuously poll for updates. Airflow 3 introduces event-driven scheduling that enables you to trigger DAGs based on such updates. In this talk I'll demonstrate how this changes your DAG's code and how this works internally in Airflow. Lastly, I'll demonstrate a practical use case that leverages Airflow 3's event-driven scheduling.

We strive for our dbt project to be ready by 9am for our stakeholders. Should be easy, right? Except that our dbt project consists of around 450 dbt models and over 30 sources. Some of those sources are ready as early as midnight but some as late as 4am, and in total our project takes around 4 hours to run. Join as us we walk through the evolution of our dbt run setup, from one selector, to a set of parallel commands, to today's setup -- a dynamic lineage in Airflow which runs models when and only when the upstream source is ready. It's finished when the Tableau datasource is refreshed and our stakeholders can start their day with the latest data.

Take your DAGs in Apache Airflow to the next level? This is an insightful session where we’ll uncover 5 transformative strategies to enhance your data workflows. Whether you’re a data engineering pro or just getting started, this presentation is packed with practical tips and actionable insights that you can apply right away. We’ll dive into the magic of using powerful libraries like Pandas, share techniques to trim down data volumes for faster processing, and highlight the importance of modularizing your code for easier maintenance. Plus, you’ll discover efficient ways to monitor and debug your DAGs, and how to make the most of Airflow’s built-in features. By the end of this session, you’ll have a toolkit of strategies to boost the efficiency and performance of your DAGs, making your data processing tasks smoother and more effective. Don’t miss out on this opportunity to elevate your Airflow DAGs!

There was a post on the data engineering subreddit recently that discussed how difficult it is to keep up with the data engineering world. Did you learn Hadoop, great we are on Snowflake, BigQuery and Databricks now. Just learned Airflow, well now we have Airflow 3.0. And the list goes on. But what doesn’t change, and what have been the lessons over the past decade. That’s what I’ll be covering in this talk. Real lessons and realities that come up time and time again whether you’re working for a start-up or a large enterprise.

In today’s dynamic data environments, tables and schemas are constantly evolving and keeping semantic layers up to date has become a critical operational challenge. Manual updates don’t scale, and delays can quickly lead to broken dashboards, failed pipelines, and lost trust. We’ll show how to harness Apache Airflow 3 and its new event-driven scheduling capabilities to automate the entire lifecycle: detecting table and schema changes in real time, parsing and interpreting those changes, and shifting left the updating of semantic models across dbt, Looker, or custom metadata layers. AI agents will add intelligence and automation that rationalize schema diffs, assess impact of changes, and propose targeted updates to semantic layers reducing manual work and minimizing the risk of errors. We’ll dive into strategies for efficient change detection, safe incremental updates, and orchestrating workflows where humans collaborate with AI agents to validate and deploy changes. By the end of the session, you’ll understand how to build resilient, self-healing semantic layers that minimize downtime, reduce manual intervention, and scale effortlessly across fast-changing data environments.

Curious how code truly flows inside Airflow? Join me for a unique visualisation journey into Airflow’s inner workings (first of its kind) — code blocks and modules called when certain operations are running. A walkthrough that unveils task execution, observability, and debugging like never before. Scaling of Airflow in action, showing performance comparison b/w Airflow 3 vs 2. This session will demystify Airflow’s architecture, showcasing real-time task flows and the heartbeat of pipelines in action. Perfect for engineers looking to optimize workflows, troubleshoot efficiently, and gain a new perspective on Airflow’s powerful upgraded core. See Airflow running live with detailed insights and unlock the secrets to better pipeline management!

Are you looking to build slick, dynamic trigger forms for your DAGs? It all starts with mastering params. Params are the gold standard for adding execution options to your DAGs, allowing you to create dynamic, user-friendly trigger forms with descriptions, validation, and now, with Airflow 3, bidirectional support for conf data! In this talk, we’ll break down how to use params effectively, share best practices, and explore what’s new since the 2023 Airflow Summit talk ( https://airflowsummit.org/sessions/2023/flexible-dag-trigger-forms-aip-50/) . If you want to make DAG execution more flexible, intuitive, and powerful, this session is a must-attend!

In Airflow 2 there was a plugin mechanism to extend the UI for new functions as well as be able to add hooks and other features. As Airflow 3 rewrote the UI old Plugins were not working for all cases anymore. Airflow 3.1 now provides a re-vamped option to extend the UI with a new plugin schema in native React components and embedded iframes following AIP-68 definitions. In this session we will provide an overview about capabilities and give some intro how you can roll-your-own.

Airflow has been used by many companies as a core part of their internal data platform. Would you be interested in finding out how Airflow could play a pivotal role in achieving data engineering excellence and efficiency using modern data architecture. The best practices, tools and setup to achieve a stable but yet cost effective way of running small or big workloads, let’s find out! In this workshop we will review how an organisation can setup data platform architecture around Airflow and necessary requirements. Airflow and it’s role in Data Platform Different ways to organise airflow environment enabling scalability and stability Useful open source libraries and custom plugins allowing efficiency How to manage multi-tenancy, cost savings Challenges and factors to keep in mind using Success Matrix! This workshop should be suitable for any Architect, Data Engineer or Devops aiming to build/enhance their internal Data Platform. At the end of this workshop you would have solid understanding of initial setup and ways to optimise further getting most out of the tool for your own organisation.

What if your Airflow tasks could understand natural language AND adapt to schema changes automatically, while maintaining the deterministic, observable workflows we rely on? This talk introduces practical patterns for AI-native orchestration that preserve Airflow’s strengths while adding intelligence where it matters most. Through a real-world example, we’ll demonstrate AI-powered tasks that detect schema drift across multi-cloud systems and perform context-aware data quality checks that go beyond simple validation—understanding business rules, detecting anomalies, and generating validation queries from prompts like “check data quality across regions.” All within static DAG structures you can test and debug normally. We’ll show how AI becomes a first-class citizen by combining Airflow’s features, assets for schema context, Human-in-the-Loop for approvals, and AssetWatchers for automated triggers, with engines such as Apache DataFusion for high-performance query execution and support for cross-cloud data processing with unified access to multiple storage formats. These patterns apply directly to schema validation and similar cases where natural language can simplify complex operations. This isn’t about bolting AI onto Airflow. It’s about evolving how we build workflows, from brittle rules to intelligent adaptation, while keeping everything testable, auditable, and production-ready.

In this keynote, Peeyush Rai and Vikram Koka will be walking through how Airflow is being used as part of a Agentic AI platform servicing insurance companies, which runs on all the major public clouds, leveraging models from Open AI, Google (Gemini), AWS (Claude and Bedrock). This talk walks through the details of the actual end user business workflow including gathering relevant financial data to make a decision, as well as the tricky challenge of handling AI hallucinations, with new Airflow capabilities such as “Human in the loop”. This talk offers something for both business and technical audiences. Business users will get a clear view of what it takes to bring an AI application into production and how to align their operations and business teams with an AI enabled workflow. Meanwhile, technical users will walk away with practical insights on how to orchestrate complex business processes enabling a seamless collaboration between Airflow, AI Agents and Human in the loop.

The workflow orchestration team at Zoox aims to build a solution for orchestrating heterogeneous workflows encompassing data, ML, and QA pipelines. We have encountered two primary challenges: first, the steep learning curve for new Airflow users and the need for a user-friendly yet scalable development process; second, integrating and migrating existing pipelines with established solutions. This presentation will detail our approach, as a small team at Zoox, to address these challenges. The discussion will first present an exciting introduction to Zoox and what we do. Then we will walk down the memory lane of the past and current of Airflow use at Zoox. Furthermore, we will share our strategies for simplifying the Airflow DAG creation process and enhancing user experience. Lastly, we will share a few of our thoughts for how to grow the team and grow Airflow’s presence at Zoox in the future.

Apache Bigtop is a time-proven open-source software stack for building data platform, which has been built around the Hadoop and Spark ecosystem since 2011. Its software composition has been changed during such a long period, and recently job scheduler is removed mainly due to the inactivity of its development. The speaker believes that Airflow perfectly fits into this gap and is proposing incorporating it in the Bigtop stack. This presentation will introduce how easily users can build a data platform with Bigtop including Airflow, and how Airflow can integrate those software with its wide range of providers and enterprise-readiness such as the Kerberos support.

Join us to explore the DAG Upgrade Agent. Developed with Google Agent Development Kit and powered by Gemini, the DAG Upgrade Agent uses a rules-based framework to analyze DAG code, identify compatibility issues between core airflow and provider package versions, and generates precise upgrade recommendations and automated code conversions. Perfect for upcoming Airflow 3.0 migrations.