Airflow is a powerhouse for batch data pipelines—but can it be tuned for real-time workloads? In this session, we’ll share how we adapted Apache Airflow to orchestrate near-real-time data processing at scale. From leveraging event-driven triggers and external APIs to minimizing latency with smart DAG design, we’ll dive into real-world architectural patterns, challenges, and optimizations that helped us handle time-sensitive data workflows with confidence. This talk is ideal for teams seeking to expand beyond batch and explore hybrid or real-time orchestration using Airflow.
talk-data.com
Topic
Airflow
Apache Airflow
682
tagged
Activity Trend
Top Events
Trino is incredibly effective at enabling users to extract insights quickly and effectively from large amount of data located in dispersed and heterogeneous federated data systems. However, some business data problems are more complex than interactive analytics use cases, and are best broken down into a sequence of interdependent steps, a.k.a. a workflow. For these use cases, dedicated software is often required in order to schedule and manage these processes with a principled approach. In this session, we will look at how we can leverage Apache Airflow to orchestrate Trino queries into complex workflows that solve practical batch processing problems, all the while avoiding the use of repetitive, redundant data movement.
Vayu is a conversational copilot for Apache Airflow, developed at Prevalent AI to help data engineers manage, troubleshoot, and fix pipelines using natural language. Deployments often fail silently due to misconfigurations, missing connections, or runtime issues impossible to identify in unit tests. Vayu tackles these via a troubleshooting agent that inspects logs, metrics, configs, and runtime state to find root causes and suggest fixes saving engineers significant troubleshooting time. It can also apply approved fixes to DAG code and commit them to your version control system. Key Capabilities: Troubleshooting Agent: Inspects logs, configs, variables, and connections to find root causes and suggest fixes. Pipeline Mechanic Agent: Suggests code-level fixes e.g., missing connections or bad imports and, once approved, commits them to version control. DAG Manager Agent: Understands DAG logic, suggests improvements, and can trigger DAGs conversationally. Architecture: Built with open-source tools including Google ADK as the orchestration layer and a custom Airflow MCP server based on the FastMCP framework. LLMs never access Airflow directly. The full codebase will be open-sourced.
On March 13th, 2025, Amazon Web Services announced General Availability of Amazon SageMaker Unified Studio, bringing together AWS machine learning and analytics capabilities. At the heart of this next generation of Amazon SageMaker sits Apache Airflow. All SageMaker Unified Studio users have a personal, open-source Airflow deployment, running alongside their Jupyter notebook, enabling those users to easily develop Airflow DAGs that have unified access to all of their data. In this talk, I will go into details around the motivations for choosing Airflow for this capability, the challenges with incorporating Airflow into such a large and diverse experience, the key role that open-source plays, how we’re leveraging GenAI to make that open source development experience better, and the goals for the future of Airflow in SageMaker Unified Studio. Attendees will leave with a better understanding of the considerations they need to make when choosing Airflow as a component of their enterprise project, and a greater appreciation of how Airflow can power advanced capabilities.
Datadog is a world-class data platform ingesting more than a 100 trillion events a day, providing real-time insights. Before Airflow’s prominence, we built batch processing on Luigi, Spotify’s open-source orchestrator. As Airflow gained wide adoption, we evaluated adopting the major improvements of release 2.0, but opted for building our own orchestrator instead to realize our dataset-centric, event-driven vision. Meanwhile, the 3.0 release aligned Airflow with the same vision we pursued internally, as a modern asset-driven orchestrator. It showed how futile it is to build our own compared to the momentum of the community. We evaluated several orchestrators and decided to join forces with the Airflow project. This talk follows our journey from building a custom orchestrator to adopting and contributing to Airflow 3. We’ll share our thought process, our asset partitions use case, and how we’re working with the community to materialize the Data Awareness (AIP-73) vision. Partition-based incremental scheduling is core to our orchestration model, enabling scalable, observable pipelines thanks to Datadog’s Data Observability product providing visibility into pipeline health.
Ready to contribute to Apache Airflow? In this hands-on workshop, you’ll be expected to come prepared with your development environment already configured (Breeze installed is strongly recommended, but Codespaces works if you can’t install Docker). We’ll dive straight into finding issues that match your skills and walk you through the entire contribution process—from creating your first pull request to receiving community feedback. Whether you’re writing code, enhancing documentation, or offering feedback, there’s a place for you. Let’s get started and see your name among Airflow contributors!
We face a paradox: we could use usage data to build better software, but collecting that data seems to contradict the very principles of user freedom that open source represents. Apache Airflow’s current telemetry - already purged - system has become a battleground for this conflict, with some users voicing concerns over privacy while maintainers struggle to make informed decisions without data. What can we do to strike the right balance?
how do you automate data inputs, retraining, or model monitoring?
This is an overview of migrating from Apache Airflow to Lakeflow Jobs for modern data orchestration. It covers key differences, best practices and practical examples of transitioning from traditional Airflow DAGs orchestrating legacy systems to declarative, incremental ETL pipelines with Lakeflow. Attendees will gain actionable tips on how to improve efficiency, scalability and maintainability in their workflows.
Airflow 3 is here, bringing a new era of flexibility, scalability, and security to data orchestration. This release makes building, running, and managing data pipelines easier than ever. In this session, we will cover the key benefits of Airflow 3, including: (1) Ease of Use: Airflow 3 rethinks the user experience—from an intuitive, upgraded UI to DAG Versioning and scheduler-integrated backfills that let teams manage pipelines more effectively than ever before (2) Stronger Security: By decoupling task execution from direct database connections, Airflow 3 enforces task isolation and minimal-privilege access. This meets stringent compliance standards while reducing the risk of unauthorized data exposure. (3) Ultimate Flexibility: Run tasks anywhere, anytime with remote execution and event-driven scheduling. Airflow 3 is designed for global, heterogeneous modern data environments with an architecture that facilitates edge and hybrid-cloud to GPU-based deployments.
Ray is an open-source framework for scaling Python applications, particularly machine learning and AI workloads. It provides the layer for parallel processing and distributed computing. Many large language models (LLMs), including OpenAI's GPT models, are trained using Ray.
On the other hand, Apache Airflow is a consolidated data orchestration framework downloaded more than 20 million times monthly.
This talk presents the Airflow Ray provider package that allows users to interact with Ray from an Airflow workflow. In this talk, I'll show how to use the package to create Ray clusters and how Airflow can trigger Ray pipelines in those clusters.
GetYourGuide’s journey to scalable analytics using dbt and Airflow.
Summary In this episode of the Data Engineering Podcast Pete DeJoy, co-founder and product lead at Astronomer, talks about building and managing Airflow pipelines on Astronomer and the upcoming improvements in Airflow 3. Pete shares his journey into data engineering, discusses Astronomer's contributions to the Airflow project, and highlights the critical role of Airflow in powering operational data products. He covers the evolution of Airflow, its position in the data ecosystem, and the challenges faced by data engineers, including infrastructure management and observability. The conversation also touches on the upcoming Airflow 3 release, which introduces data awareness, architectural improvements, and multi-language support, and Astronomer's observability suite, Astro Observe, which provides insights and proactive recommendations for Airflow users.
Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementData migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details.Your host is Tobias Macey and today I'm interviewing Pete DeJoy about building and managing Airflow pipelines on Astronomer and the upcoming improvements in Airflow 3Interview IntroductionCan you describe what Astronomer is and the story behind it?How would you characterize the relationship between Airflow and Astronomer?Astronomer just released your State of Airflow 2025 Report yesterday and it is the largest data engineering survey ever with over 5,000 respondents. Can you talk a bit about top level findings in the report?What about the overall growth of the Airflow project over time?How have the focus and features of Astronomer changed since it was last featured on the show in 2017?Astro Observe GA’d in early February, what does the addition of pipeline observability mean for your customers? What are other capabilities similar in scope to observability that Astronomer is looking at adding to the platform?Why is Airflow so critical in providing an elevated Observability–or cataloging, or something simlar - experience in a DataOps platform? What are the notable evolutions in the Airflow project and ecosystem in that time?What are the core improvements that are planned for Airflow 3.0?What are the most interesting, innovative, or unexpected ways that you have seen Astro used?What are the most interesting, unexpected, or challenging lessons that you have learned while working on Airflow and Astro?What do you have planned for the future of Astro/Astronomer/Airflow?Contact Info LinkedInParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.Links AstronomerAirflowMaxime BeaucheminMongoDBDatabricksConfluentSparkKafkaDagsterPodcast EpisodePrefectAirflow 3The Rise of the Data Engineer blog postdbtJupyter NotebookZapiercosmos library for dbt in AirflowRuffAirflow Custom OperatorSnowflakeThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
In this podcast episode, we talked with Adrian Brudaru about the past, present and future of data engineering.
About the speaker: Adrian Brudaru studied economics in Romania but soon got bored with how creative the industry was, and chose to go instead for the more factual side. He ended up in Berlin at the age of 25 and started a role as a business analyst. At the age of 30, he had enough of startups and decided to join a corporation, but quickly found out that it did not provide the challenge he wanted. As going back to startups was not a desirable option either, he decided to postpone his decision by taking freelance work and has never looked back since. Five years later, he co-founded a company in the data space to try new things. This company is also looking to release open source tools to help democratize data engineering.
0:00 Introduction to DataTalks.Club 1:05 Discussing trends in data engineering with Adrian 2:03 Adrian's background and journey into data engineering 5:04 Growth and updates on Adrian's company, DLT Hub 9:05 Challenges and specialization in data engineering today 13:00 Opportunities for data engineers entering the field 15:00 The "Modern Data Stack" and its evolution 17:25 Emerging trends: AI integration and Iceberg technology 27:40 DuckDB and the emergence of portable, cost-effective data stacks 32:14 The rise and impact of dbt in data engineering 34:08 Alternatives to dbt: SQLMesh and others 35:25 Workflow orchestration tools: Airflow, Dagster, Prefect, and GitHub Actions 37:20 Audience questions: Career focus in data roles and AI engineering overlaps 39:00 The role of semantics in data and AI workflows 41:11 Focusing on learning concepts over tools when entering the field 45:15 Transitioning from backend to data engineering: challenges and opportunities 47:48 Current state of the data engineering job market in Europe and beyond 49:05 Introduction to Apache Iceberg, Delta, and Hudi file formats 50:40 Suitability of these formats for batch and streaming workloads 52:29 Tools for streaming: Kafka, SQS, and related trends 58:07 Building AI agents and enabling intelligent data applications 59:09Closing discussion on the place of tools like DBT in the ecosystem
🔗 CONNECT WITH ADRIAN BRUDARU Linkedin - / data-team Website - https://adrian.brudaru.com/ 🔗 CONNECT WITH DataTalksClub Join the community - https://datatalks.club/slack.html Subscribe to our Google calendar to have all our events in your calendar - https://calendar.google.com/calendar/... Check other upcoming events - https://lu.ma/dtc-events LinkedIn - /datatalks-club Twitter - /datatalksclub Website - https://datatalks.club/
🌟 Session Overview 🌟
Session Name: The State of Apache Airflow Speaker: Ricardo Sueiras Session Description: Apache Airflow is a popular workflow orchestration tool with a thriving and welcoming community that drives constant innovation and new releases. I will walk you through some of the new features you can expect to find in Apache Airflow, as well as cover use cases and some advanced topics to help you supercharge your use of Apache Airflow. There will be plenty of live demos of these features, and I hope they will encourage you all to try the new advanced features of Airflow yourself.
🚀 About Big Data and RPA 2024 🚀
Unlock the future of innovation and automation at Big Data & RPA Conference Europe 2024! 🌟 This unique event brings together the brightest minds in big data, machine learning, AI, and robotic process automation to explore cutting-edge solutions and trends shaping the tech landscape. Perfect for data engineers, analysts, RPA developers, and business leaders, the conference offers dual insights into the power of data-driven strategies and intelligent automation. 🚀 Gain practical knowledge on topics like hyperautomation, AI integration, advanced analytics, and workflow optimization while networking with global experts. Don’t miss this exclusive opportunity to expand your expertise and revolutionize your processes—all from the comfort of your home! 📊🤖✨
📅 Yearly Conferences: Curious about the evolution of QA? Check out our archive of past Big Data & RPA sessions. Watch the strategies and technologies evolve in our videos! 🚀 🔗 Find Other Years' Videos: 2023 Big Data Conference Europe https://www.youtube.com/playlist?list=PLqYhGsQ9iSEpb_oyAsg67PhpbrkCC59_g 2022 Big Data Conference Europe Online https://www.youtube.com/playlist?list=PLqYhGsQ9iSEryAOjmvdiaXTfjCg5j3HhT 2021 Big Data Conference Europe Online https://www.youtube.com/playlist?list=PLqYhGsQ9iSEqHwbQoWEXEJALFLKVDRXiP
💡 Stay Connected & Updated 💡
Don’t miss out on any updates or upcoming event information from Big Data & RPA Conference Europe. Follow us on our social media channels and visit our website to stay in the loop!
🌐 Website: https://bigdataconference.eu/, https://rpaconference.eu/ 👤 Facebook: https://www.facebook.com/bigdataconf, https://www.facebook.com/rpaeurope/ 🐦 Twitter: @BigDataConfEU, @europe_rpa 🔗 LinkedIn: https://www.linkedin.com/company/73234449/admin/dashboard/, https://www.linkedin.com/company/75464753/admin/dashboard/ 🎥 YouTube: http://www.youtube.com/@DATAMINERLT
Self-service analytics empowers users to independently access, explore, and analyze data to accelerate decision-making. It’s a foundational step for organizations to democratize the use of data for business growth with easy-to-use, simple-to-understand, and quick-to-deliver tools. Interactive querying, SQL analytics, data preparation, data transformation, data workflow orchestration and search analytics are some of the key day-to-day functions that data users need from self-service solutions. Learn how AWS analytics services like Amazon Athena, AWS Glue, Amazon Redshift, Amazon Managed Workflows for Apache Airflow, and Amazon OpenSearch Service enable data-driven decision-making through self-serve analytics.
Learn more: AWS re:Invent: https://go.aws/reinvent. More AWS events: https://go.aws/3kss9CP
Subscribe: More AWS videos: http://bit.ly/2O3zS75 More AWS events videos: http://bit.ly/316g9t4
About AWS: Amazon Web Services (AWS) hosts events, both online and in-person, bringing the cloud computing community together to connect, collaborate, and learn from AWS experts. AWS is the world's most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—are using AWS to lower costs, become more agile, and innovate faster.