Updates on the Apache Airflow project status, roadmap, and recent milestones.
talk-data.com
Topic
Airflow
Apache Airflow
682
tagged
Activity Trend
Top Events
Explore the latest advancements in AWS Analytics designed to transform your data processing landscape. This session unveils powerful new capabilities across key services, including Amazon EMR for scalable big data processing, AWS Glue for seamless data integration, Amazon Athena for optimized querying, and Amazon Managed Workflows for Apache Airflow (MWAA) for workflow orchestration. Discover how these innovations can supercharge performance, optimize costs, and streamline your data ecosystem. Whether you're looking to enhance scalability, improve data integration, accelerate queries, or refine workflow management, join us to gain actionable insights that will position your organization at the forefront of data processing innovation.
Learn more: More AWS events: https://go.aws/3kss9CP
Subscribe: More AWS videos: http://bit.ly/2O3zS75 More AWS events videos: http://bit.ly/316g9t4
ABOUT AWS: Amazon Web Services (AWS) hosts events, both online and in-person, bringing the cloud computing community together to connect, collaborate, and learn from AWS experts. AWS is the world's most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—are using AWS to lower costs, become more agile, and innovate faster.
AWSreInvent #AWSreInvent2025 #AWS
Summary In this episode Preeti Somal, EVP of Engineering at Temporal, talks about the durable execution model and how it reshapes the way teams build reliable, stateful systems for data and AI. She explores Temporal’s code‑first programming model—workflows, activities, task queues, and replay—and how it eliminates hand‑rolled retry, checkpoint, and error‑handling scaffolding while letting data remain where it lives. Preeti shares real-world patterns for replacing DAG-first orchestration, integrating application and data teams through signals and Nexus for cross-boundary calls, and using Temporal to coordinate long-running, human-in-the-loop, and agentic AI workflows with full observability and auditability. Shee also discusses heuristics for choosing Temporal alongside (or instead of) traditional orchestrators, managing scale without moving large datasets, and lessons from running durable execution as a cloud service.
Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementData teams everywhere face the same problem: they're forcing ML models, streaming data, and real-time processing through orchestration tools built for simple ETL. The result? Inflexible infrastructure that can't adapt to different workloads. That's why Cash App and Cisco rely on Prefect. Cash App's fraud detection team got what they needed - flexible compute options, isolated environments for custom packages, and seamless data exchange between workflows. Each model runs on the right infrastructure, whether that's high-memory machines or distributed compute. Orchestration is the foundation that determines whether your data team ships or struggles. ETL, ML model training, AI Engineering, Streaming - Prefect runs it all from ingestion to activation in one platform. Whoop and 1Password also trust Prefect for their data operations. If these industry leaders use Prefect for critical workflows, see what it can do for you at dataengineeringpodcast.com/prefect.Data migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details. Composable data infrastructure is great, until you spend all of your time gluing it together. Bruin is an open source framework, driven from the command line, that makes integration a breeze. Write Python and SQL to handle the business logic, and let Bruin handle the heavy lifting of data movement, lineage tracking, data quality monitoring, and governance enforcement. Bruin allows you to build end-to-end data workflows using AI, has connectors for hundreds of platforms, and helps data teams deliver faster. Teams that use Bruin need less engineering effort to process data and benefit from a fully integrated data platform. Go to dataengineeringpodcast.com/bruin today to get started. And for dbt Cloud customers, they'll give you $1,000 credit to migrate to Bruin Cloud.Your host is Tobias Macey and today I'm interviewing Preeti Somal about how to incorporate durable execution and state management into AI application architectures Interview IntroductionHow did you get involved in the area of data management?Can you describe what durable execution is and how it impacts system architecture?With the strong focus on state maintenance and high reliability, what are some of the most impactful ways that data teams are incorporating tools like Temporal into their work?One of the core primitives in Temporal is a "workflow". How does that compare to similar primitives in common data orchestration systems such as Airflow, Dagster, Prefect, etc.? What are the heuristics that you recommend when deciding which tool to use for a given task, particularly in data/pipeline oriented projects? Even if a team is using a more data-focused orchestration engine, what are some of the ways that Temporal can be applied to handle the processing logic of the actual data?AI applications are also very dependent on reliable data to be effective in production contexts. What are some of the design patterns where durable execution can be integrated into RAG/agent applications?What are some of the conceptual hurdles that teams experience when they are starting to adopt Temporal or other durable execution frameworks?What are the most interesting, innovative, or unexpected ways that you have seen Temporal/durable execution used for data/AI services?What are the most interesting, unexpected, or challenging lessons that you have learned while working on Temporal?When is Temporal/durable execution the wrong choice?What do you have planned for the future of Temporal for data and AI systems? Contact Info LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story. Links TemporalDurable ExecutionFlinkMachine Learning EpochSpark StreamingAirflowDirected Acyclic Graph (DAG)Temporal NexusTensorZeroAI Engineering Podcast Episode The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
At Fresha, we became the pioneers that put StarRocks to test in production for realtime analytical workloads. But one of the first challenges we faced was getting all the data there reliably and efficiently. We had to think about historical data, and realtime data and orchestrate all of that, such that we can move fast, without breaking too many things. Our tools of choice: Airflow, StarRocks Pipes, Apache Flink. In this talk, I’ll share how we built our data pipelines using Apache Flink and Airflow, what worked and what didn’t for us. Along the way, we’ll explore how Flink helps ensure data consistency, handles failures gracefully, and keeps our real-time workloads running strong.
Daniel will take a hands-on journey into building AI analyst agents from scratch. Using dbt metadata to provide large language models with the right context, he’ll show how to connect LLMs to your data effectively. Expect a deep dive into the challenges of query generation, practical frameworks for success, and lessons learned from real-world implementations.
In this twofold session, I'll cover how we've used dbt to bring order in heaps of SQL statements used to manage a datawarehouse. I'd like to share how dbt made our team more efficient and our data warehouse more resilient. Secondly, I'll highlight why dbt enabled a way forward on supporting low-code applications: by leveraging our data warehouse as a backend. I'll dive into systemic design, application architecture & data modelling. Tools/tech covered will be SQL, Trino, Outsystems, GIT, Airflow and of course dbt! Expect practical insights, architectural patterns, and lessons learned from a real-world implementation.
Managing dbt for 150 analytics engineers meant evolving from fragmented dbt Core projects to unified standards, migrating to dbt Cloud. We solved security risks and inconsistent practices through standardization and centralized workflows, while maintaining our Airflow orchestration. Challenges remain in balancing governance with analyst autonomy at scale.
In this episode, we talk with Daniel, an astrophysicist turned machine learning engineer and AI ambassador. Daniel shares his journey bridging astronomy and data science, how he leveraged live courses and public knowledge sharing to grow his skills, and his experiences working on cutting-edge radio astronomy projects and AI deployments. He also discusses practical advice for beginners in data and astronomy, and insights on career growth through community and continuous learning.TIMECODES00:00 Lunar eclipse story and Daniel’s astronomy career04:12 Electromagnetic spectrum and MEERKAT data explained10:39 Data analysis and positional cross-correlation challenges15:25 Physics behind radio star detection and observation limits16:35 Radio astronomy’s advantage and machine learning potential20:37 Radio astronomy progress and Daniel’s ML journey26:00 Python tools and experience with ZoomCamps31:26 Intel internship and exploring LLMs41:04 Sharing progress and course projects with orchestration tools44:49 Setting up Airflow 3.0 and building data pipelines47:39 AI startups, training resources, and NVIDIA courses50:20 Student access to education, NVIDIA experience, and beginner astronomy programs57:59 Skills, projects, and career advice for beginners59:19 Starting with data science or engineering1:00:07 Course sponsorship, data tools, and learning resourcesConnect with Daniel Linkedin - / egbodaniel Connect with DataTalks.Club: Join the community - https://datatalks.club/slack.htmlSubscribe to our Google calendar to have all our events in your calendar - https://calendar.google.com/calendar/...Check other upcoming events - https://lu.ma/dtc-eventsGitHub: https://github.com/DataTalksClubLinkedIn - / datatalks-club Twitter - / datatalksclub Website - https://datatalks.club/
At Vinted, Europe’s largest second-hand marketplace, over 20 decentralized data teams generate, transform, and build products on petabytes of data. Each team utilizes their own tools, workflows, and expertise. Coordinating data pipeline creation across such diverse teams presents significant challenges. These include complex inter-team dependencies, inconsistent scheduling solutions, and rapidly evolving requirements.
This talk is aimed at data engineers, platform engineers, and technical leads with experience in workflow orchestration and will demonstrate how we empower teams at Vinted to define data pipelines quickly and reliably. We will present our user-friendly abstraction layer built on top of Apache Airflow, enhanced by a Python code generator. This abstraction simplifies upgrades and migrations, removes scheduler complexity, and supports Vinted’s rapid growth. Attendees will learn how Python abstractions and code generation can standardize pipeline development across diverse teams, reduce operational complexity, and enable greater flexibility and control in large-scale data organizations. Through practical lessons and real-world examples of our abstraction interface, we will offer insights into designing scheduler-agnostic architectures for successful data pipeline orchestration.
In this talk, we will introduce Ordeq, a cutting-edge data pipeline development framework used by data engineers, scientists and analysts across ING. Ordeq helps you modularise pipeline logic and abstract IO, elevating projects from proof-of-concepts to maintainable production-level applications. We will demonstrate how Ordeq integrates seamlessly with popular data processing tools like Spark, Polars, Matplotlib, DSPy, and orchestration tools such as Airflow. Additionally, we showcase how you can leverage Ordeq on public cloud offering like GCP. Ordeq has 0 dependencies and is available under MIT license.
Following on from the Building consumable data products keynote, we will dive deeper into the interactions around the data product catalog, to show how the network effect of explicit data sharing relationships starts to pay dividends to the participants. Such as:
For the product consumer:
• Searching for products, understanding content, costs, terms and conditions, licenses, quality certifications etc
• Inspecting sample data, choosing preferred data format, setting up a secure subscription, and seeing data provisioned into a database from the product catalog.
• Providing feedback and requesting help
• Reviewing own active subscriptions
• Understanding the lineage behind each product along with outstanding exceptions and future plans
For the product manager/owner:
• Setting up a new product, creating a new release of an existing product and issuing a data correction/restatement
• Reviewing a product’s active subscriptions and feedback/requests from consumers
• Interacting with the technical teams on pipeline implementations along with issues and proposed enhancements
• For the data governance team
• Viewing the network of dependencies between data products (the data mesh) to understand the data value chains and risk concentrations
• Reviewing a dashboard of metrics around the data products including popularity, errors/exceptions, subscriptions, interaction
• Show traceability from a governance policy relating to, say data sovereignty or data privacy to the product implementations.
• Building trust profiles for producers and consumers
The aim of the demonstrations and discussions is to explore the principles and patterns relating to data products, rather than push a particular implementation approach.
Having said that, all of the software used in the demonstrations is open source. Principally this is Egeria, Open Lineage and Unity Catalog from the Linux Foundation, plus Apache Airflow, Apache Kafka and Apache SuperSet from the Apache Software Foundation.
Videos of the demonstrations will be available on YouTube after the conference and the complete demo software can be downloaded and run on a laptop so you can share your experiences with your teams after the event.
Apache Airflow is the go-to platform for data orchestration, while dbt is widely recognized for analytical data transformations. Using astronomer-cosmos library, integrating dbt projects into Airflow becomes straightforward, allowing each dbt model to be transformed into individual tasks or task groups equipped with Airflow features like retries and callbacks. However, organizing dbt models into separate Airflow DAGs based on domain or user-defined filters presents challenges in maintaining dependencies across these distinct DAGs. Ensuring that downstream dbt tasks only execute after the corresponding upstream tasks in different DAGs have successfully completed is crucial for data consistency—yet this functionality is not supported by default. Join GetYourGuide as we explore our method for dynamically creating inter-DAG sensors in Airflow using Astronomer Cosmos for dbt. We will show how we maintained dbt model dependencies across multiple DAGs, making our pipeline modular, scalable, and robust.
dbt has become the de facto standard for transforming data in modern analytics stacks. But as projects grow, so does the question: where should dbt run in production, and how can we make it faster? In this talk, we’ll compare the performance trade-offs between running dbt natively and orchestrating it through Airflow using Cosmos, with a focus on workflow efficiency at scale. Using a 200-model dbt project as a case study, we’ll show how workflow execution time in Cosmos was reduced from 15 minutes to just 5 minutes. We’ll also discuss opportunities to push performance further—ranging from better DAG optimization to warehouse-aware scheduling strategies. Whether you’re a data engineer, analytics engineer, or platform owner, you’ll leave with practical strategies to optimize dbt execution and inspiration for what’s next in large-scale orchestration
This presentation highlights practical validation techniques to prevent misconfigurations and enhance reliability in Apache Airflow environments. We cover two key safeguards: validating that sensors are correctly tied to their upstream tasks, and checking that critical DAGs have PagerDuty alerting enabled. Both validations are automated and integrated into our CI/CD pipeline using GitHub Actions, ensuring continuous enforcement and early detection of potential issues before deployment. In addition, we’ve implemented a solution to track Service Level Objectives (SLOs) for our DAGs, enabling better insight into reliability and performance over time. These checks form a practical defense against operational blind spots, promoting workflow reliability and robust incident response. Join us as we uncover practical strategies to streamline workflow monitoring and enhance system resilience using Apache Airflow's robust capabilities.
Town Hall update on Apache Airflow project progress.
Update on Apache Airflow project
Airflow Town Hall – 2025-08-08
Airflow Town Hall – 2025-08-08
Airflow Town Hall – 2025-08-08
Overview of the latest Airflow project updates, milestones, and roadmap.