talk-data.com talk-data.com

Topic

Airflow

Apache Airflow

workflow_management data_orchestration etl

682

tagged

Activity Trend

157 peak/qtr
2020-Q1 2026-Q1

Activities

682 activities · Newest first

It’s time for another episode of the Data Engineering Central Podcast. In this episode we cover … * Apache Airflow vs Databricks Workflows * End-of-Year Engineering Planning for 2025 * 10 Billion Row Challenge with DuckDB vs Daft vs Polars * Raw Data Ingestion. As usual, the full episode is available to paid subscribers, and a shortened version to you free loaders out there, don’t worry, I still love you though.

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit dataengineeringcentral.substack.com/subscribe

This meeting will focus on real-world use cases. HelloPrint, an online platform for printed products with over 10 million SKUs, leveraged a modern data stack—including DBT, Airflow, and OpenAI—to streamline its product catalog. Our team’s solution reduced manual tasks by 80%, showcasing the power of automation and data-driven processes.

Apache Airflow Best Practices

"Apache Airflow Best Practices" is your go-to guide for mastering data workflow orchestration using Apache Airflow. This book introduces you to core concepts and features of Airflow and helps you efficiently design, deploy, and manage workflows. With detailed examples and hands-on tutorials, you'll learn how to tackle real-world challenges in data engineering. What this Book will help me do Understand and utilize the features and updates introduced in Apache Airflow 2.x. Design and implement robust, scalable, and efficient data pipelines and workflows. Learn best practices for deploying Apache Airflow in cloud environments such as AWS and GCP. Extend Airflow's functionality with custom plugins and advanced configuration. Monitor, maintain, and scale your Airflow deployment effectively for high availability. Author(s) Dylan Intorf, Dylan Storey, and Kendrick van Doorn are seasoned professionals in data engineering, data strategy, and software development. Between them, they bring decades of experience working in diverse industries like finance, tech, and life sciences. They bring their expertise into this practical guide to help practitioners understand and master Apache Airflow. Who is it for? This book is tailored for data professionals such as data engineers, scientists, and system administrators, offering valuable insights for new learners and experienced users. If you're starting with workflow orchestration, seeking to optimize your current Airflow implementation, or scaling efforts, this book aligns with your goals. Readers should have a basic knowledge of Python programming and data engineering principles.

Coalesce 2024: Transitioning from dbt Core to dbt Cloud: A user story

Join us as we share our journey of migrating from dbt Core to dbt Cloud. We'll discuss why we made this shift – focusing on security, ownership, and standardization. Starting with separate team-based projects on dbt Core, we moved towards a unified structure, and eventually embraced dbt Cloud. Now, all teams follow a common structure and standardized requirements, ensuring better security and collaboration.

In our session, we'll explore how we improved our data analytics processes by migrating from dbt Core to dbt Cloud. Initially, each team had its way of working on dbt Core, leading to security risks and inconsistent practices. To address this, we transitioned to a more unified approach on dbt Core. This year we migrated dbt Cloud, which allowed us to centralize our data analytics workflows, enhancing security and promoting collaboration.

For scheduling we manage our own Airflow instance using AWS EKS. We use Datahub as data catalog.

Key points: Enhanced Security: dbt Cloud provided robust security features, helping us safeguard our data pipelines. Ownership and Collaboration: With dbt Cloud, teams took ownership of their projects while collaborating more effectively. Standardization: We enforced standardized requirements across all projects, ensuring consistency and efficiency, using dbt-project-evaluator.

Speakers: Alejandro Ivanez Platform Engineer DPG Media

Mathias Lavaert Principal Platform Engineer DPG Media

Read the blog to learn about the latest dbt Cloud features announced at Coalesce, designed to help organizations embrace analytics best practices at scale https://www.getdbt.com/blog/coalesce-2024-product-announcements

Coalesce 2024: From Core to Cloud: Unlocking dbt at Warner Brothers Discovery (CNN)

Since the beginning of 2024, the Warner Brothers Discovery team supporting the CNN data platform has been undergoing an extensive migration project from dbt Core to dbt Cloud. Concurrently, the team is also segmenting their project into multi-project frameworks utilizing dbt Mesh. In this talk, Zachary will review how this transition has simplified data pipelines, improved pipeline performance and data quality, and made data collaboration at scale more seamless.

He'll discuss how dbt Cloud features like the Cloud IDE, automated testing, documentation, and code deployment have enabled the team to standardize on a single developer platform while also managing dependencies effectively. He'll share details on how the automation framework they built using Terraform streamlines dbt project deployments with dbt Cloud to a ""push-button"" process. By leveraging an infrastructure as code experience, they can orchestrate the creation of environment variables, dbt Cloud jobs, Airflow connections, and AWS secrets with a unified approach that ensures consistency and reliability across projects.

Speakers: Mamta Gupta Staff Analytics Engineer Warner Brothers Discovery

Zachary Lancaster Manager, Data Engineering Warner Brothers Discovery

Read the blog to learn about the latest dbt Cloud features announced at Coalesce, designed to help organizations embrace analytics best practices at scale https://www.getdbt.com/blog/coalesce-2024-product-announcements

Introduction to the vision behind Airflow 3, including the emerging technology trends in the industry and how Airflow will evolve. This will include an overview of the architectural changes in Airflow to support emerging use cases and distributed data infrastructure models.

Data Engineering for Machine Learning Pipelines: From Python Libraries to ML Pipelines and Cloud Platforms

This book covers modern data engineering functions and important Python libraries, to help you develop state-of-the-art ML pipelines and integration code. The book begins by explaining data analytics and transformation, delving into the Pandas library, its capabilities, and nuances. It then explores emerging libraries such as Polars and CuDF, providing insights into GPU-based computing and cutting-edge data manipulation techniques. The text discusses the importance of data validation in engineering processes, introducing tools such as Great Expectations and Pandera to ensure data quality and reliability. The book delves into API design and development, with a specific focus on leveraging the power of FastAPI. It covers authentication, authorization, and real-world applications, enabling you to construct efficient and secure APIs using FastAPI. Also explored is concurrency in data engineering, examining Dask's capabilities from basic setup to crafting advanced machine learning pipelines. The book includes development and delivery of data engineering pipelines using leading cloud platforms such as AWS, Google Cloud, and Microsoft Azure. The concluding chapters concentrate on real-time and streaming data engineering pipelines, emphasizing Apache Kafka and workflow orchestration in data engineering. Workflow tools such as Airflow and Prefect are introduced to seamlessly manage and automate complex data workflows. What sets this book apart is its blend of theoretical knowledge and practical application, a structured path from basic to advanced concepts, and insights into using state-of-the-art tools. With this book, you gain access to cutting-edge techniques and insights that are reshaping the industry. This book is not just an educational tool. It is a career catalyst, and an investment in your future as a data engineering expert, poised to meet the challenges of today's data-driven world. What You Will Learn Elevate your data wrangling jobs by utilizing the power of both CPU and GPU computing, and learn to process data using Pandas 2.0, Polars, and CuDF at unprecedented speeds Design data validation pipelines, construct efficient data service APIs, develop real-time streaming pipelines and master the art of workflow orchestration to streamline your engineering projects Leverage concurrent programming to develop machine learning pipelines and get hands-on experience in development and deployment of machine learning pipelines across AWS, GCP, and Azure Who This Book Is For Data analysts, data engineers, data scientists, machine learning engineers, and MLOps specialists

Big Data on Kubernetes

Big Data on Kubernetes is your comprehensive guide to leveraging Kubernetes for scalable and efficient big data solutions. You will learn key concepts of Kubernetes architecture and explore tools like Apache Spark, Airflow, and Kafka. Gain hands-on experience building complete data pipelines to tackle real-world data challenges. What this Book will help me do Understand Kubernetes architecture and learn to deploy and manage clusters. Build and orchestrate big data pipelines using Spark, Airflow, and Kafka. Develop scalable and resilient data solutions with Docker and Kubernetes. Integrate and optimize data tools for real-time ingestion and processing. Apply concepts to hands-on projects addressing actual big data scenarios. Author(s) Neylson Crepalde is an experienced data specialist with extensive knowledge of Kubernetes and big data solutions. With deep practical experience, Neylson brings real-world insights to his writing. His approach emphasizes actionable guidance and relatable problem-solving with a strong foundation in scalable architecture. Who is it for? This book is ideal for data engineers, BI analysts, data team leaders, and tech managers familiar with Python, SQL, and YAML. Targeted at professionals seeking to develop or expand their expertise in scalable big data solutions, it provides practical insights into Docker, Kubernetes, and prominent big data tools.

10 years after its creation, Airflow is stronger than ever: in last year’s Airflow survey, 81% of users said Airflow is important or very important to their business, 87% said their Airflow usage has grown over time, and 92% said they would recommend Airflow. In this panel discussion, we’ll celebrate a decade of Airflow and delve into how it became the highly recommended industry standard it is today, including history, pivotal moments, and the role of the community. Our panel of seasoned experts will also talk about where Airflow is going next, including future use cases like generative AI and the highly anticipated Airflow 3.0. Don’t miss this insightful exploration into one of the most influential tools in the data landscape.

OpenLineage is an open standard for lineage data collection, integrated into the Airflow codebase, facilitating lineage collection across providers like Google, Amazon, and more. Atlan Data Catalog is a 3rd generation active metadata platform that is a single source of trust unifying cataloging, data discovery, lineage, and governance experience. We will demonstrate what OpenLineage is and how, with minimal and intuitive setup across Airlfow and Atlan, it presents unified workflows view, efficient cross-platform lineage collection, including column level, in various technologies (Python, Spark, dbt, SQL etc.) and clouds (AWS, Azure, GCP, etc.) - all orchestrated by Airflow. This integration enables further use case unlocks on automated metadata management by making the operational pipelines dataset-aware for self-service exploration. It also will demonstrate real world challenges and resolutions for lineage consumers in improving audit and compliance accuracy through column-level lineage traceability across the data estate. The talk will also briefly overview the most recent OpenLineage developments and planned future enhancements.

At Vibrant Planet, we’re on a mission to make the world’s communities and ecosystems more resilient in the face of climate change. Our cloud-based platform is designed for collaborative scenario planning to tackle wildfires, climate threats, and ecosystem restoration on a massive scale. In this talk we will dive into how we are using Airflow. Particularly we will focus on how we’re making Airflow pipelines smarter and more resilient, especially when dealing with the task of processing large satellite imagery and other geospatial data. Self-Healing Pipelines: Discuss our self-healing pipelines which identify likely out-of-memory events and incrementally allocate more memory for task instance retries, ensuring robust and uninterrupted workflow execution. Initial Memory Recommendations: We’ll discuss how we set intelligent initial memory allocations for each task instance, enhancing resource efficiency from the outset.

Apache Airflow has a lot of configuration options. A change in some of these options can affect the performance of Airflow. If you are wondering why your Airflow instance is not running the number of tasks you expected it to run, after this talk, you will have a better understanding of the configuration options available for improving the number of tasks your Airflow instance can run. We will talk about the DAG parsing configuration options, options for scheduler scalability, etc., and the pros and cons of these options.

When developing Machine Learning (ML) models, the biggest challenges are often infrastructural. How do we deploy our model and expose an inference API? How can we retrain? Can we continuously evaluate performance and monitor model drift? In this talk, we will present how we are tackling these problems at the Philadelphia Phillies by developing a suite of tools that enable our software engineering and analytics teams to train, test, evaluate, and deploy ML models - that can be entirely orchestrated in Airflow. This framework abstracts away the infrastructural complexities that productionizing ML Pipelines presents and allows our analysts to focus on developing robust baseball research for baseball operations stakeholders across player evaluation, acquisition, and development. We’ll also look at how we use Airflow, MLflow, MLServer, cloud services, and GitHub Actions to architect a platform that supports our framework for all points of the ML Lifecycle.

Join us as we check in on the current status of AIP-63: DAG Versioning. This session will explore the motivations behind AIP-63, the challenges faced by Airflow users in understanding and managing DAG history, and how it aims to address them. From tracking TaskInstance history to improving DAG representation in the UI, we’ll examine what we’ve already done and what’s next. We’ll also touch upon the potential future steps outlined in AIP-66 regarding the execution of specific DAG versions.