Airflow

Data Engineering Central Podcast - 04

2024-11-20 · Data Engineering Central Podcast Listen

podcast_episode

by Daniel Beach

Airflow Data Engineering Databricks DuckDB Polars

It’s time for another episode of the Data Engineering Central Podcast. In this episode we cover … * Apache Airflow vs Databricks Workflows * End-of-Year Engineering Planning for 2025 * 10 Billion Row Challenge with DuckDB vs Daft vs Polars * Raw Data Ingestion. As usual, the full episode is available to paid subscribers, and a shortened version to you free loaders out there, don’t worry, I still love you though.

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit dataengineeringcentral.substack.com/subscribe

Using the modern stack to build a product catalog for scalability

2024-11-12 · Data November Talk Evening

talk

by Sara García Cabezalí (HelloPrint)

Airflow dbt openai

This meeting will focus on real-world use cases. HelloPrint, an online platform for printed products with over 10 million SKUs, leveraged a modern data stack—including DBT, Airflow, and OpenAI—to streamline its product catalog. Our team’s solution reduced manual tasks by 80%, showcasing the power of automation and data-driven processes.

Apache Airflow Best Practices

2024-10-31 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Dylan Intorf , Dylan Storey , Kendrick van Doorn

Airflow AWS Cloud Computing Data Engineering GCP Python apache-airflow data data-engineering

"Apache Airflow Best Practices" is your go-to guide for mastering data workflow orchestration using Apache Airflow. This book introduces you to core concepts and features of Airflow and helps you efficiently design, deploy, and manage workflows. With detailed examples and hands-on tutorials, you'll learn how to tackle real-world challenges in data engineering. What this Book will help me do Understand and utilize the features and updates introduced in Apache Airflow 2.x. Design and implement robust, scalable, and efficient data pipelines and workflows. Learn best practices for deploying Apache Airflow in cloud environments such as AWS and GCP. Extend Airflow's functionality with custom plugins and advanced configuration. Monitor, maintain, and scale your Airflow deployment effectively for high availability. Author(s) Dylan Intorf, Dylan Storey, and Kendrick van Doorn are seasoned professionals in data engineering, data strategy, and software development. Between them, they bring decades of experience working in diverse industries like finance, tech, and life sciences. They bring their expertise into this practical guide to help practitioners understand and master Apache Airflow. Who is it for? This book is tailored for data professionals such as data engineers, scientists, and system administrators, offering valuable insights for new learners and experienced users. If you're starting with workflow orchestration, seeking to optimize your current Airflow implementation, or scaling efforts, this book aligns with your goals. Readers should have a basic knowledge of Python programming and data engineering principles.

Airflow 3.0

2024-10-24 · Astronomer's Forum for Apache Airflow®

talk

Airflow airflow 3.0 distributed data infrastructure

Introduction to the vision behind Airflow 3, including the emerging technology trends in the industry and how Airflow will evolve. This will include an overview of the architectural changes in Airflow to support emerging use cases and distributed data infrastructure models.

Airflow for GenAI and Machine Learning

2024-10-24 · Astronomer's Forum for Apache Airflow®

talk

Airflow genai machine learning

Explore how data orchestration with Airflow is driving the development of robust, scalable AI solutions.

Observability on Airflow

2024-10-24 · Astronomer's Forum for Apache Airflow®

talk

Airflow observability

Best practices and strategies for maximizing visibility into your data pipelines; including attaching SLAs to workflows in order to ensure timely delivery of reliable data products.

Coalesce 2024: Transitioning from dbt Core to dbt Cloud: A user story

2024-10-16 · Dbt Coalesce 2024 Watch

video

by Mathias Lavaert (DPG Media) , Alejandro Ivanez (DPG Media)

Airflow Analytics AWS Cloud Computing Data Analytics dbt Cyber Security

Join us as we share our journey of migrating from dbt Core to dbt Cloud. We'll discuss why we made this shift – focusing on security, ownership, and standardization. Starting with separate team-based projects on dbt Core, we moved towards a unified structure, and eventually embraced dbt Cloud. Now, all teams follow a common structure and standardized requirements, ensuring better security and collaboration.

In our session, we'll explore how we improved our data analytics processes by migrating from dbt Core to dbt Cloud. Initially, each team had its way of working on dbt Core, leading to security risks and inconsistent practices. To address this, we transitioned to a more unified approach on dbt Core. This year we migrated dbt Cloud, which allowed us to centralize our data analytics workflows, enhancing security and promoting collaboration.

For scheduling we manage our own Airflow instance using AWS EKS. We use Datahub as data catalog.

Key points: Enhanced Security: dbt Cloud provided robust security features, helping us safeguard our data pipelines. Ownership and Collaboration: With dbt Cloud, teams took ownership of their projects while collaborating more effectively. Standardization: We enforced standardized requirements across all projects, ensuring consistency and efficiency, using dbt-project-evaluator.

Speakers: Alejandro Ivanez Platform Engineer DPG Media

Mathias Lavaert Principal Platform Engineer DPG Media

Read the blog to learn about the latest dbt Cloud features announced at Coalesce, designed to help organizations embrace analytics best practices at scale https://www.getdbt.com/blog/coalesce-2024-product-announcements

Coalesce 2024: From Core to Cloud: Unlocking dbt at Warner Brothers Discovery (CNN)

2024-10-16 · Dbt Coalesce 2024 Watch

video

by Mamta Gupta (Warner Brothers Discovery) , Zachary Lancaster (Warner Brothers Discovery)

Airflow Analytics AWS Cloud Computing Data Engineering Data Quality dbt IaC Terraform

Since the beginning of 2024, the Warner Brothers Discovery team supporting the CNN data platform has been undergoing an extensive migration project from dbt Core to dbt Cloud. Concurrently, the team is also segmenting their project into multi-project frameworks utilizing dbt Mesh. In this talk, Zachary will review how this transition has simplified data pipelines, improved pipeline performance and data quality, and made data collaboration at scale more seamless.

He'll discuss how dbt Cloud features like the Cloud IDE, automated testing, documentation, and code deployment have enabled the team to standardize on a single developer platform while also managing dependencies effectively. He'll share details on how the automation framework they built using Terraform streamlines dbt project deployments with dbt Cloud to a ""push-button"" process. By leveraging an infrastructure as code experience, they can orchestrate the creation of environment variables, dbt Cloud jobs, Airflow connections, and AWS secrets with a unified approach that ensures consistency and reliability across projects.

Speakers: Mamta Gupta Staff Analytics Engineer Warner Brothers Discovery

Zachary Lancaster Manager, Data Engineering Warner Brothers Discovery

Read the blog to learn about the latest dbt Cloud features announced at Coalesce, designed to help organizations embrace analytics best practices at scale https://www.getdbt.com/blog/coalesce-2024-product-announcements

Airflow 3.0

2024-10-03 · Astronomer's Airflow Forum

talk

Airflow

Introduction to the vision behind Airflow 3, including the emerging technology trends in the industry and how Airflow will evolve. This will include an overview of the architectural changes in Airflow to support emerging use cases and distributed data infrastructure models.

Airflow for GenAI and Machine Learning

2024-10-03 · Astronomer's Airflow Forum

talk

Airflow genai machine learning

Explore how data orchestration with Airflow is driving the development of robust, scalable AI solutions and hear directly from those at the forefront of innovation.

Observability on Airflow

2024-10-03 · Astronomer's Airflow Forum

talk

Airflow observability

Best practices and strategies for maximizing visibility into your data pipelines; including attaching SLAs to workflows in order to ensure timely delivery of reliable data products.

Data Engineering for Machine Learning Pipelines: From Python Libraries to ML Pipelines and Cloud Platforms

2024-09-27 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Pavan Kumar Narayanan

AI/ML Airflow Analytics API AWS Azure Cloud Computing Data Analytics Data Engineering Data Quality GCP Kafka +10 more

This book covers modern data engineering functions and important Python libraries, to help you develop state-of-the-art ML pipelines and integration code. The book begins by explaining data analytics and transformation, delving into the Pandas library, its capabilities, and nuances. It then explores emerging libraries such as Polars and CuDF, providing insights into GPU-based computing and cutting-edge data manipulation techniques. The text discusses the importance of data validation in engineering processes, introducing tools such as Great Expectations and Pandera to ensure data quality and reliability. The book delves into API design and development, with a specific focus on leveraging the power of FastAPI. It covers authentication, authorization, and real-world applications, enabling you to construct efficient and secure APIs using FastAPI. Also explored is concurrency in data engineering, examining Dask's capabilities from basic setup to crafting advanced machine learning pipelines. The book includes development and delivery of data engineering pipelines using leading cloud platforms such as AWS, Google Cloud, and Microsoft Azure. The concluding chapters concentrate on real-time and streaming data engineering pipelines, emphasizing Apache Kafka and workflow orchestration in data engineering. Workflow tools such as Airflow and Prefect are introduced to seamlessly manage and automate complex data workflows. What sets this book apart is its blend of theoretical knowledge and practical application, a structured path from basic to advanced concepts, and insights into using state-of-the-art tools. With this book, you gain access to cutting-edge techniques and insights that are reshaping the industry. This book is not just an educational tool. It is a career catalyst, and an investment in your future as a data engineering expert, poised to meet the challenges of today's data-driven world. What You Will Learn Elevate your data wrangling jobs by utilizing the power of both CPU and GPU computing, and learn to process data using Pandas 2.0, Polars, and CuDF at unprecedented speeds Design data validation pipelines, construct efficient data service APIs, develop real-time streaming pipelines and master the art of workflow orchestration to streamline your engineering projects Leverage concurrent programming to develop machine learning pipelines and get hands-on experience in development and deployment of machine learning pipelines across AWS, GCP, and Azure Who This Book Is For Data analysts, data engineers, data scientists, machine learning engineers, and MLOps specialists

Big Data on Kubernetes

2024-07-19 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Neylson Crepalde

Airflow BI Big Data Docker Kafka Kubernetes Python Spark SQL YAML data data-engineering +1 more

Big Data on Kubernetes is your comprehensive guide to leveraging Kubernetes for scalable and efficient big data solutions. You will learn key concepts of Kubernetes architecture and explore tools like Apache Spark, Airflow, and Kafka. Gain hands-on experience building complete data pipelines to tackle real-world data challenges. What this Book will help me do Understand Kubernetes architecture and learn to deploy and manage clusters. Build and orchestrate big data pipelines using Spark, Airflow, and Kafka. Develop scalable and resilient data solutions with Docker and Kubernetes. Integrate and optimize data tools for real-time ingestion and processing. Apply concepts to hands-on projects addressing actual big data scenarios. Author(s) Neylson Crepalde is an experienced data specialist with extensive knowledge of Kubernetes and big data solutions. With deep practical experience, Neylson brings real-world insights to his writing. His approach emphasizes actionable guidance and relatable problem-solving with a strong foundation in scalable architecture. Who is it for? This book is ideal for data engineers, BI analysts, data team leaders, and tech managers familiar with Python, SQL, and YAML. Targeted at professionals seeking to develop or expand their expertise in scalable big data solutions, it provides practical insights into Docker, Kubernetes, and prominent big data tools.