talk-data.com talk-data.com

Event

Airflow Summit 2024

2024-07-01 Airflow Summit Visit website ↗

Activities tracked

98

Airflow Summit 2024 program

Sessions & talks

Showing 26–50 of 98 · Newest first

Search within this event →

Bronco: Managing Terraform at Scale with Airflow

2024-07-01
session

Airflow is not just purpose-built for data applications. It is a job scheduler on steroids. This is exactly what a cloud platform team needs: a configurable and scalable automation tool that can handle thousands of administrative tasks. Come learn how one enterprise platform team used Airflow to support cloud infrastructure at unprecedented scale.

Building in Resource Awareness and Event Dependency into Airflow

2024-07-01
session

In this talk, we will explore how adding custom dependency checks into Airflow’s scheduling system can elevate Airflow’s performance. We will specifically discuss how we added general upstream events dependency checking as well as how to make Airflow aware of used/available compute resources so that the system can better decide when and where to run a given task on Kubernetes infrastructure. We’ll cover why the existing dependency checking in Airflow is not sufficient in our use case, and why adding custom code to Airflow is needed. We’ll cover the pros and cons with this approach.

Building on Cosmos: Making dbt on Airflow Easy

2024-07-01
session

Balyasny Asset Management (BAM) is a diversified global investment firm founded in 2001 with over $20 billion in assets under management. As dbt took hold at BAM, we had multiple teams building dbt projects against Snowflake, Redshift, and SQL Server. The common question was: How can we quickly and easily productionise our projects? Airflow is the orchestrator of choice at BAM, but our dbt users ranged from Airflow power users to people who’d never heard of Airflow before. We built a single solution on top of Cosmos that allowed us to: Decouple the dbt project from the Airflow repository Have each dbt node run as a separate Airflow task Allow users to run dbt with little to no Airflow knowledge Enable users to have fine-grained control over how dbt is run and to combine it with other Airflow tasks Provide observability, monitoring, and alerting.

Building Reliable Data Products

2024-07-01
session

Data engineers have shifted from delivering data for internal analytics applications to customer-facing data products. And with that shift comes a whole new level of operational rigor necessary to instill trust and confidence in the data. How do you hold data pipelines to the same standards as traditional software applications? Can you apply principles learned from the field of SRE to the world of data? In this talk, we’ll explore how we’ve seen this evolve in Astronomer’s customer base and highlight best practices learned from the most critical data product applications we’ve seen. We’ll hear from Astronomer’s own data team as they went through the transformation from analytics to data products. And we’ll showcase a new product we’re building to help data teams around the world solve exactly this problem!

Comparing Airflow Executors and Custom Environments

2024-07-01
session

With recent works in the direction of Executor Decoupling and interest in Hybrid Execution, we find it’s still quite common for Airflow users to use the old-time rule of thumbs like “Don’t Use Airflow with LocalExecutor in production”, “If your scheduler lags, split your DAGs over two separate Airflow Clusters”, and so on. In our talk, we will show a deep dive comparison between various Execution models Airflow support and hopefully update understanding of their efficiency and limitations.

Connecting the Dots in Airflow: From User to Contributor

2024-07-01
session

“Connecting the Dots in Airflow: From User to Contributor” explores the journey of transitioning from an Airflow user to an active project contributor. This talk will cover essential steps, resources, and best practices to effectively engage with the Airflow community and make meaningful contributions. Attendees will gain insights into the collaborative nature of open-source projects and how their involvement can drive both personal growth and project innovation.

Converting Legacy Schedulers to Airflow

2024-07-01
session
Fritz Davenport (Astronomer)

Having helped many customers to migrate thousands of workloads, we will discuss the process of migrations, and how we built an open-source framework to migrate legacy scheduler workflows via standard sets of patterns to Airflow Projects. This framework is easily extended to encompass schedulers such as Automic, Autosys, Oozie, JAMS, SSIS and others, and has turned a difficult process requiring months or years to a simple one taking days or weeks.

Customizing LLMs: Leveraging Technology to tailor GenAI using Airflow

2024-07-01
session

Laurel provides an AI-driven timekeeping solution tailored for accounting and legal firms, automating timesheet creation by capturing digital work activities. This session highlights two notable AI projects: UTBMS Code Prediction: Leveraging small language models, this system builds new embeddings to predict work codes for legal bills with high accuracy. More details are available in our case study: https://www.laurel.ai/resources-post/enhancing-legal-and-accounting-workflows-with-ai-insights-into-work-code-prediction . Bill Creation and Narrative Generation: Utilizing Retrieval-Augmented Generation (RAG), this approach transforms users’ digital activities into fully billable entries. Additionally, we will discuss how we use Airflow for model management in these AI projects: Daily Model Retraining: We retrain our models daily Model (Re)deployment: Our Airflow DAG evaluates model performance, redeploying it if improvements are detected Cost Management: To avoid high costs associated with querying large language models frequently, our DAG utilizes RAG to efficiently summarize daily activities into a billable timesheet at day’s end.

DAGify - Enterprise Scheduler Migration Accelerator for Airflow

2024-07-01
session

DAGify is a highly extensible, template driven, enterprise scheduler migration accelerator that helps organizations speed up their migration to Apache Airflow. While DAGify does not claim to migrate 100% of existing scheduler functionality it aims to heavily reduce the manual effort it takes for developers to convert their enterprise scheduler formats into Python Native Airflow DAGs. DAGify is an open source tool under Apache 2.0 license and available on Github ( https://github.com/GoogleCloudPlatform/dagify) . In this session we will introduce DAGify, its use cases and demo its functionality by converting Control-M XML files to Airflow DAGs. Additionally we will highlight DAGify’s “no-code” extensibility by creating custom conversion templates to map Control-M functionality to Airflow operators.

Data Orchestration for Emerging Technology Analysis

2024-07-01
session

The Center for Security and Emerging Technology is a think tank at Georgetown University that studies security implications of emerging technologies, including data-driven analyses across bibliometric, patenting, and investment datasets. This talk will describe CSET’s data infrastructure which uses Airflow to orchestrate data ingestion, model deployment, webscraping, and manual data curation pipelines. We’ll also discuss how outputs from these pipelines are integrated into public-facing web applications and written reports, and some lessons learned from building and maintaining data pipelines on a data team with a diverse skill set.

dbt-Core & Airflow 101: Building Data Pipelines Demystified

2024-07-01
session

dbt became the de facto for data teams building reliable and trustworthy SQL code leveraging a modern data stack architecture. The dbt logic needs to be orchestrated, and jobs scheduled to meet business expectations. That’s where Airflow comes into play. In this quick introduction session, you’ll gonna learn: How to leverage dbt-Core & Airflow to orchestrate pipelines Write DAGs in a Pythonic way Apply best practices on your jobs

Elevating Machine Learning Deployment: Unleashing the Power of Airflow in Wix's ML Platform

2024-07-01
session

In his presentation, Elad will provide a novel take on Airflow, highlighting its versatility beyond conventional use for scheduled pipelines. He’ll discuss its application as an on-demand tool for initiating and halting jobs, mainly in the Data Science fields, like dataset enrichment and batch prediction via API calls, complete with real-time status tracking and alerts. The talk aims to encourage a fresh approach to Airflow utilization but will also delve into the technical aspects of implementing DAG triggering and cancellation logic. What will the audience learn: Real-life use case of leveraging Airflow capabilities beyond traditional pipeline scheduling, with innovative integration as the infrastructure for ML Platform. Trigger on-demand DAGs through API. Cancel running DAGs. Demonstration of an end-to-end ML pipeline utilizing AWS Sagemaker for batch predictions. Some more Airflow best practices. Join us to learn from Wix’s experience and best practices!

Empowering Airflow Users: A framework for performance testing and transparent resource optimization

2024-07-01
session

Apache Airflow is the backbone of countless data pipelines, but optimizing performance and resource utilization can be a challenge. This talk introduces a novel performance testing framework designed to measure, monitor, and improve the efficiency of Airflow deployments. I’ll delve into the framework’s modular architecture, showcasing how it can be tailored to various Airflow setups (Docker, Kubernetes, cloud providers). By measuring key metrics across schedulers, workers, triggers, and databases, this framework provides actionable insights to identify bottlenecks and compare performance across different versions or configurations. Attendees will learn: The motivation behind developing a standardized performance testing approach. Key design considerations and challenges in measuring performance across diverse Airflow environments. How to leverage the framework to construct test suites for different use cases (e.g., version comparison). Practical tips for interpreting performance test results and making informed decisions about resource allocation. How this framework contributes to greater transparency in Airflow release notes, empowering users with performance data.

Empowering Business Analysts with DAG Authoring IDE Running 8000 Workflows

2024-07-01
session

At Wix more often than not business analysts build workflows themselves to avoid data engineers being a bottleneck. But how do you enable them to create SQL ETLs starting when dependencies are ready and sending emails or refreshing Tableau reports when the work is done? One simple answer may be to use Airflow. The problem is every BA cannot be expected to know Python and Git so well that they will create thousands of DAGs easily. To bridge this gap we have built a web-based IDE, called Quix, that allows simple notebook-like development of Trino SQL workflows and converts them to Airflow DAGs when a user hits the “schedule” button. During the talk we will go through the problems of building a reliable and extendable DAG generating tool, why we preferred Airflow over Apache Oozie and also tricks (sharding, HA-mode, etc) allowing Airflow to run 8000 active DAGs on a single cluster in k8s.

Empowering More Teams in your Organization to Self-service their Airflow Needs

2024-07-01
session

Does your organization feel like the responsibility to write Airflow DAGs, handle the Airflow infrastructure administration, debug failing tasks, and keep up with new features and best practices is too much for too few people? Perhaps you only have one data team that owns all of that; or you have too many teams that have too many permissions into other teams’ DAGs. The topic of this talk is how Rakuten Kobo enables self-service for various teams within its organization to build their own DAGs in Airflow. The talk will include how we delineate the Airflow responsibilities of various teams, build guard rails for new Airflow developers, how different teams automatically have permissions required for their “own” DAGs (but not others), the unique responsibilities of Operations and Data Engineering teams, and how it is done in a scalable manner. Maybe you’ll be inspired to make changes in your own organization, or have some tips of your own to share! Depending on questions, we could discuss some of the technical details as well.

Event-driven Data Pipelines with Apache Airflow

2024-07-01
session

Airflow is all about schedules…we use CRON strings and Timetable to define schedules, and there’s an Airflow Scheduler component that manages those timetables, and a lot more, to ensure that DAGs and tasks are addressed based on those schedules. But what do you do if your data isn’t available on a schedule? What if data is coming from many sources, at varying times, and your job is to make sure it’s all as up-to-date as possible? An event-driven data pipeline may be the answer. An event-driven architecture (or EDA) is an architecture pattern that uses events to decouple an application’s components. It relies on external events, not an internal schedule, to create loosely coupled data pipelines that determine when to take action, and what actions to take. In this session, we will discuss the design considerations when using Airflow in an EDA and the tools Airflow has to make this happen, including Datasets, REST API, Dynamic Task Mapping, custom Timetables, Sensors, and queues.

Evolution of Airflow at Uber

2024-07-01
session

Up until a few years ago, teams at Uber used multiple data workflow systems, with some based on open source projects such as Apache Oozie, Apache Airflow, and Jenkins while others were custom built solutions written in Python and Clojure. Every user who needed to move data around had to learn about and choose from these systems, depending on the specific task they needed to accomplish. Each system required additional maintenance and operational burdens to keep it running, troubleshoot issues, fix bugs, and educate users. After this evaluation, and with the goal in mind of converging on a single workflow system capable of supporting Uber’s scale, we settled on an Airflow-based system. The Airflow-based DSL provided the best trade-off of flexibility, expressiveness, and ease of use while being accessible for our broad range of users, which includes data scientists, developers, machine learning experts, and operations employees. This talk will focus on scaling Airflow to Uber’s scale and providing a no-code seamless user experience

Evolution of Orchestration at GoDaddy: A Journey from On-prem to Cloud-based Single Pane Model

2024-07-01
session

Explore the evolutionary journey of orchestration within GoDaddy, tracing its transformation from initial on-premise deployment to a robust cloud-based Apache Airflow orchestration model. This session will detail the pivotal shifts in design, organizational decisions, and governance that have streamlined GoDaddy’s Data Platform and enhanced overall governance. Attendees will gain insights valuable for optimizing Airflow deployments and simplifying complex orchestration processes. Recap of the transformation journey and its impact on GoDaddy’s data operations. Future directions and ongoing improvements in orchestration at GoDaddy. This session will benefit attendees by providing a comprehensive case study on optimizing orchestration in a complex enterprise environment, emphasizing practical insights and scalable solutions.

Exploring DAG Design Patterns in Apache Airflow

2024-07-01
session

This talk delves into advanced Directed Acyclic Graph (DAG) design patterns that are pivotal for optimizing data pipeline management and boosting efficiency. We’ll cover dynamic DAG generation, which allows for flexible, scalable workflow creation based on real-time data and configurations. Learn about task grouping and SubDAGs to enhance readability and maintainability of complex workflows. We’ll also explore parameterized DAGs for injecting runtime parameters into tasks, enabling versatile and adaptable pipeline configurations. Additionally, the session will address branching and conditional execution to manage workflow paths dynamically based on data conditions or external triggers. Lastly, understand how to leverage parallelism and concurrency to maximize resource utilization and reduce execution times. This session is designed for intermediate to advanced users who are familiar with the basics of Airflow and looking to deepen their understanding of its more sophisticated capabilities. This session is crafted to be compelling by focusing on practical, high-impact design patterns that can significantly improve the performance and scalability of Airflow deployments.

From Oops to Ops: Smart Task Failure Diagnosis with OpenAI

2024-07-01
session

This session reveals an experimental venture integrating OpenAI’s AI technologies with Airflow, aimed at advancing error diagnosis. Through the application of AI, our objective is to deepen the understanding of issues, provide comprehensive insights into task failures, and suggest actionable solutions, thereby augmenting the resolution process. This method seeks to not only enhance diagnostic efficiency but also to equip data engineers with AI-informed recommendations. Participants will be guided through the integration journey, illustrating how AI can refine error analysis and potentially simplify troubleshooting workflows.

From Tech Specs to Business Impact: How to Design A Truly End-to-End Airflow Project

2024-07-01
session

There are many Airflow tutorials. However, many don’t show the full process of sourcing, transforming, testing, alerting, documenting, and finally supplying data. This talk with go over how to piece together an end-to-end Airflow project that transforms raw data to be consumable by the business. It will include how various technologies can all be orchestrated by Airflow to satisfy the needs of analysts, engineers, and business stakeholders. The talk will be divided into the following sections: Introduction: Introducing the business problem and how we came up with the solution design Data sourcing: Fetching and storing API data using basic operators and hooks Transformation and Testing: How to use dbt to build and test models based on the raw data Alerting: Alerting the necessary parties when any part of this DAG fails using Slack Consumption: How to make dynamic data accessible to business stakeholders

Gen AI using Airflow 3: A vision for Airflow RAGs

2024-07-01
session

Gen AI has taken the computing world by storm. As Enterprises and Startups have started to experiment with LLM applications, it has become clear that providing the right context to these LLM applications is critical. This process known as Retrieval augmented generation (RAG) relies on adding custom data to the large language model, so that the efficacy of the response can be improved. Processing custom data and integrating with Enterprise applications is a strength of Apache Airflow. This talk goes into details about a vision to enhance Apache Airflow to more intuitively support RAG, with additional capabilities and patterns. Specifically, these include the following Support for unstructured data sources such as Text, but also extending to Image, Audio, Video, and Custom sensor data LLM model invocation, including both external model services through APIs and local models using container invocation. Automatic Index Refreshing with a focus on unstructured data lifecycle management to avoid cumbersome and expensive index creation on Vector databases Templates for hallucination reduction via testing and scoping strategies

Growing with Apache Airflow: A Providers Journey

2024-07-01
session

It has been nearly 4 years since the launch of Managed Workflows for Apache Airflow (MWAA) by AWS. It has gone through the trials and tribulations as with any new idea, working with customers to better understand its shortcomings, building dedicated teams focused on scaling and growth, and at its core, preserving the integrity and functionality of Apache Airflow. Initially launched with Airflow 1.10, MWAA is now available globally in multiple AWS regions supporting the latest version of Airflow along with a multitude of features. In this talk, we will cover a bit of that history along with debunking a few myths surrounding the critical needs for users today. From compliance requirements, larger environments, observability, and pricing, we will discuss how MWAA has evolved and continues to grow through its focus on customer value and more importantly, its dedication to the Apache Airflow community.

Hello Quality: Building CIs to run Providers Packages System Tests

2024-07-01
session

Airflow operators are a core feature of Apache Airflow and it’s extremely important that we maintain high quality of operators, prevent regressions and on the other hand we help developers with automated tests results to double check if introduced changes don’t cause regressions or backward incompatible changes and we provide Airflow release managers with information whether a given version of a provider should be released or not yet. Recently a new approach to assuring production quality was implemented for AWS, Google and Astronomer-provided operators - standalone Continuous Integration processes were configured for them and test results dashboards show the results of the last test runs. What has been working well for these operator providers might be a pattern to follow for others - during this presentation, AWS, Google and Astronomer engineers are going to share the information about the internals of Test Dashboards implemented for AWS, Google and Astronomer-provided operators. This approach might be a a ‘blueprint’ to follow for other providers.

How Panasonic Leverages Airflow

2024-07-01
session

Using various operators to perform daily routines. Integration with Technologies: Redis: Acts as a caching mechanism to optimize data retrieval and processing speed, enhancing overall pipeline performance. MySQL: Utilized for storing metadata and managing task state information within Airflow’s backend database. Tableau: Integrates with Airflow to generate interactive visualizations and dashboards, providing valuable insights into the processed data. Amazon Redshift: Panasonic leverages Redshift for scalable data warehousing, seamlessly integrating it with Airflow for data loading and analytics. Foundry: Integrated with Airflow to access and process data stored within Foundry’s data platform, ensuring data consistency and reliability. Plotly Dashboards: Employed for creating custom, interactive web-based dashboards to visualize and analyze data processed through Airflow pipelines. GitLab CI/CD Pipelines: Utilized for version control and continuous integration/continuous deployment (CI/CD) of Airflow DAGs (Directed Acyclic Graphs), ensuring efficient development and deployment of workflows.