talk-data.com
Activities tracked
90
Airflow Summit 2023 program
Top Topics
Sessions & talks
Showing 51–75 of 90 · Newest first
Flexible DAG Trigger Forms (AIP-50)
As user of Airflow we often use DagRun.conf attributes to control content and flow of a DAG run. Previously the Airflow UI only allowed to launch via JSON in the UI. This was technically feasible but not user friendly. A user needs to model, check and understand the JSON and enter parameters manually without the option to validate before trigger. Similar like Jenkins or Github/Azure pipelines we desire an UI option to trigger with a UI and specifying parameters. With Airflow 2.6.0 now the DAG.params are used to render a nice entry form and with a bit of options a user friendly trigger UI can be implemented. This session is showing how the new feature works and provides some examples how to use it for your purposes.
Forging the Future: Five years of fabricating with Airflow
As a data engineer, I’ve used Airflow extensively over the last 5 years: across 3 jobs, several different roles; for side projects, for critical infrastructure; for manually triggered jobs, for automated workflows; for IT (Ookla/Speedtest.net), for science (Allen Institute for Cell Science), for the commons (Openverse), for liberation (Orca Collective). Authoring a DAG has changed dramatically since 2018, thanks to improvements in both Airflow and the Python language. In this session, we’ll take a trip back in time to see how DAGs looked several years ago, and what the same DAGs might look like now. We’ll appreciate the many improvements that have been made towards simplifying workflow construction. I’ll also discuss the significant advancements that have been made around deploying Airflow. Lastly, I’ll give a brief overview of different use cases and ways I’ve seen Airflow leveraged.
How to submit an Issue for community to fix To ensure a quality product the Airflow community relies on bug reports from Airflow users. Often times bug reports are incomplete or fail to include steps for observed bug to be re-created. This workshop will present an example of bug-to-issue process, namely, how to rule out non-Airflow issues and once an Airflow issue is suspected, how to submit an issue for community to see. This could also cover how the community picks up on an issue and eventually fixes it in a future release. Issue reporting is key to improving Airflow and the community will benefit on an easily digestible guide of how best to do so.
From Pain Points to Best Practices: Enhancing Airflow migrations and local development at Wix.com
Are you tired of spending hours on Airflow migrations and wondering how to make them more accessible? Would you like to be able to test your code on different Airflow versions? or are you struggling to set up a reliable local development environment? These are some of the top pain points for data engineers working with Airflow. But fear not – Wix Data Engineering has some best practices to share that will make your life easier. What will the audience learn: How does Wix Data Engineering make Airflow migrations easier and less painful. How to ensure DEs code is forward-compatible with the latest Airflow version. How to test code on different Airflow versions How to maintain a stable local environment for DEs while speeding up their dev velocity. Some more must-know framework team’s best practices.
We are continuing to modernize the Airflow UI to make it easier to manage all aspects of your DAGs. See a demo of the latest updates and improve your workflows with new tips and tricks. Then get a preview of what else will be coming soon. Followed up by Q&A for people to field their own use-cases and explore new ideas on how to improve the user experience.
New to Airflow or haven’t followed any of the recent DAG authoring enhancements? This talk is for you! We will go through various DAG authoring features like Setup/Teardown tasks (~2.7), Datasets (2.4), Dynamic Tasks (2.3) and Async tasks (2.2). You won’t be an expert after this short talk, however, you’ll have a head start when you write your next DAG, no hacks required.
How to Build a System Test Dashboard and why you Should do it?
System tests are executable DAGs for example and testing purposes. With a simple pytest command, you can run an entire DAG. From a provider point of view, they can be viewed as integration tests for all provider related operators and sensors. Running these system tests frequently and monitoring the results allow us to enforce stability amongst many other benefits. In this presentation we will explore how AWS built their system test environment, from the GitHub fork to the health dashboard that exists today…but more importantly, why you should do it as well!
How to use Data Contracts for Data Quality in your Airflow Ecosystem
Data contracts have been much discussed in the community of late, with a lot of curiosity around how to approach this concept in practice. We believe data contracts need a harmonizing layer to manage data quality in a uniform manner across a fragmented stack. We are calling this harmonizing layer the Control Plane for Data - powered by the common thread across these systems: metadata. For teams already orchestrating pipelines with Airflow, data contacts can be an effective way to process data that meets preset quality standards. With a control plane as a connecting layer, producers can build data contracts that consumers can rely on, ensuring DAGs only run when a contract is valid. Producers can govern how workflows should behave, and consumers receive the tooling they need to only opt into high quality data. Learn how to use data contracts and DataHub to make your Airflow pipelines more reliable - as well as other use cases that can help build a simpler, more flexible data stack.
Introducing airflowctl: A CLI to streamline getting started with Airflow
New users starting with Airflow frequently encounter several challenges, ranging from the complexities of Containers and virtual environments to the Python dependency hell. Moreover, their familiarity with tools such as Docker, docker-compose, and Helm might be somewhat limited and even overkill. In contrast, seasoned Airflow users encounter their problems, encompassing configuration conflics with ongoing Airflow projects and intricacies stemming from Docker and docker-compose configurations and lack of visibility into all the projects. With airflowctl, users can install & setup Airflow using a single command. For existing users, they can use it to manage multiple Airflow projects with different Airflow versions on the same machine. This allows creating & debugging DAGs in an IDE seamlessly. Agenda for the call: Why airflowctl? Goal Current functionality & Demo Vision / Roadmap
Lightning talks and event wrap-up
We will have close Airflow Summit with lightning talks (5 minutes each). You will be able to sign up during the event. We will only have space for 10 talks.
Manifest destiny: Orchestrating dbt using Airflow
Airflow is a popular choice for organizations looking to integrate open-source dbt within their existing data infrastructure. This talk will explore two primary methods of running dbt in Airflow: job-level and model-level. We’ll discuss the tradeoffs associated with each approach, highlighting the simplicity and efficiency of job-level orchestration, contrasted with the enhanced observability and control provided by model-level orchestration. We’ll also explain how the balance has shifted in recent years, with improvements to dbt core making model-level more efficient and innovative Airflow extensions like Cosmos making it easier to implement. Finally, we’ll provide benchmarks to help you determine which paradigm is the best fit for your needs.
Apache Airflow has over 650 Python dependencies. In case you did not know already, dependencies in Python are difficult subject. And Airflow has its own, custom ways of managing the dependencies. Airflow has a rather complex system to manage dependencies in their CI system, but this talk is not about it. This talk is directed to the users of Airflow who want to keep their dependencies updated, describing ways they can do it. This presentation will explain how to effectively manage and handle custom dependencies in Airflow. Jarek will guide you through practical solutions and best practices to make your Airflow experience with dependencies - yes you guessed it - a breeze.
Micropipelines: A microservice approach for DAG authoring using datasets
Introduced in Airflow 2.4, Datasets are a foundational feature for authoring modular data pipelines. As DAGs grow to encompass a larger number of data sources and encompass multiple data transformation steps, they typically become less predictable in the timeliness of execution and less efficient. This talk focuses on leveraging Datasets to enable predictable and more efficient DAGs, by leveraging patterns from microservice architectures. Just as large monolithic applications were decomposed into micro-services to deliver more efficient scalability and faster development cycles, micropipelines have the same potential to radically transform data pipeline efficiency and development velocity. Using a simple financial analysis pipeline example, with data aggregation being done in Snowflake and prediction analysis in Spark, this talk outlines how to retain timelines of data pipelines while expanding data sets.
Migrate Apache Oozie Workflows to Airflow and Run with Amazon EMR
Learn how to convert Oozie Workflows into Airflow DAG and run it on Amazon EMR. The utility supports Airflow 2.4.3. This utility is built on top of https://github.com/GoogleCloudPlatform/oozie-to-airflow
Migrating from Enterprise Scheduler like Autosys, TIDAL, Stonebranch to Airflow
How we migrated from Autosys with 1000s of jobs with 800+ dependencies with SLA to be met every hour in a Canada Prominent Bank. Use case to migrate from enterprise scheduler $ spent for every license and renewal cost SLA,Monitoring,Auditing,Devops Integration Vendor lockin 4.Integration to multiple providers
This sesion is about the current state of implementation for multi-tenancy feature of Airflow. This is a long-term feature that involves multiple changes, separate AIPs to implement, with the long-term vision of having single Airflow instance supporting multiple, independed teams using it - either from the same company or as part of Airflow-As-A-Service implementation.
My Journey to Committer Status: What I learned and how it can help you
Apache Airflow is one of the largest Apache projects by many metrics but it ranks particularly high in the number of contributors involved in the project. This leads to hundreds of Github Issues, Pull Requests and Discussions being submitted to the project every month. So it is critical to have an ample number of Committers to support the community. In this talk I will summarize my personal experience working towards, and ultimately achieving, committer status in Apache Airflow. I’ll cover the lessons I learned along the way as well as provide some advice and best practices to help others achieve committer status themselves.
Nurturing an Open Source Community is Like Tending a Garden
Nurturing a healthy open source community is hard work and requires discipline. There’s a lot of inherent friction in a distributed community that makes it difficult for the many participants to collaborate. It is challenging to contribute something when you’re new to a community. It is overwhelming for maintainers to give enough attention to contributors when a project becomes popular. In this talk, I’ll go over the common pitfalls of open source communities and will go over practices that contribute to keeping your community healthy.
OpenLineage in Airflow: A Comprehensive Guide
With native support for OpenLineage in Airflow, users can now observe and manage their data pipelines with ease. This talk will cover the benefits of using OpenLineage, how it is implemented in Airflow, practical examples of how to take advantage of it, and what’s in our roadmap. Whether you’re an Airflow user or provider maintainer, this session will give you the knowledge to make the most of this tool.
We’ve heard a lot in the last few years about insecurity in the open source software ecosystem, whether it be vulnerabilities, supply chain attacks or malware. Has open source become suddenly fraught with security problems? Or is it maybe, possibly… actually doing great? Let’s delve into the collaborative nature of our open-source ecosystems, and explore how transparency, peer review, and community have created a robust security posture. We’ll examine real-world examples, dispel myths, and reveal the inherent strengths of open source in fostering a secure and resilient software ecosystem.
Operators form the core of the language of Airflow. In this talk I will argue that while they have served their purpose, they are holding back the development of Airflow and if Airflow wants to stay relevant in the world of the ’new’ data stack (hint: it isn’t currently considered to be part of it) self-service data mesh it needs to kill its darling.
Open Source doc edits provide a low-stakes way for new users to first contribute. Ideally, new users find opportunities and feel welcome to fix docs as they learn, engaging with the community from the start. But, I found that contributing docs to Airflow had some surprising obstacles. In this talk, I’ll share my first docs contribution journey, including problems and fixes. For example, you must understand how Airflow uses Sphinx and know when to choose to edit in the GitHub UI or locally. But it wasn’t documented that GitHub renders only Markdown previews and since Sphinx uses markup, you must build docs locally to check formatting; an opportunity for me to add to the Contributor Guide for docs. In addition to examples of reducing obstacles, this talk covers the importance of docs for community and available resources to start writing. If you already contribute and want to create opportunities for others, I’ll also share characteristics of good first issues and docs projects.
Platform for Genomic Processing Using Airflow and ECS
High-scale orchestration of genomic algorithms using Airflow workflows, AWS Elastic Container Service (ECS), and Docker. Genomic algorithms are highly demanding of CPU, RAM, and storage. Our data science team requires a platform to facilitate the development and validation of proprietary algorithms. The Data engineering team develops a research data platform that enables Data Scientists to publish docker images to AWS ECR and run them using Airflow DAGS that provision AWS’s ECS compute power of EC2 and Fargate. We will describe a research platform that allows our data science team to check their algorithms on ~1000 cases in parallel using airflow UI and dynamic DAG generation to utilize EC2 machines, auto-scaling groups, and ECS clusters across multiple AWS regions.
Reducing Costs by Maximizing Airflow and DAG Performance
Airflow DAGs are Python code (which can pretty much do anything you want) and Airflow has hundreds configuration options (which can dramatically change Airflow behavior). Those two facts contribute to endless combinations that can run the same workloads, but only a precious few are efficient. The rest will result in failed tasks and excessive compute usage, costing time and money. This talk will demonstrate how small changes can yield big dividends, and reveals some code improvements and Airflow configurations that can reduce costs and maximize performance.