Keynote: Airflow then and now
Bolke and Maxime tell us about past on current time of Apache Airflow.
Activities tracked
39
Airflow Summit 2020 program
Sessions & talks
Showing 26–39 of 39 · Newest first
Bolke and Maxime tell us about past on current time of Apache Airflow.
A team of core committers explain what is coming to Airflow 2.0.
In this talk, colleagues from Airbnb, Twitter and Lyft share details about how they are using Apache Airflow to power their data pipelines. Slides: Tao Feng (Lyft) Dan Davydov (Twitter)
Gris Cuevas shares some statistics about the state of D&I at the Apache Software Foundation and also the initiative the foundation is taking to make projects more diverse and inclusive. Then, Aizhamal shares her own journey on becoming an open source contributor, and dives into project specific initiatives that help Apache Airflow to be one of the most sustainable projects in open source.
This talk discusses how to build an Airflow based data platform that can take advantage of popular ML tools (Jupyter, Tensorflow, Spark) while creating an easy-to-manage/monitor As the field of data science grows in popularity, companies find themselves in need of a single common language that can connect their data science teams and data infrastructure teams. Data scientists want rapid iteration, infrastructure engineers want monitoring and security controls, and product owners want their solutions deployed in time for quarterly reports. This talk will discuss how to build an Airflow based data platform that can take advantage of popular ML tools (Jupyter, Tensorflow, Spark) while creating an easy-to-manage/monitor ecosystem for data infrastructure and support team. In this talk, we will take an idea from a single-machine Jupyter Notebook to a cross-service Spark + Tensorflow pipeline, to a canary tested, production-ready model served on Google Cloud Functions. We will show how Apache Airflow can connect all layers of a data team to deliver rapid results.
At Nielsen Identity Engine, we use Spark to process 10’s of TBs of data. Our ETLs, orchestrated by Airflow, spin-up AWS EMR clusters with thousands of nodes per day. In this talk, we’ll guide you through migrating Spark workloads to Kubernetes with minimal changes to Airflow DAGs, using the open-sourced GCP Spark-on-K8s operator and the native integration we recently contributed to the Airflow project.
In this talk Anita showcases how to use the newly released Airflow Backport Providers. Some of the topics we will cover are: How to install them in Airflow 1.10.x How to install them in Composer How to migrate one or more DAG from using legacy to new providers. Known bugs and fixes.
How do you create fast and painless delivery of new DAGs into production? When running Airflow at scale, it becomes a big challenge to manage the full lifecycle around your pipelines; making sure that DAGs are easy to develop, test, and ship into prod. In this talk, we will cover our suggested approach to building a proper CI/CD cycle that ensures the quality and fast delivery of production pipelines. CI/CD is the practice of delivering software from dev to prod, optimized for fast iteration and quality control. In the data engineering context, DAGs are just another piece of software that require some form of lifecycle management. Traditionally, DAGs have been thought of as relatively static, but the new wave of analytics and machine learning efforts require more agile DAG development, in line with how agile software engineering teams build and ship code. In this session, we will dive into the challenges of building CI/CD cycles for Airflow DAGs. We will focus on a pipeline that involves Apache Spark as an extra dimension of real-world complexity, walking through a typical flow of DAG authoring, debugging, and testing, from local to staging to prod environments. We will offer best practices and discuss open-source tools you can use to easily build your own smooth cycle for Airflow CI/CD.
This talk will guide you trough internals of the official Production Docker Image of Airflow. It will show you the foreseen use cases for it and how to use it in conjunction with the Official Helm Chart to make your own deployments.
In the contemporary world security is important more than ever - Airflow installations are no exception. Google Cloud Platform and Cloud Composer offer useful security options for running your DAGs and tasks in a way so you effectively can manage a risk of data exfiltration and access to the system is limited. This is a sponsored talk, presented by Google Cloud .
In this talk, we share the lessons learned while building a scheduler-as-a-service leveraging Apache Airflow to achieve improved stability and security for one of the largest gaming companies. The platform integrates with different data sources and meets varied SLA’s across workflows owned by multiple game studios. In particular, we present a comprehensive self-serve airflow architecture with multi-tenancy, auto-dag generation, SSO-integration with improved ease of deployment. Within Electronic Arts, to provide scheduler-as-a-service and to support hundreds of thousands of execution workflows, each team requires an isolated environment with access to a central data lake containing several petabytes of anonymized player and game metrics. Leveraging Airflow, each team is provided a private code repository and namespace with which they can deploy their DAGs at their own behest. To support agile development cycles, a private testing sandbox and auto-deployment to an isolated multi-tenant airflow platform has been made available to game studios. In production, a single dockerized airflow deployment on Kubernetes is utilized to ensure highly availability and single-step deployment. Custom SSO-integration and RBAC-based operator and sensor whitelisting allows for secure logical isolation. In addition, providing dynamic DAG instantiation capability helps address varied SLA’s during game launch seasons that are staggered through a financial year.
Scribd is migrating its data pipeline from an in house system to Airflow. It’s a one big giant data pipeline consisting of more than 1,500 tasks. In this talk, I would like to share couple best practices on setting up a cloud native Airflow deployment in AWS. For those who are interested in migrating a non-trivial data pipeline to Airflow, I will also share how Scribd plans and executes the migration. Here are some of the topics that will be covered: How to setup a highly available Airflow cluster in AWS using both ECS and EKS with Terraform. How to manage Airflow DAGs across multiple git repositories. How we manage Airflow variables using a custom Airflow Terraform provider. Best practices on monitoring multiple Airflow clusters with Datadog and Pagerduty. How to Airflow to make it feature parity with Scribd’s in house orchestration system. How to plan and execute non-trivial data pipeline migrations. We transcompiled internal DSL to Airflow DAG to simulate what a real run will look like to surface performance issues early in the process. How we fixed an Airflow performance bottleneck so our giant DAG can be properly rendered in Web UI. For detailed deep dives on some of topics mentioned above, please check out our blog post series at https://tech.scribd.com/tag/airflow-series/ [Slides] ( https://docs.google.com/presentation/d/e/2PACX-1vRb-iH5NX2d7m-rQ7WGc6XlRvRCADwXq2hdjRjRuJ5h7e9ybfoUA13ytxpHgx7JG815fIKEE-QKuRUV/pub?start=false&loop=false&delayms=3000 )
How do you ensure your workflows work before deploying to production? In this talk I’ll go over various ways to assure your code works as intended - both on a task and a DAG level. In this talk I cover: How to test and debug tasks locally How to test with and without task instance context How to test against external systems, e.g. how to test a PostgresOperator? How to test the integration of multiple tasks to ensure they work nicely together
In this talk we review how Airflow helped create a tool to detect data anomalies. Leveraging Airflow for process management, database interoperability, and authentication created an easy path forward to achieve scale, decrease the development time and pass security audits. While Airflow is generally looked at as a solution to manage data pipelines, integrating tools with Airflow can also speed up development of those tools. The Data Anomaly Detector was created at One Medical to scan thousands of metrics per day for data anomalies. It’s a complicated tool and much of that complexity was outsourced to Airflow. Because the data infrastructure at One Medical was already built around Airflow, and Airflow had many desirable features, it made sense to build the tool to integrate closely with Airflow. The end result was more time could be spend on building features to do statistical analysis, and less effort had to be spent on database authentication, interoperability or process management. It’s an interesting example of how Airflow can be leveraged to build data intensive tools.