Airflow Summit 2020

Achieving Airflow Observability

2020-07-01

session

Evgeny Shulman

Airflow CI/CD Data Engineering KPI Kubernetes Spark

Identify issues in a fraction of the time and streamline root cause analysis for your DAGs. Airflow is the leading orchestration platform for data engineers. But when running Airflow at production scale, many teams have bigger needs for monitoring jobs, creating the right level of alerting, tracking problems in data, and finding the root cause of errors. In this talk we will cover our suggested approach to gaining Airflow observability so that you have the visibility you need to be productive. What is observability? The capability of monitoring and analyzing event logs, along with KPIs and other data, that yields actionable insights. In the data engineering context, observability is crucial for finding problems in jobs and data before those problems impact data consumers downstream. It’s a particularly difficult challenge because of the different platforms data engineers use (Airflow, Spark, Kubernetes, etc.) and the complicated life cycle of data pipeline CI/CD. In the session, we will do a deep dive into the visibility gaps your team might face running production-scale Airflow. We will walk through a typical day in the life of finding errors in DAGs, offer best practices, and discuss open source tools you can use to extend Airflow for observability and robust monitoring. We will use standard Airflow DAG examples to guide the presentation.

Achieving Airflow observability with Databand

2020-07-01

session

Josh Benamram (Databand.ai)

Airflow BigQuery Data Engineering Data Quality Snowflake Spark

While Airflow is a central product for data engineering teams, it’s usually one piece of a bigger puzzle. The vast majority of teams use Airflow in combination with other tools like Spark, Snowflake, and BigQuery. Making sure pipelines are reliable, detecting issues that lead to SLA misses, and identifying data quality problems requires deep visibility into DAGs and data flows. Join this session to learn how Databand’s observability system makes it easy to monitor your end-to-end pipeline health and quickly remediate issues. This is a sponsored talk, presented by Databand .

Adding an executor to Airflow: A contributor overflow exception

2020-07-01

session

Vanessa Sochat (Stanford University Research Computing Center)

Airflow

Engaging with a new community is a common experience in OSS development. There are usually expectations held by the project about the contributor’s exposure to the community, and by the contributor about interactions with the community. When these expectations are misaligned, the process is strained. In this talk Vanessa discusses a real life experience that required communication, persistence, and patience to ultimately lead to a positive outcome. Slides

Advanced Apache Superset for Data Engineers

2020-07-01

session

Maxime Beauchemin (Preset)

Superset

Superset is the leading open source data exploration and visualization platform. In this talk, we’ll be presenting Superset with a focus on advanced topics that are most relevant to Data Engineers. The presentation will be largely a live demo of the product, with a deeper dive into advanced topics for Data Engineers. This is a sponsored talk, presented by Preset .

AIP-31: Airflow functional DAG definition

2020-07-01

session

Gerard Casas Saez (Twitter)

Airflow ETL/ELT Python

Airflow does not currently have an explicit way to declare messages passed between tasks in a DAG. XCom are available but are hidden in execution functions inside the operator. AIP-31 proposes a way to make this message passing explicit in the DAG file and make it easier to reason about your DAG behaviour. In this talk, we will explore what other DSL are doing for message passing and how has that influenced AIP-31. We will explore the motivations behind explicit message passing as well as more proposals that can be built on top of it. In addition, we will explore a new way to define custom Python transformations using the task decorator proposed, and how this change may improve the extensibility of Airflow for more experimental ETL use cases.

Airflow: A beast character in the gaming world

2020-07-01

session

Naresh Yegireddi (PlayStation) , Patricio Garza (PlayStation)

Airflow Analytics AWS Amazon EC2 Big Data Data Analytics

Being a pioneer for the past 25 years, SONY PlayStation has played a vital role in the Interactive Gaming Industry. Over 100+ million monthly active users, 100+ million PS-4 console sales along with thousands of game development partners across the globe, big-data problem is quite inevitable. This presentation talks about how we scaled Airflow horizontally which has helped us building a stable, scalable and optimal data processing infrastructure powered by Apache Spark, AWS ECS, EC2 and Docker. Due to the demand for processing large volumes of data and also to meet the growing Organization’s data analytics and usage demands, the data team at PlayStation took an initiative to build an open source big data processing infrastructure where Apache Spark in Python as the core ETL engine. Apache Airflow is the core workflow management tool for the entire eco system. We started with an Airflow application running on a single AWS EC2 instance to support parallelism of 16 with 1 scheduler and 1 worker and eventually scaled it to a bigger scheduler along with 4 workers to support a parallelism of 96, DAG concurrency of 96 and a worker task concurrency of 24. Containerized all the services on AWS ECS which gave us an ability to scale Airflow horizontally.

Airflow as an elastic ETL tool

2020-07-01

session

Vicente Rubén Del Pino Ruiz (UnitedHealth Group) , Hendrik Kleine (Optum)

AI/ML Airflow Docker ELK ETL/ELT SQL

In search of a better, modern, simplistic method of managing ETL’s processes and merging them with various AI and ML tasks, we landed on Airflow. We envisioned a new user friendly interface that can leverage dynamic DAG’s and reusable components to build an ETL tool that requires virtually no training. We built several template DAG’s and connectors for Airflow to typical data sources, like SQL Server. Then proceeded to build a modern interface on top that brings ETL build, scheduling and execution capabilities. Acknowledging Airflow is designed for task orchestration, we expanded our infrastructure to use K8 and Docker for elastic computing. Key to our solution is the ability to create ETL’s using only open source tools, whilst executing on-par or faster than commercial solutions and an interface so simple that ETL’s could be created in seconds.

Airflow as the next gen of workflow system at Pinterest

2020-07-01

session

Dinghang Yu (Pinterest) , Yulei Li (Pinterest) , Ace Haidrey (Pinterest)

Airflow

At Pinterest, our current workflow system, called pinball, has served the data pipeline orchestration demands well for years. However, with the rapid increasing execution demand the system started to expose scalability and performance issues. Therefore we decided to look for a new solution to better address the issues and serve the workflow scheduling demand, and we chose Airflow as our next generation of workflow. In this talk we discuss how we made the decision to on board to Apache Airflow, and beyond the out-of-box features and experience what improvements we made to better support the business need at Pinterest.

Airflow at Société Générale : An open source orchestration solution in a banking environment

2020-07-01

session

Mohammed Marragh (Société Générale) , Alaeddine Maaoui

Airflow Cloud Computing

This talk covers an overview of Airflow as well as lessons learned of its implementation in a banking production environment which is Société Générale. It will be the summary of a two-year experience, a storytelling of an adventure within Société Générale in order to offer an internal cloud solution based on Airflow (AirflowaaS). I will cover the following points: As part of the search for an open source solution for the replacement of a proprietary orchestration suite, we carried out a study that allowed us to choose Apache Airflow as a solution, but why ? The different implementation models (HA, Scalability, ..) How to manage an Airflow platforms in production with 45,000 runs per month?

Airflow CI/CD: Github to Cloud Composer (safely)

2020-07-01

session

Jacob Ferriero (Google)

Airflow BigQuery CI/CD Cloud Computing Dataflow GCP

Deploying bad DAGs to your Airflow environment can wreak havoc. This talk provides an opinionated take on a mono repo structure for GCP data pipelines leveraging BigQuery, Dataflow and a series of CI tests for validating your Airflow DAGs before deploying them to Cloud Composer. Composer makes deploying airflow infrastructure easy and deploying DAGs “just dropping files in a GCS bucket”. However, this opens the opportunity for many organizations to shoot themselves in the foot by not following a strong CI/CD process. Pushing bad dags to Composer can manifest in a really sad airflow webserver and many wasted DAG parsing cycles in the scheduler, disrupting other teams using the same environment. This talk will outline a series of recommended continuous integration tests to validate PRs for updating or deploying new Airflow DAGs before pushing them to your GCP Environment with a small “DAGs deployer” application that will manage deploying DAGs following some best practices. This talk will walk through explaining automating these tests with Cloud Build, but could easily be ported to your favorite CI/CD tool.

Airflow in Airbnb

2020-07-01

session

Yingbo Wang (Airbnb) , Conor Camp (Airbnb) , Ping Zhang , Cong Zhu (Airbnb) , Kevin Yang (Airbnb | Airflow PMC member)

Airflow

Go over the yesterday, today and tomorrow for Airflow in Airbnb. Share our learnings and vision in Airflow core and around Airflow in its eco system. Starting with the history of Airflow in Airbnb, briefly describe how Airflow is used and the high level overview of Airflow in Airbnb. Then going into out current setup of Airflow, short term plans, learnings and best practises of Airflow. And finally talk about our the roadmap and vision of Airflow in Airbnb. Aside with Airflow core, we would love to also talk about what we’ve done inside the Airflow ecosystem, including frameworks built on top of Airflow, workflow development tools, etc. The recording for this session is not available.

Airflow on Kubernetes: Containerizing your workflows

2020-07-01

session

Michael Hewitt (Nielsen)

Airflow Kubernetes Terraform

At Nielsen Digital we have been moving our ETLs to containerized environments managed by Kubernetes. We have successfully transferred some of our ETLs to this environment in production. In order to do this we used the following technologies: Helm to easily deploy Airflow on to Kubernetes; Airflow’s Kubernetes Executor to take full advantage Kubernetes features; and Airflow’s Kubernetes Pod Operator in order to execute our containerized Tasks within our DAGs. To automate a lot of the deployment process we also used Terraform. Lastly, Kubernetes features were used to gain much more fine grained control of Airflows infrastructure. Join me in this talk to take an in depth look at how we used these technologies, why we used these technologies, and the results of using them so far. I will also briefly go over some features coming in Airflow 2.0 that we are considering to use in our workflows.

Airflow the perfect match in our analytics pipeline

2020-07-01

session

Sergio Camilo Fandiño Hernández

Airflow Analytics BI BigQuery Cloud Computing GCP

For three years we at LOVOO, a market-leading dating app, have been using the Google Cloud managed version of Airflow, a product we’ve been familiar with since its Alpha release. We took a calculated risk and integrated the Alpha into our product, and, luckily, it was a match. Since then, we have been leveraging this software to build out not only our data pipeline, but also boost the way we do analytics and BI. The speaker will present an overview of the software’s usability for Pipeline Error Alerting through BashOperators that communicate with Slack and will touch upon how they built their Analytics Pipeline (deployment and growth) and currently batch big amounts of data from different sources effectively using Airflow. We will also showcase our PythonOperators-driven RedShift to BigQuery data migration process, as well as offer a guide for creating fully dynamic tasks inside DAG.

Ask me anything with Airflow members

2020-07-01

session

Kaxil Naik , Jarek Potiuk

Airflow

Ask me Anything with a group of Airflow committers & PMC members.

Autonomous driving with Airflow

2020-07-01

session

Michal Dura (DXC) , Amr Noureldin

Airflow postgresql

This talk describes how Airflow is utilized in an Autonomous driving project, originating from Munich - Germany. We describe the Airflow setup, what challenges we encountered and how we maneuvered to achieve a distributed and highly scalable Airflow setup. One of the biggest automotive manufacturers elected to go for Airflow as an orchestration tool, in the pursuit of producing their first Level-3 autonomous driving vehicle in Germany. In this talk, we will describe the journey of deploying Airflow on top of OpenShift using a PostgreSQL database + RabbitMQ. We will describe how we achieve high-availability for the different Airflow components. We will tackle issues related to the database performance and failover recovery for the different Airflow components in our setup. In addition, we will present the bottlenecks we encountered with (1) Airflow scheduler (especially with complex DAGs), and (2) SparkSubmitOperator. For both topics, we will describe how we mitigated them. We will also describe how we leverage OpenShift to dynamically scale our Airflow deployment based on the running workloads. The talk will be concluded with a brief overview of future requirements and beneficial features we believe will be helpful for the community.

Building reuseable and trustworthy ELT pipelines (A templated approach)

2020-07-01

session

Nehil Jain (SnapTravel)

Airflow dbt ETL/ELT Singer SQL

To improve automation of data pipelines, I propose a universal approach to ELT pipeline that optimizes for data integrity, extensibility, and speed to delivery. The workflow is built using open source tools and standards like Apache Airflow, Singer, Great Expectations, and DBT. Templating ETLs is challenging! The creation and maintenance of data pipelines in production require hard work to manage bugs in code and bad data. I like to propose a data pipeline pattern that can simplify building pipelines while optimizing for data integrity and observability. The workflow is built using open source tools like Singer, Great Expectations, and DBT. Goals: Make EL T simple and fast to implement Validate your assumptions of the data before you make it available for use Allow analysts/data scientists add pain-free contributions to EL T using SQL Generate data documentation, failure logs for quick recovery, and fixes outages in your pipeline Target Audience: Approachable to any level of developer Novice data personals interested in starting ELT workflow and learning about different tools of the ecosystem Intermediate+ developers interested in supercharging their pipeline with Write Audit Publish pattern and reducing pipeline debt

Data DAGs with lineage for fun and for profit

2020-07-01

session

Bolke de Bruin

Airflow

Let’s be honest about it. Many of us don’t consider data lineage to be cool. But what if lineage would allow you to write less boilerplate and less code, while at the same time make your data scientists, your auditors, your management and well everyone more happy? What if you could write DAGs that mix between tasks based and data based? Lineage support has been incubating with Airflow for a while. It was buggy and not very easy to use. Still for a lot of reasons it is really cool to have data lineage available. One of those reasons is that it can make writing DAGs a lot easier. Recently a lot of development has gone into improved lineage support and to make it much easier or even transparent to use. In this talk I will focus on what we have in mind, evangelize data lineage but also gather feedback from the audience where we should take it next.

Data engineering hierarchy of needs

2020-07-01

session

Angel Daz (Ocelot Data)

Cloud Computing Data Engineering

Data Infrastructures look differently between small, mid, and large sized companies. Yet, most content out there is for large and sophisticated systems. And almost none of it is on migrating a legacy, on-prem, databases over to the cloud. In order to better explain the evolving needs of data engineering organizations, we will review the hierarchy of needs for data engineering.

Data flow with Airflow @ PayPal

2020-07-01

session

Aishwarya Sankaravadivel

Airflow API

In PayPal we decided to move away from two of our enterprise schedulers, Control-M and UC4, to Airflow. As we started the journey, the first most important step that we wanted to take was to build all the mandatory API’s on the top of Airflow so that we could integrate with our Self-Service Tools. In this talk we share the challenges that we ran into while building APIs on top of Airflow and how we overcame them.

Democratised data workflows at scale

2020-07-01

session

Mihail Petkov (Financial Times) , Emil Todorov (Financial Times)

Airflow BI Java Kubernetes Python Cyber Security

Financial Times is increasing its digital revenue by allowing business people to make data-driven decisions. Providing an Airflow based platform where data engineers, data scientists, BI experts and others can run language agnostic jobs was a huge swing. One of the most successful steps in the platform’s development was building our own execution environment, allowing stakeholders to self deploy jobs without cross team dependencies on top of the unlimited scale of Kubernetes. In this talk we share how we have integrated and extended Airflow at Financial Times. The main topics we will cover include: Providing team level security isolation Removing cross team dependencies Creating execution environment for independently creating and deploying R, Python, JAVA, Spark, etc jobs Reducing latency when sharing data between task instances Integrating all these features on top of Kubernetes

Demo: Reducing the lines, a visual DAG editor

2020-07-01

session

Traey Hatch (Onica)

Airflow JSON Python YAML

In this talk I will introduce a DAG authoring and editing tool for Airflow that we have built. Installed as a plugin, this tool allows users to author DAGs compose existing operators and hooks with virtually no Python experience. We walk through a demo of DAG authorship and deployment, and spend time reviewing the underlying open-source standards used and the general approach that was taken to develop the code. In addition to allowing dags to be created in a visual editor, the underlying tech enables Airflow DAGs to be described programmatically in YAML or JSON. DAGs described there can be saved in backing databases instead of Python files.

Effective Cross-DAG dependency

2020-07-01

session

Lucas Fonseca (QuintoAndar) , Rafael Ribaldo (QuintoAndar)

Airflow

Cross-DAG dependency may reduce cohesion in data pipelines and, without having an explicit solution in Airflow or in a third-party plugin, those pipelines tend to become complex to handle. That is the reason we, at QuintoAndar, have created an intermediate DAG to handle relationships across data pipelines called Mediator, in order for them to be scalable and maintainable by any team. At QuintoAndar we seek automation and modularization in our data pipelines and believe that breaking them into many responsibility modules (DAGs) enhances maintainability, reusability and understanding to move data from one point to another. However, extending interconnections between DAGs tend to reduce those enhancements, make them complex and, above all, there’s no explicit built-in solution in Airflow for them. That is why we created a Mediator DAG. The Mediator DAG in Airflow has the responsibility of looking for successfully finished DAG executions that may represent the previous step of another. That is, if a DAG is dependent of another, the Mediator will take care of checking and triggering the necessary objects for the data flow to continue. In conclusion, it is sometimes not practical to combine multiple DAGs into one. Hence, our proposal, is to define a Mediator DAG to handle dependencies and bring cohesion to a data pipeline without losing its purpose. View presentation (Prezi)

From cron to Airflow on Kubernetes: A startup story

2020-07-01

session

Adam Boscarino (Devoted Health)

Airflow Kubernetes

Learn how Devoted Health went from cron jobs to Airflow deployment Kubernetes using a combination of open source and internal tooling. Devoted Health, a Medicare Advantage startup, went from cron jobs to Airflow on Kubernetes in a short period of time. This journey is a common one, but still has a steep learning curve for new Airflow users. This talk will give you a blueprint to follow by covering the tools we use, best practices, and lessons learned. We’ll share Devoted’s approach to managing our deployment, monitoring the platform, and developing, testing, and deploying DAGs. This includes internal tooling we’ve written that allows Data Scientists to work with Airflow without worrying about Airflow itself.

From S3 to BigQuery - How a first-time Airflow user successfully implemented a data pipeline

2020-07-01

session

Leah Cole

Airflow BigQuery Cloud Computing Cloud Storage DWH GCP

BigQuery is GCP’s serverless, highly scalable and cost-effective cloud data warehouse that can analyze petabytes of data at super fast speeds. Amazon S3 is one of the oldest and most popular cloud storage offerings. Folks with data in S3 often want to use BigQuery to gain insights into their data. Using Apache Airflow, they can build pipelines to seamlessly orchestrate that connection. In this talk, Leah walks through how they created an easily configurable pipeline to extract data. When a team at work mentioned wanting to set up a repeatable process for migrating data stored in S3 to BigQuery, Leah knew using Cloud Composer (GCP-hosted Airflow) was the right tool for the job, but she didn’t have much experience with the proprietary file types the data used. Luckily, one of her colleagues did have experience with that proprietary file type, though they hadn’t worked with Airflow. Leah and her colleague teamed up to build a reusable, easily configurable solution for the team. She will walk you through their problem, the solution, and the process they took for coming to that solution, highlighting resources that were especially useful to a first-time Airflow user.

From Zero to Airflow: bootstrapping a ML platform

2020-07-01

session

Noam Elfanbaum (Bluevine)

AI/ML Airflow CI/CD Grafana Python

At Bluevine we use Airflow to drive our ML platform. In this talk, Noam presents the challenges and gains we had at transitioning from a single server running Python scripts with cron to a full blown Airflow setup. This includes: supporting multiple Python versions, event driven DAGs, performance issues and more! Some of the points that I’ll cover are: Supporting multiple Python versions Event driven DAGs Airflow Performance issues and how we circumvented them Building Airflow plugins to enhance observability Monitoring Airflow using Grafana CI for Airflow DAGs (super useful!) Patching Airflow scheduler Slides

talk-data.com

Top Topics

Top Speakers

Achieving Airflow Observability

Achieving Airflow observability with Databand

Adding an executor to Airflow: A contributor overflow exception

Advanced Apache Superset for Data Engineers

AIP-31: Airflow functional DAG definition

Airflow: A beast character in the gaming world

Airflow as an elastic ETL tool

Airflow as the next gen of workflow system at Pinterest

Airflow at Société Générale : An open source orchestration solution in a banking environment

Airflow CI/CD: Github to Cloud Composer (safely)

Airflow in Airbnb

Airflow on Kubernetes: Containerizing your workflows

Airflow the perfect match in our analytics pipeline

Ask me anything with Airflow members

Autonomous driving with Airflow

Building reuseable and trustworthy ELT pipelines (A templated approach)

Data DAGs with lineage for fun and for profit

Data engineering hierarchy of needs

Data flow with Airflow @ PayPal

Democratised data workflows at scale

Demo: Reducing the lines, a visual DAG editor

Effective Cross-DAG dependency

From cron to Airflow on Kubernetes: A startup story

From S3 to BigQuery - How a first-time Airflow user successfully implemented a data pipeline

From Zero to Airflow: bootstrapping a ML platform