talk-data.com talk-data.com

Event

Airflow Summit 2021

2021-07-01 Airflow Summit Visit website ↗

Activities tracked

56

Airflow Summit 2021 program

Sessions & talks

Showing 1–25 of 56 · Newest first

Search within this event →

Advanced Superset for Engineers (API’s, Version Controlled Dashboards, & more)

2021-07-01
session

Apache Superset is a modern, open-source data exploration & visualization platform originally created by Maxime Beauchemin. In this talk, I will showcase advanced technical Superset features like the rich Superset API, how to version control dashboards using Github, embedding Superset charts in other applications, and more. This talk will be technical and hands-on, and I will share all code examples I use so you can play with them yourself afterwards!

Airflow and Analytics Engineering - Dos and don'ts

2021-07-01
session

Considering that the role of Analytics Engineering has emerged in the last few years within data and analytics teams, it is important for me to highlight what role an Analytics engineer has and how the Dos and Don’ts from my perspective can contribute to a team and boost their day-to-day work with the help of Airflow.

Airflow as the Foundation of a Multi-Faceted Data Platform

2021-07-01
session
Jay Sen , Ry Walker (Astronomer)

A discussion with Jay Sen, Data Platform Architect at Paypal, and Ry Walker, Founder/CTO of Astronomer about the central role Airflow plays within Paypal’s data platform, and the opportunity to build stronger integrations between Airflow and other tools that surround it.

Airflow Extensions for Streamlined ETL Backfilling

2021-07-01
session

Using Airflow as our scheduling framework, we ETL data generated by tens of millions of transactions every day to build the backbone for our reports, dashboards, and training data for our machine learning models. There are over 500 (and growing) such ingested and aggregated tables owned by multiple teams that contain intricate dependencies between one another. Given this level of complexity, it can become extremely cumbersome to coordinate backfills for any given table, when also taking into account all its downstream dependencies, aggregation intervals, and data availability. This talk will focus on how we customized and extended Airflow at Adyen to streamline our backfilling operations. This allows us to prevent mistakes and enable our product teams to keep launching fast and iterating.

Airflow Journey @SG

2021-07-01
session

This talk will cover the adoption journey (Technical Challenges & Team Organization) of Apache Airflow (1.8 to 2.0) at Societe Generale. Time line of events: POC with v1.8 to convince our management. Shared infrastructure with v1.10.2. Multiple Infrastructure with v1.10.12. On demand service offer with v2.0 (Challenges & REX)

Airflow loves Kubernetes

2021-07-01
session

In this talk Jarek and Kaxil will talk about official, community support for running Airflow in the Kubernetes environment. The full support for Kubernetes deployments was developed by the community for quite a while and in the past users of Airflow had to rely on 3rd-party images and helm-charts to run Airflow on Kubernetes. Over the last year community members made an enormous effort to provide robust, simple and versatile support for those deployments that would respond to all kinds of Airflow users. Starting from official container image, through quick-start docker-compose configuration, culminating in April with release of the official Helm Chart for Airflow. This talk is aimed for Airflow users who would like to make use of all the effort. The users will learn how to: Extend or customize Airflow Official Docker Image to adapt it to their needs Run quickstart docker-compose environment where they can quickly verify their images Configure and deploy Airflow on Kubernetes using the Official Airflow Helm chart

Airflow: The Power of Stitching Services Together

2021-07-01
session

Apache Airflow is known to be a great orchestration tool that enables use cases that would not be possible otherwise. One of the great features that Airflow has is the possibility to “glue” together totally separate services to establish bigger functionalities. In this talk you will learn about various Airflow usages that let Airflow users to automate their critical company processes and even establish businesses. The examples provided will be based on Airflow used in the context of Cloud Composer which is a managed service to provision and manage Airflow instances.

An On-Demand Airflow Service for Internet Scale Gameplay Pipelines

2021-07-01
session

EA Games have very dynamic and federated needs on their data processing pipelines. Many individual studios within EA build and manage the data pipelines for their games iterating rapidly through game development cycles. Developer productivity around orchestrating these pipelines is as critical as providing a robust production quality orchestration service. With these in mind, we re-engineered our Airflow service ground up to cater to our large internal user base (1000s) and internet scale data processing systems (Petabytes of data). This session details the evolution of the use of Airflow at EA Digital Platform from a monolithic multi-tenant instance to an “On-Demand” system where teams and studios create their own dedicated Airflow instance with all the necessary bells-and-whistles required at the click of a button - and allows them to immediately get their data pipelines running. We also elaborate how Airflow is interwoven into a “Self Serve” model for ETL pipelines within our teams with the objective of truely democratizing data across our games.

Apache Airflow 2.0 on Amazon MWAA

2021-07-01
session

In this session we will discuss Amazon Managed Workflows for Apache Airflow (MWAA), how Apache Airflow (and specifically version 2.0) is implemented in the service, best practices for deployment and operations, and the Amazon MWAA team’s commitment to open source usage and contributions.

Apache Airflow and Ray: Orchestrating ML at Scale

2021-07-01
session

As the Apache Airflow project grows, we seek both ways to incorporate rising technologies and novel ways to expose them to our users. Ray is one of the fastest-growing distributed computation systems on the market today. In this talk, we will introduce the Ray decorator and Ray backend. These features, built with the help of the Ray maintainers at Anyscale, will allow Data Scientists to natively integrate their distributed pandas, XGBoost, and TensorFlow jobs to their airflow pipelines with a single decorator. By merging the orchestration of Airflow and the distributed computation of Ray, this coordination of technologies opens Airflow users to a whole host of new possibilities when designing their pipelines.

Apache Airflow at Apple - Multi-tenant Airflow and Custom Operators

2021-07-01
session

Running a platform where different business units at Apple can run their workloads in isolation and share operators.

Apache Airflow at Wise

2021-07-01
session

Wise (previously TransferWise) is a London-based fin-tech company. We build a better way of sending money internationally. At Wise we make great use of Airflow. More than 100 data scientists, analysts and engineers use Airflow every day to generate reports, prepare data, (re)train machine learning models and monitor services. My name is Alexandra, I’m a Machine Learning Engineer at Wise. Our team is responsible for building and maintaining Wise’s Airflow instances. In this presentation I would like to talk about three main things, our current setup, our challenges and our future plans with Airflow. We are currently transitioning from a single centralised Airflow instance into many segregated instances to increase reliability and limit access. We’ve learned a lot throughout this journey and looking to share these learnings with a wider audience.

Autoscaling in Airflow - Lessons learned

2021-07-01
session

Autoscaling in Airflow - what we learnt based on Cloud Composer case. We would like to present how we approach the autoscaling problem for Airflow running in Kubernetes in Cloud Composer: how we calculate our autoscaling metric, what problem we had for scaling down and how did we solve it. Also we share an ideas on what and how we could improve the current solution

Building an Elastic Platform Using Airflow Uniquely as an Orchestrator

2021-07-01
session
Lucas Fonseca (QuintoAndar) , Rafael Ribaldo (QuintoAndar)

At QuintoAndar we seek automation and scalability in our data pipelines and believe that Airflow is the right tool for giving us exactly what we need. However, having all concerns mapped and tooling defined doesn’t necessarily mean success. For months we had struggled with a misconception that Airflow should act as an orchestrator and executor within a monolithic strategy. That could not be further from the truth because of the rise of scalability and performance issues, infrastructure and maintainability costs, and multi-directional impact throughout development teams. Employing Airflow, though, as an orchestration-only solution may help teams deliver value to end users in a more efficient, reliable and performant manner, where data pipelines can be executed anywhere with proper resources and optimizations. Those are the reasons we have shifted from an orchestrate-execute strategy to an orchestrate-only one, in order to leverage the full power of data pipeline management in Airflow. Straightaway the separation of data processing and pipeline coordination brought not only a finer resource tuning and better maintainability, but also a tremendous scalability on both ends.

Building a robust data pipeline with the dAG stack: dbt, Airflow, Great Expectations

2021-07-01
session

Data quality has become a much discussed topic in the fields of data engineering and data science, and it has become clear that data validation is absolutely crucial to ensuring the reliability of any data products and insights produced by an organization’s data pipelines. This session will outline patterns for combining three popular open source tools in the data ecosystem - dbt, Airflow, and Great Expectations - and use them to build a robust data pipeline with data validation at each critical step.

Building a Scalable & Isolated Architecture for Preprocessing Medical Records

2021-07-01
session

After performing several experiments with Airflow, we reached the best architectural design for processing text medical records in scale. Our hybrid solution uses Kubernetes, Apache Airflow, Apache Livy, and Apache cTAKES. Using Kubernetes’ containers has the benefit of having a consistent, portable, and isolated environment for each component of the pipeline. With Apache Livy, you can run tasks in a Spark Cluster at scale. Additionally, Apache cTAKES helps with the extraction of information from electronic medical records clinical free-text by using natural language processing techniques to identify codable entities, temporal events, properties, and relations.

Building Providers & DAGs in the Airflow Ecosystem

2021-07-01
session

Learn how to use Airflow’s robust ecosystem of providers to construct secure, high-quality DAGs.

Building the AirflowEventStream

2021-07-01
session
Jelle Munk (Adyen)

Or how to keep our traditional java application up-to-date on everything big data. At Adyen we process tens of millions of transactions a day, a number that rises every day. This means that generating reports, training machine learning models or any other operation that requires a bird’s eye view on weeks or months of data requires the use of Big Data technologies. We recently migrated to Airflow for scheduling all batch operations on our on-premise Big Data cluster. Some of these operations require input from our merchants or our support team. Merchants can for instance subscribe to reports, choose their preferred time zone, and even specify which columns they want included. After generating the reports, these reports then need to become available in our customer portal. So how do we keep track in our Customer Area which reports have been generated in Airflow? How do we launch ad-hoc backfills when one of our merchants subscribes to a new report? How do we integrate all of this into our existing monitoring pipeline? This talk will focus on how we have successfully integrated our big data platform with our existing Java web applications and how Airflow (with some simple add-ons) played a crucial role in achieving this.

Building the Data Science Platform with Airflow @Near

2021-07-01
session

At Near we work on TBs of Location data with close to real time modelling to generate key consumer insights and estimates for our clients across the globe. We have hundreds of country specific models deployed and managed through airflow to achieve this goal. Some of the workflows that we have deployed our schedule based, some are dynamic and some are trigger based. In this session I would be discussing some of the workflows that are being scheduled and monitored using airflow and the key benefits and also the challenges that we have faced in our production systems.

Clearing Airflow obstructions

2021-07-01
session

Apache Airflow aims to speed the development of workflows, but developers are always ready to add bugs here and there. This talk illustrates a few pitfalls faced while developing workflows at the BBC to build machine learning models. The objective is to share some lessons learned and, hopefully, save others time. Some of the topics covered, with code examples: Tasks unsuitable to be run from within Airflow executors Plugins misusage Inconsistency while using an operator (Mis)configuration What to avoid during a workflow deployment Consequences of non-idempotent tasks

Contributing to Apache Airflow: First Steps

2021-07-01
session

Learn to contribute to the Apache Airflow ecosystem both with and without code. Post an article to the Airflow blog, improve documentation, or dive head-first into into Airflow’s free and open source software community.

Contributing to Apache Airflow | Journey to becoming Airflow's leading contributor

2021-07-01
session

From not knowing Python (let alone Airflow), and from submitting the first PR that fixes typo to becoming Airflow Committer, PMC Member, Release Manager, and #1 Committer this year, this talk walks through Kaxil’s journey in the Airflow World. The second part of this talk explains: how you can also start your OSS journey by contributing to Airflow Expanding familiarity with a different part of the Airflow codebase Continue committing regularly & steadily to become Airflow Committer. (including talking about current Guidelines of becoming a Committer) Different mediums of communication (Dev list, users list, Slack channel, Github Discussions etc)

Create Your Custom Secrets Backend for Apache Airflow - A guided tour into Airflow codebase

2021-07-01
session

This talk aims to share how Airflow’s secrets backend works, and how users can create their custom secret backends for their specific use cases & technology stack.

Creating Data Pipelines with Elyra, a visual DAG composer and Apache Airflow

2021-07-01
session

This presentation will detail how Elyra creates Jupyter Notebook, Python and R script- based pipelines without having to leave your web browser. The goal of using Elyra is to help construct data pipelines by surfacing concepts and patterns common in pipeline construction into a familiar, easy to navigate interface for Data Scientists and Engineers so they can create pipelines on their own. In Elyra’s Pipeline Editor UI, portions of Apache Airflow’s domain language are surfaced to the user and either made transparent or understandable through the use of tooltips or helpful notes in the proper context during pipeline construction. With these features, Elyra can rapidly prototype data workflows without the need to know or write any pipeline code. Lastly, we will look at what features we have planned on our roadmap for Airflow, including more robust Kubernetes integration and support for runtime specific components/operators. Project Home: https://github.com/elyra-ai/elyra

Customizing Xcom to enhance data sharing between tasks

2021-07-01
session
Vikram Koka (Astronomer) , Ephraim Anierobi

In Apache Airflow, Xcom is the default mechanism for passing data between tasks in a DAG. In practice, this has been restricted to small data elements, since the Xcom data is persisted in the Airflow metadatabase and is constrained by database and performance limitations. With the new TaskFlow API introduced in Airflow 2.0, it is seamless to pass data between tasks and the use of Xcom is invisible. However, the ability to pass data is restricted to a relatively small set of data types which can be natively converted in JSON. This tutorial describes how to go beyond these limitations by developing and deploying a Custom Xcom backend within Airflow to enable the sharing of large and varied data elements such as Pandas data frames between tasks in a data pipeline, using a cloud storage such as Google Storage or Amazon S3.