talk-data.com talk-data.com

Event

Airflow Summit 2021

2021-07-01 Airflow Summit Visit website ↗

Activities tracked

6

Airflow Summit 2021 program

Filtering by: ETL/ELT ×

Sessions & talks

Showing 1–6 of 6 · Newest first

Search within this event →

Airflow Extensions for Streamlined ETL Backfilling

2021-07-01
session

Using Airflow as our scheduling framework, we ETL data generated by tens of millions of transactions every day to build the backbone for our reports, dashboards, and training data for our machine learning models. There are over 500 (and growing) such ingested and aggregated tables owned by multiple teams that contain intricate dependencies between one another. Given this level of complexity, it can become extremely cumbersome to coordinate backfills for any given table, when also taking into account all its downstream dependencies, aggregation intervals, and data availability. This talk will focus on how we customized and extended Airflow at Adyen to streamline our backfilling operations. This allows us to prevent mistakes and enable our product teams to keep launching fast and iterating.

An On-Demand Airflow Service for Internet Scale Gameplay Pipelines

2021-07-01
session

EA Games have very dynamic and federated needs on their data processing pipelines. Many individual studios within EA build and manage the data pipelines for their games iterating rapidly through game development cycles. Developer productivity around orchestrating these pipelines is as critical as providing a robust production quality orchestration service. With these in mind, we re-engineered our Airflow service ground up to cater to our large internal user base (1000s) and internet scale data processing systems (Petabytes of data). This session details the evolution of the use of Airflow at EA Digital Platform from a monolithic multi-tenant instance to an “On-Demand” system where teams and studios create their own dedicated Airflow instance with all the necessary bells-and-whistles required at the click of a button - and allows them to immediately get their data pipelines running. We also elaborate how Airflow is interwoven into a “Self Serve” model for ETL pipelines within our teams with the objective of truely democratizing data across our games.

Dataclasses as Pipeline Definitions in Airflow

2021-07-01
session
Madison Swain-Bowden (Automattic)

We will describe how we were able to build a system in Airflow for MySQL to Redshift ETL pipelines defined in pure Python using dataclasses. These dataclasses are then used to dynamically generate DAGs depending on pipeline type. This setup allows us to implement robust testing, validation, alerts, and documentation for our pipelines. We will also describe the performance improvements we achieved by upgrading to Airflow 2.0.

Lessons Learned while Migrating Data Pipelines from Enterprise Schedulers to Airflow

2021-07-01
session
Hari Nair (Unravel) , Shivnath Babu (Unravel)

Digital transformation, application modernization, and data platform migration to the cloud are key initiatives in most enterprises today. These initiatives are stressing the scheduling and automation tools in these enterprises to the point that many users are looking for better solutions. A survey revealed that 88% of users believe that their business will benefit from an improved automation strategy across technology and business. Airflow has an excellent opportunity to capture mindshare and emerge as the leading solution here. At Unravel, we are seeing the trend where many of our enterprise customers are at various stages of migrating to Airflow from their enterprise schedulers or ETL/ELT orchestration tools like Autosys, Informatica, Oozie, Pentaho, and Tidal. In this talk, we will share lessons learned and best practices found in the entire pipeline migration life-cycle which includes: (i) The evaluation process which led to picking Airflow, including certain aspects where Airflow can do better (ii) The challenges in discovering and understanding all components and dependencies that need to be considered in the migration (iii) The challenges arising during the pipeline code and data migration, especially, in getting a single-pane-of-glass and apples-to-apples views to track the progress of the migration (iv) The challenges in ensuring that the pipelines that have been migrated to Airflow are able to perform and scale on par or better compared to what existed previously

Orchestrating ELT with Fivetran and Airflow

2021-07-01
session

At Fivetran, we are seeing many organizations adopt the Modern Data Stack to suit the breadth of their data needs. However, as incoming data sources begin to scale, it can be hard to manage and maintain the environment, with more time spent repairing and reengineering old data pipelines than building new ones. This talk will introduce a number of new Airflow Providers, including the airflow-provider-fivetran, and discuss some of the benefits and considerations we are seeing data engineers, data analysts, and data scientist experience in doing so.

Reverse ETL on Airflow

2021-07-01
session

At Snowflake you can imagine we do a lot of data pipelines and tables curating metrics metrics for all parts of the business. These are the lifeline of Snowflake’s business decisions. We also have a lot of source systems that display and make these metrics accessible to end users. So what happens when your data model does not match your system? For example your bookings numbers in salesforce do not match your data model that curates bookings metrics. At snowflake we continued to run into this problem over and over again. Having this problem we set out to build an infrastructure that would allow users to effortlessly sync the results of their data pipelines with any downstream / upstream system. Allowing us to have a central source of truth in our warehouse. This infrastructure was built on snowflake using airflow and allows a user to begin syncing data with a few details such as model and system to update. In this presentation we will show you how using airflow and snowflake we are able to use our data pipelines as the source of truth for all systems involved in the business. With this infrastructure we are able to use snowflake models as a central source of truth for all applications used throughout the company. This ensures that any number synced in this way seen by two users is always the same.