We will describe how we were able to build a system in Airflow for MySQL to Redshift ETL pipelines defined in pure Python using dataclasses. These dataclasses are then used to dynamically generate DAGs depending on pipeline type. This setup allows us to implement robust testing, validation, alerts, and documentation for our pipelines. We will also describe the performance improvements we achieved by upgrading to Airflow 2.0.
talk-data.com
Activities tracked
56
Airflow Summit 2021 program
Top Topics
Sessions & talks
Showing 26–50 of 56 · Newest first
Data Lineage with Apache Airflow using OpenLineage
If you manage a lot of data, and you’re attending this summit, you likely rely on Apache Airflow to do a lot of the heavy lifting. Like any powerful tool, Apache Airflow allows you to accomplish what you couldn’t before… but also creates new challenges. As DAGs pile up, complexity layers on top of complexity and it becomes hard to grasp how a failed or delayed DAG will affect everything downstream. In this session we will provide a crash course on OpenLineage, an open platform for metadata management and data lineage analysis. We’ll show how capturing metadata with OpenLineage can help you maintain inter-DAG dependencies, capture data on historical runs, and minimize data quality issues.
Data Pipeline HealthCheck for Correctness, Performance, and Cost Efficiency
We are witnessing a rapid growth in the number of mission-critical data pipelines that leaders of data products are responsible for. “Are your data pipelines healthy?” This question was posed to more than 200 leaders of data products from various industries. The answers ranged from “unfortunately, no” to “they are mostly fine, but I am always afraid that something or the other will cause a pipeline to break”. This talk presents the concept of Pipeline HealthCheck (PHC) which enables leaders of data products to have high confidence in the correctness, performance, and cost efficiency of their data pipelines. More importantly, PHC enables leaders of data products as well as their development and operations teams to have high confidence in their ability to quickly detect, troubleshoot, and fix problems that make data pipelines unhealthy. The talk also includes a demo of how PHC helps handle common problems in data pipelines like incorrect results, missing SLAs, and overshooting cost budgets.
The scheduler is the core of Airflow, and it’s a complex beast. In this session we will go through the scheduler in some detail; how it works; what the communication paths are and what processing is done where.
Discussion panel: Keep your Airflow secure
You might have heard some recent news about ransomware attacks for many companies. Quite recently the U. S. Department of Justice has elevated the priority of investigations of ransomware attacks to the same level as terrorism. Certainly security aspects of running software and so called “supply-chain attacks” have made a press recently. Also, you might have read recently about security researcher who made USD 13,000 via bounties by finding and contacting companies that had old, un-patched versions of Airflow - even if the ASF security process was great and PMC of Airflow has fixed those long time ago. If any of this rings a bell, then this session is for you. In this session Dolev (security expert and researchers who submitted security issues recently to Airflow), Ash and Jarek (Airflow PMC members) will discuss the state of security and best practices for keeping your Airflow secure and why it is important. The discussion will be moderated by Tomasz Urbaszek, Airflow PMC member. You can get a glimpse of what they will talk about through this blog post .
Drift Bio: The Future of Microbial Genomics with Apache Airflow
In recent years, the bioinformatics world has seen an explosion in genomic analysis as gene sequencing technologies have become exponentially cheaper. Tests that previously would have cost tens of thousands of dollars will soon run at pennies per sequence. This glut of data has exposed a notable bottleneck in the current suite of technologies available to bioinformaticians. At Drift Biotechnologies, we use Apache Airflow to transition traditionally on-premise large scale data and deep learning workflows for bioinformatics to the cloud, with an emphasis on workflows and data from next generation sequencing technologies.
Dynamic Security Roles in Airflow for Multi-Tenancy
Multi-tenant Airflow instances can help save costs for an organization. This talk will walk through how we dynamically assigned roles to users based on groups in Active Directory so that teams would have access to DAGs they created in the UI on our multi-tenant Airflow instance. We created our own custom AirflowSecurityManager class in order to achieve this that ultimately ties LDAP and RBAC together.
Airflow scheduler uses DAG definitions to monitor the state of tasks in the metadata database, and triggers the task instances whose dependencies have been met. It is based on state of dependencies scheduling. The idea of event based scheduling is to let the operators send events to the scheduler to trigger a scheduling action, such as starting jobs, stopping jobs and restarting jobs. Event based scheduling allows potential support for richer scheduling semantics such as periodic execution and manual trigger at per operator granularity.
Guaranteeing pipeline SLAs and data quality standards with Databand
We’ve all heard the phrase “data is the new oil.” But really imagine a world where this analogy is more real, where problems in the flow of data - delays, low quality, high volatility - could bring down whole economies? When data is the new oil with people and businesses similarly reliant on it, how do you avoid the fires, spills, and crises? As data products become central to companies’ bottom line, data engineering teams need to create higher standards for the availability, completeness, and fidelity of their data. In this session we’ll demonstrate how Databand helps organizations guarantee the health of their Airflow pipelines. Databand is a data pipeline observability system that monitors SLAs and data quality issues, and proactively alerts users on problems to avoid data downtime. The session will be led by Josh Benamram, CEO and Cofounder of Databand.ai. Josh will be joined by Vinoo Ganesh, an experienced software engineer, system architect, and current CTO of Veraset, a data-as-a-service startup focused on understanding the world from a geospatial perspective. Join to see how Databand.ai can help you create stable, reliable pipelines that your business can depend on!
Introducing Viewflow: a framework for writing data models without writing Airflow code
In this talk, we present Viewflow, an open-source Airflow-based framework that allows data scientists to create materialized views in SQL, R, and Python without writing Airflow code. We will start by explaining what problem does Viewflow solve: writing and maintaining complex Airflow code instead of focusing on data science. Then we will see how Viewflow solves that problem. We will continue by showing how to use VIewflow with several real-world examples. Finally, we will see what the upcoming features of Viewflow are! Resources: Announcement blog post: https://medium.com/datacamp-engineering/viewflow-fe07353fa068 GitHub repo: https://github.com/datacamp/viewflow
Lessons Learned while Migrating Data Pipelines from Enterprise Schedulers to Airflow
Digital transformation, application modernization, and data platform migration to the cloud are key initiatives in most enterprises today. These initiatives are stressing the scheduling and automation tools in these enterprises to the point that many users are looking for better solutions. A survey revealed that 88% of users believe that their business will benefit from an improved automation strategy across technology and business. Airflow has an excellent opportunity to capture mindshare and emerge as the leading solution here. At Unravel, we are seeing the trend where many of our enterprise customers are at various stages of migrating to Airflow from their enterprise schedulers or ETL/ELT orchestration tools like Autosys, Informatica, Oozie, Pentaho, and Tidal. In this talk, we will share lessons learned and best practices found in the entire pipeline migration life-cycle which includes: (i) The evaluation process which led to picking Airflow, including certain aspects where Airflow can do better (ii) The challenges in discovering and understanding all components and dependencies that need to be considered in the migration (iii) The challenges arising during the pipeline code and data migration, especially, in getting a single-pane-of-glass and apples-to-apples views to track the progress of the migration (iv) The challenges in ensuring that the pipelines that have been migrated to Airflow are able to perform and scale on par or better compared to what existed previously
Looking ahead: What comes after Airflow 2.0?
Modernize a decade old pipeline with Airflow 2.0
As a follow up for https://airflowsummit.org/sessions/teaching-old-dag-new-tricks/ , in this talk, we would like to share a happy ending story on how Scribd fully migrated its data platform to the cloud and Airflow 2.0. We will talk about data validation tools and task trigger customizations the team built to smooth out the transition. We will share how we completed the Airflow 2.0 migration started with an unsupported MySQL version and metrics to prove why everyone should perform the upgrade. Lastly, we will discuss how large scale backfills (10 years worth of run) are managed and automated at Scribd.
An informal and fun chat about the journey that we took and the decisions that we made in building Amazon Managed Workflows for Apache Airflow. We will talk about Our first tryst with understanding Airflow Talking to Amazon Data Engineers and how they ran workflows at scale Key design decisions and the reasons behind them Road ahead, and what we dream about for future of Apache Airflow. Open-Source tenets and commitment from the team We will leave time at the end for a short AMA/Questions.
Next-Gen Astronomer Cloud
Astronomer founders Ry Walker and Greg Neiheisel will preview the upcoming next-gen Astronomer Cloud product offering.
Operating contexts: patterns around defining how a DAG should behave in dev, staging, prod & beyond
As people define and publish a DAG, it can be really useful to make it clear how this DAG should behave under different “operating contexts”. Common operating contexts may match your different environments (dev / staging / prod) and/or match your operating needs (quick run, full backfill, test run, …). Over the years, patterns have emerged around workflow authors, teams and organizations, and little has been shared as to how to approach this. In this talk, we’ll talk about what an “operating context” is, why it’s useful, and describe common patterns and best practices around this topic.
Orchestrating ELT with Fivetran and Airflow
At Fivetran, we are seeing many organizations adopt the Modern Data Stack to suit the breadth of their data needs. However, as incoming data sources begin to scale, it can be hard to manage and maintain the environment, with more time spent repairing and reengineering old data pipelines than building new ones. This talk will introduce a number of new Airflow Providers, including the airflow-provider-fivetran, and discuss some of the benefits and considerations we are seeing data engineers, data analysts, and data scientist experience in doing so.
Pinterest’s Migration Journey
Last year, we were able to share why we have selected Airflow to be our next generation workflow system. This year, we will dive into the journey of migrating over 3000+ workflows and 45000+ tasks to Airflow. We will discuss the infrastructure additions to support such loads, the partitioning and prioritization of different workflow tiers defined in house, the migration tooling we built to get users to onboard, the translation layers between our old DSLs and the new, our internal k8s executor to leverage Pinterest’s kubernetes fleet, and more. We want to share the challenges both technically and usability wise to get such large migrations over the course of a year, and how we overcame it to successfully migrate 100% of the workflows to our inhouse workflow platform branded Spinner.
Productionizing ML Pipelines with Airflow, Kedro, and Great Expectations
Machine Learning models can add value and insight to many projects, but they can be challenging to put into production due to problems like lack of reproducibility, difficulty maintaining integrations, and sneaky data quality issues. Kedro, a framework for creating reproducible, maintainable, and modular data science code, and Great Expectations, a framework for data validations, are two great open-source Python tools that can address some of these problems. Both integrate seamlessly with Airflow for flexible and powerful ML pipeline orchestration. In this talk we’ll discuss how you can leverage existing Airflow provider packages to integrate these tools to create sustainable, production-ready ML models.
Provision as a Service: Automating data center operations with Airflow at Cloudflare
Cloudflare’s network keeps growing, and that growth doesn’t just come from building new data centers in new cities. We’re also upgrading the capacity of existing data centers by adding newer generations of servers — a process that makes our network safer, faster, and more reliable for our users. In this talk, I’ll share how we’re leveraging Apache Airflow to build our own Provision-as-a-Service (PraaS) platform and cut by 90% the amount of time our team spent on mundane operational tasks.
At Snowflake you can imagine we do a lot of data pipelines and tables curating metrics metrics for all parts of the business. These are the lifeline of Snowflake’s business decisions. We also have a lot of source systems that display and make these metrics accessible to end users. So what happens when your data model does not match your system? For example your bookings numbers in salesforce do not match your data model that curates bookings metrics. At snowflake we continued to run into this problem over and over again. Having this problem we set out to build an infrastructure that would allow users to effortlessly sync the results of their data pipelines with any downstream / upstream system. Allowing us to have a central source of truth in our warehouse. This infrastructure was built on snowflake using airflow and allows a user to begin syncing data with a few details such as model and system to update. In this presentation we will show you how using airflow and snowflake we are able to use our data pipelines as the source of truth for all systems involved in the business. With this infrastructure we are able to use snowflake models as a central source of truth for all applications used throughout the company. This ensures that any number synced in this way seen by two users is always the same.
Robots are your friends - using automation to keep your Airflow operators up to date
As part of my role at Google, maintaining samples for Cloud Composer, hosted managed Airflow, is crucial. It’s not feasible for me to try out every sample every day to check that it’s working. So, how do I do it? Automation! While I won’t let the robots touch everything, they let me know when it’s time to pay attention. Here’s how: Step 0: An update for the operators is released Step 1: A GitHub bot called Renovate Bot opens up a PR to a special requirements file to make this update Step 2: Cloud build runs unit tests to make sure none of my DAGs immediately break Step 3: PR is approved and merged to main Step 4: Cloud Build updates my dev environment Step 5: I look at my DAGs in dev to make sure all is well. If there is a problem, I need to resolve it manually and revert my requirements file. Step 6: I manually update my prod PyPI packages I’ll discuss what automation tools I choose to use and why, and the places where I intentionally leave manual steps to ensure proper oversight.
Running Big Data Applications in production with Airflow + Firebolt
In this talk we’ll see some real world examples from Firebolt customers demonstrating how Airflow is used to orchestrate operational data analytics applications with large data volumes, while keeping query latency low.
SciDAP: Airflow and CWL-powered bioinformatics platform
Reproducibility is the fundamental principle of a scientific research. This also applies to the computational workflows that are used to process research data. Common Workflow Language (CWL) is a highly formalized way to describe pipelines that was developed to achieve reproducibility and portability of computational analysis. However, there were only few workflow execution platforms that could run CWL pipelines. Here, we present CWL-Airflow – an extension for Airflow to execute CWL pipelines. CWL-Airflow serves as a processing engine for Scientific Data Analysis Platform (SciDAP) – a data analysis platform that makes complex computational workflows both user-friendly and reproducible. In our presentation we are going to explain why we see Airflow as the perfect backend for running scientific workflows, what problems we encountered in extending Airflow to run CWL pipelines and how we solved them. We will also discuss what are the pros and cons of limiting our platform to CWL pipelines and potential applications of CWL-Airflow outside the realm of biology.
Airflow has a lot of moving parts, and it can be a little overwhelming as a new user - as I was not too long ago. Join me as we go though Airflow’s architecture at a high level, explore how DAGs work and run, and look at some of the good, the bad, and the unexpected things lurking inside.