“Why is my data missing?” “Why didn’t my Airflow job run?” “What happened to this report?” If you’ve been on the receiving end of any of these questions, you’re not alone. As data pipelines become increasingly complex and companies ingest more and more data, data engineers are on the hook for troubleshooting where, why, and how data quality issues occur, and most importantly, fixing them so systems can get up and running again. In this talk, Francisco Alberini, Monte Carlo’s first product hire, discusses the three primary factors that contribute to data quality issues and how data teams can leverage Airflow, dbt, and other solutions in their arsenal to conduct root cause analysis on their data pipelines.
talk-data.com
Topic
Airflow
Apache Airflow
682
tagged
Activity Trend
Top Events
Numeric results with bulletproof confidence: this is what companies actually sell when promoting their machine learning product. Yet this seems out of reach when the product is both generic and complex, with much of the inner calculations hidden from the end user. So how can code improvements or changes in core component performance be tested at scale? Implementing API and Load Tests is time-consuming, but thorough: defining parameters, building infrastructure and debugging. The bugs may be real, but they can also be a result of poor infrastructure implementation (who is testing the testers?). In this session we will discuss how Airflow can help scale up testing in a stable and sustainable way.
This talk is all about how we at Jagex manage DAGs at scale. Focusing of following challenges faced and how we resolved them; Keeping track of airflow state Keeping track of each DAG state DAGs as git submodules Updating airflow with new dags How to seamlessly automate airflow deployment How to avoid package dependency conflicts
In this session we’ll be discussing the considerations and challenges when running Apache Airflow at scale. We’ll start by defining what it means to run Airflow at scale. Then we’ll dive deep into understanding limitations of the Airflow architecture, Scheduler processes, and configuration options. We’ll then define scaling workloads via containers and leveraging pools and priority, followed by scaling DAGs via dDynamic DAGs/DAG factories, CI/CD, and DAG access control. Finally we’ll get into managing Multiple Airflow Environments, how to split up workloads, and provide central governance for Airflow environment creation and monitoring with an example of Distributing workloads across environments.
For most ML-based SaaS companies, the need to fulfill each customer’s KPI will usually be addressed by matching a dedicated model. Along with the benefits of optimizing the model’s performance, a model per customer solution carries a heavy production complexity with it. In this manner, incorporating up-to-date data as well as new features and capabilities as part of a model’s retraining process can become a major production bottleneck. In this talk, we will see how Riskified scaled up modeling operations based on MLOps ideas, and focus on how we used Airflow as our ML pipeline orchestrator. We will dive into how we wrap Airflow as an internal service, the goals we started with, the obstacles along the way and finally - how we solved them. You will receive tools for how to set up your own Airflow-based continuous training ML pipeline, and how we adjusted it such that ML engineers and data scientists would be able to collaborate and work in parallel using the same pipeline.
At Astronomer we have been longtime supporters and contributors to open source Apache Airflow. In this session we will present Astronomer’s latest journey, Astro, our cloud-native managed service that simplifies data orchestration and reduces operational overhead. We will also discuss the increasing importance of data orchestration in modern enterprise data platforms, industry trends, and practical problems that arise in the ever expanding heterogeneous environments.
This session is about the state and future plans of the multi-tenancy feature of Airflow. Airflow has traditionally been single-tenant product. Mutliple instances could be bound together to provide a multi-tenant implementation and when using a modern infrastructure - Kubernetes - you could even reuse resources between those - but it was not a true “multi-tenant” solution. But Airflow becomes more of a platform now and the needs for multi-tenancy as a feature of the platform are highly expected by a number of users. In 2022 we’ve started to add multi-tenant features and we are aiming to make Airflow Multi-Tenant in the near* future. This talk is about the state of the multi-tenancy now and the future plans we have for Airflow becoming full multi-tenant platform.
In this talk we want to present how Airbnb extends the REST api to support on-demand workload. A DAG object is created from a local environment like Jupyter notebook, serialized into binary and transported to the API. The API persists the DAG object into the meta DB and Airflow scheduler and worker are extended to process this new kind of DAG.
OpenLineage is an open standard for metadata and lineage collection designed to instrument jobs as they are running. The standard has become remarkably adept at understanding the lifecycle of data within an organization. Additionally, Airflow lets you make use of OpenLineage with a convenient integration. Gathering data lineage has never been easier. In this talk, we’ll provide an update-to-date report on OpenLineage features and the Airflow integration – essential information for data governance architects & engineers.
Recently there has been much discussion around data monitoring, particularly in regards to reducing time to mitigate data quality problems once they’ve been detected. The problem with reactive or periodic monitoring as the de-facto standard for maintaining data quality is that it’s expensive. By the time a data problem has been identified, it’s effects may have been amplified across a myriad of downstream consumers, leaving you (a data engineer) with a big mess to clean-up. In this talk, we will present an approach for proactively addressing data quality problems using orchestration based on a central metadata graph. Specifically, we will walk through use cases highlighting how the open source metadata platform DataHub can enable proactive pipeline circuit-breaking by serving as the source of truth for both technical & semantic health status for a pipeline’s data dependencies. We’ll share practical recipes for how three powerful open source projects can be combined to build reliable data pipelines: Great Expectations for generating technical health signals in the form of assertion results on datasets, DataHub for providing a semantic identity for a dataset, including ownership, compliance, & lineage, and Airflow for orchestration.
This talk will cover the challenges we can face managing a large number of Airflow instances on private environment. Monitoring and metrics layers for production environment. Collecting and customizing logs. Resource consumption and green IT. Providing support for users and shared responsibility. Pain points
In Apple, we are building a self-serve data platform based on Airflow. Self-serve means users can create, deploy and run their DAGs freely. With provided logs and metrics, users are able to test or troubleshot DAGs on their own. Today, a common use case is, users want to test one or a few tasks in their DAG. However, when they trigger the DAG, all tasks instead of just the ones people are interested will run. To save time and resources, lots of users choose to manually mark complete for each tasks to skip. Can we do better than that? Is there an easy-peasy way to skip tasks? In this lightning talk, we would like to share the challenges we had, the solution we came up with, and the lesson we learned.
Get your ticket for this workshop Tensorflow Extended (TFX) can run machine learning pipelines on Airflow, but all the steps are run by default in the same workers where the Airflow DAG is running. This can lead to an excessive usage of resources, and breaks the assumption that Airflow is a scheduler; it becomes also the data processing platform. In this session, we will see how to use TFX with third party services, on top of Google Cloud Platform. The data processing steps can be run in Dataflow, Spark, Flink and other runners (parallelizing the processing of data and scaling up to petabytes), and the training steps can be run in Vertex or other external services. After this workshop, you will have learnt how to externalize any TFX heavyweight computing outside Airflow, while maintaining Airflow as the orchestrator for your machine learning pipelines.
Airflow has an inherent SLA alert mechanism. When the scheduler sees such an SLA miss for some task, it sends an alert by email. The problem is, that this email is nice, but we can’t really know when each task is eventually successful. Moreover, even if there is such an email upon success following an SLA miss, it does not give us a good view of the current status at any given time. In order to solve this, we developed SLAyer, an application that gets information of SLA misses from Airflow’s database and reports the current status to Prometheus, provides metrics per dag, task, and execution date currently in violation of its SLA.
This talk tells the story of how we have approached data and analytics as a startup at Preset and how the need for a data orchestrator grew over time. Our stack is (loosely) Fivetran/Segment/dbt/BigQuery/Hightouch, and we finally got to a place where we suffer quite a bit from not having an orchestrator and are bringing in Airflow to address our orchestration needs. This talk is about how startups approach solving data challenges, the shifting role of the orchestrator in the modern data stack, and the growing need for an orchestrator as your data platform becomes more complex.
According to analysts, 87 percent of enterprises have already adopted hybrid cloud strategies ( https://www.flexera.com/blog/industry-trends/trend-of-cloud-computing-2020/) . Customers have many reasons why they need to support hybrid environments, from maximising the value from heritage systems, to meeting local compliance and data processing regulations. As they build their data pipelines, they increasingly need to be able to orchestrate those across on-premesis and cloud environments. In this session, I will share how you can leverage Apache Airflow to orchestrate a workflow using data sources inside and outside the cloud.
Fivetran’s Airflow provider allows Recharge to manage our connector syncs alongside our other DAGs orchestrating related components of our core data pipelines. The provider has enabled increased flexibility on sync schedules, custom alerting, and quicker response times to failures.
At Credit Karma, we enable financial progress for more than 100 million of our members by recommending them personalized financial products when they interact with our application. In this talk we are introducing our machine learning platform to build interactive and production model-building workflows to serve relevant financial products to Credit Karma users. Vega, Credit Karma’s Machine Learning Platform, has 3 major components: 1) QueryProcessor for feature and training data generation, backed by Google BigQuery, 2) PipelineProcessor for feature transformations, offline scoring and model-analysis, backed by Apache Beam 3) ModelProcessor for running Tensorflow and Scikit models, backed by Google AI Platform, which provides data scientists the flexibility to explore different kinds of machine learning or deep learning models, ranging from gradient boosted trees to neural network with complex structures Vega exposed a unified Python API for Feature Generation, Modeling ETL, Model Training and Model Analysis. Vega supports writing interactive notebooks and python scripts to run these components in local mode with sampled data and in cloud mode for large scale distributed computing. Vega provides the ability to chain the processors provided by data scientists through Python code to define the entire workflow. Then it automatically generates the execution plan for deploying the workflow on Apache Airflow for running offline model experiments and refreshes. Overall, with the unified python API and automated Airflow DAG generation, Vega has improved the efficiency of ML Engineering. Using Airflow we deploy more than 20K features and 100 models daily
Resilient systems have the capability to recover when stressed by load, bugs in the workflow, and failure of any task. Reliability of the infrastructure or platform is not sufficient to run workflows reliably. It is critical to bring in resiliency practices during the design and build phase of the workflow to improve reliability, performance and operational aspects of the workflow. In this session, We will go through Architecture of the Airflow through the lens of reliability Idempotency Designing for failures Applying back pressure Best practices What we do not cover: Infrastructure/Platform/Product reliability
This session will talk about the awesome new features the community has built that would be part of Airflow 2.3. Highlights: Dynamic Task Mapping DB. Downgrades Pruning old DB records Connections using JSON UI Improvements