This talk focuses on exploring the implementation of Apache Airflow for Blockchain ETL orchestration, indexing, and the adoption of GitOps at Circle. IT will cover CICD tips, architectural choices for managing Blockchain data at scale, engineering practices to enable data scientists and some learnings from production.
talk-data.com
Topic
Airflow
Apache Airflow
92
tagged
Activity Trend
Top Events
DAG integrity is critical. So are coding conventions, consistency in standards for the group. In this talk, we will share the various lessons learned for testing/verifying our DAGs as part of our GitHub workflows [ for testing as part of the pull request process, and for automated deployment - eventually to production - once merged ]. We will dig into how we have unlocked additional efficiencies, catch errors before they get deployed, and generally how we are better off for having both Airflow & plenty of checks in our CI, before we merge/deploy.
Behaviour Driven Development can, in the simplest of terms, be described as Test Driven Development, only readable. It is of course more than that, but that is not the aim of this talk. This talk aims to show: How to write tests before you write a single line of Airflow code Create reusable and readable steps for setting up tests, in a given-when-then manner. Test rendering and execution of your DAG’s tasks Real world examples from a monorepo containing multiple Airflow projects Written only with pytest, and some code I stole from smart people in github.com/apache/airflow/tests
This talk is presented by Broadcom. Airflow’s “workflow as code” approach has many benefits, including enabling dynamic pipeline generation and flexibility and extensibility in a seamless development environment. However, what challenges do you face as you expand your Airflow footprint in your organization? What if you could enhance Airflow’s monitoring capabilities, forecast DAG and task executions, obtain predictive alerting, visualize trends, and get more robust logging? Broadcom’s Automation Analytics & Intelligence (AAI) offers advanced analytics for workload automation for cloud and on-premises automation. It connects easily with Airflow to offer improved visibility into dependencies between tasks in Airflow DAGs along with the workload’s critical path, dynamic SLA management, and more. Join our presentation to hear more about how AAI can help you improve service delivery. We will also lead a workshop that will allow you to dive deeper into how easy it is to install our Airflow Connector and get started visualizing your Airflow DAGs to optimize your workload and identify issues before they impact your business.
Airflow is not just purpose-built for data applications. It is a job scheduler on steroids. This is exactly what a cloud platform team needs: a configurable and scalable automation tool that can handle thousands of administrative tasks. Come learn how one enterprise platform team used Airflow to support cloud infrastructure at unprecedented scale.
In this talk, we will explore how adding custom dependency checks into Airflow’s scheduling system can elevate Airflow’s performance. We will specifically discuss how we added general upstream events dependency checking as well as how to make Airflow aware of used/available compute resources so that the system can better decide when and where to run a given task on Kubernetes infrastructure. We’ll cover why the existing dependency checking in Airflow is not sufficient in our use case, and why adding custom code to Airflow is needed. We’ll cover the pros and cons with this approach.
Balyasny Asset Management (BAM) is a diversified global investment firm founded in 2001 with over $20 billion in assets under management. As dbt took hold at BAM, we had multiple teams building dbt projects against Snowflake, Redshift, and SQL Server. The common question was: How can we quickly and easily productionise our projects? Airflow is the orchestrator of choice at BAM, but our dbt users ranged from Airflow power users to people who’d never heard of Airflow before. We built a single solution on top of Cosmos that allowed us to: Decouple the dbt project from the Airflow repository Have each dbt node run as a separate Airflow task Allow users to run dbt with little to no Airflow knowledge Enable users to have fine-grained control over how dbt is run and to combine it with other Airflow tasks Provide observability, monitoring, and alerting.
With recent works in the direction of Executor Decoupling and interest in Hybrid Execution, we find it’s still quite common for Airflow users to use the old-time rule of thumbs like “Don’t Use Airflow with LocalExecutor in production”, “If your scheduler lags, split your DAGs over two separate Airflow Clusters”, and so on. In our talk, we will show a deep dive comparison between various Execution models Airflow support and hopefully update understanding of their efficiency and limitations.
“Connecting the Dots in Airflow: From User to Contributor” explores the journey of transitioning from an Airflow user to an active project contributor. This talk will cover essential steps, resources, and best practices to effectively engage with the Airflow community and make meaningful contributions. Attendees will gain insights into the collaborative nature of open-source projects and how their involvement can drive both personal growth and project innovation.
Having helped many customers to migrate thousands of workloads, we will discuss the process of migrations, and how we built an open-source framework to migrate legacy scheduler workflows via standard sets of patterns to Airflow Projects. This framework is easily extended to encompass schedulers such as Automic, Autosys, Oozie, JAMS, SSIS and others, and has turned a difficult process requiring months or years to a simple one taking days or weeks.
Laurel provides an AI-driven timekeeping solution tailored for accounting and legal firms, automating timesheet creation by capturing digital work activities. This session highlights two notable AI projects: UTBMS Code Prediction: Leveraging small language models, this system builds new embeddings to predict work codes for legal bills with high accuracy. More details are available in our case study: https://www.laurel.ai/resources-post/enhancing-legal-and-accounting-workflows-with-ai-insights-into-work-code-prediction . Bill Creation and Narrative Generation: Utilizing Retrieval-Augmented Generation (RAG), this approach transforms users’ digital activities into fully billable entries. Additionally, we will discuss how we use Airflow for model management in these AI projects: Daily Model Retraining: We retrain our models daily Model (Re)deployment: Our Airflow DAG evaluates model performance, redeploying it if improvements are detected Cost Management: To avoid high costs associated with querying large language models frequently, our DAG utilizes RAG to efficiently summarize daily activities into a billable timesheet at day’s end.
DAGify is a highly extensible, template driven, enterprise scheduler migration accelerator that helps organizations speed up their migration to Apache Airflow. While DAGify does not claim to migrate 100% of existing scheduler functionality it aims to heavily reduce the manual effort it takes for developers to convert their enterprise scheduler formats into Python Native Airflow DAGs. DAGify is an open source tool under Apache 2.0 license and available on Github ( https://github.com/GoogleCloudPlatform/dagify) . In this session we will introduce DAGify, its use cases and demo its functionality by converting Control-M XML files to Airflow DAGs. Additionally we will highlight DAGify’s “no-code” extensibility by creating custom conversion templates to map Control-M functionality to Airflow operators.
The Center for Security and Emerging Technology is a think tank at Georgetown University that studies security implications of emerging technologies, including data-driven analyses across bibliometric, patenting, and investment datasets. This talk will describe CSET’s data infrastructure which uses Airflow to orchestrate data ingestion, model deployment, webscraping, and manual data curation pipelines. We’ll also discuss how outputs from these pipelines are integrated into public-facing web applications and written reports, and some lessons learned from building and maintaining data pipelines on a data team with a diverse skill set.
dbt became the de facto for data teams building reliable and trustworthy SQL code leveraging a modern data stack architecture. The dbt logic needs to be orchestrated, and jobs scheduled to meet business expectations. That’s where Airflow comes into play. In this quick introduction session, you’ll gonna learn: How to leverage dbt-Core & Airflow to orchestrate pipelines Write DAGs in a Pythonic way Apply best practices on your jobs
In his presentation, Elad will provide a novel take on Airflow, highlighting its versatility beyond conventional use for scheduled pipelines. He’ll discuss its application as an on-demand tool for initiating and halting jobs, mainly in the Data Science fields, like dataset enrichment and batch prediction via API calls, complete with real-time status tracking and alerts. The talk aims to encourage a fresh approach to Airflow utilization but will also delve into the technical aspects of implementing DAG triggering and cancellation logic. What will the audience learn: Real-life use case of leveraging Airflow capabilities beyond traditional pipeline scheduling, with innovative integration as the infrastructure for ML Platform. Trigger on-demand DAGs through API. Cancel running DAGs. Demonstration of an end-to-end ML pipeline utilizing AWS Sagemaker for batch predictions. Some more Airflow best practices. Join us to learn from Wix’s experience and best practices!
Apache Airflow is the backbone of countless data pipelines, but optimizing performance and resource utilization can be a challenge. This talk introduces a novel performance testing framework designed to measure, monitor, and improve the efficiency of Airflow deployments. I’ll delve into the framework’s modular architecture, showcasing how it can be tailored to various Airflow setups (Docker, Kubernetes, cloud providers). By measuring key metrics across schedulers, workers, triggers, and databases, this framework provides actionable insights to identify bottlenecks and compare performance across different versions or configurations. Attendees will learn: The motivation behind developing a standardized performance testing approach. Key design considerations and challenges in measuring performance across diverse Airflow environments. How to leverage the framework to construct test suites for different use cases (e.g., version comparison). Practical tips for interpreting performance test results and making informed decisions about resource allocation. How this framework contributes to greater transparency in Airflow release notes, empowering users with performance data.
At Wix more often than not business analysts build workflows themselves to avoid data engineers being a bottleneck. But how do you enable them to create SQL ETLs starting when dependencies are ready and sending emails or refreshing Tableau reports when the work is done? One simple answer may be to use Airflow. The problem is every BA cannot be expected to know Python and Git so well that they will create thousands of DAGs easily. To bridge this gap we have built a web-based IDE, called Quix, that allows simple notebook-like development of Trino SQL workflows and converts them to Airflow DAGs when a user hits the “schedule” button. During the talk we will go through the problems of building a reliable and extendable DAG generating tool, why we preferred Airflow over Apache Oozie and also tricks (sharding, HA-mode, etc) allowing Airflow to run 8000 active DAGs on a single cluster in k8s.
Does your organization feel like the responsibility to write Airflow DAGs, handle the Airflow infrastructure administration, debug failing tasks, and keep up with new features and best practices is too much for too few people? Perhaps you only have one data team that owns all of that; or you have too many teams that have too many permissions into other teams’ DAGs. The topic of this talk is how Rakuten Kobo enables self-service for various teams within its organization to build their own DAGs in Airflow. The talk will include how we delineate the Airflow responsibilities of various teams, build guard rails for new Airflow developers, how different teams automatically have permissions required for their “own” DAGs (but not others), the unique responsibilities of Operations and Data Engineering teams, and how it is done in a scalable manner. Maybe you’ll be inspired to make changes in your own organization, or have some tips of your own to share! Depending on questions, we could discuss some of the technical details as well.
Airflow is all about schedules…we use CRON strings and Timetable to define schedules, and there’s an Airflow Scheduler component that manages those timetables, and a lot more, to ensure that DAGs and tasks are addressed based on those schedules. But what do you do if your data isn’t available on a schedule? What if data is coming from many sources, at varying times, and your job is to make sure it’s all as up-to-date as possible? An event-driven data pipeline may be the answer. An event-driven architecture (or EDA) is an architecture pattern that uses events to decouple an application’s components. It relies on external events, not an internal schedule, to create loosely coupled data pipelines that determine when to take action, and what actions to take. In this session, we will discuss the design considerations when using Airflow in an EDA and the tools Airflow has to make this happen, including Datasets, REST API, Dynamic Task Mapping, custom Timetables, Sensors, and queues.
Up until a few years ago, teams at Uber used multiple data workflow systems, with some based on open source projects such as Apache Oozie, Apache Airflow, and Jenkins while others were custom built solutions written in Python and Clojure. Every user who needed to move data around had to learn about and choose from these systems, depending on the specific task they needed to accomplish. Each system required additional maintenance and operational burdens to keep it running, troubleshoot issues, fix bugs, and educate users. After this evaluation, and with the goal in mind of converging on a single workflow system capable of supporting Uber’s scale, we settled on an Airflow-based system. The Airflow-based DSL provided the best trade-off of flexibility, expressiveness, and ease of use while being accessible for our broad range of users, which includes data scientists, developers, machine learning experts, and operations employees. This talk will focus on scaling Airflow to Uber’s scale and providing a no-code seamless user experience