The session will cover capabilities of data lineage in Apache Airflow, how to use them, and motivations for it. It will present the technical know-how of integrating data lineage solutions with Apache Airflow, and provisioning DAGs metadata to fuel lineage functionalities in a way transparent to the user, limiting the setup friction. It will include Google’s Cloud Composer lineage integration implemented through the current Airflow’s data lineage architecture, and our approach to the lineage evolution strategy.
talk-data.com
Topic
Cloud Composer
Google Cloud Composer
2
tagged
Activity Trend
Top Events
Reliability is a complex and important topic. I will focus on both reliability definition and best practices. I will begin by reviewing the Apache Airflow components that impact reliability. I will subsequently examine those aspects, showing the single points of failure, mitigations, and tradeoffs. The journey starts with the scheduling process. I will focus on the aspects of Scheduler infrastructure and configuration that address reliability improvements. It doesn’t run in a vacuum therefore I’ll share my observations on the reliability aspect of Scheduler infrastructure. We recommend tasks to be idempotent but that is not always possible. I will share the challenges of running user’s code in the distributed architecture of Cloud Composer. I will refer to the volatility of some cloud resources and mitigation methods in various scenarios. Deferrability plays important part in the reliability, but there are also other elements we shouldn’t ignore.