OpenLineage is an open standard for lineage data collection, integrated into the Airflow codebase, facilitating lineage collection across providers like Google, Amazon, and more. Atlan Data Catalog is a 3rd generation active metadata platform that is a single source of trust unifying cataloging, data discovery, lineage, and governance experience. We will demonstrate what OpenLineage is and how, with minimal and intuitive setup across Airlfow and Atlan, it presents unified workflows view, efficient cross-platform lineage collection, including column level, in various technologies (Python, Spark, dbt, SQL etc.) and clouds (AWS, Azure, GCP, etc.) - all orchestrated by Airflow. This integration enables further use case unlocks on automated metadata management by making the operational pipelines dataset-aware for self-service exploration. It also will demonstrate real world challenges and resolutions for lineage consumers in improving audit and compliance accuracy through column-level lineage traceability across the data estate. The talk will also briefly overview the most recent OpenLineage developments and planned future enhancements.
talk-data.com
Topic
AWS
Amazon Web Services (AWS)
6
tagged
Activity Trend
Top Events
In his presentation, Elad will provide a novel take on Airflow, highlighting its versatility beyond conventional use for scheduled pipelines. He’ll discuss its application as an on-demand tool for initiating and halting jobs, mainly in the Data Science fields, like dataset enrichment and batch prediction via API calls, complete with real-time status tracking and alerts. The talk aims to encourage a fresh approach to Airflow utilization but will also delve into the technical aspects of implementing DAG triggering and cancellation logic. What will the audience learn: Real-life use case of leveraging Airflow capabilities beyond traditional pipeline scheduling, with innovative integration as the infrastructure for ML Platform. Trigger on-demand DAGs through API. Cancel running DAGs. Demonstration of an end-to-end ML pipeline utilizing AWS Sagemaker for batch predictions. Some more Airflow best practices. Join us to learn from Wix’s experience and best practices!
It has been nearly 4 years since the launch of Managed Workflows for Apache Airflow (MWAA) by AWS. It has gone through the trials and tribulations as with any new idea, working with customers to better understand its shortcomings, building dedicated teams focused on scaling and growth, and at its core, preserving the integrity and functionality of Apache Airflow. Initially launched with Airflow 1.10, MWAA is now available globally in multiple AWS regions supporting the latest version of Airflow along with a multitude of features. In this talk, we will cover a bit of that history along with debunking a few myths surrounding the critical needs for users today. From compliance requirements, larger environments, observability, and pricing, we will discuss how MWAA has evolved and continues to grow through its focus on customer value and more importantly, its dedication to the Apache Airflow community.
Airflow operators are a core feature of Apache Airflow and it’s extremely important that we maintain high quality of operators, prevent regressions and on the other hand we help developers with automated tests results to double check if introduced changes don’t cause regressions or backward incompatible changes and we provide Airflow release managers with information whether a given version of a provider should be released or not yet. Recently a new approach to assuring production quality was implemented for AWS, Google and Astronomer-provided operators - standalone Continuous Integration processes were configured for them and test results dashboards show the results of the last test runs. What has been working well for these operator providers might be a pattern to follow for others - during this presentation, AWS, Google and Astronomer engineers are going to share the information about the internals of Test Dashboards implemented for AWS, Google and Astronomer-provided operators. This approach might be a a ‘blueprint’ to follow for other providers.
AI workloads are becoming increasingly complex, with unique requirements around data management, compute scalability, and model lifecycle management. In this session, we will explore the real-world challenges users face when operating AI at scale. Through real-world examples, we will uncover common pitfalls in areas like data versioning, reproducibility, model deployment, and monitoring. Our practical guide will highlight strategies for building robust and scalable AI platforms leveraging Airflow as the orchestration layer and AWS for its extensive AI/ML capabilities. We will showcase how users have tackled these challenges, streamlined their AI workflows, and unlocked new levels of productivity and innovation.
Before Airflow 2.9, user management was part of core Airflow, therefore modifying it or customizing it to fit user needs was not an easy process. Authentication and authorization managers (auth managers), is a new concept introduced in Airflow 2.9. It was introduced as extensible user management (AIP-56), allowing Airflow users to have a flexible way to integrate with organization’s identity services. Organizations want a single place to manage permissions and FAB (Flask App Builder) made it difficult to achieve. In this talk, after explaining the concept of auth managers and why we built this, we will show you how you can leverage the new auth manager interface to build an authorization service for Airflow based on your existing identity provider. We will see that auth managers can be leveraged to change considerably how users and their permissions are managed in an Airflow environment. Finally, we will dive deep into the AWS auth manager as an alternative auth manager and see some different usages as examples.