talk-data.com talk-data.com

Topic

Spark

Apache Spark

big_data distributed_computing analytics

6

tagged

Activity Trend

71 peak/qtr
2020-Q1 2026-Q1

Activities

Showing filtered results

Filtering by: Airflow Summit 2024 ×

OpenLineage is an open standard for lineage data collection, integrated into the Airflow codebase, facilitating lineage collection across providers like Google, Amazon, and more. Atlan Data Catalog is a 3rd generation active metadata platform that is a single source of trust unifying cataloging, data discovery, lineage, and governance experience. We will demonstrate what OpenLineage is and how, with minimal and intuitive setup across Airlfow and Atlan, it presents unified workflows view, efficient cross-platform lineage collection, including column level, in various technologies (Python, Spark, dbt, SQL etc.) and clouds (AWS, Azure, GCP, etc.) - all orchestrated by Airflow. This integration enables further use case unlocks on automated metadata management by making the operational pipelines dataset-aware for self-service exploration. It also will demonstrate real world challenges and resolutions for lineage consumers in improving audit and compliance accuracy through column-level lineage traceability across the data estate. The talk will also briefly overview the most recent OpenLineage developments and planned future enhancements.

This talk will explore ASAPP’s use of Apache Airflow to streamline and optimize our machine learning operations (MLOps). Key highlights include: Integrating with our custom Spark solution for achieving speedup, efficiency, and cost gains for generative AI transcription, summarization and intent categorization pipelines Different design patterns of integrating with efficient LLM servers - like TGI/vllm/tensor-RT for Summarization pipelines with/without Spark. An overview of batched LLM inference using Airflow as opposed to real time inference outside of it [Tentative] Possible extension of this scaffolding to Reinforcement Learning from Human Feedback (RLHF) and Reinforcement Learning from AI Feedback (RLAIF) for fine-tuning LLMs, using Airflow as the orchestrator. Additionally, the talk will cover ASAPP’s MLOps journey with Airflow over the past few years, including an overview of our cloud infrastructure, various data backends, and sources. The primary focus will be on the machine learning workflows at ASAPP, rather than the data workflows, providing a detailed look at how Airflow enhances our MLOps processes.

Imagine a world where writing Airflow tasks in languages like Go, R, Julia, or maybe even Rust is not just a dream but a native capability. Say goodbye to BashOperators; welcome to the future of Airflow task execution. Here’s what you can expect to learn from this session: Multilingual Tasks: Explore how we empower DAG authors to write tasks in any language while retaining seamless access to Airflow Variables and Connections. Simplified Development and Testing: Discover how a standardized interface for task execution promises to streamline development efforts and elevate code maintainability. Enhanced Scalability and Remote Workers: Learn how enabling tasks to run on remote workers opens up possibilities for seamless deployment on diverse platforms, including Windows and remote Spark or Ray clusters. Experience the convenience of effortless deployments as we unlock new avenues for Airflow usage. Join us as we embark on an exploratory journey to shape the future of Airflow task execution. Your insights and contributions are invaluable as we refine this vision together. Let’s chart a course towards a more versatile, efficient, and accessible Airflow ecosystem.

This usecase shows how we deal with data of different varieties from different sources. Each source sends data in different layout, timings, structures, location patterns sizes. The goal is to process the files within SLA and send them out. This a complex multi step processing pipeline that involves multiple spark jobs, api based integrations with microservices, resolving unique ids, deduplication and filtering. Note that this is an event driven system, but not a streaming data system. The files are of gigabyte scale, and each day the data being processed is of terabyte scale. We will be talking about how to make DAG creation and business logic building a “low-code no-code process” so that non technical analysts can write business logic and light developers can deploy DAGs without much manual effort. Every aspect is either source specific or source-agnostic configuration driven. Airflow was chosen to enable easy DAG building, scaling, monitoring, troubleshooting and rerunning.

In today’s data-driven era, ensuring data reliability and enhancing our testing and development capabilities are paramount. Local unit testing has its merits but falls short when dealing with the volume of big data. One major challenge is running Spark jobs pre-deployment to ensure they produce expected results and handle production-level data volumes. In this talk, we will discuss how Autodesk leveraged Astronomer to improve pipeline development. We’ll explore how it addresses challenges with sensitive and large data sets that cannot be transferred to local machines or non-production environments. Additionally, we’ll cover how this approach supports over 10 engineers working simultaneously on different feature branches within the same repo. We will highlight the benefits, such as conflict-free development and testing, and eliminating concerns about data corruption when running DAGs on production Airflow servers. Join me to discover how solutions like Astronomer empower developers to work with increased efficiency and reliability. This talk is perfect for those interested in big data, cloud solutions, and innovative development practices.

Jupyter Notebooks are widely used by data scientists and engineers to prototype and experiment with data. However these engineers are often required to work with other data or platform engineers to productionize these experiments due to the complexity in navigating infrastructure and systems. In this talk, we will deep dive into this PR https://github.com/apache/airflow/pull/34840 and share how airflow can be leveraged as a platform to execute notebook pipelines (python, scala or spark) in dynamic environments like Kubernetes for various heterogeneous use cases. We will demonstrate how data scientists can use a Jupyter extension to easily build and manage such pipelines which are executed using Airflow streamlining data science workflow development and supercharging productivity