OpenLineage has simplified collecting lineage metadata across the data ecosystem by standardizing its representation in an extensible model. It enabled a whole ecosystem improving data pipeline reliability and ease of troubleshooting in production environments. In this talk, we’ll briefly introduce the OpenLineage model and explore how this metadata is collected from Airflow, Spark, dbt, and Flink. We’ll demonstrate how to extract valuable insights and outline practical benefits and common challenges when building ingestion, processing and storage for OpenLineage data. We will also briefly show how OpenLineage events can be used to observe data pipelines exhastively and the benefits that brings.
talk-data.com
Speaker
Julien Le Dem
2
talks
Frequent Collaborators
Filter by Event / Source
Talks & appearances
Showing 2 of 14 activities
Datadog is a world-class data platform ingesting more than a 100 trillion events a day, providing real-time insights. Before Airflow’s prominence, we built batch processing on Luigi, Spotify’s open-source orchestrator. As Airflow gained wide adoption, we evaluated adopting the major improvements of release 2.0, but opted for building our own orchestrator instead to realize our dataset-centric, event-driven vision. Meanwhile, the 3.0 release aligned Airflow with the same vision we pursued internally, as a modern asset-driven orchestrator. It showed how futile it is to build our own compared to the momentum of the community. We evaluated several orchestrators and decided to join forces with the Airflow project. This talk follows our journey from building a custom orchestrator to adopting and contributing to Airflow 3. We’ll share our thought process, our asset partitions use case, and how we’re working with the community to materialize the Data Awareness (AIP-73) vision. Partition-based incremental scheduling is core to our orchestration model, enabling scalable, observable pipelines thanks to Datadog’s Data Observability product providing visibility into pipeline health.