talk-data.com talk-data.com

Oscar Ligthart

Speaker

Oscar Ligthart

2

talks

Senior Data Engineer at Vinted

Frequent Collaborators

Filter by Event / Source

Talks & appearances

2 activities · Newest first

Search activities →

At Vinted, Europe’s largest second-hand marketplace, over 20 decentralized data teams generate, transform, and build products on petabytes of data. Each team utilizes their own tools, workflows, and expertise. Coordinating data pipeline creation across such diverse teams presents significant challenges. These include complex inter-team dependencies, inconsistent scheduling solutions, and rapidly evolving requirements.

This talk is aimed at data engineers, platform engineers, and technical leads with experience in workflow orchestration and will demonstrate how we empower teams at Vinted to define data pipelines quickly and reliably. We will present our user-friendly abstraction layer built on top of Apache Airflow, enhanced by a Python code generator. This abstraction simplifies upgrades and migrations, removes scheduler complexity, and supports Vinted’s rapid growth. Attendees will learn how Python abstractions and code generation can standardize pipeline development across diverse teams, reduce operational complexity, and enable greater flexibility and control in large-scale data organizations. Through practical lessons and real-world examples of our abstraction interface, we will offer insights into designing scheduler-agnostic architectures for successful data pipeline orchestration.

Vinted is the biggest second-hand marketplace in Europe with multiple business verticals. Our data ecosystem has over 20 decentralized teams responsible for generating, transforming, and building Data Products from petabytes of data. This creates a daring environment where inter-team dependencies, varied expertise with scheduling tools, and diverse use cases need to be managed efficiently. To tackle these challenges, we have centralized our approach by leveraging Apache Airflow to orchestrate data dependencies across teams. In this session, we will present how we utilize a code generator to streamline the creation of Airflow code for numerous dbt repositories, dockerized jobs, and Vertex-AI pipelines. With this approach, we simplify the complexity and offer our users the flexibility required to accommodate their use cases. We will share our sensor-callback strategy, which we developed to manage task dependencies, overcoming the limitations of traditional dataset triggers. This approach requires a data asset registry to monitor global dependencies and SLOs, and serves as a safeguard during CI processes for detecting potential breaking changes.