At QuintoAndar we seek automation and scalability in our data pipelines and believe that Airflow is the right tool for giving us exactly what we need. However, having all concerns mapped and tooling defined doesn’t necessarily mean success. For months we had struggled with a misconception that Airflow should act as an orchestrator and executor within a monolithic strategy. That could not be further from the truth because of the rise of scalability and performance issues, infrastructure and maintainability costs, and multi-directional impact throughout development teams. Employing Airflow, though, as an orchestration-only solution may help teams deliver value to end users in a more efficient, reliable and performant manner, where data pipelines can be executed anywhere with proper resources and optimizations. Those are the reasons we have shifted from an orchestrate-execute strategy to an orchestrate-only one, in order to leverage the full power of data pipeline management in Airflow. Straightaway the separation of data processing and pipeline coordination brought not only a finer resource tuning and better maintainability, but also a tremendous scalability on both ends.
talk-data.com
Speaker
Rafael Ribaldo
2
talks
Frequent Collaborators
Filter by Event / Source
Talks & appearances
2 activities · Newest first
Cross-DAG dependency may reduce cohesion in data pipelines and, without having an explicit solution in Airflow or in a third-party plugin, those pipelines tend to become complex to handle. That is the reason we, at QuintoAndar, have created an intermediate DAG to handle relationships across data pipelines called Mediator, in order for them to be scalable and maintainable by any team. At QuintoAndar we seek automation and modularization in our data pipelines and believe that breaking them into many responsibility modules (DAGs) enhances maintainability, reusability and understanding to move data from one point to another. However, extending interconnections between DAGs tend to reduce those enhancements, make them complex and, above all, there’s no explicit built-in solution in Airflow for them. That is why we created a Mediator DAG. The Mediator DAG in Airflow has the responsibility of looking for successfully finished DAG executions that may represent the previous step of another. That is, if a DAG is dependent of another, the Mediator will take care of checking and triggering the necessary objects for the data flow to continue. In conclusion, it is sometimes not practical to combine multiple DAGs into one. Hence, our proposal, is to define a Mediator DAG to handle dependencies and bring cohesion to a data pipeline without losing its purpose. View presentation (Prezi)