In this presentation, we discuss how we built a fully managed workflow orchestration system at Salesforce using Apache Airflow to facilitate dependable data lake infrastructure on the public cloud. We touch upon how we utilized kubernetes for increased scalability and resilience, as well as the most effective approaches for managing and scaling data pipelines. We will also talk about how we addressed data security and privacy, multitenancy, and interoperability with other internal systems. We discuss how we use this system to empower users with the ability to effortlessly build reliable pipelines that incorporate failure detection, alerting, and monitoring for deep insights through monitoring, removing the undifferentiated heavy lifting associated with running and managing their own orchestration engines. Lastly, we elaborate on how we integrated our in-house CI/CD pipelines to enable effective DAG and dependency management, further enhancing the system’s capabilities.
talk-data.com
Topic
CI/CD
Continuous Integration/Continuous Delivery (CI/CD)
5
tagged
Activity Trend
Top Events
Representing the Murex Reporting team at UniCredit we would like to present our journey with Airflow, and how over the past two years it enabled us to automate and simplify our batch workflows. Comparing to our previous rigid mainframe scheduling approach, we have created a robust and scalable framework complete with a CI/CD process, bringing our time to market of scheduling changes down from 3 days to 1. Basing our solution on DAG networks joined by ResumeDagRunOperators and an array of custom-built plugins (such as static time predecessors) we were able to replicate the scheduling of our overnight ETL processes (consisting of approx. 8000 tasks with many-to-many dependencies) in Airflow, satisfying our bank reporting SLAs without performance regression and gaining massively improved process visibility and control. Our presentation will illustrate our journey and explore some of these customizations, which venture outside of Airflow’s core functionalities.
Change management in data teams can be challenging to say the least. Not only you have to evolve your data pipelines, data structures, and datasets themselves across environments, you also have to keep data exploration and visualizations tools in sync. In this talk, we’ll be exploring how to do this best across environments (ie: dev, staging and prod), talking about how CI/CD can help, implementing good data ops practices and cranking up the level of rigor where it matters. We’ll also talk about rigor-vs-speed tradeoffs, where clearly not all data pipelines are born equal, and how to think about to evolve the level of rigor over time in places where it matters most.
A steady rise in users and business critical workflows poses challenges to development and production workflows. The solution: enable multi-tenancy on our single Airflow instance. We needed to enable teams to manage their python requirements, and ensure DAGs were insulated from each other. To achieve this we divided our monolithic setup into three parts: Infrastructure (with common code packaging), Workspace Creation, and CI/CD to manage deployments. Backstage templates enable teams to create isolated development environments that resemble our production environment, ensuring consistency. Distributing common code via a private pypi gives teams more control over what code their DAGs run. And a PythonOperator Shim in production utilizes virtualenv to run Python code with each team’s defined requirements for their DAG. In doing these things we enable effective multi-tenancy, and facilitate easier development and production workflows for Airflow.
Apache Airflow has over 650 Python dependencies. In case you did not know already, dependencies in Python are difficult subject. And Airflow has its own, custom ways of managing the dependencies. Airflow has a rather complex system to manage dependencies in their CI system, but this talk is not about it. This talk is directed to the users of Airflow who want to keep their dependencies updated, describing ways they can do it. This presentation will explain how to effectively manage and handle custom dependencies in Airflow. Jarek will guide you through practical solutions and best practices to make your Airflow experience with dependencies - yes you guessed it - a breeze.