talk-data.com talk-data.com

Filter by Source

Select conferences and events

People (33 results)

See all 33 →

Activities & events

Title & Speakers Event
Event Airflow Summit 2020 2020-07-01
Blaine Elliot – Sr. Data Engineer @ One Medical

In this talk we review how Airflow helped create a tool to detect data anomalies. Leveraging Airflow for process management, database interoperability, and authentication created an easy path forward to achieve scale, decrease the development time and pass security audits. While Airflow is generally looked at as a solution to manage data pipelines, integrating tools with Airflow can also speed up development of those tools. The Data Anomaly Detector was created at One Medical to scan thousands of metrics per day for data anomalies. It’s a complicated tool and much of that complexity was outsourced to Airflow. Because the data infrastructure at One Medical was already built around Airflow, and Airflow had many desirable features, it made sense to build the tool to integrate closely with Airflow. The end result was more time could be spend on building features to do statistical analysis, and less effort had to be spent on database authentication, interoperability or process management. It’s an interesting example of how Airflow can be leveraged to build data intensive tools.

Airflow Cyber Security
Tao Feng – Tech Lead Data @ Lyft | Airflow PMC member , Dan Davydov – Software Engineer @ Twitter | Airflow PMC member , Kevin Yang – Engineering Manager @ Airbnb | Airflow PMC member

In this talk, colleagues from Airbnb, Twitter and Lyft share details about how they are using Apache Airflow to power their data pipelines. Slides: Tao Feng (Lyft) Dan Davydov (Twitter)

AI/ML Airflow ETL/ELT
Vanessa Sochat – Research Software Engineer @ Stanford University Research Computing Center

Engaging with a new community is a common experience in OSS development. There are usually expectations held by the project about the contributor’s exposure to the community, and by the contributor about interactions with the community. When these expectations are misaligned, the process is strained. In this talk Vanessa discusses a real life experience that required communication, persistence, and patience to ultimately lead to a positive outcome. Slides

Airflow
Noam Elfanbaum – Data engineering lead @ Bluevine

At Bluevine we use Airflow to drive our ML platform. In this talk, Noam presents the challenges and gains we had at transitioning from a single server running Python scripts with cron to a full blown Airflow setup. This includes: supporting multiple Python versions, event driven DAGs, performance issues and more! Some of the points that I’ll cover are: Supporting multiple Python versions Event driven DAGs Airflow Performance issues and how we circumvented them Building Airflow plugins to enhance observability Monitoring Airflow using Grafana CI for Airflow DAGs (super useful!) Patching Airflow scheduler Slides

AI/ML Airflow CI/CD Grafana Python
Kaxil Naik – Airflow PMC member & Committer | Senior Director of Engineering at Astronomer , Jarek Potiuk – Independent Open-Source Contributor and Advisor

Ask me Anything with a group of Airflow committers & PMC members.

Airflow
Michal Dura – Big Data Engineer @ DXC , Amr Noureldin – Solution Architect at DXC

This talk describes how Airflow is utilized in an Autonomous driving project, originating from Munich - Germany. We describe the Airflow setup, what challenges we encountered and how we maneuvered to achieve a distributed and highly scalable Airflow setup. One of the biggest automotive manufacturers elected to go for Airflow as an orchestration tool, in the pursuit of producing their first Level-3 autonomous driving vehicle in Germany. In this talk, we will describe the journey of deploying Airflow on top of OpenShift using a PostgreSQL database + RabbitMQ. We will describe how we achieve high-availability for the different Airflow components. We will tackle issues related to the database performance and failover recovery for the different Airflow components in our setup. In addition, we will present the bottlenecks we encountered with (1) Airflow scheduler (especially with complex DAGs), and (2) SparkSubmitOperator. For both topics, we will describe how we mitigated them. We will also describe how we leverage OpenShift to dynamically scale our Airflow deployment based on the running workloads. The talk will be concluded with a brief overview of future requirements and beneficial features we believe will be helpful for the community.

Airflow postgresql
Jarek Potiuk – Independent Open-Source Contributor and Advisor

This talk will guide you trough internals of the official Production Docker Image of Airflow. It will show you the foreseen use cases for it and how to use it in conjunction with the Official Helm Chart to make your own deployments.

Airflow Docker
Sergio Camilo Fandiño Hernández – Senior Analytics Engineer at Trade Republic

For three years we at LOVOO, a market-leading dating app, have been using the Google Cloud managed version of Airflow, a product we’ve been familiar with since its Alpha release. We took a calculated risk and integrated the Alpha into our product, and, luckily, it was a match. Since then, we have been leveraging this software to build out not only our data pipeline, but also boost the way we do analytics and BI. The speaker will present an overview of the software’s usability for Pipeline Error Alerting through BashOperators that communicate with Slack and will touch upon how they built their Analytics Pipeline (deployment and growth) and currently batch big amounts of data from different sources effectively using Airflow. We will also showcase our PythonOperators-driven RedShift to BigQuery data migration process, as well as offer a guide for creating fully dynamic tasks inside DAG.

Airflow Analytics BI BigQuery Cloud Computing GCP Redshift
Josh Benamram – CEO and co-founder @ Databand.ai

While Airflow is a central product for data engineering teams, it’s usually one piece of a bigger puzzle. The vast majority of teams use Airflow in combination with other tools like Spark, Snowflake, and BigQuery. Making sure pipelines are reliable, detecting issues that lead to SLA misses, and identifying data quality problems requires deep visibility into DAGs and data flows. Join this session to learn how Databand’s observability system makes it easy to monitor your end-to-end pipeline health and quickly remediate issues. This is a sponsored talk, presented by Databand .

Airflow BigQuery Data Engineering Data Quality Snowflake Spark
Kamil Bregula – Airflow committer & PMC member , Kaxil Naik – Airflow PMC member & Committer | Senior Director of Engineering at Astronomer , Daniel Imberman – Airflow PMC, Engineer at Astronomer, Lover of all things Airflow , Jarek Potiuk – Independent Open-Source Contributor and Advisor , Ash Berlin-Taylor – Airflow PMC member & Director Airflow Engineering at Astronomer , Tomasz Urbaszek – Senior Software Engineer at Snowflake | Apache Airflow PMC Member

A team of core committers explain what is coming to Airflow 2.0.

Airflow
Mihail Petkov – Sr. Software Engineer @ Financial Times , Emil Todorov – Software Engineer @ Financial Times

Financial Times is increasing its digital revenue by allowing business people to make data-driven decisions. Providing an Airflow based platform where data engineers, data scientists, BI experts and others can run language agnostic jobs was a huge swing. One of the most successful steps in the platform’s development was building our own execution environment, allowing stakeholders to self deploy jobs without cross team dependencies on top of the unlimited scale of Kubernetes. In this talk we share how we have integrated and extended Airflow at Financial Times. The main topics we will cover include: Providing team level security isolation Removing cross team dependencies Creating execution environment for independently creating and deploying R, Python, JAVA, Spark, etc jobs Reducing latency when sharing data between task instances Integrating all these features on top of Kubernetes

Airflow BI Java Kubernetes Python Cyber Security Spark
Anita Fronczak – Google, Software Engineer

In this talk Anita showcases how to use the newly released Airflow Backport Providers. Some of the topics we will cover are: How to install them in Airflow 1.10.x How to install them in Composer How to migrate one or more DAG from using legacy to new providers. Known bugs and fixes.

Airflow
Leah Cole – Developer Relations Engineer

BigQuery is GCP’s serverless, highly scalable and cost-effective cloud data warehouse that can analyze petabytes of data at super fast speeds. Amazon S3 is one of the oldest and most popular cloud storage offerings. Folks with data in S3 often want to use BigQuery to gain insights into their data. Using Apache Airflow, they can build pipelines to seamlessly orchestrate that connection. In this talk, Leah walks through how they created an easily configurable pipeline to extract data. When a team at work mentioned wanting to set up a repeatable process for migrating data stored in S3 to BigQuery, Leah knew using Cloud Composer (GCP-hosted Airflow) was the right tool for the job, but she didn’t have much experience with the proprietary file types the data used. Luckily, one of her colleagues did have experience with that proprietary file type, though they hadn’t worked with Airflow. Leah and her colleague teamed up to build a reusable, easily configurable solution for the team. She will walk you through their problem, the solution, and the process they took for coming to that solution, highlighting resources that were especially useful to a first-time Airflow user.

Airflow BigQuery Cloud Computing Cloud Storage DWH GCP Cloud Composer S3
Rafal Biegacz – Senior Engineering Manager (Cloud Composer, Google)

In the contemporary world security is important more than ever - Airflow installations are no exception. Google Cloud Platform and Cloud Composer offer useful security options for running your DAGs and tasks in a way so you effectively can manage a risk of data exfiltration and access to the system is limited. This is a sponsored talk, presented by Google Cloud .

Airflow Cloud Computing GCP Cloud Composer Cyber Security
Mohammed Marragh – DevOps Lead @ Société Générale , Alaeddine Maaoui – Product Owner - Société Générale

This talk covers an overview of Airflow as well as lessons learned of its implementation in a banking production environment which is Société Générale. It will be the summary of a two-year experience, a storytelling of an adventure within Société Générale in order to offer an internal cloud solution based on Airflow (AirflowaaS). I will cover the following points: As part of the search for an open source solution for the replacement of a proprietary orchestration suite, we carried out a study that allowed us to choose Apache Airflow as a solution, but why ? The different implementation models (HA, Scalability, ..) How to manage an Airflow platforms in production with 45,000 runs per month?

Airflow Cloud Computing
Adam Boscarino – Lead Data Infrastructure Engineer @ Devoted Health

Learn how Devoted Health went from cron jobs to Airflow deployment Kubernetes using a combination of open source and internal tooling. Devoted Health, a Medicare Advantage startup, went from cron jobs to Airflow on Kubernetes in a short period of time. This journey is a common one, but still has a steep learning curve for new Airflow users. This talk will give you a blueprint to follow by covering the tools we use, best practices, and lessons learned. We’ll share Devoted’s approach to managing our deployment, monitoring the platform, and developing, testing, and deploying DAGs. This includes internal tooling we’ve written that allows Data Scientists to work with Airflow without worrying about Airflow itself.

Airflow Kubernetes
Ry Walker – CEO @ Astronomer , Maxime Beauchemin – Founder & CEO @ Preset , Viraj Parekh – Founding Team at Astronomer

Astronomer is focused on improving Airflow’s user experience through the entire lifecycle — from authoring + testing DAGs, to building containers and deploying the DAGs, to running and monitoring both the DAGs and the infrastructure that they are operating within — with an eye towards increased security and governance as well. In this talk we walk you through some current UX challenges, an overview of how the Astronomer platform addresses the major challenges, and also provide sneak peek of the things that we’re working on in the coming months to improve Airflow’s user experience. This is a sponsored talk, presented by Astronomer .

Airflow Astronomer Cyber Security
Karolina Rosol – Sr. Technical Program Manager @ Snowflake , Maciej Oczko – Sr. Engineering Manager @ Snowflake

This talk shares Polidea’s journey from mobile app development studio to an OSS oriented business partner. We will tell you our story towards code leadership throughout the years. We are also going to share the challenges and practical insights into managing open source projects in our company. After this talk, you will know how we approached combining open source, business and team management not forgetting about a human aspect. This is a sponsored talk, presented by Polidea .

Bolke de Bruin – Airflow Team Member

Let’s be honest about it. Many of us don’t consider data lineage to be cool. But what if lineage would allow you to write less boilerplate and less code, while at the same time make your data scientists, your auditors, your management and well everyone more happy? What if you could write DAGs that mix between tasks based and data based? Lineage support has been incubating with Airflow for a while. It was buggy and not very easy to use. Still for a lot of reasons it is really cool to have data lineage available. One of those reasons is that it can make writing DAGs a lot easier. Recently a lot of development has gone into improved lineage support and to make it much easier or even transparent to use. In this talk I will focus on what we have in mind, evangelize data lineage but also gather feedback from the audience where we should take it next.

Airflow
Naresh Yegireddi – Lead Data Engineer @ PlayStation , Patricio Garza – Principal Data Architect @ PlayStation

Being a pioneer for the past 25 years, SONY PlayStation has played a vital role in the Interactive Gaming Industry. Over 100+ million monthly active users, 100+ million PS-4 console sales along with thousands of game development partners across the globe, big-data problem is quite inevitable. This presentation talks about how we scaled Airflow horizontally which has helped us building a stable, scalable and optimal data processing infrastructure powered by Apache Spark, AWS ECS, EC2 and Docker. Due to the demand for processing large volumes of data and also to meet the growing Organization’s data analytics and usage demands, the data team at PlayStation took an initiative to build an open source big data processing infrastructure where Apache Spark in Python as the core ETL engine. Apache Airflow is the core workflow management tool for the entire eco system. We started with an Airflow application running on a single AWS EC2 instance to support parallelism of 16 with 1 scheduler and 1 worker and eventually scaled it to a bigger scheduler along with 4 workers to support a parallelism of 96, DAG concurrency of 96 and a worker task concurrency of 24. Containerized all the services on AWS ECS which gave us an ability to scale Airflow horizontally.

Airflow Analytics AWS Amazon EC2 Big Data Data Analytics Docker ETL/ELT Python Spark