talk-data.com talk-data.com

Topic

Kubernetes

container_orchestration devops microservices

560

tagged

Activity Trend

40 peak/qtr
2020-Q1 2026-Q1

Activities

560 activities · Newest first

It should be no surprise to the Airflow community that the hype around generative large language models (LLMs) and their wildly-inventive chat front ends have brought significant attention to growing these models and feeding them on a steady diet of data. For many communities in the infrastructure, orchestration, and data landscape this is an opportunity to think big, help our users scale, and make the right foundational investments to sustain that growth over the long term. In this keynote I’ll talk about my own community, Kubernetes, and how we’re using the surge in AI/ML excitement to address long standing gaps and unlock new capabilities. Not just for the workloads using GPUs and the platform teams supporting them, but thinking about how we can accelerate Airflow users and other key automators of workflow. We’re all in this together, and the future of orchestration is moving mountains of data at the speed of light!

As a bank Monzo has seen exponential growth in active users, from 1.6 million in 2019 to 5.8 million in 2022. At the same time the number of data users and analysts has expanded from an initial team of 4 to 132. Alongside this growth, our infrastructure and tooling have had to evolve to deliver the same value at a new scale. From an Airflow installation deployed on a single monolithic instance we now deploy atop Kubernetes and have integrated our Airflow setup into the bank’s backend systems. This talk charts the story of that expansion and the growing pains we’ve faced, as well as looking to the future of our use of Airflow. We’ll first discuss how data at Monzo works, from event capture to arrival in our Data Warehouse, before assessing the challenges of our Airflow setup. We’ll then dive into the re-platforming that was required to meet our growing data needs, and some of the unique challenges that come with serving an ever growing user base and need for analysis and insight.

We would love to speak about our experience upgrading our old airflow 1 infrastructure to airflow 2 on kubernetes and how we orchestrated the migration of approximately 1500 DAGs that were owned by multiple teams in our organization. We had some interesting challenges along the way and can speak about our solutions. Points we can talk about: Old airflow 1 infrastructure and why we decided to move to kubernetes for airflow 2. Possible migration paths we thought of and why we chose the route we did. Things we did to make the migration easier to achieve: Implementing dag factories - used some neat programmatic approaches to make a great factory interface for our users. Custom cross airflow instance dag dependency solution. DAG audits - how we programmatically determined which dags were actually still being used to reduce migration load. Problems that we faced: DAG ownership Backfilling in airflow 2 k8s DAG dependencies

In this presentation, we discuss how we built a fully managed workflow orchestration system at Salesforce using Apache Airflow to facilitate dependable data lake infrastructure on the public cloud. We touch upon how we utilized kubernetes for increased scalability and resilience, as well as the most effective approaches for managing and scaling data pipelines. We will also talk about how we addressed data security and privacy, multitenancy, and interoperability with other internal systems. We discuss how we use this system to empower users with the ability to effortlessly build reliable pipelines that incorporate failure detection, alerting, and monitoring for deep insights through monitoring, removing the undifferentiated heavy lifting associated with running and managing their own orchestration engines. Lastly, we elaborate on how we integrated our in-house CI/CD pipelines to enable effective DAG and dependency management, further enhancing the system’s capabilities.

We will cover how Snap (parent company of Snapchat) has been using Airflow since 2016. How we built a secure deployment on GCP that integrates with internal tools for workload authorization, RBAC and more. We made permissions for DAGs easy to use for customers using k8s workload identity binding and tight UI integration. How are we migrating 2500+ DAGs from Airflow V1, Python 2 to V2 Python 3 using tools + automations. Making code/DAG migration requires significant amount of time investment. Our team created several tools that can convert or re-write DAGs in the new format. Some other self-serving tools that we built internally.

session
by Niko Oliveira (Amazon | Apache Airflow Comitter)

Executors are a core concept in Apache Airflow and are an essential piece to the execution of DAGs. They have seen a lot of investment over the year and there are many exciting advancements that will benefit both users and contributors. This talk will briefly discuss executors, how they work and what they are responsible for. It will then describe Executor Decoupling (AIP-51) and how this has fully unlocked development of third-party executors. We’ll touch on the migration of “core” executors (such as Celery and Kubernetes) to their own package as well as the addition of new “3rd party” executors from providers such as AWS. Finally, a description/demo of Hybrid Executors, a proposed new feature to allow multiple executors to be used natively and seamlessly side by side within a single Airflow environment; which will be a powerful feature in a future full of many new Airflow executors.

Airflow’s KubernetesExecutor has supported multi_namespace_mode for long time. This feature is great at allowing Airflow jobs to run in different namespaces on the same Kubernetes clusters for better isolation and easier management. However, this feature requires cluster-role for the Airflow scheduler, which can create security problems or be a blocker for some users. PR https://github.com/apache/airflow/pull/28047 , which will become available in Airflow 2.6.0, resolves this issue by allowing Airflow users to specify multi_namespace_mode_namespace_list when using multi_namespace_mode, so that no cluster-role is needed and user only needs to ensure the Scheduler has permissions to certain namespaces rather than all namespaces on the Kubernetes cluster. This talk aims to help you better understand KubernetesExecutor and how to set it up in a more secure manner.

Much of the world sees Airflow as a hammer and ETL tasks as nails, but in reality, Airflow is much more of a sophisticated multitool, capable of orchestrating a wide variety of complex workflows. Astronomer’s Customer Reliability Engineering (CRE) team is leveraging this potential in its development of Airline, a tool powered by Airflow that monitors Airflow deployments and sends alerts proactively when issues arise. In this talk, Ryan Hatter from Astronomer will give an overview of Airline. He’ll explain how it integrates with ZenDesk, Kubernetes, and other services to resolve customers’ problems more quickly, and in many cases, even before customers realize there’s an issue. Join us for a practical exploration of Airflow’s capabilities beyond ETL, and learn how proactive, automated monitoring can enhance your operations.

This talk will cover in high overview the architecture of a data product DAG, the benefits in a data mesh world and how to implement it easily. Airflow is the de-facto orchestrator we use at Astrafy for all our data engineering projects. Over the years we have developed deep expertise in orchestrating data jobs and recently we have adopted the “data mesh” paradigm of having one Airlfow DAG per data product. Our standard data product DAGs contain the following stages: Data contract: check integrity of data before transforming the data Data transformation: applies dbt transformation via a kubernetes pod operator Data distribution: mainly informing downstream applications that new data is available to be consumed For use cases where different data products need to be finished before triggering another data product, we have a mechanism with an engine in between that keeps track of finished dags and triggers DAGs based on a mapping table containing data products dependencies.

Productive cross-team collaboration between data engineers and analysts is the goal of all data teams, however, fulfilling on that mission can be challenging given the diverse set of skills that each group brings. In this talk we present an example of how one team tackled this topic by creating a flexible, dynamic and extensible framework using Airflow and cloud services that allowed engineers and analysts to jointly create data-centric micro-services to serve up projections and other robust analysis for use in the organization. The framework, which utilized dynamic DAG generation configured using yaml files, Kubernetes jobs and dbt transformations, abstracted away many of the details associated with workflow orchestration, allowing analysts to focus on their Python or R code and data processing logic while enabling data engineers to monitor the pipelines and ensure their scalability.

Today, all major cloud service providers and 3rd party providers include Apache Airflow as a managed service offering in their portfolios. While these cloud based solutions help with the undifferentiated heavy lifting of environment management, some data teams are also looking to operate self-managed Airflow instances to satisfy specific differentiated capabilities. In this session, we would talk about: Why should you might need to run self managed Airflow The available deployment options (with emphasis on Airflow on Kubernetes) How to deploy Airflow on Kubernetes using automation (Helm Charts & Terraform) Developer experience (sync DAGs using automation) Operator experience (Observability) Owned responsibilities and Tradeoffs A thorough understanding would help you understand the end-to-end perspectives of operating a highly available and scalable self managed Airflow environment to meet your ever growing workflow needs.

What if you didn’t have to worry about mounting cloud costs and could improve performance at the same time? Automation makes it possible, and we’ll show you how! During the session, you’ll discover practical techniques to lower your Kubernetes spend, eliminate waste, and boost performance. From VMs and workloads to architecture and network, we’ll outline how different factors impact cost behaviors so you get on top of your Kubernetes expenses efficiently.

What if you want to install Kubernetes, but your machines are not connected to the internet? Data protection or critical infrastructure use-cases are usually negating the use of Kubernetes. In this talk I will discuss several options and suggest an enterprise architecture which gives you almost same flexibility as the standard installations with internet access.

Modernize Applications with Apache Kafka

Application modernization has become increasingly important as older systems struggle to keep up with today's requirements. When you migrate legacy monolithic applications to microservices, easier maintenance and optimized resource utilization generally follow. But new challenges arise around communication within services and between applications. You can overcome many of these issues with the help of modern messaging technologies such as Apache Kafka. In this report, Jennifer Vargas and Richard Stroop from Red Hat explain how IT leaders and enterprise architects can use Kafka for microservices communication and then off-load operational needs through the use of Kubernetes and managed services. You'll also explore application modernization techniques that don't require you to break down your monolithic application. This report helps you: Understand the importance of migrating your monolithic applications to microservices Examine the various challenges you may face during the modernization process Explore application modernization techniques and learn the benefits of using Apache Kafka during the development process Learn how Apache Kafka can support business outcomes Understand how Kubernetes can help you overcome any difficulties you may encounter when using Kafka for application development

Creating our Own Kubernetes & Docker to Run Our Data Infrastructure | Modal

ABOUT THE TALK: In this talk, Erik Bernhardsson will share how Modal starts 1000s of large containers in seconds, and what they had to do under the surface to build this. This includes a custom file system written in Rust, their own container runtime, and their own container image builder. This talk will give you an idea of how containers work along with some of the low-level Linux details underneath. We'll also talk about many infrastructure tools hold data teams back, and why they deserve faster and better tools.

ABOUT THE SPEAKER: Erik Bernhardsson is the founder and CEO of Modal, which is an infrastructure provider for data teams. Before Modal, Erik was the CTO at Better for six years, and previously spent seven years at Spotify, building the music recommendation system and running data teams.

ABOUT DATA COUNCIL: Data Council (https://www.datacouncil.ai/) is a community and conference series that provides data professionals with the learning and networking opportunities they need to grow their careers.

Make sure to subscribe to our channel for the most up-to-date talks from technical professionals on data related topics including data infrastructure, data engineering, ML systems, analytics and AI from top startups and tech companies.

FOLLOW DATA COUNCIL: Twitter: https://twitter.com/DataCouncilAI LinkedIn: https://www.linkedin.com/company/datacouncil-ai/

SQL Server 2022 Administration Inside Out

Conquer SQL Server 2022 and Azure SQL administration from the inside out! Dive into SQL Server 2022 administration and grow your Microsoft SQL Server data platform skillset. This well-organized reference packs in timesaving solutions, tips, and workarounds, all you need to plan, implement, deploy, provision, manage, and secure SQL Server 2022 in any environment: on-premises, cloud, or hybrid, including detailed, dedicated chapters on Azure SQL Database and Azure SQL Managed Instance. Nine experts thoroughly tour DBA capabilities available in the SQL Server 2022 Database Engine, SQL Server Data Tools, SQL Server Management Studio, PowerShell, and much more. Youll find extensive new coverage of Azure SQL Database and Azure SQL Managed Instance, both as a cloud platform of SQL Server and in their new integrations with SQL Server 2022, information available in no other book. Discover how experts tackle todays essential tasks and challenge yourself to new levels of mastery. Identify low-hanging fruit and practical, easy wins for improving SQL Server administration Get started with modern SQL Server tools, including SQL Server Management Studio, and Azure Data Studio Upgrade your SQL Server administration skillset to new features of SQL Server 2022, Azure SQL Database, Azure SQL Managed Instance, and SQL Server on Linux Design and implement modern on-premises database infrastructure, including Kubernetes Leverage data virtualization of third-party or non-relational data sources Monitor SQL instances for corruption, index activity, fragmentation, and extended events Automate maintenance plans, database mail, jobs, alerts, proxies, and event forwarding Protect data through encryption, privacy, and auditing Provision, manage, scale and secure, and bidirectionally synchronize Microsofts powerful Azure SQL Managed Instance Understand and enable new Intelligent Query Processing features to increase query concurrency Prepare a best-practice runbook for disaster recovery Use SQL Server 2022 features to span infrastructure across hybrid environments ...

Everybody knows our yellow vans, trucks and planes around the world. But do you know how data drives our business and how we leverage algorithms and technology in our core operations? We will share some “behind the scenes” insights on Deutsche Post DHL Group’s journey towards a Data-Driven Company. • Large-Scale Use Cases: Challenging and high impact Use Cases in all major areas of logistics, including Computer Vision and NLP • Fancy Algorithms: Deep-Neural Networks, TSP Solvers and the standard toolkit of a Data Scientist • Modern Tooling: Cloud Platforms, Kubernetes , Kubeflow, Auto ML • No rusty working mode: small, self-organized, agile project teams, combining state of the art Machine Learning with MLOps best practices • A young, motivated and international team – German skills are only “nice to have” But we have more to offer than slides filled with buzzwords. We will demonstrate our passion for our work, deep dive into our largest use cases that impact your everyday life and share our approach for a timeseries forecasting library - combining data science, software engineering and technology for efficient and easy to maintain machine learning projects..

Summary

Making effective use of data requires proper context around the information that is being used. As the size and complexity of your organization increases the difficulty of ensuring that everyone has the necessary knowledge about how to get their work done scales exponentially. Wikis and intranets are a common way to attempt to solve this problem, but they are frequently ineffective. Rehgan Avon co-founded AlignAI to help address this challenge through a more purposeful platform designed to collect and distribute the knowledge of how and why data is used in a business. In this episode she shares the strategic and tactical elements of how to make more effective use of the technical and organizational resources that are available to you for getting work done with data.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management When you're ready to build your next pipeline, or want to test out the projects you hear about on the show, you'll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don't forget to thank them for their continued support of this show! Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan's active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to dataengineeringpodcast.com/atlan today to learn more about how Atlan’s active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork & Unilever achieve extraordinary things with metadata and escape the chaos. Struggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the leading end-to-end Data Observability Platform! Trusted by the data teams at Fox, JetBlue, and PagerDuty, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, dbt models, Airflow jobs, and business intelligence tools, reducing time to detection and resolution from weeks to just minutes. Monte Carlo also gives you a holistic picture of data health with automatic, end-to-end lineage from ingestion to the BI layer directly out of the box. Start trusting your data with Monte Carlo today! Visit dataengineeringpodcast.com/montecarlo to learn more. Your host is Tobias Macey and today I'm interviewing Rehgan Avon about her work at AlignAI to help organizations standardize their technical and procedural approaches to working with data

Interview

Introduction How did you get involved in the area of data management? Can you describe what AlignAI is and the story behind it? What are the core problems that you are focused on addressing?

What are the tactical ways that you are working to solve those problems?

What are some of the common and avoidable ways that analytics/AI projects go wrong?

What are some of the ways that organizational scale and complexity impacts their ability to execute on data and AI projects?

What are the ways that incomplete/unevenly distributed knowledge manifests in project design and execution? Can you describe the design and implementation of the AlignAI platform?

How have the goals and implementation of the product changed since you