talk-data.com talk-data.com

Topic

Python

programming_language data_science web_development

1446

tagged

Activity Trend

185 peak/qtr
2020-Q1 2026-Q1

Activities

1446 activities · Newest first

DuckDB in Action

Dive into DuckDB and start processing gigabytes of data with ease—all with no data warehouse. DuckDB is a cutting-edge SQL database that makes it incredibly easy to analyze big data sets right from your laptop. In DuckDB in Action you’ll learn everything you need to know to get the most out of this awesome tool, keep your data secure on prem, and save you hundreds on your cloud bill. From data ingestion to advanced data pipelines, you’ll learn everything you need to get the most out of DuckDB—all through hands-on examples. Open up DuckDB in Action and learn how to: Read and process data from CSV, JSON and Parquet sources both locally and remote Write analytical SQL queries, including aggregations, common table expressions, window functions, special types of joins, and pivot tables Use DuckDB from Python, both with SQL and its "Relational"-API, interacting with databases but also data frames Prepare, ingest and query large datasets Build cloud data pipelines Extend DuckDB with custom functionality Pragmatic and comprehensive, DuckDB in Action introduces the DuckDB database and shows you how to use it to solve common data workflow problems. You won’t need to read through pages of documentation—you’ll learn as you work. Get to grips with DuckDB's unique SQL dialect, learning to seamlessly load, prepare, and analyze data using SQL queries. Extend DuckDB with both Python and built-in tools such as MotherDuck, and gain practical insights into building robust and automated data pipelines. About the Technology DuckDB makes data analytics fast and fun! You don’t need to set up a Spark or run a cloud data warehouse just to process a few hundred gigabytes of data. DuckDB is easily embeddable in any data analytics application, runs on a laptop, and processes data from almost any source, including JSON, CSV, Parquet, SQLite and Postgres. About the Book DuckDB in Action guides you example-by-example from setup, through your first SQL query, to advanced topics like building data pipelines and embedding DuckDB as a local data store for a Streamlit web app. You’ll explore DuckDB’s handy SQL extensions, get to grips with aggregation, analysis, and data without persistence, and use Python to customize DuckDB. A hands-on project accompanies each new topic, so you can see DuckDB in action. What's Inside Prepare, ingest and query large datasets Build cloud data pipelines Extend DuckDB with custom functionality Fast-paced SQL recap: From simple queries to advanced analytics About the Reader For data pros comfortable with Python and CLI tools. About the Authors Mark Needham is a blogger and video creator at @‌LearnDataWithMark. Michael Hunger leads product innovation for the Neo4j graph database. Michael Simons is a Java Champion, author, and Engineer at Neo4j. Quotes I use DuckDB every day, and I still learned a lot about how DuckDB makes things that are hard in most databases easy! - Jordan Tigani, Founder, MotherDuck An excellent resource! Unlocks possibilities for storing, processing, analyzing, and summarizing data at the edge using DuckDB. - Pramod Sadalage, Director, Thoughtworks Clear and accessible. A comprehensive resource for harnessing the power of DuckDB for both novices and experienced professionals. - Qiusheng Wu, Associate Professor, University of Tennessee Excellent! The book all we ducklings have been waiting for! - Gunnar Morling, Decodable

Summary Data contracts are both an enforcement mechanism for data quality, and a promise to downstream consumers. In this episode Tom Baeyens returns to discuss the purpose and scope of data contracts, emphasizing their importance in achieving reliable analytical data and preventing issues before they arise. He explains how data contracts can be used to enforce guarantees and requirements, and how they fit into the broader context of data observability and quality monitoring. The discussion also covers the challenges and benefits of implementing data contracts, the organizational impact, and the potential for standardization in the field.

Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementData lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst is an end-to-end data lakehouse platform built on Trino, the query engine Apache Iceberg was designed for, with complete support for all table formats including Apache Iceberg, Hive, and Delta Lake. Trusted by teams of all sizes, including Comcast and Doordash. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.At Outshift, the incubation engine from Cisco, they are driving innovation in AI, cloud, and quantum technologies with the powerful combination of enterprise strength and startup agility. Their latest innovation for the AI ecosystem is Motific, addressing a critical gap in going from prototype to production with generative AI. Motific is your vendor and model-agnostic platform for building safe, trustworthy, and cost-effective generative AI solutions in days instead of months. Motific provides easy integration with your organizational data, combined with advanced, customizable policy controls and observability to help ensure compliance throughout the entire process. Move beyond the constraints of traditional AI implementation and ensure your projects are launched quickly and with a firm foundation of trust and efficiency. Go to motific.ai today to learn more!Your host is Tobias Macey and today I'm interviewing Tom Baeyens about using data contracts to build a clearer API for your dataInterview IntroductionHow did you get involved in the area of data management?Can you describe the scope and purpose of data contracts in the context of this conversation?In what way(s) do they differ from data quality/data observability?Data contracts are also known as the API for data, can you elaborate on this?What are the types of guarantees and requirements that you can enforce with these data contracts?What are some examples of constraints or guarantees that cannot be represented in these contracts?Are data contracts related to the shift-left?Data contracts are also known as the API for data, can you elaborate on this?The obvious application of data contracts are in the context of pipeline execution flows to prevent failing checks from propagating further in the data flow. What are some of the other ways that these contracts can be integrated into an organization's data ecosystem?How did you approach the design of the syntax and implementation for Soda's data contracts?Guarantees and constraints around data in different contexts have been implemented in numerous tools and systems. What are the areas of overlap in e.g. dbt, great expectations?Are there any emerging standards or design patterns around data contracts/guarantees that will help encourage portability and integration across tooling/platform contexts?What are the most interesting, innovative, or unexpected ways that you have seen data contracts used?What are the most interesting, unexpected, or challenging lessons that you have learned while working on data contracts at Soda?When are data contracts the wrong choice?What do you have planned for the future of data contracts?Contact Info LinkedInParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.Links SodaPodcast EpisodeJBossData ContractAirflowUnit TestingIntegration TestingOpenAPIGraphQLCircuit Breaker PatternSodaCLSoda Data ContractsData MeshGreat Expectationsdbt Unit TestsOpen Data ContractsODCS == Open Data Contract StandardODPS == Open Data Product SpecificationThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Although Python is talked about a lot in the data world, if you are aiming for your first data analyst role, I don’t think you should learn it.

It takes too much time, it’s hard to learn, and it’s hard to use.

In this episode, I’ll dive into more of the specifics and what to focus on instead.

⁠📩 Get my weekly email with helpful data career tips

🧙‍♂️ Ace the Interview with Confidence

⁠📊 Come to my next free “How to Land Your First Data Job” training⁠

⁠🏫 Check out my 10-week data analytics bootcamp

▶️ Want to be a Data Analyst? Learn These Skills w/ Luke Barousse

Timestamps:

(01:00) Why You Shouldn't Learn Python (04:11) Is Python in Demand? (06:03) What To Know in Python

Connect with Avery:

📺 Subscribe on YouTube

🎙Listen to My Podcast

👔 Connect with me on LinkedIn

📸 Instagram

🎵 TikTok

Mentioned in this episode: Join the last cohort of 2025! The LAST cohort of The Data Analytics Accelerator for 2025 kicks off on Monday, December 8th and enrollment is officially open!

To celebrate the end of the year, we’re running a special End-of-Year Sale, where you’ll get: ✅ A discount on your enrollment 🎁 6 bonus gifts, including job listings, interview prep, AI tools + more

If your goal is to land a data job in 2026, this is your chance to get ahead of the competition and start strong.

👉 Join the December Cohort & Claim Your Bonuses: https://DataCareerJumpstart.com/daa https://www.datacareerjumpstart.com/daa

Part 2 of the workshop providing a hands-on introduction to FiftyOne: loading datasets from the FiftyOne Dataset Zoo, navigating the FiftyOne App, programmatically inspecting attributes, adding new samples and custom attributes, generating and evaluating model predictions, and saving insightful views.

podcast_episode
by Jordan Goldmeier (Booz Allen Hamilton; The Perduco Group; EY; Excel TV; Wake Forest University; Anarchy Data) , Adel (DataFramed)

Excel often gets unfair criticism from data practitioners, many of us will remember a time when Excel was looked down upon—why would anyone use Excel when we have powerful tools like Python, R, SQL, or BI tools? However,  like it or not, Excel is here to stay, and there’s a meme, bordering on reality, that Excel is carrying a large chunk of the world’s GDP. But when it really comes down to it, can you do data science in Excel? Jordan Goldmeier is an entrepreneur, a consultant, a best-selling author of four books on data, and a digital nomad. He started his career as a data scientist in the defense industry for Booz Allen Hamilton and The Perduco Group, before moving into consultancy with EY, and then teaching people how to use data at Excel TV, Wake Forest University, and now Anarchy Data. He also has a newsletter called The Money Making Machine, and he's on a mission to create 100 entrepreneurs.  In the episode, Adel and Jordan explore excel in data science, excel’s popularity, use cases for Excel in data science, the impact of GenAI on Excel, Power Query and data transformation, advanced Excel features, Excel for prototyping and generating buy-in, the limitations of Excel and what other tools might emerge in its place, and much more.  Links Mentioned in the Show: Data Smart: Using Data Science to Transform Information Into Insight by Jordan Goldmeier[Webinar] Developing a Data Mindset: How to Think, Speak, and Understand Data[Course] Data Analysis in ExcelRelated Episode: Do Spreadsheets Need a Rethink? With Hjalmar Gislason, CEO of GRIDRewatch sessions from RADAR: AI Edition New to DataCamp? Learn on the go using the DataCamp mobile appEmpower your business with world-class data and AI skills with DataCamp for business

Summary Generative AI has rapidly gained adoption for numerous use cases. To support those applications, organizational data platforms need to add new features and data teams have increased responsibility. In this episode Lior Gavish, co-founder of Monte Carlo, discusses the various ways that data teams are evolving to support AI powered features and how they are incorporating AI into their work. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementData lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst is an end-to-end data lakehouse platform built on Trino, the query engine Apache Iceberg was designed for, with complete support for all table formats including Apache Iceberg, Hive, and Delta Lake. Trusted by teams of all sizes, including Comcast and Doordash. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.Your host is Tobias Macey and today I'm interviewing Lior Gavish about the impact of AI on data engineersInterview IntroductionHow did you get involved in the area of data management?Can you start by clarifying what we are discussing when we say "AI"?Previous generations of machine learning (e.g. deep learning, reinforcement learning, etc.) required new features in the data platform. What new demands is the current generation of AI introducing?Generative AI also has the potential to be incorporated in the creation/execution of data pipelines. What are the risk/reward tradeoffs that you have seen in practice?What are the areas where LLMs have proven useful/effective in data engineering?Vector embeddings have rapidly become a ubiquitous data format as a result of the growth in retrieval augmented generation (RAG) for AI applications. What are the end-to-end operational requirements to support this use case effectively?As with all data, the reliability and quality of the vectors will impact the viability of the AI application. What are the different failure modes/quality metrics/error conditions that they are subject to?As much as vectors, vector databases, RAG, etc. seem exotic and new, it is all ultimately shades of the same work that we have been doing for years. What are the areas of overlap in the work required for running the current generation of AI, and what are the areas where it diverges?What new skills do data teams need to acquire to be effective in supporting AI applications?What are the most interesting, innovative, or unexpected ways that you have seen AI impact data engineering teams?What are the most interesting, unexpected, or challenging lessons that you have learned while working with the current generation of AI?When is AI the wrong choice?What are your predictions for the future impact of AI on data engineering teams?Contact Info LinkedInParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your Links Monte CarloPodcast EpisodeNLP == Natural Language ProcessingLarge Language ModelsGenerative AIMLOpsML EngineerFeature StoreRetrieval Augmented Generation (RAG)LangchainThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Big Data on Kubernetes

Big Data on Kubernetes is your comprehensive guide to leveraging Kubernetes for scalable and efficient big data solutions. You will learn key concepts of Kubernetes architecture and explore tools like Apache Spark, Airflow, and Kafka. Gain hands-on experience building complete data pipelines to tackle real-world data challenges. What this Book will help me do Understand Kubernetes architecture and learn to deploy and manage clusters. Build and orchestrate big data pipelines using Spark, Airflow, and Kafka. Develop scalable and resilient data solutions with Docker and Kubernetes. Integrate and optimize data tools for real-time ingestion and processing. Apply concepts to hands-on projects addressing actual big data scenarios. Author(s) Neylson Crepalde is an experienced data specialist with extensive knowledge of Kubernetes and big data solutions. With deep practical experience, Neylson brings real-world insights to his writing. His approach emphasizes actionable guidance and relatable problem-solving with a strong foundation in scalable architecture. Who is it for? This book is ideal for data engineers, BI analysts, data team leaders, and tech managers familiar with Python, SQL, and YAML. Targeted at professionals seeking to develop or expand their expertise in scalable big data solutions, it provides practical insights into Docker, Kubernetes, and prominent big data tools.

Summary In this episode Praveen Gujar, Director of Product at LinkedIn, talks about the intricacies of product management for data and analytical platforms. Praveen shares his journey from Amazon to Twitter and now LinkedIn, highlighting his extensive experience in building data products and platforms, digital advertising, AI, and cloud services. He discusses the evolving role of product managers in data-centric environments, emphasizing the importance of clean, reliable, and compliant data. Praveen also delves into the challenges of building scalable data platforms, the need for organizational and cultural alignment, and the critical role of product managers in bridging the gap between engineering and business teams. He provides insights into the complexities of platformization, the significance of long-term planning, and the necessity of having a strong relationship with engineering teams. The episode concludes with Praveen offering advice for aspiring product managers and discussing the future of data management in the context of AI and regulatory compliance.

Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementData lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst is an end-to-end data lakehouse platform built on Trino, the query engine Apache Iceberg was designed for, with complete support for all table formats including Apache Iceberg, Hive, and Delta Lake. Trusted by teams of all sizes, including Comcast and Doordash. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.Your host is Tobias Macey and today I'm interviewing Praveen Gujar about product management for data and analytical platformsInterview IntroductionHow did you get involved in the area of data management?Product management is typically thought of as being oriented toward customer facing functionality and features. What is involved in being a product manager for data systems?Many data-oriented products that are customer facing require substantial technical capacity to serve those use cases. How does that influence the process of determining what features to provide/create?investment in technical capacity/platformsidentifying groupings of features that can be served by a common platform investmentmanaging organizational pressures between engineering, product, business, finance, etc.What are the most interesting, innovative, or unexpected ways that you have seen "Data Products & Platforms @ Big-tech" used?What are the most interesting, unexpected, or challenging lessons that you have learned while working on "Building Data Products & Platforms for Big-tech"?When is "Data Products & Platforms @ Big-tech" the wrong choice?What do you have planned for the future of "Data Products & Platforms @ Big-tech"?Contact Info LinkedInWebsiteParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.Links DataHubPodcast EpisodeRAG == Retrieval Augmented GenerationThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

In this episode, Conor and Bryce chat with Kevlin Henney about C++, Python and more! Link to Episode 190 on WebsiteDiscuss this episode, leave a comment, or ask a question (on GitHub)Twitter ADSP: The PodcastConor HoekstraBryce Adelstein LelbachAbout the Guest Kevlin Henney is an independent consultant, speaker, writer and trainer. His software development interests are in programming, practice and people. He has been a columnist for various magazines and websites. He is the co-author of A Pattern Language for Distributed Computing and On Patterns and Pattern Languages, two volumes in the Pattern-Oriented Software Architecture series, and editor of 97 Things Every Programmer Should Know and co-editor of 97 Things Every Java Programmer Should Know. Show Notes Date Recorded: 2024-07-11 Date Released: 2024-07-12 When zombies attack! Bristol city council ready for undead invasionACCU Conference97 Things Every Programmer Should Know (GitHub)97 Things Every Programmer Should Know97 Things Every Java Programmer Should KnowC++Now 2018: Ben Deane “Easy to Use, Hard to Misuse: Declarative Style in C++”When to Use a List Comprehension in PythonIntro Song Info Miss You by Sarah Jansen https://soundcloud.com/sarahjansenmusic Creative Commons — Attribution 3.0 Unported — CC BY 3.0 Free Download / Stream: http://bit.ly/l-miss-you Music promoted by Audio Library https://youtu.be/iYYxnasvfx8

Summary Postgres is one of the most widely respected and liked database engines ever. To make it even easier to use for developers to use, Nikita Shamgunov decided to makee it serverless, so that it can scale from zero to infinity. In this episode he explains the engineering involved to make that possible, as well as the numerous details that he and his team are packing into the Neon service to make it even more attractive for anyone who wants to build on top of Postgres. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementData lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst is an end-to-end data lakehouse platform built on Trino, the query engine Apache Iceberg was designed for, with complete support for all table formats including Apache Iceberg, Hive, and Delta Lake. Trusted by teams of all sizes, including Comcast and Doordash. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.Your host is Tobias Macey and today I'm interviewing Nikita Shamgunov about his work on making Postgres a serverless database at Neon.Interview IntroductionHow did you get involved in the area of data management?Can you describe what Neon is and the story behind it?The ecosystem around Postgres is large and varied. What are the pain points that you are trying to address with Neon? What does it mean for a database to be serverless?What kinds of products and services are unlocked by making Postgres a serverless database?How does your vision for Neon compare/contrast with what you know of PlanetScale?Postgres is known for having a large ecosystem of plugins that add a lot of interesting and useful features, but the storage layer has not been as easily extensible historically. How have architectural changes in recent Postgres releases enabled your work on Neon?What are the core pieces of engineering that you have had to complete to make Neon possible?How have the design and goals of the project evolved since you first started working on it?The separation of storage and compute is one of the most fundamental promises of the cloud. What new capabilities does that enable in Postgres?How does the branching functionality change the ways that development teams are able to deliver and debug features?Because the storage is now a networked system, what new performance/latency challenges does that introduce? How have you addressed them in Neon?Anyone who has ever operated a Postgres instance has had to tackle the upgrade process. How does Neon address that process for end users?The rampant growth of AI has touched almost every aspect of computing, and Postgres is no exception. How does the introduction of pgvector and semantic/similarity search functionality impact the adoption and usage patterns of Postgres/Neon?What new challenges does that introduce for you as an operator and business owner?What are the lessons that you learned from MemSQL/SingleStore that have been most helpful in your work at Neon?What are the most interesting, innovative, or unexpected ways that you have seen Neon used?What are the most interesting, unexpected, or challenging lessons that you have learned while working on Neon?When is Neon the wrong choice? Postgres?What do you have planned for the future of Neon?Contact Info @nikitabase on TwitterLinkedInParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.Links NeonPostgreSQLNeon GithubPHPMySQLSQL ServerSingleStorePodcast EpisodeAWS AuroraKhosla VenturesYugabyteDBPodcast EpisodeCockroachDBPodcast EpisodePlanetScalePodcast EpisodeClickhousePodcast EpisodeDuckDBPodcast EpisodeWAL == Write-Ahead LogPgBouncerPureStoragePaxos)HNSW IndexIVF Flat IndexRAG == Retrieval Augmented GenerationAlloyDBNeon Serverless DriverDevinmagic.devThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Send us a text Welcome to the cozy corner of the tech world where ones and zeros mingle with casual chit-chat. Datatopics Unplugged is your go-to spot for relaxed discussions around tech, news, data, and society. Dive into conversations that should flow as smoothly as your morning coffee (but don’t), where industry insights meet laid-back banter. Whether you're a data aficionado or just someone curious about the digital age, pull up a chair, relax, and let's get into the heart of data, unplugged style! In this episode, we explore: LLMs Gaming the System: Uncover how LLMs are using political sycophancy and tool-using flattery to game the system. Dive deeper: paper, chain of thought prompting & post on x.Recording Industry Association of America (RIAA) Sue AI Music Generators: They are taking on Suno and Udio for using copyrighted music to train their models. Some ai generated music that is very similar to existing songs: song 1, song 2, song 3. More on GenAI: midjourney creating copyrighted images, and chatGPT reciting email-adresses.AI-Powered Olympic Recaps: NBC’s personalized daily recaps with Al Michaels' voice offer a new way to catch up on the Olympics.Figma’s AI Redesign: Discover Figma’s new AI tools that speed up design and creativity. We debate the tool's value and its application in the design process. Rabbit R1 Security Flaws: Hackers exposed hardcoded API keys in Rabbit R1’s source code, leading to major security issues. Find out more.Pyinstrument for Python: Meet Pyinstrument, the easy-to-use Python profiler that optimizes code performance. Explore it on GitHub.The Ultimate Font - Bart’s dreams come true: Explore the groundbreaking integration of True Type Fonts with AI for dynamic text rendering. Discover more here.Hot Takes on AI Competition: Google claims no one has a moat in AI, sparking debate on open-source models' future. We also explore Ladybird Browser Project, an independently funded browser project aiming to build a cutting-edge browser engine.

OpenLineage is an open standard for lineage data collection, integrated into the Airflow codebase, facilitating lineage collection across providers like Google, Amazon, and more. Atlan Data Catalog is a 3rd generation active metadata platform that is a single source of trust unifying cataloging, data discovery, lineage, and governance experience. We will demonstrate what OpenLineage is and how, with minimal and intuitive setup across Airlfow and Atlan, it presents unified workflows view, efficient cross-platform lineage collection, including column level, in various technologies (Python, Spark, dbt, SQL etc.) and clouds (AWS, Azure, GCP, etc.) - all orchestrated by Airflow. This integration enables further use case unlocks on automated metadata management by making the operational pipelines dataset-aware for self-service exploration. It also will demonstrate real world challenges and resolutions for lineage consumers in improving audit and compliance accuracy through column-level lineage traceability across the data estate. The talk will also briefly overview the most recent OpenLineage developments and planned future enhancements.

NCR Voyix Retail Analytics AI team offers ML products for retailers while embracing Airflow as its MLOps Platform. As the team is small and there have been twice as many data scientists as engineers, we encountered challenges in making Airflow accessible to the scientists: As they come from diverse programming backgrounds, we needed an architecture enabling them to develop production-ready ML workflows without prior knowledge of Airflow. Due to dynamic product demands, we had to implement a mechanism to interchange Airflow operators effortlessly. As workflows serve multiple customers, they should be easily configurable and simultaneously deployable. We came up with the following architecture to deal with the above: Enabling our data scientists to formulate ML workflows as structured Python files. Seamlessly converting the workflows into Airflow DAGs while aggregating their steps to be executed on different Airflow operators. Deploying DAGs via CI/CD’s UI to the DAGs folder for all customers while considering definitions for each in their configuration files. In this session, we will cover Airflow’s evolution in our team and review the concepts of our architecture.

DAGify is a highly extensible, template driven, enterprise scheduler migration accelerator that helps organizations speed up their migration to Apache Airflow. While DAGify does not claim to migrate 100% of existing scheduler functionality it aims to heavily reduce the manual effort it takes for developers to convert their enterprise scheduler formats into Python Native Airflow DAGs. DAGify is an open source tool under Apache 2.0 license and available on Github ( https://github.com/GoogleCloudPlatform/dagify) . In this session we will introduce DAGify, its use cases and demo its functionality by converting Control-M XML files to Airflow DAGs. Additionally we will highlight DAGify’s “no-code” extensibility by creating custom conversion templates to map Control-M functionality to Airflow operators.

At Wix more often than not business analysts build workflows themselves to avoid data engineers being a bottleneck. But how do you enable them to create SQL ETLs starting when dependencies are ready and sending emails or refreshing Tableau reports when the work is done? One simple answer may be to use Airflow. The problem is every BA cannot be expected to know Python and Git so well that they will create thousands of DAGs easily. To bridge this gap we have built a web-based IDE, called Quix, that allows simple notebook-like development of Trino SQL workflows and converts them to Airflow DAGs when a user hits the “schedule” button. During the talk we will go through the problems of building a reliable and extendable DAG generating tool, why we preferred Airflow over Apache Oozie and also tricks (sharding, HA-mode, etc) allowing Airflow to run 8000 active DAGs on a single cluster in k8s.

Up until a few years ago, teams at Uber used multiple data workflow systems, with some based on open source projects such as Apache Oozie, Apache Airflow, and Jenkins while others were custom built solutions written in Python and Clojure. Every user who needed to move data around had to learn about and choose from these systems, depending on the specific task they needed to accomplish. Each system required additional maintenance and operational burdens to keep it running, troubleshoot issues, fix bugs, and educate users. After this evaluation, and with the goal in mind of converging on a single workflow system capable of supporting Uber’s scale, we settled on an Airflow-based system. The Airflow-based DSL provided the best trade-off of flexibility, expressiveness, and ease of use while being accessible for our broad range of users, which includes data scientists, developers, machine learning experts, and operations employees. This talk will focus on scaling Airflow to Uber’s scale and providing a no-code seamless user experience

Artificial Intelligence is reshaping the landscape of software development. In this talk, we’ll explore the latest AI breakthroughs improving LLM capabilities for software development use cases. We’ll discuss work and ideas in the field related to Airflow, particularly around model capabilities related to Python, DSLs, and low-resource languages.

DAG Authors, while constructing DAGs, generally use native libraries provided by Airflow in conjunction with python libraries available over public PyPI repositories. But sometimes, DAG authors need to construct DAG using libraries that are either in-house or not available over public PyPI repositories. This poses a serious challenge for users who want to run their custom code with Airflow DAGs, particularly when Airflow is deployed in a cloud-native fashion. Traditionally, these packages are baked in Airflow Docker images. This won’t work post deployment and is super impractical if your library is under development. We propose a solution that creates a dedicated Airflow global python environment that dynamically generates the requirements, establishes a version-compatible pyenv adhering to Airflow’s policies, and manages custom pip repository authentication seamlessly. Importantly, the service executes these steps in a fail-safe manner, not compromising core components. Join us as we discuss the solution to this common problem, touching upon the design, and seeing the solution in action. We also candidly discuss some challenges, and the shortcomings of the proposed solution.

Airflow’s power comes from its vast ecosystem, but securing this intricate web requires a united front. This talk unveils a groundbreaking collaborative effort between the Python Software Foundation (PSF), the Apache Software Foundation (ASF), the Airflow Project Management Committee (PMC), and Alpha-Omega Fund - aimed at securing not only Airflow, but the whole ecosystem. We’ll explore this new project dedicated to improving security across the Airflow landscape.