Python

Dynamic DAGs and Data Quality using DAGFactory

2025-07-01 · Airflow Summit 2025

session

by Gangfeng Huang , Ashir Alam

Airflow BI Cloud Computing Data Quality GCP GitHub Cloud Composer YAML

We have a similar pattern of DAGs running for different data quality dimensions like accuracy, timeliness, & completeness. To do this again and again, we would be duplicating and potentially introducing human error while doing copy paste of code or making people write same code again. To solve for this, we are doing few things: Run DAGs via DagFactory to dynamically generate DAGs using just some YAML code for all the steps we want to run in our DQ checks. Hide this behind a UI which is hooked to github PR open step, now the user just provides some inputs or selects from dropdown in UI and a YAML DAG is generated for them. This highlights the potential for DAGFactory to hide Airflow Python code from users and make it more accessible to Data Analysts and Business Intelligence along with normal Software Engg, along with reducing human error. YAML is the perfect format to be able to generate code, create a PR and DagFactory is the perfect fir for that. All of this is running in GCP Cloud Composer.

Lightning talk: Supercharging Apache Airflow: Enhancing Core Components with Rust

2025-07-01 · Airflow Summit 2025

session

by Shahar Epstein

Airflow Rust

Apache Airflow is a powerful workflow orchestrator, but as workloads grow, its Python-based components can become performance bottlenecks. This talk explores how Rust, with its speed, safety, and concurrency advantages, can enhance Airflow’s core components (e.g, scheduler, DAG processor, etc). We’ll dive into the motivations behind using Rust, architectural trade-offs, and the challenges of bridging the gap between Python and Rust. A proof-of-concept showcasing an Airflow scheduler rewritten in Rust will demonstrate the potential benefits of this approach.

Transforming Data Engineering: Achieving Efficiency and Ease with an Intuitive Orchestration Solution

2025-07-01 · Airflow Summit 2025

session

by Mili Tripathi , Rakesh Kumar Tai

Airflow CI/CD Data Engineering Data Science Git PySpark SQL

In the rapidly evolving field of data engineering and data science, efficiency and ease of use are crucial. Our innovative solution offers a user-friendly interface to manage and schedule custom PySpark, PySQL, Python, and SQL code, streamlining the process from development to production. Using Airflow at the backend, this tool eliminates the complexities of infrastructure management, version control, CI/CD processes, and workflow orchestration.The intuitive UI allows users to upload code, configure job parameters, and set schedules effortlessly, without the need for additional scripting or coding. Additionally, users have the flexibility to bring their own custom artifactory solution and run their code. In summary, our solution significantly enhances the orchestration and scheduling of custom code, breaking down traditional barriers and empowering organizations to maximize their data’s potential and drive innovation efficiently. Whether you are an individual data scientist or part of a large data engineering team, this tool provides the resources needed to streamline your workflow and achieve your goals faster than ever before.

Enabling Agents In The Enterprise With A Platform Approach

2025-06-29 · Data Engineering Podcast Listen

podcast_episode

by Arun Joseph (Deutsche Telekom) , Tobias Macey

AI/ML API Data Collection Data Engineering Data Management Data Quality Datafold GenAI

Summary In this episode of the Data Engineering Podcast Arun Joseph talks about developing and implementing agent platforms to empower businesses with agentic capabilities. From leading AI engineering at Deutsche Telekom to his current entrepreneurial venture focused on multi-agent systems, Arun shares insights on building agentic systems at an organizational scale, highlighting the importance of robust models, data connectivity, and orchestration loops. Listen in as he discusses the challenges of managing data context and cost in large-scale agent systems, the need for a unified context management platform to prevent data silos, and the potential for open-source projects like LMOS to provide a foundational substrate for agentic use cases that can transform enterprise architectures by enabling more efficient data management and decision-making processes.

Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementData migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details. This episode is brought to you by Coresignal, your go-to source for high-quality public web data to power best-in-class AI products. Instead of spending time collecting, cleaning, and enriching data in-house, use ready-made multi-source B2B data that can be smoothly integrated into your systems via APIs or as datasets. With over 3 billion data records from 15+ online sources, Coresignal delivers high-quality data on companies, employees, and jobs. It is powering decision-making for more than 700 companies across AI, investment, HR tech, sales tech, and market intelligence industries. A founding member of the Ethical Web Data Collection Initiative, Coresignal stands out not only for its data quality but also for its commitment to responsible data collection practices. Recognized as the top data provider by Datarade for two consecutive years, Coresignal is the go-to partner for those who need fresh, accurate, and ethically sourced B2B data at scale. Discover how Coresignal's data can enhance your AI platforms. Visit dataengineeringpodcast.com/coresignal to start your free 14-day trial. Your host is Tobias Macey and today I'm interviewing Arun Joseph about building an agent platform to empower the business to adopt agentic capabilitiesInterview IntroductionHow did you get involved in the area of data management?Can you start by giving an overview of how Deutsche Telekom has been approaching applications of generative AI?What are the key challenges that have slowed adoption/implementation?Enabling non-engineering teams to define and manage AI agents in production is a challenging goal. From a data engineering perspective, what does the abstraction layer for these teams look like? How do you manage the underlying data pipelines, versioning of agents, and monitoring of these user-defined agents?What was your process for developing the architecture and interfaces for what ultimately became the LMOS?How do the principles of operatings systems help with managing the abstractions and composability of the framework?Can you describe the overall architecture of the LMOS?What does a typical workflow look like for someone who wants to build a new agent use case?How do you handle data discovery and embedding generation to avoid unnecessary duplication of processing?With your focus on openness and local control, how do you see your work complementing projects like OumiWhat are the most interesting, innovative, or unexpected ways that you have seen LMOS used?What are the most interesting, unexpected, or challenging lessons that you have learned while working on LMOS?When is LMOS the wrong choice?What do you have planned for the future of LMOS and MASAIC?Contact Info LinkedInParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.Links LMOSDeutsche TelekomMASAICOpenAI Agents SDKRAG == Retrieval Augmented GenerationLangChainMarvin MinskyVector DatabaseMCP == Model Context ProtocolA2A (Agent to Agent) ProtocolQdrantLlamaIndexDVC == Data Version ControlKubernetesKotlinIstioXerox PARC)OODA (Observe, Orient, Decide, Act) LoopThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

AI Agents from Scratch: A Beginner's LLM Workshop

2025-06-28 · AI Agents from Scratch: A Beginner's LLM Workshop

workshop

NLP embeddings langchain langgraph llms retrieval-augmented generation transformers vector databases

Unlock the power of AI agents—even if you’re just starting out. In this hands-on, beginner-friendly workshop, you'll go from understanding how Large Language Models (LLMs) work to building a real AI agent using Python, LangChain, and LangGraph. Live Demo: Your First AI Agent — follow along as we build an AI agent that retrieves, reasons, and responds using LangChain and LangGraph.

Episode 240: Thrust, Rust vs C++, Python & More (Part 2)

2025-06-27 · ADSP: Algorithms + Data Structures = Programs Listen

podcast_episode

by Conor Hoekstra , Jared Hoberock (NVIDIA) , Bryce Adelstein Lelbach (NVIDIA) , Ben Deane

Computer Science GitHub Rust

In this episode, Conor and Bryce chat with Jared Hoberock about the NVIDIA Thrust Parallel Algorithms Library, Rust vs C++, Python and more. Link to Episode 240 on WebsiteDiscuss this episode, leave a comment, or ask a question (on GitHub)Socials ADSP: The Podcast: TwitterConor Hoekstra: Twitter | BlueSky | MastodonBryce Adelstein Lelbach: TwitterAbout the Guest Jared Hoberock joined NVIDIA Research in October 2008. His interests include parallel programming models and physically-based rendering. Jared is the co-creator of Thrust, a high performance parallel algorithms library. While at NVIDIA, Jared has contributed to the DirectX graphics driver, Gelato, a final frame film renderer, and OptiX, a high-performance, programmable ray tracing engine. Jared received a Ph.D in computer science from the University of Illinois at Urbana-Champaign. He is a two-time recipient of the NVIDIA Graduate Research Fellowship. Show Notes Date Generated: 2025-05-21 Date Released: 2025-06-27 ThrustThrust Docsiota Algorithmthrust::counting_iteratorthrust::sequenceMLIRNumPyNumbaIntro Song Info Miss You by Sarah Jansen https://soundcloud.com/sarahjansenmusic Creative Commons — Attribution 3.0 Unported — CC BY 3.0 Free Download / Stream: http://bit.ly/l-miss-you Music promoted by Audio Library https://youtu.be/iYYxnasvfx8

Emulating User Actions with the Python-based Tool KdeGuiTest

2025-06-26 · June Talk evening: KDE Opt Green - End of Windows 10

workshop

by Oreoluwa (KDE Community)

kde kdeguitest

In this workshop, Oreoluwa will walk us through KdeGuiTest, an open-source automation tool used to emulate user interaction with different software applications. KdeGuiTest (previously called KdeEcoTest) is an automation and testing tool which allows one to record and simulate user interactions with the GUI of an application. It is being developed as part of the KDE Eco initiative to create usage scenario scripts for measuring the energy consumption of software.

Python for KDE Applications

2025-06-26 · June Talk evening: KDE Opt Green - End of Windows 10

talk

by Nicolas Fella (KDE)

kde

The KDE community has been producing free software for almost 30 years. This does not only include the Plasma desktop environment but hundreds of applications. Virtually all of these applications are written in C++. As part of the "Streamlined Application Development Experience" initiative we want to open the door to include other programming languages as well. In this talk we are going to look at the recent work on enabling writing KDE applications using Python, how to do it, and how to get involved with improving the support for it.

Energy Insights from the Edge: Anomaly Detection and Forecasting on Raspberry Pi

2025-06-25 · PyData Cornwall - Meetup #1

talk

influxdb 3 core prophet ml river ml

We'll learn how to monitor and analyse energy data from sensors on a Raspberry Pi, simulating solar or wind power systems. Data is streamed into InfluxDB 3 Core (an open source time series database), where we use Python plugins to run Prophet ML for forecasting and River ML for real-time anomaly detection, directly inside the database. You'll see how you can collect and query data from edge devices to deliver smart insights for energy monitoring on the edge.

Practical introduction to linear programming

2025-06-25 · Solving Optimisation Problems with Linear Programming

talk

linear programming pulp

A practical introduction to linear programming, empowering you to solve real-world optimisation challenges. It is a gentle, practical introduction to linear programming— the mathematical technique used to allocate limited resources, optimise schedules, and maximise outputs under constraints. Hands-On with Python: learn how free Python libraries, such as PuLP, enable you to frame and solve sophisticated optimisation problems with just a few lines of code.

Definitive Python Polars with Jeroen Janssens and Thijs Nieuwdorp

2025-06-25 · The Joe Reis Show Listen

podcast_episode

by Thijs Nieuwdorp (VodafoneZiggo) , Joe Reis (DeepLearning.AI) , Dr. Jeroen Janssens (Posit)

Pandas Polars

Jeroen Janssens and Thijs Nieuwdo join me to chat about all things Polars. We discuss the evolution of the Polars library, its advantages over pandas, and their journey of writing 'Python Polars: The Definitive Guide.'