PyData Amsterdam 2025

Real-Time Context Engineering for LLMs

2025-09-26 Watch

talk

Manu Joseph

AI/ML LLM Python React

Context engineering has replaced prompt engineering as the main challenge in building agents and LLM applications. Context engineering involves providing LLMs with relevant and timely context data from various data sources, which allows them to make context-aware decisions. The context data provided to the LLM must be produced in real-time to enable it to react intelligently at human perceivable latencies (a second or two at most). If the application takes longer to react, humans would perceive it as laggy and unintelligent. In this talk, we will introduce context engineering and motivate for real-time context engineering for interactive applications. We will also demonstrate how to integrate real-time context data from applications inside Python agents using the Hopsworks feature store and corresponding application IDs. Application IDs are the key to unlock application context data for agents and LLMs. We will walk through an example of an interactive application (TikTok clone) that we make AI-enabled with Hopsworks.

Orchestrating success: How Vinted standardizes large-scale, decentralized data pipelines

2025-09-26

talk

Rodrigo Loredo , Oscar Ligthart

Airflow Python

At Vinted, Europe’s largest second-hand marketplace, over 20 decentralized data teams generate, transform, and build products on petabytes of data. Each team utilizes their own tools, workflows, and expertise. Coordinating data pipeline creation across such diverse teams presents significant challenges. These include complex inter-team dependencies, inconsistent scheduling solutions, and rapidly evolving requirements.

This talk is aimed at data engineers, platform engineers, and technical leads with experience in workflow orchestration and will demonstrate how we empower teams at Vinted to define data pipelines quickly and reliably. We will present our user-friendly abstraction layer built on top of Apache Airflow, enhanced by a Python code generator. This abstraction simplifies upgrades and migrations, removes scheduler complexity, and supports Vinted’s rapid growth. Attendees will learn how Python abstractions and code generation can standardize pipeline development across diverse teams, reduce operational complexity, and enable greater flexibility and control in large-scale data organizations. Through practical lessons and real-world examples of our abstraction interface, we will offer insights into designing scheduler-agnostic architectures for successful data pipeline orchestration.

Declarative Feature Engineering: Bridging Spark and Flink with a Unified DSL

2025-09-26

talk

Miguel Leite , Vitalii Zhebrakovskyi

AI/ML Flink Python Spark

Building ML features at scale shouldn’t require every ML Scientist to become an expert in Spark or Flink. At Adyen, the Feature Platform team built a Python-based DSL that lets data scientists define features declaratively — while automatically generating the necessary batch or real-time pipelines behind the scenes.

Optimize the Right Thing: Cost-Sensitive Classification in Practice

2025-09-26 Watch

talk

Shimanto Rahman

AI/ML Python Scikit-learn

Not all mistakes in machine learning are equal—a false negative in fraud detection or medical diagnosis can be far costlier than a false positive. Cost-sensitive learning helps navigate these trade-offs by incorporating error costs into the training process, leading to smarter decision-making. This talk introduces Empulse, an open-source Python package that brings cost-sensitive learning into scikit-learn. Attendees will learn why standard models fall short in cost-sensitive scenarios and how to build better classifiers with Scikit-Learn and Empulse.

Untitled13.ipynb

2025-09-26 Watch

talk

Vincent Warmerdam

LLM Python

For well over a decade, Python notebooks revolutionized our field. They gave us so much creative freedom and dramatically lowered the entry barrier for newcomers. Yet despite all this ... it has been a decade! And the notebook is still in roughly the same form factor.

So what if we allow ourselves to rethink notebooks ... really rethink it! What features might we come up with? Can we make the notebook understand datasources? What about LLMs? Can we generate widgets on the fly? What if we make changes to Python itself?

This presentation will be a stream of demos that help paint a picture of what the future might hold. I will share my latest work in the anywidget/marimo ecosystem as well as some new hardware integrations.

The main theme that I will work towards: if you want better notebooks, reactive Python might very well be the future.

The Gentle Monorepo: Ship Faster and Collaborate Better

2025-09-25 Watch

talk

Gerben Dekker

Python

Monorepos promise faster development and smoother cross-team collaboration, but they often seem intimidating, requiring major tooling, buy-in, and process changes. This talk shows how Dexter gradually introduced a Python monorepo by combining a few lightweight tools with a pragmatic, trust-based approach to adoption. The result is that we can effectively reuse components across our various energy forecasting and trade optimization products. We iterate quicker on bringing our research to production, which benefits our customers and supports the renewable energy transition. After this talk, you’ll walk away with a practical blueprint for introducing a monorepo in your context, without requiring heavy up-front work.

Microlog: Explain Your Python Applications with Logs, Graphs, and AI

2025-09-25

talk

Chris Laffra

AI/ML Python

Microlog is a lightweight continuous profiler and logger for Python that helps developers understand their applications through interactive visualizations and AI-powered insights. With extremely low overhead and a 100% Python stack, it makes it easy to trace performance issues, debug unexpected behavior, and gain visibility into production systems.

Context is King: Evaluating Long Context vs. RAG for Data Grounding

2025-09-25

talk

Bauke Brenninkmeijer

GenAI Python RAG

Grounding Large Language Models in your specific data is crucial, but notoriously challenging. Retrieval-Augmented Generation (RAG) is the common pattern, yet practical implementations are often brittle, suffering from poor retrieval, ineffective chunking, and context limitations, leading to inaccurate or irrelevant answers. The emergence of massive context windows (1M+ tokens) seems to offer a simpler path – just put all your data in the prompt! But does it truly solve the "needle in a haystack" problem, or introduce new challenges like prohibitive costs and information getting lost in the middle? This talk dives deep into the engineering realities. We'll dissect common RAG failure modes, explore techniques for building robust RAG systems (advanced retrieval, re-ranking, query transformations), and critically evaluate the practical viability, costs, and limitations of leveraging long context windows for complex data tasks in Python. Leave understanding the real trade-offs to make informed architectural decisions for building reliable, data-grounded GenAI applications.

Formula 1 goes Bayesian: Time Series Decomposition with PyMC

2025-09-25 Watch

talk

Wesley Boelrijk

AI/ML Python

Forecasting time series can be messy, data is often missing, noisy, or full of structural changes like holidays, outliers, or evolving patterns. This talk shows how to build interpretable time series decomposition models using PyMC, a modern probabilistic programming library.

We’ll break time series into trend, seasonality, and noise components using engineered time features (e.g., Fourier and Radial Basis Functions). You’ll also learn how to model correlated series using hierarchical priors, letting multiple time series "learn from each other." As a case study, we’ll analyze Formula 1 lap time data to compare drivers and explore performance consistency using Bayesian posteriors.

This is a hands-on, code-first talk for data scientists, ML engineers, and researchers curious about Bayesian modeling (or Formula 1). Familiarity with Python and basic statistics is helpful, but no deep knowledge of Bayes is required.

Actionable Techniques for Finding Performance Regressions

2025-09-25

talk

Jeroen Janssens , Thijs Nieuwdorp (VodafoneZiggo)

Bash Data Science Git Parquet Polars Python

Ever been burned by a mysterious slowdown in your data pipeline? In this session, we'll reveal how a stealthy performance regression in the Polars DataFrame library was hunted down and squashed. Using git bisect, Bash scripting, and uv, we automated commit compilation and benchmarking across two repos to pinpoint a commit that degraded multi-file Parquet loading. This led to challenging assumptions and rethinking performance monitoring for the Python data science library Polars.

Causal Inference Framework for incrementality : A Case Study at Booking to estimate incremental CLV due to App installs

2025-09-25

talk

Netesh , Nazlı Alagöz

GitHub Marketing Python

This talk dives into the challenge of measuring the causal impact of app installs on customer loyalty and value, a question at the heart of data-driven marketing. While randomized controlled trials are the gold standard, they’re rarely feasible in this context. Instead, we’ll explore how observational causal inference methods can be thoughtfully applied to estimate incremental value with careful consideration of confounding, selection, and measurement biases. This session is designed for data scientists, marketing analysts, and applied researchers with a working knowledge of statistics and causal inference concepts. We’ll keep the tone practical and informative, focusing on real-world challenges and solutions rather than heavy mathematical derivations.

Attendees will learn: * How to design robust observational studies for business impact * Strategies for covariate selection and bias mitigation * The use of multiple statistical and design-based causal inference approaches * Methods for validating and refuting causal claims in the absence of true randomization We’ll share actionable insights, code snippets, and a GitHub repository with example workflows so you can apply these techniques in your own organization. By the end of the talk, you’ll be equipped to design more transparent and credible causal studies-and make better decisions about where to invest your marketing dollars.

Requirements: A basic understanding of causal inference and Python is recommended. Materials and relevant links will be shared during the session

Potato breeding using image analysis in a production setting

2025-09-25

talk

Rik Nuijten , Dick Abma

AWS DevOps Python Terraform

The scale-up company Solynta focuses on hybrid potato breeding, which helps achieve improvements in yield, disease resistance, and climate adaptation. Scientific innovation is part of our core business. Plant selections are highly data-driven, involving, for example, drone observations and genetic data. Minimal time-to-production for new ideas is essential, which is facilitated by our custom AWS devops platform. This platform focusses on automation and accessible data storage.

In this talk, we introduce how computer vision (YOLO and SAM modelling) enables monitoring traits of plants in the field, and how we operate these models. This further entails: • Our experience from training and evaluating models on drone images • Trade-offs selecting AWS services, Terraform modules and Python packages for automation and robustness • Our team setup that allows IT specialists and biologists to work together effectively

The talk will provide practical insights for both data scientists and DevOps engineers. The main takeaways are that object detection and segmentation from drone maps, at scale, are achievable for a small team. Furthermore, with the right approach, you can standardise a DevOps platform to let operations and developers work together.

Large-Scale Video Intelligence

2025-09-25 Watch

talk

Antonino Ingargiola , Irene Donato

AI/ML Python Vector DB

The explosion of video data demands search beyond simple metadata. How do we find specific visual moments, actions, or faces within petabytes of footage? This talk dives into architecting a robust, scalable multi-modal video search system. We will explore an architecture combining efficient batch preprocessing for feature extraction (including person detection, face/CLIP-style embeddings) with optimized vector database indexing. Attendees will learn practical strategies for managing massive datasets, optimizing ML inference (e.g., lightweight models, specialized runtimes), and bridging pre-computed indexes with real-time analysis for deeper insights. This session is for data scientists, ML engineers, and architects looking to build sophisticated video understanding capabilities.

Audience: Data Scientists, Machine Learning Engineers, Data Engineers, System Architects.

Takeaway: Attendees will learn architectural patterns and practical techniques for building scalable multi-modal video search systems, including feature extraction, vector database utilization, and ML pipeline optimization.

Background Knowledge: Familiarity with Python, core machine learning concepts (e.g., embeddings, classification), and general data processing pipelines is beneficial. Experience with video processing or computer vision is a plus but not strictly required.

talk-data.com

Top Topics

Top Speakers

Real-Time Context Engineering for LLMs

Orchestrating success: How Vinted standardizes large-scale, decentralized data pipelines

Declarative Feature Engineering: Bridging Spark and Flink with a Unified DSL

Optimize the Right Thing: Cost-Sensitive Classification in Practice

Untitled13.ipynb

The Gentle Monorepo: Ship Faster and Collaborate Better

Microlog: Explain Your Python Applications with Logs, Graphs, and AI

Context is King: Evaluating Long Context vs. RAG for Data Grounding

Formula 1 goes Bayesian: Time Series Decomposition with PyMC

Actionable Techniques for Finding Performance Regressions

Causal Inference Framework for incrementality : A Case Study at Booking to estimate incremental CLV due to App installs

Potato breeding using image analysis in a production setting

Large-Scale Video Intelligence