PyData Paris 2025

Conception et Déploiement d'un RAG sécurisé

2025-10-01

Face To Face

GenAI RAG

Comment exploiter tout le potentiel de la GenAI tout en protégeant un corpus documentaire sensible et critique.

Démo Risk Hunter

2025-10-01

Face To Face

AI/ML SaaS

Venez découvrir comment notre Plateforme modulable et AI Driven, SaaS et On premise, vous permet de gérer votre GRC et Cybersécurité.

How to do real TDD in data science? A journey from pandas to polars with pelage!

2025-10-01 Watch

talk

Alix Tiran-Cappello

Data Quality Data Science GitHub Pandas Polars Python

In the world of data, inconsistencies or inaccuracies often presents a major challenge to extract valuable insights. Yet the number of robust tools and practices to address those issues remain limited. Particularly, the practice of TDD remains quite difficult in data science, while it is a standard among classic software development, also because of poorly adapted tools and frameworks.

To address this issue we released Pelage, an open-source Python package to facilitate data exploration and testing, which relies on Polars intuitive syntax and speed. Pelage empowers data scientists and analysts to facilitate data transformation, enhance data quality and improve code clarity.

We will demonstrate, in a test-first approach, how you can use this library in a meaningful data science workflow to gain greater confidence for your data transformations.

See website: https://alixtc.github.io/pelage/

IA@Horse Technologies : comment nous avons doublé notre capacité d’analyse qualité avec Altair Rapid Miner ?

2025-10-01

Face To Face

AI/ML

Grâce à la solution low-no code Altair RapidMiner, les experts qualité peuvent adapter l’algorithme sans programmation.

JSON Relationnel Duality : vos modèles document et relationnel unifié

2025-10-01

Face To Face

JSON

L'IA générative en entreprise : impact, risques et leviers d’actionL'IA générative en entreprise : impact, risques et leviers d’action

2025-10-01

Face To Face

AI/ML

Dans cette conférence, nous présentons l’IA générative et son impact sur les métiers et les organisations. Nous aborderons les opportunités

Probabilistic regression models: let's compare different modeling strategies and discuss how to evaluate them

2025-10-01

talk

Olivier Grisel

AI/ML PyTorch Scikit-learn

Most common machine learning models (linear, tree-based or neural network-based), optimize for the least squares loss when trained for regression tasks. As a result, they output a point estimate of the conditional expected value of the target: E[y|X].

In this presentation, we will explore several ways to train and evaluate probabilistic regression models as a richer alternative to point estimates. Those models predict a richer description of the full distribution of y|X and allow us to quantify the predictive uncertainty for individual predictions.

On the model training part, we will introduce the following options:

ensemble of quantile regressors for a grid of quantile levels (using linear models or gradient boosted trees in scikit-learn, XGBoost and PyTorch),
how to reduce probabilistic regression to multi-class classification + a cumulative sum of the predict_proba output to recover a continuous conditional CDF.
how to implement this approach as a generic scikit-learn meta-estimator;
how this approach is used to pretrain foundational tabular models (e.g. TabPFNv2).
simple Bayesian models (e.g. Bayesian Ridge and Gaussian Processes);
more specialized approaches as implemented in XGBoostLSS.

We will also discuss how to evaluate probabilistic predictions via:

the pinball loss of quantile regressors,
other strictly proper scoring rules such as Continuous Ranked Probability Score (CRPS),
coverage measures and width of prediction intervals,
reliability diagrams for different quantile levels.

We will illustrate of those concepts with concrete examples and running code.

Finally, we will illustrate why some applications need such calibrated probabilistic predictions:

estimating uncertainty in trip times depending on traffic conditions to help a human decision make choose among various travel plan options.
modeling value at risk for investment decisions,
assessing the impact of missing variables for an ML model trained to work in degraded mode,
Bayesian optimization for operational parameters of industrial machines from little/costly observations.

If time allows, will also discuss usage and limitations of Conformal Quantile Regressors as implemented in MAPIE and contrast aleatoric vs epistemic uncertainty captured by those models.

Démos d'IA Agentique avec Thinkeo

2025-10-01

Face To Face

AI/ML

Découvrez comment les agents IA transforment la création de documents : - Créez des présentations complètes - Générez des rapports ...

Beyond Prototyping: Building Production-Level Apps with Streamlit

2025-10-01 Watch

talk

Arnaud Miribel , Johannes Rieke

LLM

Streamlit is a great tool for prototyping data apps, but is it also fit for complex, production-level apps? In this talk, the Streamlit team will showcase new features, LLM integrations, and deployment options that can help you effectively use Streamlit in your company, whether it’s a small startup or a large enterprise.

CodeCommons: Towards transparent, richer and sustainable datasets for code generation model training

2025-10-01 Watch

talk

Simeon Carstens , Rania Talbi

AI/ML Big Data

Built on top of Software Heritage - the largest public archive of source code - the CodeCommons collaboration is building a large-scale, meta-data rich source code dataset designed to make training AI models on code more transparent, sustainable, and fair. Code will be enriched with contextual information such as issues, pull request discussions, licensing data, and provenance. In this presentation, we will present the goals and structure of both Software Heritage and CodeCommons projects, and discuss our particular contribution to CodeCommon's big data infrastructure.

Enhancing Machine Learning Workflows with skore

2025-10-01 Watch

talk

Marie Sacksick

AI/ML Python

Discover how skore, a new-born open-source Python library, can elevate your machine learning projects by integrating recommended practices and avoiding common pitfalls. This talk will introduce skore's key features and demonstrate how it can streamline your model evaluation and diagnostics processes.

Animation Simulateur RedBull Racing

2025-10-01

Face To Face

Oracle

Démonstration de l’apport des technologies Oracle pour l’écurie F1 Oracle Red Bull Racing

Meta-Dashboards: Accelerating Geospatial Web Apps Creation with Voilà

2025-10-01 Watch

talk

Davide De Marchi

DataViz

The Joint Research Centre has cultivated significant expertise in developing Voilà dashboards for scientific data visualization, resulting in the design and deployment of many real-world web applications. This presentation will highlight our commitment to building a robust Voilà developer community through dedicated training and resource libraries. We will introduce and demonstrate our innovative meta-dashboards, which streamline the creation of complex, multi-page dashboards by automating framework and code generation. A live demonstration will illustrate the ease of building a geospatial application using this tool. We will conclude with a showcase of recently developed Voilà dashboards in areas such as agricultural/biodiversity surveys and air quality monitoring, demonstrating their effectiveness in data exploration and validation.

Move beyond academia: Introducing an industry-first tabular benchmark

2025-10-01 Watch

talk

Alexandre Abraham , Louis Le Dain

Discover a new benchmark designed for real-world impact. Built on authentic private-company data and carefully chosen public datasets that reflect real industry challenges, like product categorization, basket prediction, and personalized recommendations, it offers a realistic testing ground for both classic baselines (e.g., gradient boosting) and the latest models such as CARTE, TabICL, and TabPFN. By bridging the gap between academic research and industrial needs, this benchmark brings model evaluation closer to the decisions and constraints faced in practice.

This shift has tangible consequences: models are tested on problems that matter to businesses, using metrics that reflect real-world priorities (e.g., Precision@K, Recall@K, MAP@K). It enables more relevant model selection, highlights where academic approaches fall short, and fosters solutions that are not just novel but deployable. Models are judged on tasks and metrics that matter, enabling more informed choices, exposing the limits of lab-only approaches, and helping accelerate the journey from innovation to deployment.

PyPI in the face: running jokes that PyPI download stats can play on you

2025-10-01 Watch

talk

Loïc Estève

Analytics GitHub NumPy Python Scikit-learn SciPy

We all love to tell stories with data and we all love to listen to them. Wouldn't it be great if we could also draw actionable insights from these nice stories?

As scikit-learn maintainers, we would love to use PyPI download stats and other proxy metrics (website analytics, github repository statistics, etc ...) to help inform some of our decisions like: - how do we increase user awareness of best practices (please use Pipeline and cross-validation)? - how do we advertise our recent improvements (use HistGradientBoosting rather than GradientBoosting, TunedThresholdClassifier, PCA and a few other models can run on GPU) ? - do users care more about new features from recent releases or consolidation of what already exists? - how long should we support older versions of Python, numpy or scipy ?

In this talk we will highlight a number of lessons learned while trying to understand the complex reality behind these seemingly simple metrics.

Telling nice stories is not always hard, trying to grasp the reality behind these metrics is often tricky.

Break

2025-10-01

talk

Break

2025-10-01

talk

Break

2025-10-01

talk

Venez voir QDA Miner & WordStat 2025 en action !

2025-10-01

Face To Face

AI/ML LLM

Gardez le contrôle sur l'IA: choisissez votre moteur (OpenAI, Gemini, etc.) et personnalisez les prompts pour une transparence totale.

Balancing Privacy and Utility: Efficient PII Detection and Replacement in Textual Data

2025-10-01 Watch

talk

Elizaveta Clouet , Justine BEL-LETOILE

NLP

Anonymizing free-text data is harder than it seems. While structured databases have well-established anonymization techniques, textual data — like invoices, resumes, or medical records — poses unique challenges. Personally identifiable information (PII) can appear anywhere, in unpredictable formats, and how to modify it while preserving the dataset's usefulness?

Let's explore a practical, open-source 2-step approach to text anonymization: (1) detecting PII using NER models and (2) replacing it while preserving key dataset characteristics (e.g. document formatting, statistical distributions). We will demonstrate how to build a robust pipeline leveraging tools such as pre-trained PII detection models, gliner for fine-tuning, or Faker for generating meaningful replacements.

Ideal for those with a basic understanding of NLP, this session offers practical insights for anyone working with sensitive textual data.

Sharing computational course material at larger scale: a French multi-tenant attempt

2025-10-01 Watch

talk

Nicolas M. Thiéry

Python

With the rise of computation and data as pillars of science, institutions are struggling to provide large-scale training to their students and staff. Often, this leads to redundant, fragmented efforts, with each organization producing its own bespoke training material. In this talk, we report on a collaborative multi-tenant initiative to produce a shared corpus of interactive training resources in the Python language, designed as a digital common that can be adapted to diverse contexts and formats in French higher education and beyond.

Skrub: machine learning for dataframes

2025-10-01 Watch

talk

Jérôme Dockès , Riccardo Cappuzzo , Guillaume Lemaitre (scikit-learn)

AI/ML DataOps Scikit-learn

Skrub is an open source package that simplifies machine-learning with dataframes by providing a variety of tools to explore, prepare and feature-engineer dataframes so they can be integrated into scikit-learn pipelines. Skrub DataOps allow to build extensive, multi-table wrangling plans, explore hyperparameter spaces, and export the resulting objects for deployment. The talk showcases various use cases where skrub can simplify the job of a data scientist from data preparation to deployment, through code examples and demonstrations.

Room change

2025-10-01

talk

Building Data Science Tools for Sustainable Transformation

2025-10-01 Watch

talk

Anita Graser

AI/ML Data Science GenAI

The current AI hype, driven by generative AI and particularly large language models, is creating excitement, fear, and inflated expectations. In this keynote, we'll explore geographic & mobility data science tools (such as GeoPandas and MovingPandas) to transform this hype into sustainable and positive development that empowers users.

Forewords

2025-10-01

talk

talk-data.com

Top Topics

Top Speakers

Conception et Déploiement d'un RAG sécurisé

Démo Risk Hunter

How to do real TDD in data science? A journey from pandas to polars with pelage!

IA@Horse Technologies : comment nous avons doublé notre capacité d’analyse qualité avec Altair Rapid Miner ?

JSON Relationnel Duality : vos modèles document et relationnel unifié

L'IA générative en entreprise : impact, risques et leviers d’actionL'IA générative en entreprise : impact, risques et leviers d’action

Probabilistic regression models: let's compare different modeling strategies and discuss how to evaluate them

Démos d'IA Agentique avec Thinkeo

Beyond Prototyping: Building Production-Level Apps with Streamlit

CodeCommons: Towards transparent, richer and sustainable datasets for code generation model training

Enhancing Machine Learning Workflows with skore

Animation Simulateur RedBull Racing

Meta-Dashboards: Accelerating Geospatial Web Apps Creation with Voilà

Move beyond academia: Introducing an industry-first tabular benchmark

PyPI in the face: running jokes that PyPI download stats can play on you

Break

Break

Break

Venez voir QDA Miner & WordStat 2025 en action !

Balancing Privacy and Utility: Efficient PII Detection and Replacement in Textual Data

Sharing computational course material at larger scale: a French multi-tenant attempt

Skrub: machine learning for dataframes

Room change

Building Data Science Tools for Sustainable Transformation

Forewords