PyData Paris 2025

Réinventer l’allocation des coûts: quand la qualité des données rencontre l’IA générative dans Alteryx

2025-10-01

Face To Face

AI/ML Alteryx

A Journey Through a Geospatial Data Pipeline: From Raw Coordinates to Actionable Insights

2025-10-01 Watch

talk

Gravin Florent

AI/ML Analytics Big Data Cloud Computing

Every dataset has a story — and when it comes to geospatial data, it’s a story deeply rooted in space and scale. But working with geospatial information is often a hidden challenge: massive file sizes, strange formats, projections, and pipelines that don't scale easily.

In this talk, we'll follow the life of a real-world geospatial dataset, from its raw collection in the field to its transformation into meaningful insights. Along the way, we’ll uncover the key steps of building a robust, scalable open-source geospatial pipeline.

Drawing on years of experience at Camptocamp, we’ll explore:

How raw spatial data is ingested and cleaned
How vector and raster data are efficiently stored and indexed (PostGIS, Cloud Optimized GeoTIFFs, Zarr)
How modern tools like Dask, GeoServer, and STAC (SpatioTemporal Asset Catalogs) help process and serve geospatial data
How to design pipelines that handle both "small data" (local shapefiles) and "big data" (terabytes of satellite imagery)
Common pitfalls and how to avoid them when moving from prototypes to production

This journey will show how the open-source ecosystem has matured to make geospatial big data accessible — and how spatial thinking can enrich almost any data project, whether you are building dashboards, doing analytics, or setting the stage for machine learning later on.

Comment les IA génératives réorientent les discussions économiques et législatives en Europe?

2025-10-01

Face To Face

AI/ML

Comment les IA génératives réorientent les discussions économiques et législatives en Europe

Démo Risk Hunter

2025-10-01

Face To Face

AI/ML SaaS

Venez découvrir comment notre Plateforme modulable et AI Driven, SaaS et On premise, vous permet de gérer votre GRC et Cybersécurité.

IA@Horse Technologies : comment nous avons doublé notre capacité d’analyse qualité avec Altair Rapid Miner ?

2025-10-01

Face To Face

AI/ML

Grâce à la solution low-no code Altair RapidMiner, les experts qualité peuvent adapter l’algorithme sans programmation.

L'IA générative en entreprise : impact, risques et leviers d’actionL'IA générative en entreprise : impact, risques et leviers d’action

2025-10-01

Face To Face

AI/ML

Dans cette conférence, nous présentons l’IA générative et son impact sur les métiers et les organisations. Nous aborderons les opportunités

Probabilistic regression models: let's compare different modeling strategies and discuss how to evaluate them

2025-10-01

talk

Olivier Grisel

AI/ML PyTorch Scikit-learn

Most common machine learning models (linear, tree-based or neural network-based), optimize for the least squares loss when trained for regression tasks. As a result, they output a point estimate of the conditional expected value of the target: E[y|X].

In this presentation, we will explore several ways to train and evaluate probabilistic regression models as a richer alternative to point estimates. Those models predict a richer description of the full distribution of y|X and allow us to quantify the predictive uncertainty for individual predictions.

On the model training part, we will introduce the following options:

ensemble of quantile regressors for a grid of quantile levels (using linear models or gradient boosted trees in scikit-learn, XGBoost and PyTorch),
how to reduce probabilistic regression to multi-class classification + a cumulative sum of the predict_proba output to recover a continuous conditional CDF.
how to implement this approach as a generic scikit-learn meta-estimator;
how this approach is used to pretrain foundational tabular models (e.g. TabPFNv2).
simple Bayesian models (e.g. Bayesian Ridge and Gaussian Processes);
more specialized approaches as implemented in XGBoostLSS.

We will also discuss how to evaluate probabilistic predictions via:

the pinball loss of quantile regressors,
other strictly proper scoring rules such as Continuous Ranked Probability Score (CRPS),
coverage measures and width of prediction intervals,
reliability diagrams for different quantile levels.

We will illustrate of those concepts with concrete examples and running code.

Finally, we will illustrate why some applications need such calibrated probabilistic predictions:

estimating uncertainty in trip times depending on traffic conditions to help a human decision make choose among various travel plan options.
modeling value at risk for investment decisions,
assessing the impact of missing variables for an ML model trained to work in degraded mode,
Bayesian optimization for operational parameters of industrial machines from little/costly observations.

If time allows, will also discuss usage and limitations of Conformal Quantile Regressors as implemented in MAPIE and contrast aleatoric vs epistemic uncertainty captured by those models.

Démos d'IA Agentique avec Thinkeo

2025-10-01

Face To Face

AI/ML

Découvrez comment les agents IA transforment la création de documents : - Créez des présentations complètes - Générez des rapports ...

CodeCommons: Towards transparent, richer and sustainable datasets for code generation model training

2025-10-01 Watch

talk

Simeon Carstens , Rania Talbi

AI/ML Big Data

Built on top of Software Heritage - the largest public archive of source code - the CodeCommons collaboration is building a large-scale, meta-data rich source code dataset designed to make training AI models on code more transparent, sustainable, and fair. Code will be enriched with contextual information such as issues, pull request discussions, licensing data, and provenance. In this presentation, we will present the goals and structure of both Software Heritage and CodeCommons projects, and discuss our particular contribution to CodeCommon's big data infrastructure.

Enhancing Machine Learning Workflows with skore

2025-10-01 Watch

talk

Marie Sacksick

AI/ML Python

Discover how skore, a new-born open-source Python library, can elevate your machine learning projects by integrating recommended practices and avoiding common pitfalls. This talk will introduce skore's key features and demonstrate how it can streamline your model evaluation and diagnostics processes.

Venez voir QDA Miner & WordStat 2025 en action !

2025-10-01

Face To Face

AI/ML LLM

Gardez le contrôle sur l'IA: choisissez votre moteur (OpenAI, Gemini, etc.) et personnalisez les prompts pour une transparence totale.

Skrub: machine learning for dataframes

2025-10-01 Watch

talk

Jérôme Dockès , Riccardo Cappuzzo , Guillaume Lemaitre (scikit-learn)

AI/ML DataOps Scikit-learn

Skrub is an open source package that simplifies machine-learning with dataframes by providing a variety of tools to explore, prepare and feature-engineer dataframes so they can be integrated into scikit-learn pipelines. Skrub DataOps allow to build extensive, multi-table wrangling plans, explore hyperparameter spaces, and export the resulting objects for deployment. The talk showcases various use cases where skrub can simplify the job of a data scientist from data preparation to deployment, through code examples and demonstrations.

Building Data Science Tools for Sustainable Transformation

2025-10-01 Watch

talk

Anita Graser

AI/ML Data Science GenAI

The current AI hype, driven by generative AI and particularly large language models, is creating excitement, fear, and inflated expectations. In this keynote, we'll explore geographic & mobility data science tools (such as GeoPandas and MovingPandas) to transform this hype into sustainable and positive development that empowers users.

Big ideas shaping scientific Python: the quest for performance and usability

2025-09-30 Watch

talk

Ralf Gommers

AI/ML NumPy Python

Behind every technical leap in scientific Python lies a human ecosystem of volunteers, companies, and institutions working in tension and collaboration. This keynote explores how innovation actually happens in open source, through the lens of recent and ongoing initiatives that aim to move the needle on performance and usability - from the ideas that went into NumPy 2.0 and its relatively smooth rollout to the ongoing efforts to leverage the performance GPUs offer without sacrificing maintainability and usability.

Takeaways for the audience: Whether you’re an ML engineer tired of debugging GPU-CPU inconsistencies, a researcher pushing Python to its limits, or an open-source maintainer seeking sustainable funding, this keynote will equip you with both practical solutions and a clear vision of where scientific Python is headed next.

ActiveTigger: A Collaborative Text Annotation Research Tool for Computational Social Sciences

2025-09-30 Watch

talk

Emilien SCHULTZ , Paul Girard , Julien Boelaert

AI/ML API Computer Science GenAI GitHub LLM

The exponential growth of textual data—ranging from social media posts and digital news archives to speech-to-text transcripts—has opened new frontiers for research in the social sciences. Tasks such as stance detection, topic classification, and information extraction have become increasingly common. At the same time, the rapid evolution of Natural Language Processing, especially pretrained language models and generative AI, has largely been led by the computer science community, often leaving a gap in accessibility for social scientists.

To address this, we initiated since 2023 the development of ActiveTigger, a lightweight, open-source Python application (with a web frontend in React) designed to accelerate annotation process and manage large-scale datasets through the integration of fine-tuned models. It aims to support computational social science for a large public both within and outside social sciences. Already used by a dynamic community in social sciences, the stable version is planned for early June 2025.

From a more technical prospect, the API is designed to manage the complete workflow from project creation, embeddings computation, exploration of the text corpus, human annotation with active learning, fine-tuning of pre-trained models (BERT-like), prediction on a larger corpus, and export. It also integrates LLM-as-a-service capabilities for prompt-based annotation and information extraction, offering a flexible approach for hybrid manual/automatic labeling. Accessible both with a web frontend and a Python client, ActiveTigger encourages customization and adaptation to specific research contexts and practices.

In this talk, we will delve into the motivations behind the creation of ActiveTigger, outline its technical architecture, and walk through its core functionalities. Drawing on several ongoing research projects within the Computational Social Science (CSS) group at CREST, we will illustrate concrete use cases where ActiveTigger has accelerated data annotation, enabled scalable workflows, and fostered collaborations. Beyond the technical demonstration, the talk will also open a broader reflection on the challenges and opportunities brought by generative AI in academic research—especially in terms of reliability, transparency, and methodological adaptation for qualitative and quantitative inquiries.

The repository of the project : https://github.com/emilienschultz/activetigger/

The development of this software is funded by the DRARI Ile-de-France and supported by Progédo.

Modern Web Data Extraction: Techniques, Tools, Legal and Ethical Considerations

2025-09-30 Watch

talk

Domagoj Marić

AI/ML

To satisfy the need for data in generative and traditional AI, in a rapidly evolving environment, the ability to efficiently extract data from the web has become indispensable for businesses and developers. This presentation delves into the methodology and tools of web crawling and web scraping, with an overview of the ethical and legal side of the process, including the best practices on how to crawl politely and efficiently and use the data to not violate any privacy or intellectual property laws.

Optimal Transport in Python: A Practical Introduction with POT

2025-09-30 Watch

talk

Rémi Flamary

AI/ML Data Science GitHub Python

Optimal Transport (OT) is a powerful mathematical framework with applications in machine learning, statistics, and data science. This talk introduces the Python Optimal Transport toolbox (POT), an open-source library designed to efficiently solve OT problems. Attendees will learn the basics of OT, explore real-world use cases, and gain hands-on experience with POT (https://pythonot.github.io/) .

Tackling Domain Shift with SKADA: A Hands-On Guide to Domain Adaptation

2025-09-30 Watch

talk

Théo Gnassounou , Antoine Collas

AI/ML Python PyTorch

Domain adaptation addresses the challenge of applying ML models to data that differs from the training distribution—a common issue in real-world applications. SKADA is a new Python library that brings domain adaptation tools to the sci-kit-learn and PyTorch ecosystem. This talk covers SKADA’s design, its integration with standard ML workflows, and how it helps practitioners build models that generalize better across domains.

Unlock the full predictive power of your multi-table data

2025-09-30 Watch

talk

Luc-Aurélien Gauthier , Alexis Bondu

AI/ML Python

While most machine learning tutorials and challenges focus on single-table datasets, real-world enterprise data is often distributed across multiple tables, such as customer logs, transaction records, or manufacturing logs. In this talk, we address the often-overlooked challenge of building predictive features directly from raw, multi-table data. You will learn how to automate feature engineering using a scalable, supervised, and overfit-resistant approach, grounded in information theory and available as a Python open-source library. The talk is aimed at data scientists and ML engineers working with structured data; basic machine learning knowledge is sufficient to follow.

Browser-based AI workflows in Jupyter

2025-09-30 Watch

talk

Nicolas Brichet , Jeremy Tuloup

AI/ML Python

JupyterLite brings Python and other programming languages to the browser, removing the need for a server. In this talk, we show how to extend it for AI workflows: connecting to remote models, running smaller models locally in the browser, and leveraging lightweight interfaces like a chat to interact with them.

Navigating the security compliance maze of an ML service

2025-09-30 Watch

talk

Uwe L. Korn

AI/ML Cyber Security

While everyone is talking about the m(e/a)ss of bureaucracy, we want to show you hands-on what you could need to be doing to operate an ML service. We will give an overview of things like ISO-27001 certifications, Cyber Resilience Act or AIBOMs. We want to highlight their impact/intention and give advice on how integrate them into your development workflow.

This talk is written from a practiconer's perspective and will help you set up your project to make your compliance department happy. It isn't meant as a deep-dive into the individual standards.

Démo sur stand

2025-09-01

Face To Face

AI/ML

Pour tester nos applications IA : Amplify, Campaign Companion, Score on the fly sur notre stand C16 et échangez avec nos experts Data & IA.

talk-data.com

Top Topics

Top Speakers

Réinventer l’allocation des coûts: quand la qualité des données rencontre l’IA générative dans Alteryx

A Journey Through a Geospatial Data Pipeline: From Raw Coordinates to Actionable Insights

Comment les IA génératives réorientent les discussions économiques et législatives en Europe?

Démo Risk Hunter

IA@Horse Technologies : comment nous avons doublé notre capacité d’analyse qualité avec Altair Rapid Miner ?

L'IA générative en entreprise : impact, risques et leviers d’actionL'IA générative en entreprise : impact, risques et leviers d’action

Probabilistic regression models: let's compare different modeling strategies and discuss how to evaluate them

Démos d'IA Agentique avec Thinkeo

CodeCommons: Towards transparent, richer and sustainable datasets for code generation model training

Enhancing Machine Learning Workflows with skore

Venez voir QDA Miner & WordStat 2025 en action !

Skrub: machine learning for dataframes

Building Data Science Tools for Sustainable Transformation

Big ideas shaping scientific Python: the quest for performance and usability

ActiveTigger: A Collaborative Text Annotation Research Tool for Computational Social Sciences

Modern Web Data Extraction: Techniques, Tools, Legal and Ethical Considerations

Optimal Transport in Python: A Practical Introduction with POT

Tackling Domain Shift with SKADA: A Hands-On Guide to Domain Adaptation

Unlock the full predictive power of your multi-table data

Browser-based AI workflows in Jupyter

Navigating the security compliance maze of an ML service

Démo sur stand