Python

CoSApp: an open-source library to design complex systems

2025-10-01 · PyData Paris 2025 Watch

talk

by Étienne Lac

API GitLab

CoSApp, for Collaborative System Approach, is a Python library dedicated to the simulation and design of multi-disciplinary systems. It is primarily intended for engineers and system architects during the early stage of industrial product design. The API of CoSApp is focused on simplicity and explicit declaration of design problems. Special attention is given to modularity; a very flexible mechanism of solver assembly allows users to construct complex, customized simulation workflows. This presentation aims at presenting the key features of the framework.

https://cosapp.readthedocs.io https://gitlab.com/cosapp/cosapp

Parallel processing using CRDTs

2025-10-01 · PyData Paris 2025 Watch

talk

by David Brochart

Beyond embarrassingly parallel processing problems, data must be shared between workers for them to do something useful. This can be done by: - sharing memory between threads, with the issue of preventing access to shared data to avoid race conditions. - copying memory to subprocesses, with the challenge of synchronizing data whenever it is mutated.

In Python, using threads is not an option because of the GIL (global interpreter lock), which prevents true parallelism. This might change in the future with the removal of the GIL, but usual problems with multithreading will appear, such as using locks and managing their complexity. Subprocesses don't suffer from the GIL, but usually need to access a database for sharing data, which is often too slow. Algorithms such as HAMT (hash array mapped trie) have been used to efficiently and safely share data stored in immutable data structures, removing the need for locks. In this talk we will show how CRDTs (conflict-free replicated data type) can be used for the same purpose.

How to do real TDD in data science? A journey from pandas to polars with pelage!

2025-10-01 · PyData Paris 2025 Watch

talk

by Alix Tiran-Cappello

Data Quality Data Science GitHub Pandas Polars

In the world of data, inconsistencies or inaccuracies often presents a major challenge to extract valuable insights. Yet the number of robust tools and practices to address those issues remain limited. Particularly, the practice of TDD remains quite difficult in data science, while it is a standard among classic software development, also because of poorly adapted tools and frameworks.

To address this issue we released Pelage, an open-source Python package to facilitate data exploration and testing, which relies on Polars intuitive syntax and speed. Pelage empowers data scientists and analysts to facilitate data transformation, enhance data quality and improve code clarity.

We will demonstrate, in a test-first approach, how you can use this library in a meaningful data science workflow to gain greater confidence for your data transformations.

See website: https://alixtc.github.io/pelage/

Enhancing Machine Learning Workflows with skore

2025-10-01 · PyData Paris 2025 Watch

talk

by Marie Sacksick

AI/ML

Discover how skore, a new-born open-source Python library, can elevate your machine learning projects by integrating recommended practices and avoiding common pitfalls. This talk will introduce skore's key features and demonstrate how it can streamline your model evaluation and diagnostics processes.

PyPI in the face: running jokes that PyPI download stats can play on you

2025-10-01 · PyData Paris 2025 Watch

talk

by Loïc Estève

Analytics GitHub NumPy Scikit-learn SciPy

We all love to tell stories with data and we all love to listen to them. Wouldn't it be great if we could also draw actionable insights from these nice stories?

As scikit-learn maintainers, we would love to use PyPI download stats and other proxy metrics (website analytics, github repository statistics, etc ...) to help inform some of our decisions like: - how do we increase user awareness of best practices (please use Pipeline and cross-validation)? - how do we advertise our recent improvements (use HistGradientBoosting rather than GradientBoosting, TunedThresholdClassifier, PCA and a few other models can run on GPU) ? - do users care more about new features from recent releases or consolidation of what already exists? - how long should we support older versions of Python, numpy or scipy ?

In this talk we will highlight a number of lessons learned while trying to understand the complex reality behind these seemingly simple metrics.

Telling nice stories is not always hard, trying to grasp the reality behind these metrics is often tricky.

Sharing computational course material at larger scale: a French multi-tenant attempt

2025-10-01 · PyData Paris 2025 Watch

talk

by Nicolas M. Thiéry

With the rise of computation and data as pillars of science, institutions are struggling to provide large-scale training to their students and staff. Often, this leads to redundant, fragmented efforts, with each organization producing its own bespoke training material. In this talk, we report on a collaborative multi-tenant initiative to produce a shared corpus of interactive training resources in the Python language, designed as a digital common that can be adapted to diverse contexts and formats in French higher education and beyond.

Big ideas shaping scientific Python: the quest for performance and usability

2025-09-30 · PyData Paris 2025 Watch

talk

by Ralf Gommers (Quansight Labs)

AI/ML NumPy

Behind every technical leap in scientific Python lies a human ecosystem of volunteers, companies, and institutions working in tension and collaboration. This keynote explores how innovation actually happens in open source, through the lens of recent and ongoing initiatives that aim to move the needle on performance and usability - from the ideas that went into NumPy 2.0 and its relatively smooth rollout to the ongoing efforts to leverage the performance GPUs offer without sacrificing maintainability and usability.

Takeaways for the audience: Whether you’re an ML engineer tired of debugging GPU-CPU inconsistencies, a researcher pushing Python to its limits, or an open-source maintainer seeking sustainable funding, this keynote will equip you with both practical solutions and a clear vision of where scientific Python is headed next.

ActiveTigger: A Collaborative Text Annotation Research Tool for Computational Social Sciences

2025-09-30 · PyData Paris 2025 Watch

talk

by Emilien SCHULTZ , Paul Girard , Julien Boelaert

AI/ML API Computer Science GenAI GitHub LLM NLP React

The exponential growth of textual data—ranging from social media posts and digital news archives to speech-to-text transcripts—has opened new frontiers for research in the social sciences. Tasks such as stance detection, topic classification, and information extraction have become increasingly common. At the same time, the rapid evolution of Natural Language Processing, especially pretrained language models and generative AI, has largely been led by the computer science community, often leaving a gap in accessibility for social scientists.

To address this, we initiated since 2023 the development of ActiveTigger, a lightweight, open-source Python application (with a web frontend in React) designed to accelerate annotation process and manage large-scale datasets through the integration of fine-tuned models. It aims to support computational social science for a large public both within and outside social sciences. Already used by a dynamic community in social sciences, the stable version is planned for early June 2025.

From a more technical prospect, the API is designed to manage the complete workflow from project creation, embeddings computation, exploration of the text corpus, human annotation with active learning, fine-tuning of pre-trained models (BERT-like), prediction on a larger corpus, and export. It also integrates LLM-as-a-service capabilities for prompt-based annotation and information extraction, offering a flexible approach for hybrid manual/automatic labeling. Accessible both with a web frontend and a Python client, ActiveTigger encourages customization and adaptation to specific research contexts and practices.

In this talk, we will delve into the motivations behind the creation of ActiveTigger, outline its technical architecture, and walk through its core functionalities. Drawing on several ongoing research projects within the Computational Social Science (CSS) group at CREST, we will illustrate concrete use cases where ActiveTigger has accelerated data annotation, enabled scalable workflows, and fostered collaborations. Beyond the technical demonstration, the talk will also open a broader reflection on the challenges and opportunities brought by generative AI in academic research—especially in terms of reliability, transparency, and methodological adaptation for qualitative and quantitative inquiries.

The repository of the project : https://github.com/emilienschultz/activetigger/

The development of this software is funded by the DRARI Ile-de-France and supported by Progédo.

Code as Data: A Practical Introduction to Python’s Abstract Syntax Tree

2025-09-30 · PyData Paris 2025 Watch

talk

by Laurent Direr

Peek under the hood of Python and unlock the power of its Abstract Syntax Tree! We'll demystify the AST and explore how it powers tools like pytest, linters, and refactoring - as well as some of your favorite data libraries.

You Don’t Need Spark for That: Pythonic Data Lakehouse Workflows

2025-09-30 · PyData Paris 2025 Watch

talk

by Romain Clement

Data Lakehouse Delta Spark

Have you ever spun up a Spark cluster just to update three rows in a Delta table? In this talk, we’ll explore how modern Python libraries can power lightweight, production-grade Data Lakehouse workflows—helping you avoid over-engineering your data stack.

Optimal Transport in Python: A Practical Introduction with POT

2025-09-30 · PyData Paris 2025 Watch

talk

by Rémi Flamary

AI/ML Data Science GitHub

Optimal Transport (OT) is a powerful mathematical framework with applications in machine learning, statistics, and data science. This talk introduces the Python Optimal Transport toolbox (POT), an open-source library designed to efficiently solve OT problems. Attendees will learn the basics of OT, explore real-world use cases, and gain hands-on experience with POT (https://pythonot.github.io/) .

Reproducible software provisioning for high performance computing (HPC) and research software engineering (RSE) using Spack

2025-09-30 · PyData Paris 2025 Watch

talk

by Martin Lang , Hans Fangohr

In this talk we focus on installing software (stacks) beyond just the Python ecosystem. In the first part of the talk we give an introduction to using the package manager Spack (https://spack.readthedocs.io). In the second part we explain how we use Spack at our institute to manage the software stack on the local HPC.

How to make public data more accessible with "baked" data and DuckDB

2025-09-30 · PyData Paris 2025 Watch

talk

by Chris Kucharczyk

C#/.NET DuckDB

Publicly available data is rarely analysis-ready, hampering researchers, organizations, and the public from easily accessing the information these datasets contain. One way to address this shortcoming is to "bake" the data into a structured format and ship it alongside code that can be used for analysis. For analytical work in particular, DuckDB provides a performant way to query the structured data in a variety of contexts.

This talk will explore the benefits and tradeoffs of this architectural pattern using the design of scipeds–an open source Python package for analyzing higher-education data in the US–as a case study.

No DuckDB experience required, beginner Python and programming experience recommended. This talk is aimed at data practitioners, especially those who work with public datasets.

Tackling Domain Shift with SKADA: A Hands-On Guide to Domain Adaptation

2025-09-30 · PyData Paris 2025 Watch

talk

by Théo Gnassounou , Antoine Collas

AI/ML PyTorch

Domain adaptation addresses the challenge of applying ML models to data that differs from the training distribution—a common issue in real-world applications. SKADA is a new Python library that brings domain adaptation tools to the sci-kit-learn and PyTorch ecosystem. This talk covers SKADA’s design, its integration with standard ML workflows, and how it helps practitioners build models that generalize better across domains.

Expanding Programming Language Support in JupyterLite

2025-09-30 · PyData Paris 2025 Watch

talk

by Isabel Paredes , Thorsten Beier , Antoine Prouvost , Ian Thomas (Publicis Spine)

JupyterLite is a web-based distribution of JupyterLab that runs entirely in the browser, leveraging WebAssembly builds of language kernels and interpreters.

In this talk, we introduce emscripten-forge, a conda-based software distribution tailored for WebAssembly and the web browser. Emscripten-forge empowers several JupyterLite kernels, including:

xeus-Python for Python,
xeus-R for R,
xeus-Octave for GNU Octave.

These kernels cover some of the most popular languages in scientific computing.

Additionally, emscripten-forge includes builds for various terminal applications, utilized by the Cockle shell emulator to enable the JupyterLite terminal.

Unlock the full predictive power of your multi-table data

2025-09-30 · PyData Paris 2025 Watch

talk

by Luc-Aurélien Gauthier , Alexis Bondu

AI/ML

While most machine learning tutorials and challenges focus on single-table datasets, real-world enterprise data is often distributed across multiple tables, such as customer logs, transaction records, or manufacturing logs. In this talk, we address the often-overlooked challenge of building predictive features directly from raw, multi-table data. You will learn how to automate feature engineering using a scalable, supervised, and overfit-resistant approach, grounded in information theory and available as a Python open-source library. The talk is aimed at data scientists and ML engineers working with structured data; basic machine learning knowledge is sufficient to follow.

Browser-based AI workflows in Jupyter

2025-09-30 · PyData Paris 2025 Watch

talk

by Nicolas Brichet , Jeremy Tuloup

AI/ML

JupyterLite brings Python and other programming languages to the browser, removing the need for a server. In this talk, we show how to extend it for AI workflows: connecting to remote models, running smaller models locally in the browser, and leveraging lightweight interfaces like a chat to interact with them.

A Hitchhiker's Guide to the Array API Standard Ecosystem

2025-09-30 · PyData Paris 2025 Watch

talk

by Lucas Colley

API NumPy PyTorch Scikit-learn SciPy

The array API standard is unifying the ecosystem of Python array computing, facilitating greater interoperability between code written for different array libraries, including NumPy, CuPy, PyTorch, JAX, and Dask.

But what are all of these "array-api-" libraries for? How can you use these libraries to 'future-proof' your libraries, and provide support for GPU and distributed arrays to your users? Find out in this talk, where I'll guide you through every corner of the array API standard ecosystem, explaining how SciPy and scikit-learn are using all of these tools to adopt the standard. I'll also be sharing progress updates from the past year, to give you a clear picture of where we are now, and what the future holds.

The new lockfile format introduced in PEP 751

2025-09-30 · PyData Paris 2025 Watch

talk

by Nico Albers

TOML

In March 2025, PEP 751 got accepted, proposing an new format how lockfiles should be structured. The talk will give a brief history of this PEP (and it's rejected predecessor), introduce you to the proposed pylock.toml format and discuss (subjective) highlights of this PEP. Afterwards, a practical example how this PEP could improve managing your environments will be discussed.

From Jupyter Notebook to Publish-Ready Report: Effortless Sharing with Quarto

2025-09-30 · PyData Paris 2025 Watch

talk

by Christophe Dervieux

GitHub

See how Quarto can transform your Jupyter notebooks into stakeholder-ready web pages or PDFs, published online with just one command. This session features practical demonstrations of publishing with quarto publish, applying custom styles tailored to your organization thanks to brand.yml, and leveraging new features for reproducible research.

Designed for anyone looking to share their work, this talk requires only basic Python and notebook familiarity. You’ll walk away with the skills to elevate your reporting workflow and share insights professionally.

talk-data.com

Activity Trend

Top Events

Top Speakers

CoSApp: an open-source library to design complex systems

Parallel processing using CRDTs

How to do real TDD in data science? A journey from pandas to polars with pelage!

Enhancing Machine Learning Workflows with skore

PyPI in the face: running jokes that PyPI download stats can play on you

Sharing computational course material at larger scale: a French multi-tenant attempt

Big ideas shaping scientific Python: the quest for performance and usability

ActiveTigger: A Collaborative Text Annotation Research Tool for Computational Social Sciences

Code as Data: A Practical Introduction to Python’s Abstract Syntax Tree

You Don’t Need Spark for That: Pythonic Data Lakehouse Workflows

Optimal Transport in Python: A Practical Introduction with POT

Reproducible software provisioning for high performance computing (HPC) and research software engineering (RSE) using Spack

How to make public data more accessible with "baked" data and DuckDB

Tackling Domain Shift with SKADA: A Hands-On Guide to Domain Adaptation

Expanding Programming Language Support in JupyterLite

Unlock the full predictive power of your multi-table data

Browser-based AI workflows in Jupyter

A Hitchhiker's Guide to the Array API Standard Ecosystem

The new lockfile format introduced in PEP 751

From Jupyter Notebook to Publish-Ready Report: Effortless Sharing with Quarto