talk-data.com

Event

PyData Paris 2025

2025-09-01 – 2025-10-02 PyData

Activities tracked

Filtering by: GitHub ×

Top Speakers

Johan Mabille 2 Romain Clement 2 Tim Paine 1 Christophe Dervieux 1 David Brochart 1 Emanuele Fabbiani 1 Guillaume Lemaitre 1 Ian Thomas 1 Jeremy Tuloup 1 Justine BEL-LETOILE 1 Lex Avstreikh 1 Nicolas M. Thiéry 1

Sessions & talks

Showing 1–6 of 6 · Newest first

Search within this event →

Build a data studio in your notebook with jupyter-fs

2025-10-01 Watch

talk

Tim Paine

GitHub S3

jupyter-fs provides an interface between PyFilesystem and fsspec file systems, the JupyterLab user interface, and the Jupyter notebooks you run. Connect and browse your local filesystem, S3, Samba, WebDAV, and more, interacting with data seamlessly from both the JupyterLab UI and your notebook's kernel.

How to do real TDD in data science? A journey from pandas to polars with pelage!

2025-10-01 Watch

talk

Alix Tiran-Cappello

Data Quality Data Science GitHub Pandas Polars Python

In the world of data, inconsistencies or inaccuracies often presents a major challenge to extract valuable insights. Yet the number of robust tools and practices to address those issues remain limited. Particularly, the practice of TDD remains quite difficult in data science, while it is a standard among classic software development, also because of poorly adapted tools and frameworks.

To address this issue we released Pelage, an open-source Python package to facilitate data exploration and testing, which relies on Polars intuitive syntax and speed. Pelage empowers data scientists and analysts to facilitate data transformation, enhance data quality and improve code clarity.

We will demonstrate, in a test-first approach, how you can use this library in a meaningful data science workflow to gain greater confidence for your data transformations.

See website: https://alixtc.github.io/pelage/

PyPI in the face: running jokes that PyPI download stats can play on you

2025-10-01 Watch

talk

Loïc Estève

Analytics GitHub NumPy Python Scikit-learn SciPy

We all love to tell stories with data and we all love to listen to them. Wouldn't it be great if we could also draw actionable insights from these nice stories?

As scikit-learn maintainers, we would love to use PyPI download stats and other proxy metrics (website analytics, github repository statistics, etc ...) to help inform some of our decisions like: - how do we increase user awareness of best practices (please use Pipeline and cross-validation)? - how do we advertise our recent improvements (use HistGradientBoosting rather than GradientBoosting, TunedThresholdClassifier, PCA and a few other models can run on GPU) ? - do users care more about new features from recent releases or consolidation of what already exists? - how long should we support older versions of Python, numpy or scipy ?

In this talk we will highlight a number of lessons learned while trying to understand the complex reality behind these seemingly simple metrics.

Telling nice stories is not always hard, trying to grasp the reality behind these metrics is often tricky.

ActiveTigger: A Collaborative Text Annotation Research Tool for Computational Social Sciences

2025-09-30 Watch

talk

Emilien SCHULTZ , Paul Girard , Julien Boelaert

AI/ML API Computer Science GenAI GitHub LLM

The exponential growth of textual data—ranging from social media posts and digital news archives to speech-to-text transcripts—has opened new frontiers for research in the social sciences. Tasks such as stance detection, topic classification, and information extraction have become increasingly common. At the same time, the rapid evolution of Natural Language Processing, especially pretrained language models and generative AI, has largely been led by the computer science community, often leaving a gap in accessibility for social scientists.

To address this, we initiated since 2023 the development of ActiveTigger, a lightweight, open-source Python application (with a web frontend in React) designed to accelerate annotation process and manage large-scale datasets through the integration of fine-tuned models. It aims to support computational social science for a large public both within and outside social sciences. Already used by a dynamic community in social sciences, the stable version is planned for early June 2025.

From a more technical prospect, the API is designed to manage the complete workflow from project creation, embeddings computation, exploration of the text corpus, human annotation with active learning, fine-tuning of pre-trained models (BERT-like), prediction on a larger corpus, and export. It also integrates LLM-as-a-service capabilities for prompt-based annotation and information extraction, offering a flexible approach for hybrid manual/automatic labeling. Accessible both with a web frontend and a Python client, ActiveTigger encourages customization and adaptation to specific research contexts and practices.

In this talk, we will delve into the motivations behind the creation of ActiveTigger, outline its technical architecture, and walk through its core functionalities. Drawing on several ongoing research projects within the Computational Social Science (CSS) group at CREST, we will illustrate concrete use cases where ActiveTigger has accelerated data annotation, enabled scalable workflows, and fostered collaborations. Beyond the technical demonstration, the talk will also open a broader reflection on the challenges and opportunities brought by generative AI in academic research—especially in terms of reliability, transparency, and methodological adaptation for qualitative and quantitative inquiries.

The repository of the project : https://github.com/emilienschultz/activetigger/

The development of this software is funded by the DRARI Ile-de-France and supported by Progédo.

Optimal Transport in Python: A Practical Introduction with POT

2025-09-30 Watch

talk

Rémi Flamary

AI/ML Data Science GitHub Python

Optimal Transport (OT) is a powerful mathematical framework with applications in machine learning, statistics, and data science. This talk introduces the Python Optimal Transport toolbox (POT), an open-source library designed to efficiently solve OT problems. Attendees will learn the basics of OT, explore real-world use cases, and gain hands-on experience with POT (https://pythonot.github.io/) .

From Jupyter Notebook to Publish-Ready Report: Effortless Sharing with Quarto

2025-09-30 Watch

talk

Christophe Dervieux

GitHub Python

See how Quarto can transform your Jupyter notebooks into stakeholder-ready web pages or PDFs, published online with just one command. This session features practical demonstrations of publishing with quarto publish, applying custom styles tailored to your organization thanks to brand.yml, and leveraging new features for reproducible research.

Designed for anyone looking to share their work, this talk requires only basic Python and notebook familiarity. You’ll walk away with the skills to elevate your reporting workflow and share insights professionally.

PyData Paris 2025

Top Topics

Top Speakers