talk-data.com
Event
PyData Paris 2025
Activities tracked
189
Top Topics
Sessions & talks
Showing 151–175 of 189 · Newest first
Social Event
Lightning Talks
Big ideas shaping scientific Python: the quest for performance and usability
Behind every technical leap in scientific Python lies a human ecosystem of volunteers, companies, and institutions working in tension and collaboration. This keynote explores how innovation actually happens in open source, through the lens of recent and ongoing initiatives that aim to move the needle on performance and usability - from the ideas that went into NumPy 2.0 and its relatively smooth rollout to the ongoing efforts to leverage the performance GPUs offer without sacrificing maintainability and usability.
Takeaways for the audience: Whether you’re an ML engineer tired of debugging GPU-CPU inconsistencies, a researcher pushing Python to its limits, or an open-source maintainer seeking sustainable funding, this keynote will equip you with both practical solutions and a clear vision of where scientific Python is headed next.
Break
Break
Break
ActiveTigger: A Collaborative Text Annotation Research Tool for Computational Social Sciences
The exponential growth of textual data—ranging from social media posts and digital news archives to speech-to-text transcripts—has opened new frontiers for research in the social sciences. Tasks such as stance detection, topic classification, and information extraction have become increasingly common. At the same time, the rapid evolution of Natural Language Processing, especially pretrained language models and generative AI, has largely been led by the computer science community, often leaving a gap in accessibility for social scientists.
To address this, we initiated since 2023 the development of ActiveTigger, a lightweight, open-source Python application (with a web frontend in React) designed to accelerate annotation process and manage large-scale datasets through the integration of fine-tuned models. It aims to support computational social science for a large public both within and outside social sciences. Already used by a dynamic community in social sciences, the stable version is planned for early June 2025.
From a more technical prospect, the API is designed to manage the complete workflow from project creation, embeddings computation, exploration of the text corpus, human annotation with active learning, fine-tuning of pre-trained models (BERT-like), prediction on a larger corpus, and export. It also integrates LLM-as-a-service capabilities for prompt-based annotation and information extraction, offering a flexible approach for hybrid manual/automatic labeling. Accessible both with a web frontend and a Python client, ActiveTigger encourages customization and adaptation to specific research contexts and practices.
In this talk, we will delve into the motivations behind the creation of ActiveTigger, outline its technical architecture, and walk through its core functionalities. Drawing on several ongoing research projects within the Computational Social Science (CSS) group at CREST, we will illustrate concrete use cases where ActiveTigger has accelerated data annotation, enabled scalable workflows, and fostered collaborations. Beyond the technical demonstration, the talk will also open a broader reflection on the challenges and opportunities brought by generative AI in academic research—especially in terms of reliability, transparency, and methodological adaptation for qualitative and quantitative inquiries.
The repository of the project : https://github.com/emilienschultz/activetigger/
The development of this software is funded by the DRARI Ile-de-France and supported by Progédo.
Code as Data: A Practical Introduction to Python’s Abstract Syntax Tree
Peek under the hood of Python and unlock the power of its Abstract Syntax Tree! We'll demystify the AST and explore how it powers tools like pytest, linters, and refactoring - as well as some of your favorite data libraries.
You Don’t Need Spark for That: Pythonic Data Lakehouse Workflows
Have you ever spun up a Spark cluster just to update three rows in a Delta table? In this talk, we’ll explore how modern Python libraries can power lightweight, production-grade Data Lakehouse workflows—helping you avoid over-engineering your data stack.
Modern Web Data Extraction: Techniques, Tools, Legal and Ethical Considerations
To satisfy the need for data in generative and traditional AI, in a rapidly evolving environment, the ability to efficiently extract data from the web has become indispensable for businesses and developers. This presentation delves into the methodology and tools of web crawling and web scraping, with an overview of the ethical and legal side of the process, including the best practices on how to crawl politely and efficiently and use the data to not violate any privacy or intellectual property laws.
Optimal Transport in Python: A Practical Introduction with POT
Optimal Transport (OT) is a powerful mathematical framework with applications in machine learning, statistics, and data science. This talk introduces the Python Optimal Transport toolbox (POT), an open-source library designed to efficiently solve OT problems. Attendees will learn the basics of OT, explore real-world use cases, and gain hands-on experience with POT (https://pythonot.github.io/) .
Reproducible software provisioning for high performance computing (HPC) and research software engineering (RSE) using Spack
In this talk we focus on installing software (stacks) beyond just the Python ecosystem. In the first part of the talk we give an introduction to using the package manager Spack (https://spack.readthedocs.io). In the second part we explain how we use Spack at our institute to manage the software stack on the local HPC.
Fighting against the instability : Debian Science at the synchrotron SOLEIL
The talk addresses the challenges of maintaining and preserving the sovereignty of data processing tools in synchrotron X-ray experiments. It emphasizes the use of stable packaging systems like Debian-based distributions and fostering collaboration within the scientific community to ensure independence from external services and long-term support for software.
How to make public data more accessible with "baked" data and DuckDB
Publicly available data is rarely analysis-ready, hampering researchers, organizations, and the public from easily accessing the information these datasets contain. One way to address this shortcoming is to "bake" the data into a structured format and ship it alongside code that can be used for analysis. For analytical work in particular, DuckDB provides a performant way to query the structured data in a variety of contexts.
This talk will explore the benefits and tradeoffs of this architectural pattern using the design of scipeds–an open source Python package for analyzing higher-education data in the US–as a case study.
No DuckDB experience required, beginner Python and programming experience recommended. This talk is aimed at data practitioners, especially those who work with public datasets.
Tackling Domain Shift with SKADA: A Hands-On Guide to Domain Adaptation
Domain adaptation addresses the challenge of applying ML models to data that differs from the training distribution—a common issue in real-world applications. SKADA is a new Python library that brings domain adaptation tools to the sci-kit-learn and PyTorch ecosystem. This talk covers SKADA’s design, its integration with standard ML workflows, and how it helps practitioners build models that generalize better across domains.
Lunch
Lunch
Lunch
Expanding Programming Language Support in JupyterLite
JupyterLite is a web-based distribution of JupyterLab that runs entirely in the browser, leveraging WebAssembly builds of language kernels and interpreters.
In this talk, we introduce emscripten-forge, a conda-based software distribution tailored for WebAssembly and the web browser. Emscripten-forge empowers several JupyterLite kernels, including:
- xeus-Python for Python,
- xeus-R for R,
- xeus-Octave for GNU Octave.
These kernels cover some of the most popular languages in scientific computing.
Additionally, emscripten-forge includes builds for various terminal applications, utilized by the Cockle shell emulator to enable the JupyterLite terminal.
Sparrow, Pirates of the Apache Arrow
Sparrow is a lightweight C++20 idiomatic implementation of the Apache Arrow memory specification. Designed for compatibility with the Arrow C data interface, Sparrow enables seamless data exchange with other libraries supporting the Arrow format. It also offers high-level APIs, ensuring interoperability with standard modern C++ algorithms.
Unlock the full predictive power of your multi-table data
While most machine learning tutorials and challenges focus on single-table datasets, real-world enterprise data is often distributed across multiple tables, such as customer logs, transaction records, or manufacturing logs. In this talk, we address the often-overlooked challenge of building predictive features directly from raw, multi-table data. You will learn how to automate feature engineering using a scalable, supervised, and overfit-resistant approach, grounded in information theory and available as a Python open-source library. The talk is aimed at data scientists and ML engineers working with structured data; basic machine learning knowledge is sufficient to follow.
Advanced Polars: Lazy Queries and Streaming Mode
Do you find yourself struggling with Pandas' limitations when handling massive datasets or real-time data streams?
Discover Polars, the lightning-fast DataFrame library built in Rust. This talk presents two advanced features of the next-generation dataframe library: lazy queries and streaming mode.
Lazy evaluation in Polars allows you to build complex data pipelines without the performance bottlenecks of eager execution. By deferring computation, Polars optimises your queries using techniques like predicate and projection pushdown, reducing unnecessary computations and memory overhead. This leads to significant performance improvements, particularly with datasets larger than your system’s physical memory.
Polars' LazyFrames form the foundation of the library’s streaming mode, enabling efficient streaming pipelines, real-time transformations, and seamless integration with various data sinks.
This session will explore use cases and technical implementations of both lazy queries and streaming mode. We’ll also include live-coding demonstrations to introduce the tool, showcase best practices, and highlight common pitfalls.
Attendees will walk away with practical knowledge of lazy queries and streaming mode, ready to apply these tools in their daily work as data engineers or data scientists.
Browser-based AI workflows in Jupyter
JupyterLite brings Python and other programming languages to the browser, removing the need for a server. In this talk, we show how to extend it for AI workflows: connecting to remote models, running smaller models locally in the browser, and leveraging lightweight interfaces like a chat to interact with them.
Navigating the security compliance maze of an ML service
While everyone is talking about the m(e/a)ss of bureaucracy, we want to show you hands-on what you could need to be doing to operate an ML service. We will give an overview of things like ISO-27001 certifications, Cyber Resilience Act or AIBOMs. We want to highlight their impact/intention and give advice on how integrate them into your development workflow.
This talk is written from a practiconer's perspective and will help you set up your project to make your compliance department happy. It isn't meant as a deep-dive into the individual standards.