talk-data.com talk-data.com

Topic

Python

programming_language data_science web_development

20

tagged

Activity Trend

185 peak/qtr
2020-Q1 2026-Q1

Activities

Showing filtered results

Filtering by: PyData Paris 2025 ×
CoSApp: an open-source library to design complex systems

CoSApp, for Collaborative System Approach, is a Python library dedicated to the simulation and design of multi-disciplinary systems. It is primarily intended for engineers and system architects during the early stage of industrial product design. The API of CoSApp is focused on simplicity and explicit declaration of design problems. Special attention is given to modularity; a very flexible mechanism of solver assembly allows users to construct complex, customized simulation workflows. This presentation aims at presenting the key features of the framework.

https://cosapp.readthedocs.io https://gitlab.com/cosapp/cosapp

Parallel processing using CRDTs

Beyond embarrassingly parallel processing problems, data must be shared between workers for them to do something useful. This can be done by: - sharing memory between threads, with the issue of preventing access to shared data to avoid race conditions. - copying memory to subprocesses, with the challenge of synchronizing data whenever it is mutated.

In Python, using threads is not an option because of the GIL (global interpreter lock), which prevents true parallelism. This might change in the future with the removal of the GIL, but usual problems with multithreading will appear, such as using locks and managing their complexity. Subprocesses don't suffer from the GIL, but usually need to access a database for sharing data, which is often too slow. Algorithms such as HAMT (hash array mapped trie) have been used to efficiently and safely share data stored in immutable data structures, removing the need for locks. In this talk we will show how CRDTs (conflict-free replicated data type) can be used for the same purpose.

How to do real TDD in data science? A journey from pandas to polars with pelage!

In the world of data, inconsistencies or inaccuracies often presents a major challenge to extract valuable insights. Yet the number of robust tools and practices to address those issues remain limited. Particularly, the practice of TDD remains quite difficult in data science, while it is a standard among classic software development, also because of poorly adapted tools and frameworks.

To address this issue we released Pelage, an open-source Python package to facilitate data exploration and testing, which relies on Polars intuitive syntax and speed. Pelage empowers data scientists and analysts to facilitate data transformation, enhance data quality and improve code clarity.

We will demonstrate, in a test-first approach, how you can use this library in a meaningful data science workflow to gain greater confidence for your data transformations.

See website: https://alixtc.github.io/pelage/

PyPI in the face: running jokes that PyPI download stats can play on you

We all love to tell stories with data and we all love to listen to them. Wouldn't it be great if we could also draw actionable insights from these nice stories?

As scikit-learn maintainers, we would love to use PyPI download stats and other proxy metrics (website analytics, github repository statistics, etc ...) to help inform some of our decisions like: - how do we increase user awareness of best practices (please use Pipeline and cross-validation)? - how do we advertise our recent improvements (use HistGradientBoosting rather than GradientBoosting, TunedThresholdClassifier, PCA and a few other models can run on GPU) ? - do users care more about new features from recent releases or consolidation of what already exists? - how long should we support older versions of Python, numpy or scipy ?

In this talk we will highlight a number of lessons learned while trying to understand the complex reality behind these seemingly simple metrics.

Telling nice stories is not always hard, trying to grasp the reality behind these metrics is often tricky.

Sharing computational course material at larger scale: a French multi-tenant attempt

With the rise of computation and data as pillars of science, institutions are struggling to provide large-scale training to their students and staff. Often, this leads to redundant, fragmented efforts, with each organization producing its own bespoke training material. In this talk, we report on a collaborative multi-tenant initiative to produce a shared corpus of interactive training resources in the Python language, designed as a digital common that can be adapted to diverse contexts and formats in French higher education and beyond.

Big ideas shaping scientific Python: the quest for performance and usability

Behind every technical leap in scientific Python lies a human ecosystem of volunteers, companies, and institutions working in tension and collaboration. This keynote explores how innovation actually happens in open source, through the lens of recent and ongoing initiatives that aim to move the needle on performance and usability - from the ideas that went into NumPy 2.0 and its relatively smooth rollout to the ongoing efforts to leverage the performance GPUs offer without sacrificing maintainability and usability.

Takeaways for the audience: Whether you’re an ML engineer tired of debugging GPU-CPU inconsistencies, a researcher pushing Python to its limits, or an open-source maintainer seeking sustainable funding, this keynote will equip you with both practical solutions and a clear vision of where scientific Python is headed next.

ActiveTigger: A Collaborative Text Annotation Research Tool for Computational Social Sciences

The exponential growth of textual data—ranging from social media posts and digital news archives to speech-to-text transcripts—has opened new frontiers for research in the social sciences. Tasks such as stance detection, topic classification, and information extraction have become increasingly common. At the same time, the rapid evolution of Natural Language Processing, especially pretrained language models and generative AI, has largely been led by the computer science community, often leaving a gap in accessibility for social scientists.

To address this, we initiated since 2023 the development of ActiveTigger, a lightweight, open-source Python application (with a web frontend in React) designed to accelerate annotation process and manage large-scale datasets through the integration of fine-tuned models. It aims to support computational social science for a large public both within and outside social sciences. Already used by a dynamic community in social sciences, the stable version is planned for early June 2025.

From a more technical prospect, the API is designed to manage the complete workflow from project creation, embeddings computation, exploration of the text corpus, human annotation with active learning, fine-tuning of pre-trained models (BERT-like), prediction on a larger corpus, and export. It also integrates LLM-as-a-service capabilities for prompt-based annotation and information extraction, offering a flexible approach for hybrid manual/automatic labeling. Accessible both with a web frontend and a Python client, ActiveTigger encourages customization and adaptation to specific research contexts and practices.

In this talk, we will delve into the motivations behind the creation of ActiveTigger, outline its technical architecture, and walk through its core functionalities. Drawing on several ongoing research projects within the Computational Social Science (CSS) group at CREST, we will illustrate concrete use cases where ActiveTigger has accelerated data annotation, enabled scalable workflows, and fostered collaborations. Beyond the technical demonstration, the talk will also open a broader reflection on the challenges and opportunities brought by generative AI in academic research—especially in terms of reliability, transparency, and methodological adaptation for qualitative and quantitative inquiries.

The repository of the project : https://github.com/emilienschultz/activetigger/

The development of this software is funded by the DRARI Ile-de-France and supported by Progédo.

Optimal Transport in Python: A Practical Introduction with POT

Optimal Transport (OT) is a powerful mathematical framework with applications in machine learning, statistics, and data science. This talk introduces the Python Optimal Transport toolbox (POT), an open-source library designed to efficiently solve OT problems. Attendees will learn the basics of OT, explore real-world use cases, and gain hands-on experience with POT (https://pythonot.github.io/) .

Reproducible software provisioning for high performance computing (HPC) and research software engineering (RSE) using Spack

In this talk we focus on installing software (stacks) beyond just the Python ecosystem. In the first part of the talk we give an introduction to using the package manager Spack (https://spack.readthedocs.io). In the second part we explain how we use Spack at our institute to manage the software stack on the local HPC.

How to make public data more accessible with "baked" data and DuckDB

Publicly available data is rarely analysis-ready, hampering researchers, organizations, and the public from easily accessing the information these datasets contain. One way to address this shortcoming is to "bake" the data into a structured format and ship it alongside code that can be used for analysis. For analytical work in particular, DuckDB provides a performant way to query the structured data in a variety of contexts.

This talk will explore the benefits and tradeoffs of this architectural pattern using the design of scipeds–an open source Python package for analyzing higher-education data in the US–as a case study.

No DuckDB experience required, beginner Python and programming experience recommended. This talk is aimed at data practitioners, especially those who work with public datasets.

Tackling Domain Shift with SKADA: A Hands-On Guide to Domain Adaptation

Domain adaptation addresses the challenge of applying ML models to data that differs from the training distribution—a common issue in real-world applications. SKADA is a new Python library that brings domain adaptation tools to the sci-kit-learn and PyTorch ecosystem. This talk covers SKADA’s design, its integration with standard ML workflows, and how it helps practitioners build models that generalize better across domains.

Expanding Programming Language Support in JupyterLite

JupyterLite is a web-based distribution of JupyterLab that runs entirely in the browser, leveraging WebAssembly builds of language kernels and interpreters.

In this talk, we introduce emscripten-forge, a conda-based software distribution tailored for WebAssembly and the web browser. Emscripten-forge empowers several JupyterLite kernels, including:

  • xeus-Python for Python,
  • xeus-R for R,
  • xeus-Octave for GNU Octave.

These kernels cover some of the most popular languages in scientific computing.

Additionally, emscripten-forge includes builds for various terminal applications, utilized by the Cockle shell emulator to enable the JupyterLite terminal.

Unlock the full predictive power of your multi-table data

While most machine learning tutorials and challenges focus on single-table datasets, real-world enterprise data is often distributed across multiple tables, such as customer logs, transaction records, or manufacturing logs. In this talk, we address the often-overlooked challenge of building predictive features directly from raw, multi-table data. You will learn how to automate feature engineering using a scalable, supervised, and overfit-resistant approach, grounded in information theory and available as a Python open-source library. The talk is aimed at data scientists and ML engineers working with structured data; basic machine learning knowledge is sufficient to follow.

A Hitchhiker's Guide to the Array API Standard Ecosystem

The array API standard is unifying the ecosystem of Python array computing, facilitating greater interoperability between code written for different array libraries, including NumPy, CuPy, PyTorch, JAX, and Dask.

But what are all of these "array-api-" libraries for? How can you use these libraries to 'future-proof' your libraries, and provide support for GPU and distributed arrays to your users? Find out in this talk, where I'll guide you through every corner of the array API standard ecosystem, explaining how SciPy and scikit-learn are using all of these tools to adopt the standard. I'll also be sharing progress updates from the past year, to give you a clear picture of where we are now, and what the future holds.

From Jupyter Notebook to Publish-Ready Report: Effortless Sharing with Quarto

See how Quarto can transform your Jupyter notebooks into stakeholder-ready web pages or PDFs, published online with just one command. This session features practical demonstrations of publishing with quarto publish, applying custom styles tailored to your organization thanks to brand.yml, and leveraging new features for reproducible research.

Designed for anyone looking to share their work, this talk requires only basic Python and notebook familiarity. You’ll walk away with the skills to elevate your reporting workflow and share insights professionally.