PyData Paris 2024

MLOps at Renault Group: A Generic Pipeline for Scalable Deployment

2024-09-26

talk

Alix Tiran-Cappello , Alexandre Carton

AI/ML API CI/CD GCP MLOps Python

Scaling machine learning at large organizations like Renault Group presents unique challenges, in terms of scales, legal requirements, and diversity of use cases. Data scientists require streamlined workflows and automated processes to efficiently deploy models into production. We present an MLOps pipeline based on python Kubeflow and GCP Vertex AI API designed specifically for this purpose. It enables data scientists to focus on code development for pre-processing, training, evaluation, and prediction. This MLOPS pipeline is a cornerstone of the AI@Scale program, which aims to roll out AI across the Group.

We choose a Python-first approach, allowing Data scientists to focus purely on writing preprocessing or ML oriented Python code, also allowing data retrieval through SQL queries. The pipeline addresses key questions such as prediction type (batch or API), model versioning, resource allocation, drift monitoring, and alert generation. It favors faster time to market with automated deployment and infrastructure management. Although we encountered pitfalls and design difficulties, that we will discuss during the presentation, this pipeline integrates with a CI/CD process, ensuring efficient and automated model deployment and serving.

Finally, this MLOps solution empowers Renault data scientists to seamlessly translate innovative models into production, and smoothen the development of scalable, and impactful AI-driven solutions.

Counting down for CRA - updates and expectations

2024-09-26

talk

Cheuk Ting Ho

Python Cyber Security

The EU Commission is likely to vote on the Cyber Resilience Act (CRA) later this year. In this talk we will look at the timeline for the new legislation, any critical discussions happening around implementation and most importantly, the new responsibilities outlined by the CRA. We’ll also discuss what the PSF is doing for CPython and for PyPI and what each of us in the Python ecosystem might want to do to get ready for a new era of increased certainty – and liability – around security.

Boosting AI Reliability: Uncertainty Quantification with MAPIE

2024-09-26

talk

Louis Lacombe , Thibault Cordier , Valentin Laurent

AI/ML GitHub Python PyTorch Scikit-learn TensorFlow

MAPIE (Model Agnostic Prediction Interval Estimator) is your go-to solution for managing uncertainties and risks in machine learning models. This Python library, nestled within scikit-learn-contrib, offers a way to calculate prediction intervals with controlled coverage rates for regression, classification, and even time series analysis. But it doesn't stop there - MAPIE can also be used to handle more complex tasks like multi-label classification and semantic segmentation in computer vision, ensuring probabilistic guarantees on crucial metrics like recall and precision. MAPIE can be integrated with any model - whether it's scikit-learn, TensorFlow, or PyTorch. Join us as we delve into the world of conformal predictions and how to quickly manage your uncertainties using MAPIE.

Link to Github: https://github.com/scikit-learn-contrib/MAPIE

Chainsail: facilitating sampling of multimodal probability distributions

2024-09-26

talk

Simeon Carstens

Monte Carlo Python

Markov chain Monte Carlo (MCMC) methods, a class of iterative algorithms that allow sampling almost arbitrary probability distributions, have become increasingly popular and accessible to statisticians and scientists. But they run into difficulties when applied to multimodal probability distributions. These occur, for example, in Bayesian data analysis, when multiple regions in the parameter space explain the data equally well or when some parameters are redundant. Inaccurate sampling then results in incomplete and misleading parameter estimates. Markov chain Monte Carlo (MCMC) methods, a very popular class of iterative algorithms that allow sampling almost arbitrary probability distributions, run into difficulties when applied to multimodal probability distributions. These occur, for example, in Bayesian data analysis, when multiple regions in the parameter space explain the data equally well or when some parameters are redundant. In this talk, intended for data scientists and statisticians with basic knowledge of MCMC and probabilistic programming, I present Chainsail, an open-source web service written entirely in Python. It implements Replica Exchange, an advanced MCMC method designed specifically to improve sampling of multimodal distributions. Chainsail makes this algorithm easily accessible to users of probabilistic programming libraries by automatically tuning important parameters and exploiting easy on-demand provisioning of the (increased) computing resources necessary for running Replica Exchange.

Dreadful Frailties in Propensity Score Matching and How to Fix Them.

2024-09-26

talk

Alexandre Abraham

Python

In their seminal paper "Why propensity scores should not be used for matching," King and Nielsen (2019) highlighted the shortcomings of Propensity Score Matching (PSM). Despite these concerns, PSM remains prevalent in mitigating selection bias across numerous retrospective medical studies each year and continues to be endorsed by health authorities. Guidelines to mitigating these issues have been proposed, but many researchers encounter difficulties in both adhering to these guidelines and in thoroughly documenting the entire process.

In this presentation, I show the inherent variability in outcomes resulting from the commonly accepted validation condition of Standardized Mean Difference (SMD) below 10%. This variability can significantly impact treatment comparisons, potentially leading to misleading conclusions. To address this issue, I introduce A2A, a novel metric computed on a task specifically designed for the problem at hand. By integrating A2A with SMD, our approach substantially reduces the variability of predicted Average Treatment Effects (ATE) by up to 90% across validated matching techniques.

These findings collectively enhance the reliability of PSM outcomes and lay the groundwork for a comprehensive automated bias correction procedure. Additionally, to facilitate seamless adoption across programming languages, I have integrated these methods into "popmatch," a Python package that not only incorporates these techniques but also offers a convenient Python interface for R's MatchIt methods.

Catering Causal Inference: An Introduction to 'metalearners', a Flexible MetaLearner Library in Python

2024-09-26

talk

Kevin Klein , Francesc Martí Escofet

AI/ML Python

Discover metalearners, a cutting-edge Python library designed for Causal Inference with particularly flexible and user-friendly MetaLearner implementations. metalearners leverages the power of conventional Machine Learning estimators and molds them into causal treatment effect estimators. This talk is targeted towards data professionals with some Python and Machine Learning competences, guiding them to optimizing interventions such as 'Which potential customers should receive a voucher to optimally allocate a voucher budget?' or 'Which patients should receive which medical treatment?' based on causal interpretations.

DataLab: Bridging Scientific and Industrial Worlds for Advanced Signal and Image Processing

2024-09-26

talk

Pierre Raybaut

Python

The video is available here: https://www.youtube.com/watch?v=yn1bR-BVfn8&list=PLGVZCDnMOq0pKya8gksd00ennKuyoH7v7&index=37

This talk introduces DataLab, a unique open-source platform for signal and image processing, seamlessly integrating scientific and industrial applications.

The main objective of this talk is to show how DataLab may be used as a complementary tool alongside with Jupyter notebooks or an IDE (e.g., Spyder), and how it can be extended with custom Python scripts or applications.

sktime - python toolbox for time series: next-generation AI – deep learning and foundation models

2024-09-26

talk

Benedikt Heidrich , Franz Kiraly

AI/ML API GenAI Python Scikit-learn

sktime is a widely used scikit-learn compatible library for learning with time series. sktime is easily extensible by anyone, and interoperable with the pydata/numfocus stack.

This talk presents progress, challenges, and newest features off the press, in extending the sktime framework to deep learning and foundation models.

Recent progress in generative AI and deep learning is leading to an ever-exploding number of popular “next generation AI” models for time series tasks like forecasting, classification, segmentation.

Particular challenges of the new AI ecosystem are inconsistent formal interfaces, different deep learning backends, vendor specific APIs and architectures which do not match sklearn-like patterns well – every practitioner who has tried to use at least two such models at the same time (outside sktime) will have their individual painful memories.

We show how sktime brings its unified interface architecture for time series modelling to the brave new AI frontier, using novel design patterns building on ideas from hugging face and scikit-learn, to provide modular, extensible building blocks with a simple specification language.

Is your marketing effective? Let Bayes decide!

2024-09-26

talk

Emanuele Fabbiani

AI/ML Analytics HTML Marketing Python

Understanding the effectiveness of various marketing channels is crucial to maximise the return on investment (ROI). However, the limitation of third-party cookies and an ever-growing focus on privacy make it difficult to rely on basic analytics. This talk discusses a pioneering project where a Bayesian model was employed to assess the marketing media mix effectiveness of WeRoad, the fastest-growing Italian tour operator.

The Bayesian approach allows for the incorporation of prior knowledge, seamlessly updating it with new data to provide robust, actionable insights. This project leveraged a Bayesian model to unravel the complex interactions between marketing channels such as online ads, social media, and promotions. We'll dive deep into how the Bayesian model was designed, discussing how we provided the AI system with expert knowledge, and presenting how delays and saturation were modelled.

We will also tackle aspects of the technical implementation, discussing how Python, PyMC, and Streamlit provided us with the all the tools we needed to develop an effective, efficient, and user-friendly system.

Attendees will walk away with:

A simple understanding of the Bayesian approach and why it matters.
Concrete examples of the transformative impact on WeRoad's marketing strategy.
A blueprint to harness predictive models in their business strategies.

Visualization of the sky in Notebooks: the ipyaladin widget extension

2024-09-26

talk

Matthieu Baumann , Manon Marchand

API GitHub Python

Aladin allows to visualize images of the sky or planetary surfaces just as an astronomical "openstreetmap" app. The view can be panned and explored interactively. In the ipyaladin widget -- that brings Aladin in the Jupyter Notebook environnement -- these abilities are extended with a python API. The users can send astronomical data in standard formats back and forth the viewer and their Python code. Such data can be images of the sky in different wavelengths, but also tabular data, complex shapes that characterize telescope observation regions, or even special sky features (such as probability region for the provenance of a gravitational event).

With these already existing features, and current work we are doing with the new development framework anywidget, ipyaladin is really close to a version 1.0.0. It is already used in its beta version in different experimental science platforms, for example in the ESCAPE European Science Cluster of Astronomy & Particle Physics project and in the experimental SKA (Square Kilometre Array, a telescope for radio astronomy) analysis platform.

In this presentation, we will share our feedback on the development of a widget thanks to anywidget compared to the bare ipywidget framework. And we will demonstrate the functionalities of the widget through scientific use cases.

Color-composite images from the James Webb Space Telescope

2024-09-26

talk

Camilla Pacifici , Jesse Averbukh

Cloud Computing Python

The astronomical community has built a good amount of software to visualize and analyze the images obtained with the James Webb Space Telescope (JWST). In this talk, I will present the open-source Python package Jdaviz. I will show you how to visualize publicly available JWST images and build the pretty color images that we have all seen in the media. Half the talk will be an introduction to JWST and Jdaviz and half will be a hands on session on a cloud platform (you will only need to create an account) or on your own machine (the package is available on PyPI).

Fast NetworkX and How Accelerated Backends Are Changing Graph Analytics

2024-09-26

talk

Erik Welch , Rick Ratzel

Analytics Pandas Python

NetworkX is arguably the most popular graph analytics library available today, but one of its greatest strengths - the pure-python implementation - is also possibly its biggest weakness. If you're a seasoned data scientists or a new student of the fascinating field of graph analytics, you're probably familiar with NetworkX and interested in how to make this extremely easy-to-use library powerful enough to handle realistically large graph workflows that often exceed the limitations of its pure-python implementation.

This talk will describe a relatively new capability of NetworkX; support for accelerated backends, and how they can benefit NetworkX users by allowing it to finally be both easy to use and fast. Through the use of backends, NetworkX can also be incorporated into workflows that take advantage of similar accelerators, such as Accelerated Pandas (cudf.pandas), to finally make these easy to use solutions scale to larger problems.

Attend this talk to learn about how you can leverage the various backends available to NetworkX today to seamlessly run graph analytics on GPUs, use GraphBLAS implementations, and more, all without leaving the comfort and convenience of the most popular graph analytics library available.

Unpack business metrics to explain their evolution

2024-09-26

talk

Max Halford

Analytics KPI Python

One of the more mundane tasks in the business analytics world is to measure KPIs: averages, sums, ratios, etc. Typically, these are measured period over period, to see how they trend. If you're a data analyst, you've likely been asked to debug/explain a metric, because a stakeholder wants to understand why a number has changed.

This topic isn't well grounded theory, and the answers we come up with can be lacklustre. In this talk, we discuss solutions to this very common topic. We will look at a methodology we have developed at Carbonfact, and the opensource Python tool we are sharing.

On the structure and reproducibility of Python packages - data crunch

2024-09-25

talk

Zhihan Zhang , Maria Knorps

Python

Did you know that all top PyPI packages declare their 3rd party dependencies? In contrast, only about 53% of scientific projects do the same. The question arises: How can we reproduce Python-based scientific experiments if we're unaware of the necessary libraries for our environment? In this talk, we delve into the Python packaging ecosystem and employ a data-driven approach to analyze the structure and reproducibility of packages. We compare two distinct groups of Python packages: the most popular ones on PyPI, which we anticipate to adhere more closely to best practices, and a selection from biomedical experiments. Through our analysis, we uncover common development patterns in Python projects and utilize our open-source library, FawltyDeps, to identify undeclared dependencies and assess the reproducibility of these projects. This discussion is especially valuable for enthusiasts of clean Python code, as well as for data scientists and engineers eager to adopt best practices and enhance reproducibility. Attendees will depart with actionable insights on enhancing the transparency and reliability of their Python projects, thereby advancing the cause of reproducible scientific research.

Polars Plugins: how you (yes, you!) can extend Polars

2024-09-25

talk

Marco Gorelli

API Polars Python Rust

Polars is a dataframe library taking the world by storm. It is very runtime and memory efficient and comes with a clean and expressive API. Sometimes, however, the built-in API isn't enough. And that's where its killer feature comes in: plugins. You can extend Polars, and solve practically any problem.

No prior Rust experience required, intermediate Python and programming experience required. By the end of the talk, you will know how to write your own Polars Plugin! This talk is aimed at data practitioners.

Solara: Pure Python web apps beyond prototypes and dashboards

2024-09-25

talk

Maarten Breddels , Iisakki Rotko

API JavaScript Python

Many Python frameworks are suitable for creating basic dashboards or prototypes but struggle with more complex ones. Taking lessons from the JavaScript community, the experts on building UI’s, we created a new framework called Solara. Solara scales to much more complex apps and compute-intensive dashboards. Built on the Jupyter stack, Solara apps and its reusable components run in the Jupyter notebook and on its own production quality server based on Starlette/FastAPI.

Solara has a declarative API that is designed for dynamic and complex UIs yet is easy to write. Reactive variables power our state management, which automatically triggers rerenders. Our component-centric architecture stimulates code reusability, and hot reloading promotes efficient workflows. With our rich set of UI and data-focused components, Solara spans the entire spectrum from rapid prototyping to robust, complex dashboards.

Python 3.12's new monitoring and debugging API

2024-09-25

talk

Johannes Bechberger

API Python

Python 3.12 introduced a new low-impact monitoring API with PEP669, which can be used to implement far faster debuggers than ever before. This talk covers the main advantages of this API and how you can use it to develop small tools.

talk-data.com

Top Topics

Top Speakers

MLOps at Renault Group: A Generic Pipeline for Scalable Deployment

Counting down for CRA - updates and expectations

Boosting AI Reliability: Uncertainty Quantification with MAPIE

Chainsail: facilitating sampling of multimodal probability distributions

Dreadful Frailties in Propensity Score Matching and How to Fix Them.

Catering Causal Inference: An Introduction to 'metalearners', a Flexible MetaLearner Library in Python

DataLab: Bridging Scientific and Industrial Worlds for Advanced Signal and Image Processing

sktime - python toolbox for time series: next-generation AI – deep learning and foundation models

Is your marketing effective? Let Bayes decide!

Visualization of the sky in Notebooks: the ipyaladin widget extension

Color-composite images from the James Webb Space Telescope

Fast NetworkX and How Accelerated Backends Are Changing Graph Analytics

Unpack business metrics to explain their evolution

On the structure and reproducibility of Python packages - data crunch

Polars Plugins: how you (yes, you!) can extend Polars

Solara: Pure Python web apps beyond prototypes and dashboards

Python 3.12's new monitoring and debugging API