talk-data.com
Event
PyData Paris 2024
Activities tracked
77
Top Topics
Sessions & talks
Showing 1–25 of 77 · Newest first
Farewell
Lightning Talks
Machine Learning practitioners build predictive models from "noisy" data resulting in uncertain predictions. But what does "noise" mean in a machine learning context?
Coffee Break
Coffee Break
Coffee Break
An update on the latest scikit-learn features
In this talk, we provide an update on the latest scikit-learn features that have been implemented in versions 1.4 and 1.5. We will particularly discuss the following features:
- the metadata routing API allowing to pass metadata around estimators;
- the
TunedThresholdClassifierCVallowing to tuned operational decision through custom metric; - better support for categorical features and missing values;
- interoperability of array and dataframe.
MLOps at Renault Group: A Generic Pipeline for Scalable Deployment
Scaling machine learning at large organizations like Renault Group presents unique challenges, in terms of scales, legal requirements, and diversity of use cases. Data scientists require streamlined workflows and automated processes to efficiently deploy models into production. We present an MLOps pipeline based on python Kubeflow and GCP Vertex AI API designed specifically for this purpose. It enables data scientists to focus on code development for pre-processing, training, evaluation, and prediction. This MLOPS pipeline is a cornerstone of the AI@Scale program, which aims to roll out AI across the Group.
We choose a Python-first approach, allowing Data scientists to focus purely on writing preprocessing or ML oriented Python code, also allowing data retrieval through SQL queries. The pipeline addresses key questions such as prediction type (batch or API), model versioning, resource allocation, drift monitoring, and alert generation. It favors faster time to market with automated deployment and infrastructure management. Although we encountered pitfalls and design difficulties, that we will discuss during the presentation, this pipeline integrates with a CI/CD process, ensuring efficient and automated model deployment and serving.
Finally, this MLOps solution empowers Renault data scientists to seamlessly translate innovative models into production, and smoothen the development of scalable, and impactful AI-driven solutions.
Rising concerns over IT's carbon footprint necessitate tools that gauge and mitigate these impacts. This session introduces CodeCarbon, an open-source tool that estimates computing's carbon emissions by measuring energy use across hardware components. Aimed at AI researchers and data scientists, CodeCarbon provides actionable insights into the environmental costs of computational projects, supporting efforts towards sustainability without requiring deep technical expertise.
This talk from the main contributors of Code Carbon will cover the environmental impact of IT, the possibilities to estimate it and a demo of CodeCarbon.
Adaptive Prediction Intervals
Adaptive prediction intervals, which represent prediction uncertainty, are crucial for practitioners involved in decision-making. Having an adaptivity feature is challenging yet essential, as an uncertainty measure must reflect the model's confidence for each observation. Attendees will learn about state-of-the-art algorithms for constructing adaptive prediction intervals, which is an active area of research.
Counting down for CRA - updates and expectations
The EU Commission is likely to vote on the Cyber Resilience Act (CRA) later this year. In this talk we will look at the timeline for the new legislation, any critical discussions happening around implementation and most importantly, the new responsibilities outlined by the CRA. We’ll also discuss what the PSF is doing for CPython and for PyPI and what each of us in the Python ecosystem might want to do to get ready for a new era of increased certainty – and liability – around security.
Processing medical images at scale on the cloud
The MedTech industry is undergoing a revolutionary transformation with continuous innovations promising greater precision, efficiency, and accessibility. In particular oncology, a branch of medicine that focuses on cancer, will benefit immensely from these new technologies, which may enable clinicians to detect cancer earlier and increase chances of survival. Detecting cancerous cells in microscopic photography of cells (Whole Slide Images, aka WSIs) is usually done with segmentation algorithms, which neural networks (NNs) are very good at. While using ML and NNs for image segmentation is a fairly standard task with established solutions, doing it on WSIs is a different kettle of fish. Most training pipelines and systems have been designed for analytics, meaning huge columns of small individual datums. In the case of WSIs, a single image is so huge that its file can be up to dozens of gigabytes. To allow innovation in medical imaging with AI, we need efficient and affordable ways to store and process these WSIs at scale.
Boosting AI Reliability: Uncertainty Quantification with MAPIE
MAPIE (Model Agnostic Prediction Interval Estimator) is your go-to solution for managing uncertainties and risks in machine learning models. This Python library, nestled within scikit-learn-contrib, offers a way to calculate prediction intervals with controlled coverage rates for regression, classification, and even time series analysis. But it doesn't stop there - MAPIE can also be used to handle more complex tasks like multi-label classification and semantic segmentation in computer vision, ensuring probabilistic guarantees on crucial metrics like recall and precision. MAPIE can be integrated with any model - whether it's scikit-learn, TensorFlow, or PyTorch. Join us as we delve into the world of conformal predictions and how to quickly manage your uncertainties using MAPIE.
Link to Github: https://github.com/scikit-learn-contrib/MAPIE
Open Source Sustainability & Philanthropy: Building Contributor Communities
Open Source Software, the backbone of today’s digital infrastructure, must be sustainable for the long-term. Qureshi and Fang (2011) find that motivating, engaging, and retaining new contributors is what makes open source projects sustainable.
Yet, as Steinmacher, et al. (2015) identifies, first-time open source contributors often lack timely answers to questions, newcomer orientation, mentors, and clear documentation. Moreover, since the term was first coined in 1998, open source lags far behind other technical domains in participant diversity. Trinkenreich, et al. (2022) reports that only about 5% of projects were reported to have women as core developers, and women authored less than 5% of pull requests, but had similar or even higher rates of pull request acceptances to men. So, how can we achieve more diversity in open source communities and projects?
Bloomberg’s Women in Technology (BWIT) community, Open Source Program Office (OSPO), and Corporate Philanthropy team collaborated with NumFOCUS to develop a volunteer incentive model that aligns business value, philanthropic impact, and individual technical growth. Through it, participating Bloomberg engineers were given the opportunity to convert their hours spent contributing to the pandas open source project into a charitable donation to a non-profit of their choice.
The presenters will discuss how we wove together differing viewpoints: non-profit foundation and for-profit corporation, corporate philanthropy and engineers, first-time contributors and core devs. They will showcase why and how we converted technical contributions into charitable dollars, the difference this community-building model had in terms of creating a diverse and sustained group of new open source contributors, and the viability of extending this to other open source projects and corporate partners to contribute to the long-term sustainability of open source—thereby demonstrating the true convergence of tech and social impact.
NOTE: [1] Qureshi, I, and Fang, Y. "Socialization in open source software projects: A growth mixture modeling approach." 2011. [2] Steinmacher, I., et al. "Social barriers faced by newcomers placing their first contribution in open source software projects." 2015. [3] Trinkenreich, B., et al. "Women’s participation in open source software: A survey of the literature." 2022.
Towards a deeper understanding of retrieval and vector databases
Retrieval is the process of searching for a given item (image, text, …) in a large database that are similar to one or more query items. A classical approach is to transform the database items and the query item into vectors (also called embeddings) with a trained model so that they can be compared via a distance metric. It has many applications in various fields, e.g. to build a visual recommendation system like Google Lens or a RAG (Retrieval Augmented Generation), a technique used to inject specific knowledge into LLMs depending on the query. Vector databases ease the management, serving and retrieval of the vectors in production and implement efficient indexes, to rapidly search through millions of vectors. They gained a lot of attention over the past year, due to the rise of LLMs and RAGs.
Although people working with LLMs are increasingly familiar with the basic principles of vector databases, the finer details and nuances often remain obscure. This lack of clarity hinders the ability to make optimal use of these systems.
In this talk, we will detail two examples of real-life projects (Deduplication of real estate adverts using the image embedding model DinoV2 and RAG for a medical company using the text embedding model Ada-2) and deep dive into retrieval and vector databases to demystify the key aspects and highlight the limitations: HSNW index, comparison of the providers, metadata filtering (the related plunge of performance when filtering too many nodes and how indexing partially helps it), partitioning, reciprocal rank fusion, the performance and limitations of the representations created by SOTA image and text embedding models, …
Lunch Break
Lunch Break
Lunch Break
Chainsail: facilitating sampling of multimodal probability distributions
Markov chain Monte Carlo (MCMC) methods, a class of iterative algorithms that allow sampling almost arbitrary probability distributions, have become increasingly popular and accessible to statisticians and scientists. But they run into difficulties when applied to multimodal probability distributions. These occur, for example, in Bayesian data analysis, when multiple regions in the parameter space explain the data equally well or when some parameters are redundant. Inaccurate sampling then results in incomplete and misleading parameter estimates. Markov chain Monte Carlo (MCMC) methods, a very popular class of iterative algorithms that allow sampling almost arbitrary probability distributions, run into difficulties when applied to multimodal probability distributions. These occur, for example, in Bayesian data analysis, when multiple regions in the parameter space explain the data equally well or when some parameters are redundant. In this talk, intended for data scientists and statisticians with basic knowledge of MCMC and probabilistic programming, I present Chainsail, an open-source web service written entirely in Python. It implements Replica Exchange, an advanced MCMC method designed specifically to improve sampling of multimodal distributions. Chainsail makes this algorithm easily accessible to users of probabilistic programming libraries by automatically tuning important parameters and exploiting easy on-demand provisioning of the (increased) computing resources necessary for running Replica Exchange.
Dreadful Frailties in Propensity Score Matching and How to Fix Them.
In their seminal paper "Why propensity scores should not be used for matching," King and Nielsen (2019) highlighted the shortcomings of Propensity Score Matching (PSM). Despite these concerns, PSM remains prevalent in mitigating selection bias across numerous retrospective medical studies each year and continues to be endorsed by health authorities. Guidelines to mitigating these issues have been proposed, but many researchers encounter difficulties in both adhering to these guidelines and in thoroughly documenting the entire process.
In this presentation, I show the inherent variability in outcomes resulting from the commonly accepted validation condition of Standardized Mean Difference (SMD) below 10%. This variability can significantly impact treatment comparisons, potentially leading to misleading conclusions. To address this issue, I introduce A2A, a novel metric computed on a task specifically designed for the problem at hand. By integrating A2A with SMD, our approach substantially reduces the variability of predicted Average Treatment Effects (ATE) by up to 90% across validated matching techniques.
These findings collectively enhance the reliability of PSM outcomes and lay the groundwork for a comprehensive automated bias correction procedure. Additionally, to facilitate seamless adoption across programming languages, I have integrated these methods into "popmatch," a Python package that not only incorporates these techniques but also offers a convenient Python interface for R's MatchIt methods.
Onyxia: A User-Centric Interface for Data Scientists in the Cloud Age
In this talk, we'll look into why Insee had to go beyond usual tools like JupyterHub. With data science growing, it has become important to have tools that are easy to use, can change as needed, and help people work together. The opensource software Onyxia brings a new answer by offering a user-friendly way to boost creativity in a data environment that uses massively containerization and object storage.
Catering Causal Inference: An Introduction to 'metalearners', a Flexible MetaLearner Library in Python
Discover metalearners, a cutting-edge Python library designed for Causal Inference with particularly flexible and user-friendly MetaLearner implementations. metalearners leverages the power of conventional Machine Learning estimators and molds them into causal treatment effect estimators. This talk is targeted towards data professionals with some Python and Machine Learning competences, guiding them to optimizing interventions such as 'Which potential customers should receive a voucher to optimally allocate a voucher budget?' or 'Which patients should receive which medical treatment?' based on causal interpretations.
DataLab: Bridging Scientific and Industrial Worlds for Advanced Signal and Image Processing
The video is available here: https://www.youtube.com/watch?v=yn1bR-BVfn8&list=PLGVZCDnMOq0pKya8gksd00ennKuyoH7v7&index=37
This talk introduces DataLab, a unique open-source platform for signal and image processing, seamlessly integrating scientific and industrial applications.
The main objective of this talk is to show how DataLab may be used as a complementary tool alongside with Jupyter notebooks or an IDE (e.g., Spyder), and how it can be extended with custom Python scripts or applications.
sktime - python toolbox for time series: next-generation AI – deep learning and foundation models
sktime is a widely used scikit-learn compatible library for learning with time series. sktime is easily extensible by anyone, and interoperable with the pydata/numfocus stack.
This talk presents progress, challenges, and newest features off the press, in extending the sktime framework to deep learning and foundation models.
Recent progress in generative AI and deep learning is leading to an ever-exploding number of popular “next generation AI” models for time series tasks like forecasting, classification, segmentation.
Particular challenges of the new AI ecosystem are inconsistent formal interfaces, different deep learning backends, vendor specific APIs and architectures which do not match sklearn-like patterns well – every practitioner who has tried to use at least two such models at the same time (outside sktime) will have their individual painful memories.
We show how sktime brings its unified interface architecture for time series modelling to the brave new AI frontier, using novel design patterns building on ideas from hugging face and scikit-learn, to provide modular, extensible building blocks with a simple specification language.