Machine Learning practitioners build predictive models from "noisy" data resulting in uncertain predictions. But what does "noise" mean in a machine learning context?
talk-data.com
Event
PyData Paris 2024
Activities tracked
14
Top Topics
Sessions & talks
Showing 1–14 of 14 · Newest first
MLOps at Renault Group: A Generic Pipeline for Scalable Deployment
Scaling machine learning at large organizations like Renault Group presents unique challenges, in terms of scales, legal requirements, and diversity of use cases. Data scientists require streamlined workflows and automated processes to efficiently deploy models into production. We present an MLOps pipeline based on python Kubeflow and GCP Vertex AI API designed specifically for this purpose. It enables data scientists to focus on code development for pre-processing, training, evaluation, and prediction. This MLOPS pipeline is a cornerstone of the AI@Scale program, which aims to roll out AI across the Group.
We choose a Python-first approach, allowing Data scientists to focus purely on writing preprocessing or ML oriented Python code, also allowing data retrieval through SQL queries. The pipeline addresses key questions such as prediction type (batch or API), model versioning, resource allocation, drift monitoring, and alert generation. It favors faster time to market with automated deployment and infrastructure management. Although we encountered pitfalls and design difficulties, that we will discuss during the presentation, this pipeline integrates with a CI/CD process, ensuring efficient and automated model deployment and serving.
Finally, this MLOps solution empowers Renault data scientists to seamlessly translate innovative models into production, and smoothen the development of scalable, and impactful AI-driven solutions.
Rising concerns over IT's carbon footprint necessitate tools that gauge and mitigate these impacts. This session introduces CodeCarbon, an open-source tool that estimates computing's carbon emissions by measuring energy use across hardware components. Aimed at AI researchers and data scientists, CodeCarbon provides actionable insights into the environmental costs of computational projects, supporting efforts towards sustainability without requiring deep technical expertise.
This talk from the main contributors of Code Carbon will cover the environmental impact of IT, the possibilities to estimate it and a demo of CodeCarbon.
Processing medical images at scale on the cloud
The MedTech industry is undergoing a revolutionary transformation with continuous innovations promising greater precision, efficiency, and accessibility. In particular oncology, a branch of medicine that focuses on cancer, will benefit immensely from these new technologies, which may enable clinicians to detect cancer earlier and increase chances of survival. Detecting cancerous cells in microscopic photography of cells (Whole Slide Images, aka WSIs) is usually done with segmentation algorithms, which neural networks (NNs) are very good at. While using ML and NNs for image segmentation is a fairly standard task with established solutions, doing it on WSIs is a different kettle of fish. Most training pipelines and systems have been designed for analytics, meaning huge columns of small individual datums. In the case of WSIs, a single image is so huge that its file can be up to dozens of gigabytes. To allow innovation in medical imaging with AI, we need efficient and affordable ways to store and process these WSIs at scale.
Boosting AI Reliability: Uncertainty Quantification with MAPIE
MAPIE (Model Agnostic Prediction Interval Estimator) is your go-to solution for managing uncertainties and risks in machine learning models. This Python library, nestled within scikit-learn-contrib, offers a way to calculate prediction intervals with controlled coverage rates for regression, classification, and even time series analysis. But it doesn't stop there - MAPIE can also be used to handle more complex tasks like multi-label classification and semantic segmentation in computer vision, ensuring probabilistic guarantees on crucial metrics like recall and precision. MAPIE can be integrated with any model - whether it's scikit-learn, TensorFlow, or PyTorch. Join us as we delve into the world of conformal predictions and how to quickly manage your uncertainties using MAPIE.
Link to Github: https://github.com/scikit-learn-contrib/MAPIE
Catering Causal Inference: An Introduction to 'metalearners', a Flexible MetaLearner Library in Python
Discover metalearners, a cutting-edge Python library designed for Causal Inference with particularly flexible and user-friendly MetaLearner implementations. metalearners leverages the power of conventional Machine Learning estimators and molds them into causal treatment effect estimators. This talk is targeted towards data professionals with some Python and Machine Learning competences, guiding them to optimizing interventions such as 'Which potential customers should receive a voucher to optimally allocate a voucher budget?' or 'Which patients should receive which medical treatment?' based on causal interpretations.
sktime - python toolbox for time series: next-generation AI – deep learning and foundation models
sktime is a widely used scikit-learn compatible library for learning with time series. sktime is easily extensible by anyone, and interoperable with the pydata/numfocus stack.
This talk presents progress, challenges, and newest features off the press, in extending the sktime framework to deep learning and foundation models.
Recent progress in generative AI and deep learning is leading to an ever-exploding number of popular “next generation AI” models for time series tasks like forecasting, classification, segmentation.
Particular challenges of the new AI ecosystem are inconsistent formal interfaces, different deep learning backends, vendor specific APIs and architectures which do not match sklearn-like patterns well – every practitioner who has tried to use at least two such models at the same time (outside sktime) will have their individual painful memories.
We show how sktime brings its unified interface architecture for time series modelling to the brave new AI frontier, using novel design patterns building on ideas from hugging face and scikit-learn, to provide modular, extensible building blocks with a simple specification language.
Understanding the effectiveness of various marketing channels is crucial to maximise the return on investment (ROI). However, the limitation of third-party cookies and an ever-growing focus on privacy make it difficult to rely on basic analytics. This talk discusses a pioneering project where a Bayesian model was employed to assess the marketing media mix effectiveness of WeRoad, the fastest-growing Italian tour operator.
The Bayesian approach allows for the incorporation of prior knowledge, seamlessly updating it with new data to provide robust, actionable insights. This project leveraged a Bayesian model to unravel the complex interactions between marketing channels such as online ads, social media, and promotions. We'll dive deep into how the Bayesian model was designed, discussing how we provided the AI system with expert knowledge, and presenting how delays and saturation were modelled.
We will also tackle aspects of the technical implementation, discussing how Python, PyMC, and Streamlit provided us with the all the tools we needed to develop an effective, efficient, and user-friendly system.
Attendees will walk away with:
- A simple understanding of the Bayesian approach and why it matters.
- Concrete examples of the transformative impact on WeRoad's marketing strategy.
- A blueprint to harness predictive models in their business strategies.
In the rapidly evolving landscape of Artificial Intelligence (AI), open source and openness AI have emerged as crucial factors in fostering innovation, transparency, and accountability. Mistral AI's release of the open-weight Mistral 7B model has sparked significant adoption and demand, highlighting the importance of open-source and customization in building AI applications. This talk focuses on the Mistral AI model landscape, the benefits of open-source and customization, and the opportunities for building AI applications using Mistral models.
Keynote: Open-source AI: why it matters and how to get started
In this talk, we will go through everything open-source AI: the state of open-source AI, why it matters, the future of it and how you can get started with it.
Jupylates: spaced repetition for teaching with Jupyter
Jupyter based environments are getting a lot of traction for teaching computing, programming, and data sciences. The narrative structure of notebooks has indeed proven its value for guiding each student at it's own pace to the discovery and understanding of new concepts or new idioms (e.g. how do I extract a column in pandas?). But then these new pieces of knowledge tend to quickly fade out and be forgotten. Indeed long term acquisition of knowledge and skills takes reinforcement by repetition. This is the foundation of many online learning platforms like Webwork or WIMS that offer exercises with randomization and automatic feedback. And of popular "AI-powered" apps -- e.g. to learn foreign languages -- that use spaced repetition algorithms designed by educational and neuro sciences to deliver just the right amount of repetition.
What if you could author such exercizes as notebooks, to benefit from everything that Jupyter can offer (think rich narratives, computations, visualization, interactions)? What if you could integrate such exercises right into your Jupyter based course? What if a learner could get personalized exercise recommandations based on their past learning records, without having to give away these sensitive pieces of information away?
That's Jupylates (work in progress). And thanks to the open source scientific stack, it's just a small Jupyter extension.
The Jupyter stack has undergone a significant transformation in recent years with the integration of collaborative editing features: users can now modify a shared document and see each other's changes in real time, with a user experience akin to that of Google Docs. The underlying technology uses a special data structure called Conflict-free Replicated Data Types (CRDTs), that automatically resolves conflicts when concurrent changes are made. This allows data to be distributed rather than centralized in a server, letting clients work as if data was local rather than remote. In this talk, we look at new possibilities that CRDTs can unlock, and how they are redefining Jupyter's architecture. Different use cases are presented: a suggestion system similar to Google Doc's, a chat system allowing collaboration with an AI agent, an execution model allowing full notebook state recovery, a collaborative widget model. We also look at the benefits of using CRDTs in JupyterLite, where users can interact without a server. This may be a great example of a distributed system where every user owns their data and shares them with their peers.
Would you rely on ChatGPT to dial 911? A talk on balancing determinism and probabilism in production machine learning systems
In the last year there hasn’t been a day that passed without us hearing about a new generative AI innovation that will enhance some aspect of our lives. On a number of tasks large probabilistic systems are now outperforming humans, or at least they do so “on average”. “On average” means most of the time, but in many real life scenarios “average” performance is not enough: we need correctness ALL of the time, for example when you ask the system to dial 911.
In this talk we will explore the synergy between deterministic and probabilistic models to enhance the robustness and controllability of machine learning systems. Tailored for ML engineers, data scientists, and researchers, the presentation delves into the necessity of using both deterministic algorithms and probabilistic model types across various ML systems, from straightforward classification to advanced Generative AI models.
You will learn about the unique advantages each paradigm offers and gain insights into how to most effectively combine them for optimal performance in real-world applications. I will walk you through my past and current experiences in working with simple and complex NLP models, and show you what kind of pitfalls, shortcuts, and tricks are possible to deliver models that are both competent and reliable.
The session will be structured into a brief introduction to both model types, followed by case studies in classification and generative AI, concluding with a Q&A segment.
Keynote: DIY Personalization: When, how and why to build your own models
With increased ease of smaller "AI" models, better chips and on-device learning, is it possible now to build and train your own models for your own use? In this keynote, we'll explore learnings of small, medium and large-sized model personalization, but driven by yourself and for yourself. A walk through what's possible, what's not and what we should prioritize if we'd like AI & ML to be made for everyone.