Scikit-learn

Tracking Policy Evolution Through Clustering: A New Approach to Temporal Pattern Analysis in Multi-Dimensional Data

2025-12-10 · PyData Boston 2025

talk

by Sarthak Pattnaik

Matplotlib Pandas Python

Analyzing how patterns evolve over time in multi-dimensional datasets is challenging—traditional time-series methods often struggle with interpretability when comparing multiple entities across different scales. This talk introduces a clustering-based framework that transforms continuous data into categorical trajectories, enabling intuitive visualization and comparison of temporal patterns.What & Why: The method combines quartile-based categorization with modified Hamming distance to create interpretable "trajectory fingerprints" for entities over time. This approach is particularly valuable for policy analysis, economic comparisons, and any domain requiring longitudinal pattern recognition.Who: Data scientists and analysts working with temporal datasets, policy researchers, and anyone interested in comparative analysis across entities with different scales or distributions.Type: Technical presentation with practical implementation examples using Python (pandas, scikit-learn, matplotlib). Moderate mathematical content balanced with intuitive visualizations.Takeaway: Attendees will learn a novel approach to temporal pattern analysis that bridges the gap between complex statistical methods and accessible, policy-relevant insights. You'll see practical implementations analyzing 60+ years of fiscal policy data across 8 countries, with code available for adaptation to your own datasets.

Hands-On Machine Learning with Scikit-Learn and PyTorch

2025-10-29 · O'Reilly AI & ML Books O'Reilly Amazon

book

by Aurélien Géron

AI/ML LLM PyTorch ai-ml data deep-learning machine-learning

The potential of machine learning today is extraordinary, yet many aspiring developers and tech professionals find themselves daunted by its complexity. Whether you're looking to enhance your skill set and apply machine learning to real-world projects or are simply curious about how AI systems function, this book is your jumping-off place. With an approachable yet deeply informative style, author Aurélien Géron delivers the ultimate introductory guide to machine learning and deep learning. Drawing on the Hugging Face ecosystem, with a focus on clear explanations and real-world examples, the book takes you through cutting-edge tools like Scikit-Learn and PyTorch—from basic regression techniques to advanced neural networks. Whether you're a student, professional, or hobbyist, you'll gain the skills to build intelligent systems. Understand ML basics, including concepts like overfitting and hyperparameter tuning Complete an end-to-end ML project using scikit-Learn, covering everything from data exploration to model evaluation Learn techniques for unsupervised learning, such as clustering and anomaly detection Build advanced architectures like transformers and diffusion models with PyTorch Harness the power of pretrained models—including LLMs—and learn to fine-tune them Train autonomous agents using reinforcement learning

Modèles ML de scoring : comment les rendre immortels ? Propriété et Portabilité assurées avec Scoring.AI

2025-10-01 · Big Data & AI Paris 2025

Face To Face

by Tanguy Le Nouvel (Scoring.AI by Micropole)

AI/ML NumPy Pandas Python

Vos modèles prédictifs vieillissent mal ? Une mise à jour de vos packages (pandas, scikit-learn, lightgbm…), et c’est la panne assurée en production…

Avec Scoring.AI, reprenez le contrôle total et garantissez leur pérennité. Notre outil innovant construit des scores hyper performants et traduit automatiquement leur déploiement en code Python pur, basé uniquement sur Pandas et NumPy.

Résultat ?

Une portabilité totale : vos modèles fonctionnent en production indépendamment des packages et outils qui ont servi à les construire

Une maintenance simplifiée : les équipes IT peuvent mettre à jour leur stack technique sans risque de casse

Propriété et transparence accrue : un code lisible, auditable et facile à déployer, même dans des environnements contraints

À travers des cas concrets et une démo live, explorez comment désenclaver vos modèles des dépendances logicielles et garantir leur survie sur le long terme. Parce qu’un bon modèle, c’est un modèle qui dure !

Enterprise Data Science: €50 Billion Wasted -- And How to Get it Back!

2025-10-01 · Big Data & AI Paris 2025

Face To Face

by Stephen Bauer (Probabl)

AI/ML Data Science

How data science and the next wave of open-source innovation are closing the €50B efficiency gap in Enterprise AI.

Today, 75% of data science output is lost to fragmented data, scattered tooling, manual workflows, and poor reproducibility. Yet nearly every data scientist relies on scikit-learn — the backbone of modern AI/ML.

We’ll unpack the root causes of inefficiency in enterprise data science — and show how open-source tools are unlocking performance, reproducibility, and strategic autonomy at scale.

Probabilistic regression models: let's compare different modeling strategies and discuss how to evaluate them

2025-10-01 · PyData Paris 2025

talk

by Olivier Grisel

AI/ML PyTorch

Most common machine learning models (linear, tree-based or neural network-based), optimize for the least squares loss when trained for regression tasks. As a result, they output a point estimate of the conditional expected value of the target: E[y|X].

In this presentation, we will explore several ways to train and evaluate probabilistic regression models as a richer alternative to point estimates. Those models predict a richer description of the full distribution of y|X and allow us to quantify the predictive uncertainty for individual predictions.

On the model training part, we will introduce the following options:

ensemble of quantile regressors for a grid of quantile levels (using linear models or gradient boosted trees in scikit-learn, XGBoost and PyTorch),
how to reduce probabilistic regression to multi-class classification + a cumulative sum of the predict_proba output to recover a continuous conditional CDF.
how to implement this approach as a generic scikit-learn meta-estimator;
how this approach is used to pretrain foundational tabular models (e.g. TabPFNv2).
simple Bayesian models (e.g. Bayesian Ridge and Gaussian Processes);
more specialized approaches as implemented in XGBoostLSS.

We will also discuss how to evaluate probabilistic predictions via:

the pinball loss of quantile regressors,
other strictly proper scoring rules such as Continuous Ranked Probability Score (CRPS),
coverage measures and width of prediction intervals,
reliability diagrams for different quantile levels.

We will illustrate of those concepts with concrete examples and running code.

Finally, we will illustrate why some applications need such calibrated probabilistic predictions:

estimating uncertainty in trip times depending on traffic conditions to help a human decision make choose among various travel plan options.
modeling value at risk for investment decisions,
assessing the impact of missing variables for an ML model trained to work in degraded mode,
Bayesian optimization for operational parameters of industrial machines from little/costly observations.

If time allows, will also discuss usage and limitations of Conformal Quantile Regressors as implemented in MAPIE and contrast aleatoric vs epistemic uncertainty captured by those models.

PyPI in the face: running jokes that PyPI download stats can play on you

2025-10-01 · PyData Paris 2025 Watch

talk

by Loïc Estève

Analytics GitHub NumPy Python SciPy

We all love to tell stories with data and we all love to listen to them. Wouldn't it be great if we could also draw actionable insights from these nice stories?

As scikit-learn maintainers, we would love to use PyPI download stats and other proxy metrics (website analytics, github repository statistics, etc ...) to help inform some of our decisions like: - how do we increase user awareness of best practices (please use Pipeline and cross-validation)? - how do we advertise our recent improvements (use HistGradientBoosting rather than GradientBoosting, TunedThresholdClassifier, PCA and a few other models can run on GPU) ? - do users care more about new features from recent releases or consolidation of what already exists? - how long should we support older versions of Python, numpy or scipy ?

In this talk we will highlight a number of lessons learned while trying to understand the complex reality behind these seemingly simple metrics.

Telling nice stories is not always hard, trying to grasp the reality behind these metrics is often tricky.

Skrub: machine learning for dataframes

2025-10-01 · PyData Paris 2025 Watch

talk

by Jérôme Dockès , Riccardo Cappuzzo , Guillaume Lemaitre (scikit-learn)

AI/ML DataOps

Skrub is an open source package that simplifies machine-learning with dataframes by providing a variety of tools to explore, prepare and feature-engineer dataframes so they can be integrated into scikit-learn pipelines. Skrub DataOps allow to build extensive, multi-table wrangling plans, explore hyperparameter spaces, and export the resulting objects for deployment. The talk showcases various use cases where skrub can simplify the job of a data scientist from data preparation to deployment, through code examples and demonstrations.

A Hitchhiker's Guide to the Array API Standard Ecosystem

2025-09-30 · PyData Paris 2025 Watch

talk

by Lucas Colley

API NumPy Python PyTorch SciPy

The array API standard is unifying the ecosystem of Python array computing, facilitating greater interoperability between code written for different array libraries, including NumPy, CuPy, PyTorch, JAX, and Dask.

But what are all of these "array-api-" libraries for? How can you use these libraries to 'future-proof' your libraries, and provide support for GPU and distributed arrays to your users? Find out in this talk, where I'll guide you through every corner of the array API standard ecosystem, explaining how SciPy and scikit-learn are using all of these tools to adopt the standard. I'll also be sharing progress updates from the past year, to give you a clear picture of where we are now, and what the future holds.

Open-source Business

2025-09-30 · PyData Paris 2025 Watch

talk

by Yann LECHELLE (:PROBABL.) , Alexander C. S. Hendorf (opotoc GmbH) , Sylvain Corlay

Arrow

Challenges in economics and governance models for open-source scientific projects

In this presentation, the CEOs of two companies at the forefront of open-source scientific software development - Sylvain Corlay of QuantStack and Yann Lechelle of Probabl - examine the intricate challenges of open-source funding and governance and reflect on how these two aspects interconnect.

We start by reflecting on the origins of the open-source movement within the scientific community, and delve into the contemporary challenges of operating businesses and identifying sustainable economic models that both leverage and contribute to open-source software.

In particular, we highlight the unique approaches and experiences of QuantStack and Probabl, which primarily contribute to multi-stakeholder scientific projects such as scikit-learn, Jupyter, Apache Arrow, or conda-forge.

Optimize the Right Thing: Cost-Sensitive Classification in Practice

2025-09-26 · PyData Amsterdam 2025 Watch

talk

by Shimanto Rahman

AI/ML Python

Not all mistakes in machine learning are equal—a false negative in fraud detection or medical diagnosis can be far costlier than a false positive. Cost-sensitive learning helps navigate these trade-offs by incorporating error costs into the training process, leading to smarter decision-making. This talk introduces Empulse, an open-source Python package that brings cost-sensitive learning into scikit-learn. Attendees will learn why standard models fall short in cost-sensitive scenarios and how to build better classifiers with Scikit-Learn and Empulse.

Advanced Machine Learning Techniques for Predicting Properties of Synthetic Aviation Fuels using Python

2025-07-10 · SciPy 2025

talk

by Ana Comesana

AI/ML NumPy Pandas Python

Synthetic aviation fuels (SAFs) offer a pathway to improving efficiency, but high cost and volume requirements hinder property testing and increase risk of developing low-performing fuels. To promote productive SAF research, we used Fourier Transform Infrared (FTIR) spectra to train accurate, interpretable fuel property models. In this presentation, we will discuss how we leveraged standard Python libraries – NumPy, pandas, and scikit-learn – and Non-negative Matrix Factorization to decompose FTIR spectra and develop predictive models. Specifically, we will review the pipeline developed for preprocessing FTIR data, the ensemble models used for property prediction, and how the features correlate with physicochemical properties.

GPUs & ML – Beyond Deep Learning

2025-07-10 · SciPy 2025

talk

by Simon Adorf

AI/ML API Data Science

This talk explores various methods to accelerate traditional machine learning pipelines using scikit-learn, UMAP, and HDBSCAN on GPUs. We will contrast the experimental Array API Standard support layer in scikit-learn with the cuML library from the NVIDIA RAPIDS Data Science stack, including its zero-code change acceleration capability. ML and data science practitioners will learn how to seamlessly accelerate machine learning workflows, highlight performance benefits, and receive practical guidance for different problem types and sizes. Insights into minimizing cost and runtime by effectively mixing hardware for various tasks, as well as the current implementation status and future plans for these acceleration methods, will be provided.

Building machine learning pipelines that scale: a case study using Ibis and IbisML

2025-07-07 · SciPy 2025

talk

by Anjali Datta , Deepyaman Datta

AI/ML Analytics Data Engineering Pandas Python SQL

Pandas and scikit-learn have become staples in the machine learning toolkit for processing and modeling tabular data in Python. However, when data size scales up, these tools become slow or run out of memory. Ibis provides a unified, Pythonic, dataframe-like interface to 20+ execution backends, including dataframe libraries, databases, and analytics engines. Ibis enables users to leverage these powerful tools without rewriting their data engineering code (or learning SQL). IbisML extends the benefits of using Ibis to the ML workflow by letting users preprocess their data at scale on any Ibis-supported backend.

In this tutorial, you'll build an end-to-end machine learning project to predict the live win probability after each move during chess games.

Machine Learning Model Deployment

2025-06-10 · Data + AI Summit 2025

talk

AI/ML Data Lakehouse Databricks Delta Python

This course is designed to introduce three primary machine learning deployment strategies and illustrate the implementation of each strategy on Databricks. Following an exploration of the fundamentals of model deployment, the course delves into batch inference, offering hands-on demonstrations and labs for utilizing a model in batch inference scenarios, along with considerations for performance optimization. The second part of the course comprehensively covers pipeline deployment, while the final segment focuses on real-time deployment. Participants will engage in hands-on demonstrations and labs, deploying models with Model Serving and utilizing the serving endpoint for real-time inference. By mastering deployment strategies for a variety of use cases, learners will gain the practical skills needed to move machine learning models from experimentation to production. This course shows you how to operationalize AI solutions efficiently, whether it's automating decisions in real-time or integrating intelligent insights into data pipelines. Pre-requisites: Familiarity with Databricks workspace and notebooks, familiarity with Delta Lake and Lakehouse, intermediate level knowledge of Python (e.g. common Python libraries for DS/ML like Scikit-Learn, awareness of model deployment strategies) Labs: Yes Certification Path: Databricks Certified Machine Learning Associate

Machine Learning Model Development

2025-06-09 · Data + AI Summit 2025

talk

AI/ML Data Lakehouse Databricks Delta Python

In this course, you’ll learn how to develop traditional machine learning models on Databricks. We’ll cover topics like using popular ML libraries, executing common tasks efficiently with AutoML and MLflow, harnessing Databricks' capabilities to track model training, leveraging feature stores for model development, and implementing hyperparameter tuning. Additionally, the course covers AutoML for rapid and low-code model training, ensuring that participants gain practical, real-world skills for streamlined and effective machine learning model development in the Databricks environment. Pre-requisites: Familiarity with Databricks workspace and notebooks, familiarity with Delta Lake and Lakehouse, intermediate level knowledge of Python (e.g. common Python libraries for DS/ML like Scikit-Learn, fundamental ML algorithms like regression and classification, model evaluation with common metrics) Labs: Yes Certification Path: Databricks Certified Machine Learning Associate

Data Preparation for Machine Learning

2025-06-09 · Data + AI Summit 2025

talk

AI/ML DataViz Databricks Matplotlib Pandas PySpark Python

In this course, you’ll learn the fundamentals of preparing data for machine learning using Databricks. We’ll cover topics like exploring, cleaning, and organizing data tailored for traditional machine learning applications. We’ll also cover data visualization, feature engineering, and optimal feature storage strategies. By building a strong foundation in data preparation, this course equips you with the essential skills to create high-quality datasets that can power accurate and reliable machine learning and AI models. Whether you're developing predictive models or enabling downstream AI applications, these capabilities are critical for delivering impactful, data-driven solutions. Pre-requisites: Familiarity with Databricks workspace, notebooks, as well as Unity Catalog. An intermediate level knowledge of Python (scikit-learn, Matplotlib), Pandas, and PySpark. As well as with concepts of exploratory data analysis, feature engineering, standardization, and imputation methods). Labs: Yes Certification Path: Databricks Certified Machine Learning Associate

Speed up ML workflows up to 50x with NVIDIA RAPIDS in Google Colab

2025-04-09 · Google Cloud Next '25

session

AI/ML Cloud Computing Data Science GCP LLM Pandas Python

Learn how to speed up popular data science libraries such as pandas and scikit-learn by up to 50x in Google Colab using pre-installed NVIDIA RAPIDS Python libraries. Boost both speed and scale for your workflows by simply selecting a GPU runtime in Colab – no code changes required. In addition, Gemini helps Colab users incorporate GPUs and generate pandas code from simple natural language prompts.

This Session is hosted by a Google Cloud Next Sponsor.
Visit your registration profile at g.co/cloudnext to opt out of sharing your contact information with the sponsor hosting this session.

Linguistics and Fairness - Tamara Atanasoska

2025-01-17 · DataTalks.Club Listen

podcast_episode

by Tamara Atanasoska (:probably..)

AI/ML Computer Science GitHub HTML NLP

In this podcast episode, we talked with Tamara Atanasoska about building fair AI systems.

About the Speaker:Tamara works on ML explainability, interpretability and fairness as Open Source Software Engineer at probable. She is a maintainer of fairlearn, contributor to scikit-learn and skops. Tamara has both computer science/ software engineering and a computational linguistics(NLP) background.During the event, the guest discussed their career journey from software engineering to open-source contributions, focusing on explainability in AI through Scikit-learn and Fairlearn. They explored fairness in AI, including challenges in credit loans, hiring, and decision-making, and emphasized the importance of tools, human judgment, and collaboration. The guest also shared their involvement with PyLadies and encouraged contributions to Fairlearn. 00:00 Introduction to the event and the community 01:51 Topic introduction: Linguistic fairness and socio-technical perspectives in AI 02:37 Guest introduction: Tamara’s background and career 03:18 Tamara’s career journey: Software engineering, music tech, and computational linguistics 09:53 Tamara’s background in language and computer science 14:52 Exploring fairness in AI and its impact on society 21:20 Fairness in AI models26:21 Automating fairness analysis in models 32:32 Balancing technical and domain expertise in decision-making 37:13 The role of humans in the loop for fairness 40:02 Joining Probable and working on open-source projects 46:20 Scopes library and its integration with Hugging Face 50:48 PyLadies and community involvement 55:41 The ethos of Scikit-learn and Fairlearn

🔗 CONNECT WITH TAMARA ATANASOSKA Linkedin - https://www.linkedin.com/in/tamaraatanasoska GitHub- https://github.com/TamaraAtanasoska

🔗 CONNECT WITH DataTalksClub Join DataTalks.Club:⁠⁠https://datatalks.club/slack.html⁠⁠ Our events:⁠⁠https://datatalks.club/events.html⁠⁠ Datalike Substack -⁠⁠https://datalike.substack.com/⁠⁠ LinkedIn:⁠⁠ / datatalks-club

Numerical Python: Scientific Computing and Data Science Applications with Numpy, SciPy and Matplotlib

2024-09-27 · O'Reilly Data Science Books O'Reilly Amazon

book

by Robert Johansson

AI/ML Analytics Data Analytics Data Science Matplotlib NumPy Pandas Python SciPy data data-science data-science-tools

Learn how to leverage the scientific computing and data analysis capabilities of Python, its standard library, and popular open-source numerical Python packages like NumPy, SymPy, SciPy, matplotlib, and more. This book demonstrates how to work with mathematical modeling and solve problems with numerical, symbolic, and visualization techniques. It explores applications in science, engineering, data analytics, and more. Numerical Python, Third Edition, presents many case study examples of applications in fundamental scientific computing disciplines, as well as in data science and statistics. This fully revised edition, updated for each library's latest version, demonstrates Python's power for rapid development and exploratory computing due to its simple and high-level syntax and many powerful libraries and tools for computation and data analysis. After reading this book, readers will be familiar with many computing techniques, including array-based and symbolic computing, visualization and numerical file I/O, equation solving, optimization, interpolation and integration, and domain-specific computational problems, such as differential equation solving, data analysis, statistical modeling, and machine learning. What You'll Learn Work with vectors and matrices using NumPy Review Symbolic computing with SymPy Plot and visualize data with Matplotlib Perform data analysis tasks with Pandas and SciPy Understand statistical modeling and machine learning with statsmodels and scikit-learn Optimize Python code using Numba and Cython Who This Book Is For Developers who want to understand how to use Python and its ecosystem of libraries for scientific computing and data analysis.

An update on the latest scikit-learn features

2024-09-26 · PyData Paris 2024

talk

by Stefanie Sabine Senger , Guillaume Lemaitre (scikit-learn)

API

In this talk, we provide an update on the latest scikit-learn features that have been implemented in versions 1.4 and 1.5. We will particularly discuss the following features:

the metadata routing API allowing to pass metadata around estimators;
the TunedThresholdClassifierCV allowing to tuned operational decision through custom metric;
better support for categorical features and missing values;
interoperability of array and dataframe.

talk-data.com

Activity Trend

Top Events

Top Speakers

Tracking Policy Evolution Through Clustering: A New Approach to Temporal Pattern Analysis in Multi-Dimensional Data

Hands-On Machine Learning with Scikit-Learn and PyTorch

Modèles ML de scoring : comment les rendre immortels ? Propriété et Portabilité assurées avec Scoring.AI

Enterprise Data Science: €50 Billion Wasted -- And How to Get it Back!

Probabilistic regression models: let's compare different modeling strategies and discuss how to evaluate them

PyPI in the face: running jokes that PyPI download stats can play on you

Skrub: machine learning for dataframes

A Hitchhiker's Guide to the Array API Standard Ecosystem

Open-source Business

Challenges in economics and governance models for open-source scientific projects

Optimize the Right Thing: Cost-Sensitive Classification in Practice

Advanced Machine Learning Techniques for Predicting Properties of Synthetic Aviation Fuels using Python

GPUs & ML – Beyond Deep Learning

Building machine learning pipelines that scale: a case study using Ibis and IbisML

Machine Learning Model Deployment

Machine Learning Model Development

Data Preparation for Machine Learning

Speed up ML workflows up to 50x with NVIDIA RAPIDS in Google Colab

Linguistics and Fairness - Tamara Atanasoska

Numerical Python: Scientific Computing and Data Science Applications with Numpy, SciPy and Matplotlib

An update on the latest scikit-learn features