talk-data.com talk-data.com

Topic

Scikit-learn

machine_learning data_science data_analysis

7

tagged

Activity Trend

6 peak/qtr
2020-Q1 2026-Q1

Activities

7 activities · Newest first

PyPI in the face: running jokes that PyPI download stats can play on you

We all love to tell stories with data and we all love to listen to them. Wouldn't it be great if we could also draw actionable insights from these nice stories?

As scikit-learn maintainers, we would love to use PyPI download stats and other proxy metrics (website analytics, github repository statistics, etc ...) to help inform some of our decisions like: - how do we increase user awareness of best practices (please use Pipeline and cross-validation)? - how do we advertise our recent improvements (use HistGradientBoosting rather than GradientBoosting, TunedThresholdClassifier, PCA and a few other models can run on GPU) ? - do users care more about new features from recent releases or consolidation of what already exists? - how long should we support older versions of Python, numpy or scipy ?

In this talk we will highlight a number of lessons learned while trying to understand the complex reality behind these seemingly simple metrics.

Telling nice stories is not always hard, trying to grasp the reality behind these metrics is often tricky.

Skrub: machine learning for dataframes

Skrub is an open source package that simplifies machine-learning with dataframes by providing a variety of tools to explore, prepare and feature-engineer dataframes so they can be integrated into scikit-learn pipelines. Skrub DataOps allow to build extensive, multi-table wrangling plans, explore hyperparameter spaces, and export the resulting objects for deployment. The talk showcases various use cases where skrub can simplify the job of a data scientist from data preparation to deployment, through code examples and demonstrations.

A Hitchhiker's Guide to the Array API Standard Ecosystem

The array API standard is unifying the ecosystem of Python array computing, facilitating greater interoperability between code written for different array libraries, including NumPy, CuPy, PyTorch, JAX, and Dask.

But what are all of these "array-api-" libraries for? How can you use these libraries to 'future-proof' your libraries, and provide support for GPU and distributed arrays to your users? Find out in this talk, where I'll guide you through every corner of the array API standard ecosystem, explaining how SciPy and scikit-learn are using all of these tools to adopt the standard. I'll also be sharing progress updates from the past year, to give you a clear picture of where we are now, and what the future holds.

Open-source Business

Challenges in economics and governance models for open-source scientific projects

In this presentation, the CEOs of two companies at the forefront of open-source scientific software development - Sylvain Corlay of QuantStack and Yann Lechelle of Probabl - examine the intricate challenges of open-source funding and governance and reflect on how these two aspects interconnect.

We start by reflecting on the origins of the open-source movement within the scientific community, and delve into the contemporary challenges of operating businesses and identifying sustainable economic models that both leverage and contribute to open-source software.

In particular, we highlight the unique approaches and experiences of QuantStack and Probabl, which primarily contribute to multi-stakeholder scientific projects such as scikit-learn, Jupyter, Apache Arrow, or conda-forge.

Optimize the Right Thing: Cost-Sensitive Classification in Practice

Not all mistakes in machine learning are equal—a false negative in fraud detection or medical diagnosis can be far costlier than a false positive. Cost-sensitive learning helps navigate these trade-offs by incorporating error costs into the training process, leading to smarter decision-making. This talk introduces Empulse, an open-source Python package that brings cost-sensitive learning into scikit-learn. Attendees will learn why standard models fall short in cost-sensitive scenarios and how to build better classifiers with Scikit-Learn and Empulse.

US Army Corp of Engineers Enhanced Commerce & National Sec Through Data-Driven Geospatial Insight

The US Army Corps of Engineers (USACE) is responsible for maintaining and improving nearly 12,000 miles of shallow-draft (9'-14') inland and intracoastal waterways, 13,000 miles of deep-draft (14' and greater) coastal channels, and 400 ports, harbors, and turning basins throughout the United States. Because these components of the national waterway network are considered assets to both US commerce and national security, they must be carefully managed to keep marine traffic operating safely and efficiently.

The National DQM Program is tasked with providing USACE a nationally standardized remote monitoring and documentation system across multiple vessel types with timely data access, reporting, dredge certifications, data quality control, and data management. Government systems have often lagged commercial systems in modernization efforts, and the emergence of the cloud and Data Lakehouse Architectures have empowered USACE to successfully move into the modern data era.

This session incorporates aspects of these topics: Data Lakehouse Architecture: Delta Lake, platform security and privacy, serverless, administration, data warehouse, Data Lake, Apache Iceberg, Data Mesh GIS: H3, MOSAIC, spatial analysis data engineering: data pipelines, orchestration, CDC, medallion architecture, Databricks Workflows, data munging, ETL/ELT, lakehouses, data lakes, Parquet, Data Mesh, Apache Spark™ internals. Data Streaming: Apache Spark Structured Streaming, real-time ingestion, real-time ETL, real-time ML, real-time analytics, and real-time applications, Delta Live Tables. ML: PyTorch, TensorFlow, Keras, scikit-learn, Python and R ecosystems data governance: security, compliance, RMF, NIST data sharing: sharing and collaboration, delta sharing, data cleanliness, APIs.

Talk by: Jeff Mroz

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Predicting Repeat Admissions to Substance Abuse Treatment with Machine Learning

In our presentation, we will walk through a model created to predict repeat admissions to substance abuse treatment centers. The goal is to predict early who will be at high risk for relapse so care can be tailored to put additional focus on these patients. We used the Treatment Episode Data Set (TEDS) Admissions data set, which includes every publicly funded substance abuse treatment admission in the US.

While longitudinal data is not available in the data set, we were able to predict with 88% accuracy and an f-score of 0.85 which admissions were first or repeat admissions. Our solution used a scikit-learn Random Forest model and leveraged MLFlow to track model metrics to choose the most effective model. Our pipeline tested over 100 models of different types ranging from Gradient Boosted Trees to Deep Neural Networks in Tensorflow.

To improve model interpretability, we used Shapley values to measure which variables were most important for predicting readmission. These model metrics along with other valuable data are visualized in an interactive Power BI dashboard designed to help practitioners understand who to focus on during treatment. We are in discussions with companies and researchers who may be able to leverage this model in substance abuse treatment centers in the field.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/