Unlocking the Missing 78%: Inclusive Communities for the Future of Scientific Python

2025-07-10

talk

Noor Aftab

AI/ML Data Science IBM Python

Women remain critically underrepresented in data science and Python communities, comprising only 15–22% of professionals globally and less than 3% of contributors to Python open-source projects. This disparity not only limits diversity but also represents a missed opportunity for innovation and community growth. This talk explores actionable strategies to address these gaps, drawing from my leadership in Women in AI at IBM, TechWomen mentorship, and initiatives with NumFOCUS. Attendees will gain insights and practical steps to create inclusive environments, foster diverse collaboration, and ensure the scientific Python community thrives by unlocking its full potential.

Open Code, Open Science: What’s Getting in Your Way?

2025-07-10

talk

Leah Wasser , Inessa Pawson , Avik Basu , Tetsuo Koyama , Jeremiah Paige

AI/ML LLM Python

Collaborating on code and software is essential to open science—but it’s not always easy. Join this BoF for an interactive discussion on the real-world challenges of open source collaboration. We’ll explore common hurdles like Python packaging, contributing to existing codebases, and emerging issues around LLM-assisted development and AI-generated software contributions.

We’ll kick off with a brief overview of pyOpenSci—an inclusive community of Pythonistas, from novices to experts—working to make it easier to create, find, share, and contribute to reusable code. We’ll then facilitate small-group discussions and use an interactive Mentimeter survey to help you share your experiences and ideas.

Your feedback will directly shape pyOpenSci’s priorities for the coming year, as we build new programs and resources to support your work in the Python scientific ecosystem. Whether you’re just starting out or a seasoned developer, you’ll leave with clear ways to get involved and make an impact on the broader Python ecosystem in service of advancing scientific discovery.

Organizing Conferences in These Times

2025-07-10

talk

Julie Hollek

Python

Conferences serve as a way to connect groups of humans around common topics of interest. In the open source community, they have played a critical role in knowledge sharing, advancing technology, and fostering a sense of community. This is especially true for the global Python community. Times are changing, the political climate both in the US and abroad has drastically shifted making gathering in the real world much more complex. Advancements in technologies have changed the calculus on what is considered quality participation. Join us in this BoF to discuss these challenges and how we can continue to come together as a community.

Python at the Speed of Light: Accelerating Science with CUDA Python

2025-07-10

talk

Christopher Lamb, VP of Software for Compute Platforms at NVIDIA

AI/ML Data Science Python

NVIDIA’s CUDA platform has long been the backbone of high-performance GPU computing, but its power has historically been gated behind C and C++ expertise. With the recent introduction of native Python support, CUDA is more accessible to the programming language you know and love, ushering in a new era for scientific computing, data science, and AI development.

Advanced Machine Learning Techniques for Predicting Properties of Synthetic Aviation Fuels using Python

2025-07-10

talk

Ana Comesana

AI/ML NumPy Pandas Python Scikit-learn

Synthetic aviation fuels (SAFs) offer a pathway to improving efficiency, but high cost and volume requirements hinder property testing and increase risk of developing low-performing fuels. To promote productive SAF research, we used Fourier Transform Infrared (FTIR) spectra to train accurate, interpretable fuel property models. In this presentation, we will discuss how we leveraged standard Python libraries – NumPy, pandas, and scikit-learn – and Non-negative Matrix Factorization to decompose FTIR spectra and develop predictive models. Specifically, we will review the pipeline developed for preprocessing FTIR data, the ensemble models used for property prediction, and how the features correlate with physicochemical properties.

Can Scientific Python Tools Unlock the Secrets of Materials? The Electrons That Machine-Learning Can't Handle

2025-07-10

talk

Filippo Balzaretti

AI/ML Python

Designing tomorrow's materials requires understanding how atoms behave – a challenge that's both fascinating and incredibly complex. While machine learning offers exciting speedups in materials simulation, it often falls short, missing vital electronic structure information needed to connect theory with experimental results. This work introduces a powerful solution: Density Functional Tight Binding (DFTB), which, combined with the versatile tools of Scientific Python, allows us to understand the electronic behavior of materials while maintaining computational efficiency. In this talk, I will present our findings demonstrating how DFTB, coupled with readily available Python packages, allows for direct comparison between theoretical predictions and experimental data, such as XPS measurements. I will also showcase our publicly available repository, containing DFTB parameters for a wide range of materials, making this powerful approach accessible to the broader research community.

RydIQule: A Package for Modelling Quantum Sensors

2025-07-10

talk

Ben Miller

NumPy Python

Rydberg atoms offer unique quantum properties that enable radio-frequency sensing capabilities distinct from any classical analogue; however, large parameter spaces and complex configurations make understanding and designing these quantum experiments challenging. Current solutions are often developed as in-house, closed-sourced software simulating a narrow range of problems. We present RydIQule, an open-source package leveraging tools of computational python in novel ways to model the behavior of these systems generally. We describe RydIQule’s approach to representing quantum systems using computational graphs and leveraging numpy broadcasting to define complete experiments. In addition to discussing the computational challenges RydIQule helps overcome, we outline how collaboration between physics and computational research backgrounds has led to this impactful tool.

My Dinner with Numeric, Numpy, and Scipy: A Retrospective from 2001 to 2025 with Comments and Anecdotes

2025-07-10

talk

Charles R Harris

NumPy Python SciPy

This keynote will trace the personal journey of NumPy's development and the evolution of the SciPy community from 2001 to the present. Drawing on over two decades of involvement, I’ll reflect on how a small group of enthusiastic contributors grew into a vibrant, global ecosystem that now forms the foundation of scientific computing in Python. Through stories, milestones, and community moments, we’ll explore the challenges, breakthroughs, and collaborative spirit that shaped both NumPy and the SciPy conventions over the years.

KvikUproot - Reading and Deserializing High Energy Physics Data with KvikIO and CuPy

2025-07-09

talk

Frank Strug

Python

Computational needs in high energy physics applications are increasingly met by utilizing GPUs as hardware accelerators, but achieving the highest throughput requires directly reading data into GPU memory. This has yet to be achieved for HEP’s standard domain specific “ROOT” file formats. Using KvikIO’s python bindings to CuFile and NvComp, KvikUproot is a prototype package to support the reading of ROOT file formats by the GPU. On GPUDirect storage (GDS) enabled systems, data bypasses the CPU and is loaded directly from storage to the GPU. We will discuss the methodology we developed to read ROOT files into GPUs via RDMA.

cuTile, the New/Old Kid on the Block: Python Programming Models for GPUs

2025-07-09

talk

Bryce Adelstein Lelbach (NVIDIA)

AI/ML C#/.NET Data Science GitHub HTML LLM

Block-based programming divides inputs into local arrays that are processed concurrently by groups of threads. Users write sequential array-centric code, and the framework handles parallelization, synchronization, and data movement behind the scenes. This approach aligns well with SciPy's array-centric ethos and has roots in older HPC libraries, such as NWChem’s TCE, BLIS, and ATLAS.

In recent years, many block-based Python programming models for GPUs have emerged, like Triton, JAX/Pallas, and Warp, aiming to make parallelism more accessible for scientists and increase portability.

In this talk, we'll present cuTile and Tile IR, a new Pythonic tile-based programming model and compiler recently announced by NVIDIA. We'll explore cuTile examples from a variety of domains, including a new LLAMA3-based reference app and a port of miniWeather. You'll learn the best practices for writing and debugging block-based Python GPU code, gain insight into how such code performs, and learn how it differs from traditional SIMT programming.

By the end of the session, you'll understand how block-based GPU programming enables more intuitive, portable, and efficient development of high-performance, data-parallel Python applications for HPC, data science, and machine learning.

tobac: Tracking Atmospheric Phenomena on Multiscale, Multivariate Diverse Datasets

2025-07-09

talk

Sean W. Freeman

Python

Tracking and Object-Based Analysis of Clouds (tobac) is a Python package that enables researchers to identify, track, and perform object-based analyses of phenomena in large atmospheric datasets. Over the past four years, tobac’s userbase has grown within atmospheric science, and the package has transitioned from its original life as a small, focused package with few maintainers to a larger package with more robust governance and structure. In this presentation, we will discuss the challenges and lessons learned during the transition to robust governance structures and the future of tobac as we incorporate new techniques for using multiple variables and scales to track the same system.

Escaping Proof-of-Concept Purgatory: Building Robust LLM-Powered Applications

2025-07-09

talk

hugo bowne-anderson (Outerbounds)

AI/ML LLM Python

Large language models (LLMs) enable powerful data-driven applications, but many projects get stuck in “proof-of-concept purgatory”—where flashy demos fail to translate into reliable, production-ready software. This talk introduces the LLM software development lifecycle (SDLC)—a structured approach to moving beyond early-stage prototypes. Using first principles from software engineering, observability, and iterative evaluation, we’ll cover common pitfalls, techniques for structured output extraction, and methods for improving reliability in real-world data applications. Attendees will leave with concrete strategies for integrating AI into scientific Python workflows—ensuring LLMs generate value beyond the prototype stage.

Packaging a Scientific Python Project

2025-07-09

talk

Henry Fredrick Schreiner III

GitHub Python

One of the most important aspects of developing scientific software is distribution for others. The Scientific Python Development Guide was developed to provide up-to-date best practices for packaging, linting, and testing, along with a versatile template supporting multiple backends, and a WebAssembly-powered repo-review tool to check a repository directly in the guide. This talk, with the guide for reference, will cover key best practices for project setup, backend selection, packaging metadata, GitHub Actions for testing and deployment, tools for validating code quality. We will even cover tools for packaging compiled components that are simple enough for anyone to use.

Python is all you need: an overview of the composable, Python-native data stack

2025-07-09

talk

Deepyaman Datta

API Data Engineering dbt Modern Data Stack Python SQL

For the past decade, SQL has reigned king of the data transformation world, and tools like dbt have formed a cornerstone of the modern data stack. Until recently, Python-first alternatives couldn't compete with the scale and performance of modern SQL. Now Ibis can provide the same benefits of SQL execution with a flexible Python dataframe API.

In this talk, you will learn how Ibis supercharges existing open-source libraries like Kedro and Pandera and how you can combine these technologies (and a few more) to build and orchestrate scalable data engineering pipelines without sacrificing the comfort (and other advantages) of Python.

Cubed: Scalable array processing with bounded-memory in Python

2025-07-09

talk

Tom White , Tom Nicholas

Cloud Computing NumPy Python

Cubed is a framework for distributed processing of large arrays without a cluster. Designed to respect memory constraints at all times, Cubed can express any NumPy-like array operation as a series of embarrassingly-parallel, bounded-memory steps. By using Zarr as persistent storage between steps, Cubed can run in a serverless fashion on both a local machine and on a range of Cloud platforms. After explaining Cubed’s model, we will show how Cubed has been integrated with Xarray and demonstrate its performance on various large array geoscience workloads.

CuPy: My Journey toward GPU-Accelerated Computing in Python

2025-07-09

talk

Leo Fang

NumPy Python SciPy

This talk walks all Pythonistas through recent CuPy feature development. Join me and hear my story on how an open-source novice started contributing to and helping CuPy over the years grow into a full-fledged, reliable, GPU-accelerated array library that covers most of NumPy, SciPy, and Numba functionalities.

Unlocking AI Performance with NeMo Curator: Scalable Data Processing for LLMs

2025-07-09

talk

Allison Ding

AI/ML Data Quality LLM Python

Training Large Language Models (LLMs) requires processing massive-scale datasets efficiently. Traditional CPU-based data pipelines struggle to keep up with the exponential growth of data, leading to bottlenecks in model training. In this talk, we present NeMo Curator, an accelerated, scalable Python-based framework designed to curate high-quality datasets for LLMs efficiently. Leveraging GPU-accelerated processing with RAPIDS, NeMo Curator provides modular pipelines for synthetic data generation, deduplication, filtering, classification, and PII redaction—improving data quality and training efficiency.

We will showcase real-world examples demonstrating how multi-node, multi-GPU processing scales dataset preparation to 100+ TB of data, achieving up to 7% improvement in LLM downstream tasks. Attendees will gain insights into configurable pipelines that enhance training workflows, with a focus on reproducibility, scalability, and open-source integration within Python's scientific computing ecosystem.

Breaking Out of the Loop: Refactoring Legacy Software with Polars

2025-07-09

talk

Brodie Vidrine

Java Polars Python

Data manipulation libraries like Polars allow us to analyze and process data much faster than with native Python, but that’s only true if you know how to use them properly. When the team working on NCEI's Global Summary of the Month first integrated Polars, they found it was actually slower than the original Java version. In this talk, we'll discuss how our team learned how to think about computing problems like spreadsheet programmers, increasing our products’ processing speed by over 80%. We’ll share tips for rewriting legacy code to take advantage of parallel processing. We’ll also cover how we created custom, pre-compiled functions with Numba when the business requirements were too complex for native Polars expressions.

Burning fuel for cheap! Transport-independent depletion in OpenMC

2025-07-09

talk

Oleksandr Yardas

API Monte Carlo Python

OpenMC is an open source, community-developed, Monte Carlo tool for neutron transport simulations, featuring a depletion module for fuel burnup calculations in nuclear reactors and a Python API. Depletion calculations can be expensive as they require solving the neutron transport and bateman equations in each timestep to update the neutron flux and material composition, respectively. Material properties such as temperature and density govern material cross sections, which in turn govern reaction rates. The reaction rates can effect the neutron population. In a scenario where there is no significant change in the material properties or composition, the transport simulation may only need to be run once; the same cross sections are used for the entire depletion calculation. We recently extended the depletion module in OpenMC to enable transport-independent depletion using multigroup cross sections and fluxes. This talk will focus on the technical details of this feature, its validation, and briefly touch on areas where the feature has been used. Two recent use cases will be highlighted. The first use case calculates shutdown dose rates for fusion power applications, and the second performs depletion for fission reactor fuel cycle modeling.

GBNet: Gradient Boosting packages integrated into PyTorch

2025-07-09

talk

Michael Horrell

Python PyTorch

GBNet

Gradient Boosting Machines (GBMs) are widely used for their predictive power and interpretability, while Neural Networks offer flexible architectures but can be opaque. GBNet is a Python package that integrates XGBoost and LightGBM with PyTorch. By leveraging PyTorch’s auto-differentiation, GBNet enables novel architectures for GBMs that were previously exclusive to pure Neural Networks. The result is a greatly expanded set of applications for GBMs and an improved ability to interpret expressive architectures due to the use of GBMs.

Accelerating Genomic Data Science and AI/ML with Composability

2025-07-09

talk

Nezar Abdennur , Trevor Manz

AI/ML Analytics Arrow Data Analytics Data Science Python

The practice of data science in genomics and computational biology is fraught with friction. This is largely due to a tight coupling of bioinformatic tools to file input/output. While omic data is specialized and the storage formats for high-throughput sequencing and related data are often standardized, the adoption of emerging open standards not tied to bioinformatics can help better integrate bioinformatic workflows into the wider data science, visualization, and AI/ML ecosystems. Here, we present two bridge libraries as short vignettes for composable bioinformatics. First, we present Anywidget, an architecture and toolkit based on modern web standards for sharing interactive widgets across all Jupyter-compatible runtimes, including JupyterLab, Google Colab, VSCode, and more. Second, we present Oxbow, a Rust and Python-based adapter library that unifies access to common genomic data formats by efficiently transforming queries into Apache Arrow, a standard in-memory columnar representation for tabular data analytics. Together, we demonstrate the composition of these libraries to build a custom connected genomic analysis and visualization environments. We propose that components such as these, which leverage scientific domain-agnostic standards to unbundle specialized file manipulation, analytics, and web interactivity, can serve as reusable building blocks for composing flexible genomic data analysis and machine learning workflows as well as systems for exploratory data analysis and visualization.

Dynamic Data with Matplotlib

2025-07-09

talk

Kyle Sunden

DataViz Matplotlib Pandas Python

Matplotlib is already a favorite plotting library for creating static data visualizations in Python. Here, we discuss the development of a new DataContainer interface and accompanying transformation pipeline which enable easier dynamic data visualization in Matplotlib. This improves the experience of plotting pure functions, automatically recomputing when you pan and zoom. Data containers can ingest data from a variety of sources, including structured data such as Pandas Dataframes or Xarrays, up to live updating data from web services or databases. The flexible transformation pipeline allows for control over how your data is encoded into a plot.

Python for Climate Science: Using Intake to provide easy access to Climate Model data

2025-07-09

talk

Charles Turner

Python

Climate models generate a lot of data - and this can make it hard for researchers to efficiently access and use the data they need. The solutions of yesteryear include standardised file structures, sqlite databases, and just knowing where to look. All of these work - to varying degrees - but can leave new users scratching their heads. In this talk, I'll outline how ACCESS-NRI built tooling around Intake and Intake-ESM to make it easy for climate researchers to access available data, share their own, and avoid writing the custom scripts over and over to work with the data their experiments generate.

What We Maintain, We Defend

2025-07-09

talk

Hon. Kathryn D. Huff, PhD

Python SciPy

Scientific Python is not only at the heart of discovery and advancement, but also infrastructure. This talk will provide a perspective on how open-source Python tools that are already powering real-world impact across the sciences are also supportive of public institutions and critical public data infrastructure. Drawing on her previous experience leading policy efforts in the Department of Energy as well as her experience in open-source scientific computing, Katy will highlight the indispensable role of transparency, reproducibility, and community in high-stakes domains. This talk invites the SciPy community to recognize its unique strengths and to amplify their impact by contributing to the public good through technically excellent, civic-minded development.

Bring Accelerated Computing to Data Science in Python

2025-07-08

talk

Kevin Lee

Data Science Python

As data science continues to evolve, the ever-growing size of datasets poses significant computational challenges. Traditional CPU-based processing often struggles to keep pace with the demands of data science workflows. Accelerated computing with GPUs offers a solution by enabling massive parallelism and significantly reducing processing times for data-heavy tasks. In this session, we will explore GPU computing architecture, how it differs from CPUs, and why it is particularly well-suited for data science workloads. This hands-on lab will dive into the different approaches to GPU programming, from low-level CUDA coding to high-level Python libraries within RAPIDS such as, CuPy, cuDF, cuGraph, and cuML.

talk-data.com

SciPy 2025

Top Topics

Top Speakers

Unlocking the Missing 78%: Inclusive Communities for the Future of Scientific Python

Open Code, Open Science: What’s Getting in Your Way?

Organizing Conferences in These Times

Python at the Speed of Light: Accelerating Science with CUDA Python

Advanced Machine Learning Techniques for Predicting Properties of Synthetic Aviation Fuels using Python

Can Scientific Python Tools Unlock the Secrets of Materials? The Electrons That Machine-Learning Can't Handle

RydIQule: A Package for Modelling Quantum Sensors

My Dinner with Numeric, Numpy, and Scipy: A Retrospective from 2001 to 2025 with Comments and Anecdotes

KvikUproot - Reading and Deserializing High Energy Physics Data with KvikIO and CuPy

cuTile, the New/Old Kid on the Block: Python Programming Models for GPUs

tobac: Tracking Atmospheric Phenomena on Multiscale, Multivariate Diverse Datasets

Escaping Proof-of-Concept Purgatory: Building Robust LLM-Powered Applications

Packaging a Scientific Python Project

Python is all you need: an overview of the composable, Python-native data stack

Cubed: Scalable array processing with bounded-memory in Python

CuPy: My Journey toward GPU-Accelerated Computing in Python

Unlocking AI Performance with NeMo Curator: Scalable Data Processing for LLMs

Breaking Out of the Loop: Refactoring Legacy Software with Polars

Burning fuel for cheap! Transport-independent depletion in OpenMC

GBNet: Gradient Boosting packages integrated into PyTorch

GBNet

Accelerating Genomic Data Science and AI/ML with Composability

Dynamic Data with Matplotlib

Python for Climate Science: Using Intake to provide easy access to Climate Model data

What We Maintain, We Defend

Bring Accelerated Computing to Data Science in Python