SciPy 2025

Real-world Impacts of Generative AI in the Research Software Engineer and Data Scientist Workplace

2025-07-11

talk

Steve Van Tuyl

AI/ML Data Science GenAI LLM

Recent breakthroughs in large language model-based artificial intelligence (AI) have captured the public’s interest in AI more broadly. With the growing adoption of these technologies in professional and educational settings, public dialog about their potential impacts on the workforce has been ubiquitous. It is, however, difficult to separate the public dialog about the potential impact of the technology from the experienced impact of the technology in the research software engineer and data science workplace. Likewise, it is challenging to separate the generalized anxiety about AI from its specific impacts on individuals working in specialized work settings.

As research software engineers (RSEs) and those in adjacent computational fields engage with AI in the workplace, the realities of the impacts of this technology are becoming clearer. However, much of the dialog has been limited to high-level discussion around general intra-institutional impacts, and lacks the nuance required to provide helpful guidance to RSE practitioners in research settings, specifically. Surprisingly, many RSEs are not involved in career discussions on what the rise of AI means for their professions.

During this BoF, we will hold a structured, interactive discussion session with the goal of identifying critical areas of engagement with AI in the workplace including: current use of AI, AI assistance and automation, AI skills and workforce development, AI and open science, and AI futures. This BoF will represent the first of a series of discussions held jointly by the Academic Data Science Alliance and the US Research Software Engineer Association over the coming year, with support from Schmidt Sciences. The insights gathered from these sessions will inform the development of guidance resources on these topic areas for the broader RSE and computational data practitioner communities.

Getting all your snakes in a grid: collaborating and teaching with Python in Excel and the Anaconda Toolbox

2025-07-10

talk

Sarah Kaiser

Data Science Google Sheets Python

Working with data in grids or spreadsheets is great for collaboration as there are many different tools to view and edit the files. Data science workflows often include packages like openpyxl to create, load, edit, and export spreadsheets that then are shared with others who can use other tools like Excel, Google Sheets, or IDEs to view them. The new Python in Excel feature as well as the Anaconda Toolbox add-in provides the tools to run Python directly in cells in a spreadsheet, making it easier for Pythonistas to access and collaborate on code. This talk will introduce how these features work, demo collaborating on Python code in a worksheet, and talk about some case studies where these tools have been used to teach and collaborate with Python.

Unlocking the Missing 78%: Inclusive Communities for the Future of Scientific Python

2025-07-10

talk

Noor Aftab

AI/ML Data Science IBM Python

Women remain critically underrepresented in data science and Python communities, comprising only 15–22% of professionals globally and less than 3% of contributors to Python open-source projects. This disparity not only limits diversity but also represents a missed opportunity for innovation and community growth. This talk explores actionable strategies to address these gaps, drawing from my leadership in Women in AI at IBM, TechWomen mentorship, and initiatives with NumFOCUS. Attendees will gain insights and practical steps to create inclusive environments, foster diverse collaboration, and ensure the scientific Python community thrives by unlocking its full potential.

Python at the Speed of Light: Accelerating Science with CUDA Python

2025-07-10

talk

Christopher Lamb, VP of Software for Compute Platforms at NVIDIA

AI/ML Data Science Python

NVIDIA’s CUDA platform has long been the backbone of high-performance GPU computing, but its power has historically been gated behind C and C++ expertise. With the recent introduction of native Python support, CUDA is more accessible to the programming language you know and love, ushering in a new era for scientific computing, data science, and AI development.

GPUs & ML – Beyond Deep Learning

2025-07-10

talk

Simon Adorf

AI/ML API Data Science Scikit-learn

This talk explores various methods to accelerate traditional machine learning pipelines using scikit-learn, UMAP, and HDBSCAN on GPUs. We will contrast the experimental Array API Standard support layer in scikit-learn with the cuML library from the NVIDIA RAPIDS Data Science stack, including its zero-code change acceleration capability. ML and data science practitioners will learn how to seamlessly accelerate machine learning workflows, highlight performance benefits, and receive practical guidance for different problem types and sizes. Insights into minimizing cost and runtime by effectively mixing hardware for various tasks, as well as the current implementation status and future plans for these acceleration methods, will be provided.

cuTile, the New/Old Kid on the Block: Python Programming Models for GPUs

2025-07-09

talk

Bryce Adelstein Lelbach (NVIDIA)

AI/ML C#/.NET Data Science GitHub HTML LLM

Block-based programming divides inputs into local arrays that are processed concurrently by groups of threads. Users write sequential array-centric code, and the framework handles parallelization, synchronization, and data movement behind the scenes. This approach aligns well with SciPy's array-centric ethos and has roots in older HPC libraries, such as NWChem’s TCE, BLIS, and ATLAS.

In recent years, many block-based Python programming models for GPUs have emerged, like Triton, JAX/Pallas, and Warp, aiming to make parallelism more accessible for scientists and increase portability.

In this talk, we'll present cuTile and Tile IR, a new Pythonic tile-based programming model and compiler recently announced by NVIDIA. We'll explore cuTile examples from a variety of domains, including a new LLAMA3-based reference app and a port of miniWeather. You'll learn the best practices for writing and debugging block-based Python GPU code, gain insight into how such code performs, and learn how it differs from traditional SIMT programming.

By the end of the session, you'll understand how block-based GPU programming enables more intuitive, portable, and efficient development of high-performance, data-parallel Python applications for HPC, data science, and machine learning.

Keeping LLMs in Their Lane: Focused AI for Data Science and Research

2025-07-09

talk

Joe Cheng

AI/ML Data Science LLM

LLMs are powerful, flexible, easy-to-use... and often wrong. This is a dangerous combination, especially for data analysis and scientific research, where correctness and reproducibility are core requirements. Fortunately, it turns out that by carefully applying LLMs to narrower use cases, we can turn them into surprisingly reliable assistants that accelerate and enhance, rather than undermine, scientific work.

This is not just theory—I’ll showcase working examples of seamlessly integrating LLMs into analytic workflows, helping data scientists build interactive, intelligent applications without needing to be web developers. You’ll see firsthand how keeping LLMs focused lets us leverage their "intelligence" in a way that’s practical, rigorous, and reproducible.

Accelerating Genomic Data Science and AI/ML with Composability

2025-07-09

talk

Nezar Abdennur , Trevor Manz

AI/ML Analytics Arrow Data Analytics Data Science Python

The practice of data science in genomics and computational biology is fraught with friction. This is largely due to a tight coupling of bioinformatic tools to file input/output. While omic data is specialized and the storage formats for high-throughput sequencing and related data are often standardized, the adoption of emerging open standards not tied to bioinformatics can help better integrate bioinformatic workflows into the wider data science, visualization, and AI/ML ecosystems. Here, we present two bridge libraries as short vignettes for composable bioinformatics. First, we present Anywidget, an architecture and toolkit based on modern web standards for sharing interactive widgets across all Jupyter-compatible runtimes, including JupyterLab, Google Colab, VSCode, and more. Second, we present Oxbow, a Rust and Python-based adapter library that unifies access to common genomic data formats by efficiently transforming queries into Apache Arrow, a standard in-memory columnar representation for tabular data analytics. Together, we demonstrate the composition of these libraries to build a custom connected genomic analysis and visualization environments. We propose that components such as these, which leverage scientific domain-agnostic standards to unbundle specialized file manipulation, analytics, and web interactivity, can serve as reusable building blocks for composing flexible genomic data analysis and machine learning workflows as well as systems for exploratory data analysis and visualization.

Bring Accelerated Computing to Data Science in Python

2025-07-08

talk

Kevin Lee

Data Science Python

As data science continues to evolve, the ever-growing size of datasets poses significant computational challenges. Traditional CPU-based processing often struggles to keep pace with the demands of data science workflows. Accelerated computing with GPUs offers a solution by enabling massive parallelism and significantly reducing processing times for data-heavy tasks. In this session, we will explore GPU computing architecture, how it differs from CPUs, and why it is particularly well-suited for data science workloads. This hands-on lab will dive into the different approaches to GPU programming, from low-level CUDA coding to high-level Python libraries within RAPIDS such as, CuPy, cuDF, cuGraph, and cuML.

Geospatial data visualisation in Python

2025-07-08

talk

Adam Symington

Data Science DataViz Matplotlib Plotly Python

The rapid expansion of the geospatial industry and accompanying increase in availability of geospatial data, presents unique opportunities and challenges in data science. As the need for skilled data scientists increases, the ability to manipulate and interpret this data becomes crucial. This workshop introduces the essentials of geospatial data manipulation and data visualisation, emphasizing hands-on techniques to transform, analyze and visualise diverse datasets effectively.

Throughout the workshop, attendees will explore the extensive ecosystem of geospatial Python libraries. Key tools include GeoPandas, Shapely and Cartopy for vector data, GDAL, Rasterio and rioxarray for raster data and participants will also learn to integrate these with popular plotting libraries such as Matplotlib, Bokeh, and Plotly for visualizations.

This tutorial will cover three primary topics: visualizing geospatial shapes, managing raster datasets, and synthesizing multiple data types into unified visual representations. Each section will incorporate data manipulation exercises to ensure attendees not only visualize but also deeply understand geospatial data.

Targeting both beginners and advanced practitioners, the workshop will employ real-world examples to guide participants through the necessary steps to produce striking and informative geospatial visualizations. By the end, attendees will be equipped with the knowledge to leverage advanced data science techniques in their geospatial projects, making them proficient in both the analysis and communication of spatial information.

talk-data.com

Top Topics

Top Speakers

Real-world Impacts of Generative AI in the Research Software Engineer and Data Scientist Workplace

Getting all your snakes in a grid: collaborating and teaching with Python in Excel and the Anaconda Toolbox

Unlocking the Missing 78%: Inclusive Communities for the Future of Scientific Python

Python at the Speed of Light: Accelerating Science with CUDA Python

GPUs & ML – Beyond Deep Learning

cuTile, the New/Old Kid on the Block: Python Programming Models for GPUs

Keeping LLMs in Their Lane: Focused AI for Data Science and Research

Accelerating Genomic Data Science and AI/ML with Composability

Bring Accelerated Computing to Data Science in Python

Geospatial data visualisation in Python