SciPy 2025

Real-world Impacts of Generative AI in the Research Software Engineer and Data Scientist Workplace

2025-07-11

talk

Steve Van Tuyl

AI/ML Data Science GenAI LLM

Recent breakthroughs in large language model-based artificial intelligence (AI) have captured the public’s interest in AI more broadly. With the growing adoption of these technologies in professional and educational settings, public dialog about their potential impacts on the workforce has been ubiquitous. It is, however, difficult to separate the public dialog about the potential impact of the technology from the experienced impact of the technology in the research software engineer and data science workplace. Likewise, it is challenging to separate the generalized anxiety about AI from its specific impacts on individuals working in specialized work settings.

As research software engineers (RSEs) and those in adjacent computational fields engage with AI in the workplace, the realities of the impacts of this technology are becoming clearer. However, much of the dialog has been limited to high-level discussion around general intra-institutional impacts, and lacks the nuance required to provide helpful guidance to RSE practitioners in research settings, specifically. Surprisingly, many RSEs are not involved in career discussions on what the rise of AI means for their professions.

During this BoF, we will hold a structured, interactive discussion session with the goal of identifying critical areas of engagement with AI in the workplace including: current use of AI, AI assistance and automation, AI skills and workforce development, AI and open science, and AI futures. This BoF will represent the first of a series of discussions held jointly by the Academic Data Science Alliance and the US Research Software Engineer Association over the coming year, with support from Schmidt Sciences. The insights gathered from these sessions will inform the development of guidance resources on these topic areas for the broader RSE and computational data practitioner communities.

Accelerating scientific data releases: Automated metadata generation with LLM agents

2025-07-11

talk

Tudor Garbulet , Chirag Shah

AI/ML Data Management GenAI GitHub LLM

The rapid growth of scientific data repositories demands innovative solutions for efficient metadata creation. In this talk, we present our open-source project that leverages large language models to automate the generation of standard-compliant metadata files from raw scientific datasets. Our approach harnesses the capabilities of pre-trained open source models, finetuned with domain-specific data, and integrated with Langgraph to orchestrate a modular, end-to-end pipeline capable of ingesting heterogeneous raw data files and outputting metadata conforming to specific standards.

The methodology involves a multi-stage process where raw data is first parsed and analyzed by the LLM to extract relevant scientific and contextual information. This information is then structured into metadata templates that adhere strictly to recognized standards, thereby reducing human error and accelerating the data release cycle. We demonstrate the effectiveness of our approach using the USGS ScienceBase repository, where we have successfully generated metadata for a variety of scientific datasets, including images, time series, and text data.

Beyond its immediate application to the USGS ScienceBase repository, our open-source framework is designed to be extensible, allowing adaptation to other data release processes across various scientific domains. We will discuss the technical challenges encountered, such as managing diverse data formats and ensuring metadata quality, and outline strategies for community-driven enhancements. This work not only streamlines the metadata creation workflow but also sets the stage for broader adoption of generative AI in scientific data management.

Additional Material: - Project supported by USGS and ORNL - Codebase will be available on GitHub after paper publication - Fine-tuned LLM models will be available on Hugginface after paper publication

Open Code, Open Science: What’s Getting in Your Way?

2025-07-10

talk

Leah Wasser , Inessa Pawson , Avik Basu , Tetsuo Koyama , Jeremiah Paige

AI/ML LLM Python

Collaborating on code and software is essential to open science—but it’s not always easy. Join this BoF for an interactive discussion on the real-world challenges of open source collaboration. We’ll explore common hurdles like Python packaging, contributing to existing codebases, and emerging issues around LLM-assisted development and AI-generated software contributions.

We’ll kick off with a brief overview of pyOpenSci—an inclusive community of Pythonistas, from novices to experts—working to make it easier to create, find, share, and contribute to reusable code. We’ll then facilitate small-group discussions and use an interactive Mentimeter survey to help you share your experiences and ideas.

Your feedback will directly shape pyOpenSci’s priorities for the coming year, as we build new programs and resources to support your work in the Python scientific ecosystem. Whether you’re just starting out or a seasoned developer, you’ll leave with clear ways to get involved and make an impact on the broader Python ecosystem in service of advancing scientific discovery.

Polyglot RAG: Building a Multimodal, Multilingual, and Agentic AI Assistant

2025-07-10

talk

Axel Sirota

AI/ML LLM RAG

AI assistants are evolving from simple Q&A bots to intelligent, multimodal, multilingual, and agentic systems capable of reasoning, retrieving, and autonomously acting. In this talk, we’ll showcase how to build a voice-enabled, multilingual, multimodal RAG (Retrieval-Augmented Generation) assistant using Gradio, OpenAI’s Whisper, LangChain, LangGraph, and FAISS. Our assistant will not only process voice and text inputs in multiple languages but also intelligently retrieve information from structured and unstructured data. We’ll demonstrate this with a flight search use case—leveraging a flight database for retrieval and, when necessary, autonomously searching external sources using LangGraph. You will gain practical insights into building scalable, adaptive AI assistants that move beyond static chatbots to autonomous agents that interact dynamically with users and the web.

cuTile, the New/Old Kid on the Block: Python Programming Models for GPUs

2025-07-09

talk

Bryce Adelstein Lelbach (NVIDIA)

AI/ML C#/.NET Data Science GitHub HTML LLM

Block-based programming divides inputs into local arrays that are processed concurrently by groups of threads. Users write sequential array-centric code, and the framework handles parallelization, synchronization, and data movement behind the scenes. This approach aligns well with SciPy's array-centric ethos and has roots in older HPC libraries, such as NWChem’s TCE, BLIS, and ATLAS.

In recent years, many block-based Python programming models for GPUs have emerged, like Triton, JAX/Pallas, and Warp, aiming to make parallelism more accessible for scientists and increase portability.

In this talk, we'll present cuTile and Tile IR, a new Pythonic tile-based programming model and compiler recently announced by NVIDIA. We'll explore cuTile examples from a variety of domains, including a new LLAMA3-based reference app and a port of miniWeather. You'll learn the best practices for writing and debugging block-based Python GPU code, gain insight into how such code performs, and learn how it differs from traditional SIMT programming.

By the end of the session, you'll understand how block-based GPU programming enables more intuitive, portable, and efficient development of high-performance, data-parallel Python applications for HPC, data science, and machine learning.

Keeping LLMs in Their Lane: Focused AI for Data Science and Research

2025-07-09

talk

Joe Cheng

AI/ML Data Science LLM

LLMs are powerful, flexible, easy-to-use... and often wrong. This is a dangerous combination, especially for data analysis and scientific research, where correctness and reproducibility are core requirements. Fortunately, it turns out that by carefully applying LLMs to narrower use cases, we can turn them into surprisingly reliable assistants that accelerate and enhance, rather than undermine, scientific work.

This is not just theory—I’ll showcase working examples of seamlessly integrating LLMs into analytic workflows, helping data scientists build interactive, intelligent applications without needing to be web developers. You’ll see firsthand how keeping LLMs focused lets us leverage their "intelligence" in a way that’s practical, rigorous, and reproducible.

Escaping Proof-of-Concept Purgatory: Building Robust LLM-Powered Applications

2025-07-09

talk

hugo bowne-anderson (Outerbounds)

AI/ML LLM Python

Large language models (LLMs) enable powerful data-driven applications, but many projects get stuck in “proof-of-concept purgatory”—where flashy demos fail to translate into reliable, production-ready software. This talk introduces the LLM software development lifecycle (SDLC)—a structured approach to moving beyond early-stage prototypes. Using first principles from software engineering, observability, and iterative evaluation, we’ll cover common pitfalls, techniques for structured output extraction, and methods for improving reliability in real-world data applications. Attendees will leave with concrete strategies for integrating AI into scientific Python workflows—ensuring LLMs generate value beyond the prototype stage.

Unlocking AI Performance with NeMo Curator: Scalable Data Processing for LLMs

2025-07-09

talk

Allison Ding

AI/ML Data Quality LLM Python

Training Large Language Models (LLMs) requires processing massive-scale datasets efficiently. Traditional CPU-based data pipelines struggle to keep up with the exponential growth of data, leading to bottlenecks in model training. In this talk, we present NeMo Curator, an accelerated, scalable Python-based framework designed to curate high-quality datasets for LLMs efficiently. Leveraging GPU-accelerated processing with RAPIDS, NeMo Curator provides modular pipelines for synthetic data generation, deduplication, filtering, classification, and PII redaction—improving data quality and training efficiency.

We will showcase real-world examples demonstrating how multi-node, multi-GPU processing scales dataset preparation to 100+ TB of data, achieving up to 7% improvement in LLM downstream tasks. Attendees will gain insights into configurable pipelines that enhance training workflows, with a focus on reproducibility, scalability, and open-source integration within Python's scientific computing ecosystem.

DataMapPlot: Rich Tools for UMAP Visualizations

2025-07-09

talk

Leland McInnes

LLM

A lot of data scientists use UMAP to help them quickly visualize and explore complex datasets. This could be exploring large unstructured datasets via neural embeddings, or working on LLM explainability by mapping out Sparse Autoencoder features. Making the visualizations good enough, and compelling enough, to present to end users is much harder. However, if done right a good UMAP plot can be a powerful communication tool, or a rich interactive experience that draws users in. Attendees will come away with a sense of what is possible, and an introduction to open source tools that can make it easy.

Building LLM-Powered Applications for Data Scientists and Software Engineers

2025-07-08

talk

hugo bowne-anderson (Outerbounds) , Stefan Krawczyk

AI/ML GenAI LLM

This workshop is designed to equip software engineers with the skills to build and iterate on generative AI-powered applications. Participants will explore key components of the AI software development lifecycle through first principles thinking, including prompt engineering, monitoring, evaluations, and handling non-determinism. The session focuses on using multimodal AI models to build applications, such as querying PDFs, while providing insights into the engineering challenges unique to AI systems. By the end of the workshop, participants will know how to build a PDF-querying app, but all techniques learned will be generalizable for building a variety of generative AI applications.

If you're a data scientist, machine learning practitioner, or AI enthusiast, this workshop can also be valuable for learning about the software engineering aspects of AI applications, such as lifecycle management, iterative development, and monitoring, which are critical for production-level AI systems.

Network Analysis Made Simple

2025-07-08

talk

Eric Ma

API LLM

Through the use of NetworkX's API, tutorial participants will learn about the basics of graph theory and its use in applied network science. Starting with a computationally-oriented definition of a graph and its associated methods, we will progress through the following concepts: path and structure finding, visualization, and graph storage on disk. We will also offer tutorial participants the option of one advanced topic overview, including the use of graphs alongside LLMs for knowledge retrieval, scalable alternatives to NetworkX including cuGraph, and the use of linear algebraic translation of graph problems to speed up computations.

Retrieval Augmented Generation (RAG) for LLMs

2025-07-07

talk

Sukhada Kulkarni , Antoni Liria Sala , Siyu Qian , Xinling Luo

LLM NLP RAG

Large Language Models (LLMs) have revolutionized natural language processing, but they come with limitations such as hallucinations and outdated knowledge. Retrieval-Augmented Generation (RAG) is a practical approach to mitigating these issues by integrating external knowledge retrieval into the LLM generation process.

This tutorial will introduce the core concepts of RAG, walk through its key components, and provide a hands-on session for building a complete RAG pipeline. We will also cover advanced techniques, such as hybrid search, re-ranking, ensemble retrieval, and benchmarking. By the end of this tutorial, participants will be equipped with both the theoretical understanding and practical skills needed to build robust RAG pipeline.

Building with LLMs Made Simple

2025-07-07

talk

Eric Ma

LLM Python

In this tutorial, you will learn how to integrate Large Language Models (LLMs) directly into Python programs as thoughtfully-designed core components of the program rather than bolt-on additions. This hands-on session teaches design principles and practical techniques for incorporating LLM outputs into program control flow. We will use LlamaBot, an open-source Python interface to LLMs, focusing on local execution with local and efficient models.

talk-data.com

Top Topics

Top Speakers

Real-world Impacts of Generative AI in the Research Software Engineer and Data Scientist Workplace

Accelerating scientific data releases: Automated metadata generation with LLM agents

Open Code, Open Science: What’s Getting in Your Way?

Polyglot RAG: Building a Multimodal, Multilingual, and Agentic AI Assistant

cuTile, the New/Old Kid on the Block: Python Programming Models for GPUs

Keeping LLMs in Their Lane: Focused AI for Data Science and Research

Escaping Proof-of-Concept Purgatory: Building Robust LLM-Powered Applications

Unlocking AI Performance with NeMo Curator: Scalable Data Processing for LLMs

DataMapPlot: Rich Tools for UMAP Visualizations

Building LLM-Powered Applications for Data Scientists and Software Engineers

Network Analysis Made Simple

Retrieval Augmented Generation (RAG) for LLMs

Building with LLMs Made Simple