SciPy 2025

Keeping LLMs in Their Lane: Focused AI for Data Science and Research

2025-07-09

talk

Joe Cheng

AI/ML Data Science LLM

LLMs are powerful, flexible, easy-to-use... and often wrong. This is a dangerous combination, especially for data analysis and scientific research, where correctness and reproducibility are core requirements. Fortunately, it turns out that by carefully applying LLMs to narrower use cases, we can turn them into surprisingly reliable assistants that accelerate and enhance, rather than undermine, scientific work.

This is not just theory—I’ll showcase working examples of seamlessly integrating LLMs into analytic workflows, helping data scientists build interactive, intelligent applications without needing to be web developers. You’ll see firsthand how keeping LLMs focused lets us leverage their "intelligence" in a way that’s practical, rigorous, and reproducible.

Escaping Proof-of-Concept Purgatory: Building Robust LLM-Powered Applications

2025-07-09

talk

hugo bowne-anderson (Outerbounds)

AI/ML LLM Python

Large language models (LLMs) enable powerful data-driven applications, but many projects get stuck in “proof-of-concept purgatory”—where flashy demos fail to translate into reliable, production-ready software. This talk introduces the LLM software development lifecycle (SDLC)—a structured approach to moving beyond early-stage prototypes. Using first principles from software engineering, observability, and iterative evaluation, we’ll cover common pitfalls, techniques for structured output extraction, and methods for improving reliability in real-world data applications. Attendees will leave with concrete strategies for integrating AI into scientific Python workflows—ensuring LLMs generate value beyond the prototype stage.

Unlocking AI Performance with NeMo Curator: Scalable Data Processing for LLMs

2025-07-09

talk

Allison Ding

AI/ML Data Quality LLM Python

Training Large Language Models (LLMs) requires processing massive-scale datasets efficiently. Traditional CPU-based data pipelines struggle to keep up with the exponential growth of data, leading to bottlenecks in model training. In this talk, we present NeMo Curator, an accelerated, scalable Python-based framework designed to curate high-quality datasets for LLMs efficiently. Leveraging GPU-accelerated processing with RAPIDS, NeMo Curator provides modular pipelines for synthetic data generation, deduplication, filtering, classification, and PII redaction—improving data quality and training efficiency.

We will showcase real-world examples demonstrating how multi-node, multi-GPU processing scales dataset preparation to 100+ TB of data, achieving up to 7% improvement in LLM downstream tasks. Attendees will gain insights into configurable pipelines that enhance training workflows, with a focus on reproducibility, scalability, and open-source integration within Python's scientific computing ecosystem.

The Myth of Artificial: Spotlighting Community Intelligence for Responsible Science

2025-07-09

talk

Dr. Malvika Sharan

AI/ML

The widespread fascination with AI often fuels a "myth of the artificial", the belief that scientific and technological progress stems solely from algorithms and large tech breakthroughs. This talk challenges that notion, arguing that truly responsible and impactful science is fundamentally built upon and sustained by the resilient, collective intelligence of the scientific and research community.

Accelerating Genomic Data Science and AI/ML with Composability

2025-07-09

talk

Nezar Abdennur , Trevor Manz

AI/ML Analytics Arrow Data Analytics Data Science Python

The practice of data science in genomics and computational biology is fraught with friction. This is largely due to a tight coupling of bioinformatic tools to file input/output. While omic data is specialized and the storage formats for high-throughput sequencing and related data are often standardized, the adoption of emerging open standards not tied to bioinformatics can help better integrate bioinformatic workflows into the wider data science, visualization, and AI/ML ecosystems. Here, we present two bridge libraries as short vignettes for composable bioinformatics. First, we present Anywidget, an architecture and toolkit based on modern web standards for sharing interactive widgets across all Jupyter-compatible runtimes, including JupyterLab, Google Colab, VSCode, and more. Second, we present Oxbow, a Rust and Python-based adapter library that unifies access to common genomic data formats by efficiently transforming queries into Apache Arrow, a standard in-memory columnar representation for tabular data analytics. Together, we demonstrate the composition of these libraries to build a custom connected genomic analysis and visualization environments. We propose that components such as these, which leverage scientific domain-agnostic standards to unbundle specialized file manipulation, analytics, and web interactivity, can serve as reusable building blocks for composing flexible genomic data analysis and machine learning workflows as well as systems for exploratory data analysis and visualization.

Building LLM-Powered Applications for Data Scientists and Software Engineers

2025-07-08

talk

hugo bowne-anderson (Outerbounds) , Stefan Krawczyk

AI/ML GenAI LLM

This workshop is designed to equip software engineers with the skills to build and iterate on generative AI-powered applications. Participants will explore key components of the AI software development lifecycle through first principles thinking, including prompt engineering, monitoring, evaluations, and handling non-determinism. The session focuses on using multimodal AI models to build applications, such as querying PDFs, while providing insights into the engineering challenges unique to AI systems. By the end of the workshop, participants will know how to build a PDF-querying app, but all techniques learned will be generalizable for building a variety of generative AI applications.

If you're a data scientist, machine learning practitioner, or AI enthusiast, this workshop can also be valuable for learning about the software engineering aspects of AI applications, such as lifecycle management, iterative development, and monitoring, which are critical for production-level AI systems.

Scaling-up deep learning inference to large-scale bioimage data

2025-07-08

talk

Peter Sobolewski , Fernando Cervantes Sanchez

AI/ML Python

Artificial intelligence has been successfully applied to bioimage understanding and achieved significative results in the last decade. Advances in imaging technologies have also allowed the acquisition of higher resolution images. That has increased not only the magnification at what images are captured, but the size of the acquired images as well. This comprises a challenge for deep learning inference in large-scale images, since these methods are commonly used in relatively small regions rather than whole images. This workshop presents techniques to scale-up inference of deep learning models to large-scale image data with help of Dask for parallelization in Python.

Building an AI Agent for Natural Language to SQL Query Execution on Live Databases

2025-07-08

talk

Cainã Max Couto da Silva

AI/ML RAG React SQL

This hands-on tutorial will guide participants through building an end-to-end AI agent that translates natural language questions into SQL queries, validates and executes them on live databases, and returns accurate responses. Participants will build a system that intelligently routes between a specialized SQL agent and a ReAct chat agent, implementing RAG for query similarity matching, comprehensive safety validation, and human-in-the-loop confirmation. By the end of this 4-hour session, attendees will have created a powerful and extensible system they can adapt to their own data sources.

Building machine learning pipelines that scale: a case study using Ibis and IbisML

2025-07-07

talk

Anjali Datta , Deepyaman Datta

AI/ML Analytics Data Engineering Pandas Python Scikit-learn

Pandas and scikit-learn have become staples in the machine learning toolkit for processing and modeling tabular data in Python. However, when data size scales up, these tools become slow or run out of memory. Ibis provides a unified, Pythonic, dataframe-like interface to 20+ execution backends, including dataframe libraries, databases, and analytics engines. Ibis enables users to leverage these powerful tools without rewriting their data engineering code (or learning SQL). IbisML extends the benefits of using Ibis to the ML workflow by letting users preprocess their data at scale on any Ibis-supported backend.

In this tutorial, you'll build an end-to-end machine learning project to predict the live win probability after each move during chess games.

Reproducible Machine Learning Workflows for Scientists with pixi

2025-07-07

talk

John Kirkham , Ruben Arts , Matthew Feickert

AI/ML Linux Python PyTorch

Scientific researchers need reproducible software environments for complex applications that can run across heterogeneous computing platforms. Modern open source tools, like pixi, provide automatic reproducibility solutions for all dependencies while providing a high level interface well suited for researchers.

This tutorial will provide a practical introduction to using pixi to easily create scientific and AI/ML environments that benefit from hardware acceleration, across multiple machines and platforms. The focus will be on applications using the PyTorch and JAX Python machine learning libraries with CUDA enabled, as well as deploying these environments to production settings in Linux container images.

A Hands-on Tutorial towards building Explainable Machine Learning using SHAP, GINI, LIME, and Permutation Importance

2025-07-07

talk

Dr. Debarshi Datta , Dr. Subhosit Ray

AI/ML

The advancement of AI systems necessitates the need for interpretability to address transparency, biases, risks, and regulatory compliance. The workshop teaches core techniques in interpretability, including SHAP (game-theoretic feature attribution), GINI (decision tree impurity analysis), LIME (local surrogate models), and Permutation Importance (feature shuffling), which provide global and local explanations for model decisions. With hands-on building of interpretability tools and visualization techniques, we explore how these methods enable bias detection and clinical trust in healthcare diagnostics and develop the most effective strategies in finance. These techniques are essential in building interpretable AI to address the challenges of the black-box models.

Scaling Clustering for Big Data: Leveraging RAPIDS cuML

2025-07-07

talk

Allison Ding

AI/ML Big Data DataViz NLP

This tutorial will explore GPU-accelerated clustering techniques using RAPIDS cuML, optimizing algorithms like K-Means, DBSCAN, and HDBSCAN for large datasets. Traditional clustering methods struggle with scalability, but GPU acceleration significantly enhances performance and efficiency.

Participants will learn to leverage dimensionality reduction techniques (PCA, T-SNE, UMAP) for better data visualization and apply hyperparameter tuning with Optuna and cuML. The session also includes real-world applications like topic modeling in NLP and customer segmentation. By the end, attendees will be equipped to implement, optimize, and scale clustering algorithms effectively, unlocking faster and more powerful insights in machine learning workflows.

talk-data.com

Top Topics

Top Speakers

Keeping LLMs in Their Lane: Focused AI for Data Science and Research

Escaping Proof-of-Concept Purgatory: Building Robust LLM-Powered Applications

Unlocking AI Performance with NeMo Curator: Scalable Data Processing for LLMs

The Myth of Artificial: Spotlighting Community Intelligence for Responsible Science

Accelerating Genomic Data Science and AI/ML with Composability

Building LLM-Powered Applications for Data Scientists and Software Engineers

Scaling-up deep learning inference to large-scale bioimage data

Building an AI Agent for Natural Language to SQL Query Execution on Live Databases

Building machine learning pipelines that scale: a case study using Ibis and IbisML

Reproducible Machine Learning Workflows for Scientists with pixi

A Hands-on Tutorial towards building Explainable Machine Learning using SHAP, GINI, LIME, and Permutation Importance

Scaling Clustering for Big Data: Leveraging RAPIDS cuML