talk-data.com talk-data.com

Event

SciPy 2025

2025-07-07 – 2025-07-13 PyData

Activities tracked

142

Sessions & talks

Showing 101–125 of 142 · Newest first

Search within this event →

GBNet: Gradient Boosting packages integrated into PyTorch

2025-07-09
talk

GBNet

Gradient Boosting Machines (GBMs) are widely used for their predictive power and interpretability, while Neural Networks offer flexible architectures but can be opaque. GBNet is a Python package that integrates XGBoost and LightGBM with PyTorch. By leveraging PyTorch’s auto-differentiation, GBNet enables novel architectures for GBMs that were previously exclusive to pure Neural Networks. The result is a greatly expanded set of applications for GBMs and an improved ability to interpret expressive architectures due to the use of GBMs.

ReSCU-Nets: recurrent U-Nets for segmentation of multidimensional microscopy data

2025-07-09
talk

Image analysis is a central tool in modern biology. Cell and developmental biologists generate multidimensional microscopy data, including imaging of cellular, subcellular and tissue structures, in three dimensions, over time, and with multiple molecular markers. Segmentation and tracking of multidimensional microscopy data requires high accuracy across many images (e.g. timepoints) and is a labour-intensive part of biological image processing pipelines. We present ReSCU-Nets, recurrent convolutional neural networks that use the segmentation results from the previous frame as a prompt to segment the current frame. We demonstrate that ReSCU-Nets outperform state-of-the-art segmentation models in different tasks on biological multidimensional microscopy sequences.

Accelerating Genomic Data Science and AI/ML with Composability

2025-07-09
talk

The practice of data science in genomics and computational biology is fraught with friction. This is largely due to a tight coupling of bioinformatic tools to file input/output. While omic data is specialized and the storage formats for high-throughput sequencing and related data are often standardized, the adoption of emerging open standards not tied to bioinformatics can help better integrate bioinformatic workflows into the wider data science, visualization, and AI/ML ecosystems. Here, we present two bridge libraries as short vignettes for composable bioinformatics. First, we present Anywidget, an architecture and toolkit based on modern web standards for sharing interactive widgets across all Jupyter-compatible runtimes, including JupyterLab, Google Colab, VSCode, and more. Second, we present Oxbow, a Rust and Python-based adapter library that unifies access to common genomic data formats by efficiently transforming queries into Apache Arrow, a standard in-memory columnar representation for tabular data analytics. Together, we demonstrate the composition of these libraries to build a custom connected genomic analysis and visualization environments. We propose that components such as these, which leverage scientific domain-agnostic standards to unbundle specialized file manipulation, analytics, and web interactivity, can serve as reusable building blocks for composing flexible genomic data analysis and machine learning workflows as well as systems for exploratory data analysis and visualization.

DataMapPlot: Rich Tools for UMAP Visualizations

2025-07-09
talk
LLM

A lot of data scientists use UMAP to help them quickly visualize and explore complex datasets. This could be exploring large unstructured datasets via neural embeddings, or working on LLM explainability by mapping out Sparse Autoencoder features. Making the visualizations good enough, and compelling enough, to present to end users is much harder. However, if done right a good UMAP plot can be a powerful communication tool, or a rich interactive experience that draws users in. Attendees will come away with a sense of what is possible, and an introduction to open source tools that can make it easy.

Dynamic Data with Matplotlib

2025-07-09
talk

Matplotlib is already a favorite plotting library for creating static data visualizations in Python. Here, we discuss the development of a new DataContainer interface and accompanying transformation pipeline which enable easier dynamic data visualization in Matplotlib. This improves the experience of plotting pure functions, automatically recomputing when you pan and zoom. Data containers can ingest data from a variety of sources, including structured data such as Pandas Dataframes or Xarrays, up to live updating data from web services or databases. The flexible transformation pipeline allows for control over how your data is encoded into a plot.

Python for Climate Science: Using Intake to provide easy access to Climate Model data

2025-07-09
talk

Climate models generate a lot of data - and this can make it hard for researchers to efficiently access and use the data they need. The solutions of yesteryear include standardised file structures, sqlite databases, and just knowing where to look. All of these work - to varying degrees - but can leave new users scratching their heads. In this talk, I'll outline how ACCESS-NRI built tooling around Intake and Intake-ESM to make it easy for climate researchers to access available data, share their own, and avoid writing the custom scripts over and over to work with the data their experiments generate.

Break

2025-07-09
talk

SciPy Tools Plenary

2025-07-09
talk

What We Maintain, We Defend

2025-07-09
talk

Scientific Python is not only at the heart of discovery and advancement, but also infrastructure. This talk will provide a perspective on how open-source Python tools that are already powering real-world impact across the sciences are also supportive of public institutions and critical public data infrastructure. Drawing on her previous experience leading policy efforts in the Department of Energy as well as her experience in open-source scientific computing, Katy will highlight the indispensable role of transparency, reproducibility, and community in high-stakes domains. This talk invites the SciPy community to recognize its unique strengths and to amplify their impact by contributing to the public good through technically excellent, civic-minded development.

Opening Notes

2025-07-09
talk

Registration and Breakfast

2025-07-09
talk

Bring Accelerated Computing to Data Science in Python

2025-07-08
talk

As data science continues to evolve, the ever-growing size of datasets poses significant computational challenges. Traditional CPU-based processing often struggles to keep pace with the demands of data science workflows. Accelerated computing with GPUs offers a solution by enabling massive parallelism and significantly reducing processing times for data-heavy tasks. In this session, we will explore GPU computing architecture, how it differs from CPUs, and why it is particularly well-suited for data science workloads. This hands-on lab will dive into the different approaches to GPU programming, from low-level CUDA coding to high-level Python libraries within RAPIDS such as, CuPy, cuDF, cuGraph, and cuML.

Building LLM-Powered Applications for Data Scientists and Software Engineers

2025-07-08
talk

This workshop is designed to equip software engineers with the skills to build and iterate on generative AI-powered applications. Participants will explore key components of the AI software development lifecycle through first principles thinking, including prompt engineering, monitoring, evaluations, and handling non-determinism. The session focuses on using multimodal AI models to build applications, such as querying PDFs, while providing insights into the engineering challenges unique to AI systems. By the end of the workshop, participants will know how to build a PDF-querying app, but all techniques learned will be generalizable for building a variety of generative AI applications.

If you're a data scientist, machine learning practitioner, or AI enthusiast, this workshop can also be valuable for learning about the software engineering aspects of AI applications, such as lifecycle management, iterative development, and monitoring, which are critical for production-level AI systems.

Hierarchical Data Analysis with Xarray DataTree & Zarr

2025-07-08
talk

Xarray provides data structures for multi-dimensional labeled arrays and a toolkit for scalable data analysis on large, complex datasets. Many real-world datasets often have hierarchical or heterogeneous structure, and are best organized through groups of related data arrays. Through xarray.DataTree, the xarray data model now supports opening datasets with a hierarchical structure of groups, such as HDF5 files and Zarr stores. This expanded data model is now general enough to manage data across different scientific disciplines, including geosciences and biosciences. This hands-on tutorial focuses on intermediate and advanced workflows using xarray to analyze real-world hierarchical data.

(Pre-)Commit to Better Code

2025-07-08
talk
Git

Maintaining code quality can be challenging, no matter the size of your project or number of contributors. Different team members may have different opinions on code styling and preferences for code structure, while solo contributors might find themselves spending a considerable amount of time making sure the code conforms to accepted conventions. However, manually inspecting and fixing issues in files is both tedious and error-prone. As such, computers are much more suited to this task than humans. Pre-commit hooks are a great way to have a computer handle this for you.

Pre-commit hooks are code checks that run whenever you attempt to commit your changes with Git. They can detect and, in some cases, automatically correct code-quality issues before they make it to your codebase. In this tutorial, you will learn how to install and configure pre-commit hooks for your repository to ensure that only code that passes your checks makes it into your code base. We will also explore how to build custom pre-commit hooks for novel use cases.

Scaling-up deep learning inference to large-scale bioimage data

2025-07-08
talk

Artificial intelligence has been successfully applied to bioimage understanding and achieved significative results in the last decade. Advances in imaging technologies have also allowed the acquisition of higher resolution images. That has increased not only the magnification at what images are captured, but the size of the acquired images as well. This comprises a challenge for deep learning inference in large-scale images, since these methods are commonly used in relatively small regions rather than whole images. This workshop presents techniques to scale-up inference of deep learning models to large-scale image data with help of Dask for parallelization in Python.

Shiny for Python: Building Production-Ready Dashboards in Python

2025-07-08
talk

Shiny is a framework for building web applications and data dashboards in Python. In this workshop, you will see how the basic building blocks of shiny can be extended to create your own scalable production-ready python applications.

In particular, this workshop covers:

  • Overview of the basic building blocks of a Shiny for Python application
  • How to refactor applications into shiny modules
  • How to write tests for your shiny application
  • Deploy and share your application

At the end of this course you will be able to:

  • Build a Shiny app in Python
  • Refactor your reactive logic into Shiny Modules
  • Identify when to write Shiny modules
  • Write unit tests and end-to-end tests for your shiny application
  • Deploy and share your application (for free!)

Lunch

2025-07-08
talk

Building an AI Agent for Natural Language to SQL Query Execution on Live Databases

2025-07-08
talk

This hands-on tutorial will guide participants through building an end-to-end AI agent that translates natural language questions into SQL queries, validates and executes them on live databases, and returns accurate responses. Participants will build a system that intelligently routes between a specialized SQL agent and a ReAct chat agent, implementing RAG for query similarity matching, comprehensive safety validation, and human-in-the-loop confirmation. By the end of this 4-hour session, attendees will have created a powerful and extensible system they can adapt to their own data sources.

Create custom image visualization and analysis tools with napari

2025-07-08
talk

With cameras in everything from microscopes to telescopes to satellites, scientists produce image data in countless formats, shapes, sizes, and dimensions. Python provides a rich ecosystem of libraries to make sense of them. napari is a Python library for multidimensional image visualization, but it does double duty as a standalone application that can be easily extended with GUI tools for analysis, visualization, and annotation. In this tutorial, we'll start with the basics of image visualization and analysis in Python, then show how to extend the napari user interface to make analysis workflows as easy as pushing a button, and finally show how to share these extensions as plugins, which can be easily installed by users and collaborators. If you work with images (particularly multidimensional images), and especially if you work with scientists who may not be comfortable with Python, this tutorial might be for you!

Create Your First Python Package: Make Your Python Code Easier to Share and Use

2025-07-08
talk

Python packaging can be overwhelming. However, a trusted, community-vetted workflow can make it easier. In this hands-on workshop, you’ll learn a tested approach developed by the pyOpenSci community and vetted by Python packaging maintainers. You’ll create an installable, maintainable, and citable package using a quickstart template. You’ll also receive step-by-step guidance on publishing to TestPyPI (and resources for conda-forge, and adding a DOI with Zenodo). If you can’t install software on your laptop, you can use GitHub Codespaces to participate in the workshop. Join us to package your Python code confidently and to access ongoing support in our community beyond the workshop.

Geospatial data visualisation in Python

2025-07-08
talk

The rapid expansion of the geospatial industry and accompanying increase in availability of geospatial data, presents unique opportunities and challenges in data science. As the need for skilled data scientists increases, the ability to manipulate and interpret this data becomes crucial. This workshop introduces the essentials of geospatial data manipulation and data visualisation, emphasizing hands-on techniques to transform, analyze and visualise diverse datasets effectively.

Throughout the workshop, attendees will explore the extensive ecosystem of geospatial Python libraries. Key tools include GeoPandas, Shapely and Cartopy for vector data, GDAL, Rasterio and rioxarray for raster data and participants will also learn to integrate these with popular plotting libraries such as Matplotlib, Bokeh, and Plotly for visualizations.

This tutorial will cover three primary topics: visualizing geospatial shapes, managing raster datasets, and synthesizing multiple data types into unified visual representations. Each section will incorporate data manipulation exercises to ensure attendees not only visualize but also deeply understand geospatial data.

Targeting both beginners and advanced practitioners, the workshop will employ real-world examples to guide participants through the necessary steps to produce striking and informative geospatial visualizations. By the end, attendees will be equipped with the knowledge to leverage advanced data science techniques in their geospatial projects, making them proficient in both the analysis and communication of spatial information.

Network Analysis Made Simple

2025-07-08
talk

Through the use of NetworkX's API, tutorial participants will learn about the basics of graph theory and its use in applied network science. Starting with a computationally-oriented definition of a graph and its associated methods, we will progress through the following concepts: path and structure finding, visualization, and graph storage on disk. We will also offer tutorial participants the option of one advanced topic overview, including the use of graphs alongside LLMs for knowledge retrieval, scalable alternatives to NetworkX including cuGraph, and the use of linear algebraic translation of graph problems to speed up computations.

Processing Cloud-optimized data in Python with Serverless Functions (Lithops, Dataplug)

2025-07-08
talk

Cloud-optimized (CO) data formats are designed to efficiently store and access data directly from cloud storage without needing to download the entire dataset. These formats enable faster data retrieval, scalability, and cost-effectiveness by allowing users to fetch only the necessary subsets of data. They also allow for efficient parallel data processing using on-the-fly partitioning, which can considerably accelerate data management operations. In this sense, cloud-optimized data is a nice fit for data-parallel jobs using serverless. FaaS provides a data-driven scalable and cost-efficient experience, with practically no management burden. Each serverless function will read and process a small portion of the cloud-optimized dataset, being read in parallel directly from object storage, significantly increasing the speedup.

In this talk, you will learn how to process cloud-optimized data formats in Python using the Lithops toolkit. Lithops is a serverless data processing toolkit that is specially designed to process data from Cloud Object Storage using Serverless functions. We will also demonstrate the Dataplug library that enables Cloud Optimized data managament of scientific settings such as genomics, metabolomics, or geospatial data. We will show different data processing pipelines in the Cloud that demonstrate the benefits of cloud-optimized data management.

Show your work: Tutorial on building and hosting web applications

2025-07-08
talk

TL;DR Learn how to turn your Python functions into interactive web applications using open-source tools. By the end, each of us will have deployed a portfolio (or store) with multiple web applications and learned how to reproduce it easily later on.

Tell me more Work not shown is work lost. Many excellent scientists and engineers are not always adept at showcasing their work. This results in many interesting scientific ideas that have never been brought to light.

However, using today's tools, one no longer has to leave the Python ecosystem to create classy, complete prototypes using modern data visualization and web development tools. With over five years of experience building and presenting data solutions at huge science companies, we show it doesn't have to be challenging. We provide a walkthrough of the primary web application frameworks and showcase Fast Dash, an open-source Python library that we built to address specific prototyping needs.

This tutorial is designed for all data professionals who value the ability to quickly convert their scientific code into web applications. Participants will learn about the leading frameworks, their strengths and limitations, and a decision flowchart for picking the best one for a given task. We will go through some day-to-day applications and hands-on Python coding throughout the session. Whether you bring your use-cases and datasets, or pick from our suggestions, you'll have a reproducible portfolio (app store) of deployed web applications by the end!