This talk walks all Pythonistas through recent CuPy feature development. Join me and hear my story on how an open-source novice started contributing to and helping CuPy over the years grow into a full-fledged, reliable, GPU-accelerated array library that covers most of NumPy, SciPy, and Numba functionalities.
talk-data.com
Topic
Python
67
tagged
Activity Trend
Top Events
Training Large Language Models (LLMs) requires processing massive-scale datasets efficiently. Traditional CPU-based data pipelines struggle to keep up with the exponential growth of data, leading to bottlenecks in model training. In this talk, we present NeMo Curator, an accelerated, scalable Python-based framework designed to curate high-quality datasets for LLMs efficiently. Leveraging GPU-accelerated processing with RAPIDS, NeMo Curator provides modular pipelines for synthetic data generation, deduplication, filtering, classification, and PII redaction—improving data quality and training efficiency.
We will showcase real-world examples demonstrating how multi-node, multi-GPU processing scales dataset preparation to 100+ TB of data, achieving up to 7% improvement in LLM downstream tasks. Attendees will gain insights into configurable pipelines that enhance training workflows, with a focus on reproducibility, scalability, and open-source integration within Python's scientific computing ecosystem.
Data manipulation libraries like Polars allow us to analyze and process data much faster than with native Python, but that’s only true if you know how to use them properly. When the team working on NCEI's Global Summary of the Month first integrated Polars, they found it was actually slower than the original Java version. In this talk, we'll discuss how our team learned how to think about computing problems like spreadsheet programmers, increasing our products’ processing speed by over 80%. We’ll share tips for rewriting legacy code to take advantage of parallel processing. We’ll also cover how we created custom, pre-compiled functions with Numba when the business requirements were too complex for native Polars expressions.
OpenMC is an open source, community-developed, Monte Carlo tool for neutron transport simulations, featuring a depletion module for fuel burnup calculations in nuclear reactors and a Python API. Depletion calculations can be expensive as they require solving the neutron transport and bateman equations in each timestep to update the neutron flux and material composition, respectively. Material properties such as temperature and density govern material cross sections, which in turn govern reaction rates. The reaction rates can effect the neutron population. In a scenario where there is no significant change in the material properties or composition, the transport simulation may only need to be run once; the same cross sections are used for the entire depletion calculation. We recently extended the depletion module in OpenMC to enable transport-independent depletion using multigroup cross sections and fluxes. This talk will focus on the technical details of this feature, its validation, and briefly touch on areas where the feature has been used. Two recent use cases will be highlighted. The first use case calculates shutdown dose rates for fusion power applications, and the second performs depletion for fission reactor fuel cycle modeling.
GBNet
Gradient Boosting Machines (GBMs) are widely used for their predictive power and interpretability, while Neural Networks offer flexible architectures but can be opaque. GBNet is a Python package that integrates XGBoost and LightGBM with PyTorch. By leveraging PyTorch’s auto-differentiation, GBNet enables novel architectures for GBMs that were previously exclusive to pure Neural Networks. The result is a greatly expanded set of applications for GBMs and an improved ability to interpret expressive architectures due to the use of GBMs.
The practice of data science in genomics and computational biology is fraught with friction. This is largely due to a tight coupling of bioinformatic tools to file input/output. While omic data is specialized and the storage formats for high-throughput sequencing and related data are often standardized, the adoption of emerging open standards not tied to bioinformatics can help better integrate bioinformatic workflows into the wider data science, visualization, and AI/ML ecosystems. Here, we present two bridge libraries as short vignettes for composable bioinformatics. First, we present Anywidget, an architecture and toolkit based on modern web standards for sharing interactive widgets across all Jupyter-compatible runtimes, including JupyterLab, Google Colab, VSCode, and more. Second, we present Oxbow, a Rust and Python-based adapter library that unifies access to common genomic data formats by efficiently transforming queries into Apache Arrow, a standard in-memory columnar representation for tabular data analytics. Together, we demonstrate the composition of these libraries to build a custom connected genomic analysis and visualization environments. We propose that components such as these, which leverage scientific domain-agnostic standards to unbundle specialized file manipulation, analytics, and web interactivity, can serve as reusable building blocks for composing flexible genomic data analysis and machine learning workflows as well as systems for exploratory data analysis and visualization.
Matplotlib is already a favorite plotting library for creating static data visualizations in Python.
Here, we discuss the development of a new DataContainer interface and accompanying transformation pipeline which enable easier dynamic data visualization in Matplotlib.
This improves the experience of plotting pure functions, automatically recomputing when you pan and zoom.
Data containers can ingest data from a variety of sources, including structured data such as Pandas Dataframes or Xarrays, up to live updating data from web services or databases.
The flexible transformation pipeline allows for control over how your data is encoded into a plot.
Climate models generate a lot of data - and this can make it hard for researchers to efficiently access and use the data they need. The solutions of yesteryear include standardised file structures, sqlite databases, and just knowing where to look. All of these work - to varying degrees - but can leave new users scratching their heads. In this talk, I'll outline how ACCESS-NRI built tooling around Intake and Intake-ESM to make it easy for climate researchers to access available data, share their own, and avoid writing the custom scripts over and over to work with the data their experiments generate.
Scientific Python is not only at the heart of discovery and advancement, but also infrastructure. This talk will provide a perspective on how open-source Python tools that are already powering real-world impact across the sciences are also supportive of public institutions and critical public data infrastructure. Drawing on her previous experience leading policy efforts in the Department of Energy as well as her experience in open-source scientific computing, Katy will highlight the indispensable role of transparency, reproducibility, and community in high-stakes domains. This talk invites the SciPy community to recognize its unique strengths and to amplify their impact by contributing to the public good through technically excellent, civic-minded development.
As data science continues to evolve, the ever-growing size of datasets poses significant computational challenges. Traditional CPU-based processing often struggles to keep pace with the demands of data science workflows. Accelerated computing with GPUs offers a solution by enabling massive parallelism and significantly reducing processing times for data-heavy tasks. In this session, we will explore GPU computing architecture, how it differs from CPUs, and why it is particularly well-suited for data science workloads. This hands-on lab will dive into the different approaches to GPU programming, from low-level CUDA coding to high-level Python libraries within RAPIDS such as, CuPy, cuDF, cuGraph, and cuML.
Artificial intelligence has been successfully applied to bioimage understanding and achieved significative results in the last decade. Advances in imaging technologies have also allowed the acquisition of higher resolution images. That has increased not only the magnification at what images are captured, but the size of the acquired images as well. This comprises a challenge for deep learning inference in large-scale images, since these methods are commonly used in relatively small regions rather than whole images. This workshop presents techniques to scale-up inference of deep learning models to large-scale image data with help of Dask for parallelization in Python.
Shiny is a framework for building web applications and data dashboards in Python. In this workshop, you will see how the basic building blocks of shiny can be extended to create your own scalable production-ready python applications.
In particular, this workshop covers:
- Overview of the basic building blocks of a Shiny for Python application
- How to refactor applications into shiny modules
- How to write tests for your shiny application
- Deploy and share your application
At the end of this course you will be able to:
- Build a Shiny app in Python
- Refactor your reactive logic into Shiny Modules
- Identify when to write Shiny modules
- Write unit tests and end-to-end tests for your shiny application
- Deploy and share your application (for free!)
With cameras in everything from microscopes to telescopes to satellites, scientists produce image data in countless formats, shapes, sizes, and dimensions. Python provides a rich ecosystem of libraries to make sense of them. napari is a Python library for multidimensional image visualization, but it does double duty as a standalone application that can be easily extended with GUI tools for analysis, visualization, and annotation. In this tutorial, we'll start with the basics of image visualization and analysis in Python, then show how to extend the napari user interface to make analysis workflows as easy as pushing a button, and finally show how to share these extensions as plugins, which can be easily installed by users and collaborators. If you work with images (particularly multidimensional images), and especially if you work with scientists who may not be comfortable with Python, this tutorial might be for you!
Python packaging can be overwhelming. However, a trusted, community-vetted workflow can make it easier. In this hands-on workshop, you’ll learn a tested approach developed by the pyOpenSci community and vetted by Python packaging maintainers. You’ll create an installable, maintainable, and citable package using a quickstart template. You’ll also receive step-by-step guidance on publishing to TestPyPI (and resources for conda-forge, and adding a DOI with Zenodo). If you can’t install software on your laptop, you can use GitHub Codespaces to participate in the workshop. Join us to package your Python code confidently and to access ongoing support in our community beyond the workshop.
The rapid expansion of the geospatial industry and accompanying increase in availability of geospatial data, presents unique opportunities and challenges in data science. As the need for skilled data scientists increases, the ability to manipulate and interpret this data becomes crucial. This workshop introduces the essentials of geospatial data manipulation and data visualisation, emphasizing hands-on techniques to transform, analyze and visualise diverse datasets effectively.
Throughout the workshop, attendees will explore the extensive ecosystem of geospatial Python libraries. Key tools include GeoPandas, Shapely and Cartopy for vector data, GDAL, Rasterio and rioxarray for raster data and participants will also learn to integrate these with popular plotting libraries such as Matplotlib, Bokeh, and Plotly for visualizations.
This tutorial will cover three primary topics: visualizing geospatial shapes, managing raster datasets, and synthesizing multiple data types into unified visual representations. Each section will incorporate data manipulation exercises to ensure attendees not only visualize but also deeply understand geospatial data.
Targeting both beginners and advanced practitioners, the workshop will employ real-world examples to guide participants through the necessary steps to produce striking and informative geospatial visualizations. By the end, attendees will be equipped with the knowledge to leverage advanced data science techniques in their geospatial projects, making them proficient in both the analysis and communication of spatial information.
Cloud-optimized (CO) data formats are designed to efficiently store and access data directly from cloud storage without needing to download the entire dataset. These formats enable faster data retrieval, scalability, and cost-effectiveness by allowing users to fetch only the necessary subsets of data. They also allow for efficient parallel data processing using on-the-fly partitioning, which can considerably accelerate data management operations. In this sense, cloud-optimized data is a nice fit for data-parallel jobs using serverless. FaaS provides a data-driven scalable and cost-efficient experience, with practically no management burden. Each serverless function will read and process a small portion of the cloud-optimized dataset, being read in parallel directly from object storage, significantly increasing the speedup.
In this talk, you will learn how to process cloud-optimized data formats in Python using the Lithops toolkit. Lithops is a serverless data processing toolkit that is specially designed to process data from Cloud Object Storage using Serverless functions. We will also demonstrate the Dataplug library that enables Cloud Optimized data managament of scientific settings such as genomics, metabolomics, or geospatial data. We will show different data processing pipelines in the Cloud that demonstrate the benefits of cloud-optimized data management.
TL;DR Learn how to turn your Python functions into interactive web applications using open-source tools. By the end, each of us will have deployed a portfolio (or store) with multiple web applications and learned how to reproduce it easily later on.
Tell me more Work not shown is work lost. Many excellent scientists and engineers are not always adept at showcasing their work. This results in many interesting scientific ideas that have never been brought to light.
However, using today's tools, one no longer has to leave the Python ecosystem to create classy, complete prototypes using modern data visualization and web development tools. With over five years of experience building and presenting data solutions at huge science companies, we show it doesn't have to be challenging. We provide a walkthrough of the primary web application frameworks and showcase Fast Dash, an open-source Python library that we built to address specific prototyping needs.
This tutorial is designed for all data professionals who value the ability to quickly convert their scientific code into web applications. Participants will learn about the leading frameworks, their strengths and limitations, and a decision flowchart for picking the best one for a given task. We will go through some day-to-day applications and hands-on Python coding throughout the session. Whether you bring your use-cases and datasets, or pick from our suggestions, you'll have a reproducible portfolio (app store) of deployed web applications by the end!
Pandas and scikit-learn have become staples in the machine learning toolkit for processing and modeling tabular data in Python. However, when data size scales up, these tools become slow or run out of memory. Ibis provides a unified, Pythonic, dataframe-like interface to 20+ execution backends, including dataframe libraries, databases, and analytics engines. Ibis enables users to leverage these powerful tools without rewriting their data engineering code (or learning SQL). IbisML extends the benefits of using Ibis to the ML workflow by letting users preprocess their data at scale on any Ibis-supported backend.
In this tutorial, you'll build an end-to-end machine learning project to predict the live win probability after each move during chess games.
Spreadsheets are one of the most common ways to share and work with data which helpfully also works great in Python! In this tutorial, we will cover some of the basics and best pratice of consuming and producing spreadsheets in Python as well as a deep dive into how to run Python directly in your spreadsheets. We will introduce and dive deep into the new Python in Excel features as well as the Anaconda Toolbox for Excel add-in.
Working with data can be challenging: it often doesn’t come in the best format for analysis, and understanding it well enough to extract insights requires both time and the skills to filter, aggregate, reshape, and visualize it. This session will equip you with the knowledge you need to effectively use pandas – a powerful library for data analysis in Python – to make this process easier.
Pandas makes it possible to work with tabular data and perform all parts of the analysis from collection and manipulation through aggregation and visualization. While most of this session focuses on pandas, during our discussion of visualization, we will also introduce at a high level Matplotlib (the library that pandas uses for its visualization features, which when used directly makes it possible to create custom layouts, add annotations, etc.) and Seaborn (another plotting library, which features additional plot types and the ability to visualize long-format data).