Speaker

Allison Ding

Activities

3

talks

Filter by Event / Source

SciPy 2025 2 PyData Seattle 2025 1

Talks & appearances

3 activities · Newest first

Search activities →

Scaling Large-Scale Interactive Data Visualization with Accelerated Computing

2025-11-09 · PyData Seattle 2025

talk

Data Science DataViz Plotly Python

As datasets continue to grow in both size and complexity, CPU-based visualization pipelines often become bottlenecks, slowing down exploratory data analysis and interactive dashboards. In this session, we’ll demonstrate how GPU acceleration can transform Python-based interactive visualization workflows, delivering speedups of up to 50x with minimal code changes. Using libraries such as hvPlot, Datashader, cuxfilter, and Plotly Dash, we’ll walk through real-world examples of visualizing both tabular and unstructured data and demonstrate how RAPIDS, a suite of open-source GPU-accelerated data science libraries from NVIDIA, accelerates these workflows. Attendees will learn best practices for accelerating preprocessing, building scalable dashboards, and profiling pipelines to identify and resolve bottlenecks. Whether you are an experienced data scientist or developer, you’ll leave with practical techniques to instantly scale your interactive visualization workflows on GPUs.

Unlocking AI Performance with NeMo Curator: Scalable Data Processing for LLMs

2025-07-09 · SciPy 2025

talk

AI/ML Data Quality LLM Python

Training Large Language Models (LLMs) requires processing massive-scale datasets efficiently. Traditional CPU-based data pipelines struggle to keep up with the exponential growth of data, leading to bottlenecks in model training. In this talk, we present NeMo Curator, an accelerated, scalable Python-based framework designed to curate high-quality datasets for LLMs efficiently. Leveraging GPU-accelerated processing with RAPIDS, NeMo Curator provides modular pipelines for synthetic data generation, deduplication, filtering, classification, and PII redaction—improving data quality and training efficiency.

We will showcase real-world examples demonstrating how multi-node, multi-GPU processing scales dataset preparation to 100+ TB of data, achieving up to 7% improvement in LLM downstream tasks. Attendees will gain insights into configurable pipelines that enhance training workflows, with a focus on reproducibility, scalability, and open-source integration within Python's scientific computing ecosystem.

Scaling Clustering for Big Data: Leveraging RAPIDS cuML

2025-07-07 · SciPy 2025

talk

AI/ML Big Data DataViz NLP

This tutorial will explore GPU-accelerated clustering techniques using RAPIDS cuML, optimizing algorithms like K-Means, DBSCAN, and HDBSCAN for large datasets. Traditional clustering methods struggle with scalability, but GPU acceleration significantly enhances performance and efficiency.

Participants will learn to leverage dimensionality reduction techniques (PCA, T-SNE, UMAP) for better data visualization and apply hyperparameter tuning with Optuna and cuML. The session also includes real-world applications like topic modeling in NLP and customer segmentation. By the end, attendees will be equipped to implement, optimize, and scale clustering algorithms effectively, unlocking faster and more powerful insights in machine learning workflows.