Event

SciPy 2025

2025-07-07 – 2025-07-13 PyData

Activities tracked

2

Filtering by: Allison Ding ×

Top Speakers

Inessa Pawson 3 Justus Magin 3 Tetsuo Koyama 3 Tom Nicholas 3 Jacob Tomlinson 2 Katrina Riehl 2 Allison Ding 2 Deepyaman Datta 2 Eric Ma 2 Sarah Kaiser 2 Benoît Bovy 2 Deepak Cherian 2

Sessions & talks

Showing 1–2 of 2 · Newest first

Search within this event →

Unlocking AI Performance with NeMo Curator: Scalable Data Processing for LLMs

2025-07-09

talk

Allison Ding

AI/ML Data Quality LLM Python

Training Large Language Models (LLMs) requires processing massive-scale datasets efficiently. Traditional CPU-based data pipelines struggle to keep up with the exponential growth of data, leading to bottlenecks in model training. In this talk, we present NeMo Curator, an accelerated, scalable Python-based framework designed to curate high-quality datasets for LLMs efficiently. Leveraging GPU-accelerated processing with RAPIDS, NeMo Curator provides modular pipelines for synthetic data generation, deduplication, filtering, classification, and PII redaction—improving data quality and training efficiency.

We will showcase real-world examples demonstrating how multi-node, multi-GPU processing scales dataset preparation to 100+ TB of data, achieving up to 7% improvement in LLM downstream tasks. Attendees will gain insights into configurable pipelines that enhance training workflows, with a focus on reproducibility, scalability, and open-source integration within Python's scientific computing ecosystem.

Scaling Clustering for Big Data: Leveraging RAPIDS cuML

2025-07-07

talk

Allison Ding

AI/ML Big Data DataViz NLP

This tutorial will explore GPU-accelerated clustering techniques using RAPIDS cuML, optimizing algorithms like K-Means, DBSCAN, and HDBSCAN for large datasets. Traditional clustering methods struggle with scalability, but GPU acceleration significantly enhances performance and efficiency.

Participants will learn to leverage dimensionality reduction techniques (PCA, T-SNE, UMAP) for better data visualization and apply hyperparameter tuning with Optuna and cuML. The session also includes real-world applications like topic modeling in NLP and customer segmentation. By the end, attendees will be equipped to implement, optimize, and scale clustering algorithms effectively, unlocking faster and more powerful insights in machine learning workflows.

talk-data.com

SciPy 2025

Top Topics

Top Speakers

Unlocking AI Performance with NeMo Curator: Scalable Data Processing for LLMs

Scaling Clustering for Big Data: Leveraging RAPIDS cuML