Speaker

Allison Ding

Activities

2

talks

Filtering by: SciPy 2025 ×

Filter by Event / Source

SciPy 2025 2 PyData Seattle 2025 1

Talks & appearances

Showing 2 of 3 activities

Search activities →

Unlocking AI Performance with NeMo Curator: Scalable Data Processing for LLMs

2025-07-09 · SciPy 2025

talk

AI/ML Data Quality LLM Python

Training Large Language Models (LLMs) requires processing massive-scale datasets efficiently. Traditional CPU-based data pipelines struggle to keep up with the exponential growth of data, leading to bottlenecks in model training. In this talk, we present NeMo Curator, an accelerated, scalable Python-based framework designed to curate high-quality datasets for LLMs efficiently. Leveraging GPU-accelerated processing with RAPIDS, NeMo Curator provides modular pipelines for synthetic data generation, deduplication, filtering, classification, and PII redaction—improving data quality and training efficiency.

We will showcase real-world examples demonstrating how multi-node, multi-GPU processing scales dataset preparation to 100+ TB of data, achieving up to 7% improvement in LLM downstream tasks. Attendees will gain insights into configurable pipelines that enhance training workflows, with a focus on reproducibility, scalability, and open-source integration within Python's scientific computing ecosystem.

Scaling Clustering for Big Data: Leveraging RAPIDS cuML

2025-07-07 · SciPy 2025

talk

AI/ML Big Data DataViz NLP

This tutorial will explore GPU-accelerated clustering techniques using RAPIDS cuML, optimizing algorithms like K-Means, DBSCAN, and HDBSCAN for large datasets. Traditional clustering methods struggle with scalability, but GPU acceleration significantly enhances performance and efficiency.

Participants will learn to leverage dimensionality reduction techniques (PCA, T-SNE, UMAP) for better data visualization and apply hyperparameter tuning with Optuna and cuML. The session also includes real-world applications like topic modeling in NLP and customer segmentation. By the end, attendees will be equipped to implement, optimize, and scale clustering algorithms effectively, unlocking faster and more powerful insights in machine learning workflows.