talk-data.com talk-data.com

D

Speaker

Dane Corneil

1

talks

Senior Staff Applied Scientist NVIDIA

Dane is a Staff Applied Scientist at NVIDIA where he is researching methods to generate high-quality synthetic data at scale. His background is in computational neuroscience and reinforcement learning.

Bio from: Data + AI Summit 2025

Filter by Event / Source

Talks & appearances

1 activities · Newest first

Search activities →
Improve AI Training With the First Synthetic Personas Dataset Aligned to Real-World Distributions

A big challenge in LLM development and synthetic data generation is ensuring data quality and diversity. While data incorporating varied perspectives and reasoning traces consistently improves model performance, procuring such data remains impossible for most enterprises. Human-annotated data struggles to scale, while purely LLM-based generation often suffers from distribution clipping and low entropy. In a novel compound AI approach, we combine LLMs with probabilistic graphical models and other tools to generate synthetic personas grounded in real demographic statistics. The approach allows us to address major limitations in bias, licensing, and persona skew of existing methods. We release the first open-source dataset aligned with real-world distributions and show how enterprises can leverage it with Gretel Data Designer (now part of NVIDIA) to bring diversity and quality to model training on the Databricks platform, all while addressing model collapse and data provenance concerns head-on.