talk-data.com
Meetup
Presentation
2025-03-06 at 17:00
Introducing GneissWeb - a state-of-the-art LLM pre-training dataset
Description
Overview of GneissWeb, a ~10 trillion-token LLM pre-training dataset derived from FineWeb, with open recipes, results, and reproduction tools. We'll cover how it was created, the tools and techniques used, and provide code examples to try. Reported ~2% average improvement in benchmark performance over FineWeb.