talk-data.com

Meetup Presentation 2025-03-06 at 17:00

Introducing GneissWeb - a state-of-the-art LLM pre-training dataset

Event: [AI Alliance] Introducing Gneissweb: A State-Of-The-Art LLM Pre-training Dataset

Speakers

Shahrokh Daijavad

Research Scientist · IBM Almaden Research Center

Topics

llm pre-training dataset gneissweb fineweb huggingface datasets data preparation kits

Description

Overview of GneissWeb, a ~10 trillion-token LLM pre-training dataset derived from FineWeb, with open recipes, results, and reproduction tools. We'll cover how it was created, the tools and techniques used, and provide code examples to try. Reported ~2% average improvement in benchmark performance over FineWeb.