Topic

llm pre-training dataset

Activities

1

tagged

Activity Trend

1 peak/qtr

2020-Q1 2026-Q1

Top Events

[AI Alliance] Introducing Gneissweb: A State-Of-The-Art LLM Pre-training Dataset 1

Top Speakers

Shahrokh Daijavad (IBM Almaden Research Center) 1

Activities

1 activities · Newest first

All Video Podcast Book

Introducing GneissWeb - a state-of-the-art LLM pre-training dataset

2025-03-06 · [AI Alliance] Introducing Gneissweb: A State-Of-The-Art LLM Pre-training Dataset

Presentation

by Shahrokh Daijavad (IBM Almaden Research Center)

data preparation kits fineweb gneissweb huggingface datasets

Overview of GneissWeb, a ~10 trillion-token LLM pre-training dataset derived from FineWeb, with open recipes, results, and reproduction tools. We'll cover how it was created, the tools and techniques used, and provide code examples to try. Reported ~2% average improvement in benchmark performance over FineWeb.