Overview of GneissWeb, a ~10 trillion-token LLM pre-training dataset derived from FineWeb, with open recipes, results, and reproduction tools. We'll cover how it was created, the tools and techniques used, and provide code examples to try. Reported ~2% average improvement in benchmark performance over FineWeb.
talk-data.com
Topic
llm pre-training dataset
1
tagged
Activity Trend
1
peak/qtr
2020-Q1
2026-Q1