talk-data.com talk-data.com

Topic

llm pre-training dataset

1

tagged

Activity Trend

1 peak/qtr
2020-Q1 2026-Q1

Activities

1 activities · Newest first

Overview of GneissWeb, a ~10 trillion-token LLM pre-training dataset derived from FineWeb, with open recipes, results, and reproduction tools. We'll cover how it was created, the tools and techniques used, and provide code examples to try. Reported ~2% average improvement in benchmark performance over FineWeb.