talk-data.com talk-data.com

Topic

pre-training dataset

2

tagged

Activity Trend

1 peak/qtr
2020-Q1 2026-Q1

Activities

2 activities Β· Newest first

In this session we will go over how we created GneissWeb and discuss tools and techniques used. We will provide code examples that you can try at your leisure.

πŸ‘‰ > 2% avg improvement in benchmark performance over FineWeb πŸ‘‰ Huggingface page πŸ‘‰ Data prep kit detailed recipe πŸ‘‰ Data prep kit bloom filter for quick reproduction πŸ‘‰ Recipe models for reproduction πŸ‘‰ announcement πŸ‘‰ Paper

At IBM, responsible AI implies transparency in training data: Introducing GneissWeb (pronounced β€œniceWeb”), a state-of-the-art LLM pre-training dataset with ~10 Trillion tokens derived from FineWeb, with open recipes, results, and tools for reproduction! In this session we will go over how we created GneissWeb and discuss tools and techniques used. We will provide code examples that you can try at your leisure.