Overview of GneissWeb, a ~10 trillion-token LLM pre-training dataset derived from FineWeb, with open recipes, results, and reproduction tools. We'll cover how it was created, the tools and techniques used, and provide code examples to try. Reported ~2% average improvement in benchmark performance over FineWeb.
talk-data.com
Company
IBM Almaden Research Center
Speakers
2
Activities
5
Speakers from IBM Almaden Research Center
Talks & appearances
5 activities from IBM Almaden Research Center speakers
In this session we will go over how we created GneissWeb and discuss tools and techniques used. We will provide code examples that you can try at your leisure.
π > 2% avg improvement in benchmark performance over FineWeb π Huggingface page π Data prep kit detailed recipe π Data prep kit bloom filter for quick reproduction π Recipe models for reproduction π announcement π Paper
At IBM, responsible AI implies transparency in training data: Introducing GneissWeb (pronounced βniceWebβ), a state-of-the-art LLM pre-training dataset with ~10 Trillion tokens derived from FineWeb, with open recipes, results, and tools for reproduction! In this session we will go over how we created GneissWeb and discuss tools and techniques used. We will provide code examples that you can try at your leisure.
Discussion on security of foundation models.