talk-data.com
Meetup
Presentation
2025-03-06 at 17:00
Introducing GneissWeb - a state-of-the-art LLM pre-training dataset
Topics
Description
In this session we will go over how we created GneissWeb and discuss tools and techniques used. We will provide code examples that you can try at your leisure.
๐ > 2% avg improvement in benchmark performance over FineWeb ๐ Huggingface page ๐ Data prep kit detailed recipe ๐ Data prep kit bloom filter for quick reproduction ๐ Recipe models for reproduction ๐ announcement ๐ Paper