In this session we will go over how we created GneissWeb and discuss tools and techniques used. We will provide code examples that you can try at your leisure.
๐ > 2% avg improvement in benchmark performance over FineWeb ๐ Huggingface page ๐ Data prep kit detailed recipe ๐ Data prep kit bloom filter for quick reproduction ๐ Recipe models for reproduction ๐ announcement ๐ Paper