In this session we will go over how we created GneissWeb and discuss tools and techniques used. We will provide code examples that you can try at your leisure.
👉 > 2% avg improvement in benchmark performance over FineWeb 👉 Huggingface page 👉 Data prep kit detailed recipe 👉 Data prep kit bloom filter for quick reproduction 👉 Recipe models for reproduction 👉 announcement 👉 Paper