At IBM, responsible AI implies transparency in training data: Introducing GneissWeb (pronounced “niceWeb”), a state-of-the-art LLM pre-training dataset with ~10 Trillion tokens derived from FineWeb, with open recipes, results, and tools for reproduction! In this session we will go over how we created GneissWeb and discuss tools and techniques used. We will provide code examples that you can try at your leisure.
talk-data.com
Topic
edge computing
1
tagged
Activity Trend
1
peak/qtr
2020-Q1
2026-Q1
Top Events
Tiny But Mighty: Unleashing the Power of Small Language Models
1
Networking 2.0: Cisco's Vision for the Future
1
Global AI Paris fait sa rentrée 😅
1
#15 - London - Webuild-AI
1
[AI Alliance] Introducing Gneissweb: A State-Of-The-Art LLM Pre-training Dataset
1
Tiny But Mighty: Unleashing the Power of Small Language Models
1
Top Speakers
Filtering by:
[AI Alliance] Introducing Gneissweb: A State-Of-The-Art LLM Pre-training Dataset
×