talk-data.com

Meetup Presentation 2025-03-06 at 17:00

Introducing GneissWeb - a state-of-the-art LLM pre-training dataset

Event: [AI Alliance] Introducing Gneissweb: A State-Of-The-Art LLM Pre-training Dataset

Speakers

Shahrokh Daijavad

Research Scientist · IBM Almaden Research Center

Topics

LLM pre-training dataset edge computing Data Engineering ai@edge

Description

At IBM, responsible AI implies transparency in training data: Introducing GneissWeb (pronounced “niceWeb”), a state-of-the-art LLM pre-training dataset with ~10 Trillion tokens derived from FineWeb, with open recipes, results, and tools for reproduction! In this session we will go over how we created GneissWeb and discuss tools and techniques used. We will provide code examples that you can try at your leisure.