Topic

pre-training dataset

Activities

2

tagged

Activity Trend

1 peak/qtr

2020-Q1 2026-Q1

Top Events

[AI Alliance] Introducing Gneissweb: A State-Of-The-Art LLM Pre-training Dataset 1 [AI Alliance] Introducing Gneissweb: A State-Of-The-Art LLM Pre-training Dataset 1

Top Speakers

Shahrokh Daijavad (IBM Almaden Research Center) 2

Activities

2 activities · Newest first

All Video Podcast Book

Introducing GneissWeb - a state-of-the-art LLM pre-training dataset

2025-03-06 · [AI Alliance] Introducing Gneissweb: A State-Of-The-Art LLM Pre-training Dataset

Presentation

by Shahrokh Daijavad (IBM Almaden Research Center)

LLM ai huggingface

In this session we will go over how we created GneissWeb and discuss tools and techniques used. We will provide code examples that you can try at your leisure.

👉 > 2% avg improvement in benchmark performance over FineWeb 👉 Huggingface page 👉 Data prep kit detailed recipe 👉 Data prep kit bloom filter for quick reproduction 👉 Recipe models for reproduction 👉 announcement 👉 Paper

Introducing GneissWeb - a state-of-the-art LLM pre-training dataset

2025-03-06 · [AI Alliance] Introducing Gneissweb: A State-Of-The-Art LLM Pre-training Dataset

Presentation

by Shahrokh Daijavad (IBM Almaden Research Center)

Data Engineering LLM ai@edge edge computing

At IBM, responsible AI implies transparency in training data: Introducing GneissWeb (pronounced “niceWeb”), a state-of-the-art LLM pre-training dataset with ~10 Trillion tokens derived from FineWeb, with open recipes, results, and tools for reproduction! In this session we will go over how we created GneissWeb and discuss tools and techniques used. We will provide code examples that you can try at your leisure.