Topic

edge computing

Activities

1

tagged

Activity Trend

1 peak/qtr

2020-Q1 2026-Q2

Top Events

Tiny But Mighty: Unleashing the Power of Small Language Models 1 Networking 2.0: Cisco's Vision for the Future 1 Global AI Paris fait sa rentrée 😅 1 #15 - London - Webuild-AI 1 [AI Alliance] Introducing Gneissweb: A State-Of-The-Art LLM Pre-training Dataset 1 Tiny But Mighty: Unleashing the Power of Small Language Models 1

Top Speakers

Shahrokh Daijavad (IBM Almaden Research Center) 1 Andrii Kolesnyk (WeBuild-AI) 1

Activities

Showing filtered results

All Video Podcast Book

Filtering by: [AI Alliance] Introducing Gneissweb: A State-Of-The-Art LLM Pre-training Dataset ×

Introducing GneissWeb - a state-of-the-art LLM pre-training dataset

2025-03-06 · [AI Alliance] Introducing Gneissweb: A State-Of-The-Art LLM Pre-training Dataset

Presentation

by Shahrokh Daijavad (IBM Almaden Research Center)

Data Engineering LLM ai@edge pre-training dataset

At IBM, responsible AI implies transparency in training data: Introducing GneissWeb (pronounced “niceWeb”), a state-of-the-art LLM pre-training dataset with ~10 Trillion tokens derived from FineWeb, with open recipes, results, and tools for reproduction! In this session we will go over how we created GneissWeb and discuss tools and techniques used. We will provide code examples that you can try at your leisure.