talk-data.com

Company

IBM Almaden Research Center

Speakers

Activities

Speakers from IBM Almaden Research Center

Shahrokh Daijavad Research Scientist 3 Dr Nathalie Baracaldo Research Staff Member 2

Talks & appearances

5 activities from IBM Almaden Research Center speakers

Introducing GneissWeb - a state-of-the-art LLM pre-training dataset

2025-03-06 · [AI Alliance] Introducing Gneissweb: A State-Of-The-Art LLM Pre-training Dataset

Presentation

Shahrokh Daijavad (Research Scientist)

llm pre-training dataset gneissweb fineweb huggingface datasets data preparation kits

Overview of GneissWeb, a ~10 trillion-token LLM pre-training dataset derived from FineWeb, with open recipes, results, and reproduction tools. We'll cover how it was created, the tools and techniques used, and provide code examples to try. Reported ~2% average improvement in benchmark performance over FineWeb.

Introducing GneissWeb - a state-of-the-art LLM pre-training dataset

2025-03-06 · [AI Alliance] Introducing Gneissweb: A State-Of-The-Art LLM Pre-training Dataset

Presentation

Shahrokh Daijavad (Research Scientist)

LLM pre-training dataset ai huggingface

In this session we will go over how we created GneissWeb and discuss tools and techniques used. We will provide code examples that you can try at your leisure.

👉 > 2% avg improvement in benchmark performance over FineWeb 👉 Huggingface page 👉 Data prep kit detailed recipe 👉 Data prep kit bloom filter for quick reproduction 👉 Recipe models for reproduction 👉 announcement 👉 Paper

Introducing GneissWeb - a state-of-the-art LLM pre-training dataset

2025-03-06 · [AI Alliance] Introducing Gneissweb: A State-Of-The-Art LLM Pre-training Dataset

Presentation

Shahrokh Daijavad (Research Scientist)

LLM pre-training dataset edge computing Data Engineering ai@edge

At IBM, responsible AI implies transparency in training data: Introducing GneissWeb (pronounced “niceWeb”), a state-of-the-art LLM pre-training dataset with ~10 Trillion tokens derived from FineWeb, with open recipes, results, and tools for reproduction! In this session we will go over how we created GneissWeb and discuss tools and techniques used. We will provide code examples that you can try at your leisure.

Security of Foundation Models

2023-07-13 · Best of WiDS at IBM 2023: Talks on Language Models and Foundation Models

talk

Dr Nathalie Baracaldo (Research Staff Member)

Cyber Security

Security of Foundation Models

2023-07-12 · Best of WiDS at IBM 2023: Talks on Language Models and Foundation Models

talk

Dr Nathalie Baracaldo (Research Staff Member)

Cyber Security

Discussion on security of foundation models.