Search – talk-data.com

Title & Speakers	Event
[AI Alliance] GneissWeb: Preparing High Quality Data for LLMs at Scale 2025-05-08 · 16:00 Details IBM recently released GneissWeb, a large dataset yielding around 10 trillion tokens that caters to the data quality and quantity requirements of training Large Language Models. In this talk i will do a deep dive on the philosophy behind this dataset, where it stands w.r.t the other datasets out there, how to recreate it based on the tools IBM has open sourced and some performance figures with it. This talk will be a followup of the talk given by Shahrokh Daijavad of IBM in the month of March. Prerequisites This is a follow up to our March 6, 2025 session “Introducing GneissWeb - a state-of-the-art LLM pre-training dataset“: Check the GitHub show notes Re-watch on YouTube About the presenter Bishwaranjan Bhattacharjee (LinkedIn), Senior Technical Staff Member and Master Inventor, IBM Research About the AI Alliance The AI Alliance is an international community of researchers, developers and organizational leaders committed to support and enhance open innovation across the AI technology landscape to accelerate progress, improve safety, security and trust in AI, and maximize benefits to people and society everywhere. Members of the AI Alliance believe that open innovation is essential to develop and achieve safe and responsible AI that benefit society rather than benefit a select few big players.	[AI Alliance] GneissWeb: Preparing High Quality Data for LLMs at Scale
[AI Alliance] GneissWeb: Preparing High Quality Data for LLMs at Scale 2025-05-08 · 16:00 Details IBM recently released GneissWeb, a large dataset yielding around 10 trillion tokens that caters to the data quality and quantity requirements of training Large Language Models. In this talk i will do a deep dive on the philosophy behind this dataset, where it stands w.r.t the other datasets out there, how to recreate it based on the tools IBM has open sourced and some performance figures with it. This talk will be a followup of the talk given by Shahrokh Daijavad of IBM in the month of March. Prerequisites This is a follow up to our March 6, 2025 session “Introducing GneissWeb - a state-of-the-art LLM pre-training dataset“: Check the GitHub show notes Re-watch on YouTube About the presenter Bishwaranjan Bhattacharjee (LinkedIn), Senior Technical Staff Member and Master Inventor, IBM Research About the AI Alliance The AI Alliance is an international community of researchers, developers and organizational leaders committed to support and enhance open innovation across the AI technology landscape to accelerate progress, improve safety, security and trust in AI, and maximize benefits to people and society everywhere. Members of the AI Alliance believe that open innovation is essential to develop and achieve safe and responsible AI that benefit society rather than benefit a select few big players.	[AI Alliance] GneissWeb: Preparing High Quality Data for LLMs at Scale
[AI Alliance] GneissWeb: Preparing High Quality Data for LLMs at Scale 2025-05-08 · 16:00 Details IBM recently released GneissWeb, a large dataset yielding around 10 trillion tokens that caters to the data quality and quantity requirements of training Large Language Models. In this talk i will do a deep dive on the philosophy behind this dataset, where it stands w.r.t the other datasets out there, how to recreate it based on the tools IBM has open sourced and some performance figures with it. This talk will be a followup of the talk given by Shahrokh Daijavad of IBM in the month of March. Prerequisites This is a follow up to our March 6, 2025 session “Introducing GneissWeb - a state-of-the-art LLM pre-training dataset“: Check the GitHub show notes Re-watch on YouTube About the presenter Bishwaranjan Bhattacharjee (LinkedIn), Senior Technical Staff Member and Master Inventor, IBM Research About the AI Alliance The AI Alliance is an international community of researchers, developers and organizational leaders committed to support and enhance open innovation across the AI technology landscape to accelerate progress, improve safety, security and trust in AI, and maximize benefits to people and society everywhere. Members of the AI Alliance believe that open innovation is essential to develop and achieve safe and responsible AI that benefit society rather than benefit a select few big players.	[AI Alliance] GneissWeb: Preparing High Quality Data for LLMs at Scale
[AI Alliance] GneissWeb: Preparing High Quality Data for LLMs at Scale 2025-05-08 · 16:00 Details IBM recently released GneissWeb, a large dataset yielding around 10 trillion tokens that caters to the data quality and quantity requirements of training Large Language Models. In this talk i will do a deep dive on the philosophy behind this dataset, where it stands w.r.t the other datasets out there, how to recreate it based on the tools IBM has open sourced and some performance figures with it. This talk will be a followup of the talk given by Shahrokh Daijavad of IBM in the month of March. Prerequisites This is a follow up to our March 6, 2025 session “Introducing GneissWeb - a state-of-the-art LLM pre-training dataset“: Check the GitHub show notes Re-watch on YouTube About the presenter Bishwaranjan Bhattacharjee (LinkedIn), Senior Technical Staff Member and Master Inventor, IBM Research About the AI Alliance The AI Alliance is an international community of researchers, developers and organizational leaders committed to support and enhance open innovation across the AI technology landscape to accelerate progress, improve safety, security and trust in AI, and maximize benefits to people and society everywhere. Members of the AI Alliance believe that open innovation is essential to develop and achieve safe and responsible AI that benefit society rather than benefit a select few big players.	[AI Alliance] GneissWeb: Preparing High Quality Data for LLMs at Scale
[AI Alliance] GneissWeb: Preparing High Quality Data for LLMs at Scale 2025-05-08 · 16:00 Details IBM recently released GneissWeb, a large dataset yielding around 10 trillion tokens that caters to the data quality and quantity requirements of training Large Language Models. In this talk i will do a deep dive on the philosophy behind this dataset, where it stands w.r.t the other datasets out there, how to recreate it based on the tools IBM has open sourced and some performance figures with it. This talk will be a followup of the talk given by Shahrokh Daijavad of IBM in the month of March. Prerequisites This is a follow up to our March 6, 2025 session “Introducing GneissWeb - a state-of-the-art LLM pre-training dataset“: Check the GitHub show notes Re-watch on YouTube About the presenter Bishwaranjan Bhattacharjee (LinkedIn), Senior Technical Staff Member and Master Inventor, IBM Research About the AI Alliance The AI Alliance is an international community of researchers, developers and organizational leaders committed to support and enhance open innovation across the AI technology landscape to accelerate progress, improve safety, security and trust in AI, and maximize benefits to people and society everywhere. Members of the AI Alliance believe that open innovation is essential to develop and achieve safe and responsible AI that benefit society rather than benefit a select few big players.	[AI Alliance] GneissWeb: Preparing High Quality Data for LLMs at Scale
[AI Alliance] GneissWeb: Preparing High Quality Data for LLMs at Scale 2025-05-08 · 16:00 Details IBM recently released GneissWeb, a large dataset yielding around 10 trillion tokens that caters to the data quality and quantity requirements of training Large Language Models. In this talk i will do a deep dive on the philosophy behind this dataset, where it stands w.r.t the other datasets out there, how to recreate it based on the tools IBM has open sourced and some performance figures with it. This talk will be a followup of the talk given by Shahrokh Daijavad of IBM in the month of March. Prerequisites This is a follow up to our March 6, 2025 session “Introducing GneissWeb - a state-of-the-art LLM pre-training dataset“: Check the GitHub show notes Re-watch on YouTube About the presenter Bishwaranjan Bhattacharjee (LinkedIn), Senior Technical Staff Member and Master Inventor, IBM Research About the AI Alliance The AI Alliance is an international community of researchers, developers and organizational leaders committed to support and enhance open innovation across the AI technology landscape to accelerate progress, improve safety, security and trust in AI, and maximize benefits to people and society everywhere. Members of the AI Alliance believe that open innovation is essential to develop and achieve safe and responsible AI that benefit society rather than benefit a select few big players.	[AI Alliance] GneissWeb: Preparing High Quality Data for LLMs at Scale
[AI Alliance] GneissWeb: Preparing High Quality Data for LLMs at Scale 2025-05-08 · 16:00 Details IBM recently released GneissWeb, a large dataset yielding around 10 trillion tokens that caters to the data quality and quantity requirements of training Large Language Models. In this talk i will do a deep dive on the philosophy behind this dataset, where it stands w.r.t the other datasets out there, how to recreate it based on the tools IBM has open sourced and some performance figures with it. This talk will be a followup of the talk given by Shahrokh Daijavad of IBM in the month of March. Prerequisites This is a follow up to our March 6, 2025 session “Introducing GneissWeb - a state-of-the-art LLM pre-training dataset“: Check the GitHub show notes Re-watch on YouTube About the presenter Bishwaranjan Bhattacharjee (LinkedIn), Senior Technical Staff Member and Master Inventor, IBM Research About the AI Alliance The AI Alliance is an international community of researchers, developers and organizational leaders committed to support and enhance open innovation across the AI technology landscape to accelerate progress, improve safety, security and trust in AI, and maximize benefits to people and society everywhere. Members of the AI Alliance believe that open innovation is essential to develop and achieve safe and responsible AI that benefit society rather than benefit a select few big players.	[AI Alliance] GneissWeb: Preparing High Quality Data for LLMs at Scale
Introducing GneissWeb - a state-of-the-art LLM pre-training dataset 2025-03-06 · 17:00 Shahrokh Daijavad – Research Scientist @ IBM Almaden Research Center Overview of GneissWeb, a ~10 trillion-token LLM pre-training dataset derived from FineWeb, with open recipes, results, and reproduction tools. We'll cover how it was created, the tools and techniques used, and provide code examples to try. Reported ~2% average improvement in benchmark performance over FineWeb. llm pre-training dataset gneissweb fineweb huggingface datasets data preparation kits	[AI Alliance] Introducing Gneissweb: A State-Of-The-Art LLM Pre-training Dataset
Introducing GneissWeb - a state-of-the-art LLM pre-training dataset 2025-03-06 · 17:00 Shahrokh Daijavad – Research Scientist @ IBM Almaden Research Center In this session we will go over how we created GneissWeb and discuss tools and techniques used. We will provide code examples that you can try at your leisure. 👉 > 2% avg improvement in benchmark performance over FineWeb 👉 Huggingface page 👉 Data prep kit detailed recipe 👉 Data prep kit bloom filter for quick reproduction 👉 Recipe models for reproduction 👉 announcement 👉 Paper LLM pre-training dataset ai huggingface	[AI Alliance] Introducing Gneissweb: A State-Of-The-Art LLM Pre-training Dataset
Introducing GneissWeb - a state-of-the-art LLM pre-training dataset 2025-03-06 · 17:00 Shahrokh Daijavad – Research Scientist @ IBM Almaden Research Center At IBM, responsible AI implies transparency in training data: Introducing GneissWeb (pronounced “niceWeb”), a state-of-the-art LLM pre-training dataset with ~10 Trillion tokens derived from FineWeb, with open recipes, results, and tools for reproduction! In this session we will go over how we created GneissWeb and discuss tools and techniques used. We will provide code examples that you can try at your leisure. LLM pre-training dataset edge computing Data Engineering ai@edge	[AI Alliance] Introducing Gneissweb: A State-Of-The-Art LLM Pre-training Dataset
[AI Alliance] Introducing Gneissweb: A State-Of-The-Art LLM Pre-training Dataset 2025-03-06 · 17:00 Agenda Quick intro about AI Alliance (5 mins) GneissWeb presentation (40 mins) Q&A (10 mins) Wrapup Session: Introducing GneissWeb - a state-of-the-art LLM pre-training dataset At IBM, responsible AI implies transparency in training data: Introducing GneissWeb (pronounced “niceWeb”), a state-of-the-art LLM pre-training dataset with \~10 Trillion tokens derived from FineWeb, with open recipes, results, and tools for reproduction! In this session we will go over how we created GneissWeb and discuss tools and techniques used. We will provide code examples that you can try at your leisure. 👉 > 2% avg improvement in benchmark performance over FineWeb 👉 Huggingface page 👉 Data prep kit detailed recipe 👉 Data prep kit bloom filter for quick reproduction 👉 Recipe models for reproduction 👉 announcement 👉 Paper Session Type Presentation Audience LLM app developers, data scientists, data engineers Technical Level Beginner – Intermediate Prerequisites None Speaker: Shahrokh Daijavad, Research Scientist @ IBM Almaden Research Center Shahrokh Daijavad, a distinguished Research Scientist in the Watsonx Data Engineering group at IBM Almaden Research Center, has a rich background in Edge Computing and Data Engineering. He earned his B.Eng. and Ph.D. in electrical engineering from McMaster University and spent years at IBM T. J. Watson Research Center. His recent research focuses on AI@Edge and Data Engineering for IBM Watsonx AI offerings. About the AI Alliance The AI Alliance is an international community of researchers, developers and organizational leaders committed to support and enhance open innovation across the AI technology landscape to accelerate progress, improve safety, security and trust in AI, and maximize benefits to people and society everywhere. Members of the AI Alliance believe that open innovation is essential to develop and achieve safe and responsible AI that benefit society rather than benefit a select few big players.	[AI Alliance] Introducing Gneissweb: A State-Of-The-Art LLM Pre-training Dataset
[AI Alliance] Introducing Gneissweb: A State-Of-The-Art LLM Pre-training Dataset 2025-03-06 · 17:00 Agenda Quick intro about AI Alliance (5 mins) GneissWeb presentation (40 mins) Q&A (10 mins) Wrapup Session: Introducing GneissWeb - a state-of-the-art LLM pre-training dataset At IBM, responsible AI implies transparency in training data: Introducing GneissWeb (pronounced “niceWeb”), a state-of-the-art LLM pre-training dataset with \~10 Trillion tokens derived from FineWeb, with open recipes, results, and tools for reproduction! In this session we will go over how we created GneissWeb and discuss tools and techniques used. We will provide code examples that you can try at your leisure. 👉 > 2% avg improvement in benchmark performance over FineWeb 👉 Huggingface page 👉 Data prep kit detailed recipe 👉 Data prep kit bloom filter for quick reproduction 👉 Recipe models for reproduction 👉 announcement 👉 Paper Session Type Presentation Audience LLM app developers, data scientists, data engineers Technical Level Beginner – Intermediate Prerequisites None Speaker: Shahrokh Daijavad, Research Scientist @ IBM Almaden Research Center Shahrokh Daijavad, a distinguished Research Scientist in the Watsonx Data Engineering group at IBM Almaden Research Center, has a rich background in Edge Computing and Data Engineering. He earned his B.Eng. and Ph.D. in electrical engineering from McMaster University and spent years at IBM T. J. Watson Research Center. His recent research focuses on AI@Edge and Data Engineering for IBM Watsonx AI offerings. About the AI Alliance The AI Alliance is an international community of researchers, developers and organizational leaders committed to support and enhance open innovation across the AI technology landscape to accelerate progress, improve safety, security and trust in AI, and maximize benefits to people and society everywhere. Members of the AI Alliance believe that open innovation is essential to develop and achieve safe and responsible AI that benefit society rather than benefit a select few big players.	[AI Alliance] Introducing Gneissweb: A State-Of-The-Art LLM Pre-training Dataset
[AI Alliance] Introducing Gneissweb: A State-Of-The-Art LLM Pre-training Dataset 2025-03-06 · 17:00 Agenda Quick intro about AI Alliance (5 mins) GneissWeb presentation (40 mins) Q&A (10 mins) Wrapup Session: Introducing GneissWeb - a state-of-the-art LLM pre-training dataset At IBM, responsible AI implies transparency in training data: Introducing GneissWeb (pronounced “niceWeb”), a state-of-the-art LLM pre-training dataset with \~10 Trillion tokens derived from FineWeb, with open recipes, results, and tools for reproduction! In this session we will go over how we created GneissWeb and discuss tools and techniques used. We will provide code examples that you can try at your leisure. 👉 > 2% avg improvement in benchmark performance over FineWeb 👉 Huggingface page 👉 Data prep kit detailed recipe 👉 Data prep kit bloom filter for quick reproduction 👉 Recipe models for reproduction 👉 announcement 👉 Paper Session Type Presentation Audience LLM app developers, data scientists, data engineers Technical Level Beginner – Intermediate Prerequisites None Speaker: Shahrokh Daijavad, Research Scientist @ IBM Almaden Research Center Shahrokh Daijavad, a distinguished Research Scientist in the Watsonx Data Engineering group at IBM Almaden Research Center, has a rich background in Edge Computing and Data Engineering. He earned his B.Eng. and Ph.D. in electrical engineering from McMaster University and spent years at IBM T. J. Watson Research Center. His recent research focuses on AI@Edge and Data Engineering for IBM Watsonx AI offerings. About the AI Alliance The AI Alliance is an international community of researchers, developers and organizational leaders committed to support and enhance open innovation across the AI technology landscape to accelerate progress, improve safety, security and trust in AI, and maximize benefits to people and society everywhere. Members of the AI Alliance believe that open innovation is essential to develop and achieve safe and responsible AI that benefit society rather than benefit a select few big players.	[AI Alliance] Introducing Gneissweb: A State-Of-The-Art LLM Pre-training Dataset
[AI Alliance] Introducing Gneissweb: A State-Of-The-Art LLM Pre-training Dataset 2025-03-06 · 17:00 Agenda Quick intro about AI Alliance (5 mins) GneissWeb presentation (40 mins) Q&A (10 mins) Wrapup Session: Introducing GneissWeb - a state-of-the-art LLM pre-training dataset At IBM, responsible AI implies transparency in training data: Introducing GneissWeb (pronounced “niceWeb”), a state-of-the-art LLM pre-training dataset with \~10 Trillion tokens derived from FineWeb, with open recipes, results, and tools for reproduction! In this session we will go over how we created GneissWeb and discuss tools and techniques used. We will provide code examples that you can try at your leisure. 👉 > 2% avg improvement in benchmark performance over FineWeb 👉 Huggingface page 👉 Data prep kit detailed recipe 👉 Data prep kit bloom filter for quick reproduction 👉 Recipe models for reproduction 👉 announcement 👉 Paper Session Type Presentation Audience LLM app developers, data scientists, data engineers Technical Level Beginner – Intermediate Prerequisites None Speaker: Shahrokh Daijavad, Research Scientist @ IBM Almaden Research Center Shahrokh Daijavad, a distinguished Research Scientist in the Watsonx Data Engineering group at IBM Almaden Research Center, has a rich background in Edge Computing and Data Engineering. He earned his B.Eng. and Ph.D. in electrical engineering from McMaster University and spent years at IBM T. J. Watson Research Center. His recent research focuses on AI@Edge and Data Engineering for IBM Watsonx AI offerings. About the AI Alliance The AI Alliance is an international community of researchers, developers and organizational leaders committed to support and enhance open innovation across the AI technology landscape to accelerate progress, improve safety, security and trust in AI, and maximize benefits to people and society everywhere. Members of the AI Alliance believe that open innovation is essential to develop and achieve safe and responsible AI that benefit society rather than benefit a select few big players.	[AI Alliance] Introducing Gneissweb: A State-Of-The-Art LLM Pre-training Dataset

[AI Alliance] GneissWeb: Preparing High Quality Data for LLMs at Scale 2025-05-08 · 16:00

Details IBM recently released GneissWeb, a large dataset yielding around 10 trillion tokens that caters to the data quality and quantity requirements of training Large Language Models. In this talk i will do a deep dive on the philosophy behind this dataset, where it stands w.r.t the other datasets out there, how to recreate it based on the tools IBM has open sourced and some performance figures with it. This talk will be a followup of the talk given by Shahrokh Daijavad of IBM in the month of March.

Prerequisites This is a follow up to our March 6, 2025 session “Introducing GneissWeb - a state-of-the-art LLM pre-training dataset“:

Check the GitHub show notes
Re-watch on YouTube

About the presenter Bishwaranjan Bhattacharjee (LinkedIn), Senior Technical Staff Member and Master Inventor, IBM Research

About the AI Alliance The AI Alliance is an international community of researchers, developers and organizational leaders committed to support and enhance open innovation across the AI technology landscape to accelerate progress, improve safety, security and trust in AI, and maximize benefits to people and society everywhere. Members of the AI Alliance believe that open innovation is essential to develop and achieve safe and responsible AI that benefit society rather than benefit a select few big players.

[AI Alliance] GneissWeb: Preparing High Quality Data for LLMs at Scale

[AI Alliance] GneissWeb: Preparing High Quality Data for LLMs at Scale 2025-05-08 · 16:00

Details IBM recently released GneissWeb, a large dataset yielding around 10 trillion tokens that caters to the data quality and quantity requirements of training Large Language Models. In this talk i will do a deep dive on the philosophy behind this dataset, where it stands w.r.t the other datasets out there, how to recreate it based on the tools IBM has open sourced and some performance figures with it. This talk will be a followup of the talk given by Shahrokh Daijavad of IBM in the month of March.

Prerequisites This is a follow up to our March 6, 2025 session “Introducing GneissWeb - a state-of-the-art LLM pre-training dataset“:

Check the GitHub show notes
Re-watch on YouTube

About the presenter Bishwaranjan Bhattacharjee (LinkedIn), Senior Technical Staff Member and Master Inventor, IBM Research

About the AI Alliance The AI Alliance is an international community of researchers, developers and organizational leaders committed to support and enhance open innovation across the AI technology landscape to accelerate progress, improve safety, security and trust in AI, and maximize benefits to people and society everywhere. Members of the AI Alliance believe that open innovation is essential to develop and achieve safe and responsible AI that benefit society rather than benefit a select few big players.

[AI Alliance] GneissWeb: Preparing High Quality Data for LLMs at Scale

[AI Alliance] GneissWeb: Preparing High Quality Data for LLMs at Scale 2025-05-08 · 16:00

Details IBM recently released GneissWeb, a large dataset yielding around 10 trillion tokens that caters to the data quality and quantity requirements of training Large Language Models. In this talk i will do a deep dive on the philosophy behind this dataset, where it stands w.r.t the other datasets out there, how to recreate it based on the tools IBM has open sourced and some performance figures with it. This talk will be a followup of the talk given by Shahrokh Daijavad of IBM in the month of March.

Prerequisites This is a follow up to our March 6, 2025 session “Introducing GneissWeb - a state-of-the-art LLM pre-training dataset“:

Check the GitHub show notes
Re-watch on YouTube

About the presenter Bishwaranjan Bhattacharjee (LinkedIn), Senior Technical Staff Member and Master Inventor, IBM Research

About the AI Alliance The AI Alliance is an international community of researchers, developers and organizational leaders committed to support and enhance open innovation across the AI technology landscape to accelerate progress, improve safety, security and trust in AI, and maximize benefits to people and society everywhere. Members of the AI Alliance believe that open innovation is essential to develop and achieve safe and responsible AI that benefit society rather than benefit a select few big players.

[AI Alliance] GneissWeb: Preparing High Quality Data for LLMs at Scale

[AI Alliance] GneissWeb: Preparing High Quality Data for LLMs at Scale 2025-05-08 · 16:00

Details IBM recently released GneissWeb, a large dataset yielding around 10 trillion tokens that caters to the data quality and quantity requirements of training Large Language Models. In this talk i will do a deep dive on the philosophy behind this dataset, where it stands w.r.t the other datasets out there, how to recreate it based on the tools IBM has open sourced and some performance figures with it. This talk will be a followup of the talk given by Shahrokh Daijavad of IBM in the month of March.

Prerequisites This is a follow up to our March 6, 2025 session “Introducing GneissWeb - a state-of-the-art LLM pre-training dataset“:

Check the GitHub show notes
Re-watch on YouTube

About the presenter Bishwaranjan Bhattacharjee (LinkedIn), Senior Technical Staff Member and Master Inventor, IBM Research

About the AI Alliance The AI Alliance is an international community of researchers, developers and organizational leaders committed to support and enhance open innovation across the AI technology landscape to accelerate progress, improve safety, security and trust in AI, and maximize benefits to people and society everywhere. Members of the AI Alliance believe that open innovation is essential to develop and achieve safe and responsible AI that benefit society rather than benefit a select few big players.

[AI Alliance] GneissWeb: Preparing High Quality Data for LLMs at Scale

[AI Alliance] GneissWeb: Preparing High Quality Data for LLMs at Scale 2025-05-08 · 16:00

Details IBM recently released GneissWeb, a large dataset yielding around 10 trillion tokens that caters to the data quality and quantity requirements of training Large Language Models. In this talk i will do a deep dive on the philosophy behind this dataset, where it stands w.r.t the other datasets out there, how to recreate it based on the tools IBM has open sourced and some performance figures with it. This talk will be a followup of the talk given by Shahrokh Daijavad of IBM in the month of March.

Prerequisites This is a follow up to our March 6, 2025 session “Introducing GneissWeb - a state-of-the-art LLM pre-training dataset“:

Check the GitHub show notes
Re-watch on YouTube

About the presenter Bishwaranjan Bhattacharjee (LinkedIn), Senior Technical Staff Member and Master Inventor, IBM Research

About the AI Alliance The AI Alliance is an international community of researchers, developers and organizational leaders committed to support and enhance open innovation across the AI technology landscape to accelerate progress, improve safety, security and trust in AI, and maximize benefits to people and society everywhere. Members of the AI Alliance believe that open innovation is essential to develop and achieve safe and responsible AI that benefit society rather than benefit a select few big players.

[AI Alliance] GneissWeb: Preparing High Quality Data for LLMs at Scale

[AI Alliance] GneissWeb: Preparing High Quality Data for LLMs at Scale 2025-05-08 · 16:00

Details IBM recently released GneissWeb, a large dataset yielding around 10 trillion tokens that caters to the data quality and quantity requirements of training Large Language Models. In this talk i will do a deep dive on the philosophy behind this dataset, where it stands w.r.t the other datasets out there, how to recreate it based on the tools IBM has open sourced and some performance figures with it. This talk will be a followup of the talk given by Shahrokh Daijavad of IBM in the month of March.

Prerequisites This is a follow up to our March 6, 2025 session “Introducing GneissWeb - a state-of-the-art LLM pre-training dataset“:

Check the GitHub show notes
Re-watch on YouTube

About the presenter Bishwaranjan Bhattacharjee (LinkedIn), Senior Technical Staff Member and Master Inventor, IBM Research

About the AI Alliance The AI Alliance is an international community of researchers, developers and organizational leaders committed to support and enhance open innovation across the AI technology landscape to accelerate progress, improve safety, security and trust in AI, and maximize benefits to people and society everywhere. Members of the AI Alliance believe that open innovation is essential to develop and achieve safe and responsible AI that benefit society rather than benefit a select few big players.

[AI Alliance] GneissWeb: Preparing High Quality Data for LLMs at Scale

[AI Alliance] GneissWeb: Preparing High Quality Data for LLMs at Scale 2025-05-08 · 16:00

Details IBM recently released GneissWeb, a large dataset yielding around 10 trillion tokens that caters to the data quality and quantity requirements of training Large Language Models. In this talk i will do a deep dive on the philosophy behind this dataset, where it stands w.r.t the other datasets out there, how to recreate it based on the tools IBM has open sourced and some performance figures with it. This talk will be a followup of the talk given by Shahrokh Daijavad of IBM in the month of March.

Prerequisites This is a follow up to our March 6, 2025 session “Introducing GneissWeb - a state-of-the-art LLM pre-training dataset“:

Check the GitHub show notes
Re-watch on YouTube

About the presenter Bishwaranjan Bhattacharjee (LinkedIn), Senior Technical Staff Member and Master Inventor, IBM Research

About the AI Alliance The AI Alliance is an international community of researchers, developers and organizational leaders committed to support and enhance open innovation across the AI technology landscape to accelerate progress, improve safety, security and trust in AI, and maximize benefits to people and society everywhere. Members of the AI Alliance believe that open innovation is essential to develop and achieve safe and responsible AI that benefit society rather than benefit a select few big players.

[AI Alliance] GneissWeb: Preparing High Quality Data for LLMs at Scale

Introducing GneissWeb - a state-of-the-art LLM pre-training dataset 2025-03-06 · 17:00

Shahrokh Daijavad – Research Scientist @ IBM Almaden Research Center

Overview of GneissWeb, a ~10 trillion-token LLM pre-training dataset derived from FineWeb, with open recipes, results, and reproduction tools. We'll cover how it was created, the tools and techniques used, and provide code examples to try. Reported ~2% average improvement in benchmark performance over FineWeb.

llm pre-training dataset gneissweb fineweb huggingface datasets data preparation kits

[AI Alliance] Introducing Gneissweb: A State-Of-The-Art LLM Pre-training Dataset

Introducing GneissWeb - a state-of-the-art LLM pre-training dataset 2025-03-06 · 17:00

Shahrokh Daijavad – Research Scientist @ IBM Almaden Research Center

In this session we will go over how we created GneissWeb and discuss tools and techniques used. We will provide code examples that you can try at your leisure.

👉 > 2% avg improvement in benchmark performance over FineWeb 👉 Huggingface page 👉 Data prep kit detailed recipe 👉 Data prep kit bloom filter for quick reproduction 👉 Recipe models for reproduction 👉 announcement 👉 Paper

LLM pre-training dataset ai huggingface

[AI Alliance] Introducing Gneissweb: A State-Of-The-Art LLM Pre-training Dataset

Introducing GneissWeb - a state-of-the-art LLM pre-training dataset 2025-03-06 · 17:00

Shahrokh Daijavad – Research Scientist @ IBM Almaden Research Center

At IBM, responsible AI implies transparency in training data: Introducing GneissWeb (pronounced “niceWeb”), a state-of-the-art LLM pre-training dataset with ~10 Trillion tokens derived from FineWeb, with open recipes, results, and tools for reproduction! In this session we will go over how we created GneissWeb and discuss tools and techniques used. We will provide code examples that you can try at your leisure.

LLM pre-training dataset edge computing Data Engineering ai@edge

[AI Alliance] Introducing Gneissweb: A State-Of-The-Art LLM Pre-training Dataset

[AI Alliance] Introducing Gneissweb: A State-Of-The-Art LLM Pre-training Dataset 2025-03-06 · 17:00

Agenda

Quick intro about AI Alliance (5 mins)
GneissWeb presentation (40 mins)
Q&A (10 mins)
Wrapup

Session: Introducing GneissWeb - a state-of-the-art LLM pre-training dataset At IBM, responsible AI implies transparency in training data: Introducing GneissWeb (pronounced “niceWeb”), a state-of-the-art LLM pre-training dataset with \~10 Trillion tokens derived from FineWeb, with open recipes, results, and tools for reproduction!

In this session we will go over how we created GneissWeb and discuss tools and techniques used. We will provide code examples that you can try at your leisure.

👉 > 2% avg improvement in benchmark performance over FineWeb 👉 Huggingface page 👉 Data prep kit detailed recipe 👉 Data prep kit bloom filter for quick reproduction 👉 Recipe models for reproduction 👉 announcement 👉 Paper

Session Type Presentation

Audience LLM app developers, data scientists, data engineers

Technical Level Beginner – Intermediate

Prerequisites None

Speaker: Shahrokh Daijavad, Research Scientist @ IBM Almaden Research Center Shahrokh Daijavad, a distinguished Research Scientist in the Watsonx Data Engineering group at IBM Almaden Research Center, has a rich background in Edge Computing and Data Engineering. He earned his B.Eng. and Ph.D. in electrical engineering from McMaster University and spent years at IBM T. J. Watson Research Center. His recent research focuses on AI@Edge and Data Engineering for IBM Watsonx AI offerings.

About the AI Alliance The AI Alliance is an international community of researchers, developers and organizational leaders committed to support and enhance open innovation across the AI technology landscape to accelerate progress, improve safety, security and trust in AI, and maximize benefits to people and society everywhere. Members of the AI Alliance believe that open innovation is essential to develop and achieve safe and responsible AI that benefit society rather than benefit a select few big players.

[AI Alliance] Introducing Gneissweb: A State-Of-The-Art LLM Pre-training Dataset

[AI Alliance] Introducing Gneissweb: A State-Of-The-Art LLM Pre-training Dataset 2025-03-06 · 17:00

Agenda

Quick intro about AI Alliance (5 mins)
GneissWeb presentation (40 mins)
Q&A (10 mins)
Wrapup

Session: Introducing GneissWeb - a state-of-the-art LLM pre-training dataset At IBM, responsible AI implies transparency in training data: Introducing GneissWeb (pronounced “niceWeb”), a state-of-the-art LLM pre-training dataset with \~10 Trillion tokens derived from FineWeb, with open recipes, results, and tools for reproduction!

In this session we will go over how we created GneissWeb and discuss tools and techniques used. We will provide code examples that you can try at your leisure.

👉 > 2% avg improvement in benchmark performance over FineWeb 👉 Huggingface page 👉 Data prep kit detailed recipe 👉 Data prep kit bloom filter for quick reproduction 👉 Recipe models for reproduction 👉 announcement 👉 Paper

Session Type Presentation

Audience LLM app developers, data scientists, data engineers

Technical Level Beginner – Intermediate

Prerequisites None

Speaker: Shahrokh Daijavad, Research Scientist @ IBM Almaden Research Center Shahrokh Daijavad, a distinguished Research Scientist in the Watsonx Data Engineering group at IBM Almaden Research Center, has a rich background in Edge Computing and Data Engineering. He earned his B.Eng. and Ph.D. in electrical engineering from McMaster University and spent years at IBM T. J. Watson Research Center. His recent research focuses on AI@Edge and Data Engineering for IBM Watsonx AI offerings.

About the AI Alliance The AI Alliance is an international community of researchers, developers and organizational leaders committed to support and enhance open innovation across the AI technology landscape to accelerate progress, improve safety, security and trust in AI, and maximize benefits to people and society everywhere. Members of the AI Alliance believe that open innovation is essential to develop and achieve safe and responsible AI that benefit society rather than benefit a select few big players.

[AI Alliance] Introducing Gneissweb: A State-Of-The-Art LLM Pre-training Dataset

[AI Alliance] Introducing Gneissweb: A State-Of-The-Art LLM Pre-training Dataset 2025-03-06 · 17:00

Agenda

Quick intro about AI Alliance (5 mins)
GneissWeb presentation (40 mins)
Q&A (10 mins)
Wrapup

Session: Introducing GneissWeb - a state-of-the-art LLM pre-training dataset At IBM, responsible AI implies transparency in training data: Introducing GneissWeb (pronounced “niceWeb”), a state-of-the-art LLM pre-training dataset with \~10 Trillion tokens derived from FineWeb, with open recipes, results, and tools for reproduction!

In this session we will go over how we created GneissWeb and discuss tools and techniques used. We will provide code examples that you can try at your leisure.

👉 > 2% avg improvement in benchmark performance over FineWeb 👉 Huggingface page 👉 Data prep kit detailed recipe 👉 Data prep kit bloom filter for quick reproduction 👉 Recipe models for reproduction 👉 announcement 👉 Paper

Session Type Presentation

Audience LLM app developers, data scientists, data engineers

Technical Level Beginner – Intermediate

Prerequisites None

Speaker: Shahrokh Daijavad, Research Scientist @ IBM Almaden Research Center Shahrokh Daijavad, a distinguished Research Scientist in the Watsonx Data Engineering group at IBM Almaden Research Center, has a rich background in Edge Computing and Data Engineering. He earned his B.Eng. and Ph.D. in electrical engineering from McMaster University and spent years at IBM T. J. Watson Research Center. His recent research focuses on AI@Edge and Data Engineering for IBM Watsonx AI offerings.

About the AI Alliance The AI Alliance is an international community of researchers, developers and organizational leaders committed to support and enhance open innovation across the AI technology landscape to accelerate progress, improve safety, security and trust in AI, and maximize benefits to people and society everywhere. Members of the AI Alliance believe that open innovation is essential to develop and achieve safe and responsible AI that benefit society rather than benefit a select few big players.

[AI Alliance] Introducing Gneissweb: A State-Of-The-Art LLM Pre-training Dataset

[AI Alliance] Introducing Gneissweb: A State-Of-The-Art LLM Pre-training Dataset 2025-03-06 · 17:00

Agenda

Quick intro about AI Alliance (5 mins)
GneissWeb presentation (40 mins)
Q&A (10 mins)
Wrapup

Session: Introducing GneissWeb - a state-of-the-art LLM pre-training dataset At IBM, responsible AI implies transparency in training data: Introducing GneissWeb (pronounced “niceWeb”), a state-of-the-art LLM pre-training dataset with \~10 Trillion tokens derived from FineWeb, with open recipes, results, and tools for reproduction!

In this session we will go over how we created GneissWeb and discuss tools and techniques used. We will provide code examples that you can try at your leisure.

👉 > 2% avg improvement in benchmark performance over FineWeb 👉 Huggingface page 👉 Data prep kit detailed recipe 👉 Data prep kit bloom filter for quick reproduction 👉 Recipe models for reproduction 👉 announcement 👉 Paper

Session Type Presentation

Audience LLM app developers, data scientists, data engineers

Technical Level Beginner – Intermediate

Prerequisites None

Speaker: Shahrokh Daijavad, Research Scientist @ IBM Almaden Research Center Shahrokh Daijavad, a distinguished Research Scientist in the Watsonx Data Engineering group at IBM Almaden Research Center, has a rich background in Edge Computing and Data Engineering. He earned his B.Eng. and Ph.D. in electrical engineering from McMaster University and spent years at IBM T. J. Watson Research Center. His recent research focuses on AI@Edge and Data Engineering for IBM Watsonx AI offerings.

About the AI Alliance The AI Alliance is an international community of researchers, developers and organizational leaders committed to support and enhance open innovation across the AI technology landscape to accelerate progress, improve safety, security and trust in AI, and maximize benefits to people and society everywhere. Members of the AI Alliance believe that open innovation is essential to develop and achieve safe and responsible AI that benefit society rather than benefit a select few big players.

[AI Alliance] Introducing Gneissweb: A State-Of-The-Art LLM Pre-training Dataset

Activities & events