Overview of GneissWeb, a ~10 trillion-token LLM pre-training dataset derived from FineWeb, with open recipes, results, and reproduction tools. We'll cover how it was created, the tools and techniques used, and provide code examples to try. Reported ~2% average improvement in benchmark performance over FineWeb.
talk-data.com
S
Speaker
Shahrokh Daijavad
1
talks
Research Scientist
IBM Almaden Research Center
Shahrokh Daijavad, a distinguished Research Scientist in the Watsonx Data Engineering group at IBM Almaden Research Center, has a rich background in Edge Computing and Data Engineering. He earned his B.Eng. and Ph.D. in electrical engineering from McMaster University and spent years at IBM T. J. Watson Research Center. His recent research focuses on AI@Edge and Data Engineering for IBM Watsonx AI offerings.
Bio from: [AI Alliance] Introducing Gneissweb: A State-Of-The-Art LLM Pre-training Dataset
Filtering by:
[AI Alliance] Introducing Gneissweb: A State-Of-The-Art LLM Pre-training Dataset
×
Filter by Event / Source
Talks & appearances
Showing 1 of 3 activities