Tokenizer-free language model inference

A team from Aleph Alpha will talk about tokenizer-free language model inference. This talk presents an approach to language model inference that eliminates the need for conventional large-vocabulary tokenizers, using a core vocabulary of 256 byte values and a three-part architecture (byte-level encoder/decoder, a latent transformer, and patch embeddings). The talk will cover the architecture and engineering challenges in building an efficient inference pipeline, coordinating models, CUDA graphs, and KV caches.

talk-data.com

Tokenizer-free language model inference

Description