A team from Aleph Alpha will talk about tokenizer-free language model inference. This talk presents an approach to language model inference that eliminates the need for conventional large-vocabulary tokenizers, using a core vocabulary of 256 byte values and a three-part architecture (byte-level encoder/decoder, a latent transformer, and patch embeddings). The talk will cover the architecture and engineering challenges in building an efficient inference pipeline, coordinating models, CUDA graphs, and KV caches.
talk-data.com
Topic
cuda graphs
1
tagged
Activity Trend
1
peak/qtr
2020-Q1
2026-Q1