Topic

cuda graphs

Activities

1

tagged

Activity Trend

1 peak/qtr

2020-Q1 2026-Q1

Top Events

Tokenizer-free language model inference 1

Activities

1 activities · Newest first

All Video Podcast Book

Tokenizer-free language model inference

2025-05-20 · Tokenizer-free language model inference

talk

byte-level encoder/decoder inference pipeline kv caches latent transformer patch embeddings

A team from Aleph Alpha will talk about tokenizer-free language model inference. This talk presents an approach to language model inference that eliminates the need for conventional large-vocabulary tokenizers, using a core vocabulary of 256 byte values and a three-part architecture (byte-level encoder/decoder, a latent transformer, and patch embeddings). The talk will cover the architecture and engineering challenges in building an efficient inference pipeline, coordinating models, CUDA graphs, and KV caches.