talk-data.com talk-data.com

Topic

quantization

4

tagged

Activity Trend

2 peak/qtr
2020-Q1 2026-Q1

Activities

4 activities · Newest first

Ever wondered what actually happens when you call an LLM API? This talk breaks down the inference pipeline from tokenization to text generation, explaining what's really going on under the hood. He will walk through the key sampling strategies and their parameters - temperature, top-p, top-k, beam search. We'll also cover performance tricks like quantization, KV caching, and prompt caching that can speed things up significantly. If time allows, we will also touch on some use-case-specific techniques like pass@k and majority voting.

Abstract: The talk introduces Any Compression via Iterative Pruning (ACIP), a novel approach designed to give users intuitive control over the compression-performance trade-off. ACIP uses a single gradient descent run of iterative pruning to establish a global parameter ranking, enabling immediate materialization of models of any target size. It demonstrates strong predictive performance on downstream tasks without costly fine-tuning and achieves state-of-the-art compression for open-weight LLMs, often complementing common quantization techniques.

In this session, we'll discuss how data is stored, retrieved, augmented and isolated for users, and how index types, quantization, multi-tenancy, sharding, and replication affect their behaviour and performance. We will also discuss vector databases' integration with AI models that can generate vectors, or use retrieved data to produce augmented, or transformed outputs. When you emerge from this deep dive, you will have seen the inner workings of a vector database, and the key aspects that make them different to your grandma's database.