talk-data.com

Topic

quantization

Activities

tagged

Activity Trend

2 peak/qtr

2020-Q1 2026-Q1

Top Events

PyData Trójmiasto #37 1 PyTorch Meetup #22 1 #16: Compressing Foundation Models as Easy as Image Compression? by M. Genzel 1 AI Meetup: Vector DB and Benchmarking Framework for LLama 1

Top Speakers

Dariusz Piotrowski (Amazon Robotics) 1 Kirill Solodskih (TheStage AI) 1 JP Hwang (Weaviate) 1 Dr. Martin Genzel (Merantix Momentum) 1

Activities

4 activities · Newest first

All Video Podcast Book

WTF is Temperature: LLM inference in Practice

2025-10-29 · PyData Trójmiasto #37

talk

by Dariusz Piotrowski (Amazon Robotics)

LLM beam search inference kv caching prompt caching sampling strategies tokenization

Ever wondered what actually happens when you call an LLM API? This talk breaks down the inference pipeline from tokenization to text generation, explaining what's really going on under the hood. He will walk through the key sampling strategies and their parameters - temperature, top-p, top-k, beam search. We'll also cover performance tricks like quantization, KV caching, and prompt caching that can speed things up significantly. If time allows, we will also touch on some use-case-specific techniques like pass@k and majority voting.

Accelerating Neural Networks Through Quantization in PyTorch for Different Devices

2025-10-23 · PyTorch Meetup #22

talk

by Kirill Solodskih (TheStage AI)

PyTorch dnns

We design and apply quantization algorithms for PyTorch DNNs across modern architectures, using PyTorch internals mechanisms to automatically balance quality and speed. We then compile the quantized checkpoints to deliver real-world speedup on different hardware.

Can Compressing Foundation Models be as Easy as Image Compression?

2025-04-22 · #16: Compressing Foundation Models as Easy as Image Compression? by M. Genzel

talk

by Dr. Martin Genzel (Merantix Momentum)

acip compression foundation models iterative pruning llms

Abstract: The talk introduces Any Compression via Iterative Pruning (ACIP), a novel approach designed to give users intuitive control over the compression-performance trade-off. ACIP uses a single gradient descent run of iterative pruning to establish a global parameter ranking, enabling immediate materialization of models of any target size. It demonstrates strong predictive performance on downstream tasks without costly fine-tuning and achieves state-of-the-art compression for open-weight LLMs, often complementing common quantization techniques.

Vector database: A technical deep-dive

2024-05-15 · AI Meetup: Vector DB and Benchmarking Framework for LLama

talk

by JP Hwang (Weaviate)

ai models index types multi-tenancy replication sharding vector databases weaviate

In this session, we'll discuss how data is stored, retrieved, augmented and isolated for users, and how index types, quantization, multi-tenancy, sharding, and replication affect their behaviour and performance. We will also discuss vector databases' integration with AI models that can generate vectors, or use retrieved data to produce augmented, or transformed outputs. When you emerge from this deep dive, you will have seen the inner workings of a vector database, and the key aspects that make them different to your grandma's database.