talk-data.com talk-data.com

Company

Aleph Alpha

Speakers

4

Activities

5

Speakers from Aleph Alpha

Talks & appearances

5 activities from Aleph Alpha speakers

The release of Kimi K2 mixture-of-expert (MoE) models has firmly established them as the leading architecture of large language models (LLMs) at the intelligence frontier. Due to their massive size (+1 trillion parameters) and sparse computation pattern, selectively activating parameter subsets rather than the entire model for each token, MoE-style LLMs present significant challenges for inference workloads, significantly altering the underlying inference economics. With the ever-growing consumer demand for AI models, as well as the internal need of AGI companies to generate trillions of tokens of synthetic data, the \"cost per token\" is becoming an even more important factor, determining the profit margins and the cost of capex required for internal reinforcment learning (RL) training rollouts. In this talk we will go through the details of the cost structure of generating a \"DeepSeek token,\" we will discuss the tradeoffs between latency/throughput and cost, and we will try to estimate the optimal setup to run it.\n\nIf you want to join this event, please sign up on our Luma page: https://lu.ma/2ae8czbn\n​⚠️ Registration is free, but required due to building security.\n\nSpeakers:\n\n* Piotr Mazurek (https://x.com/tugot17), Senior AI Inference Engineer

Abstract: Ever notice how your AI interactions start strong but quickly deteriorate with complexity? We've all been there – carefully crafting detailed prompts for AI models, only to receive increasingly mediocre responses as our inputs grow longer. The conventional wisdom says more context equals better results, but real-world evidence suggests otherwise. In this session, I'll share discoveries from analyzing thousands of AI interactions across various domains that reveal a surprising truth: the relationship between prompt length and response quality isn't linear – it's parabolic. There's a sweet spot, and most of us are operating well beyond it.

Aziz (Aleph Alpha) will talk about How to Build an On-Premise LLM Finetuning Platform in which we will be exploring different fine-tuning approaches — including LoRA, QLoRA, and full finetuning — and discuss when to use each. We’ll also show how to implement dynamic worker scheduling and automatic GPU resource allocation, helping you streamline training workflows and turbocharge your engineering teams — all while ensuring your data stays securely on your own infrastructure.

In this talk, we're excited to show you how we built ScaleDown, a Chrome extension that makes your AI interactions more efficient and environmentally sustainable using prompt compression! As more people use AI tools such as ChatGPT, Claude, and Gemini instead of Google Search, few realize the massive carbon footprint each interaction generates. We will talk about our journey from recognizing this hidden environmental cost to creating a solution that is helping users reduce their AI-related emissions by up to 80%. Finally, we'll share how developers can contribute to our open-source Python package powering ScaleDown's prompt compression. Whether you're interested in improving our compression algorithms, enhancing our emissions calculation methodology, or expanding compatibility with additional AI models, we'll show you how you can get involved!

Why speed is all about memory — an exploration of why optimizing memory access is the most important factor in writing performant code. Szymon Ożóg will share his extensive experience in optimizing GPU-based code, especially for large language models (LLMs). Agenda topics include overview of challenges faced, organizational structure to back multiplatform development, shaping the tech stack to achieve the goal, and what's next.