talk-data.com talk-data.com

Event

AI Engineer World's Fair 2025

2025-06-03 YouTube Visit website ↗

Activities tracked

55

Sessions & talks

Showing 26–50 of 55 · Newest first

Search within this event →
The Knowledge Graph Mullet: Trimming GraphRAG Complexity - William Lyon

The Knowledge Graph Mullet: Trimming GraphRAG Complexity - William Lyon

2025-06-03 Watch
video

There are typically two approaches to working with graphs: property graphs and RDF. These systems are often thought of as different knowledge graph paradigms optimized for different workflows. This talk examines how combining property graph interfaces with RDF triple storage creates an optimal foundation for GraphRAG systems. We'll show how to build and use knowledge graphs using the Dgraph graph database and how knowledge graphs are the foundation of building AI Agents.

Resources:

  • Dgraph docs: https://docs.hypermode.com/dgraph/overview
  • Hypermode: https://hypermode.com
  • hyper-news GitHub repo: https://github.com/johnymontana/hyper-news
  • Hypermode Agents early access: https://hyp.foo/agents
The Robots are coming for your job, and that's okay - Elmer Thomas and Maria Bermudez

The Robots are coming for your job, and that's okay - Elmer Thomas and Maria Bermudez

2025-06-03 Watch
video

In a world where AI is revolutionizing API documentation, many wonder: “Why can’t we just use AI to write the docs?” At Twilio, we’ve explored this question deeply. Our Developer Education team found that while generative AI is powerful, it still carries too much risk to be used as an autonomous customer-facing agent. Instead, we use AI to amplify our small team’s impact by automating repetitive tasks, freeing us to focus on high-value, accuracy-critical work.

This talk shares our journey building and deploying AI agents to streamline documentation workflows, support over 100 product managers, and empower less-technical colleagues to contribute. Attendees will learn practical strategies for integrating agentic AI into documentation processes, how to balance automation with human oversight, and ideas for taking their own docs to the next level. This session is ideal for anyone interested in the intersection of AI, APIs, and documentation, especially those on short-staffed teams seeking scalable solutions.

The Voice-First AI Overlay: Designing Conversational Co-Pilots - Gregory Bruss

The Voice-First AI Overlay: Designing Conversational Co-Pilots - Gregory Bruss

2025-06-03 Watch
video

This talk introduces the concept of the 'Voice-First AI Overlay': an AI agent assisting conversations directly within the communication interface, operating either single-sidedly or mediating between participants.

I dive into the engineering and design of such a system. We'll cover how overlays fit into the broader agent orchestration landscape, UI principles, and address the voice-first UX problem: how to design AI overlays that genuinely assist without disrupting the primary human interaction

See a live demo transforming messy, real-time captions into helpful conversational hints in the context of a language lesson.

From PM at Stripe to Building an AI startup, a recent founder's journey - Mounir Mouawad

From PM at Stripe to Building an AI startup, a recent founder's journey - Mounir Mouawad

2025-06-03 Watch
video

I spent a bunch of time building products in Big Tech, most recently at Stripe but before that at Google and Amazon. In this short talk I am sharing the highs and lows of building a business in AI and how that differs from building products in Big Tech. May this be an inspiration to would-be founders or useful commiseration material for fellow founders :)

How agents broke app-level infrastructure - Evan Boyle

How agents broke app-level infrastructure - Evan Boyle

2025-06-03 Watch
video

LLMs have completely broken our assumptions about app-level workloads. Compared to querying a database, LLMs are extremely flakey and slow. In web 2.0, p99 latency was just a few hundred milliseconds - anything higher and the on call is getting paged.

But today any API that uses LLMs has a p1 latency of a couple of seconds. Yet, the infrastructure we build on top of hasn't caught up with these new assumptions. There isn't a single serverless provider that supports running code for more than a few minutes!

In this session we'll take about infrastructure patterns that used to be niche, but today require attention from anyone building on top of LLMs:

  • Durable execution
  • Long running workflows and APIs
  • Durable execution
  • Agent-scoped storage
MCP Agent Fine tuning Workshop - Ronan McGovern

MCP Agent Fine tuning Workshop - Ronan McGovern

2025-06-03 Watch
video

This is a hands on workshop where students will run an agent with access to MCP servers (a playwright browser, although others can be added), generate high quality reasoning traces, and then train a Qwen3 model on those traces.

Students will learn: - How to generate high quality MCP agent reasoning traces, via an OpenAI style endpoint - How to save tools and multi-turn traces - Fine-tune a Qwen3 model on those traces with unsloth - Run the fine-tuned model

My AI Thinks I'm Eating My Feelings (and Other Nutritional Insights) - Rami Alhamad

My AI Thinks I'm Eating My Feelings (and Other Nutritional Insights) - Rami Alhamad

2025-06-03 Watch
video

Eating well shouldn't require an advanced degree or hours decoding nutrition labels. Alma leverages cutting-edge AI to turn complex nutritional science into straightforward, personalized advice.

In this talk, we'll dive into how we're using large language models and real-world user data to reshape how people track, understand, and improve their diets. We'll share insights on building user-friendly AI experiences, practical lessons from Alma's journey, and how we're making nutrition advice smarter, simpler, and genuinely helpful—one meal at a time.

RAG Evaluation Is Broken! Here's Why (And How to Fix It) - Yuval Belfer and Niv Granot

RAG Evaluation Is Broken! Here's Why (And How to Fix It) - Yuval Belfer and Niv Granot

2025-06-03 Watch
video

Optimizing local benchmarks, chunking strategies, perfect retrieval scores. If you just nodded along, you're one of many developers building RAG systems optimized for metrics that don't matter in the real world.

But what if our entire approach to evaluating retrieval-augmented generation is fundamentally flawed? The uncomfortable truth is that current RAG benchmarks reward systems that fail spectacularly on realistic information retrieval tasks.

In this talk, I'll expose the critical gaps in how we evaluate RAG systems today, from the chunking catch-22 to the myth of perfectly contained information. Using examples like the "Seinfeld Test," we'll explore why high benchmark scores often lead to disappointed users.

You'll learn practical strategies for meaningful RAG evaluation that reflects how information actually works in the wild, helping you build systems that impress not just benchmark leaderboards, but actual humans.

To learn more, check out the full episode on RAG evaluation on YAAP: https://youtu.be/RsSkwpTmn8o?si=9gIR6EeIzPgbqY4O

The Demo I Wish I'd Had: OpenAI's Agents SDK... serverless! - Brook Riggio

The Demo I Wish I'd Had: OpenAI's Agents SDK... serverless! - Brook Riggio

2025-06-03 Watch
video

Deploying and orchestrating significant AI workflows on serverless platforms like Vercel presents unique infrastructure challenges, like managing time limits and persistent state, handling task failures, and achieving reliable execution. As an engineer exploring OpenAI's powerful new Agents SDK on Vercel, I initially struggled with fitting sizeable jobs within the limits of our serverless platform.

In this highly practical, hands-on session, you'll experience the demo I wish I'd had—showing exactly how to use Inngest's native integration with Vercel to run OpenAI's Agents SDK for robust orchestration and execution of complex, long-running AI workflows. We'll cover practical solutions for retries, state preservation, and seamless orchestration within the constraints of Vercel's serverless platform.

You'll leave this talk equipped with clear, actionable strategies for implementing production-ready AI infrastructure on Vercel, including essential best practices for monitoring, observability, and robust error handling. Whether you're building your first AI system or enhancing existing workflows, this demo-driven talk provides the tools and insights needed for resilient, scalable AI deployments on Vercel.

Complete demo repo: https://github.com/brookr/serverless-agents

Please fork this, build your own examples, and send a PR to link to your work!

The End of Awkward AI Transcriptions - Travis Bartley and Myungjong Kim

The End of Awkward AI Transcriptions - Travis Bartley and Myungjong Kim

2025-06-03 Watch
video

NVIDIA is setting the new global standard for speech AI—with 6 top-ten models on the Hugging Face ASR leaderboard and blazing a trail with models like Parakeet2. In this talk, we’ll pull back the curtain on what it takes to build the world’s fastest, most accurate conversational AI, from open-source research to enterprise-ready NIM microservices that scale across any infrastructure.

We hear you, developers: Whether you’re building call center agents, video dubbing tools, or digital humans, NVIDIA’s ecosystem is designed for you. With Python-first frameworks, intuitive configurators, and a thriving open-source community, we’re making rapid iteration and seamless integration a reality—so you can launch faster, cut costs, and innovate boldly.

Real-world impact is already here. Enterprises are deploying multilingual, noise-robust, and highly customizable voice agents at scale, while our digital human blueprint lets you create interactive avatars. But the real story is the underlying conversational AI stack that’s transforming customer experience, accessibility, and global communication.

Join us to see why developers and industry leaders alike are calling NVIDIA’s speech AI “a game-changer”—and how you can be part of the next wave of conversational intelligence.

Will Agent evaluation via MCP Stabilize Agent Networks? - Ari Heljakka

Will Agent evaluation via MCP Stabilize Agent Networks? - Ari Heljakka

2025-06-03 Watch
video

Exposing complex AI Evaluation frameworks to AI agents via MCP allows for a new paradigm of agents to self-improve in a controllable manner. Unlike the often unstable straight-forward self-criticism loops, the MCP-accessible evaluation frameworks can provide the persistence layer that stabilizes and standardizes the measure of progress towards plan fulfillment with agents.

In this talk, we show how MCP-enabled evaluation engine already allows agents to self-improve in a way that is independent of agent architectures and frameworks, and holds promise to become a cornerstone of rigorous agent development.

Real AI Agents Need Planning, Not Just Prompting - Yuval Belfer

Real AI Agents Need Planning, Not Just Prompting - Yuval Belfer

2025-06-03 Watch
video

AI agents that actually deserve the name - do they even exist? Despite the hype, most "agents" today are just LLMs with fancy prompt engineering tricks, lacking true agency capabilities.

Here's a deeper issue: it's 2025, and LLMs still struggle with basic instruction following. Weird when one of the first big models was literally called "InstructGPT," right? Benchmarks are saturated but meaningless, and without genuine planning abilities, these systems will keep hitting the same walls.

In this session we will go through: - Why conventional agent frameworks like ReAct miss the mark on true agency - How dynamic planning creates agents that actually follow complex instructions - Tips to improve instruction following in any AI system you build

Rust is the language of the AGI - Michael Yuan

Rust is the language of the AGI - Michael Yuan

2025-06-03 Watch
video

In the Latent Space podcast, Bret Taylor argued that strongly and statically-typed programming languages, such as Rust, could be especially well suited for AI coding, since the generated code can be validated by compilers for real-time feedback and reinforcement learning. However, unlike weakly or dynamically typed JavaScript or Python, there are few examples of Rust code in LLMs’ training corpora, and hence limiting the LLM's capability in generating Rust code.

In this talk, we will discuss the open-source Rust Coder project, which provides an integrated agentic framework based on the MCP protocol for generating complete and valid Rust projects. The Rust Coder framework enables the following functionalities for coding LLMs (e.g., Qwen Coder or Codestral).

  • Provides Rust example code, explanations, and tutorials relevant to the user’s request within the LLM query context.
  • Generates and parses generated code artifacts into Rust Cargo projects.
  • Compiles and executes generated Rust Cargo projects.
  • Executes the compiled project against test cases.
  • Provides coding LLM feedback based on compiler and testing outputs.
  • Runs continuously until all issues are fixed.

We will demonstrate how the Rust Coder project works, how to integrate it into your agents, and ways to contribute to the open-source effort. We will also discuss pilot results from a large Rust coding camp (1000+ college students) using the Rust Coder tool.

The Rust Coder is supported by two Linux Foundation Mentorship grants, as well as content provided by the Rust Foundation.

Invisible Users, Invisible Interfaces: Accelerating Design Iteration with AI Simulation - Alex Liss

Invisible Users, Invisible Interfaces: Accelerating Design Iteration with AI Simulation - Alex Liss

2025-06-03 Watch
video

The genAI explosion has flipped classic software design on its head. Instead of building invisible interfaces, experiences so intuitive they feel second nature, we’ve seen a flood of awkward chatbot overlays and bolt-on features that confuse more than they help. But what if AI could be part of the solution? The path back to seamless design lies in using AI not as a feature, but as a tool for design itself. Through invisible users, like Intelligent Twins for AI-driven audience simulations, and computer use agents for visual evaluation, designers can accelerate needfinding and test interface concepts at scale. This session will explain that by simulating how diverse users experience new interactions, teams can anticipate user needs, reduce friction, and build great interfaces faster. Don’t bolt-on genAI features to existing products and tell people its magic – use AI to design software that actually feels like magic.

Luminal - Search-Based Deep Learning Compilers - Joe Fioti

Luminal - Search-Based Deep Learning Compilers - Joe Fioti

2025-06-03 Watch
video

Luminal is a deep learning compiler for CPUs, GPUs, and ASICs that takes a search-first approach to discovering efficient kernels, such as flash attention, automatically.

Text-to-Speech Data Preparation and Fine-tuning Workshop - Ronan McGovern

Text-to-Speech Data Preparation and Fine-tuning Workshop - Ronan McGovern

2025-06-03 Watch
video

By the end of this workshop, you'll have train Sesame's CSM-1B text-to-speech model on a voice from a Youtube video. The workshop will cover data preparation, fine-tuning and evaluation.

The Current State of Browser Agents - Jerry Wu and Wyatt Marshall

The Current State of Browser Agents - Jerry Wu and Wyatt Marshall

2025-06-03 Watch
video

Browser agents are here. But beyond simple sample use cases (I'm looking at you flight booking demo), are they as good as advertised?

In this talk, we introduce Web Bench, a new benchmark we've developed that rigorously tests browser agents across 450+ websites on real-world action based objectives such as info extraction, login/auth, form filling, and others. We'll dive into the results, unpack some unexpected discoveries, and discuss broader implications for the future of general purpose agents.

You'll walk away with practical insights into:

  1. data-driven understanding of the capabilities and limitations of state-of-the-art browser agents
  2. how to meaningfully evaluate browser agents
  3. hard-won lessons on designing and launching a benchmark

Come through and see what browser agents can really do.

Resources

Leaderboard - https://webbench.ai/ Technical Report: https://halluminate.ai/blog/benchmark Github - https://github.com/Halluminate/WebBench Huggingface - https://huggingface.co/datasets/Halluminate/WebBench

The RAG Stack We Landed On After 37 Fails - Jonathan Fernandes

The RAG Stack We Landed On After 37 Fails - Jonathan Fernandes

2025-06-03 Watch
video

Retrieval returning irrelevant results? Can't deploy solutions in the cloud? If these questions keep you up at night, you're likely experiencing the common frustrations of building an effective RAG system. But what if we could systematically optimise each component of the pipeline?

In this talk, I'll share the insights gained from 37 failed attempts, demonstrating live with documents from a knowledge base and how each optimisation impacts the end result. You'll walk away understanding how to diagnose the weaknesses in your RAG pipeline and apply targeted improvements that dramatically boost performance in real-world applications.

Buy Now, Maybe Pay Later: Dealing with Prompt-Tax While Staying at the Frontier - Andrew Thomspson

Buy Now, Maybe Pay Later: Dealing with Prompt-Tax While Staying at the Frontier - Andrew Thomspson

2025-06-03 Watch
video

Frontier LLMs now drop at warp speed. Each upgrade hits you with a Prompt‑Tax: busted prompts, cranky domain experts, and evals that show up fashionably late.

In this talk I’ll share 18 months of bruises (and wins) from shipping an agentic product for real‑estate lawyers:

• The challenge of an evolving prompt library that breaks every time the model jumps

• The bare‑bones tactics that actually work for faster migrations

• Our “betting on the model” mantra: ship the newest frontier model even when it’s rough around the edges, then race to close the gaps before anyone else does

Walk away with a playbook to stay frontier‑fresh without blowing up your roadmap or your team’s sanity.

Cognitive Shield Real Time Real Smart - Rachna Srivastava

Cognitive Shield Real Time Real Smart - Rachna Srivastava

2025-06-03 Watch
video

This high-energy demonstration unveils Cognitive Shield, a revolutionary three-level defense system that harnesses AI to combat sophisticated financial fraud. Watch as we showcase real-time deepfake detection, graph intelligence for fraud ring visualization, and cross-channel correlation of threats – all integrated within a comprehensive platform that amplifies human expertise rather than replacing it.

Learn how the same AI powering today's most dangerous financial attacks can be turned into our strongest defense.

Stop Ordering AI Takeout  A Cookbook for Winning When You Build In House - Jan Siml

Stop Ordering AI Takeout A Cookbook for Winning When You Build In House - Jan Siml

2025-06-03 Watch
video

Forget the multi-agent buffet—this is the home-cooked GenAI playbook that actually drives revenue.

In this 10-minute lightning talk, Jan Siml shares how a small, in-house team skipped the hype playbook—no multi-agent pipelines with GraphRAG, no monster eval suites—and still turned an internal GenAI assistant into real business impact.

🔑 What you’ll learn

  • Go Deep on One Job-to-Be-Done – depth crushes breadth when you own the data and the user.

  • Trace Every Click to Dollars – offline evals don’t sign contracts; revenue funnels do.

  • Push, Don’t Wait – zero-click Slack/email nudges outperform shiny chat UIs.

  • Convert Time-Saved into Time-Well-Spent – guide the next action, not just the answer.

  • Data & UX vs Bigger Models – integrations and better flow move the needle; fancy LLMs mostly move the bill.

If you’re ready to trade Michelin-priced SaaS features for pragmatic, in-house wins—and you like your lessons straight from the kitchen rather than the brochure—hit play. Your AI roadmap (and budget) will thank you.

The Benchmarks Game: Why It's Rigged and How You Can (Really) Win - Darius Emrani

The Benchmarks Game: Why It's Rigged and How You Can (Really) Win - Darius Emrani

2025-06-03 Watch
video

AI benchmarks control billions in investment and shape entire markets - but the game is rigged. In this talk, I'll expose the three "cheat codes" companies use to game benchmarks:

  • Cherry-picking comparisons (xAI's selective Grok-3 graphs)
  • Buying privileged access (OpenAI's FrontierMath funding)
  • Optimizing for style over substance (Meta's 27 Llama-4 variants on LM Arena)

When Andrej Karpathy says "I don't really know what metrics to look at right now," we have a crisis. I'll show you why Goodhart's Law guarantees benchmarks fail when billions are at stake, and more importantly, what to do about it.

You'll learn: How to spot benchmark manipulation (with real examples) Why 39% of score variance is just writing style A 5-step framework to build evaluations that actually matter for YOUR use case How pre-deployment evaluation loops separate reliable AI from constant firefighting

Drawing from my experience building evaluation systems at Waymo, Uber ATG, and SpaceX (where bad evals literally crash), I'll show you how to stop playing the rigged benchmarks game and start measuring what actually matters.

Unlocking Africa's Potential with AI — Thabang Ledwaba

Unlocking Africa's Potential with AI — Thabang Ledwaba

2025-06-03 Watch
video

As Africa stands at the crossroads of rapid population growth, urbanization, and digital transformation, Artificial Intelligence (AI) presents unprecedented opportunities to tackle some of the continent’s most pressing challenges. This presentation explores how AI can be harnessed as a tool for sustainable development—addressing issues in healthcare, agriculture, education, infrastructure, and governance.

We’ll delve into real-world applications of AI across African nations, highlight innovative local solutions, and discuss how ethical and inclusive AI development can empower communities, bridge data gaps, and foster economic growth. The session will also examine the importance of homegrown talent, policy frameworks, and cross-sector collaboration in shaping an AI-powered future tailored to Africa’s unique context.

It is time we reimagine the continent’s future through the lens of AI—one that is driven by innovation, equity, and resilience.

Analyzing 10,000 Sales Calls With AI In 2 Weeks — Charlie Guo

Analyzing 10,000 Sales Calls With AI In 2 Weeks — Charlie Guo

2025-06-03 Watch
video

AKA: The Data Goldmine You’re Probably Ignoring

Most companies are sitting on mountains of customer data: sales calls, customer support tickets, product reviews, user feedback, and social media interactions. But the truth is that most of this valuable data remains untouched - or worse, unusable.

In this case study, I'll share how our team leveraged Claude to analyze 10,000 sales call transcripts in a handful of days, extracting deep customer insights at scale. We'll cover the AI engineering challenges we faced, including model selection tradeoffs, reducing hallucinations with retrieval-augmented generation (RAG), and optimizing prompt caching to dramatically cut costs and latency (by up to 90% in some cases).

This isn't theoretical - it's a practical blueprint with concrete ROI metrics. Perfect for AI engineers, data scientists, and anyone sitting on mountains of unstructured customer data they can't analyze at scale.

Read more at https://www.ignorance.ai/

Letting AI Interface with your App with MCP — Kent C Dodds

Letting AI Interface with your App with MCP — Kent C Dodds

2025-06-03 Watch
video

We are entering a new era of user interaction. It's being built right before our very eyes and changing rapidly. As crazy as it sounds, soon each one of us will get our own Jarvis capable of performing actually useful tasks for us with a completely different user interaction mechanism than we're used to.

But someone's gotta give Jarvis the tools to perform these tasks, and that's where we come in.

In this talk, Kent will live code an MCP server and use it with an AI assistant to help us catch the vision of what this future could look like and our role in it.