The Problem of Address Matching: a Journey through NLP and AI

2025-11-09

talk

Ivan Perez Avellaneda

AI/ML NLP Python

The problem of address matching arrives when the address of one physical place is written in two or more different ways. This situation is very common in companies that receive records of customers from different sources. The differences can be classified as syntactic and semantic. In the first type, the meaning is the same but the way they are written is different. For example, one can find "Street" vs "St". In the second type, the meaning is not exactly the same. For example, one can find "Road" instead of "Street". To solve this problem and match addresses, we have a couple of approaches. The first and simple is by using similarity metrics. The second uses natural language and transformers. This is a hands-on talk and is intended for data process analyst. We are going to go through these solutions implemented in a Jupyter notebook using Python.

LLMs, Chatbots, and Dashboards: Visualize and Analyze Your Data with Natural Language

2025-11-09

talk

Daniel Chen

AI/ML Dashboard Data Science LLM

LLMs have a lot of hype around them these days. Let’s demystify how they work and see how we can put them in context for data science use. As data scientists, we want to make sure our results are inspectable, reliable, reproducible, and replicable. We already have many tools to help us in this front. However, LLMs provide a new challenge; we may not always be given the same results back from a query. This means trying to work out areas where LLMs excel in, and use those behaviors in our data science artifacts. This talk will introduce you to LLMs, the Chatlas packages, and how they can be integrated into a Shiny to create an AI-powered dashboard (using querychat). We’ll see how we can leverage the tasks LLMs are good at to better our data science products.

Building Bazel Packages for AI/ML: SciPy, PyTorch, and Beyond

2025-11-09

talk

Ramesh Oswal , Jiten Oswal

AI/ML NumPy PyTorch SciPy TensorFlow

AI/ML workloads depend heavily on complex software stacks, including numerical computing libraries (SciPy, NumPy), deep learning frameworks (PyTorch, TensorFlow), and specialized toolchains (CUDA, cuDNN). However, integrating these dependencies into Bazel-based workflows remains challenging due to compatibility issues, dependency resolution, and performance optimization. This session explores the process of creating and maintaining Bazel packages for key AI/ML libraries, ensuring reproducibility, performance, and ease of use for researchers and engineers.

How to make datamap web-apps of embedding vectors via open source tooling

2025-11-09

talk

John Tigue

AI/ML API Python RAG Vector DB

Datamaps are ML-powered visualizations of high-dimensional data, and in this talk the data is collections of embedding vectors. Interactive datamaps run in-browser as web-apps, potentially without any code running on the web server. Datamap tech can be used to visualize, say, the entire collection of chunks in a RAG vector database.

The best-of-breed tools of this new datamap technique are liberally licensed open source. This presentation is an introduction to building with those repos. The maths will be mentioned only in passing; the topic here is simply how-to with specific tools. Talk attendees will be learning about Python tools, which produce high-quality web UIs.

DataMapPlot is the premiere tool for rendering a datamap as a web-app. Here is a live demo thereof: https://connoiter.com/datamap/cff30bc1-0576-44f0-a07c-60456e131b7b

00-25: Intro to datamaps 25-45: Pipeline architecture 45-55: demos touring such tools as UMAP, HDBSCAN, DataMapPlot, Toponomy, etc. 55-90: Group coding

A Google account is required to log in to Google Colab, where participants can run the workshop notebooks. A Hugging Face API key (token) is needed to download Gemma models.

There's no place like home: using AI agents in Jupyter notebooks

2025-11-09

talk

Sarah Kaiser

AI/ML Data Science

This talk explores how AI agents integrated directly into Jupyter notebooks can help with every part of your data science work. We'll cover the latest notebook-focused agentic features in VS Code, demonstrating how they automate tedious tasks like environment management or graph styling, enhance your "scratch notebook" to sharable code, and more generally streamline data science workflows directly in notebooks.

Beyond Just Prediction: Causal Thinking in Machine Learning

2025-11-09 Watch

talk

Avik Basu

AI/ML

Most ML models excel at prediction, answering questions like "Who will buy our product?" or "Which customers are likely to churn?". But when it comes to making actionable decisions, prediction alone can be misleading. Correlation does not imply causation, and business decisions require understanding causal relationships to drive the right outcomes.

In this talk, we will explore how causal machine learning, specifically uplift modeling, can bridge the gap between prediction and decision making. Using a real-world use case, we will showcase how uplift modeling helps identify who will respond positively to interventions while avoiding those who they might deter.

Building Inference Workflows with Tile Languages

2025-11-08 Watch

talk

Andy Terrel

AI/ML API GenAI NumPy Python

The world of generative AI is expanding. New models are hitting the market daily. The field has bifurcated between model training and model inference. The need for fast inference has led to numerous Tile languages to be developed. These languages use concepts from linear algebra and borrow common numpy apis. In this talk we will show how tiling works and how to build inference models from scratch in pure Python with embedded tile languages. The goal is to provide attendees with a good overview that can be integrated in common data pipelines.

The Missing 78%: What We Learned When Our Community Doubled Overnight

2025-11-08

talk

Noor Aftab

AI/ML Python

Women make up only 22% of data and AI roles and contribute just 3% of Python commits, leaving a “missing 78%” of untapped talent and perspective. This talk shares what happened when our community doubled overnight, revealing hidden demand for inclusive spaces in scientific Python.

We’ll present the data behind this growth, examine systemic barriers, and introduce the VIM framework (Visibility–Invitation–Mechanism) — a research-backed model for building resilient, inclusive communities. Attendees will leave with practical, reproducible strategies to grow engagement, improve retention, and ensure that the future of AI and Python is shaped by all voices, not just the few.

Explainable AI for Biomedical Image Processing

2025-11-08 Watch

talk

Ojas Ankurbhai Ramwala

AI/ML Python

Advancements in deep learning for biomedical image processing have led to the development of promising algorithms across multiple clinical domains, including radiology, digital pathology, ophthalmology, cardiology, and dermatology, among others. With robust AI models demonstrating commendable results, it is crucial to understand that their limited interpretability can impede the clinical translation of deep learning algorithms. The inference mechanism of these black-box models is not entirely understood by clinicians, patients, regulatory authorities, and even algorithm developers, thereby exacerbating safety concerns. In this interactive talk, we will explore some novel explainability techniques designed to interpret the decision-making process of robust deep learning algorithms for biomedical image processing. We will also discuss the impact and limitations of these techniques and analyze their potential to provide medically meaningful algorithmic explanations. Open-source resources for implementing these interpretability techniques using Python will be covered to provide a holistic understanding of explaining deep learning models for biomedical image processing.

This talk is distilled from a course that Ojas Ramwala designed, which received the best seminar award for the highest graduate student enrollment at the Department of Biomedical Informatics and Medical Education at the University of Washington, Seattle.

Supercharging Multimodal Feature Engineering with Lance and Ray

2025-11-08 Watch

talk

Jack Ye (AWS Open Data Analytics)

AI/ML GitHub Lance

Efficient feature engineering is key to unlocking modern multimodal AI workloads. In this talk, we’ll dive deep into how Lance - an open-source format with built-in indexing, random access, and data evolution - works seamlessly with Ray’s distributed compute and UDF capabilities. We’ll walk through practical pipelines for preprocessing, embedding computation, and hybrid feature serving, highlighting concrete patterns attendees can take home to supercharge their own multimodal pipelines. See https://lancedb.github.io/lance/integrations/ray to learn more about this integration.

Why Models Break Your Pipelines (and How to Make Them First-Class Citizens)

2025-11-08

talk

Everett Kleven

AI/ML Pandas Python Spark

Most AI pipelines still treat models like Python UDFs, just another function bolted onto Spark, Pandas, or Ray. But models aren’t functions: they’re expensive, stateful, and difficult to configure. In this talk, we’ll explore why this mental model breaks at scale and share practical patterns for treating models as first-class citizens in your pipelines.

Building Agents with Agent Bricks and MCP

2025-11-08 Watch

talk

Denny Lee (Databricks)

AI/ML API Databricks

Want to create AI agents that can do more than just generate text? Join us to explore how combining Databricks' Agent Bricks with the Model Context Protocol (MCP) unlocks powerful tool-calling capabilities. We'll show you how MCP provides a standardized way for AI agents to interact with external tools, data and APIs, solving the headache of fragmented integration approaches. Learn to build agents that can retrieve both structured and unstructured data, execute custom code and tackle real enterprise challenges.

Scaling Background Noise Filtration for AI Voice Agents

2025-11-08 Watch

talk

Stephen Cheng

AI/ML Kubernetes Python

In the world of AI voice agents, especially in sensitive contexts like healthcare, audio clarity is everything. Background noise—a barking dog, a TV, street sounds—degrades transcription accuracy, leading to slower, clunkier, and less reliable AI responses. But how do you solve this in real-time without breaking the bank?

This talk chronicles our journey at a health-tech startup to ship background noise filtration at scale. We'll start with the core principles of noise reduction and our initial experiments with open-source models, then dive deep into the engineering architecture required to scale a compute-hungry ML service using Python and Kubernetes. You'll learn about the practical, operational considerations of deploying third-party models and, most importantly, how to measure their true impact on the product.

Taming the Data Tsunami: An Open-Source Playbook to Get Ready for ML

2025-11-08

talk

David Aronchick

AI/ML

Machine learning teams today are drowning in massive volumes of raw, redundant data that inflate training costs, slow down experimentation, and degrade model quality. The core architectural flaw is that we apply control too late—after the data has already been moved into centralized stores or training clusters—creating waste, instability, and long iteration cycles. What if we could fix this problem right at the source?

In this talk, we’ll discuss an open-source playbook for shifting ML data filtering, transformation, and governance upstream, directly where data is generated. We’ll walk through a declarative, policy-as-code framework for building distributed pipelines that intelligently discard noise, balance datasets, and enrich signals before they ever reach your model training infrastructure.

Drawing from real-world ML workflows, we’ll show how this “upstream control” approach can reduce dataset size by 50–70%, cut model onboarding time in half, and embed reproducibility and compliance directly into the ML lifecycle—rather than patching them in afterward.

Attendees will leave with: - A mental model for analyzing and optimizing the ML data supply chain. - An understanding of open-source tools for declarative, source-level ML data controls. - Actionable strategies to accelerate iteration, lower training costs, and improve model outcomes.

There and back again... by ferry or I-5?

2025-11-08 Watch

talk

JustinCastilla

AI/ML API

Living on Washington State’s peninsula offers endless beauty, nature, and commuting challenges. In this talk, I’ll share how I built an agentic AI system that creates and compares optimal routes to the mainland, factoring in ferry schedules, costs, driving distances, and live traffic. Originally a testbed for the Model Context Protocol (MCP) framework, this project now manages my travel schedule, generates expense estimates, and sends timely notifications for events. I’ll give a comprehensive overview of MCP, show how to quickly turn ideas into working agentic AI, and discuss practical integration with real-world APIs. Attendees will leave with actionable insights and a roadmap for building their own agentic AI solutions.

Generalized Additive Models: Explainability Strikes Back

2025-11-07 Watch

talk

Pedro Albuquerque

AI/ML Python

Generalized Additive Models (GAMs)

Generalized Additive Models (GAMs) strike a rare balance: they combine the flexibility of complex models with the clarity of simple ones.

They often achieve performance comparable to black-box models, yet remain: - Easy to interpret - Computationally efficient - Aligned with the growing demand for transparency in AI

With recent U.S. AI regulations (White House, 2022) and increasing pressure from decision-makers for explainable models, GAMs are emerging as a natural choice across industries.

Audience

This guide is for readers with some background in Python and statistics, including: - Data scientists
- Machine learning engineers
- Researchers

Takeaway

By the end, you’ll understand: - The intuition behind GAMs
- How to build and apply them in practice
- How to interpret and explain GAM predictions and results in Python

Prerequisites

You should be comfortable with: - Basic regression concepts
- Model regularization
- The bias–variance trade-off
- Python programming

Optimizing AI/ML Workloads: Resource Management and Cost Attribution

2025-11-07

talk

Saurabh Garg

AI/ML Cloud Computing Python

The proliferation of AI/ML workloads across commercial enterprises, necessitates robust mechanisms to track, inspect and analyze their use of on-prem/cloud infrastructure. To that end, effective insights are crucial for optimizing cloud resource allocation with increasing workload demand, while mitigating cloud infrastructure costs and promoting operational stability.

This talk will outline an approach to systematically monitor, inspect and analyze AI/ML workloads’ properties like runtime, resource demand/utilization and cost attribution tags . By implementing granular inspection across multi-player teams and projects, organizations can gain actionable insights into resource bottlenecks, identify opportunities for cost savings, and enable AI/ML platform engineers to directly attribute infrastructure costs to specific workloads.

Cost attribution of infrastructure usage by AI/ML workloads focuses on key metrics such as compute node group information, cpu usage seconds, data transfer, gpu allocation , memory and ephemeral storage utilization. It enables platform administrators to identify competing workloads which lead to diminishing ROI. Answering questions from data scientists like "Why did my workload run for 6 hours today, when it took only 2 hours yesterday" or "Why did my workload start 3 hours behind schedule?" also becomes easier.

Through our work on Metaflow, we will showcase how we built a comprehensive framework for transparent usage reporting, cost attribution, performance optimization, and strategic planning for future AI/ML initiatives. Metaflow is a human centric python library that enables seamless scaling and management of AI/ML projects.

Ultimately, a well-defined usage tracking system empowers organizations to maximize the return on investment from their AI/ML endeavors while maintaining budgetary control and operational efficiency. Platform engineers and administrators will be able to gain insights into the following operational aspects of supporting a battle hardened ML Platform:

1.Optimize resource allocation: Understand consumption patterns to right-size clusters and allocate resources more efficiently, reducing idle time and preventing bottlenecks.

Proactively manage capacity: Forecast future resource needs based on historical usage trends, ensuring the infrastructure can scale effectively with increasing workload demand.
Facilitate strategic planning: Make informed decisions regarding future infrastructure investments and scaling strategies.

4.Diagnose workload execution delays: Identify resource contention, queuing issues, or insufficient capacity leading to delayed workload starts.

Data Scientists on the other hand will gain clarity on factors that influence workload performance. Tuning them can lead to efficiencies in runtime and associated cost profiles.

Red Teaming AI: Getting Started with PyRIT for Safer Generative AI Systems

2025-11-07 Watch

talk

Roman Lutz

AI/ML GenAI Python Cyber Security

As generative AI systems become more powerful and widely deployed, ensuring safety and security is critical. This talk introduces AI red teaming—systematically probing AI systems to uncover potential risks—and demonstrates how to get started using PyRIT (Python Risk Identification Toolkit), an open-source framework for automated and semi-automated red teaming of generative AI systems. Attendees will leave with a practical understanding of how to identify and mitigate risks in AI applications, and how PyRIT can help along the way.

Keynote: Josh Starmer - Communicating Concepts, Clearly Explained!!! (Or, why I don’t worry about AI taking my job and sense of purpose away from me.)

2025-11-07 Watch

talk

Josh Starmer (StatQuest (YouTube))

AI/ML

In this talk I'll discuss 3 goals I have when I try to communicate complicated topics. I'll then illustrate how I used these goals to guide the development of my most popular video, PCA Step-by-Step, which has over 3.4 million views.

talk-data.com

PyData Seattle 2025

Top Topics

Top Speakers

The Problem of Address Matching: a Journey through NLP and AI

LLMs, Chatbots, and Dashboards: Visualize and Analyze Your Data with Natural Language

Building Bazel Packages for AI/ML: SciPy, PyTorch, and Beyond

How to make datamap web-apps of embedding vectors via open source tooling

There's no place like home: using AI agents in Jupyter notebooks

Beyond Just Prediction: Causal Thinking in Machine Learning

Building Inference Workflows with Tile Languages

The Missing 78%: What We Learned When Our Community Doubled Overnight

Explainable AI for Biomedical Image Processing

Supercharging Multimodal Feature Engineering with Lance and Ray

Why Models Break Your Pipelines (and How to Make Them First-Class Citizens)

Building Agents with Agent Bricks and MCP

Scaling Background Noise Filtration for AI Voice Agents

Taming the Data Tsunami: An Open-Source Playbook to Get Ready for ML

There and back again... by ferry or I-5?

Generalized Additive Models: Explainability Strikes Back

Generalized Additive Models (GAMs)

Audience

Takeaway

Prerequisites

Optimizing AI/ML Workloads: Resource Management and Cost Attribution

Red Teaming AI: Getting Started with PyRIT for Safer Generative AI Systems

Keynote: Josh Starmer - Communicating Concepts, Clearly Explained!!! (Or, why I don’t worry about AI taking my job and sense of purpose away from me.)