talk-data.com talk-data.com

Event

PyData Seattle 2025

2025-11-07 – 2025-11-09 PyData

Activities tracked

62

Sessions & talks

Showing 26–50 of 62 · Newest first

Search within this event →

The Missing 78%: What We Learned When Our Community Doubled Overnight

2025-11-08
talk

Women make up only 22% of data and AI roles and contribute just 3% of Python commits, leaving a “missing 78%” of untapped talent and perspective. This talk shares what happened when our community doubled overnight, revealing hidden demand for inclusive spaces in scientific Python.

We’ll present the data behind this growth, examine systemic barriers, and introduce the VIM framework (Visibility–Invitation–Mechanism) — a research-backed model for building resilient, inclusive communities. Attendees will leave with practical, reproducible strategies to grow engagement, improve retention, and ensure that the future of AI and Python is shaped by all voices, not just the few.

Break

2025-11-08
talk
Lightning Talks

Lightning Talks

2025-11-08 Watch
talk

Sign up for a 5-minute lightning talk at the NumFOCUS booth on Friday.

Lunch

2025-11-08
talk
Explainable AI for Biomedical Image Processing

Explainable AI for Biomedical Image Processing

2025-11-08 Watch
talk

Advancements in deep learning for biomedical image processing have led to the development of promising algorithms across multiple clinical domains, including radiology, digital pathology, ophthalmology, cardiology, and dermatology, among others. With robust AI models demonstrating commendable results, it is crucial to understand that their limited interpretability can impede the clinical translation of deep learning algorithms. The inference mechanism of these black-box models is not entirely understood by clinicians, patients, regulatory authorities, and even algorithm developers, thereby exacerbating safety concerns. In this interactive talk, we will explore some novel explainability techniques designed to interpret the decision-making process of robust deep learning algorithms for biomedical image processing. We will also discuss the impact and limitations of these techniques and analyze their potential to provide medically meaningful algorithmic explanations. Open-source resources for implementing these interpretability techniques using Python will be covered to provide a holistic understanding of explaining deep learning models for biomedical image processing.

This talk is distilled from a course that Ojas Ramwala designed, which received the best seminar award for the highest graduate student enrollment at the Department of Biomedical Informatics and Medical Education at the University of Washington, Seattle.

Supercharging Multimodal Feature Engineering with Lance and Ray

Supercharging Multimodal Feature Engineering with Lance and Ray

2025-11-08 Watch
talk
Jack Ye (AWS Open Data Analytics)

Efficient feature engineering is key to unlocking modern multimodal AI workloads. In this talk, we’ll dive deep into how Lance - an open-source format with built-in indexing, random access, and data evolution - works seamlessly with Ray’s distributed compute and UDF capabilities. We’ll walk through practical pipelines for preprocessing, embedding computation, and hybrid feature serving, highlighting concrete patterns attendees can take home to supercharge their own multimodal pipelines. See https://lancedb.github.io/lance/integrations/ray to learn more about this integration.

Why Models Break Your Pipelines (and How to Make Them First-Class Citizens)

2025-11-08
talk

Most AI pipelines still treat models like Python UDFs, just another function bolted onto Spark, Pandas, or Ray. But models aren’t functions: they’re expensive, stateful, and difficult to configure. In this talk, we’ll explore why this mental model breaks at scale and share practical patterns for treating models as first-class citizens in your pipelines.

Building Agents with Agent Bricks and MCP

Building Agents with Agent Bricks and MCP

2025-11-08 Watch
talk
Denny Lee (Databricks)

Want to create AI agents that can do more than just generate text? Join us to explore how combining Databricks' Agent Bricks with the Model Context Protocol (MCP) unlocks powerful tool-calling capabilities. We'll show you how MCP provides a standardized way for AI agents to interact with external tools, data and APIs, solving the headache of fragmented integration approaches. Learn to build agents that can retrieve both structured and unstructured data, execute custom code and tackle real enterprise challenges.

Scaling Background Noise Filtration for AI Voice Agents

Scaling Background Noise Filtration for AI Voice Agents

2025-11-08 Watch
talk

In the world of AI voice agents, especially in sensitive contexts like healthcare, audio clarity is everything. Background noise—a barking dog, a TV, street sounds—degrades transcription accuracy, leading to slower, clunkier, and less reliable AI responses. But how do you solve this in real-time without breaking the bank?

This talk chronicles our journey at a health-tech startup to ship background noise filtration at scale. We'll start with the core principles of noise reduction and our initial experiments with open-source models, then dive deep into the engineering architecture required to scale a compute-hungry ML service using Python and Kubernetes. You'll learn about the practical, operational considerations of deploying third-party models and, most importantly, how to measure their true impact on the product.

Taming the Data Tsunami: An Open-Source Playbook to Get Ready for ML

2025-11-08
talk

Machine learning teams today are drowning in massive volumes of raw, redundant data that inflate training costs, slow down experimentation, and degrade model quality. The core architectural flaw is that we apply control too late—after the data has already been moved into centralized stores or training clusters—creating waste, instability, and long iteration cycles. What if we could fix this problem right at the source?

In this talk, we’ll discuss an open-source playbook for shifting ML data filtering, transformation, and governance upstream, directly where data is generated. We’ll walk through a declarative, policy-as-code framework for building distributed pipelines that intelligently discard noise, balance datasets, and enrich signals before they ever reach your model training infrastructure.

Drawing from real-world ML workflows, we’ll show how this “upstream control” approach can reduce dataset size by 50–70%, cut model onboarding time in half, and embed reproducibility and compliance directly into the ML lifecycle—rather than patching them in afterward.

Attendees will leave with: - A mental model for analyzing and optimizing the ML data supply chain. - An understanding of open-source tools for declarative, source-level ML data controls. - Actionable strategies to accelerate iteration, lower training costs, and improve model outcomes.

How to Optimize your Python Program for Slowness: Inspired by New Turing Machine Results

How to Optimize your Python Program for Slowness: Inspired by New Turing Machine Results

2025-11-08 Watch
talk

Many talks show how to make Python code faster. This one flips the script: what if we try to make our Python as slow as possible? By exploring deliberately inefficient programs — from infinite loops to Turing machines that halt only after an astronomically long time — we’ll discover surprising lessons about computation, large numbers, and the limits of programming languages. Inspired by new Turing machine results, this talk will connect Python experiments with deep questions in theoretical computer science.

Practical Quantization in Keras: Running Large Models on Small Devices

2025-11-08
talk

Large language models are often too large to run on personal machines, requiring specialized hardware with massive memory. Quantization provides a way to shrink models, speed them up, and reduce memory usage - all while retaining most of their accuracy.

This talk introduces the fundamentals of neural network quantization, key techniques, and demonstrates how to apply them using Keras’s extensible quantization framework.

There and back again... by ferry or I-5?

There and back again... by ferry or I-5?

2025-11-08 Watch
talk

Living on Washington State’s peninsula offers endless beauty, nature, and commuting challenges. In this talk, I’ll share how I built an agentic AI system that creates and compares optimal routes to the mainland, factoring in ferry schedules, costs, driving distances, and live traffic. Originally a testbed for the Model Context Protocol (MCP) framework, this project now manages my travel schedule, generates expense estimates, and sends timely notifications for events. I’ll give a comprehensive overview of MCP, show how to quickly turn ideas into working agentic AI, and discuss practical integration with real-world APIs. Attendees will leave with actionable insights and a roadmap for building their own agentic AI solutions.

Break

2025-11-08
talk
Keynote: Chang She - Never Send a Human to do an Agent's Search

Keynote: Chang She - Never Send a Human to do an Agent's Search

2025-11-08 Watch
talk
Chang She (LanceDB)

Keynote by Chang She

Registration & Breakfast

2025-11-08
talk
Actually using GPs in practice with PyMC

Actually using GPs in practice with PyMC

2025-11-08 Watch
talk

This talk will be about the Gaussian process (GP) functionality in the open source Python package PyMC, and how to use GPs effectively for models in the real world. The goal will be to bridge the (wide!) gap between theory and practice, using an example from baseball. By the end of the talk you'll know what's possible in PyMC and how to avoid common pitfalls.

We don't dataframe shame: A love letter to dataframes

We don't dataframe shame: A love letter to dataframes

2025-11-08 Watch
talk

This lighthearted educational talk explores the wild west of dataframes. We discuss where dataframes got their origin (it wasn't R), how dataframes have evolved over time, and why dataframe is such a confusing term (what even is a dataframe?). We will look at what makes dataframes special from both a theoretical computer science perspective (the math is brief, I promise!) and from a technology landscape perspective. This talk doesn't advocate for any specific tool or technology, but instead surveys the broad field of dataframes as a whole.

Generalized Additive Models: Explainability Strikes Back

Generalized Additive Models: Explainability Strikes Back

2025-11-07 Watch
talk

Generalized Additive Models (GAMs)

Generalized Additive Models (GAMs) strike a rare balance: they combine the flexibility of complex models with the clarity of simple ones.

They often achieve performance comparable to black-box models, yet remain: - Easy to interpret - Computationally efficient - Aligned with the growing demand for transparency in AI

With recent U.S. AI regulations (White House, 2022) and increasing pressure from decision-makers for explainable models, GAMs are emerging as a natural choice across industries.


Audience

This guide is for readers with some background in Python and statistics, including: - Data scientists
- Machine learning engineers
- Researchers


Takeaway

By the end, you’ll understand: - The intuition behind GAMs
- How to build and apply them in practice
- How to interpret and explain GAM predictions and results in Python


Prerequisites

You should be comfortable with: - Basic regression concepts
- Model regularization
- The bias–variance trade-off
- Python programming

Multi-Series Forecasting at Scale with StatsForecast

Multi-Series Forecasting at Scale with StatsForecast

2025-11-07 Watch
talk
Khuyen Tran (Prefect) , Yibei Hu

Learn how to build fast and reliable retail demand forecasts using StatsForecast, an open-source Python library for scalable statistical forecasting. This session will cover techniques including rolling-origin cross-validation and conformal prediction, with practical retail demand examples.

Panel: Building Data-Driven Startups with User-Centric Design

Panel: Building Data-Driven Startups with User-Centric Design

2025-11-07 Watch
talk

Creating successful data products requires more than just powerful algorithms it demands a deep understanding of user needs. In this panel, founders and leaders from innovative data-driven startups share their strategies for designing user-centric data products including Python-based tools.

Know Your Data(Frame) with Paguro: Declarative and Composable Validation and Metadata using Polars

2025-11-07
talk

Modern data pipelines are fast and expressive, but ensuring data quality is often not as straightforward. This talk introduces Paguro, an open-source, feature-rich validation and metadata library designed on top of the Polars DataFrame library. Paguro enables users to validate both single Data(Lazy)Frames and collections of Data(Lazy)Frames together, and provides beautifully formatted terminal diagnostics that explain why and where validation failed. Attendees will learn how to integrate the lightweight, fast, and composable validation toolkit into their workflows, from exploration to production, using a familiar Polars-native syntax.

Polars on Spark: Unlocking Performance with Arrow Python UDFs

2025-11-07
talk
Shujing Yang , Allison Wang (Databricks)

PySpark’s Arrow-based Python UDFs open the door to dramatically faster data processing by avoiding expensive serialization overhead. At the same time, Polars, a high-performance DataFrame library built on Rust, offers zero-copy interoperability with Apache Arrow. This talk shows how combining these two technologies unlocks new performance gains: writing Arrow UDFs with Polars in PySpark can deliver performance speedups compared to Python UDFs. Attendees will learn how Arrow UDFs work in PySpark, how it can be used with other data processing libraries, and how to apply this approach to real-world Spark pipelines for faster, more efficient workloads.

Wrangling Internet-scale Image Datasets

Wrangling Internet-scale Image Datasets

2025-11-07 Watch
talk

Building and curating datasets at internet scale is both powerful and messy. At Irreverent Labs, we recently released Re-LAION-Caption19M, a 19-million–image dataset with improved captions, alongside a companion arXiv paper. Behind the scenes, the project involved wrangling terabytes of raw data and designing pipelines that could produce a research-quality dataset while remaining resilient, efficient, and reproducible. In this talk, we’ll share some of the practical lessons we learned while engineering data at this scale. Topics include: strategies for ensuring data quality through a mix of automated metrics and human inspection; why building file manifests pays off when dealing with millions of files; effective use of Parquet, WDS and JSONL for metadata and intermediate results; pipeline patterns that favor parallel processing and fault tolerance; and how logging and dashboards can turn long-running jobs from opaque into observable. Whether you’re working with images, text, or any other massive dataset, these patterns and pitfalls may help you design pipelines that are more robust, maintainable, and researcher-friendly.

Break

2025-11-07
talk