PyData Seattle 2025

Subgraph Isomorphism at Scale with data science tools

2025-11-09

talk

Esteban Ginez

Data Science Pandas Python

Traditional subgraph isomorphism algorithms like VF2 rely on sequential tree-search that can't leverage parallel computing. This talk introduces Δ-Motif, a data-centric approach that transforms graph matching into data operations using Python's data science stack. Δ-Motif decomposes graphs into small "motifs" to reconstruct matches. By representing graphs as tabular data with RAPIDS cuDF and Pandas, we achieve 10-595X speedups over VF2 without custom GPU kernels. I'll demonstrate practical applications from social networks to quantum computing, and show when GPU acceleration provides the biggest benefits for graph analysis problems. Perfect for data scientists working with network analysis, recommendation systems, or pattern matching at scale

The Problem of Address Matching: a Journey through NLP and AI

2025-11-09

talk

Ivan Perez Avellaneda

AI/ML NLP Python

The problem of address matching arrives when the address of one physical place is written in two or more different ways. This situation is very common in companies that receive records of customers from different sources. The differences can be classified as syntactic and semantic. In the first type, the meaning is the same but the way they are written is different. For example, one can find "Street" vs "St". In the second type, the meaning is not exactly the same. For example, one can find "Road" instead of "Street". To solve this problem and match addresses, we have a couple of approaches. The first and simple is by using similarity metrics. The second uses natural language and transformers. This is a hands-on talk and is intended for data process analyst. We are going to go through these solutions implemented in a Jupyter notebook using Python.

Break

2025-11-09

talk

GPU Accelerated Python

2025-11-09

talk

Andy Terrel

Python

Accelerating Python using the GPU is much easier than you might think. We will explore the powerful CUDA-enabled Python ecosystem in this tutorial through hands-on examples using some of the most popular accelerated scientific computing libraries.

LLMs, Chatbots, and Dashboards: Visualize and Analyze Your Data with Natural Language

2025-11-09

talk

Daniel Chen

AI/ML Dashboard Data Science LLM

LLMs have a lot of hype around them these days. Let’s demystify how they work and see how we can put them in context for data science use. As data scientists, we want to make sure our results are inspectable, reliable, reproducible, and replicable. We already have many tools to help us in this front. However, LLMs provide a new challenge; we may not always be given the same results back from a query. This means trying to work out areas where LLMs excel in, and use those behaviors in our data science artifacts. This talk will introduce you to LLMs, the Chatlas packages, and how they can be integrated into a Shiny to create an AI-powered dashboard (using querychat). We’ll see how we can leverage the tasks LLMs are good at to better our data science products.

Newcomer Sprint!

2025-11-09

talk

Fangchen Li , Eloisa Elias T , Rachel Wagner-Kaiser , Joseph Holsten , C.A.M. Gerlach , Jake Stevens-Haas

Python

Looking to contribute to open source, but wasn’t sure where to start? Want to level up your skills in debugging, programming, collaboration and more? Curious about how to fix a bug or add a feature you’re missing in your favorite software project? Come to our special newcomer sprint to learn how and try it for yourself! Newcomers to Python or open source are welcome and encouraged, as well as attendees with open source experience to help guide them!

Lunch

2025-11-09

talk

Building a Deep Research Agentic Workflow

2025-11-09

talk

Ravi Kumar Yadav , nidhin pattaniyil

LLM

OpenAI and Gemini's Deep Research offerings are a great way to get a detailed research report on a topic.

In this beginner friendly tutorial, we’ll walk through building a simple lightweight agent workflow to perform deep research.

Building Bazel Packages for AI/ML: SciPy, PyTorch, and Beyond

2025-11-09

talk

Ramesh Oswal , Jiten Oswal

AI/ML NumPy PyTorch SciPy TensorFlow

AI/ML workloads depend heavily on complex software stacks, including numerical computing libraries (SciPy, NumPy), deep learning frameworks (PyTorch, TensorFlow), and specialized toolchains (CUDA, cuDNN). However, integrating these dependencies into Bazel-based workflows remains challenging due to compatibility issues, dependency resolution, and performance optimization. This session explores the process of creating and maintaining Bazel packages for key AI/ML libraries, ensuring reproducibility, performance, and ease of use for researchers and engineers.

Going From Notebooks to Production Code

2025-11-09

talk

Robert Masson , Catherine Nelson

Python

Do you need to move your code from notebooks into production? Or do you want to level up your software engineering skills? In this tutorial, we will show you how to turn a Jupyter notebook into a robust, reproducible Python script. You will learn how to use tools for converting notebooks into scripts, how to make your code modular, and how to write unit tests.

How to make datamap web-apps of embedding vectors via open source tooling

2025-11-09

talk

John Tigue

AI/ML API Python RAG Vector DB

Datamaps are ML-powered visualizations of high-dimensional data, and in this talk the data is collections of embedding vectors. Interactive datamaps run in-browser as web-apps, potentially without any code running on the web server. Datamap tech can be used to visualize, say, the entire collection of chunks in a RAG vector database.

The best-of-breed tools of this new datamap technique are liberally licensed open source. This presentation is an introduction to building with those repos. The maths will be mentioned only in passing; the topic here is simply how-to with specific tools. Talk attendees will be learning about Python tools, which produce high-quality web UIs.

DataMapPlot is the premiere tool for rendering a datamap as a web-app. Here is a live demo thereof: https://connoiter.com/datamap/cff30bc1-0576-44f0-a07c-60456e131b7b

00-25: Intro to datamaps 25-45: Pipeline architecture 45-55: demos touring such tools as UMAP, HDBSCAN, DataMapPlot, Toponomy, etc. 55-90: Group coding

A Google account is required to log in to Google Colab, where participants can run the workshop notebooks. A Hugging Face API key (token) is needed to download Gemma models.

Break

2025-11-09

talk

Building Intelligent DIY Robots: From Hardware to Vision Systems

2025-11-09

talk

FTC 18225 High Definition

In this talk, Ethan Lee, lead programmer of an FTC (FIRST Tech Challenge) high school robotics team, and Jake Poznanski, startup founder and software engineer, will show how software, hardware, and data converge to build intelligent robots. Ethan will discuss how FTC robots apply computer vision, including OpenCV and neural networks, to convert raw camera data into autonomous robot action. He will also examine the challenges of operating under strict computation constraints, such as latency, calibration, and synchronization. Jake will explore the process of creating a DIY robot, such as CAD design, electronics, and message passing.

Scaling Large-Scale Interactive Data Visualization with Accelerated Computing

2025-11-09

talk

Allison Ding

Data Science DataViz Plotly Python

As datasets continue to grow in both size and complexity, CPU-based visualization pipelines often become bottlenecks, slowing down exploratory data analysis and interactive dashboards. In this session, we’ll demonstrate how GPU acceleration can transform Python-based interactive visualization workflows, delivering speedups of up to 50x with minimal code changes. Using libraries such as hvPlot, Datashader, cuxfilter, and Plotly Dash, we’ll walk through real-world examples of visualizing both tabular and unstructured data and demonstrate how RAPIDS, a suite of open-source GPU-accelerated data science libraries from NVIDIA, accelerates these workflows. Attendees will learn best practices for accelerating preprocessing, building scalable dashboards, and profiling pipelines to identify and resolve bottlenecks. Whether you are an experienced data scientist or developer, you’ll leave with practical techniques to instantly scale your interactive visualization workflows on GPUs.

There's no place like home: using AI agents in Jupyter notebooks

2025-11-09

talk

Sarah Kaiser

AI/ML Data Science

This talk explores how AI agents integrated directly into Jupyter notebooks can help with every part of your data science work. We'll cover the latest notebook-focused agentic features in VS Code, demonstrating how they automate tedious tasks like environment management or graph styling, enhance your "scratch notebook" to sharable code, and more generally streamline data science workflows directly in notebooks.

Registration & Breakfast

2025-11-09

talk

Conference Social

2025-11-09

talk

Join your fellow conference attendees and local meetup members at Bellevue Brewing Company - Spring District Brewpub 12190 NE District Wy, Bellevue, WA 98005

https://maps.app.goo.gl/3HSM4WvPXSfVWS3f7

Beyond Just Prediction: Causal Thinking in Machine Learning

2025-11-09 Watch

talk

Avik Basu

AI/ML

Most ML models excel at prediction, answering questions like "Who will buy our product?" or "Which customers are likely to churn?". But when it comes to making actionable decisions, prediction alone can be misleading. Correlation does not imply causation, and business decisions require understanding causal relationships to drive the right outcomes.

In this talk, we will explore how causal machine learning, specifically uplift modeling, can bridge the gap between prediction and decision making. Using a real-world use case, we will showcase how uplift modeling helps identify who will respond positively to interventions while avoiding those who they might deter.

Diversity Panel: Data for All: Empowering Underrepresented Voices in Data Science and Analytics

2025-11-09

talk

Oli Dinov , Anquida Adams , Micheleen Harris , Heejoon Ahn , Eloisa Elias T

Analytics Data Science

Data science has the power to shape industries and societies. This panel will focus on empowering underrepresented groups in data science through education, access to tools, and career opportunities. Panelists will share their journeys, discuss the importance of democratizing data skills, and explore how to make the field more accessible to diverse talent.

Unlocking Parallel PyTorch Inference (and More!) with Python Free-Threading

2025-11-09 Watch

talk

Trent Nelson

Python PyTorch

From the speaker who got kicked off the stage after 54 minutes of his 45-minute PyParallel talk at PyData NYC 2013, comes a new talk foaming about the virtues of Python's new free-threaded support!

Democratizing (Py)Data: Remote computing for all

2025-11-08

talk

C.A.M. Gerlach

Cloud Computing

PhD students, postdocs and independent researchers often struggle when trying to scale their code and data beyond their local machine, to a HPC cluster or the cloud. This is even more difficult if they don’t happen to have access to IT staff and resources to set up the necessary infrastructure, as is the case in many developing countries. We introduce a new open source, extensible remote development architecture, supported in version 6.1 of the Spyder scientific environment and IDE, that allows users to manage packages, browse files and run code remotely on a completely austere host from the comfort of their local machine.

Prompt Variation as a Diagnostic Tool: Exposing Contamination, Memorization, and True Capability in LLMs

2025-11-08

talk

Aziza Mirsaidova

LLM

Prompt variation isn't just an engineering nuisance, it's a window into fundamental LLM limitations. When a model's accuracy drops from 95% to 75% due to minor rephrasing, we're not just seeing brittleness; we're potentially exposing data contamination, spurious correlations, and shallow pattern matching. This talk explores prompt variation as a powerful diagnostic tool for understanding LLM reliability. We discuss how small changes in format, phrasing, or ordering can cause accuracy to collapse revealing about models memorizing benchmark patterns or learning superficial correlations rather than robust task representations. Drawing from academic and industry research, you will learn to distinguish between LLM's true capability and memorization, identify when models are pattern-matching rather than reasoning, and build evaluation frameworks that expose these vulnerabilities before deployment.

Securing Retrieval-Augmented Generation: How to Defend Vector Databases Against 2025 Threats

2025-11-08 Watch

talk

Rajesh

LLM RAG Cyber Security Vector DB

Modern LLM applications rely heavily on embeddings and vector databases for retrieval-augmented generation (RAG). But in 2025, researchers and OWASP flagged vector databases as a new attack surface — from embedding inversion (recovering sensitive training text) to poisoned vectors that hijack prompts. This talk demystifies these threats for practitioners and shows how to secure your RAG pipeline with real-world techniques like encrypted stores, anomaly detection, and retrieval validation. Attendees will leave with a practical security checklist for keeping embeddings safe while still unlocking the power of retrieval.

Building Inference Workflows with Tile Languages

2025-11-08 Watch

talk

Andy Terrel

AI/ML API GenAI NumPy Python

The world of generative AI is expanding. New models are hitting the market daily. The field has bifurcated between model training and model inference. The need for fast inference has led to numerous Tile languages to be developed. These languages use concepts from linear algebra and borrow common numpy apis. In this talk we will show how tiling works and how to build inference models from scratch in pure Python with embedded tile languages. The goal is to provide attendees with a good overview that can be integrated in common data pipelines.

Evaluation is all you need

2025-11-08 Watch

talk

Sebastian Duerr

LLM RAG

LLM apps fail without reliable, reproducible evaluation. This talk maps the open‑source evaluation landscape, compares leading techniques (RAGAS, Evaluation Driven Development) and frameworks (DeepEval, Phoenix, LangFuse, and braintrust), and shows how to combine tests, RAG‑specific evals, and observability to ship higher‑quality systems. Attendees leave with a decision checklist, code patterns, and a production‑ready playbook.

talk-data.com

Top Topics

Top Speakers

Subgraph Isomorphism at Scale with data science tools

The Problem of Address Matching: a Journey through NLP and AI

Break

GPU Accelerated Python

LLMs, Chatbots, and Dashboards: Visualize and Analyze Your Data with Natural Language

Newcomer Sprint!

Lunch

Building a Deep Research Agentic Workflow

Building Bazel Packages for AI/ML: SciPy, PyTorch, and Beyond

Going From Notebooks to Production Code

How to make datamap web-apps of embedding vectors via open source tooling

Break

Building Intelligent DIY Robots: From Hardware to Vision Systems

Scaling Large-Scale Interactive Data Visualization with Accelerated Computing

There's no place like home: using AI agents in Jupyter notebooks

Registration & Breakfast

Conference Social

Beyond Just Prediction: Causal Thinking in Machine Learning

Diversity Panel: Data for All: Empowering Underrepresented Voices in Data Science and Analytics

Unlocking Parallel PyTorch Inference (and More!) with Python Free-Threading

Democratizing (Py)Data: Remote computing for all

Prompt Variation as a Diagnostic Tool: Exposing Contamination, Memorization, and True Capability in LLMs

Securing Retrieval-Augmented Generation: How to Defend Vector Databases Against 2025 Threats

Building Inference Workflows with Tile Languages

Evaluation is all you need