PyData Boston 2025

Surviving the Agentic Hype with Small Language Models

2025-12-10 Watch

talk

Serhii Sokolenko (Tower Dev)

AI/ML LLM Python

The AI landscape is abuzz with talk of "agentic intelligence" and "autonomous reasoning." But beneath the hype, a quieter revolution is underway: Small Language Models (SLMs) are starting to perform the core reasoning and orchestration tasks once thought to require massive LLMs. In this talk, we’ll demystify the current state of “AI agents,” show how compact models like Phi-2, xLAM 8B, and Nemotron-H 9B can plan, reason, and call tools effectively, and demonstrate how you can deploy them on consumer-grade hardware. Using Python and lightweight frameworks such as LangChain, we’ll show how anyone can quickly build and experiment with their own local agentic systems. Attendees will leave with a grounded understanding of agent architectures, SLM capabilities, and a roadmap for running useful agents without the GPU farm.

Data engineering with Python the right way: introducing the composable, Python-native data stack

2025-12-10

talk

Deepyaman Datta

API Data Engineering dbt Modern Data Stack Python SQL

For the past decade, SQL has reigned king of the data transformation world, and tools like dbt have formed a cornerstone of the modern data stack. Until recently, Python-first alternatives couldn't compete with the scale and performance of modern SQL. Now Ibis can provide the same benefits of SQL execution with a flexible Python dataframe API.

In this talk, you will learn how Ibis supercharges open-source libraries like Kedro, Pandera, and the Boring Semantic Layer and how you can combine these technologies (and a few more) to build and orchestrate scalable data engineering pipelines without sacrificing the comfort (and other advantages) of Python.

Evaluating AI Agents in production with Python

2025-12-10 Watch

talk

Susan Shu Chang

AI/ML LLM Python

This talk covers methods of evaluating AI Agents, with an example of how the speakers built a Python-based evaluation framework for a user-facing AI Agent system which has been in production for over a year. We share tools and Python frameworks used (as well as tradeoffs and alternatives), and discuss methods such as LLM-as-Judge, rules-based evaluations, ML metrics used, as well as selection tradeoffs.

Processing large JSON files without running out of memory

2025-12-10 Watch

talk

Itamar Turner-Trauring

JSON Python

If you need to process a large JSON file in Python, it’s very easy to run out of memory while loading the data, leading to a super-slow run time or out-of-memory crashes. In this talk you'll learn:

How to measure memory usage.
Why loading JSON takes a lot of memory.
Four different ways to reduce memory usage when loading large JSON files.

Tracking Policy Evolution Through Clustering: A New Approach to Temporal Pattern Analysis in Multi-Dimensional Data

2025-12-10

talk

Sarthak Pattnaik

Matplotlib Pandas Python Scikit-learn

Analyzing how patterns evolve over time in multi-dimensional datasets is challenging—traditional time-series methods often struggle with interpretability when comparing multiple entities across different scales. This talk introduces a clustering-based framework that transforms continuous data into categorical trajectories, enabling intuitive visualization and comparison of temporal patterns.What & Why: The method combines quartile-based categorization with modified Hamming distance to create interpretable "trajectory fingerprints" for entities over time. This approach is particularly valuable for policy analysis, economic comparisons, and any domain requiring longitudinal pattern recognition.Who: Data scientists and analysts working with temporal datasets, policy researchers, and anyone interested in comparative analysis across entities with different scales or distributions.Type: Technical presentation with practical implementation examples using Python (pandas, scikit-learn, matplotlib). Moderate mathematical content balanced with intuitive visualizations.Takeaway: Attendees will learn a novel approach to temporal pattern analysis that bridges the gap between complex statistical methods and accessible, policy-relevant insights. You'll see practical implementations analyzing 60+ years of fiscal policy data across 8 countries, with code available for adaptation to your own datasets.

When Rivers Speak: Analyzing Massive Water Quality Datasets using USGS API and Remote SSH in Positron

2025-12-10 Watch

talk

Rodrigo Silva Ferreira

Analytics API Data Engineering DuckDB HTML Parquet

Rivers have long been storytellers of human history. From the Nile to the Yangtze, they have shaped trade, migration, settlement, and the rise of civilizations. They reveal the traces of human ambition... and the costs of it. Today, from the Charles to the Golden Gate, US rivers continue to tell stories, especially through data.

Over the past decades, extensive water quality monitoring efforts have generated vast public datasets: millions of measurements of pH, dissolved oxygen, temperature, and conductivity collected across the country. These records are more than environmental snapshots; they are archives of political priorities, regulatory choices, and ecological disruptions. Ultimately, they are evidence of how societies interact with their environments, often unevenly.

In this talk, I’ll explore how Python and modern data workflows can help us "listen" to these stories at scale. Using the United States Geological Survey (USGS) Water Data APIs and Remote SSH in Positron, I’ll process terabytes of sensor data spanning several years and regions. I’ll demonstrate that, while Parquet and DuckDB enable scalable exploration of historical records, using Remote SSH is paramount in order to enable large-scale data analysis. By doing so, I hope to answer some analytical questions that can surface patterns linked to industrial growth, regulatory shifts, and climate change.

By treating rivers as both ecological systems and social mirrors, we can begin to see how environmental data encodes histories of inequality, resilience, and transformation.

Whether your interest lies in data engineering, environmental analytics, or the human dimensions of climate and infrastructure, this talk will explore topics at the intersection of environmental science, will offer both technical methods and sociological lenses to understand the stories rivers continue to tell.

Fun With Python and Emoji: What Might Adding Pictures to Text Programming Languages Look Like?

2025-12-10

talk

Ted Conway

Python SQL

We all mix pictures, emojis and text freely in our communications. So, why not in our code? This session takes a whimsical look at what mixing emoji with Python and SQL might look like (spoiler alert: a lot like those "rebus" stories in Highlights Magazine for Kids!). We'll discuss the benefits of doing so, challenges that emoji present, and demo a rudimentary Python preprocessor that intercepts Python and SQL code containing emojis submitted from Jupyter notebooks and translates it back into text-only code using an emoji-to-text dictionary before passing it on to Python for execution. This session is intended for all levels of programmers.

Uncertainty-Guided AI Red Teaming: Efficient Vulnerability Discovery in LLMs

2025-12-10

talk

Zvi Topol

AI/ML LLM Python Cyber Security

AI red teaming is crucial for identifying security and safety vulnerabilities (e.g., jailbreaks, prompt injection, harmful content generation) of Large Language Models. However, manual and brute-force adversarial testing is resource-intensive and often inefficiently consumes time and compute resources exploring low-risk regions of the input space. This talk introduces a practical, Python-based methodology for accelerating red teaming using model uncertainty quantification (UQ).

Who is Python for? EVERYONE (and why that matters)

2025-12-10 Watch

talk

Deb Nicholson

Python

Python is controlled by the community and that its vast library of packages remain free for anyone to use and open for anyone to add to -- and that's no accident. Open communities that share and learn together are how we will build the kind of future we want to live in. If you've ever wondered who is in charge of Python, how it exists as a perennially free resource and why anyone would do that, this talk is for you!

Wrappers and Extenders: Companion Packages for Python Projects

2025-12-10 Watch

talk

Jules Walzer-Goldfeld

Python

Many Python users want features that don’t fit within the boundaries of their favorite libraries. Instead of forking or waiting on a pull request, you can build your own wrapper or extender package. This talk introduces the principles of designing companion packages that enhance existing libraries without changing their core code, using gt-extras as a case study. You’ll learn how to structure, document, and distribute your own add-ons to extend the tools you rely on.

The SAT math gap: gender difference or selection bias?

2025-12-09 Watch

talk

Allen Downey (Brilliant.org | Olin College)

Python

Why do male test takers consistently score about 30 points higher than female test takers on the mathematics section of the SAT? Does this reflect an actual difference in math ability, or is it an artifact of selection bias—if young men with low math ability are less likely to take the test than young women with the same ability?

This talk presents a Bayesian model that estimates how much of the observed difference can be explained by selection effects. We’ll walk through a complete Bayesian workflow, including prior elicitation with PreliZ, model building in PyMC, and validation with ArviZ, showing how Bayesian methods disentangle latent traits from observed outcomes and separate the signal from the noise.

No prior knowledge of Bayesian statistics is required; attendees should be familiar with Python and common probability distributions.

Generative Programming with Mellea: from Agentic Soup to Robust Software

2025-12-08 Watch

talk

Jake Lorocco , Nathan Fulton

AI/ML LLM Python

Agentic frameworks make it easy to build and deploy compelling demos. But building robust systems that use LLMs is difficult because of inherent environmental non-determinism. Each user is different, each request is different; the very flexibility that makes LLMs feel magical in-the-small also makes agents difficult to wrangle in-the-large.

Developers who have built large agentic-like systems know the pain. Exceptional cases multiply, prompt libraries grow, instructions are co-mingled with user input. After a few iterations, an elegant agent evolves into a big ball of mud.

This hands-on tutorial introduces participants to Mellea, an open-source Python library for writing structured generative programs. Mellea puts the developer back in control by providing the building blocks needed to circumscribe, control, and mediate essential non-determinism.

"Save your API Keys for someone else" -- Using the HuggingFace and Ollama ecosystems to run good-enough LLMs on your laptop

2025-12-08

talk

Ian Stokes-Rees

Analytics API GenAI LLM Python

In this 90 minute tutorial we'll get anyone with some basic Python and Command Line skills up and running with their own 100% laptop based set of LLMs, and explain some successful patterns for leveraging LLMs in a data analysis environment. We'll also highlight pit-falls waiting to catch you out, and encourage you that your pre-GenAI analytics skills are still relevant today and likely will be for the foreseeable future by demonstrating the limits of LLMs for data analysis tasks.

Building LLM Agents Made Simple

2025-12-08 Watch

talk

Eric Ma

API GitHub LLM Python

Learn to build practical LLM agents using LlamaBot and Marimo notebooks. This hands-on tutorial teaches the most important lesson in agent development: start with workflows, not technology.

We'll build a complete back-office automation system through three agents: a receipt processor that extracts data from PDFs, an invoice writer that generates documents, and a coordinator that orchestrates both. This demonstrates the fundamental pattern for agent systems—map your boring workflows first, build focused agents for specific tasks, then compose them so agents can use other agents as tools.

By the end, you'll understand how to identify workflows worth automating, build agents with decision-making loops, compose agents into larger systems, and integrate them into your own work. You'll leave with working code and confidence to automate repetitive tasks.

Prerequisites: Intermediate Python, familiarity with APIs, basic LLM understanding. Participants should have Ollama and models installed beforehand (setup instructions provided).

Materials: GitHub repository with Marimo notebooks. Setup uses Pixi for dependency management.

Learn to Unlock Document Intelligence with Open-Source AI

2025-12-08 Watch

talk

Mingxuan Zhao

AI/ML Analytics Python

Unlocking the full potential of AI starts with your data, but real-world documents come in countless formats and levels of complexity. This session will give you hands-on experience with Docling, an open-source Python library designed to convert complex documents into AI-ready formats. Learn how Docling simplifies document processing, enabling you to efficiently harness all your data for downstream AI and analytics applications.

Create your Health Research Agent

2025-12-08

talk

Leonardo Ferreira

AI/ML Docker Linux LLM Python

PubMed is a free search interface for biomedical literature, including citations and abstracts from many life science scientific journals. It is maintained by the National Library of Medicine at the NIH. Yet, most users only interact with it through simple keyword searches. In this hands-on tutorial, we will introduce PubMed as a data source for intelligent biomedical research assistants — and build a Health Research AI Agent using modern agentic AI frameworks such as LangChain, LangGraph, and Model Context Protocol (MCP) with minimum hardware requirements and no key tokens. To ensure compatibility, the agent will run in a Docker container which will host all necessary elements.

Participants will learn how to connect language models to structured biomedical knowledge, design context-aware queries, and containerize the entire system using Docker for maximum portability. By the end, attendees will have a working prototype that can read and reason over PubMed abstracts, summarize findings according to a semantic similarity engine, and assist with literature exploration — all running locally on modest hardware.

Expected Audience: Enthusiasts, researchers, and data scientists interested in AI agents, biomedical text mining, or practical LLM integration. Prior Knowledge: Python and Docker familiarity; no biomedical background required. Minimum Hardware Requirements: 8GB RAM (+16GB recommended), 30GB disk space, Docker pre-installed. MacOS, Windows, Linux. Key Takeaway: How to build a lightweight, reproducible research agent that combines open biomedical data with modern agentic AI frameworks.

CUDA Python Kernel Authoring

2025-12-08 Watch

talk

Katrina Riehl

Python

We'll explore best practices for writing CUDA kernels using Python, empowering developers to harness the full potential of GPU acceleration. Gain a clear understanding of the structure and functionality of CUDA kernels, learning how to effectively implement them within Python applications.

From Notebook to Pipeline: Hands-On Data Engineering with Python

2025-12-08 Watch

talk

Gilberto Hernandez

Cloud Computing Data Engineering Python SQL

In this hands-on tutorial, you'll go from a blank notebook to a fully orchestrated data pipeline built entirely in Python, all in under 90 minutes. You'll learn how to design and deploy end-to-end data pipelines using familiar notebook environments, using Python for your data loading, data transformations, and insights delivery.

We'll dive into the Ingestion-Tranformation-Delivery (ITD) framework for building data pipelines: ingest raw data from cloud object storage, transform the data using Python DataFrames, and deliver insights via a Streamlit application.

Basic familiarity with Python (and/or SQL) is helpful, but not required. By the end of the session, you'll understand practical data engineering patterns and leave with reusable code templates to help you build, orchestrate, and deploy data pipelines from notebook environments.

talk-data.com

Top Topics

Top Speakers

Surviving the Agentic Hype with Small Language Models

Data engineering with Python the right way: introducing the composable, Python-native data stack

Evaluating AI Agents in production with Python

Processing large JSON files without running out of memory

Tracking Policy Evolution Through Clustering: A New Approach to Temporal Pattern Analysis in Multi-Dimensional Data

When Rivers Speak: Analyzing Massive Water Quality Datasets using USGS API and Remote SSH in Positron

Fun With Python and Emoji: What Might Adding Pictures to Text Programming Languages Look Like?

Uncertainty-Guided AI Red Teaming: Efficient Vulnerability Discovery in LLMs

Who is Python for? EVERYONE (and why that matters)

Wrappers and Extenders: Companion Packages for Python Projects

The SAT math gap: gender difference or selection bias?

Generative Programming with Mellea: from Agentic Soup to Robust Software

"Save your API Keys for someone else" -- Using the HuggingFace and Ollama ecosystems to run good-enough LLMs on your laptop

Building LLM Agents Made Simple

Learn to Unlock Document Intelligence with Open-Source AI

Create your Health Research Agent

CUDA Python Kernel Authoring

From Notebook to Pipeline: Hands-On Data Engineering with Python