talk-data.com talk-data.com

Event

PyData Boston 2025

2025-12-08 – 2025-12-10 PyData

Activities tracked

6

Filtering by: API ×

Sessions & talks

Showing 1–6 of 6 · Newest first

Search within this event →
No Cloud? No Problem. Local RAG with Embedding Gemma

No Cloud? No Problem. Local RAG with Embedding Gemma

2025-12-10 Watch
talk

Running Retrieval-Augmented Generation (RAG) pipelines often feels tied to expensive cloud APIs or large GPU clusters—but it doesn’t have to be. This session explores how Embedding Gemma, Google’s lightweight open embedding model, enables powerful RAG and text classification workflows entirely on a local machine. Using the Sentence Transformers framework with Hugging Face, high-quality embeddings can be generated efficiently for retrieval and classification tasks. Real-world examples involving call transcripts and agent remark classification illustrate how robust results can be achieved without the cloud—or the budget.

Data engineering with Python the right way: introducing the composable, Python-native data stack

2025-12-10
talk

For the past decade, SQL has reigned king of the data transformation world, and tools like dbt have formed a cornerstone of the modern data stack. Until recently, Python-first alternatives couldn't compete with the scale and performance of modern SQL. Now Ibis can provide the same benefits of SQL execution with a flexible Python dataframe API.

In this talk, you will learn how Ibis supercharges open-source libraries like Kedro, Pandera, and the Boring Semantic Layer and how you can combine these technologies (and a few more) to build and orchestrate scalable data engineering pipelines without sacrificing the comfort (and other advantages) of Python.

When Rivers Speak: Analyzing Massive Water Quality Datasets using USGS API and Remote SSH in Positron

When Rivers Speak: Analyzing Massive Water Quality Datasets using USGS API and Remote SSH in Positron

2025-12-10 Watch
talk

Rivers have long been storytellers of human history. From the Nile to the Yangtze, they have shaped trade, migration, settlement, and the rise of civilizations. They reveal the traces of human ambition... and the costs of it. Today, from the Charles to the Golden Gate, US rivers continue to tell stories, especially through data.

Over the past decades, extensive water quality monitoring efforts have generated vast public datasets: millions of measurements of pH, dissolved oxygen, temperature, and conductivity collected across the country. These records are more than environmental snapshots; they are archives of political priorities, regulatory choices, and ecological disruptions. Ultimately, they are evidence of how societies interact with their environments, often unevenly.

In this talk, I’ll explore how Python and modern data workflows can help us "listen" to these stories at scale. Using the United States Geological Survey (USGS) Water Data APIs and Remote SSH in Positron, I’ll process terabytes of sensor data spanning several years and regions. I’ll demonstrate that, while Parquet and DuckDB enable scalable exploration of historical records, using Remote SSH is paramount in order to enable large-scale data analysis. By doing so, I hope to answer some analytical questions that can surface patterns linked to industrial growth, regulatory shifts, and climate change.

By treating rivers as both ecological systems and social mirrors, we can begin to see how environmental data encodes histories of inequality, resilience, and transformation.

Whether your interest lies in data engineering, environmental analytics, or the human dimensions of climate and infrastructure, this talk will explore topics at the intersection of environmental science, will offer both technical methods and sociological lenses to understand the stories rivers continue to tell.

fastplotlib: driving scientific discovery through data visualization

fastplotlib: driving scientific discovery through data visualization

2025-12-10 Watch
talk

Fast interactive visualization remains a considerable barrier in analysis pipelines for large neuronal datasets. Here, we present fastplotlib, a scientific plotting library featuring an expressive API for very fast visualization of scientific data. Fastplotlib is built upon pygfx, which utilizes the GPU via WGPU, allowing it to interface with modern graphics APIs such as Vulkan for fast rendering of objects. Fastplotlib is non-blocking, allowing for interactivity with data after plot generation. Ultimately, fastplotlib is a general-purpose scientific plotting library that is useful for fast and live visualization and analysis of complex datasets.

"Save your API Keys for someone else" -- Using the HuggingFace and Ollama ecosystems to run good-enough LLMs on your laptop

2025-12-08
talk

In this 90 minute tutorial we'll get anyone with some basic Python and Command Line skills up and running with their own 100% laptop based set of LLMs, and explain some successful patterns for leveraging LLMs in a data analysis environment. We'll also highlight pit-falls waiting to catch you out, and encourage you that your pre-GenAI analytics skills are still relevant today and likely will be for the foreseeable future by demonstrating the limits of LLMs for data analysis tasks.

Building LLM Agents Made Simple

Building LLM Agents Made Simple

2025-12-08 Watch
talk

Learn to build practical LLM agents using LlamaBot and Marimo notebooks. This hands-on tutorial teaches the most important lesson in agent development: start with workflows, not technology.

We'll build a complete back-office automation system through three agents: a receipt processor that extracts data from PDFs, an invoice writer that generates documents, and a coordinator that orchestrates both. This demonstrates the fundamental pattern for agent systems—map your boring workflows first, build focused agents for specific tasks, then compose them so agents can use other agents as tools.

By the end, you'll understand how to identify workflows worth automating, build agents with decision-making loops, compose agents into larger systems, and integrate them into your own work. You'll leave with working code and confidence to automate repetitive tasks.

Prerequisites: Intermediate Python, familiarity with APIs, basic LLM understanding. Participants should have Ollama and models installed beforehand (setup instructions provided).

Materials: GitHub repository with Marimo notebooks. Setup uses Pixi for dependency management.