Event

PyData Seattle 2025

2025-11-07 – 2025-11-09 PyData

Activities tracked

2

Filtering by: Data Quality ×

Top Speakers

Eloisa Elias T 3 Andy Terrel 2 C.A.M. Gerlach 2 Carl Kadie 2 Allison Ding 1 Avik Basu 1 Sarah Kaiser 1 Chang She 1 Daniel Chen 1 John Carney 1 Noor Aftab 1 Allison Wang 1

Sessions & talks

Showing 1–2 of 2 · Newest first

Search within this event →

Know Your Data(Frame) with Paguro: Declarative and Composable Validation and Metadata using Polars

2025-11-07

talk

Bernardo Dionisi

Data Quality Polars

Modern data pipelines are fast and expressive, but ensuring data quality is often not as straightforward. This talk introduces Paguro, an open-source, feature-rich validation and metadata library designed on top of the Polars DataFrame library. Paguro enables users to validate both single Data(Lazy)Frames and collections of Data(Lazy)Frames together, and provides beautifully formatted terminal diagnostics that explain why and where validation failed. Attendees will learn how to integrate the lightweight, fast, and composable validation toolkit into their workflows, from exploration to production, using a familiar Polars-native syntax.

Wrangling Internet-scale Image Datasets

2025-11-07 Watch

talk

Nicholas Merchant , Carlos Garcia Jurado Suarez

Data Quality JSON Parquet

Building and curating datasets at internet scale is both powerful and messy. At Irreverent Labs, we recently released Re-LAION-Caption19M, a 19-million–image dataset with improved captions, alongside a companion arXiv paper. Behind the scenes, the project involved wrangling terabytes of raw data and designing pipelines that could produce a research-quality dataset while remaining resilient, efficient, and reproducible. In this talk, we’ll share some of the practical lessons we learned while engineering data at this scale. Topics include: strategies for ensuring data quality through a mix of automated metrics and human inspection; why building file manifests pays off when dealing with millions of files; effective use of Parquet, WDS and JSONL for metadata and intermediate results; pipeline patterns that favor parallel processing and fault tolerance; and how logging and dashboards can turn long-running jobs from opaque into observable. Whether you’re working with images, text, or any other massive dataset, these patterns and pitfalls may help you design pipelines that are more robust, maintainable, and researcher-friendly.

talk-data.com

PyData Seattle 2025

Top Topics

Top Speakers

Know Your Data(Frame) with Paguro: Declarative and Composable Validation and Metadata using Polars

Wrangling Internet-scale Image Datasets