talk-data.com talk-data.com

Event

Small Data SF 2025

2025-11-04 – 2025-11-06 Small Data SF Visit website ↗

Activities tracked

31

Sessions & talks

Showing 1–25 of 31 · Newest first

Search within this event →

Networking Reception

2025-11-06
break

Lunch

2025-11-05
break

4 Lessons Learned the Hard Way and How Thinking Small Saved the Day

2025-11-05
talk
Scott Haines (Nike)

This session will go over 4 common data engineering problems, myth busting the "ask" vs "reality" with real-life examples. Part data engineering therapy, part solving problems that "really" matter, part how to think small and deliver big.

Better Data, Smaller Models, Bigger Impact

2025-11-05
talk
Shelby Heinecke, PhD (Salesforce)

Small models don’t need more parameters, they need better data. I’ll share how my team built the xLAM family of small action models that punch far above their weight, enabling fast and accurate AI agents deployable anywhere. We’ll explore why high-quality, task-specific data is the ultimate performance driver and how it turns small models into powerful, real-world solutions. You’ll leave with a practical playbook for creating small models that are fast, efficient, and ready to deploy from the edge to the enterprise.

Building Distributed DuckDB Processing for Lakes

2025-11-05
talk
George Fraser (Fivetran)

DuckDB is the best way to execute SQL on a single node. But with its embedding-friendly nature, it makes an excellent foundation for building distributed systems. George Fraser, CEO of Fivetran, will tell us how Fivetran used DuckDB to power its Iceberg data lake writer—coordinating thousands of small, parallel tasks across a fleet of workers, each running DuckDB queries on bounded datasets. The result is a high-throughput, dual-format (Iceberg + Delta) data lake architecture where every write scales linearly, snapshots stay perfectly in sync, and performance rivals a commercial database while remaining open and portable.

Check-in

2025-11-05
break

Doing More with our Data at DoSomething.Org: a Nonprofit’s Small Data Story

2025-11-05
talk
Sahil Gupta (DoSomething)

Throughout our 31-year history, DoSomething has led the nonprofit sector in technological innovation to better serve our mission to fuel young people to change the world. In the last two years, we have rebuilt our digital platform from scratch, focusing on efficiency, practicality, and right-sizing for nonprofit realities. In this talk, we’ll discuss the motivations behind this shift, how nonprofits across the globe face similar challenges, and how we built our data systems to empower us and the members we serve.

Explore23: Web application for exploration of a large genomic research cohort

2025-11-05
talk
Teague Sterling (23andMe)

We introduce a privacy-forward, secure, extensible, easy-to-use web application, Explore23, for browsing the multimodal data that has been collected as part of the 23andMe, Inc. Research cohort, built heavily on the DuckDB ecosystem. While the 23andMe Research program has collected a large number of data types from its >11M customers who have consented to participate in its Research program, there has not yet been a comprehensive tool enabling the exploration and visualization of the cohort, which is invaluable for genomics-driven target discovery and validation. Furthermore, any exploration of the 23andMe Research cohort needed to enable extensibility to future data types and applications, scalability for large participant and variant cohorts, comprehension by non-experts and external parties, and most importantly, protection of research participant privacy. The Explore23 tool utilizes DuckDB and the DuckDB extension ecosystem extensively through the lifecycle of data used in the showcase. A combination of pre-processing, backend result generation, and WASM-powered Mosaic integrations enable rapid search and visualization of the wide range of datasets collected. This includes integrating data from the various stages of the 23andMe research "pipeline": including raw survey questions, curated condition-based cohorts, genetic variants, and GWAS results. Of particular interest are the variant browser, which enables rapid, in-browser visualization of the over 170 million imputed and genotyped genetic variants in the 23andMe genetic panels; and the phenotypic pedigree summaries, which merges columnar datasets and graph queries (via DuckPGQ) to rapidly identify related participants in the 23andMe research cohort that share specific conditions. For each feature, there were challenges, both internal and external, in finding and contextualizing specific datasets for groups not already well acquainted with the data (e.g., even browsing surveys), and managing data scale. The front-end serves data that has been pre-processed through rigorous masking logic to protect participant privacy. In sum, Explore23 is an invaluable tool for research scientists exploring the immense complexity and diverse data of the whole 23andMe research cohort data. It highlights the incredible versatility of the DuckDB ecosystem to unify data access from raw result processing up through in-browser visualizations.

From 1 parameter model to 100 billion parameter models. How progress on one benefits all

2025-11-05
talk
Ravin Kumar (Google DeepMind)

In the last couple of years we've seen rapid evolution frontier, massive sized models. Yet at the same time small models have been going through an evolution of their own, using technologies developer for those frontier scaled models. In this talk we'll show how tensor frameworks and autograd made their way into Bayesian models, how massive model development is yielding smaller models, and how both of these are useful for the small data and model developers, and the organizations they support.

In the long run, everything is a fad

2025-11-05
talk
Benn Stancil (ThoughtSpot)

To be clear - I'm not saying that analytics and data engineering are a fad. I'm not saying the data teams are doomed to fade away, or that the old fundamentals of data modeling are wrong, or that the urge to quantify everything is a mistake. I'm saying that things seem pretty good, right now. But, you know. Like Charles Schwab constantly says, past performance is no guarantee of future results. So someone else might say all of that in the future - because, as John Maynard Keynes said, in the long run, we are all dead.

Projection Pushdown vs Predicate "Pushdown": Rethinking Query Efficiency

2025-11-05
talk
Adi Polak (Confluent)

We were told to scale compute. But what if the real problem was never about big data, but about bad data access? In this talk, we’ll unpack two powerful, often misunderstood techniques—projection pushdown and predicate pushdown—and why they matter more than ever in a world where we want lightweight, fast queries over large datasets. These optimizations aren’t just academic—they’re the difference between querying a terabyte in seconds vs. minutes. We’ll show how systems like Flink and DuckDB leverage these techniques, what limits them (hello, Protobuf), and how smart schema and storage design—especially in formats like Iceberg and Arrow can unlock dramatic speed gains. Along the way, we’ll highlight the importance of landing data in queryable formats, and why indexing and query engines matter just as much as compute. This talk is for anyone who wants to stop fully scanning their data lakes just to read one field.

Small data, big "features": we are rewriting SQLite!

2025-11-05
talk
Glauber Costa (Turso)

SQLite is the most deployed database in the world, and a crucial player in the small data movement. It powers everything we touch, from small wearables to server-side applications. But as the world changes, is it ready for the challenges that modern infrastructure demands? We believe the answer is "no": from its lack of support for concurrent writes, to its inability to work with complex data like vector embeddings, SQLite needs a fundamental overhaul. In this talk we will explore why a complete rewrite is the most practical path forward to bring SQLite into the modern era. We'll dive deep into how Turso, our full rewrite of SQLite in Rust, tackles these challenges head-on—delivering true concurrency, native vector support, and dramatic performance improvements. Expect concrete benchmarks, implementation details, and a clear roadmap for SQLite's future.

The Great Data Engineering "Reset": From Pipelines to Agents and Beyond

2025-11-05
talk
Joe Reis (DeepLearning.AI)

For years, data engineering was a story of predictable "pipelines": move data from point A to point B. But AI just hit the reset button on our entire field. Now, we're all staring into the void, wondering what's next. While the fundamentals haven't changed, data remains challenging in the traditional areas of data governance, data management, and data modeling, which still present challenges. Everything else is up for grabs. This talk will cut through the noise and explore the future of data engineering in an AI-driven world. We'll examine how team structures will evolve, why agentic workflows and real-time systems are becoming non-negotiable, and how our focus must shift from building dashboards and analytics to architecting for automated action. The reset button has been pushed. It's time for us to invent the future of our industry.

The Unbearable Bigness of Small Data

2025-11-05
talk
Jordan Tigani (MotherDuck)

"Uncharted": Building a Semantic Layer + MCP to Map 1.7M Songwriter Connections with Claude Code.

2025-11-05
talk
Sam Alexander (Knapsack)

In 2018, the Music Modernization Act setup the The Mechanical Licensing Collective (MLC) which also makes it database of musical works catalog of song writer data public. Using this unprecedented treasure trove of songwriting data & modern local data tools, I ask the "question": can small data tools allow a single person to map the pop songwriter family tree – and connect the dots from kanye west to taylor swift?

When not to use Spark?

2025-11-05
talk
Holden Karau (Fight Health Insurance)

In this talk the somewhat biased Apache Spark PMC Holden will explore the times when using Spark is more likely to lead to disappointment and pages than success and promotions. We'll, of course, look at places where Spark can excel but also explore heuristics like if it fits in Excel double check if you need Spark. By using Spark only when it's truly beneficial you can demonstrate that elusive "thought leadership" that always seems to be required for the next level of promotion. We'll explore how some of Spark's largest disadvantages are changing, but also which ones are likely to stick around -- allowing you to seem like you have a magic tech eightball next time someone asks you to design your analytics strategy. Come for a place to sit after lunch and stay for the OOM therapy.

Happy Hour & Networking

2025-11-05
break

Session 2

2025-11-04
break

Register for your top choice

Lunch

2025-11-04
break

Session 1

2025-11-04
break

Register for your top choice

Check-in

2025-11-04
break

Composable Data "Workflows": Building Pipelines That Just Work

2025-11-04
workshop
Dennis Hume (Dagster)

Learn how to use Dagster to build composable pipelines that run reliably from your laptop to production. We’ll cover practical patterns for testing, modular design, and maintainability so your workflows just work, at any scale.

Duck, duck, "deploy": Building an AI-ready app in 2 hours

2025-11-04
workshop
Russ Garner (Omni) , Becca Bruggman (Omni)

Start with a dataset in Motherduck and build a production-ready analytics app using Omni’s semantic model and APIs. We’ll cover practical data modeling techniques, share lessons learned from building AI features, and walk through how to give AI the context it needs to answer questions accurately. You’ll leave with a working app and the skills to build your next one.