Event

Small Data SF 2025

2025-11-04 – 2025-11-06 Small Data SF Visit website ↗

Activities tracked

2

Filtering by: Iceberg ×

Top Speakers

Benn Stancil 2 George Fraser 2 Joe Reis 2 Jordan Tigani 2 Shelby Heinecke, PhD 2 Adi Polak 1 Barr Moses 1 Barry McCardel 1 Becca Bruggman 1 Brian Douglas 1 Colin Zima 1 Dennis Hume 1

Sessions & talks

Showing 1–2 of 2 · Newest first

Search within this event →

Building Distributed DuckDB Processing for Lakes

2025-11-05

talk

George Fraser (Fivetran)

Data Lake Delta DuckDB Fivetran Iceberg SQL

DuckDB is the best way to execute SQL on a single node. But with its embedding-friendly nature, it makes an excellent foundation for building distributed systems. George Fraser, CEO of Fivetran, will tell us how Fivetran used DuckDB to power its Iceberg data lake writer—coordinating thousands of small, parallel tasks across a fleet of workers, each running DuckDB queries on bounded datasets. The result is a high-throughput, dual-format (Iceberg + Delta) data lake architecture where every write scales linearly, snapshots stay perfectly in sync, and performance rivals a commercial database while remaining open and portable.

Projection Pushdown vs Predicate "Pushdown": Rethinking Query Efficiency

2025-11-05

talk

Adi Polak (Confluent)

Flink Arrow Big Data DuckDB Iceberg Protobuf

We were told to scale compute. But what if the real problem was never about big data, but about bad data access? In this talk, we’ll unpack two powerful, often misunderstood techniques—projection pushdown and predicate pushdown—and why they matter more than ever in a world where we want lightweight, fast queries over large datasets. These optimizations aren’t just academic—they’re the difference between querying a terabyte in seconds vs. minutes. We’ll show how systems like Flink and DuckDB leverage these techniques, what limits them (hello, Protobuf), and how smart schema and storage design—especially in formats like Iceberg and Arrow can unlock dramatic speed gains. Along the way, we’ll highlight the importance of landing data in queryable formats, and why indexing and query engines matter just as much as compute. This talk is for anyone who wants to stop fully scanning their data lakes just to read one field.

talk-data.com

Small Data SF 2025

Top Topics

Top Speakers

Building Distributed DuckDB Processing for Lakes

Projection Pushdown vs Predicate "Pushdown": Rethinking Query Efficiency