Speaker

Adi Polak

Activities

1

talks

VP of Developer Experience Treeverse

Adi Polak is an experienced software engineer and people manager focused on data, AI, and machine learning for operations and analytics. She has built algorithms and distributed data pipelines using Spark, Kafka, HDFS, and large-scale systems, and has led teams to deliver pioneering ML initiatives. An accomplished educator, she has taught thousands of students how to scale machine learning with Spark and is the author of Scaling Machine Learning with Spark and High Performance Spark (2nd Edition). Earlier this year, she began exploring data streaming with Flink and ML inference, focusing on high-performance, end-to-end systems.

Bio from: Databricks DATA + AI Summit 2023

Filtering by: Small Data SF 2025 ×

Filter by Event / Source

O'Reilly Data Engineering Books 2 Databricks DATA + AI Summit 2023 2 IN PERSON: Apache Kafka x Apache Flink 1 IN-PERSON: Apache Kafka® x Apache Flink® Meetup 1 Big Data LDN 2025 1 Big Data LDN 2025 1 Data + AI Summit 2025 1 Small Data SF 2025 1

Talks & appearances

Showing 1 of 10 activities

Search activities →

Projection Pushdown vs Predicate "Pushdown": Rethinking Query Efficiency

2025-11-05 · Small Data SF 2025

talk

Flink Arrow Big Data DuckDB Iceberg Protobuf

We were told to scale compute. But what if the real problem was never about big data, but about bad data access? In this talk, we’ll unpack two powerful, often misunderstood techniques—projection pushdown and predicate pushdown—and why they matter more than ever in a world where we want lightweight, fast queries over large datasets. These optimizations aren’t just academic—they’re the difference between querying a terabyte in seconds vs. minutes. We’ll show how systems like Flink and DuckDB leverage these techniques, what limits them (hello, Protobuf), and how smart schema and storage design—especially in formats like Iceberg and Arrow can unlock dramatic speed gains. Along the way, we’ll highlight the importance of landing data in queryable formats, and why indexing and query engines matter just as much as compute. This talk is for anyone who wants to stop fully scanning their data lakes just to read one field.