talk-data.com talk-data.com

Adi Polak

Speaker

Adi Polak

10

talks

VP of Developer Experience Treeverse

Adi Polak is an experienced software engineer and people manager focused on data, AI, and machine learning for operations and analytics. She has built algorithms and distributed data pipelines using Spark, Kafka, HDFS, and large-scale systems, and has led teams to deliver pioneering ML initiatives. An accomplished educator, she has taught thousands of students how to scale machine learning with Spark and is the author of Scaling Machine Learning with Spark and High Performance Spark (2nd Edition). Earlier this year, she began exploring data streaming with Flink and ML inference, focusing on high-performance, end-to-end systems.

Bio from: Databricks DATA + AI Summit 2023

Filter by Event / Source

Talks & appearances

10 activities · Newest first

Search activities →
High Performance Spark, 2nd Edition

Apache Spark is amazing when everything clicks. But if you haven't seen the performance improvements you expected or still don't feel confident enough to use Spark in production, this practical book is for you. Authors Holden Karau, Rachel Warren, and Anya Bida walk you through the secrets of the Spark code base, and demonstrate performance optimizations that will help your data pipelines run faster, scale to larger datasets, and avoid costly antipatterns. Ideal for data engineers, software engineers, data scientists, and system administrators, the second edition of High Performance Spark presents new use cases, code examples, and best practices for Spark 3.x and beyond. This book gives you a fresh perspective on this continually evolving framework and shows you how to work around bumps on your Spark and PySpark journey. With this book, you'll learn how to: Accelerate your ML workflows with integrations including PyTorch Handle key skew and take advantage of Spark's new dynamic partitioning Make your code reliable with scalable testing and validation techniques Make Spark high performance Deploy Spark on Kubernetes and similar environments Take advantage of GPU acceleration with RAPIDS and resource profiles Get your Spark jobs to run faster Use Spark to productionize exploratory data science projects Handle even larger datasets with Spark Gain faster insights by reducing pipeline running times

We were told to scale compute. But what if the real problem was never about big data, but about bad data access? In this talk, we’ll unpack two powerful, often misunderstood techniques—projection pushdown and predicate pushdown—and why they matter more than ever in a world where we want lightweight, fast queries over large datasets. These optimizations aren’t just academic—they’re the difference between querying a terabyte in seconds vs. minutes. We’ll show how systems like Flink and DuckDB leverage these techniques, what limits them (hello, Protobuf), and how smart schema and storage design—especially in formats like Iceberg and Arrow can unlock dramatic speed gains. Along the way, we’ll highlight the importance of landing data in queryable formats, and why indexing and query engines matter just as much as compute. This talk is for anyone who wants to stop fully scanning their data lakes just to read one field.

Face To Face
with Hugo Lu , Jon Cooke (Dataception) , Parmar , Chris Freestone , David Richardson , Paul Rankin (Paul Rankin IT) , Jesse Anderson (Big Data Institute) , Taylor McGrath (Boomi) , Karl Ivo 🎧 Sokolov , Nick White , Chris Tabb (LEIT DATA) , Kelsey Hammock , Jean-Georges Perrin (Actian) , Mehdi Ouazza (MotherDuck) , Adi Polak (Treeverse) , Eevamaija Virtanen

https://www.bigdataldn.com/en-gb/conference/session-details.4500.251781.the-high-performance-data-and-ai-debate.html

Moving data between operational systems and analytics platforms is often a painful process. Traditional pipelines that transfer data in and out of warehouses tend to become complex, brittle, and expensive to maintain over time.

Much of this complexity, however, is avoidable. Data in motion and data at rest—Kafka Topics and Iceberg Tables—can be treated as two sides of the same coin. By establishing an equivalence between Topics and Tables, it’s possible to transparently map between them and rethink how pipelines are built.

This talk introduces a declarative approach to bridging streaming and table-based systems. By shifting complexity into the data layer, we can decompose complex, imperative pipelines into simpler, more reliable workflows

We’ll explore the design principles behind this approach, including schema mapping and evolution between Kafka and Iceberg, and how to build a system that can continuously materialize and optimize hundreds of thousands of topics as Iceberg tables.

Whether you're building new pipelines or modernizing legacy systems, this session will provide practical patterns and strategies for creating resilient, scalable, and future-proof data architectures.

Data streaming is a really difficult problem. Despite 10+ years of attempting to simplify it, teams building real-time data pipelines can spend up to 80% of their time optimizing it or fixing downstream output by handling bad data at the lake. All we want is a service that will be reliable, handle all kinds of data, connect with all kinds of systems, be easy to manage, and scale up and down as our systems change. Oh, it should also have super low latency and result in good data. Is it too much to ask?

In this presentation, you’ll learn the basics of data streaming and architecture patterns such as DLQ, used to tackle these challenges. We will then explore how to implement these patterns using Apache Flink and discuss the challenges that real-time AI applications bring to our infra. Difficult problems are difficult, and we offer no silver bullets. Still, we will share pragmatic solutions that have helped many organizations build fast, scalable, and manageable data streaming pipelines.

No More Fragile Pipelines: Kafka and Iceberg the Declarative Way

Moving data between operational systems and analytics platforms is often painful. Traditional pipelines become complex, brittle, and expensive to maintain.Take Kafka and Iceberg: batching on Kafka causes ingestion bottlenecks, while streaming-style writes to Iceberg create too many small Parquet files—cluttering metadata, degrading queries, and increasing maintenance overhead. Frequent updates further strain background table operations, causing retries—even before dealing with schema evolution. But much of this complexity is avoidable. What if Kafka Topics and Iceberg Tables were treated as two sides of the same coin? By establishing a transparent equivalence, we can rethink pipeline design entirely. This session introduces Tableflow—a new approach to bridging streaming and table-based systems. It shifts complexity away from pipelines and into a unified layer, enabling simpler, declarative workflows. We’ll cover schema evolution, compaction, topic-to-table mapping, and how to continuously materialize and optimize thousands of topics as Iceberg tables. Whether modernizing or starting fresh, you’ll leave with practical insights for building resilient, scalable, and future-proof data architectures.

Data Biases and Generative AI: A Practitioner Discussion

Join this panel discussion that unpacks the technical challenges surrounding biases found in data, and poses potential solutions and strategies for the future including Generative AI. This session is a showcase highlighting diverse perspectives in the data and AI industry.

Talk by: Adi Polak, Gavita Regunath, Christina Taylor, and Layla Yang

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Live from the Lakehouse: Ethics in AI with Adi Polak & gaining from open source with Vini Jaiswal

Hear from two guests. First, Adi Polak (VP of Developer Experience, Treeverse, and author of #1 new release - Scaling ML with Spark) on how AI helps us be more productive. Second guest, Vini Jaiswal (Principal Developer Advocate, ByteDance) on gaining with the open source community, overcoming scalability challenges, and taking innovation to the next stage. Hosted by Pearl Ubaru (Sr Technical Marketing Engineer, Databricks)

Scaling Machine Learning with Spark

Learn how to build end-to-end scalable machine learning solutions with Apache Spark. With this practical guide, author Adi Polak introduces data and ML practitioners to creative solutions that supersede today's traditional methods. You'll learn a more holistic approach that takes you beyond specific requirements and organizational goals--allowing data and ML practitioners to collaborate and understand each other better. Scaling Machine Learning with Spark examines several technologies for building end-to-end distributed ML workflows based on the Apache Spark ecosystem with Spark MLlib, MLflow, TensorFlow, and PyTorch. If you're a data scientist who works with machine learning, this book shows you when and why to use each technology. You will: Explore machine learning, including distributed computing concepts and terminology Manage the ML lifecycle with MLflow Ingest data and perform basic preprocessing with Spark Explore feature engineering, and use Spark to extract features Train a model with MLlib and build a pipeline to reproduce it Build a data system to combine the power of Spark with deep learning Get a step-by-step example of working with distributed TensorFlow Use PyTorch to scale machine learning and its internal architecture