talk-data.com

Topic

Spark

Apache Spark

big_data distributed_computing analytics

Activities

tagged

Activity Trend

71 peak/qtr

2020-Q1 2026-Q1

Top Events

O'Reilly Data Engineering Books 143 Databricks DATA + AI Summit 2023 120 Data Engineering Podcast 84 Data + AI Summit 2025 66 O'Reilly Data Science Books 20 DATA MINER Big Data Europe Conference 2020 8 Microsoft Ignite 2025 7 Airflow Summit 2020 7 Google Cloud Next '24 6 Airflow Summit 2024 6 Big Data LDN 2025 5 Google Cloud Next '25 5

Top Speakers

Tobias Macey 84 Matei Zaharia (Databricks) 10 Reynold Xin (Databricks) 8 Jean-Georges Perrin (Actian) 5 Holden Karau (Fight Health Insurance) 5 Al Martin (IBM) 5 Mark Brown (Microsoft) 5 Martin Grund (Databricks) 4 Richie (DataCamp) 4 Denny Lee (Databricks) 4 Ali Ghodsi (Databricks) 4 Michael Armbrust (Databricks) 4

Activities

Showing filtered results

All Video Podcast Book

Filtering by: Holden Karau ×

High Performance Spark, 2nd Edition

2026-05-25 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Rachel Warren , Holden Karau (Fight Health Insurance) , Adi Polak (Treeverse)

AI/ML Data Science Kubernetes PySpark PyTorch apache-spark data data-engineering

Apache Spark is amazing when everything clicks. But if you haven't seen the performance improvements you expected or still don't feel confident enough to use Spark in production, this practical book is for you. Authors Holden Karau, Rachel Warren, and Anya Bida walk you through the secrets of the Spark code base, and demonstrate performance optimizations that will help your data pipelines run faster, scale to larger datasets, and avoid costly antipatterns. Ideal for data engineers, software engineers, data scientists, and system administrators, the second edition of High Performance Spark presents new use cases, code examples, and best practices for Spark 3.x and beyond. This book gives you a fresh perspective on this continually evolving framework and shows you how to work around bumps on your Spark and PySpark journey. With this book, you'll learn how to: Accelerate your ML workflows with integrations including PyTorch Handle key skew and take advantage of Spark's new dynamic partitioning Make your code reliable with scalable testing and validation techniques Make Spark high performance Deploy Spark on Kubernetes and similar environments Take advantage of GPU acceleration with RAPIDS and resource profiles Get your Spark jobs to run faster Use Spark to productionize exploratory data science projects Handle even larger datasets with Spark Gain faster insights by reducing pipeline running times

When not to use Spark?

2025-11-05 · Small Data SF 2025

talk

by Holden Karau (Fight Health Insurance)

Analytics

In this talk the somewhat biased Apache Spark PMC Holden will explore the times when using Spark is more likely to lead to disappointment and pages than success and promotions. We'll, of course, look at places where Spark can excel but also explore heuristics like if it fits in Excel double check if you need Spark. By using Spark only when it's truly beneficial you can demonstrate that elusive "thought leadership" that always seems to be required for the next level of promotion. We'll explore how some of Spark's largest disadvantages are changing, but also which ones are likely to stick around -- allowing you to seem like you have a magic tech eightball next time someone asks you to design your analytics strategy. Come for a place to sit after lunch and stay for the OOM therapy.

High Performance Spark

2017-05-25 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Rachel Warren , Holden Karau (Fight Health Insurance)

AI/ML Scala SQL Data Streaming apache-spark data data-engineering

Apache Spark is amazing when everything clicks. But if you haven’t seen the performance improvements you expected, or still don’t feel confident enough to use Spark in production, this practical book is for you. Authors Holden Karau and Rachel Warren demonstrate performance optimizations to help your Spark queries run faster and handle larger data sizes, while using fewer resources. Ideal for software engineers, data engineers, developers, and system administrators working with large-scale data applications, this book describes techniques that can reduce data infrastructure costs and developer hours. Not only will you gain a more comprehensive understanding of Spark, you’ll also learn how to make it sing. With this book, you’ll explore: How Spark SQL’s new interfaces improve performance over SQL’s RDD data structure The choice between data joins in Core Spark and Spark SQL Techniques for getting the most out of standard RDD transformations How to work around performance issues in Spark’s key/value pair paradigm Writing high-performance Spark code without Scala or the JVM How to test for functionality and performance when applying suggested improvements Using Spark MLlib and Spark ML machine learning libraries Spark’s Streaming components and external community packages

Fast Data Processing with Spark 2 - Third Edition

2016-10-24 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Holden Karau (Fight Health Insurance) , Krishna Sankar

AI/ML Analytics API Big Data Cloud Computing Data Analytics Data Engineering Java Scala apache-spark data data-engineering

Fast Data Processing with Spark 2 takes you through the essentials of leveraging Spark for big data analysis. You will learn how to install and set up Spark, handle data using its APIs, and apply advanced functionality like machine learning and graph processing. By the end of the book, you will be well-equipped to use Spark in real-world data processing tasks. What this Book will help me do Install and configure Apache Spark for optimal performance. Interact with distributed datasets using the resilient distributed dataset (RDD) API. Leverage the flexibility of DataFrame API for efficient big data analytics. Apply machine learning models using Spark MLlib to solve complex problems. Perform graph analysis using GraphX to uncover structural insights in data. Author(s) Krishna Sankar is an experienced data scientist and thought leader in big data technologies. With a deep understanding of machine learning, distributed systems, and Apache Spark, Krishna has guided numerous projects in data engineering and big data processing. Matei Zaharia, the co-author, is also widely recognized in the field of distributed systems and cloud computing, contributing to Apache Spark development. Who is it for? This book is catered to software developers and data engineers with a foundational understanding of Scala or Java programming. Beginner to medium-level understanding of big data processing concepts is recommended for readers. If you are aspiring to solve big data problems using scalable distributed computing frameworks, this book is perfect for you. By the end, you will be confident in building Spark-powered applications and analyzing data efficiently.

Learning Spark

2015-02-17 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Andy Konwinski (Databricks) , Holden Karau (Fight Health Insurance) , Matei Zaharia (Databricks) , Patrick Wendell (Databricks)

Analytics API Data Analytics Java Python Scala SQL Data Streaming apache-spark data data-engineering

Data in all domains is getting bigger. How can you work with it efficiently? Recently updated for Spark 1.3, this book introduces Apache Spark, the open source cluster computing system that makes data analytics fast to write and fast to run. With Spark, you can tackle big datasets quickly through simple APIs in Python, Java, and Scala. This edition includes new information on Spark SQL, Spark Streaming, setup, and Maven coordinates.