Topic

ORC

Optimized Row Columnar (ORC)

columnar_storage big_data compression file_format storage

Activities

2

tagged

Activity Trend

1 peak/qtr

2020-Q1 2026-Q2

Top Events

Data Engineering Podcast 7 O'Reilly Data Engineering Books 2 Databricks DATA + AI Summit 2023 1

Top Speakers

Tobias Macey 7 Dipti Borkar (Microsoft) 2 Brock Noland (PhData) 1 Toby Mao (SQLMesh) 1 Jordan Birdsell (PhData) 1 Ryan Blue (Tabular) 1 Brooke Wenig 1 Jules S. Damji (Anyscale Inc) 1 Tanmay Deshpande 1 Tathagata Das (Databricks) 1 Aneesh Karve (Quilt Data) 1 Yoni Iny (Upsolver) 1

Activities

2 activities · Newest first

All Video Podcast Book

Learning Spark, 2nd Edition

2020-07-16 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Denny Lee (Databricks) , Brooke Wenig , Jules S. Damji (Anyscale Inc) , Tathagata Das (Databricks)

AI/ML Analytics API Avro CSV Data Analytics Delta Hive Java JSON Kafka Parquet +9 more

Data is bigger, arrives faster, and comes in a variety of formatsâ??and it all needs to be processed at scale for analytics or machine learning. But how can you process such varied workloads efficiently? Enter Apache Spark. Updated to include Spark 3.0, this second edition shows data engineers and data scientists why structure and unification in Spark matters. Specifically, this book explains how to perform simple and complex data analytics and employ machine learning algorithms. Through step-by-step walk-throughs, code snippets, and notebooks, youâ??ll be able to: Learn Python, SQL, Scala, or Java high-level Structured APIs Understand Spark operations and SQL Engine Inspect, tune, and debug Spark operations with Spark configurations and Spark UI Connect to data sources: JSON, Parquet, CSV, Avro, ORC, Hive, S3, or Kafka Perform analytics on batch and streaming data using Structured Streaming Build reliable data pipelines with open source Delta Lake and Spark Develop machine learning pipelines with MLlib and productionize models using MLflow

Hadoop Real-World Solutions Cookbook - Second Edition

2016-03-31 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Tanmay Deshpande

AI/ML Analytics Big Data Hadoop Hive Parquet Spark data data-engineering

Master the full potential of big data processing using Hadoop with this comprehensive guide. Featuring over 90 practical recipes, this book helps you streamline data workflows and implement machine learning models with tools like Spark, Hive, and Pig. By the end, you'll confidently handle complex data problems and optimize big data solutions effectively. What this Book will help me do Install and manage a Hadoop 2.x cluster efficiently to suit your data processing needs. Explore and utilize advanced tools like Hive, Pig, and Flume for seamless big data analysis. Master data import/export processes with Sqoop and workflows automation using Oozie. Implement machine learning and analytics tasks using Mahout and Apache Spark. Store and process data flexibly across formats like Parquet, ORC, RC, and more. Author(s) None Deshpande is an expert in big data processing and analytics with years of hands-on experience in implementing Hadoop-based solutions for real-world problems. Known for a clear and pragmatic writing style, None brings actionable wisdom and best practices to the forefront, helping readers excel in managing and utilizing big data systems. Who is it for? Designed for technical enthusiasts and professionals, this book is ideal for those familiar with basic big data concepts. If you are looking to expand your expertise in Hadoop's ecosystem and implement data-driven solutions, this book will guide you through essential skills and advanced techniques to efficiently manage complex big data projects.