Topic

ORC

Optimized Row Columnar (ORC)

columnar_storage big_data compression file_format storage

Activities

1

tagged

Activity Trend

1 peak/qtr

2020-Q1 2026-Q1

Top Events

Data Engineering Podcast 7 O'Reilly Data Engineering Books 2 Databricks DATA + AI Summit 2023 1

Top Speakers

Tobias Macey 7 Dipti Borkar (Microsoft) 2 Brock Noland (PhData) 1 Toby Mao (SQLMesh) 1 Jordan Birdsell (PhData) 1 Ryan Blue (Tabular) 1 Brooke Wenig 1 Jules S. Damji (Anyscale Inc) 1 Tanmay Deshpande 1 Tathagata Das (Databricks) 1 Aneesh Karve (Quilt Data) 1 Yoni Iny (Upsolver) 1

Activities

Showing filtered results

All Video Podcast Book

Filtering by: Databricks DATA + AI Summit 2023 ×

Apache Spark SQL Aggregate Improvement at Meta (Facebook)

2022-07-19 · Databricks DATA + AI Summit 2023 Watch

video

Databricks Spark SQL

Aggregate (group-by) is one of most important SQL operations in data warehouses. It is required when we want to get aggregated insights from input datasets. Over the last year, we added a series of aggregate optimizations internally at Facebook Spark SQL, and we started to contribute back to Apache Spark recently.

(1).sort aggregate (SPARK-32461): add code generation to improve query performance, replace hash with sort aggregate when child is sorted, etc. (2).object hash aggregate (SPARK-34286): adaptive sort-based fallback based on JVM heap memory usage during query execution. (3).hash aggregate (SPARK-31973): adaptive bypass partial aggregate when aggregate reduction ratio is low. (4).data source aggregate push down (SPARK-34960): aggregate push down to ORC data source by utilizing column statistics (5).files statistics aggregate: aggregate output files (and all columns) statistics distributively when writing query output

we’ll take deep dive of above features and lessons learned.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/