talk-data.com talk-data.com

Topic

Parquet

Apache Parquet

columnar_storage big_data compression file_format storage

8

tagged

Activity Trend

5 peak/qtr
2020-Q1 2026-Q1

Activities

Showing filtered results

Filtering by: Databricks DATA + AI Summit 2023 ×
Managing Data Encryption in Apache Spark™

Sensitive data sets can be encrypted directly by new Apache Spark™ versions (3.2 and higher). Setting several configuration parameters and DataFrame options will trigger the Apache Parquet modular encryption mechanism that protects select columns with column-specific keys. The upcoming Spark 3.4 version will also support uniform encryption, where all DataFrame columns are encrypted with the same key.

Spark data encryption is already leveraged by a number of companies to protect personal or business confidential data in their production environments. The main integration effort is focused on key access control and on building a Spark/Parquet plug-in code that can interact with company’s key management service (KMS).

In this session, we will briefly cover the basics of Spark/Parquet encryption usage, and dive into the details of encryption key management that will help in integrating this Spark data protection mechanism in your deployment. You will learn how to run a HelloWorld encryption sample, and how to extend it into a real world production code integrated with your organization’s KMS and access control policies. We will talk about the standard envelope encryption approach to big data protection, the performance-vs-security trade-offs between single and double envelope wrapping, internal and external key metadata storage. We will see a demo, and discuss the new features such as uniform encryption and two-tier management of encryption keys.

Talk by: Gidon Gershinsky

Here’s more to explore: Data, Analytics, and AI Governance: https://dbricks.co/44gu3YU

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Data Extraction and Sharing Via The Delta Sharing Protocol

The Delta Sharing open protocol for secure sharing and distribution of Lakehouse data is designed to reduce friction in getting data to users. Delivering custom data solutions from this protocol further leverages the technical investment committed to your Delta Lake infrastructure. There are key design and computational concepts unique to Delta Sharing to know when undertaking development. And there are pitfalls and hazards to avoid when delivering modern cloud data to traditional data platforms and users.

In this session, we introduce Delta Sharing Protocol development and examine our journey and the lessons learned while creating the Delta Sharing Excel Add-in. We will demonstrate scenarios of overfetching, underfetching, and interpretation of types. We will suggest methods to overcome these development challenges. The session will combine live demonstrations that exercise the Delta Sharing REST protocol with detailed analysis of the responses. The demonstrations will elaborate on optional capabilities of the protocol’s query mechanism, and how they are used and interpreted in real-life scenarios. As a reference baseline for data professionals, the Delta Sharing exercises will be framed relative to SQL counterparts. Specific attention will be paid to how they differ, and how Delta Sharing’s Change Data Feed (CDF) can power next-generation data architectures. The session will conclude with a survey of available integration solutions for getting the most out of your Delta Sharing environment, including frameworks, connectors, and managed services.

Attendees are encouraged to be familiar with REST, JSON, and modern programming concepts. A working knowledge of Delta Lake, the Parquet file format, and the Delta Sharing Protocol are advised.

Talk by: Roger Dunn

Here’s more to explore: A New Approach to Data Sharing: https://dbricks.co/44eUnT1

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

US Army Corp of Engineers Enhanced Commerce & National Sec Through Data-Driven Geospatial Insight

The US Army Corps of Engineers (USACE) is responsible for maintaining and improving nearly 12,000 miles of shallow-draft (9'-14') inland and intracoastal waterways, 13,000 miles of deep-draft (14' and greater) coastal channels, and 400 ports, harbors, and turning basins throughout the United States. Because these components of the national waterway network are considered assets to both US commerce and national security, they must be carefully managed to keep marine traffic operating safely and efficiently.

The National DQM Program is tasked with providing USACE a nationally standardized remote monitoring and documentation system across multiple vessel types with timely data access, reporting, dredge certifications, data quality control, and data management. Government systems have often lagged commercial systems in modernization efforts, and the emergence of the cloud and Data Lakehouse Architectures have empowered USACE to successfully move into the modern data era.

This session incorporates aspects of these topics: Data Lakehouse Architecture: Delta Lake, platform security and privacy, serverless, administration, data warehouse, Data Lake, Apache Iceberg, Data Mesh GIS: H3, MOSAIC, spatial analysis data engineering: data pipelines, orchestration, CDC, medallion architecture, Databricks Workflows, data munging, ETL/ELT, lakehouses, data lakes, Parquet, Data Mesh, Apache Spark™ internals. Data Streaming: Apache Spark Structured Streaming, real-time ingestion, real-time ETL, real-time ML, real-time analytics, and real-time applications, Delta Live Tables. ML: PyTorch, TensorFlow, Keras, scikit-learn, Python and R ecosystems data governance: security, compliance, RMF, NIST data sharing: sharing and collaboration, delta sharing, data cleanliness, APIs.

Talk by: Jeff Mroz

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Vector Data Lakes

Vector databases such as ElasticSearch and Pinecone offer fast ingestion and querying on vector embeddings with ANNs. However, they typically do not decouple compute and storage, making them hard to integrate in production data stacks. Because data storage in these databases is expensive and not easily accessible, data teams typically maintain ETL pipelines to offload historical embedding data to blob stores. When that data needs to be queried, they get loaded back into the vector database in another ETL process. This is reminiscent of loading data from OLTP database to cloud storage, then loading said data into an OLAP warehouse for offline analytics.

Recently, “lakehouse” offerings allow direct OLAP querying on cloud storage, removing the need for the second ETL step. The same could be done for embedding data. While embedding storage in blob stores cannot satisfy the high TPS requirements in online settings, we argue it’s sufficient for offline analytics use cases like slicing and dicing data based on embedding clusters. Instead of loading the embedding data back into the vector database for offline analytics, we propose direct processing on embeddings stored in Parquet files in Delta Lake. You will see that offline embedding workloads typically touch a large portion of the stored embeddings without the need for random access.

As a result, the workload is entirely bound by network throughput instead of latency, making it quite suitable for blob storage backends. On a test one billion vector dataset, ETL into cloud storage takes around one hour on a dedicated GPU instance, while batched nearest neighbor search can be done in under one minute with four CPU instances. We believe future “lakehouses” will ship with native support for these embedding workloads.

Talk by: Tony Wang and Chang She

Here’s more to explore: State of Data + AI Report: https://dbricks.co/44i2HBp Databricks named a Leader in 2022 Gartner® Magic QuadrantTM CDBMS: https://dbricks.co/3phw20d

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Sound Data Engineering in Rust—From Bits to DataFrames

Spark applications often need to query external data sources such as file-based data sources or relational data sources. In order to do this, Spark provides Data Source APIs to access structured data through Spark SQL.

Data Source APIs have optimization rules such as filter push down and column pruning to reduce the amount of data that needs to be processed to improve query performance. As part of our ongoing project to provide generic Data Source V2 push down APIs, we have introduced partial aggregate push down, which significantly speeds up spark jobs by dramatically reducing the amount of data transferred between data sources and Spark. We have implemented aggregate push down in both JDBC and parquet.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Spark Data Source V2 Performance Improvement: Aggregate Push Down

Spark applications often need to query external data sources such as file-based data sources or relational data sources. In order to do this, Spark provides Data Source APIs to access structured data through Spark SQL.

Data Source APIs have optimization rules such as filter push down and column pruning to reduce the amount of data that needs to be processed to improve query performance. As part of our ongoing project to provide generic Data Source V2 push down APIs, we have introduced partial aggregate push down, which significantly speeds up spark jobs by dramatically reducing the amount of data transferred between data sources and Spark. We have implemented aggregate push down in both JDBC and parquet.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Delta Lake, the Foundation of Your Lakehouse

Delta Lake is the open source storage layer that makes the Databricks Lakehouse Platform possible by adding reliability, performance, and scalability to your data, wherever it is located. Join this session for an inside look at what is under the hood of Databricks - see how Delta Lake, by adding ACID transactions and versioning to Parquet files together with the Photon engine, provides customers with huge performance gains and the ability to address new challenges. This session will include a demo and overview of customer use cases unlocked by Delta Lake, and the benefits of running Delta Lake workloads on Databricks.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Recent Parquet Improvements in Apache Spark

Apache Parquet is a very popular columnar file format supported by Apache Spark. In a typical Spark job, scanning Parquet files is sometimes one of the most time consuming steps, as it incurs high CPU and IO overhead. Therefore, optimizing Parquet scan performance is crucial to job latency and cost efficiency.

Spark currently have two Parquet reader implementations: a vectorized one and a non-vectorized one. The former was implemented from scratch and offers much better performance than the latter. However, it currently doesn’t support complex types (e.g., array, list, map) at the moment and will fallback to the latter when encountering them. In addition to the reader implementation, predicate pushdown is also crucial to Parquet scan performance as it enables Spark to skip those data that do not satisfy the predicates, before the scan. Currently, Spark constructs predicates itself and rely on Parquet-MR to do the heavy lifting, which does the filtering based on various information such as statistics, dictionary, bloom filter or column index.

This talk will go through two recent improvements for Parquet scan performance: 1) vectorized read support for complex types, which allows Spark to achieve 10x+ improvement when reading Parquet data of complex types, and 2) Parquet column index support, which enables Spark to leverage Parquet column index feature during predicate pushdown. Last but not least, Chao go over some future work items that can further enhance Parquet read performance.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/