JSON

Streaming Schema Drift Discovery and Controlled Mitigation

2023-07-26 · Databricks DATA + AI Summit 2023 Watch

video

by Alexander Vanadio

Databricks Delta Data Streaming

When creating streaming workloads with Databricks, it can sometimes be difficult to capture and understand the current structure of your source data. For example, what happens if you are ingesting JSON events from a vendor, and the keys are very sparsely populated, or contain dynamic content? Ideally, data engineers want to "lock in" a target schema in order to minimize complexity and maximize performance for known access patterns. What do you do when your data sources just don't cooperate with that vision? The first step is to quantify how far your current source data is drifting from your established Delta table. But how?

This session will demonstrate a way to capture and visual drift across all your streaming tables. The next question is, "Now that I see all of the data I'm missing, how do I selectively promote some of these keys into DataFrame columns?" The second half of this session will demonstrate precisely how to do a schema migration with minimal job downtime.

Talk by: Alexander Vanadio

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Data Extraction and Sharing Via The Delta Sharing Protocol

2023-07-26 · Databricks DATA + AI Summit 2023 Watch

video

by Roger Dunn

Cloud Computing Data Lakehouse Databricks Delta Parquet SQL

The Delta Sharing open protocol for secure sharing and distribution of Lakehouse data is designed to reduce friction in getting data to users. Delivering custom data solutions from this protocol further leverages the technical investment committed to your Delta Lake infrastructure. There are key design and computational concepts unique to Delta Sharing to know when undertaking development. And there are pitfalls and hazards to avoid when delivering modern cloud data to traditional data platforms and users.

In this session, we introduce Delta Sharing Protocol development and examine our journey and the lessons learned while creating the Delta Sharing Excel Add-in. We will demonstrate scenarios of overfetching, underfetching, and interpretation of types. We will suggest methods to overcome these development challenges. The session will combine live demonstrations that exercise the Delta Sharing REST protocol with detailed analysis of the responses. The demonstrations will elaborate on optional capabilities of the protocol’s query mechanism, and how they are used and interpreted in real-life scenarios. As a reference baseline for data professionals, the Delta Sharing exercises will be framed relative to SQL counterparts. Specific attention will be paid to how they differ, and how Delta Sharing’s Change Data Feed (CDF) can power next-generation data architectures. The session will conclude with a survey of available integration solutions for getting the most out of your Delta Sharing environment, including frameworks, connectors, and managed services.

Attendees are encouraged to be familiar with REST, JSON, and modern programming concepts. A working knowledge of Delta Lake, the Parquet file format, and the Delta Sharing Protocol are advised.

Talk by: Roger Dunn

Here’s more to explore: A New Approach to Data Sharing: https://dbricks.co/44eUnT1

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Powering Up the Business with a Lakehouse

2022-07-19 · Databricks DATA + AI Summit 2023 Watch

video

CI/CD Data Lakehouse Data Quality Databricks Delta GDPR/CCPA

Within Wehkamp we required a uniform way to provide reliable and on time data to the business, while making this access compliant with GDPR. Unlocking all the data sources that we have scattered across the company and democratize the data access was of the utmost importance, allowing us to empower the business with more, better and faster data.

Focusing on open source technologies, we've built a data platform almost from the ground up that focuses on 3 levels of data curation - bronze, silver and gold - which follows the LakeHouse Architecture. The ingestion into bronze is where the PII fields are pseudonymized, making the use of the data within the delta lake compliant and, since there is no visible user data, it means everyone can use the entire delta lake for exploration and new use cases. Naturally, specific teams are allowed to see some user data that is necessary for their use cases. Besides the standard architecture, we've developed a library that allows us to ingest new data sources by adding a JSON config file with the characteristics. This combined with the ACID transactions that delta provides and the efficient Structured Stream provided through Auto Loader has allowed a small team to maintain 100+ streams with insignificant downtime.

Some other components of this platform are the following: - Alerting to Slack - Data quality checks - CI/CD - Stream processing with the delta engine

The feedback so far has been encouraging, as more and more teams across the company are starting to use the new platform and taking advantage of all its perks. It is still a long time until we get to turn off some of the components of the old data platform, but it has come a long way.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Opening the Floodgates: Enabling Fast, Unmediated End User Access to Trillion-Row Datasets with SQL

2022-07-19 · Databricks DATA + AI Summit 2023 Watch

video

Analytics API ClickHouse Databricks Druid Marketing SQL

Spreadsheets revolutionized IT by giving end users the ability to create their own analytics. Providing direct end user access to trillion-row datasets generated in financial markets or digital marketing is much harder. New SQL data warehouses like ClickHouse and Druid can provide fixed latency with constant cost on very large datasets, which opens up new possibilities.

Our talk walks through recent experience on analytic apps developed by ClickHouse users that enable end users like market traders to develop their own analytics directly off raw data. We’ll cover the following topics.

Characteristics of new open source column databases and how they enable low-latency analytics at constant cost.
Idiomatic ways to validate new apps by building MVPs that support a wide range of queries on source data including storing source JSON, schema design, applying compression on columns, and building indexes for needle-in-a-haystack queries.
Incrementally identifying hotspots and applying easy optimizations to bring query performance into line with long term latency and cost requirements.
Methods of building accessible interfaces, including traditional dashboards, imitating existing APIs that are already known, and creating app-specific visualizations.

We’ll finish by summarizing a few of the benefits we’ve observed and also touch on ways that analytic infrastructure could be improved to make end user access even more productive. The lessons are as general as possible so that they can be applied across a wide range of analytic systems, not just ClickHouse.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

UIMeta: A 10X Faster Cloud-Native Apache Spark History Server

2022-07-19 · Databricks DATA + AI Summit 2023 Watch

video

Cloud Computing Databricks Spark

Spark history server is an essential tool for monitoring, analyzing and optimizing spark jobs.

The original history server is based on Spark event log mechanism. A running Spark job will produce many kinds of events that describe the job's status continuously. All the events are serialized into JSON and appended to a file —— event log. The history server has to replay the event log and rebuild the memory store needed for UI. In a cluster, the history server also needs to periodically scan the event log directory and cache all the files' metadata in memory.

Actually, an event log contains too much redundant info for a history server. A long-running application can bring a huge event log which may cost a lot to maintain and require a long time to replay. In large-scale production, the number of jobs can be large and leads to a heavy burden on history servers. It needs additional development to build a scalable history server service.

In this talk, we want to introduce a new history server based on UIMeta. UIMeta is a wrapper of the KVStore objects needed by a Spark UI. A job will bring a UIMeta log by stagedly serializing UIMeta. An UIMeta log is approximately 10x smaller in size and 10x faster in replaying compared to the original event log file. Benefitting from the good performance, we develop a new stateless history server without a directory scan. Currently, UIMeta Service has taken the place of the original history server and provided service for millions of jobs per day in Bytedance.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

talk-data.com

Activity Trend

Top Events

Top Speakers

Streaming Schema Drift Discovery and Controlled Mitigation

Data Extraction and Sharing Via The Delta Sharing Protocol

Powering Up the Business with a Lakehouse

Opening the Floodgates: Enabling Fast, Unmediated End User Access to Trillion-Row Datasets with SQL

UIMeta: A 10X Faster Cloud-Native Apache Spark History Server