talk-data.com talk-data.com

Event

Databricks DATA + AI Summit 2023

2026-01-11 YouTube Visit website ↗

Activities tracked

205

Filtering by: Data Lakehouse ×

Sessions & talks

Showing 26–50 of 205 · Newest first

Search within this event →
Sponsored: Sisense-Developing Data Products: Infusion & Composability Are Changing Expectations

Sponsored: Sisense-Developing Data Products: Infusion & Composability Are Changing Expectations

2023-07-27 Watch
video

Composable analytics is the next progression of business intelligence. We will discuss how current analytics rely on two key principles: composability and agility. Through modularizing our analytics capabilities, we can rapidly “compose” new data applications. An organization uses these building blocks to deliver customized analytics experiences at a customer level.

This session will orientate business intelligence leaders to composable data and analytics.

  • How data teams can use composable analytics to decrease application development time.
  • How an organization can leverage existing and new tools to maximize value-based, data-driven insights.
    • Requirements for effectively deploying composable analytics.
    • Utilizing no, low-code and high-code analytics capabilities.
    • Extracting full value from your customer data and metadata.
    • Leveraging analytics building blocks to create new products and revenue streams.

Talk by: Scott Castle

Here’s more to explore: Why the Data Lakehouse Is Your next Data Warehouse: https://dbricks.co/3Pt5unq Lakehouse Fundamentals Training: https://dbricks.co/44ancQs

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

The Future is Open: Data Streaming in an Omni-Cloud Reality

The Future is Open: Data Streaming in an Omni-Cloud Reality

2023-07-27 Watch
video

This session begins with data warehouse trivia and lessons learned from production implementations of multicloud data architecture. You will learn to design future-proof low latency data systems that focus on openness and interoperability. You will also gain a gentle introduction to Cloud FinOps principles that can help your organization reduce compute spend and increase efficiency. 

Most enterprises today are multicloud. While an assortment of low-code connectors boasts the ability to make data available for analytics in real time, they post long-lasting challenges:

  • Inefficient EDW targets
  • Inability to evolve schema
  • Forbiddingly expensive data exports due to cloud and vendor lock-in

The alternative is an open data lake that unifies batch and streaming workloads. Bronze landing zones in open format eliminate the data extraction costs required by proprietary EDW. Apache Spark™ Structured Streaming provides a unified ingestion interface. Streaming triggers allow us to switch back and forth between batch and stream with one-line code changes. Streaming aggregation enables us to incrementally compute on data that arrives near each other.

Specific examples are given on how to use Autoloader to discover newly arrived data and ensure exactly once, incremental processing. How DLT can be configured effectively to further simplify streaming jobs and accelerate the development cycle. How to apply SWE best practices to Workflows and integrate with popular Git providers, either using the Databricks Project or Databricks Terraform provider. 

Talk by: Christina Taylor

Here’s more to explore: Big Book of Data Engineering: 2nd Edition: https://dbricks.co/3XpPgNV The Data Team's Guide to the Databricks Lakehouse Platform: https://dbricks.co/46nuDpI

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Optimizing Batch and Streaming Aggregations

Optimizing Batch and Streaming Aggregations

2023-07-27 Watch
video

A client recently asked to optimize their batch and streaming workloads. It happened to be aggregations using DataFrame.groupby operation with a custom Scala UDAF over a data stream from Kafka. Just a single simple-looking request that turned itself up into a a-few-month-long hunt to find a more performant query execution planning than ObjectHashAggregateExec that kept falling back to a sort-based aggregation (i.e., the worst possible aggregation runtime performance). It quickly taught us that an aggregation using a custom Scala UDAF cannot be planned other than ObjectHashAggregateExec but at least tasks don't always have to fall back. And that's just batch workloads. When you throw in streaming semantics and think of the different output modes, windowing and streaming watermark optimizing aggregation can take a long time to do right.

Talk by: Jacek Laskowski

Here’s more to explore: Big Book of Data Engineering: 2nd Edition: https://dbricks.co/3XpPgNV The Data Team's Guide to the Databricks Lakehouse Platform: https://dbricks.co/46nuDpI

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Map Your Lakehouse Content with DiscoverX

Map Your Lakehouse Content with DiscoverX

2023-07-26 Watch
video

An enterprise lakehouse contains many different datasets which are related to different sources and might belong to different business units. These datasets can span across hundreds of tables, and each table has a different schema, and those schemas evolve over time. The cyber security domain is a good example where datasets come from many different source systems and land in the lakehouse. With such a complex dataset ecosystem, answers to simple questions like “Have we ever detected this IP address?” or “Which columns contain IP addresses?” can become impractical and expensive.

DiscoverX can automate the discovery of all columns that might contain specific patterns, (e.g., IP addresses, MAC addresses, fully qualified domain names, etc.) and automatically generate search and indexing queries that span across multiple tables and columns.

Talk by: Erni Durdevic and David Tempelmann

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Unlocking Near Real Time Data Replication with CDC, Apache Spark™ Streaming, and Delta Lake

Unlocking Near Real Time Data Replication with CDC, Apache Spark™ Streaming, and Delta Lake

2023-07-26 Watch
video

Tune into DoorDash's journey to migrate from a flaky ETL system with 24-hour data delays, to standardizing a CDC streaming pattern across more than 150 databases to produce near real-time data in a scalable, configurable, and reliable manner.

During this journey, understand how we use Delta Lake to build a self-serve, read-optimized data lake with data latencies of 15, whilst reducing operational overhead. Furthermore, understand how certain tradeoffs like conceding to a non-real-time system allow for multiple optimizations but still permit for OLTP query use-cases, and the benefits it provides.

Talk by: Ivan Peng and Phani Nalluri

Here’s more to explore: Big Book of Data Engineering: 2nd Edition: https://dbricks.co/3XpPgNV The Data Team's Guide to the Databricks Lakehouse Platform: https://dbricks.co/46nuDpI

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Deploying the Lakehouse to Improve the Viewer Experience on Discovery+

Deploying the Lakehouse to Improve the Viewer Experience on Discovery+

2023-07-26 Watch
video

In this session, we will discuss how real-time data streaming can be used to gain insights into user behavior and preferences, and how this data is being used to provide personalized content and recommendations on Discovery+. We will examine techniques that enables faster decision making and insights on accurate real time data including data masking and data validation. To enable a wide set of data consumers from data engineers to data scientists to data analysts, we will discuss how Unity Catalog is leveraged for secure data access and sharing while still allowing teams flexibility.

Operating at this scale requires examining the value being created by the data being processed and optimizing along the way and we will share some of our success in this area.

Talk by: Deepa Paranjpe

Here’s more to explore: Big Book of Data Engineering: 2nd Edition: https://dbricks.co/3XpPgNV The Data Team's Guide to the Databricks Lakehouse Platform: https://dbricks.co/46nuDpI

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Extending Lakehouse Architecture with Collaborative Identity

Extending Lakehouse Architecture with Collaborative Identity

2023-07-26 Watch
video
Erin Boelkens (LiveRamp) , Shawn Gilleran (LiveRamp)

Lakehouse architecture has become a valuable solution for unifying data processing for AI, but faces limitations in maximizing data’s full potential. Additional data infrastructure is helpful for strengthening data consolidation and data connectivity with third-party sources, which are necessary for building full data sets for accurate audience modeling. 

In this session, LiveRamp will demonstrate to data and analytics decision-makers how to build on the Lakehouse architecture with extensions for collaborative identity graph construction, including how to simplify and improve data enrichment, data activation, and data collaboration. LiveRamp will also introduce a complete data marketplace, which enables easy, pseudonymized data enhancements that widen the attribute set for better behavioral model construction.

With these techniques and technologies, enterprises across financial services, retail, media, travel, and more can safely unlock partner insights and ultimately produce more accurate inputs for personalization engines, and more engaging offers and recommendations for customers.

Talk by: Erin Boelkens and Shawn Gilleran

Here’s more to explore: A New Approach to Data Sharing: https://dbricks.co/44eUnT1

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

How Coinbase Built and Optimized SOON, a Streaming Ingestion Framework

How Coinbase Built and Optimized SOON, a Streaming Ingestion Framework

2023-07-26 Watch
video

Data with low latency is important for real-time incident analysis and metrics. Though we have up-to-date data in OLTP databases, they cannot support those scenarios. Data need to be replicated to a data warehouse to serve queries using GroupBy and Join across multiple tables from different systems. At Coinbase, we designed SOON (Spark cOntinuOus iNgestion) based on Kafka, Kafka Connect, and Apache Spark™ as an incremental table replication solution to replicate tables of any size from any database to Delta Lake in a timely manner. It also supports Kafka events ingestion naturally.

SOON incrementally ingests Kafka events as appends, updates, and deletes to an existing table on Delta Lake. The events are grouped into two categories: CDC (change data capture) events generated by Kafka Connect source connectors, and non-CDC events by the frontend or backend services. Both types can be appended or merged into the Delta Lake. Non-CDC events can be in any format, but CDC events must be in the standard SOON CDC schema. We implemented Kafka Connect SMTs to transform raw CDC events into this standardized format. SOON unifies all streaming ingestion scenarios such that users only need to learn one onboarding experience and the team only needs to maintain one framework.

We care about the ingestion performance. The biggest append-only table onboarded has ingress traffic at hundreds of thousands events per second; the biggest CDC-merge table onboarded has a snapshot size of a few TBs and CDC update traffic at hundreds of thousands events per second. A lot of innovative ideas are incorporated in SOON to improve its performance, such as min-max range merge optimization, KMeans merge optimization, no-update merge for deduplication, generated columns as partitions, etc.

Talk by: Chen Guo

Here’s more to explore: Big Book of Data Engineering: 2nd Edition: https://dbricks.co/3XpPgNV The Data Team's Guide to the Databricks Lakehouse Platform: https://dbricks.co/46nuDpI

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Rapidly Implementing Major Retailer API at the Hershey Company

Rapidly Implementing Major Retailer API at the Hershey Company

2023-07-26 Watch
video
Simon Whiteley (Advancing Analytics) , Jordan Donmoyer

Accurate, reliable, and timely data is critical for CPG companies to stay ahead in highly competitive retailer relationships, and for a company like the Hershey Company, the commercial relationship with Walmart is one of the most important. The team at Hershey found themselves with a looming deadline for their legacy analytics services and targeted a migration to the brand new Walmart Luminate API. Working in partnership with Advancing Analytics, the Hershey Company leveraged a metadata-driven Lakehouse Architecture to rapidly onboard the new Luminate API, helping the category management teams to overhaul how they measure, predict, and plan their business operations.

In this session, we will discuss the impact Luminate has had on Hershey's business covering key areas such as sales, supply chain, and retail field execution, and the technical building blocks that can be used to rapidly provision business users with the data they need, when they need it. We will discuss how key technologies enable this rapid approach, with Databricks Autoloader ingesting and shaping our data, Delta Streaming processing the data through the lakehouse and Databricks SQL providing a responsive serving layer. The session will include commentary as well as cover the technical journey.

Talk by: Simon Whiteley and Jordan Donmoyer

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Self-Service Data Analytics and Governance at Enterprise Scale with Unity Catalog

Self-Service Data Analytics and Governance at Enterprise Scale with Unity Catalog

2023-07-26 Watch
video

This session focuses on one of the first Unity Catalog implementations for a large-scale enterprise. In this scenario, a cloud scale analytics platform with 7500 active users based on the lakehouse approach is used. In addition, there is potential for 1500 further users who are subject to special governance rules. They are consuming more than 600 TB of data stored in Delta Lake - continuously growing at more than 1TB per day. This might grow due to local country data. Therefore, the existing data platform must be extended to enable users to combine global and local data from their countries. A new data management was required, which reflects the strict information security rules at a need to know base. Core requirements are: read only from global data, write into local and share the results.

Due to a very pronounced information security awareness and a lack of the technological possibilities it was not possible to interdisciplinary analyze and exchange data so easy or at all so far. Therefore, a lot of business potential and gains could not be identified and realized.

With the new developments in the technology used and the basis of the lakehouse approach, thanks to Unity Catalog, we were able to develop a solution that could meet high requirements for security and process. And enables globally secured interdisciplinary data exchange and analysis at scale. This solution enables the democratization of the data. This results not only in the ability to gain better insights for business management, but also to generate entirely new business cases or products that require a higher degree of data integration and encourage the culture to change. We highlight technical challenges and solutions, present best practices and point out benefits of implementing Unity catalog for enterprises.

Talk by: Artem Meshcheryakov and Pascal van Bellen

Here’s more to explore: Data, Analytics, and AI Governance: https://dbricks.co/44gu3YU

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Sponsored by: ThoughtSpot | Drive Self-Service Adoption Through the Roof with Embedded Analytics

Sponsored by: ThoughtSpot | Drive Self-Service Adoption Through the Roof with Embedded Analytics

2023-07-26 Watch
video

When it comes to building stickier apps and products to grow your business, there's no greater opportunity than embedded analytics. Data apps that deliver superior user engagement and business value do analytics differently. They take a user-first approach and know how to deliver real-time, AI-powered insights - not just to internal employees - but to an organization’s customers and partners, as well.

Learn how ThoughtSpot Everywhere is helping companies like Emerald natively integrate analytics with other tools in their modern data stack to deliver a blazing-fast and instantly available analytics experience across all the data their users love. Join this session to learn how you can leverage embedded analytics to: Drive higher app engagement Get your app to market faster And create new revenue streams

Talk by: Krishti Bikal and Vika Smilansky

Here’s more to explore: Why the Data Lakehouse Is Your next Data Warehouse: https://dbricks.co/3Pt5unq Lakehouse Fundamentals Training: https://dbricks.co/44ancQs

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Using Cisco Spaces Firehose API as a Stream of Data for Real-Time Occupancy Modeling

Using Cisco Spaces Firehose API as a Stream of Data for Real-Time Occupancy Modeling

2023-07-26 Watch
video

Honeywell manages the control of equipment for hundreds of thousands of buildings worldwide. Many of our outcomes relating to energy and comfort rely on knowing where people are in the building at any one time. This is so we can target health and comfort conditions more suitably to areas where are more densely populated. Many of these buildings have Cisco IT infrastructure in them. Using their WIFI points and the RSSI signal strength from people’s laptops and phones, Cisco can calculate the number of people in each area of the building. Cisco Spaces offer this data up as a real-time streaming source. Honeywell HBT has utilized this stream of data by writing delta live table pipelines to consume this data source.

Honeywell buildings can now receive this firehose data from hundreds of concurrent customers and provide this occupancy data as a service to our vertical offerings in commercial, health, real estate and education. We will discuss the benefits of using DLT to handle this sort of incoming stream data, and illustrate the pain points we had and the resolutions we undertook in successfully receiving the stream of Cisco data. We will illustrate how our DLT pipeline was designed, and how it scaled to deal with huge quantities of real-time streaming data.

Talk by: Paul Mracek and Chris Inkpen

Here’s more to explore: Big Book of Data Engineering: 2nd Edition: https://dbricks.co/3XpPgNV The Data Team's Guide to the Databricks Lakehouse Platform: https://dbricks.co/46nuDpI

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Data Democratization with Lakehouse: An Open Banking Application Case

Data Democratization with Lakehouse: An Open Banking Application Case

2023-07-26 Watch
video

Banco Bradesco represents one of the largest companies in the financial sector in Latin America. They have more than 99 million customers, 79 years of history, and a legacy of data distributed in hundreds of on-premises systems. With the spread of data-driven approaches and the growth of cloud computing adoption, we needed to innovate and adapt to new trends and enable an analytical environment with democratized data.

We will show how more than eight business departments have already engaged in using the Lakehouse exploratory environment, with more than 190 use cases mapped and a multi-bank financial manager. Unlike with on-premises, the cost of each process can be isolated and managed in near real-time, allowing quick responses to cost and budget deviations, while increasing the deployment speed of new features 36 times compared to on-premises.

The data is now used and shared safely and easily between different areas and companies of the group. Also, the view of dashboards within Databricks allows panels to be efficiently "prototyped" with real data, allowing an easy interaction of the business area with its real needs and then creating a definitive view with all relevant points duly stressed.

Talk by: Pedro Boareto and Fabio Luis Correia da Silva

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Event Driven Real-Time Supply Chain Ecosystem Powered by Lakehouse

Event Driven Real-Time Supply Chain Ecosystem Powered by Lakehouse

2023-07-26 Watch
video

As the backbone of Australia’s supply chain, the Australia Rail Track Corporation (ARTC) plays a vital role in the management and monitoring of goods transportation across 8,500km of its rail network throughout Australia. ARTC provides weighbridges along their track which read train weights as they pass at speeds of up to 60 kilometers an hour. This information is highly valuable and is required both by ARTC and their customers to provide accurate haulage weight details, analyze technical equipment, and help ensure wagons have been loaded correctly.

A total of 750 trains run across a network of 8500 km in a day and generate real-time data at approximately 50 sensor platforms. With the help of structured streaming and Delta Lake, ARTC was able to analyze and store:

  • Precise train location
  • Weight of the train in real-time
  • Train crossing time to the second level
  • Train speed, temperature, sound frequency, and friction
  • Train schedule lookups

Once all the IoT data has been pulled together from an IoT event hub, it is processed in real-time using structured streaming and stored in Delta Lake. To understand the train GPS location, API calls are then made per minute per train from the Lakehouse. API calls are made in real-time to another scheduling system to lookup customer info. Once the processed/enriched data is stored in Delta Lake, an API layer was also created on top of it to expose this data to all consumers.

The outcome: increased transparency on weight data as it is now made available to customers; we built a digital data ecosystem that now ARTC’s customers use to meet their KPIs/ planning; the ability to determine temporary speed restrictions across the network to improve train scheduling accuracy and also schedule network maintenance based on train schedules and speed.

Talk by: Deepak Sekar and Harsh Mishra

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Increasing Data Trust: Enabling Data Governance on Databricks Using Unity Catalog & ML-Driven MDM

Increasing Data Trust: Enabling Data Governance on Databricks Using Unity Catalog & ML-Driven MDM

2023-07-26 Watch
video

As part of Comcast Effectv’s transformation into a completely digital advertising agency, it was key to develop an approach to manage and remediate data quality issues related to customer data so that the sales organization is using reliable data to enable data-driven decision making. Like many organizations, Effectv's customer lifecycle processes are spread across many systems utilizing various integrations between them. This results in key challenges like duplicate and redundant customer data that requires rationalization and remediation. Data is at the core of Effectv’s modernization journey with the intended result of winning more business, accelerating order fulfillment, reducing make-goods and identifying revenue.

In partnership with Slalom Consulting, Comcast Effectv built a traditional lakehouse on Databricks to ingest data from all of these systems but with a twist; they anchored every engineering decision in how it will enable their data governance program.

In this session, we will touch upon the data transformation journey at Effectv and dive deeper into the implementation of data governance leveraging Databricks solutions such as Delta Lake, Unity Catalog and DB SQL. Key focus areas include how we baked master data management into our pipelines by automating the matching and survivorship process, and bringing it all together for the data consumer via DBSQL to use our certified assets in bronze, silver and gold layers.

By making thoughtful decisions about structuring data in Unity Catalog and baking MDM into ETL pipelines, you can greatly increase the quality, reliability, and adoption of single-source-of-truth data so your business users can stop spending cycles on wrangling data and spend more time developing actionable insights for your business.

Talk by: Maggie Davis and Risha Ravindranath

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Real-Time Reporting and Analytics for Construction Data Powered by Delta Lake and DBSQL

Real-Time Reporting and Analytics for Construction Data Powered by Delta Lake and DBSQL

2023-07-26 Watch
video

Procore is a construction project management software that helps construction professionals efficiently manage their projects and collaborate with their teams. Our mission is to connect everyone in construction on a global platform.

Procore is the system of record for all construction projects. Our customers need to access the data in near real-time for construction insights. Enhanced reporting is a self-service operational reporting module that allows quick data access with consistency to thousands of tables and reports.

Procore data platform rebuilt the module (originally built on the relational database) using Databricks and Delta lake. We used Apache Spark™ streaming to maintain the consistent state on the ingestion side from Kafka and plan to leverage the fully capable functionalities of DBSQL using the serverless SQL warehouse to read the medallion models (built via DBT) in Delta Lake. In addition, the Unity Catalog and the Delta share features helped us share the data across regions seamlessly. This design enabled us to improve the p95 and p99 read time by xx% (which were initially timing out).

Attend this session to hear about the learnings and experience of building a Data Lakehouse architecture.

Talk by: Jay Yang and Hari Rajaram

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Best Exploration of Columnar Shuffle Design

Best Exploration of Columnar Shuffle Design

2023-07-26 Watch
video

To significantly improve the performance of Spark SQL, there is a trend to offload Spark SQL execution to highly optimized native libraries or accelerators in past several years, like Photon from Databricks, Nvidia's Rapids plug-in, and Intel and Kyligence's initiated open source Gluten project. By the multi-fold performance improvement from these solutions, more and more Apache Spark™ users have started to adopt the new technology. One characteristics of native libraries is that they all use columnar data format as the basic data format. It's because the columnar data format has the intrinsic affinity to vectorized data processing using SIMD instructions. While vanilla Spark's shuffle is based on spark's internal row data format. The high overhead of the columnar to row and row to columnar conversion during the shuffle makes reusing current shuffle not possible. Due to the importance of shuffle service in Spark, we have to implement an efficient columnar shuffle, which brings couple of new challenges, like the split of columnar data, or the dictionary support during shuffle.

In this session, we will share the exploration process of the columnar shuffle design during our Gazelle and Gluten development, and best practices for implementing the columnar shuffle service. We will also share how we learned from the development of vanilla Spark's shuffle, for example, how to address the small files issue then we will propose the new shuffle solution. We will show the performance comparison between Columnar shuffle and vanilla Spark's row-based shuffle. Finally, we will share how the new built-in accelerators like QAT and IAA in the latest Intel processor are used in our columnar shuffle service and boost the performance.

Talk by: Binwei Yang and Rong Ma

Here’s more to explore: Why the Data Lakehouse Is Your next Data Warehouse: https://dbricks.co/3Pt5unq Lakehouse Fundamentals Training: https://dbricks.co/44ancQs

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Best Practices for Running Efficient Apache Spark™ Workloads on Databricks

Best Practices for Running Efficient Apache Spark™ Workloads on Databricks

2023-07-26 Watch
video
Justin Breese (Databricks)

Every day thousands of customers choose to run business-critical Spark workloads on the Databricks Lakehouse Platform, a platform built by the creators of Apache Spark™. These customers take advantage of platform capabilities such as fully managed compute resources, dynamic autoscaling, an integrated workflow orchestration tool and of Photon, the extremely fast vectorized execution engine. All of these make the Databricks Lakehouse Platform the best place to run Spark workloads providing operational benefits as well as tremendous price/performance value.

This session which includes live demos will cover these and other platform capabilities that can help you build your next optimized Spark application.

Talk by: Justin Breese

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Databricks Lakehouse: How BlackBerry is Revolutionizing Cybersecurity Services Worldwide

Databricks Lakehouse: How BlackBerry is Revolutionizing Cybersecurity Services Worldwide

2023-07-26 Watch
video
Robert Lombardi , Justin Lai (Arctic Wolf)

Cybersecurity incidents are costly, and using an endpoint detection and response (EDR) solution enables the detection of cybersecurity incidents as quickly as possible. To effectively detect cybersecurity incidences requires the collection of millions of data points, and the storing/querying of endpoints data presents considerable engineering challenges. This includes quickly moving local data from endpoints to a single table in the cloud and enabling performant querying against it.

The need to avoid internal data siloing within BlackBerry was paramount as multiple teams required access to the data to deliver an effective EDR solution for the present and the future. Databricks tooling enabled us to break down our data silos and iteratively improve our EDR pipeline to ingest data faster and reduce querying latency by more than 20% while reducing costs by more than 30%.

In this session, we will share the journey, lessons learned, and the future for collecting, storing, governing, and sharing data from endpoints in Databricks. The result of building EDR using Databricks helped us accelerate the deployment of our data platform.

Talk by: Justin Lai and Robert Lombardi

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksin

Databricks SQL: Why the Best Serverless Data Warehouse is a Lakehouse

Databricks SQL: Why the Best Serverless Data Warehouse is a Lakehouse

2023-07-26 Watch
video

Many organizations rely on complex cloud data architectures that create silos between applications, users and data. This fragmentation makes it difficult to access accurate, up-to-date information for analytics, often resulting in the use of outdated data. Enter the lakehouse, a modern data architecture that unifies data, AI, and analytics in a single location.

This session explores why the lakehouse is the best data warehouse, featuring success stories, use cases and best practices from industry experts. You'll discover how to unify and govern business-critical data at scale to build a curated data lake for data warehousing, SQL and BI. Additionally, you'll learn how Databricks SQL can help lower costs and get started in seconds with on-demand, elastic SQL serverless warehouses, and how to empower analytics engineers and analysts to quickly find and share new insights using their preferred BI and SQL tools such as Fivetran, dbt, Tableau, or Power BI.

Talk by: Miranda Luna and Cyrielle Simeone

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Data Extraction and Sharing Via The Delta Sharing Protocol

Data Extraction and Sharing Via The Delta Sharing Protocol

2023-07-26 Watch
video

The Delta Sharing open protocol for secure sharing and distribution of Lakehouse data is designed to reduce friction in getting data to users. Delivering custom data solutions from this protocol further leverages the technical investment committed to your Delta Lake infrastructure. There are key design and computational concepts unique to Delta Sharing to know when undertaking development. And there are pitfalls and hazards to avoid when delivering modern cloud data to traditional data platforms and users.

In this session, we introduce Delta Sharing Protocol development and examine our journey and the lessons learned while creating the Delta Sharing Excel Add-in. We will demonstrate scenarios of overfetching, underfetching, and interpretation of types. We will suggest methods to overcome these development challenges. The session will combine live demonstrations that exercise the Delta Sharing REST protocol with detailed analysis of the responses. The demonstrations will elaborate on optional capabilities of the protocol’s query mechanism, and how they are used and interpreted in real-life scenarios. As a reference baseline for data professionals, the Delta Sharing exercises will be framed relative to SQL counterparts. Specific attention will be paid to how they differ, and how Delta Sharing’s Change Data Feed (CDF) can power next-generation data architectures. The session will conclude with a survey of available integration solutions for getting the most out of your Delta Sharing environment, including frameworks, connectors, and managed services.

Attendees are encouraged to be familiar with REST, JSON, and modern programming concepts. A working knowledge of Delta Lake, the Parquet file format, and the Delta Sharing Protocol are advised.

Talk by: Roger Dunn

Here’s more to explore: A New Approach to Data Sharing: https://dbricks.co/44eUnT1

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Data Globalization at Conde Nast Using Delta Sharing

Data Globalization at Conde Nast Using Delta Sharing

2023-07-26 Watch
video

Databricks has been an essential part of the Conde Nast architecture for the last few years. Prior to building our centralized data platform, “evergreen,” we had similar challenges as many other organizations; siloed data, duplicated efforts for engineers, and a lack of collaboration between data teams. These problems led to mistrust in data sets and made it difficult to scale to meet the strategic globalization plan we had for Conde Nast.

Over the last few years we have been extremely successful in building a centralized data platform on Databricks in AWS, fully embracing the lakehouse vision from end-to-end. Now, our analysts and marketers can derive the same insights from one dataset and data scientists can use the same datasets for use cases such as personalization, subscriber propensity models, churn models and on-site recommendations for our iconic brands.

In this session, we’ll discuss how we plan to incorporate Unity Catalog and Delta Sharing as the next phase of our globalization mission. The evergreen platform has become the global standard for data processing and analytics at Conde. In order to manage the worldwide data and comply with GDPR requirements, we need to make sure data is processed in the appropriate region and PII data is handled appropriately. At the same time, we need to have a global view of the data to allow us to make business decisions at the global level. We’ll talk about how delta sharing allows us a simple, secure way to share de-identified datasets across regions in order to make these strategic business decisions, while complying with security requirements. Additionally, we’ll discuss how Unity Catalog allows us to secure, govern and audit these datasets in an easy and scalable manner.

Talk by: Zachary Bannor

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Embrace First-Party Customer Data for Marketing and Advertising using Data Cleanrooms

Embrace First-Party Customer Data for Marketing and Advertising using Data Cleanrooms

2023-07-26 Watch
video
Jordan Peck (/ Snowplow)

The digital marketing and advertising industry is going through revolutionary change in 2023, with technical, organisational, cultural and regulatory overhaul. As a result, measuring digital advertising effectiveness or coordinating and running highly targeted and effective ad campaigns is becoming more challenging than ever.

First party customer behavioral data provides organizations true competitive advantage and the ability outperform your peers in the battle for customer attention and brand loyalty.

However, first party customer data is still used sparingly across the digital ad ecosystem, and there are few tools or frameworks to allow advertisers to unlock the value in what first party data they have.

This session will show you how Snowplow allows organizations to deeply understand their users' behavior and intent by creating the best quality behavioral data. It will also explain that when this is combined with the Databricks Lakehouse and data clean rooms, brands can now unlock insights that were previously unachievable, and activate their first party customer behavioral data into highly effective, personalized and creative ad campaigns.

In this session you will learn: - Why first party data can be the ultimate in competitive advantage for digital advertisers - How data clean rooms combined with Snowplow behavioral data enable better insights and more impactful ad targeting - What specific marketing and advertising use cases are possible when utilizing a data clean room on top of the Databricks Lakehouse

Talk by: Jordan Peck

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Embracing the Future of Data Engineering: The Serverless, Real-Time Lakehouse in Action

Embracing the Future of Data Engineering: The Serverless, Real-Time Lakehouse in Action

2023-07-26 Watch
video
Frank Munz (Databricks)

As we venture into the future of data engineering, streaming and serverless technologies take center stage. In this fun, hands-on, in-depth and interactive session you can learn about the essence of future data engineering today.

We will tackle the challenge of processing streaming events continuously created by hundreds of sensors in the conference room from a serverless web app (bring your phone and be a part of the demo). The focus is on the system architecture, the involved products and the solution they provide. Which Databricks product, capability and settings will be most useful for our scenario? What does streaming really mean and why does it make our life easier? What are the exact benefits of serverless and how "serverless" is a particular solution?

Leveraging the power of the Databricks Lakehouse Platform, I will demonstrate how to create a streaming data pipeline with Delta Live Tables ingesting data from AWS Kinesis. Further, I’ll utilize advanced Databricks workflows triggers for efficient orchestration and real-time alerts feeding into a real-time dashboard. And since I don’t want you to leave with empty hands - I will use Delta Sharing to share the results of the demo we built with every participant in the room. Join me in this hands-on exploration of cutting-edge data engineering techniques and witness the future in action.

Talk by: Frank Munz

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Lineage System Table in Unity Catalog

Lineage System Table in Unity Catalog

2023-07-26 Watch
video

Unity Catalog provides fully automated data lineage for all workloads in SQL, R, Python, Scala and across all asset types at Databricks. The aggregated view has been available to end users through data explorer and API. In this session, we are excited to share that lineage is available via delta table in their UC metastore. It stores full history of recent lineage records and it is near real time. Additionally, customers can query it through standard SQL interface. With that, customers can get significant operational insights about their workload for impact analysis, troubleshooting, quality assurance, data discovery, and data governance.

Together with the system table platform effort, which provides query history, job run operational data, audit logs and more, lineage table will be a critical piece to link all the data asset and entity asset together, providing better lakehouse observability and unification to customers.

Talk by: Menglei Sun

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc