talk-data.com talk-data.com

Event

Databricks DATA + AI Summit 2023

2026-01-11 YouTube Visit website ↗

Activities tracked

119

Filtering by: Delta ×

Sessions & talks

Showing 51–75 of 119 · Newest first

Search within this event →
Apache Spark™ Streaming and Delta Live Tables Accelerates KPMG Clients For Real Time IoT Insights

Apache Spark™ Streaming and Delta Live Tables Accelerates KPMG Clients For Real Time IoT Insights

2023-07-26 Watch
video

Unplanned downtime in manufacturing costs firms up to a trillion dollars annually. Time that materials spend sitting on a production line is lost revenue. Even just 15 hours of downtime a week adds up to over 800 hours of downtime yearly. The use of Internet of Things or IoT devices can cut this time down by providing details of machine metrics. However, IoT predictive maintenance is challenged by the lack of effective, scalable infrastructure and machine learning solutions. IoT data can be the size of multiple terabytes per day and can come in a variety of formats. Furthermore, without any insights and analysis, this data becomes just another table.

The KPMG Databricks IoT Accelerator is a comprehensive solution enabling manufacturing plant operators to have a bird’s eye view of their machines’ health and empowers proactive machine maintenance across their portfolio of IoT devices. The Databricks Accelerator ingests IoT streaming data at scale and implements the Databricks Medallion architecture while leveraging Delta Live Tables to clean and process data. Real time machine learning models are developed from IoT machine measurements and are managed in MLflow. The AI predictions and IoT device readings are compiled in the gold table powering downstream dashboards like Tableau. Dashboards inform machine operators of not only machines’ ailments, but action they can take to mitigate issues before they arise. Operators can see fault history to aid in understanding failure trends, and can filter dashboards by fault type, machine, or specific sensor reading. 

Talk by: MacGregor Winegard

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksin

Sponsored: KPMG | Multicloud Enterprise Delta Sharing & Governance using Unity Catalog @ S&P Global

Sponsored: KPMG | Multicloud Enterprise Delta Sharing & Governance using Unity Catalog @ S&P Global

2023-07-26 Watch
video

Cloud technologies have revolutionized global data access across a number of industries. However, many enterprise organizations face challenges in adopting these technologies effectively, as comprehensive cloud data governance strategies and solutions are complex and evolving – particularly in hybrid or multicloud scenarios involving multiple third parties. KPMG and S&P Global have harnessed the power of Databricks Lakehouse to create a novel approach.

By integrating Unity Catalogue, Delta Sharing, and the KPMG Modern Data Platform, S&P Global has enabled scalable, transformative cross-enterprise data sharing and governance. This demonstration highlights a collaboration between S&P Global Sustainable1 (S1) ESG program and the KPMG ESG Analytics Accelerators to enable large-scale SFDR ESG portfolio analytics. Join us to discover our solution that drives transformative change, fosters data-driven decision-making, and bolsters sustainability efforts in a wide range of industries.

Talk by: Niels Hanson,Dennis Tally

Here’s more to explore: A New Approach to Data Sharing: https://dbricks.co/44eUnT1

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Vector Data Lakes

Vector databases such as ElasticSearch and Pinecone offer fast ingestion and querying on vector embeddings with ANNs. However, they typically do not decouple compute and storage, making them hard to integrate in production data stacks. Because data storage in these databases is expensive and not easily accessible, data teams typically maintain ETL pipelines to offload historical embedding data to blob stores. When that data needs to be queried, they get loaded back into the vector database in another ETL process. This is reminiscent of loading data from OLTP database to cloud storage, then loading said data into an OLAP warehouse for offline analytics.

Recently, “lakehouse” offerings allow direct OLAP querying on cloud storage, removing the need for the second ETL step. The same could be done for embedding data. While embedding storage in blob stores cannot satisfy the high TPS requirements in online settings, we argue it’s sufficient for offline analytics use cases like slicing and dicing data based on embedding clusters. Instead of loading the embedding data back into the vector database for offline analytics, we propose direct processing on embeddings stored in Parquet files in Delta Lake. You will see that offline embedding workloads typically touch a large portion of the stored embeddings without the need for random access.

As a result, the workload is entirely bound by network throughput instead of latency, making it quite suitable for blob storage backends. On a test one billion vector dataset, ETL into cloud storage takes around one hour on a dedicated GPU instance, while batched nearest neighbor search can be done in under one minute with four CPU instances. We believe future “lakehouses” will ship with native support for these embedding workloads.

Talk by: Tony Wang and Chang She

Here’s more to explore: State of Data + AI Report: https://dbricks.co/44i2HBp Databricks named a Leader in 2022 Gartner® Magic QuadrantTM CDBMS: https://dbricks.co/3phw20d

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Delta Lake AMA

Delta Lake AMA

2023-07-26 Watch
video
Robert Pack (Databricks) , Bart Samwel (Databricks) , Allison Portis , Tathagata Das (Databricks)

Have some great questions about Delta Lake?  Well, come by and ask the experts your questions!

Talk by: Bart Samwel, Tathagata Das, Robert Pack, and Allison Portis

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

How We Made a Unified Talent Solution Using Databricks Machine Learning, Fine-Tuned LLM & Dolly 2.0

How We Made a Unified Talent Solution Using Databricks Machine Learning, Fine-Tuned LLM & Dolly 2.0

2023-07-26 Watch
video

Using Databricks, we built a “Unified Talent Solution” backed by a robust data and AI engine for analyzing skills of a combined pool of permanent employees, contractors, part-time employees and vendors, inferring skill gaps, future trends and recommended priority areas to bridge talent gaps, which ultimately greatly improved operational efficiency, transparency, commercial model, and talent experience of our client. We leveraged a variety of ML algorithms such as boosting, neural networks and NLP transformers to provide better AI-driven insights.

One inevitable part of developing these models within a typical DS workflow is iteration. Databricks' end-to-end ML/DS workflow service, MLflow, helped streamline this process by organizing them into experiments that tracked the data used for training/testing, model artifacts, lineage and the corresponding results/metrics. For checking the health of our models using drift detection, bias and explainability techniques, MLflow's deploying, and monitoring services were leveraged extensively.

Our solution built on Databricks platform, simplified ML by defining a data-centric workflow that unified best practices from DevOps, DataOps, and ModelOps. Databricks Feature Store allowed us to productionize our models and features jointly. Insights were done with visually appealing charts and graphs using PowerBI, plotly, matplotlib, that answer business questions most relevant to clients. We built our own advanced custom analytics platform on top of delta lake as Delta’s ACID guarantees allows us to build a real-time reporting app that displays consistent and reliable data - React (for front-end), Structured Streaming for ingesting data from Delta table with live query analytics on real time data ML predictions based on analytics data.

Talk by: Nitu Nivedita

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Sponsored by: Avanade | Enabling Real-Time Analytics with Structured Streaming and Delta Live Tables

Sponsored by: Avanade | Enabling Real-Time Analytics with Structured Streaming and Delta Live Tables

2023-07-26 Watch
video

Join the panel to hear how Avanade is helping clients enable real-time analytics and tackle the people and process problems that accompany technology, powered by Azure Databricks.

Talk by: Thomas Kim, Dael Williamson, Zoé Durand

Here’s more to explore: Data, Analytics, and AI Governance: https://dbricks.co/44gu3YU

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Unity Catalog, Delta Sharing and Data Mesh on Databricks Lakehouse

Unity Catalog, Delta Sharing and Data Mesh on Databricks Lakehouse

2023-07-25 Watch
video

In this technical deep dive, we will detail how customers implemented data mesh on Databricks and how standardizing on delta format enabled delta-to-delta share to non-Databricks consumers.

  • Current state of the IT landscape
  • Data silos (problems with organizations not having connected data in the ecosystem)
  • A look back on why we moved away from data warehouses and choose cloud in the first place
  • What caused the data chaos in the cloud (instrumentation and too much stitching together) ~ periodic table list of services of the cloud
  • How to strike the balance between autonomy and centralization
  • Why Databricks Unity Catalog puts you in the right path to implementing data mesh strategy
  • What are the process and features that enable and end-to-end Implementation of a data strategy
  • How customers were able to successfully implement the data mesh on out of the box Unity Catalog and delta sharing without overwhelming their IT tool stack
  • Use cases
  • Delta-to-delta data sharing
  • Delta-to-others data sharing
  • How do you navigate when data today is available across regions, across clouds, on-prem and external systems
  • Change data feed to share only “data that has changed”
  • Data stewardship
  • Why ABAC is important
  • How file based access policies and governance play an important role
  • Future state and its pitfalls
  • Egress costs
  • Data compliances

Talk by: Surya Turaga and Thomas Roach

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

What’s New with Data Sharing and Collaboration on the Lakehouse: From Delta Sharing to Clean Rooms

What’s New with Data Sharing and Collaboration on the Lakehouse: From Delta Sharing to Clean Rooms

2023-07-25 Watch
video

Get ready to accelerate your data and AI collaboration game with the Databricks product team. Join us as we build the next generation of secure data collaboration capabilities on the lakehouse. Whether you're just starting your data sharing journey or exploring advanced data collaboration features like data cleanrooms, this session is tailor-made for you.

In this demo-packed session, you'll discover what’s new in Delta Sharing including dynamic and materialized views for sharing, sharing other assets such as notebooks, ML models, new Delta Sharing open source connectors for the tools of your choice, and updates to Databricks cleanroom. Learn how lakehouse is the perfect solution for your data and AI collaboration requirements, across clouds, regions and platforms and without any vendor lock-in. Plus, you'll get a peek into our upcoming roadmap. Ask any burning questions you have for our expert product team as they build a collaborative lakehouse for data, analytics and AI.

Talk by: Erika Ehrli, Kelly Albano, and Xiaotong Sun

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Taking Control of Streaming Healthcare Data

Taking Control of Streaming Healthcare Data

2023-07-25 Watch
video

Chesapeake Regional Information System for our Patients (CRISP), a nonprofit healthcare information exchange (HIE), initially partnered with Slalom to build a Databricks data lakehouse architecture in response to the analytics demands of the COVID-19 pandemic, since then they have expanded the platform to additional use cases. Recently they have worked together to engineer streaming data pipelines to process healthcare messages, such as HL7, to help CRISP become vendor independent.

This session will focus on the improvements CRISP has made to their data lakehouse platform to support streaming use cases and the impact these changes have had for the organization. We will touch on using Databricks Auto Loader to efficiently ingest incoming files, ensuring data quality with Delta Live Tables, and sharing data internally with a SQL warehouse, as well as some of the work CRISP has done to parse and standardize HL7 messages from hundreds of sources. These efforts have allowed CRISP to stream over 4 million messages daily in near real-time with the scalability it needs to continue to onboard new healthcare providers so it can continue to facilitate care and improve health outcomes.

Talk by: Andy Hanks and Chris Mantz

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Databricks As Code:Effectively Automate a Secure Lakehouse Using Terraform for Resource Provisioning

Databricks As Code:Effectively Automate a Secure Lakehouse Using Terraform for Resource Provisioning

2023-07-25 Watch
video

At Rivian, we have automated more than 95% of our Databricks resource provisioning workflows using an in-house Terraform module, affording us a lean admin team to manage over 750 users. In this session, we will cover the following elements of our approach and how others can benefit from improved team efficiency.

  • User and service principal management
  • Our permission model on Unity Catalog for data governance
  • Workspace and secrets resource management
  • Managing internal package dependencies using init scripts
  • Facilitating dashboards, SQL queries and their associated permissions
  • Scaling source of truth Petabyte scale Delta Lake table ingestion jobs and workflows

Talk by: Jason Shiverick and Vadivel Selvaraj

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

MLOps at Gucci: From Zero to Hero

MLOps at Gucci: From Zero to Hero

2023-07-25 Watch
video

Delta Lake is an open-source storage format that can be ideally used for storing large-scale datasets, which can be used for single-node and distributed training of deep learning models. Delta Lake storage format gives deep learning practitioners unique data management capabilities for working with their datasets. The challenge is that, as of now, it’s not possible to use Delta Lake to train PyTorch models directly.

PyTorch community has recently introduced a Torchdata library for efficient data loading. This library supports many formats out of the box, but not Delta Lake. This talk will demonstrate using the Delta Lake storage format for single-node and distributed PyTorch training using the torchdata framework and standalone delta-rs Delta Lake implementation.

Talk by: Michael Shtelma

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

How to Build a Metadata Driven Data Pipelines with Delta Live Tables

How to Build a Metadata Driven Data Pipelines with Delta Live Tables

2023-07-25 Watch
video

In this session, you will learn how you can use metaprogramming to automate the creation and management of Delta Live Tables pipelines at scale. The goal is to make it easy to use DLT for large-scale migrations, and other use cases that require ingesting and managing hundreds or thousands of tables, using generic code components and configuration-driven pipelines that can be dynamically reused across different projects or datasets.

Talk by: Mojgan Mazouchi and Ravi Gawai

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Building & Managing a Data Platform for a Delta Lake Exceeding 13PB & 1000s of Users: AT&T's Story

Building & Managing a Data Platform for a Delta Lake Exceeding 13PB & 1000s of Users: AT&T's Story

2023-07-25 Watch
video

Data runs AT&T’s business, just like it runs most businesses these days. Data can lead to a greater understanding of a business and when translated correctly into information can provide human and business systems valuable insights to make better decisions. Unique to AT&T is the volume of data we support, how much of our work that is driven by AI and the scale at which data and AI drive value for our customers and stakeholders.

Our cloud migration journey includes making data and AI more accessible to employees throughout AT&T so they can use their deep business expertise to leverage data more easily and rapidly. We always had to balance this data democratization and desire for speed with keeping our data private and secure. We loved the open ecosystem model of Lakehouse that enables data, BI and ML tools to be seamlessly integrated on a single pane arena; it simplifies the architecture and reduces dependencies between technologies in the cloud. Being clear in our architecture guidelines and patterns was very important to us for our success.

We are seeing more interest from our business unit partners and continuing to build the AI capability AI as a service to support more citizen data scientists. To scale up our Lakehouse journey, we built a Databricks center of excellence (CoE) function in AT&T which today has approximately 1400+ active members, further concentrating existing expertise and resources in ML/AI discipline to collaborate on all things Databricks like technical support, trainings, FAQ’s and best practices to attain and sustain world-class performance and drive business value for AT&T. Join us to learn more about how we process and manage over 10 petabytes of our network Lakehouse with Delta Lake and Databricks.

Lakehouse Federation: Access and Governance of External Data Sources from Unity Catalog

Lakehouse Federation: Access and Governance of External Data Sources from Unity Catalog

2023-07-25 Watch
video
Can Efeoglu (Databricks) , Todd Greenstein (Databricks)

Are you tired of spending time and money moving data across multiple sources and platforms to access the right data at the right time? Join our session and discover Databricks new Lakehouse Federation feature, which allows you to access, query, and govern your data in place without leaving the Lakehouse. Our experts will demonstrate how you can leverage the latest enhancements in Unity Catalog, including query federation, Hive interface, and Delta Sharing, to discover and govern all your data in one place, regardless of where it lives.

Talk by: Can Efeoglu and Todd Greenstein

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Using DMS and DLT for Change Data Capture

Using DMS and DLT for Change Data Capture

2023-07-25 Watch
video

Bringing in Relational Data Store (RDS) data into your data lake is a critical and important process to facilitate use cases. By leveraging AWS Database Migration Services (DMS) and Databricks Delta Live Tables (DLT) we can simplify change data capture from your RDS. In this talk, we will be breaking down this complex process by discussing the fundamentals and best practices. There will also be a demo where we bring this all together.

Talk by: Neil Patel and Ganesh Chand

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Databricks Asset Bundles: A Standard, Unified Approach to Deploying Data Products on Databricks

Databricks Asset Bundles: A Standard, Unified Approach to Deploying Data Products on Databricks

2023-07-25 Watch
video

In this session, we will introduce Databricks Asset Bundles, provide a demonstration of how they work for a variety of data products, and how to fit them into an overall CICD strategy for the well-architected Lakehouse.

Data teams produce a variety of assets; datasets, reports and dashboards, ML models, and business applications. These assets depend upon code (notebooks, repos, queries, pipelines), infrastructure (clusters, SQL warehouses, serverless endpoints), and supporting services/resources like Unity Catalog, Databricks Workflows, and DBSQL dashboards. Today, each organization must figure out a deployment strategy for the variety of data products they build on Databricks as there is no consistent way to describe the infrastructure and services associated with project code.

Databricks Asset Bundles is a new capability on Databricks that standardizes and unifies the deployment strategy for all data products developed on the platform. It allows developers to describe the infrastructure and resources of their project through a YAML configuration file, regardless of whether they are producing a report, dashboard, online ML model, or Delta Live Tables pipeline. Behind the scenes, these configuration files use Terraform to manage resources in a Databricks workspace, but knowledge of Terraform is not required to use Databricks Asset Bundles.

Talk by: Rafi Kurlansik and Pieter Noordhuis

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Introduction to Data Engineering on the Lakehouse

Introduction to Data Engineering on the Lakehouse

2023-07-25 Watch
video

Data engineering is a requirement for any data, analytics or AI workload. With the increased complexity of data pipelines, the need to handle real-time streaming data and the challenges of orchestrating reliable pipelines, data engineers require the best tools to help them achieve their goals. The Databricks Lakehouse Platform offers a unified platform to ingest, transform and orchestrate data and simplifies the task of building reliable ETL pipelines.

This session will provide an introductory overview of the end-to-end data engineering capabilities of the platform, including Delta Live Tables and Databricks Workflows. We’ll see how these capabilities come together to provide a complete data engineering solution and how they are used in the real world by organizations leveraging the lakehouse turning raw data into insights.

Talk by: Jibreal Hamenoo and Ori Zohar

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Delta Live Tables A to Z: Best Practices for Modern Data Pipelines

Delta Live Tables A to Z: Best Practices for Modern Data Pipelines

2023-07-25 Watch
video
Michael Armbrust (Databricks)

Join Databricks' Distinguished Principal Engineer Michael Armbrust for a technical deep dive into how Delta Live Tables (DLT) reduces the complexity of data transformation and ETL. Learn what’s new; what’s coming; and how to easily master the ins-and-outs of DLT.

Michael will describe and demonstrate:

  • What’s new in Delta Live Tables (DLT) - Enzyme, Enhanced Autoscaling, and more
  • How to easily create and maintain your DLT pipelines
  • How to monitor pipeline operations
  • How to optimize data for analytics and ML
  • Sneak Peek into the DLT roadmap

Talk by: Michael Armbrust

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Live from the Lakehouse: Lakehouse observability, and Delta Lake. With Michael Milirud and Denny Lee

Live from the Lakehouse: Lakehouse observability, and Delta Lake. With Michael Milirud and Denny Lee

2023-07-14 Watch
video
Denny Lee (Databricks) , Michael Milirud (Databricks)

Hear from two guests. First, Michael Milirud (Sr Manager, Product Management, Databricks) on Lakehouse monitoring and observability. Second guest, Denny Lee (Sr Staff Developer Advocate, Databricks), discusses Delta Lake. Hosted by Holly Smith (Sr Resident Solutions Architect, Databricks) and Jimmy Obeyeni (Strategic Account Executive, Databricks)

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Live from the Lakehouse: Machine Learning, LLM, Delta Lake, and data engineering

Live from the Lakehouse: Machine Learning, LLM, Delta Lake, and data engineering

2023-07-14 Watch
video
Jason Pohl (Databricks) , Caryl Yuhas (Databricks)

Hear from two guests. First, Caryl Yuhas (Global Practice Lead, Solutions Architect, Databricks) on Machine Learning & LLMs. Second guest, Jason Pohl (Sr. Director, Field Engineering), discusses Delta Lake and data engineering. Hosted by Holly Smith (Sr Resident Solutions Architect, Databricks) and Jimmy Obeyeni (Strategic Account Executive, Databricks)

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Cutting the Edge in Fighting Cybercrime: Reverse-Engineering a Search Language to Cross-Compile

Cutting the Edge in Fighting Cybercrime: Reverse-Engineering a Search Language to Cross-Compile

2022-07-22 Watch
video

Traditional cybersecurity Security Information and Event Management (SIEM) ways do not scale well for data sources with 30TiB per day, leading HSBC to create a Cybersecurity Lakehouse with Delta and Spark. Creating a platform to overcome several conventional technical constraints, the limitation in the amount of data for long-term analytics available in traditional platforms and query languages is difficult to scale and time-consuming to run. In this talk, we’ll learn how to implement (or actually reverse-engineer) a language with Scala and translate it into what Apache Spark understands, the Catalyst engine. We’ll guide you through the technical journey of building equivalents of a query language into Spark. We’ll learn how HSBC business benefited from this cutting-edge innovation, like decreasing time and resources for Cyber data processing migration, improving Cyber threat Incident Response, and fast onboarding of HSBC Cyber Analysts on Spark with Cybersecurity Lakehouse platform.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Connecting the Dots with DataHub: Lakehouse and Beyond

Connecting the Dots with DataHub: Lakehouse and Beyond

2022-07-19 Watch
video

You’ve successfully built your data lakehouse. Congratulations! But what happens when your operational data stores, streaming systems like Apache Kafka or data ingestion systems produce bad data into the lakehouse? Can you be proactive when it comes to preventing bad data from affecting your business? How can you take advantage of automation to ensure that raw data assets become well maintained data products (clear ownership, documentation and sensitivity classification) without requiring people to do redundant work across operational, ingestion and lakehouse systems? How do you get live and historical visibility into your entire data ecosystem (schemas, pipelines, data lineage, models, features and dashboards) within and across your production services, ingestion pipelines and data lakehouse? Data engineers struggle with data quality and data governance issues constantly interrupting their day and limiting their upside impact on the business.

In this talk, we will share how data engineers from our 3K+ strong DataHub community are using DataHub to track lineage, understand data quality, and prevent failures from impacting their important dashboards, ML models and features. The talk will include details of how DataHub extracts lineage automatically from Spark, schema and statistics from Delta Lake and shift-left strategies for developer-led governance.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Open Source Powers the Modern Data Stack

Open Source Powers the Modern Data Stack

2022-07-19 Watch
video

Lakehouses like Databricks’ Delta Lake are becoming the central brain for all data systems. But Lakehouses are only one component of the data stack. There are many building blocks required for tackling data needs, including data integrations, data transformation, data quality, observability, orchestration etc.

In this session, we will present how open source powers companies' approach to building a modern data stack. We will talk about technologies like Airbyte, Airflow, dbt, Preset, and how to connect them in order to build a customized and extensible data platform centered around Databricks.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Powering Up the Business with a Lakehouse

Powering Up the Business with a Lakehouse

2022-07-19 Watch
video

Within Wehkamp we required a uniform way to provide reliable and on time data to the business, while making this access compliant with GDPR. Unlocking all the data sources that we have scattered across the company and democratize the data access was of the utmost importance, allowing us to empower the business with more, better and faster data.

Focusing on open source technologies, we've built a data platform almost from the ground up that focuses on 3 levels of data curation - bronze, silver and gold - which follows the LakeHouse Architecture. The ingestion into bronze is where the PII fields are pseudonymized, making the use of the data within the delta lake compliant and, since there is no visible user data, it means everyone can use the entire delta lake for exploration and new use cases. Naturally, specific teams are allowed to see some user data that is necessary for their use cases. Besides the standard architecture, we've developed a library that allows us to ingest new data sources by adding a JSON config file with the characteristics. This combined with the ACID transactions that delta provides and the efficient Structured Stream provided through Auto Loader has allowed a small team to maintain 100+ streams with insignificant downtime.

Some other components of this platform are the following: - Alerting to Slack - Data quality checks - CI/CD - Stream processing with the delta engine

The feedback so far has been encouraging, as more and more teams across the company are starting to use the new platform and taking advantage of all its perks. It is still a long time until we get to turn off some of the components of the old data platform, but it has come a long way.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Turning Fan Data Into an Asset

Turning Fan Data Into an Asset

2022-07-19 Watch
video

The sports industry is evolving, with organizations investing heavily into their tech stack. Simply connecting data, however, is not enough. Unexpected insights can come from anyone — but only if they have tools they can easily use.

Enter Pumpjack Dataworks, the world’s leading fan data refinery. This platform enables clubs, leagues, and sponsors to discover digital revenue channels by joining consumer data while assuring its protection. Powered by Databricks’ Delta Sharing protocol and governed by Immuta’s data access platform, this solution democratizes fan data, making it immediately accessible.

Join us to learn how automating access control provides scalable, rule-driven assurance that data is properly managed and analyzed.

You’ll discover: How Pumpjack Dataworks uses attribute-based access control to meet privacy regulations and data sharing requirements How Databricks and Immuta’s partnership enables robust governance Why ABAC is critical for scale within modern data stacks

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/