Databricks

DataSecOps and Unity Catalog: High Leverage Governance at Scale

2023-07-26 · Databricks DATA + AI Summit 2023 Watch

video

by Zeashan Pappa (Databricks) , Deepak Sekar

Terraform

Learn how to apply DataSecOps patterns powered by Terraform to Unity Catalog to scale your governance efforts and support your organizational data usage.

Talk by: Zeashan Pappa and Deepak Sekar

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Data Sharing and Beyond with Delta Sharing

2023-07-26 · Databricks DATA + AI Summit 2023 Watch

video

by Milos Colic (Databricks) , Vuong Nguyen

AI/ML Analytics Delta

Stepping into this brave new digital world we are certain that data will be a central product for many organizations. The way to convey their knowledge and their assets will be through data and analytics. Delta Sharing was the world's first open protocol for secure and scalable real-time data sharing. Through our customer conversations, there is a lot of anticipation of how Delta Sharing can be extended to non-tabular assets, such as machine learning experiments and models.

In this session, we will cover how we extended the Delta Sharing protocol to other sharing workflows, enabling sharing of ML models, arbitrary files and more. The development resulted in Arcuate, a Databricks Labs project with a data sharing flavor. The session will start with the high-level approach and how it can be extended to cover other similar use cases. It will then move to our implementation and how it integrates seamlessly with Databricks-managed Delta Sharing server and notebooks. We finally conclude with lessons learned, and our visions for a future of data sharing and beyond

Talk by: Vuong Nguyen and Milos Colic

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Deploying the Lakehouse to Improve the Viewer Experience on Discovery+

2023-07-26 · Databricks DATA + AI Summit 2023 Watch

video

by Deepa Paranjpe

Data Engineering Data Lakehouse Data Streaming

In this session, we will discuss how real-time data streaming can be used to gain insights into user behavior and preferences, and how this data is being used to provide personalized content and recommendations on Discovery+. We will examine techniques that enables faster decision making and insights on accurate real time data including data masking and data validation. To enable a wide set of data consumers from data engineers to data scientists to data analysts, we will discuss how Unity Catalog is leveraged for secure data access and sharing while still allowing teams flexibility.

Operating at this scale requires examining the value being created by the data being processed and optimizing along the way and we will share some of our success in this area.

Talk by: Deepa Paranjpe

Here’s more to explore: Big Book of Data Engineering: 2nd Edition: https://dbricks.co/3XpPgNV The Data Team's Guide to the Databricks Lakehouse Platform: https://dbricks.co/46nuDpI

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Enabling Data Governance at Enterprise Scale Using Unity Catalog

2023-07-26 · Databricks DATA + AI Summit 2023 Watch

video

by Jaison Dominic (Amgen) , Lakhan Prajapati

Analytics AWS Cloud Computing Data Governance Fabric Cyber Security

Amgen has invested in building modern, cloud-native enterprise data and analytics platforms over the past few years with a focus on tech rationalization, data democratization, overall user experience, increase reusability, and cost-effectiveness. One of these platforms is our Enterprise Data Fabric which focuses on pulling in data across functions and providing capabilities to integrate and connect the data and govern access. For a while, we have been trying to set up robust data governance capabilities which are simple, yet easy to manage through Databricks. There were a few tools in the market that solved a few immediate needs, but none solved the problem holistically. For use cases like maintaining governance on highly restricted data domains like Finance and HR, a long-term solution native to Databricks and addressing the below limitations was deemed important:

The way these tools were set up, allowed the overriding of a few security policies

Tools were not UpToDate with the latest DBR runtime
Complexity of implementing fine-grained security
Policy management – AWS IAM + In tool policies

To address these challenges, and for large-scale enterprise adoption of our governance capability, we started working on UC integration with our governance processes. With an aim to realize the following tech benefits:

Independent of Databricks runtime
Easy fine-grained access control
Eliminated management of IAM roles
Dynamic access control using UC and dynamic views

Today, using UC, we have to implement fine-grained access control & governance for the restricted data of Amgen. We are in the process of devising a realistic migration & change management strategy across the enterprise.

Talk by: Lakhan Prajapati and Jaison Dominic

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Extending Lakehouse Architecture with Collaborative Identity

2023-07-26 · Databricks DATA + AI Summit 2023 Watch

video

by Erin Boelkens (LiveRamp) , Shawn Gilleran (LiveRamp)

AI/ML Analytics Data Lakehouse

Lakehouse architecture has become a valuable solution for unifying data processing for AI, but faces limitations in maximizing data’s full potential. Additional data infrastructure is helpful for strengthening data consolidation and data connectivity with third-party sources, which are necessary for building full data sets for accurate audience modeling.

In this session, LiveRamp will demonstrate to data and analytics decision-makers how to build on the Lakehouse architecture with extensions for collaborative identity graph construction, including how to simplify and improve data enrichment, data activation, and data collaboration. LiveRamp will also introduce a complete data marketplace, which enables easy, pseudonymized data enhancements that widen the attribute set for better behavioral model construction.

With these techniques and technologies, enterprises across financial services, retail, media, travel, and more can safely unlock partner insights and ultimately produce more accurate inputs for personalization engines, and more engaging offers and recommendations for customers.

Talk by: Erin Boelkens and Shawn Gilleran

Here’s more to explore: A New Approach to Data Sharing: https://dbricks.co/44eUnT1

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

How Coinbase Built and Optimized SOON, a Streaming Ingestion Framework

2023-07-26 · Databricks DATA + AI Summit 2023 Watch

video

by Chen Guo

Data Engineering Data Lakehouse Delta DWH Kafka Spark Data Streaming

Data with low latency is important for real-time incident analysis and metrics. Though we have up-to-date data in OLTP databases, they cannot support those scenarios. Data need to be replicated to a data warehouse to serve queries using GroupBy and Join across multiple tables from different systems. At Coinbase, we designed SOON (Spark cOntinuOus iNgestion) based on Kafka, Kafka Connect, and Apache Spark™ as an incremental table replication solution to replicate tables of any size from any database to Delta Lake in a timely manner. It also supports Kafka events ingestion naturally.

SOON incrementally ingests Kafka events as appends, updates, and deletes to an existing table on Delta Lake. The events are grouped into two categories: CDC (change data capture) events generated by Kafka Connect source connectors, and non-CDC events by the frontend or backend services. Both types can be appended or merged into the Delta Lake. Non-CDC events can be in any format, but CDC events must be in the standard SOON CDC schema. We implemented Kafka Connect SMTs to transform raw CDC events into this standardized format. SOON unifies all streaming ingestion scenarios such that users only need to learn one onboarding experience and the team only needs to maintain one framework.

We care about the ingestion performance. The biggest append-only table onboarded has ingress traffic at hundreds of thousands events per second; the biggest CDC-merge table onboarded has a snapshot size of a few TBs and CDC update traffic at hundreds of thousands events per second. A lot of innovative ideas are incorporated in SOON to improve its performance, such as min-max range merge optimization, KMeans merge optimization, no-update merge for deduplication, generated columns as partitions, etc.

Talk by: Chen Guo

Here’s more to explore: Big Book of Data Engineering: 2nd Edition: https://dbricks.co/3XpPgNV The Data Team's Guide to the Databricks Lakehouse Platform: https://dbricks.co/46nuDpI

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Instacart on Why Engineers Shouldn't Write Data Governance Policies

2023-07-26 · Databricks DATA + AI Summit 2023 Watch

video

by Kieran Taylor , Andria Fuquen

Data Governance Cyber Security

Controlling permissions for accessing data assets can be messy, time consuming, and usually a combination of both. The teams responsible for creating the business rules that govern who should have access to what data are usually different from the teams responsible for administering the grants to achieve that access. On the other side of the equation, the end user who needs access to a data asset may be left waiting for grants to be made as the decision is passed between teams. That is, if they even know the correct path to getting access in the first place.

Separating the concerns of managing data governance at a business level and implementing data governance at an engineering level is the best way to clarify data access permissions. In practice, this involves building systems to enable data governance enforcement based on business rules, with little to no understanding of the individual system where the data lives.

In practice, with a concrete business rule, such as “only users from the finance team should have access to critical financial data,” we want a system that deals only with those constituent concepts. For example, “the data is marked as critical financial” and “the user is a part of the finance team.” By abstracting away any source system components, such as “the tables in the finance schema” and “someone who’s a member of the finance Databricks group,” the access policies applied will then model the business rules as closely as possible.

This session will focus on how to establish and align the processes, policies, and stakeholders involved in making this type of system work seamlessly. Sharing the experience and learnings of our team at Instacart, we will aim to help attendees streamline and simplify their data security and access strategies.

Talk by: Kieran Taylor and Andria Fuquen

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

JetBlue’s Real-Time AI & ML Digital Twin Journey Using Databricks

2023-07-26 · Databricks DATA + AI Summit 2023 Watch

video

by Rob Bajra , Derrick Olson

AI/ML Analytics API BI Data Science NLP Data Streaming

JetBlue has embarked over the past year on an AI and ML transformation. Databricks has been instrumental in this transformation due to the ability to integrate streaming pipelines, ML training using MLflow, ML API serving using ML registry and more in one cohesive platform. Using real-time streams of weather, aircraft sensors, FAA data feeds, JetBlue operations and more are used for the world's first AI and ML operating system orchestrating a digital-twin, known as BlueSky for efficient and safe operations. JetBlue has over 10 ML products (multiple models each product) in production across multiple verticals including dynamic pricing, customer recommendation engines, supply chain optimization, customer sentiment NLP and several more.

The core JetBlue data science and analytics team consists of Operations Data Science, Commercial Data Science, AI and ML engineering and Business Intelligence. To facilitate the rapid growth and faster go-to-market strategy, the team has built an internal Data Catalog + AutoML + AutoDeploy wrapper called BlueML using Databricks features to empower data scientists including advanced analysts with the ability to train and deploy ML models in less than five lines of code.

Talk by: Derrick Olson and Rob Bajra

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Large Language Models in Healthcare: Benchmarks, Applications, and Compliance

2023-07-26 · Databricks DATA + AI Summit 2023 Watch

video

by David Talby (John Snow Labs and Pacific AI)

LLM MLOps

Large language models provide a leap in capabilities on understanding medical language and context - from passing the US medical licensing exam to summarizing clinical notes. They also suffer from a wide range of issues - hallucinations, robustness, privacy, bias – blocking many use cases. This session shares currently deployed software, lessons learned, and best practices that John Snow Labs has learned while enabling academic medical centers, pharmaceuticals, and health IT companies to build LLM-based solutions.

First, we cover benchmarks for new healthcare-specific large language models, showing how tuning LLM’s specifically on medical data and tasks results in higher accuracy on use cases such as question answering, information extraction, and summarization, compared to general-purpose LLM’s like GPT-4. Second, we share an architecture for medical chatbots that tackles issues of hallucinations, outdated content, privacy, and building a longitudinal view of patients. Third, we present a comprehensive solution for testing LLM’s beyond accuracy – for bias, fairness, representation, robustness, and toxicity – using the open-source nlptest library.

Talk by: David Talby

Here’s more to explore: LLM Compact Guide: https://dbricks.co/43WuQyb Big Book of MLOps: https://dbricks.co/3r0Pqiz

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Rapidly Implementing Major Retailer API at the Hershey Company

2023-07-26 · Databricks DATA + AI Summit 2023 Watch

video

by Simon Whiteley (Advancing Analytics) , Jordan Donmoyer

Analytics API Data Lakehouse Delta SQL Data Streaming

Accurate, reliable, and timely data is critical for CPG companies to stay ahead in highly competitive retailer relationships, and for a company like the Hershey Company, the commercial relationship with Walmart is one of the most important. The team at Hershey found themselves with a looming deadline for their legacy analytics services and targeted a migration to the brand new Walmart Luminate API. Working in partnership with Advancing Analytics, the Hershey Company leveraged a metadata-driven Lakehouse Architecture to rapidly onboard the new Luminate API, helping the category management teams to overhaul how they measure, predict, and plan their business operations.

In this session, we will discuss the impact Luminate has had on Hershey's business covering key areas such as sales, supply chain, and retail field execution, and the technical building blocks that can be used to rapidly provision business users with the data they need, when they need it. We will discuss how key technologies enable this rapid approach, with Databricks Autoloader ingesting and shaping our data, Delta Streaming processing the data through the lakehouse and Databricks SQL providing a responsive serving layer. The session will include commentary as well as cover the technical journey.

Talk by: Simon Whiteley and Jordan Donmoyer

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Self-Service Data Analytics and Governance at Enterprise Scale with Unity Catalog

2023-07-26 · Databricks DATA + AI Summit 2023 Watch

video

by Pascal van Bellen , Artem Meshcheryakov

AI/ML Analytics Cloud Computing Data Analytics Data Lakehouse Data Management Delta Cyber Security

This session focuses on one of the first Unity Catalog implementations for a large-scale enterprise. In this scenario, a cloud scale analytics platform with 7500 active users based on the lakehouse approach is used. In addition, there is potential for 1500 further users who are subject to special governance rules. They are consuming more than 600 TB of data stored in Delta Lake - continuously growing at more than 1TB per day. This might grow due to local country data. Therefore, the existing data platform must be extended to enable users to combine global and local data from their countries. A new data management was required, which reflects the strict information security rules at a need to know base. Core requirements are: read only from global data, write into local and share the results.

Due to a very pronounced information security awareness and a lack of the technological possibilities it was not possible to interdisciplinary analyze and exchange data so easy or at all so far. Therefore, a lot of business potential and gains could not be identified and realized.

With the new developments in the technology used and the basis of the lakehouse approach, thanks to Unity Catalog, we were able to develop a solution that could meet high requirements for security and process. And enables globally secured interdisciplinary data exchange and analysis at scale. This solution enables the democratization of the data. This results not only in the ability to gain better insights for business management, but also to generate entirely new business cases or products that require a higher degree of data integration and encourage the culture to change. We highlight technical challenges and solutions, present best practices and point out benefits of implementing Unity catalog for enterprises.

Talk by: Artem Meshcheryakov and Pascal van Bellen

Here’s more to explore: Data, Analytics, and AI Governance: https://dbricks.co/44gu3YU

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Sponsored by: Avanade | Accelerating Adoption of Modern Analytics and Governance at Scale

Streaming Schema Drift Discovery and Controlled Mitigation

2023-07-26 · Databricks DATA + AI Summit 2023 Watch

video

by Alexander Vanadio

Delta JSON Data Streaming

When creating streaming workloads with Databricks, it can sometimes be difficult to capture and understand the current structure of your source data. For example, what happens if you are ingesting JSON events from a vendor, and the keys are very sparsely populated, or contain dynamic content? Ideally, data engineers want to "lock in" a target schema in order to minimize complexity and maximize performance for known access patterns. What do you do when your data sources just don't cooperate with that vision? The first step is to quantify how far your current source data is drifting from your established Delta table. But how?

This session will demonstrate a way to capture and visual drift across all your streaming tables. The next question is, "Now that I see all of the data I'm missing, how do I selectively promote some of these keys into DataFrame columns?" The second half of this session will demonstrate precisely how to do a schema migration with minimal job downtime.

Talk by: Alexander Vanadio

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Using Cisco Spaces Firehose API as a Stream of Data for Real-Time Occupancy Modeling

2023-07-26 · Databricks DATA + AI Summit 2023 Watch

video

by Chris Inkpen , Paul Mracek

API Data Engineering Data Lakehouse Delta Data Streaming

Honeywell manages the control of equipment for hundreds of thousands of buildings worldwide. Many of our outcomes relating to energy and comfort rely on knowing where people are in the building at any one time. This is so we can target health and comfort conditions more suitably to areas where are more densely populated. Many of these buildings have Cisco IT infrastructure in them. Using their WIFI points and the RSSI signal strength from people’s laptops and phones, Cisco can calculate the number of people in each area of the building. Cisco Spaces offer this data up as a real-time streaming source. Honeywell HBT has utilized this stream of data by writing delta live table pipelines to consume this data source.

Honeywell buildings can now receive this firehose data from hundreds of concurrent customers and provide this occupancy data as a service to our vertical offerings in commercial, health, real estate and education. We will discuss the benefits of using DLT to handle this sort of incoming stream data, and illustrate the pain points we had and the resolutions we undertook in successfully receiving the stream of Cisco data. We will illustrate how our DLT pipeline was designed, and how it scaled to deal with huge quantities of real-time streaming data.

Talk by: Paul Mracek and Chris Inkpen

Here’s more to explore: Big Book of Data Engineering: 2nd Edition: https://dbricks.co/3XpPgNV The Data Team's Guide to the Databricks Lakehouse Platform: https://dbricks.co/46nuDpI

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Using Databricks to Power Insights and Visualizations on the S&P Global Marketplace

2023-07-26 · Databricks DATA + AI Summit 2023 Watch

video

by Onik Kurktchian

Tableau

In this session, we will explain the visualizations that serve to shorten the time to insight for our prospects and encourage potential buyers to take the next step and request more information from our commercial team. The S&P Global Marketplace is a discovery and exploration platform that enables prospective buyers and clients to easily search fundamental and alternative datasets from across S&P Global and curated third-party providers. It serves as a digital storefront that provides transparency into data coverage and use cases, reducing the time and effort for clients to find data for their needs. A key feature of Marketplace is our interactive data visualizations that provide insight into the coverage of a dataset and demonstrate how the dataset can be used to make more informed decisions.

The S&P Global Marketplace’s interactive visualizations are displayed in Tableau and are powered by Databricks. The Databricks platform allows for easy integration of S&P Global data and provides a collaborative environment where our team of product managers and data engineers can develop the code to generate each visualization. The team utilizes the web interface to develop the queries that perform the heavy lifting of data transformation instead of performing these tasks in Tableau. The final notebook output is saved into a custom data mart (“golden table”) which is the source for Tableau. We also developed an automated process that refreshes the whole process to ensure Marketplace has up to date visualizations.

Talk by: Onik Kurktchian

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Why a Major Japanese Financial Institution Chose Databricks To Accelerate its Data AI-Driven Journey

2023-07-26 · Databricks DATA + AI Summit 2023 Watch

video

by Yuki Saito (NTT DATA)

AI/ML Analytics BI Data Analytics Cyber Security

In this session, NTT DATA presents a case study involving of one of the largest and most prominent financial institutions in Japan. The project involved migration from the largest data analysis platform to Databricks, a project that required careful navigation of very strict security requirements while accommodating the needs of evolving technical solutions so they could support a wide variety of company structures. This session is for those who want to accelerate their business by effectively utilizing AI as well as BI.

NTT DATA is one of the largest system integrators in Japan, providing data analytics infrastructure to leading companies to help them effectively drive the democratization of data and AI as many in the Japanese market are now adding AI into their BI offering.

Talk by: Yuki Saito

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Your LLM, Your Data, Your Infrastructure

2023-07-26 · Databricks DATA + AI Summit 2023 Watch

video

by Sharon Zhou (Lamini)

LLM MLOps

Lamini, the most powerful LLM engine, is the platform for any and every software engineer to ship an LLM into production as rapidly and as easily as possible. In this session, learn how to train your LLM on your own data and infrastructure with a few lines of code using the Lamini library. Get early access to a playground to train any open-source LLM. With Lamini, your own LLM comes with better performance, better data privacy, lower cost, lower latency, and more.

Talk by: Sharon Zhou

Here’s more to explore: LLM Compact Guide: https://dbricks.co/43WuQyb Big Book of MLOps: https://dbricks.co/3r0Pqiz

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

ABN Story: Migrating to Future Proof Data Platformh

2023-07-26 · Databricks DATA + AI Summit 2023 Watch

video

by Rakesh Singh , Marcel Kramer

Azure Cloud Computing Hadoop Microsoft

ABN AMRO Bank is one of the top leading banks in the Netherlands. It is the third largest bank in the Netherlands by revenue and number of mortgages held within the Netherlands, and has top management support of the objective to become a fully data-driven bank. ABN AMRO started its data journey almost seven years ago and has built a data platform off-premises with Hadoop technologies. This data platform has been used by more than 200 data providers, 150 data consumers, and more than 3000 datasets.

To become a fully digital bank and address the limitation of the on-premises platform requires a future-proof data platform DIAL (digital integration and access layer). ABN AMRO decided to build an Azure cloud-native data platform with the help of Microsoft and Databricks. Last year this cloud-native platform was ready for our data providers and data consumers. Six months ago we started the journey of migrating all the content from the on-premises data platform to the Azure data platform, this was a very large-scale migration and was achieved in six months.

In this session, we will focus on three things: 1. The migration strategy going from on-premises to a cloud-native platform 2. Which Databricks solutions were used in the data platform 3. How the Databricks team assisted in the overall migration

Talk by: Rakesh Singh and Marcel Kramer

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Automating Sensitive Data (PII/PHI) Detection

2023-07-26 · Databricks DATA + AI Summit 2023 Watch

video

by Simon King , Pouya Barrach-Yousefi

AI/ML

Healthcare datasets contain both personally identifiable information (PII) and personal health information (PHI) that needs to be de-identified in order to protect patient confidentiality and ensure HIPAA compliance. This privacy data is easily detected when it’s provided in columns labeled with names such as “SSN,” First Name,” “Full Name,” and “DOB;” however, it is much harder to detect when it is hidden within columns labeled “Doctor Notes,” “Diagnoses,” or “Comments.” HealthVerity, a leader in the HIPAA-compliant exchange of real-world data (RWD) to uncover patient, payer and genomic insights and power innovation for the healthcare industry, ensures healthcare datasets are de-identified from PII and PHI using elaborate privacy procedures.

During this session, we will demonstrate how to use a low-code/no-code platform to simplify and automate data pipelines that leverage prebuilt ML models to scan data for PHI/PII leakage and quarantine those rows in Unity Catalog when leakage is identified and move them to a Databricks clean room for analysis.

Talk by: Pouya Barrach-Yousefi and Simon King

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

talk-data.com

Activity Trend

Top Events

Top Speakers

DataSecOps and Unity Catalog: High Leverage Governance at Scale

Data Sharing and Beyond with Delta Sharing

Deploying the Lakehouse to Improve the Viewer Experience on Discovery+

Enabling Data Governance at Enterprise Scale Using Unity Catalog

Extending Lakehouse Architecture with Collaborative Identity

How Coinbase Built and Optimized SOON, a Streaming Ingestion Framework

Instacart on Why Engineers Shouldn't Write Data Governance Policies

JetBlue’s Real-Time AI & ML Digital Twin Journey Using Databricks

Large Language Models in Healthcare: Benchmarks, Applications, and Compliance

Rapidly Implementing Major Retailer API at the Hershey Company

Self-Service Data Analytics and Governance at Enterprise Scale with Unity Catalog

Sponsored by: Avanade | Accelerating Adoption of Modern Analytics and Governance at Scale

Sponsored by: ThoughtSpot | Drive Self-Service Adoption Through the Roof with Embedded Analytics

Streaming Schema Drift Discovery and Controlled Mitigation

Using Cisco Spaces Firehose API as a Stream of Data for Real-Time Occupancy Modeling

Using Databricks to Power Insights and Visualizations on the S&P Global Marketplace

Why a Major Japanese Financial Institution Chose Databricks To Accelerate its Data AI-Driven Journey

Your LLM, Your Data, Your Infrastructure

ABN Story: Migrating to Future Proof Data Platformh

Automating Sensitive Data (PII/PHI) Detection