talk-data.com talk-data.com

Event

Databricks DATA + AI Summit 2023

2026-01-11 YouTube Visit website ↗

Activities tracked

582

Sessions & talks

Showing 76–100 of 582 · Newest first

Search within this event →
DataSecOps and Unity Catalog: High Leverage Governance at Scale

DataSecOps and Unity Catalog: High Leverage Governance at Scale

2023-07-26 Watch
video
Zeashan Pappa (Databricks) , Deepak Sekar

Learn how to apply DataSecOps patterns powered by Terraform to Unity Catalog to scale your governance efforts and support your organizational data usage.

Talk by: Zeashan Pappa and Deepak Sekar

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Data Sharing and Beyond with Delta Sharing

Data Sharing and Beyond with Delta Sharing

2023-07-26 Watch
video
Milos Colic (Databricks) , Vuong Nguyen

Stepping into this brave new digital world we are certain that data will be a central product for many organizations. The way to convey their knowledge and their assets will be through data and analytics. Delta Sharing was the world's first open protocol for secure and scalable real-time data sharing. Through our customer conversations, there is a lot of anticipation of how Delta Sharing can be extended to non-tabular assets, such as machine learning experiments and models.

In this session, we will cover how we extended the Delta Sharing protocol to other sharing workflows, enabling sharing of ML models, arbitrary files and more. The development resulted in Arcuate, a Databricks Labs project with a data sharing flavor. The session will start with the high-level approach and how it can be extended to cover other similar use cases. It will then move to our implementation and how it integrates seamlessly with Databricks-managed Delta Sharing server and notebooks. We finally conclude with lessons learned, and our visions for a future of data sharing and beyond

Talk by: Vuong Nguyen and Milos Colic

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Deploying the Lakehouse to Improve the Viewer Experience on Discovery+

Deploying the Lakehouse to Improve the Viewer Experience on Discovery+

2023-07-26 Watch
video

In this session, we will discuss how real-time data streaming can be used to gain insights into user behavior and preferences, and how this data is being used to provide personalized content and recommendations on Discovery+. We will examine techniques that enables faster decision making and insights on accurate real time data including data masking and data validation. To enable a wide set of data consumers from data engineers to data scientists to data analysts, we will discuss how Unity Catalog is leveraged for secure data access and sharing while still allowing teams flexibility.

Operating at this scale requires examining the value being created by the data being processed and optimizing along the way and we will share some of our success in this area.

Talk by: Deepa Paranjpe

Here’s more to explore: Big Book of Data Engineering: 2nd Edition: https://dbricks.co/3XpPgNV The Data Team's Guide to the Databricks Lakehouse Platform: https://dbricks.co/46nuDpI

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Enabling Data Governance at Enterprise Scale Using Unity Catalog

Enabling Data Governance at Enterprise Scale Using Unity Catalog

2023-07-26 Watch
video

Amgen has invested in building modern, cloud-native enterprise data and analytics platforms over the past few years with a focus on tech rationalization, data democratization, overall user experience, increase reusability, and cost-effectiveness. One of these platforms is our Enterprise Data Fabric which focuses on pulling in data across functions and providing capabilities to integrate and connect the data and govern access. For a while, we have been trying to set up robust data governance capabilities which are simple, yet easy to manage through Databricks. There were a few tools in the market that solved a few immediate needs, but none solved the problem holistically. For use cases like maintaining governance on highly restricted data domains like Finance and HR, a long-term solution native to Databricks and addressing the below limitations was deemed important:

The way these tools were set up, allowed the overriding of a few security policies

  • Tools were not UpToDate with the latest DBR runtime
  • Complexity of implementing fine-grained security
  • Policy management – AWS IAM + In tool policies

To address these challenges, and for large-scale enterprise adoption of our governance capability, we started working on UC integration with our governance processes. With an aim to realize the following tech benefits:

  • Independent of Databricks runtime
  • Easy fine-grained access control
  • Eliminated management of IAM roles
  • Dynamic access control using UC and dynamic views

Today, using UC, we have to implement fine-grained access control & governance for the restricted data of Amgen. We are in the process of devising a realistic migration & change management strategy across the enterprise.

Talk by: Lakhan Prajapati and Jaison Dominic

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Extending Lakehouse Architecture with Collaborative Identity

Extending Lakehouse Architecture with Collaborative Identity

2023-07-26 Watch
video
Erin Boelkens (LiveRamp) , Shawn Gilleran (LiveRamp)

Lakehouse architecture has become a valuable solution for unifying data processing for AI, but faces limitations in maximizing data’s full potential. Additional data infrastructure is helpful for strengthening data consolidation and data connectivity with third-party sources, which are necessary for building full data sets for accurate audience modeling. 

In this session, LiveRamp will demonstrate to data and analytics decision-makers how to build on the Lakehouse architecture with extensions for collaborative identity graph construction, including how to simplify and improve data enrichment, data activation, and data collaboration. LiveRamp will also introduce a complete data marketplace, which enables easy, pseudonymized data enhancements that widen the attribute set for better behavioral model construction.

With these techniques and technologies, enterprises across financial services, retail, media, travel, and more can safely unlock partner insights and ultimately produce more accurate inputs for personalization engines, and more engaging offers and recommendations for customers.

Talk by: Erin Boelkens and Shawn Gilleran

Here’s more to explore: A New Approach to Data Sharing: https://dbricks.co/44eUnT1

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

How Coinbase Built and Optimized SOON, a Streaming Ingestion Framework

How Coinbase Built and Optimized SOON, a Streaming Ingestion Framework

2023-07-26 Watch
video

Data with low latency is important for real-time incident analysis and metrics. Though we have up-to-date data in OLTP databases, they cannot support those scenarios. Data need to be replicated to a data warehouse to serve queries using GroupBy and Join across multiple tables from different systems. At Coinbase, we designed SOON (Spark cOntinuOus iNgestion) based on Kafka, Kafka Connect, and Apache Spark™ as an incremental table replication solution to replicate tables of any size from any database to Delta Lake in a timely manner. It also supports Kafka events ingestion naturally.

SOON incrementally ingests Kafka events as appends, updates, and deletes to an existing table on Delta Lake. The events are grouped into two categories: CDC (change data capture) events generated by Kafka Connect source connectors, and non-CDC events by the frontend or backend services. Both types can be appended or merged into the Delta Lake. Non-CDC events can be in any format, but CDC events must be in the standard SOON CDC schema. We implemented Kafka Connect SMTs to transform raw CDC events into this standardized format. SOON unifies all streaming ingestion scenarios such that users only need to learn one onboarding experience and the team only needs to maintain one framework.

We care about the ingestion performance. The biggest append-only table onboarded has ingress traffic at hundreds of thousands events per second; the biggest CDC-merge table onboarded has a snapshot size of a few TBs and CDC update traffic at hundreds of thousands events per second. A lot of innovative ideas are incorporated in SOON to improve its performance, such as min-max range merge optimization, KMeans merge optimization, no-update merge for deduplication, generated columns as partitions, etc.

Talk by: Chen Guo

Here’s more to explore: Big Book of Data Engineering: 2nd Edition: https://dbricks.co/3XpPgNV The Data Team's Guide to the Databricks Lakehouse Platform: https://dbricks.co/46nuDpI

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Instacart on Why Engineers Shouldn't Write Data Governance Policies

Instacart on Why Engineers Shouldn't Write Data Governance Policies

2023-07-26 Watch
video

Controlling permissions for accessing data assets can be messy, time consuming, and usually a combination of both. The teams responsible for creating the business rules that govern who should have access to what data are usually different from the teams responsible for administering the grants to achieve that access. On the other side of the equation, the end user who needs access to a data asset may be left waiting for grants to be made as the decision is passed between teams. That is, if they even know the correct path to getting access in the first place.

Separating the concerns of managing data governance at a business level and implementing data governance at an engineering level is the best way to clarify data access permissions. In practice, this involves building systems to enable data governance enforcement based on business rules, with little to no understanding of the individual system where the data lives.

In practice, with a concrete business rule, such as “only users from the finance team should have access to critical financial data,” we want a system that deals only with those constituent concepts. For example, “the data is marked as critical financial” and “the user is a part of the finance team.” By abstracting away any source system components, such as “the tables in the finance schema” and “someone who’s a member of the finance Databricks group,” the access policies applied will then model the business rules as closely as possible.

This session will focus on how to establish and align the processes, policies, and stakeholders involved in making this type of system work seamlessly. Sharing the experience and learnings of our team at Instacart, we will aim to help attendees streamline and simplify their data security and access strategies.

Talk by: Kieran Taylor and Andria Fuquen

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

JetBlue’s Real-Time AI & ML Digital Twin Journey Using Databricks

JetBlue’s Real-Time AI & ML Digital Twin Journey Using Databricks

2023-07-26 Watch
video

JetBlue has embarked over the past year on an AI and ML transformation. Databricks has been instrumental in this transformation due to the ability to integrate streaming pipelines, ML training using MLflow, ML API serving using ML registry and more in one cohesive platform. Using real-time streams of weather, aircraft sensors, FAA data feeds, JetBlue operations and more are used for the world's first AI and ML operating system orchestrating a digital-twin, known as BlueSky for efficient and safe operations. JetBlue has over 10 ML products (multiple models each product) in production across multiple verticals including dynamic pricing, customer recommendation engines, supply chain optimization, customer sentiment NLP and several more.

The core JetBlue data science and analytics team consists of Operations Data Science, Commercial Data Science, AI and ML engineering and Business Intelligence. To facilitate the rapid growth and faster go-to-market strategy, the team has built an internal Data Catalog + AutoML + AutoDeploy wrapper called BlueML using Databricks features to empower data scientists including advanced analysts with the ability to train and deploy ML models in less than five lines of code.

Talk by: Derrick Olson and Rob Bajra

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Large Language Models in Healthcare: Benchmarks, Applications, and Compliance

Large Language Models in Healthcare: Benchmarks, Applications, and Compliance

2023-07-26 Watch
video
David Talby (John Snow Labs and Pacific AI)

Large language models provide a leap in capabilities on understanding medical language and context - from passing the US medical licensing exam to summarizing clinical notes. They also suffer from a wide range of issues - hallucinations, robustness, privacy, bias – blocking many use cases. This session shares currently deployed software, lessons learned, and best practices that John Snow Labs has learned while enabling academic medical centers, pharmaceuticals, and health IT companies to build LLM-based solutions.

First, we cover benchmarks for new healthcare-specific large language models, showing how tuning LLM’s specifically on medical data and tasks results in higher accuracy on use cases such as question answering, information extraction, and summarization, compared to general-purpose LLM’s like GPT-4. Second, we share an architecture for medical chatbots that tackles issues of hallucinations, outdated content, privacy, and building a longitudinal view of patients. Third, we present a comprehensive solution for testing LLM’s beyond accuracy – for bias, fairness, representation, robustness, and toxicity – using the open-source nlptest library.

Talk by: David Talby

Here’s more to explore: LLM Compact Guide: https://dbricks.co/43WuQyb Big Book of MLOps: https://dbricks.co/3r0Pqiz

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Rapidly Implementing Major Retailer API at the Hershey Company

Rapidly Implementing Major Retailer API at the Hershey Company

2023-07-26 Watch
video
Simon Whiteley (Advancing Analytics) , Jordan Donmoyer

Accurate, reliable, and timely data is critical for CPG companies to stay ahead in highly competitive retailer relationships, and for a company like the Hershey Company, the commercial relationship with Walmart is one of the most important. The team at Hershey found themselves with a looming deadline for their legacy analytics services and targeted a migration to the brand new Walmart Luminate API. Working in partnership with Advancing Analytics, the Hershey Company leveraged a metadata-driven Lakehouse Architecture to rapidly onboard the new Luminate API, helping the category management teams to overhaul how they measure, predict, and plan their business operations.

In this session, we will discuss the impact Luminate has had on Hershey's business covering key areas such as sales, supply chain, and retail field execution, and the technical building blocks that can be used to rapidly provision business users with the data they need, when they need it. We will discuss how key technologies enable this rapid approach, with Databricks Autoloader ingesting and shaping our data, Delta Streaming processing the data through the lakehouse and Databricks SQL providing a responsive serving layer. The session will include commentary as well as cover the technical journey.

Talk by: Simon Whiteley and Jordan Donmoyer

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Self-Service Data Analytics and Governance at Enterprise Scale with Unity Catalog

Self-Service Data Analytics and Governance at Enterprise Scale with Unity Catalog

2023-07-26 Watch
video

This session focuses on one of the first Unity Catalog implementations for a large-scale enterprise. In this scenario, a cloud scale analytics platform with 7500 active users based on the lakehouse approach is used. In addition, there is potential for 1500 further users who are subject to special governance rules. They are consuming more than 600 TB of data stored in Delta Lake - continuously growing at more than 1TB per day. This might grow due to local country data. Therefore, the existing data platform must be extended to enable users to combine global and local data from their countries. A new data management was required, which reflects the strict information security rules at a need to know base. Core requirements are: read only from global data, write into local and share the results.

Due to a very pronounced information security awareness and a lack of the technological possibilities it was not possible to interdisciplinary analyze and exchange data so easy or at all so far. Therefore, a lot of business potential and gains could not be identified and realized.

With the new developments in the technology used and the basis of the lakehouse approach, thanks to Unity Catalog, we were able to develop a solution that could meet high requirements for security and process. And enables globally secured interdisciplinary data exchange and analysis at scale. This solution enables the democratization of the data. This results not only in the ability to gain better insights for business management, but also to generate entirely new business cases or products that require a higher degree of data integration and encourage the culture to change. We highlight technical challenges and solutions, present best practices and point out benefits of implementing Unity catalog for enterprises.

Talk by: Artem Meshcheryakov and Pascal van Bellen

Here’s more to explore: Data, Analytics, and AI Governance: https://dbricks.co/44gu3YU

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Sponsored by: Avanade | Accelerating Adoption of Modern Analytics and Governance at Scale

Sponsored by: Avanade | Accelerating Adoption of Modern Analytics and Governance at Scale

2023-07-26 Watch
video

To unlock all the competitive advantage Databricks offers your organization, you might need to update your strategy and methodology for the platform. With over 1,000+ Databricks projects completed globally in the last 18 months, we are going to share our insights on the best building blocks to target as you search for efficiency and competitive advantage.

These building blocks supporting this include enterprise metadata and data management services, data management foundation, and data services and products that enable business units to fully use their data and analytics at scale.

In this session, Avanade data leaders will highlight how Databricks’ modern data stack fits Azure PaaS and SaaS (such as Microsoft Fabric) ecosystem, how Unity catalog metadata supports automated data operations scenarios, and how we are helping clients measure modern analytics and governance business impact and value.

Talk by: Alan Grogan and Timur Mehmedbasic

Here’s more to explore: State of Data + AI Report: https://dbricks.co/44i2HBp Databricks named a Leader in 2022 Gartner® Magic QuadrantTM CDBMS: https://dbricks.co/3phw20d

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Sponsored by: ThoughtSpot | Drive Self-Service Adoption Through the Roof with Embedded Analytics

Sponsored by: ThoughtSpot | Drive Self-Service Adoption Through the Roof with Embedded Analytics

2023-07-26 Watch
video

When it comes to building stickier apps and products to grow your business, there's no greater opportunity than embedded analytics. Data apps that deliver superior user engagement and business value do analytics differently. They take a user-first approach and know how to deliver real-time, AI-powered insights - not just to internal employees - but to an organization’s customers and partners, as well.

Learn how ThoughtSpot Everywhere is helping companies like Emerald natively integrate analytics with other tools in their modern data stack to deliver a blazing-fast and instantly available analytics experience across all the data their users love. Join this session to learn how you can leverage embedded analytics to: Drive higher app engagement Get your app to market faster And create new revenue streams

Talk by: Krishti Bikal and Vika Smilansky

Here’s more to explore: Why the Data Lakehouse Is Your next Data Warehouse: https://dbricks.co/3Pt5unq Lakehouse Fundamentals Training: https://dbricks.co/44ancQs

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Streaming Schema Drift Discovery and Controlled Mitigation

Streaming Schema Drift Discovery and Controlled Mitigation

2023-07-26 Watch
video

When creating streaming workloads with Databricks, it can sometimes be difficult to capture and understand the current structure of your source data. For example, what happens if you are ingesting JSON events from a vendor, and the keys are very sparsely populated, or contain dynamic content? Ideally, data engineers want to "lock in" a target schema in order to minimize complexity and maximize performance for known access patterns. What do you do when your data sources just don't cooperate with that vision? The first step is to quantify how far your current source data is drifting from your established Delta table. But how?

This session will demonstrate a way to capture and visual drift across all your streaming tables. The next question is, "Now that I see all of the data I'm missing, how do I selectively promote some of these keys into DataFrame columns?" The second half of this session will demonstrate precisely how to do a schema migration with minimal job downtime.

Talk by: Alexander Vanadio

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Using Cisco Spaces Firehose API as a Stream of Data for Real-Time Occupancy Modeling

Using Cisco Spaces Firehose API as a Stream of Data for Real-Time Occupancy Modeling

2023-07-26 Watch
video

Honeywell manages the control of equipment for hundreds of thousands of buildings worldwide. Many of our outcomes relating to energy and comfort rely on knowing where people are in the building at any one time. This is so we can target health and comfort conditions more suitably to areas where are more densely populated. Many of these buildings have Cisco IT infrastructure in them. Using their WIFI points and the RSSI signal strength from people’s laptops and phones, Cisco can calculate the number of people in each area of the building. Cisco Spaces offer this data up as a real-time streaming source. Honeywell HBT has utilized this stream of data by writing delta live table pipelines to consume this data source.

Honeywell buildings can now receive this firehose data from hundreds of concurrent customers and provide this occupancy data as a service to our vertical offerings in commercial, health, real estate and education. We will discuss the benefits of using DLT to handle this sort of incoming stream data, and illustrate the pain points we had and the resolutions we undertook in successfully receiving the stream of Cisco data. We will illustrate how our DLT pipeline was designed, and how it scaled to deal with huge quantities of real-time streaming data.

Talk by: Paul Mracek and Chris Inkpen

Here’s more to explore: Big Book of Data Engineering: 2nd Edition: https://dbricks.co/3XpPgNV The Data Team's Guide to the Databricks Lakehouse Platform: https://dbricks.co/46nuDpI

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Using Databricks to Power Insights and Visualizations on the S&P Global Marketplace

Using Databricks to Power Insights and Visualizations on the S&P Global Marketplace

2023-07-26 Watch
video

In this session, we will explain the visualizations that serve to shorten the time to insight for our prospects and encourage potential buyers to take the next step and request more information from our commercial team. The S&P Global Marketplace is a discovery and exploration platform that enables prospective buyers and clients to easily search fundamental and alternative datasets from across S&P Global and curated third-party providers. It serves as a digital storefront that provides transparency into data coverage and use cases, reducing the time and effort for clients to find data for their needs. A key feature of Marketplace is our interactive data visualizations that provide insight into the coverage of a dataset and demonstrate how the dataset can be used to make more informed decisions.

The S&P Global Marketplace’s interactive visualizations are displayed in Tableau and are powered by Databricks. The Databricks platform allows for easy integration of S&P Global data and provides a collaborative environment where our team of product managers and data engineers can develop the code to generate each visualization. The team utilizes the web interface to develop the queries that perform the heavy lifting of data transformation instead of performing these tasks in Tableau. The final notebook output is saved into a custom data mart (“golden table”) which is the source for Tableau. We also developed an automated process that refreshes the whole process to ensure Marketplace has up to date visualizations.

Talk by: Onik Kurktchian

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Why a Major Japanese Financial Institution Chose Databricks To Accelerate its Data AI-Driven Journey

Why a Major Japanese Financial Institution Chose Databricks To Accelerate its Data AI-Driven Journey

2023-07-26 Watch
video
Yuki Saito (NTT DATA)

In this session, NTT DATA presents a case study involving of one of the largest and most prominent financial institutions in Japan. The project involved migration from the largest data analysis platform to Databricks, a project that required careful navigation of very strict security requirements while accommodating the needs of evolving technical solutions so they could support a wide variety of company structures. This session is for those who want to accelerate their business by effectively utilizing AI as well as BI.

NTT DATA is one of the largest system integrators in Japan, providing data analytics infrastructure to leading companies to help them effectively drive the democratization of data and AI as many in the Japanese market are now adding AI into their BI offering.

Talk by: Yuki Saito

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Your LLM, Your Data, Your Infrastructure

Your LLM, Your Data, Your Infrastructure

2023-07-26 Watch
video

Lamini, the most powerful LLM engine, is the platform for any and every software engineer to ship an LLM into production as rapidly and as easily as possible. In this session, learn how to train your LLM on your own data and infrastructure with a few lines of code using the Lamini library. Get early access to a playground to train any open-source LLM. With Lamini, your own LLM comes with better performance, better data privacy, lower cost, lower latency, and more.

Talk by: Sharon Zhou

Here’s more to explore: LLM Compact Guide: https://dbricks.co/43WuQyb Big Book of MLOps: https://dbricks.co/3r0Pqiz

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

ABN Story: Migrating to Future Proof Data Platformh

ABN Story: Migrating to Future Proof Data Platformh

2023-07-26 Watch
video

ABN AMRO Bank is one of the top leading banks in the Netherlands. It is the third largest bank in the Netherlands by revenue and number of mortgages held within the Netherlands, and has top management support of the objective to become a fully data-driven bank. ABN AMRO started its data journey almost seven years ago and has built a data platform off-premises with Hadoop technologies. This data platform has been used by more than 200 data providers, 150 data consumers, and more than 3000 datasets.

To become a fully digital bank and address the limitation of the on-premises platform requires a future-proof data platform DIAL (digital integration and access layer). ABN AMRO decided to build an Azure cloud-native data platform with the help of Microsoft and Databricks. Last year this cloud-native platform was ready for our data providers and data consumers. Six months ago we started the journey of migrating all the content from the on-premises data platform to the Azure data platform, this was a very large-scale migration and was achieved in six months.

In this session, we will focus on three things: 1. The migration strategy going from on-premises to a cloud-native platform 2. Which Databricks solutions were used in the data platform 3. How the Databricks team assisted in the overall migration

Talk by: Rakesh Singh and Marcel Kramer

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Automating Sensitive Data (PII/PHI) Detection

Automating Sensitive Data (PII/PHI) Detection

2023-07-26 Watch
video

Healthcare datasets contain both personally identifiable information (PII) and personal health information (PHI) that needs to be de-identified in order to protect patient confidentiality and ensure HIPAA compliance. This privacy data is easily detected when it’s provided in columns labeled with names such as “SSN,” First Name,” “Full Name,” and “DOB;” however, it is much harder to detect when it is hidden within columns labeled “Doctor Notes,” “Diagnoses,” or “Comments.” HealthVerity, a leader in the HIPAA-compliant exchange of real-world data (RWD) to uncover patient, payer and genomic insights and power innovation for the healthcare industry, ensures healthcare datasets are de-identified from PII and PHI using elaborate privacy procedures.

During this session, we will demonstrate how to use a low-code/no-code platform to simplify and automate data pipelines that leverage prebuilt ML models to scan data for PHI/PII leakage and quarantine those rows in Unity Catalog when leakage is identified and move them to a Databricks clean room for analysis.

Talk by: Pouya Barrach-Yousefi and Simon King

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Databricks and Delta Lake: Lessons Learned from Building Akamai's Web Security Analytics Product

Databricks and Delta Lake: Lessons Learned from Building Akamai's Web Security Analytics Product

2023-07-26 Watch
video
Tomer Patel , Itai Yaffe (Nielsen Identity Engine)

Akamai is a leading content delivery network (CDN) and cybersecurity company operating hundreds of thousands of servers in more than 135 countries worldwide. In this session, we will share our experiences and lessons learned from building and maintaining the Web Security Analytics (WSA) product, an interactive analytics platform powered by Databricks and Delta Lake that enables customers to efficiently analyze and take informed action on a high volume of streaming security events.

The WSA platform must be able to serve hundreds of queries per minute, scanning hundreds of terabytes of data from a six petabyte data lake, with most queries returning results within ten seconds; for both aggregation queries and needle in a haystack queries. This session will cover how to use Databricks SQL warehouses and job clusters cost-effectively, and how to improve query performance using tools and techniques such as Delta Lake, Databricks Photon, and partitioning. This talk will be valuable for anyone looking to build and operate a high-performance analytics platform.

Talk by: Tomer Patel and Itai Yaffe

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Data Democratization with Lakehouse: An Open Banking Application Case

Data Democratization with Lakehouse: An Open Banking Application Case

2023-07-26 Watch
video

Banco Bradesco represents one of the largest companies in the financial sector in Latin America. They have more than 99 million customers, 79 years of history, and a legacy of data distributed in hundreds of on-premises systems. With the spread of data-driven approaches and the growth of cloud computing adoption, we needed to innovate and adapt to new trends and enable an analytical environment with democratized data.

We will show how more than eight business departments have already engaged in using the Lakehouse exploratory environment, with more than 190 use cases mapped and a multi-bank financial manager. Unlike with on-premises, the cost of each process can be isolated and managed in near real-time, allowing quick responses to cost and budget deviations, while increasing the deployment speed of new features 36 times compared to on-premises.

The data is now used and shared safely and easily between different areas and companies of the group. Also, the view of dashboards within Databricks allows panels to be efficiently "prototyped" with real data, allowing an easy interaction of the business area with its real needs and then creating a definitive view with all relevant points duly stressed.

Talk by: Pedro Boareto and Fabio Luis Correia da Silva

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Disaster Recovery Strategies for Structured Streams

Disaster Recovery Strategies for Structured Streams

2023-07-26 Watch
video
Sachin Balgonda Patil (Databricks) , Shasidhar Eranti (Databricks)

In recent years, many businesses have adopted real-time streaming applications to enable faster decision making, quicker predictions, and improved customer experiences. Few of these applications are driving critical business use cases like financial fraud detection, loan application processing, personalized offers, etc. These business critical applications need robust disaster recovery strategies to recover from the catastrophic events to reduce the lost uptime. However, most organizations find it hard to set up disaster recovery for streaming applications as it involves continuous data flow. Streaming state and temporal behavior of data brings add complexities to the DR strategy. A reliable disaster recovery strategy includes backup, failover and failback approaches for the streaming application. Unlike the batch applications, these steps include many moving elements and need a very sophisticated approach to ensure that the services are failing over the DR region and meet the set RTO and RPO requirements.

In this session, we will cover following topics with a FINSERV use case demo: - Backup strategy: backup of delta tables, message bus services and checkpoint including offsets - Failover strategy: failover strategy to disable services in the primary region and start the services in the secondary region with minimum data loss - Failback strategy: failback strategy to restart the services in the primary region once all the services are restored - Common challenges and best practices for backup

Talk by: Shasidhar Eranti and Sachin Balgonda Patil

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

D-Lite: Integrating a Lightweight ChatGPT-Like Model Based on Dolly into Organizational Workflows

D-Lite: Integrating a Lightweight ChatGPT-Like Model Based on Dolly into Organizational Workflows

2023-07-26 Watch
video

DLite is a new instruction-following model developed by AI Squared by fine-tuning the smallest GPT-2 model on the Alpaca dataset. Despite having only 124 million parameters, DLite exhibits impressive ChatGPT-like interactivity and can be fine-tuned on a single T4 GPU for less than $15.00. Due to its small relative size, DLite can be run locally on a wide variety of compute environments, including laptop CPUs, and can be used without sending data to any third-party API. This lightweight property of DLite makes it highly accessible for personal use, empowering users to integrate machine learning models and advanced analytics into their workflows quickly, securely, and cost-effectively.

Leveraging DLite within AI Squared's platform can empower organizations to orchestrate the integration of Dolly/DLite into business workflows, creating personalized versions of Dolly/DLite, chaining models or analytics to contextualize Dolly/Dlite responses/prompts, and curating new datasets leveraging real-time feedback.

Talk by: Jacob Renn and Ian Sotnek

Here’s more to explore: LLM Compact Guide: https://dbricks.co/43WuQyb Big Book of MLOps: https://dbricks.co/3r0Pqiz

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Event Driven Real-Time Supply Chain Ecosystem Powered by Lakehouse

Event Driven Real-Time Supply Chain Ecosystem Powered by Lakehouse

2023-07-26 Watch
video

As the backbone of Australia’s supply chain, the Australia Rail Track Corporation (ARTC) plays a vital role in the management and monitoring of goods transportation across 8,500km of its rail network throughout Australia. ARTC provides weighbridges along their track which read train weights as they pass at speeds of up to 60 kilometers an hour. This information is highly valuable and is required both by ARTC and their customers to provide accurate haulage weight details, analyze technical equipment, and help ensure wagons have been loaded correctly.

A total of 750 trains run across a network of 8500 km in a day and generate real-time data at approximately 50 sensor platforms. With the help of structured streaming and Delta Lake, ARTC was able to analyze and store:

  • Precise train location
  • Weight of the train in real-time
  • Train crossing time to the second level
  • Train speed, temperature, sound frequency, and friction
  • Train schedule lookups

Once all the IoT data has been pulled together from an IoT event hub, it is processed in real-time using structured streaming and stored in Delta Lake. To understand the train GPS location, API calls are then made per minute per train from the Lakehouse. API calls are made in real-time to another scheduling system to lookup customer info. Once the processed/enriched data is stored in Delta Lake, an API layer was also created on top of it to expose this data to all consumers.

The outcome: increased transparency on weight data as it is now made available to customers; we built a digital data ecosystem that now ARTC’s customers use to meet their KPIs/ planning; the ability to determine temporary speed restrictions across the network to improve train scheduling accuracy and also schedule network maintenance based on train schedules and speed.

Talk by: Deepak Sekar and Harsh Mishra

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc