Databricks DATA + AI Summit 2023

Using Databricks to Power Insights and Visualizations on the S&P Global Marketplace

2023-07-26 Watch

video

Onik Kurktchian

Databricks Tableau

In this session, we will explain the visualizations that serve to shorten the time to insight for our prospects and encourage potential buyers to take the next step and request more information from our commercial team. The S&P Global Marketplace is a discovery and exploration platform that enables prospective buyers and clients to easily search fundamental and alternative datasets from across S&P Global and curated third-party providers. It serves as a digital storefront that provides transparency into data coverage and use cases, reducing the time and effort for clients to find data for their needs. A key feature of Marketplace is our interactive data visualizations that provide insight into the coverage of a dataset and demonstrate how the dataset can be used to make more informed decisions.

The S&P Global Marketplace’s interactive visualizations are displayed in Tableau and are powered by Databricks. The Databricks platform allows for easy integration of S&P Global data and provides a collaborative environment where our team of product managers and data engineers can develop the code to generate each visualization. The team utilizes the web interface to develop the queries that perform the heavy lifting of data transformation instead of performing these tasks in Tableau. The final notebook output is saved into a custom data mart (“golden table”) which is the source for Tableau. We also developed an automated process that refreshes the whole process to ensure Marketplace has up to date visualizations.

Talk by: Onik Kurktchian

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Why a Major Japanese Financial Institution Chose Databricks To Accelerate its Data AI-Driven Journey

2023-07-26 Watch

video

Yuki Saito (NTT DATA)

AI/ML Analytics BI Data Analytics Databricks Cyber Security

In this session, NTT DATA presents a case study involving of one of the largest and most prominent financial institutions in Japan. The project involved migration from the largest data analysis platform to Databricks, a project that required careful navigation of very strict security requirements while accommodating the needs of evolving technical solutions so they could support a wide variety of company structures. This session is for those who want to accelerate their business by effectively utilizing AI as well as BI.

NTT DATA is one of the largest system integrators in Japan, providing data analytics infrastructure to leading companies to help them effectively drive the democratization of data and AI as many in the Japanese market are now adding AI into their BI offering.

Talk by: Yuki Saito

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Your LLM, Your Data, Your Infrastructure

2023-07-26 Watch

video

Sharon Zhou

Databricks LLM MLOps

Lamini, the most powerful LLM engine, is the platform for any and every software engineer to ship an LLM into production as rapidly and as easily as possible. In this session, learn how to train your LLM on your own data and infrastructure with a few lines of code using the Lamini library. Get early access to a playground to train any open-source LLM. With Lamini, your own LLM comes with better performance, better data privacy, lower cost, lower latency, and more.

Talk by: Sharon Zhou

Here’s more to explore: LLM Compact Guide: https://dbricks.co/43WuQyb Big Book of MLOps: https://dbricks.co/3r0Pqiz

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

ABN Story: Migrating to Future Proof Data Platformh

2023-07-26 Watch

video

Rakesh Singh , Marcel Kramer

Azure Cloud Computing Databricks Hadoop Microsoft

ABN AMRO Bank is one of the top leading banks in the Netherlands. It is the third largest bank in the Netherlands by revenue and number of mortgages held within the Netherlands, and has top management support of the objective to become a fully data-driven bank. ABN AMRO started its data journey almost seven years ago and has built a data platform off-premises with Hadoop technologies. This data platform has been used by more than 200 data providers, 150 data consumers, and more than 3000 datasets.

To become a fully digital bank and address the limitation of the on-premises platform requires a future-proof data platform DIAL (digital integration and access layer). ABN AMRO decided to build an Azure cloud-native data platform with the help of Microsoft and Databricks. Last year this cloud-native platform was ready for our data providers and data consumers. Six months ago we started the journey of migrating all the content from the on-premises data platform to the Azure data platform, this was a very large-scale migration and was achieved in six months.

In this session, we will focus on three things: 1. The migration strategy going from on-premises to a cloud-native platform 2. Which Databricks solutions were used in the data platform 3. How the Databricks team assisted in the overall migration

Talk by: Rakesh Singh and Marcel Kramer

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Automating Sensitive Data (PII/PHI) Detection

2023-07-26 Watch

video

Simon King , Pouya Barrach-Yousefi

AI/ML Databricks

Healthcare datasets contain both personally identifiable information (PII) and personal health information (PHI) that needs to be de-identified in order to protect patient confidentiality and ensure HIPAA compliance. This privacy data is easily detected when it’s provided in columns labeled with names such as “SSN,” First Name,” “Full Name,” and “DOB;” however, it is much harder to detect when it is hidden within columns labeled “Doctor Notes,” “Diagnoses,” or “Comments.” HealthVerity, a leader in the HIPAA-compliant exchange of real-world data (RWD) to uncover patient, payer and genomic insights and power innovation for the healthcare industry, ensures healthcare datasets are de-identified from PII and PHI using elaborate privacy procedures.

During this session, we will demonstrate how to use a low-code/no-code platform to simplify and automate data pipelines that leverage prebuilt ML models to scan data for PHI/PII leakage and quarantine those rows in Unity Catalog when leakage is identified and move them to a Databricks clean room for analysis.

Talk by: Pouya Barrach-Yousefi and Simon King

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Databricks and Delta Lake: Lessons Learned from Building Akamai's Web Security Analytics Product

2023-07-26 Watch

video

Tomer Patel , Itai Yaffe (Nielsen Identity Engine)

Analytics Data Lake Databricks Delta Cyber Security SQL

Akamai is a leading content delivery network (CDN) and cybersecurity company operating hundreds of thousands of servers in more than 135 countries worldwide. In this session, we will share our experiences and lessons learned from building and maintaining the Web Security Analytics (WSA) product, an interactive analytics platform powered by Databricks and Delta Lake that enables customers to efficiently analyze and take informed action on a high volume of streaming security events.

The WSA platform must be able to serve hundreds of queries per minute, scanning hundreds of terabytes of data from a six petabyte data lake, with most queries returning results within ten seconds; for both aggregation queries and needle in a haystack queries. This session will cover how to use Databricks SQL warehouses and job clusters cost-effectively, and how to improve query performance using tools and techniques such as Delta Lake, Databricks Photon, and partitioning. This talk will be valuable for anyone looking to build and operate a high-performance analytics platform.

Talk by: Tomer Patel and Itai Yaffe

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Data Democratization with Lakehouse: An Open Banking Application Case

2023-07-26 Watch

video

Pedro Boareto , Fabio Luis Correia da Silva

Cloud Computing Data Lakehouse Databricks

Banco Bradesco represents one of the largest companies in the financial sector in Latin America. They have more than 99 million customers, 79 years of history, and a legacy of data distributed in hundreds of on-premises systems. With the spread of data-driven approaches and the growth of cloud computing adoption, we needed to innovate and adapt to new trends and enable an analytical environment with democratized data.

We will show how more than eight business departments have already engaged in using the Lakehouse exploratory environment, with more than 190 use cases mapped and a multi-bank financial manager. Unlike with on-premises, the cost of each process can be isolated and managed in near real-time, allowing quick responses to cost and budget deviations, while increasing the deployment speed of new features 36 times compared to on-premises.

The data is now used and shared safely and easily between different areas and companies of the group. Also, the view of dashboards within Databricks allows panels to be efficiently "prototyped" with real data, allowing an easy interaction of the business area with its real needs and then creating a definitive view with all relevant points duly stressed.

Talk by: Pedro Boareto and Fabio Luis Correia da Silva

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Disaster Recovery Strategies for Structured Streams

2023-07-26 Watch

video

Sachin Balgonda Patil (Databricks) , Shasidhar Eranti (Databricks)

Databricks Delta Data Streaming

In recent years, many businesses have adopted real-time streaming applications to enable faster decision making, quicker predictions, and improved customer experiences. Few of these applications are driving critical business use cases like financial fraud detection, loan application processing, personalized offers, etc. These business critical applications need robust disaster recovery strategies to recover from the catastrophic events to reduce the lost uptime. However, most organizations find it hard to set up disaster recovery for streaming applications as it involves continuous data flow. Streaming state and temporal behavior of data brings add complexities to the DR strategy. A reliable disaster recovery strategy includes backup, failover and failback approaches for the streaming application. Unlike the batch applications, these steps include many moving elements and need a very sophisticated approach to ensure that the services are failing over the DR region and meet the set RTO and RPO requirements.

In this session, we will cover following topics with a FINSERV use case demo: - Backup strategy: backup of delta tables, message bus services and checkpoint including offsets - Failover strategy: failover strategy to disable services in the primary region and start the services in the secondary region with minimum data loss - Failback strategy: failback strategy to restart the services in the primary region once all the services are restored - Common challenges and best practices for backup

Talk by: Shasidhar Eranti and Sachin Balgonda Patil

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

D-Lite: Integrating a Lightweight ChatGPT-Like Model Based on Dolly into Organizational Workflows

2023-07-26 Watch

video

Ian Sotnek , Jacob Renn

AI/ML Analytics API Databricks LLM MLOps

DLite is a new instruction-following model developed by AI Squared by fine-tuning the smallest GPT-2 model on the Alpaca dataset. Despite having only 124 million parameters, DLite exhibits impressive ChatGPT-like interactivity and can be fine-tuned on a single T4 GPU for less than $15.00. Due to its small relative size, DLite can be run locally on a wide variety of compute environments, including laptop CPUs, and can be used without sending data to any third-party API. This lightweight property of DLite makes it highly accessible for personal use, empowering users to integrate machine learning models and advanced analytics into their workflows quickly, securely, and cost-effectively.

Leveraging DLite within AI Squared's platform can empower organizations to orchestrate the integration of Dolly/DLite into business workflows, creating personalized versions of Dolly/DLite, chaining models or analytics to contextualize Dolly/Dlite responses/prompts, and curating new datasets leveraging real-time feedback.

Talk by: Jacob Renn and Ian Sotnek

Here’s more to explore: LLM Compact Guide: https://dbricks.co/43WuQyb Big Book of MLOps: https://dbricks.co/3r0Pqiz

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Event Driven Real-Time Supply Chain Ecosystem Powered by Lakehouse

2023-07-26 Watch

video

Harsh Mishra , Deepak Sekar

API Data Lakehouse Databricks Delta IoT KPI

As the backbone of Australia’s supply chain, the Australia Rail Track Corporation (ARTC) plays a vital role in the management and monitoring of goods transportation across 8,500km of its rail network throughout Australia. ARTC provides weighbridges along their track which read train weights as they pass at speeds of up to 60 kilometers an hour. This information is highly valuable and is required both by ARTC and their customers to provide accurate haulage weight details, analyze technical equipment, and help ensure wagons have been loaded correctly.

A total of 750 trains run across a network of 8500 km in a day and generate real-time data at approximately 50 sensor platforms. With the help of structured streaming and Delta Lake, ARTC was able to analyze and store:

Precise train location
Weight of the train in real-time
Train crossing time to the second level
Train speed, temperature, sound frequency, and friction
Train schedule lookups

Once all the IoT data has been pulled together from an IoT event hub, it is processed in real-time using structured streaming and stored in Delta Lake. To understand the train GPS location, API calls are then made per minute per train from the Lakehouse. API calls are made in real-time to another scheduling system to lookup customer info. Once the processed/enriched data is stored in Delta Lake, an API layer was also created on top of it to expose this data to all consumers.

The outcome: increased transparency on weight data as it is now made available to customers; we built a digital data ecosystem that now ARTC’s customers use to meet their KPIs/ planning; the ability to determine temporary speed restrictions across the network to improve train scheduling accuracy and also schedule network maintenance based on train schedules and speed.

Talk by: Deepak Sekar and Harsh Mishra

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

IFC's MALENA Provides Analytics for ESG Reviews in Emerging Markets Using NLP and LLMs

2023-07-26 Watch

video

Atiyah Curmally , Blaise Sandwidi

AI/ML Analytics Data Lake Databricks LLM MLOps

International Finance Corporation (IFC) is using data and AI to build machine learning solutions that create analytical capacity to support the review of ESG issues at scale. This includes natural language processing and requires entity recognition and other applications to support the work of IFC’s experts and other investors working in emerging markets. These algorithms are available via IFC’s Machine Learning ESG Analyst (MALENA) platform to enable rapid analysis, increase productivity, and build investor confidence. In this manner, IFC, a development finance institution with the mandate to address poverty in emerging markets, is making use of its historical datasets and open source AI solutions to build custom-AI applications that democratize access to ESG capacity to read and classify text.

In this session, you will learn the unique flexibility of the Apache Spark™ ecosystem from Databricks and how that has allowed IFC’s MALENA project to connect to scalable data lake storage, use different natural language processing models and seamlessly adopt MLOps.

Talk by: Atiyah Curmally and Blaise Sandwidi

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Increasing Data Trust: Enabling Data Governance on Databricks Using Unity Catalog & ML-Driven MDM

2023-07-26 Watch

video

Risha Ravindranath , Maggie Davis

AI/ML Data Governance Data Lakehouse Data Management Data Quality Databricks

As part of Comcast Effectv’s transformation into a completely digital advertising agency, it was key to develop an approach to manage and remediate data quality issues related to customer data so that the sales organization is using reliable data to enable data-driven decision making. Like many organizations, Effectv's customer lifecycle processes are spread across many systems utilizing various integrations between them. This results in key challenges like duplicate and redundant customer data that requires rationalization and remediation. Data is at the core of Effectv’s modernization journey with the intended result of winning more business, accelerating order fulfillment, reducing make-goods and identifying revenue.

In partnership with Slalom Consulting, Comcast Effectv built a traditional lakehouse on Databricks to ingest data from all of these systems but with a twist; they anchored every engineering decision in how it will enable their data governance program.

In this session, we will touch upon the data transformation journey at Effectv and dive deeper into the implementation of data governance leveraging Databricks solutions such as Delta Lake, Unity Catalog and DB SQL. Key focus areas include how we baked master data management into our pipelines by automating the matching and survivorship process, and bringing it all together for the data consumer via DBSQL to use our certified assets in bronze, silver and gold layers.

By making thoughtful decisions about structuring data in Unity Catalog and baking MDM into ETL pipelines, you can greatly increase the quality, reliability, and adoption of single-source-of-truth data so your business users can stop spending cycles on wrangling data and spend more time developing actionable insights for your business.

Talk by: Maggie Davis and Risha Ravindranath

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Real-Time Reporting and Analytics for Construction Data Powered by Delta Lake and DBSQL

2023-07-26 Watch

video

Hari Rajaram , Jay Yang

Analytics Data Lakehouse Databricks dbt Delta Kafka

Procore is a construction project management software that helps construction professionals efficiently manage their projects and collaborate with their teams. Our mission is to connect everyone in construction on a global platform.

Procore is the system of record for all construction projects. Our customers need to access the data in near real-time for construction insights. Enhanced reporting is a self-service operational reporting module that allows quick data access with consistency to thousands of tables and reports.

Procore data platform rebuilt the module (originally built on the relational database) using Databricks and Delta lake. We used Apache Spark™ streaming to maintain the consistent state on the ingestion side from Kafka and plan to leverage the fully capable functionalities of DBSQL using the serverless SQL warehouse to read the medallion models (built via DBT) in Delta Lake. In addition, the Unity Catalog and the Delta share features helped us share the data across regions seamlessly. This design enabled us to improve the p95 and p99 read time by xx% (which were initially timing out).

Attend this session to hear about the learnings and experience of building a Data Lakehouse architecture.

Talk by: Jay Yang and Hari Rajaram

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Taking Your Cloud Vendor to the Next Level: Solving Complex Challenges with Azure Databricks

2023-07-26 Watch

video

Azure Cloud Computing Databricks Kafka

Akamai's content delivery network (CDN) processes about 30% of the internet's daily traffic, resulting in a massive amount of data that presents engineering challenges, both internally and with cloud vendors. In this session, we will discuss the barriers faced while building a data infrastructure on Azure, Databricks, and Kafka to meet strict SLAs, hitting the limits of some of our cloud vendors’ services. We will describe the iterative process of re-architecting a massive scale data platform using the aforementioned technologies.

We will also delve into how today, Akamai is able to quickly ingest and make available to customers terabytes of data, as well as efficiently query Petabytes of data and return results within 10 seconds for most queries. This discussion will provide valuable insights for attendees and organizations seeking to effectively process and analyze large amounts of data.

Evaluating LLM-based Applications

2023-07-26 Watch

video

Josh Tobin

Databricks LLM MLOps

Evaluating LLM-based applications can feel like more of an art than a science. In this workshop, we'll give a hands-on introduction to evaluating language models. You'll come away with knowledge and tools you can use to evaluate your own applications, and answers to questions like:

Where do I get evaluation data from, anyway?
Is it possible to evaluate generative models in an automated way?
What metrics can I use?
What's the role of human evaluation?

Talk by: Josh Tobin

Here’s more to explore: LLM Compact Guide: https://dbricks.co/43WuQyb Big Book of MLOps: https://dbricks.co/3r0Pqiz

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Best Exploration of Columnar Shuffle Design

2023-07-26 Watch

video

Binwei Yang , Rong Ma

Data Lakehouse Databricks DWH Spark SQL

To significantly improve the performance of Spark SQL, there is a trend to offload Spark SQL execution to highly optimized native libraries or accelerators in past several years, like Photon from Databricks, Nvidia's Rapids plug-in, and Intel and Kyligence's initiated open source Gluten project. By the multi-fold performance improvement from these solutions, more and more Apache Spark™ users have started to adopt the new technology. One characteristics of native libraries is that they all use columnar data format as the basic data format. It's because the columnar data format has the intrinsic affinity to vectorized data processing using SIMD instructions. While vanilla Spark's shuffle is based on spark's internal row data format. The high overhead of the columnar to row and row to columnar conversion during the shuffle makes reusing current shuffle not possible. Due to the importance of shuffle service in Spark, we have to implement an efficient columnar shuffle, which brings couple of new challenges, like the split of columnar data, or the dictionary support during shuffle.

In this session, we will share the exploration process of the columnar shuffle design during our Gazelle and Gluten development, and best practices for implementing the columnar shuffle service. We will also share how we learned from the development of vanilla Spark's shuffle, for example, how to address the small files issue then we will propose the new shuffle solution. We will show the performance comparison between Columnar shuffle and vanilla Spark's row-based shuffle. Finally, we will share how the new built-in accelerators like QAT and IAA in the latest Intel processor are used in our columnar shuffle service and boost the performance.

Talk by: Binwei Yang and Rong Ma

Here’s more to explore: Why the Data Lakehouse Is Your next Data Warehouse: https://dbricks.co/3Pt5unq Lakehouse Fundamentals Training: https://dbricks.co/44ancQs

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Best Practices for Running Efficient Apache Spark™ Workloads on Databricks

2023-07-26 Watch

video

Justin Breese (Databricks)

Data Lakehouse Databricks Spark

Every day thousands of customers choose to run business-critical Spark workloads on the Databricks Lakehouse Platform, a platform built by the creators of Apache Spark™. These customers take advantage of platform capabilities such as fully managed compute resources, dynamic autoscaling, an integrated workflow orchestration tool and of Photon, the extremely fast vectorized execution engine. All of these make the Databricks Lakehouse Platform the best place to run Spark workloads providing operational benefits as well as tremendous price/performance value.

This session which includes live demos will cover these and other platform capabilities that can help you build your next optimized Spark application.

Talk by: Justin Breese

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Databricks Lakehouse: How BlackBerry is Revolutionizing Cybersecurity Services Worldwide

2023-07-26 Watch

video

Robert Lombardi , Justin Lai (Arctic Wolf)

Cloud Computing Data Lakehouse Databricks

Cybersecurity incidents are costly, and using an endpoint detection and response (EDR) solution enables the detection of cybersecurity incidents as quickly as possible. To effectively detect cybersecurity incidences requires the collection of millions of data points, and the storing/querying of endpoints data presents considerable engineering challenges. This includes quickly moving local data from endpoints to a single table in the cloud and enabling performant querying against it.

The need to avoid internal data siloing within BlackBerry was paramount as multiple teams required access to the data to deliver an effective EDR solution for the present and the future. Databricks tooling enabled us to break down our data silos and iteratively improve our EDR pipeline to ingest data faster and reduce querying latency by more than 20% while reducing costs by more than 30%.

In this session, we will share the journey, lessons learned, and the future for collecting, storing, governing, and sharing data from endpoints in Databricks. The result of building EDR using Databricks helped us accelerate the deployment of our data platform.

Talk by: Justin Lai and Robert Lombardi

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksin

Databricks SQL: Why the Best Serverless Data Warehouse is a Lakehouse

2023-07-26 Watch

video

Cyrielle Simeone , Miranda Luna (Databricks)

AI/ML Analytics BI Cloud Computing Data Lake Data Lakehouse

Many organizations rely on complex cloud data architectures that create silos between applications, users and data. This fragmentation makes it difficult to access accurate, up-to-date information for analytics, often resulting in the use of outdated data. Enter the lakehouse, a modern data architecture that unifies data, AI, and analytics in a single location.

This session explores why the lakehouse is the best data warehouse, featuring success stories, use cases and best practices from industry experts. You'll discover how to unify and govern business-critical data at scale to build a curated data lake for data warehousing, SQL and BI. Additionally, you'll learn how Databricks SQL can help lower costs and get started in seconds with on-demand, elastic SQL serverless warehouses, and how to empower analytics engineers and analysts to quickly find and share new insights using their preferred BI and SQL tools such as Fivetran, dbt, Tableau, or Power BI.

Talk by: Miranda Luna and Cyrielle Simeone

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Data Extraction and Sharing Via The Delta Sharing Protocol

2023-07-26 Watch

video

Roger Dunn

Cloud Computing Data Lakehouse Databricks Delta JSON Parquet

The Delta Sharing open protocol for secure sharing and distribution of Lakehouse data is designed to reduce friction in getting data to users. Delivering custom data solutions from this protocol further leverages the technical investment committed to your Delta Lake infrastructure. There are key design and computational concepts unique to Delta Sharing to know when undertaking development. And there are pitfalls and hazards to avoid when delivering modern cloud data to traditional data platforms and users.

In this session, we introduce Delta Sharing Protocol development and examine our journey and the lessons learned while creating the Delta Sharing Excel Add-in. We will demonstrate scenarios of overfetching, underfetching, and interpretation of types. We will suggest methods to overcome these development challenges. The session will combine live demonstrations that exercise the Delta Sharing REST protocol with detailed analysis of the responses. The demonstrations will elaborate on optional capabilities of the protocol’s query mechanism, and how they are used and interpreted in real-life scenarios. As a reference baseline for data professionals, the Delta Sharing exercises will be framed relative to SQL counterparts. Specific attention will be paid to how they differ, and how Delta Sharing’s Change Data Feed (CDF) can power next-generation data architectures. The session will conclude with a survey of available integration solutions for getting the most out of your Delta Sharing environment, including frameworks, connectors, and managed services.

Attendees are encouraged to be familiar with REST, JSON, and modern programming concepts. A working knowledge of Delta Lake, the Parquet file format, and the Delta Sharing Protocol are advised.

Talk by: Roger Dunn

Here’s more to explore: A New Approach to Data Sharing: https://dbricks.co/44eUnT1

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Data Globalization at Conde Nast Using Delta Sharing

2023-07-26 Watch

video

Zachary Bannor

Analytics AWS Data Lakehouse Databricks Delta GDPR/CCPA

Databricks has been an essential part of the Conde Nast architecture for the last few years. Prior to building our centralized data platform, “evergreen,” we had similar challenges as many other organizations; siloed data, duplicated efforts for engineers, and a lack of collaboration between data teams. These problems led to mistrust in data sets and made it difficult to scale to meet the strategic globalization plan we had for Conde Nast.

Over the last few years we have been extremely successful in building a centralized data platform on Databricks in AWS, fully embracing the lakehouse vision from end-to-end. Now, our analysts and marketers can derive the same insights from one dataset and data scientists can use the same datasets for use cases such as personalization, subscriber propensity models, churn models and on-site recommendations for our iconic brands.

In this session, we’ll discuss how we plan to incorporate Unity Catalog and Delta Sharing as the next phase of our globalization mission. The evergreen platform has become the global standard for data processing and analytics at Conde. In order to manage the worldwide data and comply with GDPR requirements, we need to make sure data is processed in the appropriate region and PII data is handled appropriately. At the same time, we need to have a global view of the data to allow us to make business decisions at the global level. We’ll talk about how delta sharing allows us a simple, secure way to share de-identified datasets across regions in order to make these strategic business decisions, while complying with security requirements. Additionally, we’ll discuss how Unity Catalog allows us to secure, govern and audit these datasets in an easy and scalable manner.

Talk by: Zachary Bannor

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Embrace First-Party Customer Data for Marketing and Advertising using Data Cleanrooms

2023-07-26 Watch

video

Jordan Peck (/ Snowplow)

Data Lakehouse Databricks Marketing Snowplow

The digital marketing and advertising industry is going through revolutionary change in 2023, with technical, organisational, cultural and regulatory overhaul. As a result, measuring digital advertising effectiveness or coordinating and running highly targeted and effective ad campaigns is becoming more challenging than ever.

First party customer behavioral data provides organizations true competitive advantage and the ability outperform your peers in the battle for customer attention and brand loyalty.

However, first party customer data is still used sparingly across the digital ad ecosystem, and there are few tools or frameworks to allow advertisers to unlock the value in what first party data they have.

This session will show you how Snowplow allows organizations to deeply understand their users' behavior and intent by creating the best quality behavioral data. It will also explain that when this is combined with the Databricks Lakehouse and data clean rooms, brands can now unlock insights that were previously unachievable, and activate their first party customer behavioral data into highly effective, personalized and creative ad campaigns.

In this session you will learn: - Why first party data can be the ultimate in competitive advantage for digital advertisers - How data clean rooms combined with Snowplow behavioral data enable better insights and more impactful ad targeting - What specific marketing and advertising use cases are possible when utilizing a data clean room on top of the Databricks Lakehouse

Talk by: Jordan Peck

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Embracing the Future of Data Engineering: The Serverless, Real-Time Lakehouse in Action

2023-07-26 Watch

video

Frank Munz (Databricks)

AWS Kinesis Dashboard Data Engineering Data Lakehouse Databricks

As we venture into the future of data engineering, streaming and serverless technologies take center stage. In this fun, hands-on, in-depth and interactive session you can learn about the essence of future data engineering today.

We will tackle the challenge of processing streaming events continuously created by hundreds of sensors in the conference room from a serverless web app (bring your phone and be a part of the demo). The focus is on the system architecture, the involved products and the solution they provide. Which Databricks product, capability and settings will be most useful for our scenario? What does streaming really mean and why does it make our life easier? What are the exact benefits of serverless and how "serverless" is a particular solution?

Leveraging the power of the Databricks Lakehouse Platform, I will demonstrate how to create a streaming data pipeline with Delta Live Tables ingesting data from AWS Kinesis. Further, I’ll utilize advanced Databricks workflows triggers for efficient orchestration and real-time alerts feeding into a real-time dashboard. And since I don’t want you to leave with empty hands - I will use Delta Sharing to share the results of the demo we built with every participant in the room. Join me in this hands-on exploration of cutting-edge data engineering techniques and witness the future in action.

Talk by: Frank Munz

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Essential Data Security Strategies for the Modern Enterprise Data Architecture

2023-07-26 Watch

video

Piet Loubser

AI/ML Analytics Cloud Computing Databricks Cyber Security

Balancing critical data requirements is a 24-7 task for enterprise-level organizations that must straddle the need to open specific gates to enable self-service data access while closing other access points to maintain internal and external compliance. Data breaches can cost U.S. businesses an average of $9.4 million per occurrence; ignoring this leaves organizations vulnerable to severe losses and crippling costs.

The 2022 Gartner Hype Cycle for Data Security reports that more and more enterprises are modernizing their data architecture with cloud and technology partners to help them collect, store and manage business data; a trend that does not appear to be letting up. According to Gartner®, “by 2025, 30% of enterprises will have adopted the Broad Data Security Platform (bDSP), up from less than 10% in 2021, due to the pent-up demand for higher levels of data security and the rapid increase in product capabilities."

Moving to both a modern data architecture and data-driven culture sets enterprises on the right trajectory for growth, but it’s important to keep in mind individual public cloud platforms are not guaranteed to protect and secure data. To solve this, Privacera pioneered the industry’s first open-standards-based data security platform that integrates privacy and compliance across multiple cloud services.

During this presentation, we will discuss: - Why today’s modern data architecture needs a DSP that works across the entire data ecosystem; Essential DSP prescriptive measures and adoption strategies. - Why faster and more responsible access to data insights helps reduce cost, increases productivity, expedites decision making, and leads to exponential growth.

Talk by: Piet Loubser

Here’s more to explore: Data, Analytics, and AI Governance: https://dbricks.co/44gu3YU

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Generative AI at Scale Using GAN and Stable Diffusion

2023-07-26 Watch

video

Rodrigo Beceiro , Paula Martinez (Marvik)

AI/ML Databricks GenAI LLM MLOps

Generative AI is under the spotlight and it has diverse applications but there are also many considerations when deploying a generative model at scale. This presentation will make a deep dive into multiple architectures and talk about optimization hacks for the sophisticated data pipelines that generative AI requires. The session will cover: - How to create and prepare a dataset for training at scale in single GPU and multi GPU environments. - How to optimize your data pipeline for training and inference in production considering the complex deep learning models that need to be run. - Tradeoff between higher quality outputs versus training time and resources and processing times.

Agenda: - Basic concepts in Generative AI: GAN networks and Stable Diffusion - Training and inference data pipelines - Industry applications and use cases

Talk by: Paula Martinez and Rodrigo Beceiro

Here’s more to explore: LLM Compact Guide: https://dbricks.co/43WuQyb Big Book of MLOps: https://dbricks.co/3r0Pqiz

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

talk-data.com

Databricks DATA + AI Summit 2023

Top Topics

Top Speakers

Using Databricks to Power Insights and Visualizations on the S&P Global Marketplace

Why a Major Japanese Financial Institution Chose Databricks To Accelerate its Data AI-Driven Journey

Your LLM, Your Data, Your Infrastructure

ABN Story: Migrating to Future Proof Data Platformh

Automating Sensitive Data (PII/PHI) Detection

Databricks and Delta Lake: Lessons Learned from Building Akamai's Web Security Analytics Product

Data Democratization with Lakehouse: An Open Banking Application Case

Disaster Recovery Strategies for Structured Streams

D-Lite: Integrating a Lightweight ChatGPT-Like Model Based on Dolly into Organizational Workflows

Event Driven Real-Time Supply Chain Ecosystem Powered by Lakehouse

IFC's MALENA Provides Analytics for ESG Reviews in Emerging Markets Using NLP and LLMs

Increasing Data Trust: Enabling Data Governance on Databricks Using Unity Catalog & ML-Driven MDM

Real-Time Reporting and Analytics for Construction Data Powered by Delta Lake and DBSQL

Taking Your Cloud Vendor to the Next Level: Solving Complex Challenges with Azure Databricks

Evaluating LLM-based Applications

Best Exploration of Columnar Shuffle Design

Best Practices for Running Efficient Apache Spark™ Workloads on Databricks

Databricks Lakehouse: How BlackBerry is Revolutionizing Cybersecurity Services Worldwide

Databricks SQL: Why the Best Serverless Data Warehouse is a Lakehouse

Data Extraction and Sharing Via The Delta Sharing Protocol

Data Globalization at Conde Nast Using Delta Sharing

Embrace First-Party Customer Data for Marketing and Advertising using Data Cleanrooms

Embracing the Future of Data Engineering: The Serverless, Real-Time Lakehouse in Action

Essential Data Security Strategies for the Modern Enterprise Data Architecture

Generative AI at Scale Using GAN and Stable Diffusion