Databricks

Databricks and Delta Lake: Lessons Learned from Building Akamai's Web Security Analytics Product

2023-07-26 · Databricks DATA + AI Summit 2023 Watch

video

by Tomer Patel , Itai Yaffe (Nielsen Identity Engine)

Analytics Data Lake Delta Cyber Security SQL Data Streaming

Akamai is a leading content delivery network (CDN) and cybersecurity company operating hundreds of thousands of servers in more than 135 countries worldwide. In this session, we will share our experiences and lessons learned from building and maintaining the Web Security Analytics (WSA) product, an interactive analytics platform powered by Databricks and Delta Lake that enables customers to efficiently analyze and take informed action on a high volume of streaming security events.

The WSA platform must be able to serve hundreds of queries per minute, scanning hundreds of terabytes of data from a six petabyte data lake, with most queries returning results within ten seconds; for both aggregation queries and needle in a haystack queries. This session will cover how to use Databricks SQL warehouses and job clusters cost-effectively, and how to improve query performance using tools and techniques such as Delta Lake, Databricks Photon, and partitioning. This talk will be valuable for anyone looking to build and operate a high-performance analytics platform.

Talk by: Tomer Patel and Itai Yaffe

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Data Democratization with Lakehouse: An Open Banking Application Case

2023-07-26 · Databricks DATA + AI Summit 2023 Watch

video

by Pedro Boareto , Fabio Luis Correia da Silva

Cloud Computing Data Lakehouse

Banco Bradesco represents one of the largest companies in the financial sector in Latin America. They have more than 99 million customers, 79 years of history, and a legacy of data distributed in hundreds of on-premises systems. With the spread of data-driven approaches and the growth of cloud computing adoption, we needed to innovate and adapt to new trends and enable an analytical environment with democratized data.

We will show how more than eight business departments have already engaged in using the Lakehouse exploratory environment, with more than 190 use cases mapped and a multi-bank financial manager. Unlike with on-premises, the cost of each process can be isolated and managed in near real-time, allowing quick responses to cost and budget deviations, while increasing the deployment speed of new features 36 times compared to on-premises.

The data is now used and shared safely and easily between different areas and companies of the group. Also, the view of dashboards within Databricks allows panels to be efficiently "prototyped" with real data, allowing an easy interaction of the business area with its real needs and then creating a definitive view with all relevant points duly stressed.

Talk by: Pedro Boareto and Fabio Luis Correia da Silva

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Disaster Recovery Strategies for Structured Streams

2023-07-26 · Databricks DATA + AI Summit 2023 Watch

video

by Sachin Balgonda Patil (Databricks) , Shasidhar Eranti (Databricks)

Delta Data Streaming

In recent years, many businesses have adopted real-time streaming applications to enable faster decision making, quicker predictions, and improved customer experiences. Few of these applications are driving critical business use cases like financial fraud detection, loan application processing, personalized offers, etc. These business critical applications need robust disaster recovery strategies to recover from the catastrophic events to reduce the lost uptime. However, most organizations find it hard to set up disaster recovery for streaming applications as it involves continuous data flow. Streaming state and temporal behavior of data brings add complexities to the DR strategy. A reliable disaster recovery strategy includes backup, failover and failback approaches for the streaming application. Unlike the batch applications, these steps include many moving elements and need a very sophisticated approach to ensure that the services are failing over the DR region and meet the set RTO and RPO requirements.

In this session, we will cover following topics with a FINSERV use case demo: - Backup strategy: backup of delta tables, message bus services and checkpoint including offsets - Failover strategy: failover strategy to disable services in the primary region and start the services in the secondary region with minimum data loss - Failback strategy: failback strategy to restart the services in the primary region once all the services are restored - Common challenges and best practices for backup

Talk by: Shasidhar Eranti and Sachin Balgonda Patil

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

D-Lite: Integrating a Lightweight ChatGPT-Like Model Based on Dolly into Organizational Workflows

2023-07-26 · Databricks DATA + AI Summit 2023 Watch

video

by Ian Sotnek , Jacob Renn

AI/ML Analytics API LLM MLOps

DLite is a new instruction-following model developed by AI Squared by fine-tuning the smallest GPT-2 model on the Alpaca dataset. Despite having only 124 million parameters, DLite exhibits impressive ChatGPT-like interactivity and can be fine-tuned on a single T4 GPU for less than $15.00. Due to its small relative size, DLite can be run locally on a wide variety of compute environments, including laptop CPUs, and can be used without sending data to any third-party API. This lightweight property of DLite makes it highly accessible for personal use, empowering users to integrate machine learning models and advanced analytics into their workflows quickly, securely, and cost-effectively.

Leveraging DLite within AI Squared's platform can empower organizations to orchestrate the integration of Dolly/DLite into business workflows, creating personalized versions of Dolly/DLite, chaining models or analytics to contextualize Dolly/Dlite responses/prompts, and curating new datasets leveraging real-time feedback.

Talk by: Jacob Renn and Ian Sotnek

Here’s more to explore: LLM Compact Guide: https://dbricks.co/43WuQyb Big Book of MLOps: https://dbricks.co/3r0Pqiz

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Event Driven Real-Time Supply Chain Ecosystem Powered by Lakehouse

2023-07-26 · Databricks DATA + AI Summit 2023 Watch

video

by Harsh Mishra , Deepak Sekar

API Data Lakehouse Delta IoT KPI Data Streaming

As the backbone of Australia’s supply chain, the Australia Rail Track Corporation (ARTC) plays a vital role in the management and monitoring of goods transportation across 8,500km of its rail network throughout Australia. ARTC provides weighbridges along their track which read train weights as they pass at speeds of up to 60 kilometers an hour. This information is highly valuable and is required both by ARTC and their customers to provide accurate haulage weight details, analyze technical equipment, and help ensure wagons have been loaded correctly.

A total of 750 trains run across a network of 8500 km in a day and generate real-time data at approximately 50 sensor platforms. With the help of structured streaming and Delta Lake, ARTC was able to analyze and store:

Precise train location
Weight of the train in real-time
Train crossing time to the second level
Train speed, temperature, sound frequency, and friction
Train schedule lookups

Once all the IoT data has been pulled together from an IoT event hub, it is processed in real-time using structured streaming and stored in Delta Lake. To understand the train GPS location, API calls are then made per minute per train from the Lakehouse. API calls are made in real-time to another scheduling system to lookup customer info. Once the processed/enriched data is stored in Delta Lake, an API layer was also created on top of it to expose this data to all consumers.

The outcome: increased transparency on weight data as it is now made available to customers; we built a digital data ecosystem that now ARTC’s customers use to meet their KPIs/ planning; the ability to determine temporary speed restrictions across the network to improve train scheduling accuracy and also schedule network maintenance based on train schedules and speed.

Talk by: Deepak Sekar and Harsh Mishra

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

IFC's MALENA Provides Analytics for ESG Reviews in Emerging Markets Using NLP and LLMs

2023-07-26 · Databricks DATA + AI Summit 2023 Watch

video

by Atiyah Curmally , Blaise Sandwidi

AI/ML Analytics Data Lake LLM MLOps NLP Spark

International Finance Corporation (IFC) is using data and AI to build machine learning solutions that create analytical capacity to support the review of ESG issues at scale. This includes natural language processing and requires entity recognition and other applications to support the work of IFC’s experts and other investors working in emerging markets. These algorithms are available via IFC’s Machine Learning ESG Analyst (MALENA) platform to enable rapid analysis, increase productivity, and build investor confidence. In this manner, IFC, a development finance institution with the mandate to address poverty in emerging markets, is making use of its historical datasets and open source AI solutions to build custom-AI applications that democratize access to ESG capacity to read and classify text.

In this session, you will learn the unique flexibility of the Apache Spark™ ecosystem from Databricks and how that has allowed IFC’s MALENA project to connect to scalable data lake storage, use different natural language processing models and seamlessly adopt MLOps.

Talk by: Atiyah Curmally and Blaise Sandwidi

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Increasing Data Trust: Enabling Data Governance on Databricks Using Unity Catalog & ML-Driven MDM

2023-07-26 · Databricks DATA + AI Summit 2023 Watch

video

by Risha Ravindranath , Maggie Davis

AI/ML Data Governance Data Lakehouse Data Management Data Quality Delta ETL/ELT Master Data Management SQL

As part of Comcast Effectv’s transformation into a completely digital advertising agency, it was key to develop an approach to manage and remediate data quality issues related to customer data so that the sales organization is using reliable data to enable data-driven decision making. Like many organizations, Effectv's customer lifecycle processes are spread across many systems utilizing various integrations between them. This results in key challenges like duplicate and redundant customer data that requires rationalization and remediation. Data is at the core of Effectv’s modernization journey with the intended result of winning more business, accelerating order fulfillment, reducing make-goods and identifying revenue.

In partnership with Slalom Consulting, Comcast Effectv built a traditional lakehouse on Databricks to ingest data from all of these systems but with a twist; they anchored every engineering decision in how it will enable their data governance program.

In this session, we will touch upon the data transformation journey at Effectv and dive deeper into the implementation of data governance leveraging Databricks solutions such as Delta Lake, Unity Catalog and DB SQL. Key focus areas include how we baked master data management into our pipelines by automating the matching and survivorship process, and bringing it all together for the data consumer via DBSQL to use our certified assets in bronze, silver and gold layers.

By making thoughtful decisions about structuring data in Unity Catalog and baking MDM into ETL pipelines, you can greatly increase the quality, reliability, and adoption of single-source-of-truth data so your business users can stop spending cycles on wrangling data and spend more time developing actionable insights for your business.

Talk by: Maggie Davis and Risha Ravindranath

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Real-Time Reporting and Analytics for Construction Data Powered by Delta Lake and DBSQL

2023-07-26 · Databricks DATA + AI Summit 2023 Watch

video

by Jay Yang (RiskLab, UBS) , Hari Rajaram

Analytics Data Lakehouse dbt Delta Kafka Spark SQL Data Streaming

Procore is a construction project management software that helps construction professionals efficiently manage their projects and collaborate with their teams. Our mission is to connect everyone in construction on a global platform.

Procore is the system of record for all construction projects. Our customers need to access the data in near real-time for construction insights. Enhanced reporting is a self-service operational reporting module that allows quick data access with consistency to thousands of tables and reports.

Procore data platform rebuilt the module (originally built on the relational database) using Databricks and Delta lake. We used Apache Spark™ streaming to maintain the consistent state on the ingestion side from Kafka and plan to leverage the fully capable functionalities of DBSQL using the serverless SQL warehouse to read the medallion models (built via DBT) in Delta Lake. In addition, the Unity Catalog and the Delta share features helped us share the data across regions seamlessly. This design enabled us to improve the p95 and p99 read time by xx% (which were initially timing out).

Attend this session to hear about the learnings and experience of building a Data Lakehouse architecture.

Talk by: Jay Yang and Hari Rajaram

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Taking Your Cloud Vendor to the Next Level: Solving Complex Challenges with Azure Databricks

2023-07-26 · Databricks DATA + AI Summit 2023 Watch

video

Azure Cloud Computing Kafka

Akamai's content delivery network (CDN) processes about 30% of the internet's daily traffic, resulting in a massive amount of data that presents engineering challenges, both internally and with cloud vendors. In this session, we will discuss the barriers faced while building a data infrastructure on Azure, Databricks, and Kafka to meet strict SLAs, hitting the limits of some of our cloud vendors’ services. We will describe the iterative process of re-architecting a massive scale data platform using the aforementioned technologies.

We will also delve into how today, Akamai is able to quickly ingest and make available to customers terabytes of data, as well as efficiently query Petabytes of data and return results within 10 seconds for most queries. This discussion will provide valuable insights for attendees and organizations seeking to effectively process and analyze large amounts of data.

Evaluating LLM-based Applications

2023-07-26 · Databricks DATA + AI Summit 2023 Watch

video

by Josh Tobin

LLM MLOps

Evaluating LLM-based applications can feel like more of an art than a science. In this workshop, we'll give a hands-on introduction to evaluating language models. You'll come away with knowledge and tools you can use to evaluate your own applications, and answers to questions like:

Where do I get evaluation data from, anyway?
Is it possible to evaluate generative models in an automated way?
What metrics can I use?
What's the role of human evaluation?

Talk by: Josh Tobin

Here’s more to explore: LLM Compact Guide: https://dbricks.co/43WuQyb Big Book of MLOps: https://dbricks.co/3r0Pqiz

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Best Exploration of Columnar Shuffle Design

2023-07-26 · Databricks DATA + AI Summit 2023 Watch

video

by Binwei Yang , Rong Ma

Data Lakehouse DWH Spark SQL

To significantly improve the performance of Spark SQL, there is a trend to offload Spark SQL execution to highly optimized native libraries or accelerators in past several years, like Photon from Databricks, Nvidia's Rapids plug-in, and Intel and Kyligence's initiated open source Gluten project. By the multi-fold performance improvement from these solutions, more and more Apache Spark™ users have started to adopt the new technology. One characteristics of native libraries is that they all use columnar data format as the basic data format. It's because the columnar data format has the intrinsic affinity to vectorized data processing using SIMD instructions. While vanilla Spark's shuffle is based on spark's internal row data format. The high overhead of the columnar to row and row to columnar conversion during the shuffle makes reusing current shuffle not possible. Due to the importance of shuffle service in Spark, we have to implement an efficient columnar shuffle, which brings couple of new challenges, like the split of columnar data, or the dictionary support during shuffle.

In this session, we will share the exploration process of the columnar shuffle design during our Gazelle and Gluten development, and best practices for implementing the columnar shuffle service. We will also share how we learned from the development of vanilla Spark's shuffle, for example, how to address the small files issue then we will propose the new shuffle solution. We will show the performance comparison between Columnar shuffle and vanilla Spark's row-based shuffle. Finally, we will share how the new built-in accelerators like QAT and IAA in the latest Intel processor are used in our columnar shuffle service and boost the performance.

Talk by: Binwei Yang and Rong Ma

Here’s more to explore: Why the Data Lakehouse Is Your next Data Warehouse: https://dbricks.co/3Pt5unq Lakehouse Fundamentals Training: https://dbricks.co/44ancQs

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Best Practices for Running Efficient Apache Spark™ Workloads on Databricks

2023-07-26 · Databricks DATA + AI Summit 2023 Watch

video

by Justin Breese (Databricks)

Data Lakehouse Spark

Every day thousands of customers choose to run business-critical Spark workloads on the Databricks Lakehouse Platform, a platform built by the creators of Apache Spark™. These customers take advantage of platform capabilities such as fully managed compute resources, dynamic autoscaling, an integrated workflow orchestration tool and of Photon, the extremely fast vectorized execution engine. All of these make the Databricks Lakehouse Platform the best place to run Spark workloads providing operational benefits as well as tremendous price/performance value.

This session which includes live demos will cover these and other platform capabilities that can help you build your next optimized Spark application.

Talk by: Justin Breese

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Databricks Lakehouse: How BlackBerry is Revolutionizing Cybersecurity Services Worldwide

2023-07-26 · Databricks DATA + AI Summit 2023 Watch

video

by Robert Lombardi , Justin Lai (Arctic Wolf)

Cloud Computing Data Lakehouse

Cybersecurity incidents are costly, and using an endpoint detection and response (EDR) solution enables the detection of cybersecurity incidents as quickly as possible. To effectively detect cybersecurity incidences requires the collection of millions of data points, and the storing/querying of endpoints data presents considerable engineering challenges. This includes quickly moving local data from endpoints to a single table in the cloud and enabling performant querying against it.

The need to avoid internal data siloing within BlackBerry was paramount as multiple teams required access to the data to deliver an effective EDR solution for the present and the future. Databricks tooling enabled us to break down our data silos and iteratively improve our EDR pipeline to ingest data faster and reduce querying latency by more than 20% while reducing costs by more than 30%.

In this session, we will share the journey, lessons learned, and the future for collecting, storing, governing, and sharing data from endpoints in Databricks. The result of building EDR using Databricks helped us accelerate the deployment of our data platform.

Talk by: Justin Lai and Robert Lombardi

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksin

Databricks SQL: Why the Best Serverless Data Warehouse is a Lakehouse

2023-07-26 · Databricks DATA + AI Summit 2023 Watch

video

by Cyrielle Simeone , Miranda Luna (Databricks)

AI/ML Analytics BI Cloud Computing Data Lake Data Lakehouse dbt DWH ELK Fivetran Power BI SQL +1 more

Many organizations rely on complex cloud data architectures that create silos between applications, users and data. This fragmentation makes it difficult to access accurate, up-to-date information for analytics, often resulting in the use of outdated data. Enter the lakehouse, a modern data architecture that unifies data, AI, and analytics in a single location.

This session explores why the lakehouse is the best data warehouse, featuring success stories, use cases and best practices from industry experts. You'll discover how to unify and govern business-critical data at scale to build a curated data lake for data warehousing, SQL and BI. Additionally, you'll learn how Databricks SQL can help lower costs and get started in seconds with on-demand, elastic SQL serverless warehouses, and how to empower analytics engineers and analysts to quickly find and share new insights using their preferred BI and SQL tools such as Fivetran, dbt, Tableau, or Power BI.

Talk by: Miranda Luna and Cyrielle Simeone

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Data Extraction and Sharing Via The Delta Sharing Protocol

2023-07-26 · Databricks DATA + AI Summit 2023 Watch

video

by Roger Dunn

Cloud Computing Data Lakehouse Delta JSON Parquet SQL

The Delta Sharing open protocol for secure sharing and distribution of Lakehouse data is designed to reduce friction in getting data to users. Delivering custom data solutions from this protocol further leverages the technical investment committed to your Delta Lake infrastructure. There are key design and computational concepts unique to Delta Sharing to know when undertaking development. And there are pitfalls and hazards to avoid when delivering modern cloud data to traditional data platforms and users.

In this session, we introduce Delta Sharing Protocol development and examine our journey and the lessons learned while creating the Delta Sharing Excel Add-in. We will demonstrate scenarios of overfetching, underfetching, and interpretation of types. We will suggest methods to overcome these development challenges. The session will combine live demonstrations that exercise the Delta Sharing REST protocol with detailed analysis of the responses. The demonstrations will elaborate on optional capabilities of the protocol’s query mechanism, and how they are used and interpreted in real-life scenarios. As a reference baseline for data professionals, the Delta Sharing exercises will be framed relative to SQL counterparts. Specific attention will be paid to how they differ, and how Delta Sharing’s Change Data Feed (CDF) can power next-generation data architectures. The session will conclude with a survey of available integration solutions for getting the most out of your Delta Sharing environment, including frameworks, connectors, and managed services.

Attendees are encouraged to be familiar with REST, JSON, and modern programming concepts. A working knowledge of Delta Lake, the Parquet file format, and the Delta Sharing Protocol are advised.

Talk by: Roger Dunn

Here’s more to explore: A New Approach to Data Sharing: https://dbricks.co/44eUnT1

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Data Globalization at Conde Nast Using Delta Sharing

2023-07-26 · Databricks DATA + AI Summit 2023 Watch

video

by Zachary Bannor

Analytics AWS Data Lakehouse Delta GDPR/CCPA Cyber Security

Databricks has been an essential part of the Conde Nast architecture for the last few years. Prior to building our centralized data platform, “evergreen,” we had similar challenges as many other organizations; siloed data, duplicated efforts for engineers, and a lack of collaboration between data teams. These problems led to mistrust in data sets and made it difficult to scale to meet the strategic globalization plan we had for Conde Nast.

Over the last few years we have been extremely successful in building a centralized data platform on Databricks in AWS, fully embracing the lakehouse vision from end-to-end. Now, our analysts and marketers can derive the same insights from one dataset and data scientists can use the same datasets for use cases such as personalization, subscriber propensity models, churn models and on-site recommendations for our iconic brands.

In this session, we’ll discuss how we plan to incorporate Unity Catalog and Delta Sharing as the next phase of our globalization mission. The evergreen platform has become the global standard for data processing and analytics at Conde. In order to manage the worldwide data and comply with GDPR requirements, we need to make sure data is processed in the appropriate region and PII data is handled appropriately. At the same time, we need to have a global view of the data to allow us to make business decisions at the global level. We’ll talk about how delta sharing allows us a simple, secure way to share de-identified datasets across regions in order to make these strategic business decisions, while complying with security requirements. Additionally, we’ll discuss how Unity Catalog allows us to secure, govern and audit these datasets in an easy and scalable manner.

Talk by: Zachary Bannor

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Embrace First-Party Customer Data for Marketing and Advertising using Data Cleanrooms

2023-07-26 · Databricks DATA + AI Summit 2023 Watch

video

by Jordan Peck (/ Snowplow)

Data Lakehouse Marketing Snowplow

The digital marketing and advertising industry is going through revolutionary change in 2023, with technical, organisational, cultural and regulatory overhaul. As a result, measuring digital advertising effectiveness or coordinating and running highly targeted and effective ad campaigns is becoming more challenging than ever.

First party customer behavioral data provides organizations true competitive advantage and the ability outperform your peers in the battle for customer attention and brand loyalty.

However, first party customer data is still used sparingly across the digital ad ecosystem, and there are few tools or frameworks to allow advertisers to unlock the value in what first party data they have.

This session will show you how Snowplow allows organizations to deeply understand their users' behavior and intent by creating the best quality behavioral data. It will also explain that when this is combined with the Databricks Lakehouse and data clean rooms, brands can now unlock insights that were previously unachievable, and activate their first party customer behavioral data into highly effective, personalized and creative ad campaigns.

In this session you will learn: - Why first party data can be the ultimate in competitive advantage for digital advertisers - How data clean rooms combined with Snowplow behavioral data enable better insights and more impactful ad targeting - What specific marketing and advertising use cases are possible when utilizing a data clean room on top of the Databricks Lakehouse

Talk by: Jordan Peck

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Embracing the Future of Data Engineering: The Serverless, Real-Time Lakehouse in Action

2023-07-26 · Databricks DATA + AI Summit 2023 Watch

video

by Frank Munz (Databricks)

AWS Kinesis Dashboard Data Engineering Data Lakehouse Delta Data Streaming

As we venture into the future of data engineering, streaming and serverless technologies take center stage. In this fun, hands-on, in-depth and interactive session you can learn about the essence of future data engineering today.

We will tackle the challenge of processing streaming events continuously created by hundreds of sensors in the conference room from a serverless web app (bring your phone and be a part of the demo). The focus is on the system architecture, the involved products and the solution they provide. Which Databricks product, capability and settings will be most useful for our scenario? What does streaming really mean and why does it make our life easier? What are the exact benefits of serverless and how "serverless" is a particular solution?

Leveraging the power of the Databricks Lakehouse Platform, I will demonstrate how to create a streaming data pipeline with Delta Live Tables ingesting data from AWS Kinesis. Further, I’ll utilize advanced Databricks workflows triggers for efficient orchestration and real-time alerts feeding into a real-time dashboard. And since I don’t want you to leave with empty hands - I will use Delta Sharing to share the results of the demo we built with every participant in the room. Join me in this hands-on exploration of cutting-edge data engineering techniques and witness the future in action.

Talk by: Frank Munz

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Essential Data Security Strategies for the Modern Enterprise Data Architecture

2023-07-26 · Databricks DATA + AI Summit 2023 Watch

video

by Piet Loubser

AI/ML Analytics Cloud Computing Cyber Security

Balancing critical data requirements is a 24-7 task for enterprise-level organizations that must straddle the need to open specific gates to enable self-service data access while closing other access points to maintain internal and external compliance. Data breaches can cost U.S. businesses an average of $9.4 million per occurrence; ignoring this leaves organizations vulnerable to severe losses and crippling costs.

The 2022 Gartner Hype Cycle for Data Security reports that more and more enterprises are modernizing their data architecture with cloud and technology partners to help them collect, store and manage business data; a trend that does not appear to be letting up. According to Gartner®, “by 2025, 30% of enterprises will have adopted the Broad Data Security Platform (bDSP), up from less than 10% in 2021, due to the pent-up demand for higher levels of data security and the rapid increase in product capabilities."

Moving to both a modern data architecture and data-driven culture sets enterprises on the right trajectory for growth, but it’s important to keep in mind individual public cloud platforms are not guaranteed to protect and secure data. To solve this, Privacera pioneered the industry’s first open-standards-based data security platform that integrates privacy and compliance across multiple cloud services.

During this presentation, we will discuss: - Why today’s modern data architecture needs a DSP that works across the entire data ecosystem; Essential DSP prescriptive measures and adoption strategies. - Why faster and more responsible access to data insights helps reduce cost, increases productivity, expedites decision making, and leads to exponential growth.

Talk by: Piet Loubser

Here’s more to explore: Data, Analytics, and AI Governance: https://dbricks.co/44gu3YU

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Generative AI at Scale Using GAN and Stable Diffusion

2023-07-26 · Databricks DATA + AI Summit 2023 Watch

video

by Rodrigo Beceiro , Paula Martinez (Marvik)

AI/ML GenAI LLM MLOps

Generative AI is under the spotlight and it has diverse applications but there are also many considerations when deploying a generative model at scale. This presentation will make a deep dive into multiple architectures and talk about optimization hacks for the sophisticated data pipelines that generative AI requires. The session will cover: - How to create and prepare a dataset for training at scale in single GPU and multi GPU environments. - How to optimize your data pipeline for training and inference in production considering the complex deep learning models that need to be run. - Tradeoff between higher quality outputs versus training time and resources and processing times.

Agenda: - Basic concepts in Generative AI: GAN networks and Stable Diffusion - Training and inference data pipelines - Industry applications and use cases

Talk by: Paula Martinez and Rodrigo Beceiro

Here’s more to explore: LLM Compact Guide: https://dbricks.co/43WuQyb Big Book of MLOps: https://dbricks.co/3r0Pqiz

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

talk-data.com

Activity Trend

Top Events

Top Speakers

Databricks and Delta Lake: Lessons Learned from Building Akamai's Web Security Analytics Product

Data Democratization with Lakehouse: An Open Banking Application Case

Disaster Recovery Strategies for Structured Streams

D-Lite: Integrating a Lightweight ChatGPT-Like Model Based on Dolly into Organizational Workflows

Event Driven Real-Time Supply Chain Ecosystem Powered by Lakehouse

IFC's MALENA Provides Analytics for ESG Reviews in Emerging Markets Using NLP and LLMs

Increasing Data Trust: Enabling Data Governance on Databricks Using Unity Catalog & ML-Driven MDM

Real-Time Reporting and Analytics for Construction Data Powered by Delta Lake and DBSQL

Taking Your Cloud Vendor to the Next Level: Solving Complex Challenges with Azure Databricks

Evaluating LLM-based Applications

Best Exploration of Columnar Shuffle Design

Best Practices for Running Efficient Apache Spark™ Workloads on Databricks

Databricks Lakehouse: How BlackBerry is Revolutionizing Cybersecurity Services Worldwide

Databricks SQL: Why the Best Serverless Data Warehouse is a Lakehouse

Data Extraction and Sharing Via The Delta Sharing Protocol

Data Globalization at Conde Nast Using Delta Sharing

Embrace First-Party Customer Data for Marketing and Advertising using Data Cleanrooms

Embracing the Future of Data Engineering: The Serverless, Real-Time Lakehouse in Action

Essential Data Security Strategies for the Modern Enterprise Data Architecture

Generative AI at Scale Using GAN and Stable Diffusion