Data Lakehouse

How Comcast Effectv Drives Data Observability with Databricks and Monte Carlo

2023-07-26 · Databricks DATA + AI Summit 2023 Watch

video

by Scott Lerner (Comcast Effectv) , Robinson Creighton (Comcast Effectv)

AWS Databricks Looker Monte Carlo

Comcast Effectv, the 2,000-employee advertising wing of Comcast, America’s largest telecommunications company, provides custom video ad solutions powered by aggregated viewership data. As a global technology and media company connecting millions of customers to personalized experiences and processing billions of transactions, Comcast Effectv was challenged with handling massive loads of data, monitoring hundreds of data pipelines, and managing timely coordination across data teams.

In this session, we will discuss Comcast Effectv’s journey to building a more scalable, reliable lakehouse and driving data observability at scale with Monte Carlo. This has enabled Effectv to have a single pane of glass view of their entire data environment to ensure consumer data trust across their entire AWS, Databricks, and Looker environment.

Talk by: Scott Lerner and Robinson Creighton

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

How to Create and Manage a High-Performance Analytics Team

2023-07-26 · Databricks DATA + AI Summit 2023 Watch

video

by John Thompson (EY)

AI/ML Analytics Data Science Databricks

Data science and analytics teams are unique. Large and small corporations want to build and manage analytics teams to convert their data and analytic assets into revenue and competitive advantage, but many are failing before they make their first hire. In this session, the audience will learn how to structure, hire, manage and grow an analytics team. Organizational structure, project and program portfolios, neurodiversity, developing talent, and more will be discussed.

Questions and discussion will be encouraged and engaged in. The audience will leave with a deeper understanding of how to succeed in turning data and analytics into tangible results.

Talk by: John Thompson

Here’s more to explore: State of Data + AI Report: https://dbricks.co/44i2HBp The Data Team's Guide to the Databricks Lakehouse Platform: https://dbricks.co/46nuDpI

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Sponsored: Accenture | Databricks Enables Employee Data Domain to Align People w/ Business Outcomes

Unlocking the Power of Databricks SDKs: The Power to Integrate, Streamline, and Automate

2023-07-26 · Databricks DATA + AI Summit 2023 Watch

video

by Serge Smertin (Databricks)

Databricks Java Python Terraform

In today's data-driven landscape, the demands placed upon data engineers are diverse and multifaceted. With the integration of Java, Python, or Go microservices, Databricks SDKs provide a powerful bridge between the established ecosystems and Databricks. They allow data engineers to unlock new levels of integration and collaboration, as well as integrate Unity Catalog into processes to create advanced workflows straight from notebooks.

In this session, learn best practices for when and how to use SDK, command-line interface, or Terraform integration to seamlessly integrate with Databricks and revolutionize how you integrate with the Databricks Lakehouse. The session covers using shell scripts to automate complex tasks and streamline operations that improve scalability.

Talk by: Serge Smertin

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Deep Dive Into Grammarly's Data Platform

2023-07-25 · Databricks DATA + AI Summit 2023 Watch

video

by Christopher Locklin , Faraz Yasrobi

AWS Amazon EMR Databricks Marketing

Grammarly helps 30 million people and 50,000 teams to communicate more effectively. Using the Databricks Lakehouse Platform, we can rapidly ingest, transform, aggregate, and query complex data sets from an ecosystem of sources, all governed by Unity Catalog. This session will overview Grammarly’s data platform and the decisions that shaped the implementation. We will dive deep into some architectural challenges the Grammarly Data Platform team overcame as we developed a self-service framework for incremental event processing.

Our investment in the lakehouse and Unity Catalog has dramatically improved the speed of our data value chain: making 5 billion events (ingested, aggregated, de-identified, and governed) available to stakeholders (data scientists, business analysts, sales, marketing) and downstream services (feature store, reporting/dashboards, customer support, operations) available within 15. As a result, we have improved our query cost performance (110% faster at 10% the cost) compared to our legacy system on AWS EMR.

I will share architecture diagrams, their implications at scale, code samples, and problems solved and to be solved in a technology-focused discussion about Grammarly’s iterative lakehouse data platform.

Talk by: Faraz Yasrobi and Christopher Locklin

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Unity Catalog, Delta Sharing and Data Mesh on Databricks Lakehouse

2023-07-25 · Databricks DATA + AI Summit 2023 Watch

video

by Surya Turaga , Thomas Roach

Cloud Computing Databricks Delta

In this technical deep dive, we will detail how customers implemented data mesh on Databricks and how standardizing on delta format enabled delta-to-delta share to non-Databricks consumers.

Current state of the IT landscape
Data silos (problems with organizations not having connected data in the ecosystem)
A look back on why we moved away from data warehouses and choose cloud in the first place
What caused the data chaos in the cloud (instrumentation and too much stitching together) ~ periodic table list of services of the cloud
How to strike the balance between autonomy and centralization
Why Databricks Unity Catalog puts you in the right path to implementing data mesh strategy
What are the process and features that enable and end-to-end Implementation of a data strategy
How customers were able to successfully implement the data mesh on out of the box Unity Catalog and delta sharing without overwhelming their IT tool stack
Use cases
Delta-to-delta data sharing
Delta-to-others data sharing
How do you navigate when data today is available across regions, across clouds, on-prem and external systems
Change data feed to share only “data that has changed”
Data stewardship
Why ABAC is important
How file based access policies and governance play an important role
Future state and its pitfalls
Egress costs
Data compliances

Talk by: Surya Turaga and Thomas Roach

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

PII Detection at Scale on the Lakehouse

2023-07-25 · Databricks DATA + AI Summit 2023 Watch

video

by Rachael Straiton , Ajmal Aziz

AI/ML Data Management Databricks ETL/ELT

SEEK is Australia’s largest online employment marketplace and a market leader spanning ten countries across Asia Pacific and Latin America. SEEK provides employment opportunities for roughly 16 million monthly active users and process 25 million candidate applications to listings. Processing millions of resumes involves handling and managing highly sensitive candidate information, usually inputted in a highly unstructured format. With recent high-profile data leaks in Australia, personally identifiable information (PII) protection has become a major focus area for large digital organizations.

The first step is detection, and SEEK has developed a custom framework built using HuggingFace transformers fine-tuned with nuances around employment. For example, “Software Engineer at Databricks” is not PII, but “CEO at Databricks” is PII. After identifying and anonymizing PII in stream and batch data, SEEK uses Unity Catalog’s data lineage to track PII through their reporting, ETL, and other downstream ML use-cases and govern access control achieving an organization-wide data management capability driven by deep learning and enforcement using Databricks.

Talk by: Ajmal Aziz and Rachael Straiton

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

How Rec Room Processes Billions of Events Per Day with Databricks and RudderStack

2023-07-25 · Databricks DATA + AI Summit 2023 Watch

video

by Albert Hu , Lewis Mbae

Analytics Databricks ETL/ELT

Learn how Rec Room, a fast-growing augmented and virtual reality software startup, is saving 50% of their engineering team's time by using Databricks and RudderStack to power real-time analytics and insights for their 85 million gaming customers.

In this session, you will walk through a step-by-step explanation of how Rec Room set up efficient processes for ingestion into their data lakehouse, transformation, reverse-ETL and product analytics. You will also see how Rec Room is using incremental materialization of tables to save costs and establish an uptime of close to 100%.

Talk by: Albert Hu and Lewis Mbae

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

What’s New with Data Sharing and Collaboration on the Lakehouse: From Delta Sharing to Clean Rooms

2023-07-25 · Databricks DATA + AI Summit 2023 Watch

video

by Xiaotong Sun , Kelly Albano (Databricks) , Erika Ehrli

AI/ML Analytics Databricks Delta

Get ready to accelerate your data and AI collaboration game with the Databricks product team. Join us as we build the next generation of secure data collaboration capabilities on the lakehouse. Whether you're just starting your data sharing journey or exploring advanced data collaboration features like data cleanrooms, this session is tailor-made for you.

In this demo-packed session, you'll discover what’s new in Delta Sharing including dynamic and materialized views for sharing, sharing other assets such as notebooks, ML models, new Delta Sharing open source connectors for the tools of your choice, and updates to Databricks cleanroom. Learn how lakehouse is the perfect solution for your data and AI collaboration requirements, across clouds, regions and platforms and without any vendor lock-in. Plus, you'll get a peek into our upcoming roadmap. Ask any burning questions you have for our expert product team as they build a collaborative lakehouse for data, analytics and AI.

Talk by: Erika Ehrli, Kelly Albano, and Xiaotong Sun

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Sponsored by: Microsoft | Next-Level Analytics with Power BI and Databricks

Taking Control of Streaming Healthcare Data

2023-07-25 · Databricks DATA + AI Summit 2023 Watch

video

by Chris Mantz , Andy Hanks

Analytics Data Quality Databricks Delta SQL Data Streaming

Chesapeake Regional Information System for our Patients (CRISP), a nonprofit healthcare information exchange (HIE), initially partnered with Slalom to build a Databricks data lakehouse architecture in response to the analytics demands of the COVID-19 pandemic, since then they have expanded the platform to additional use cases. Recently they have worked together to engineer streaming data pipelines to process healthcare messages, such as HL7, to help CRISP become vendor independent.

This session will focus on the improvements CRISP has made to their data lakehouse platform to support streaming use cases and the impact these changes have had for the organization. We will touch on using Databricks Auto Loader to efficiently ingest incoming files, ensuring data quality with Delta Live Tables, and sharing data internally with a SQL warehouse, as well as some of the work CRISP has done to parse and standardize HL7 messages from hundreds of sources. These efforts have allowed CRISP to stream over 4 million messages daily in near real-time with the scalability it needs to continue to onboard new healthcare providers so it can continue to facilitate care and improve health outcomes.

Talk by: Andy Hanks and Chris Mantz

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Databricks As Code:Effectively Automate a Secure Lakehouse Using Terraform for Resource Provisioning

2023-07-25 · Databricks DATA + AI Summit 2023 Watch

video

by Vadivel Selvaraj , Jason Shiverick

Data Governance Databricks Delta SQL Terraform

At Rivian, we have automated more than 95% of our Databricks resource provisioning workflows using an in-house Terraform module, affording us a lean admin team to manage over 750 users. In this session, we will cover the following elements of our approach and how others can benefit from improved team efficiency.

User and service principal management
Our permission model on Unity Catalog for data governance
Workspace and secrets resource management
Managing internal package dependencies using init scripts
Facilitating dashboards, SQL queries and their associated permissions
Scaling source of truth Petabyte scale Delta Lake table ingestion jobs and workflows

Talk by: Jason Shiverick and Vadivel Selvaraj

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

What's New in Databricks SQL -- With Live Demos

2023-07-25 · Databricks DATA + AI Summit 2023 Watch

video

by Can Efeoglu (Databricks)

Databricks DWH Modern Data Stack SQL

We’ve been pushing ahead to make the lakehouse even better for data warehousing across several pillars: native serverless experience, best in class price performance, intelligent workload management & observability and enhanced connectivity, analyst & developer experiences. As we look to double down on that pace of innovation, we want to deep dive into everything that’s been keeping us busy.

In this session we will share an update on key roadmap items. To bring things to life, you will see live demos of the most recent capabilities, from data ingestion, transformation, and consumption, using the modern data stack along with Databricks SQL.

Talk by: Can Efeoglu

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Building & Managing a Data Platform for a Delta Lake Exceeding 13PB & 1000s of Users: AT&T's Story

2023-07-25 · Databricks DATA + AI Summit 2023 Watch

video

AI/ML BI Cloud Computing Databricks Delta

Data runs AT&T’s business, just like it runs most businesses these days. Data can lead to a greater understanding of a business and when translated correctly into information can provide human and business systems valuable insights to make better decisions. Unique to AT&T is the volume of data we support, how much of our work that is driven by AI and the scale at which data and AI drive value for our customers and stakeholders.

Our cloud migration journey includes making data and AI more accessible to employees throughout AT&T so they can use their deep business expertise to leverage data more easily and rapidly. We always had to balance this data democratization and desire for speed with keeping our data private and secure. We loved the open ecosystem model of Lakehouse that enables data, BI and ML tools to be seamlessly integrated on a single pane arena; it simplifies the architecture and reduces dependencies between technologies in the cloud. Being clear in our architecture guidelines and patterns was very important to us for our success.

We are seeing more interest from our business unit partners and continuing to build the AI capability AI as a service to support more citizen data scientists. To scale up our Lakehouse journey, we built a Databricks center of excellence (CoE) function in AT&T which today has approximately 1400+ active members, further concentrating existing expertise and resources in ML/AI discipline to collaborate on all things Databricks like technical support, trainings, FAQ’s and best practices to attain and sustain world-class performance and drive business value for AT&T. Join us to learn more about how we process and manage over 10 petabytes of our network Lakehouse with Delta Lake and Databricks.

Build Your Data Lakehouse with a Modern Data Stack on Databricks

2023-07-25 · Databricks DATA + AI Summit 2023 Watch

video

by Pearl Ubaru (Databricks) , Ari Kaplan (Databricks)

Agile/Scrum AI/ML Data Engineering Data Science Databricks DWH Modern Data Stack Data Streaming

Are you looking for an introduction to the Lakehouse and what the related technology is all about? This session is for you. This session explains the value that lakehouses bring to the table using examples of companies that are actually modernizing their data, showing demos throughout. The data lakehouse is the future for modern data teams that want to simplify data workloads, ease collaboration, and maintain the flexibility and openness to stay agile as a company scales.

Come to this session and learn about the full stack, including data engineering, data warehousing in a lakehouse, data streaming, governance, and data science and AI. Learn how you can create modern data solutions of your own.

Talk by: Ari Kaplan and Pearl Ubaru

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Lakehouse Federation: Access and Governance of External Data Sources from Unity Catalog

2023-07-25 · Databricks DATA + AI Summit 2023 Watch

video

by Can Efeoglu (Databricks) , Todd Greenstein (Databricks)

Databricks Delta Hive

Are you tired of spending time and money moving data across multiple sources and platforms to access the right data at the right time? Join our session and discover Databricks new Lakehouse Federation feature, which allows you to access, query, and govern your data in place without leaving the Lakehouse. Our experts will demonstrate how you can leverage the latest enhancements in Unity Catalog, including query federation, Hive interface, and Delta Sharing, to discover and govern all your data in one place, regardless of where it lives.

Talk by: Can Efeoglu and Todd Greenstein

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Learn How to Reliably Monitor Your Data and Model Quality in the Lakehouse

2023-07-25 · Databricks DATA + AI Summit 2023 Watch

video

by Kasey Uhlenhuth (Databricks) , Alkis Polyzotis

AI/ML API Data Engineering Databricks

Developing and upkeep of production data engineering and machine learning pipelines is a challenging process for many data teams. Even more challenging is monitoring the quality of your data and models once they go into production. Building upon untrustworthy data can cause many complications for data teams. Without a monitoring service, it is challenging to proactively discover when your ML models degrade over time, and the root causes behind it. Furthermore, with a lack of lineage tracking, it is even more painful to debug errors in your models and data. Databricks Lakehouse Monitoring offers a unified service to monitor the quality of all your data and ML assets.

In this session, you’ll learn how to:

Use one unified tool to monitor the quality of any data product: data or AI
Quickly diagnose errors in your data products with root cause analysis
Set up a monitor with low friction, requiring only a button click or a single API call to start and automatically generate out-of-the-box metrics
Enable self-serve experiences for data analysts by providing reliability status for every data asset

Talk by: Kasey Uhlenhuth and Alkis Polyzotis

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Advancements in Open Source LLM Tooling, Including MLflow

2023-07-25 · Databricks DATA + AI Summit 2023 Watch

video

by Corey Zumar (Databricks) , Ben Wilson (Databricks)

AI/ML API Databricks GenAI LLM

MLflow is one of the most used open source machine learning frameworks with over 13 million monthly downloads. With the recent advancements in generative AI, MLflow has been rapidly integrating support for a lot of the popular AI tools being used such as Hugging Face, LangChain, and OpenAI. This means that it’s becoming easier than ever to build AI pipelines with your data as the foundation, yet expanding your capabilities with the incredible advancements of the AI community.

Come to this session to learn how MLflow can help you:

Easily grab open source models from Hugging Face and use Transformers pipelines in MLflow
Integrate LangChain for more advanced services and to add context into your model pipelines
Bring in OpenAI APIs as part of your pipelines
Quickly track and deploy models on the lakehouse using MLflow

Talk by: Corey Zumar and Ben Wilson

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

What’s New in Databricks Workflows -- With Live Demos

2023-07-25 · Databricks DATA + AI Summit 2023 Watch

video

by Muhammad Bilal Aslam

AI/ML BI Dashboard Databricks ETL/ELT

Databricks Workflows provides unified orchestration for the Lakehouse. Since it was first announced last year, thousands of organizations have been leveraging Workflows for orchestrating lakehouse workloads such as ETL, BI dashboard refresh and ML model training.

In this session, the Workflows product team will cover and demo the latest features and capabilities of Databricks Workflows in the areas of workflow authoring, observability and more. This session will also include an outlook for future innovations you can expect to see in the coming months.

Talk by: Muhammad Bilal Aslam

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Databricks Asset Bundles: A Standard, Unified Approach to Deploying Data Products on Databricks

2023-07-25 · Databricks DATA + AI Summit 2023 Watch

video

by Rafi Kurlansik , Pieter Noordhuis (Databricks)

AI/ML CI/CD Dashboard Databricks Delta SQL Terraform YAML

In this session, we will introduce Databricks Asset Bundles, provide a demonstration of how they work for a variety of data products, and how to fit them into an overall CICD strategy for the well-architected Lakehouse.

Data teams produce a variety of assets; datasets, reports and dashboards, ML models, and business applications. These assets depend upon code (notebooks, repos, queries, pipelines), infrastructure (clusters, SQL warehouses, serverless endpoints), and supporting services/resources like Unity Catalog, Databricks Workflows, and DBSQL dashboards. Today, each organization must figure out a deployment strategy for the variety of data products they build on Databricks as there is no consistent way to describe the infrastructure and services associated with project code.

Databricks Asset Bundles is a new capability on Databricks that standardizes and unifies the deployment strategy for all data products developed on the platform. It allows developers to describe the infrastructure and resources of their project through a YAML configuration file, regardless of whether they are producing a report, dashboard, online ML model, or Delta Live Tables pipeline. Behind the scenes, these configuration files use Terraform to manage resources in a Databricks workspace, but knowledge of Terraform is not required to use Databricks Asset Bundles.

Talk by: Rafi Kurlansik and Pieter Noordhuis

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

talk-data.com

Activity Trend

Top Events

Top Speakers

How Comcast Effectv Drives Data Observability with Databricks and Monte Carlo

How to Create and Manage a High-Performance Analytics Team

Sponsored: Accenture | Databricks Enables Employee Data Domain to Align People w/ Business Outcomes

Unlocking the Power of Databricks SDKs: The Power to Integrate, Streamline, and Automate

Deep Dive Into Grammarly's Data Platform

Unity Catalog, Delta Sharing and Data Mesh on Databricks Lakehouse

PII Detection at Scale on the Lakehouse

How Rec Room Processes Billions of Events Per Day with Databricks and RudderStack

What’s New with Data Sharing and Collaboration on the Lakehouse: From Delta Sharing to Clean Rooms

Sponsored by: Microsoft | Next-Level Analytics with Power BI and Databricks

Taking Control of Streaming Healthcare Data

Databricks As Code:Effectively Automate a Secure Lakehouse Using Terraform for Resource Provisioning

What's New in Databricks SQL -- With Live Demos

Building & Managing a Data Platform for a Delta Lake Exceeding 13PB & 1000s of Users: AT&T's Story

Build Your Data Lakehouse with a Modern Data Stack on Databricks

Lakehouse Federation: Access and Governance of External Data Sources from Unity Catalog

Learn How to Reliably Monitor Your Data and Model Quality in the Lakehouse

Advancements in Open Source LLM Tooling, Including MLflow

What’s New in Databricks Workflows -- With Live Demos

Databricks Asset Bundles: A Standard, Unified Approach to Deploying Data Products on Databricks