Databricks DATA + AI Summit 2023

Securing Databricks on AWS Using Private Link

2022-07-19 Watch

video

Analytics AWS CloudFormation Cloud Computing Data Analytics Databricks

Minimizing data transfers over the public internet is among the top priorities for organizations of any size, both for security and cost reasons. Modern cloud-native data analytics platforms need to support deployment architectures that meet this objective. For Databricks on AWS such an architecture is realized thanks to AWS PrivateLink, which allows computing resources deployed on different virtual private networks and different AWS accounts to communicate securely without ever crossing the public internet.

In this session, we want to provide a brief introduction to AWS Private Link and its main use cases in the context of a Databricks deployment: securing communications between control and data plane and securely connecting to the Databricks Web UI. We will then provide step-by-step walkthrough of the steps required in setting up PrivateLink connections with a Databricks deployment and demonstrate how to automate that process using AWS Cloud Formation or Terraform templates.

In this presentation we will cover the following topics: - Brief Introduction to AWS Private Link - How you can use PrivateLink to secure your AWS Databricks deployment - Step-by-step walkthrough of how to set up Private Link - How to automate and scale the setup using AWS CloudFormation or Terraform

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Security Best Practices for Lakehouse

2022-07-19 Watch

video

Data Lake Data Lakehouse Databricks Cyber Security

To learn more, visit the Databricks Security and Trust Center: https://www.databricks.com/trust

As you embark on a lakehouse project or evolve your existing data lake, you may want to improve your security posture and take advantage of new security features—there may even be a security team at your company that demands it! Databricks has worked with thousands of customers to securely deploy the Databricks Platform to meet their architecture and security requirements. While many organizations deploy security differently, we have found a common set of guidelines and features among organizations who require a high level of security. In this talk, we will detail the security features and architectural choices frequently used by these organizations and walk through a series of threat models for the risks that most concern security teams. While this session is great for people who already know Databricks, don’t worry, that knowledge isn’t required.

You will walk away with a full handbook detailing all of the concepts, configurations, and code from the session so that you can make immediate progress when you get back to the office. Security can be hard, but we’ve collected the hard work already done by some of the best in the industry, to make it easier. Come learn how.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Serverless Kafka and Apache Spark in a Multi-Cloud Data Lakehouse Architecture

2022-07-19 Watch

video

Analytics Cloud Computing Data Lake Data Lakehouse Databricks Delta

Apache Kafka in conjunction with Apache Spark became the de facto standard for processing and analyzing data. Both frameworks are open, flexible, and scalable. Unfortunately, the latter makes operations a challenge for many teams. Ideally, teams can use serverless SaaS offerings to focus on business logic. However, hybrid and multi-cloud scenarios require a cloud-native platform that provides automated and elastic tooling to reduce the operations burden.

This post explores different architecture to build serverless Kafka and Spark multi-cloud architectures across regions and continents. We start from the analytics perspective of a data lake and explore its relation to a fully integrated data streaming layer with Kafka to build a modern data lakehouse. Real-world use cases show the joint value and explore the benefit of the "delta lake" integration.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Simon Whiteley + Denny Lee Live Ask Me Anything

2022-07-19 Watch

video

Denny Lee (Databricks) , Simon Whiteley (Advancing Analytics)

Analytics Data Analytics Data Lakehouse Databricks Delta ETL/ELT

Simon and Denny Build A Thing is a live webshow, where Simon Whiteley (Advancing Analytics) and Denny Lee (Databricks) are building out a TV Ratings Analytics tool, working through the various challenges of building out a Data Lakehouse using Databricks. In this session, they'll be talking through their Lakehouse Platform, revisiting various pieces of functionality, and answering your questions, Live!

This is your chance to ask questions around structuring a lake for enterprise data analytics, the various ways we can use Delta Live Tables to simplify ETL or how to get started serving out data using Databricks SQL. We have a whole load of things to talk through, but we want to hear YOUR questions, which we can field from industry experience, community engagement and internal Databricks direction. There's also a chance we'll get distracted and talk about the Expanse for far too long.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Building Metadata and Lineage Driven Pipelines on Kubernetes

2022-07-19 Watch

video

AI/ML Databricks Kubernetes

Machine Learning becomes a critical role in every industry amid its widespread adoption. Composing an ML pipeline at a rapid pace is an inevitable way for success. However, an ML pipeline consists of several components and needs various efforts of different teams, including data engineers, data scientists, ML engineers, etc. A typical cooperation strategy is to define a sequence of tasks, coordinate the integration, test, apply fixes and enhancements, and repeat. ML pipeline components produced by task-driven approach lack reusability only maintenance efforts. Kubeflow Pipelines, a platform making deployments of ML pipeline on Kubernetes straightforward and scalable, provides metadata and lineage-driven approach to develop platform-independent and portable ML pipelines. Data linkage and propagation become crystal clear within ML pipelines. It also nourishes ML pipeline composition.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Building Production-Ready Recommender Systems with Feature Stores

2022-07-19 Watch

video

AI/ML Data Quality Databricks

Recommender systems are highly prevalent in modern applications and services but are notoriously difficult to build and maintain. Organizations face challenges such as complex data dependencies, data leakage, and frequently changing data/models. These challenges are compounded when building, deploying, and maintaining ML pipelines spans data scientists and engineers. Feature stores help address many of the operational challenges associated with recommender systems.

In this talk, we explore:

Challenges of building recommender systems
Strategies for reducing latency, while balancing requirements for freshness
Challenges in mitigating data quality issues
Technical and organizational challenges feature stores solve
How to integrate Feast, an open-source feature store, into an existing recommender system to support production systems

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Customer-centric Innovation to Scale Data & AI Everywhere

2022-07-19 Watch

video

Emmanuel Rochette (Intel)

AI/ML Analytics Cloud Computing Data Analytics Databricks

Imagine a world where you have the flexibility to infuse intelligence into every application, from edge to cloud. In this session, you will learn how Intel is enabling customer-centric innovation and delivering the simplicity, productivity, and performance the developers need to scale their data and AI solutions everywhere. An overview of Intel end-to-end data analytics and AI technologies, developer tools as well as examples of customers use cases will be presented.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Streaming Data into Delta Lake with Rust and Kafka

2022-07-19 Watch

video

AWS Databricks Delta DynamoDB Kafka Rust

Scribd's data architecture was originally batch-oriented, but in the last couple years, we introduced streaming data ingestion to provide near-real-time ad hoc query capability, mitigate the need for more batch processing tasks, and set the foundation for building real-time data applications.

Kafka and Delta Lake are the two key components of our streaming ingestion pipeline. Various applications and services write messages to Kafka as events are happening. We were tasked with getting these messages into Delta Lake quickly and efficiently.

Our first solution was to deploy Spark Structured Streaming jobs. This got us off the ground quickly, but had some downsides.

Since Delta Lake and the Delta transaction protocol are open source, we kicked off a project to implement our own Rust ingestion daemon. We were confident we could deliver a Rust implementation since our ingestion jobs are append only. Rust offers high performance with a focus on code safety and modern syntax.

In this talk I will describe Scribd's unique approach to ingesting messages from Kafka topics into Delta Lake tables. I will describe the architecture, deployment model, and performance of our solution, which leverages the kafka-delta-ingest Rust daemon and the delta-rs crate hosted in auto-scaling ECS services. I will discuss foundational design aspects for achieving data integrity such as distributed locking with DynamoDb to overcome S3's lack of "PutIfAbsent" semantics, and avoiding duplicates or data loss when multiple concurrent tasks are handling the same stream. I'll highlight the reliability and performance characteristics we've observed so far. I'll also describe the Terraform deployment model we use to deliver our 70-and-growing production ingestion streams into AWS.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Tackling Challenges of Distributed Deep Learning with Open Source Solutions

2022-07-19 Watch

video

AI/ML Cloud Computing Databricks PyTorch

Deep learning has had an enormous impact in a variety of domains, however, with model and data size growing at a rapid pace, scaling out deep learning training has become essential for practical use.

In this talk, you will learn about the challenges and various solutions for distributed deep learning.

We will first cover some of the common patterns used to scale out deep learning training.

We will then describe some of the challenges with distributed deep learning in practice: Infrastructure and hardware management Spending too much time managing clusters, resources, and the scheduling/placement of jobs or processes. Developer iteration speed. Too much overhead to go from small-scale local ML development to large-scale training Hard to run distributed training jobs in a notebook/interactive environment. Difficulty integrating with open source software. Scale out training while still being able to leverage open source tools such as MLflow, Pytorch Lightning, and Huggingface Managing large-scale training data. Efficiently ingest large amounts of training data to my distributed machine learning model. Cloud compute costs. Leverage cheaper spot instances, without having to restart training in case of node pre-emption. Easily switch between cloud providers to reduce costs without rewriting all my code

Then, we will share the merits of the ML open source ecosystem for distributed deep learning. In particular, we will introduce Ray Train, an open source library built on the Ray distributed execution framework, and show how it’s integrations with other open source libraries (PyTorch, Huggingface, MLflow, etc.) alleviate the pain points above.

We will conclude with a live demo showing large-scale distributed training using these open source tools.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Building Scalable & Advanced AI based Language Solutions for R&D using Databricks

2022-07-19 Watch

video

AI/ML Databricks

AI is ubiquitous and language-based AI is even more omniscient. Processing of unstructured data, understanding language to form information and generating language to respond to questions or write essays have specific business applications. In Pharma’s R&D Deloitte has been investing to create solutions across every part of the value chain and AI / ML models are embedded across. We have leveraged Databricks both as a development platform and data pipeline system which in turn has helped us accelerate and streamline the AI / ML model development required for R&D value chain. Through a systematic and scalable approach to processing, understanding, and generating unstructured content we have successfully delivered multiple use cases through which we have achieved business value and proved out critical advanced AI capabilities. We will discuss such situational challenges and solutions during our session.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Building Spatial Applications with Apache Spark and CARTO

2022-07-19 Watch

video

Analytics API BI Databricks Spark SQL

CARTO’s Spatial Extension provides the fundamental building blocks for Location Intelligence in Databricks. Many of the largest organizations using CARTO leverage Databricks for their analytics. Customers very often build custom spatial applications that simplify either a spatial analysis use case or provide a more direct interface to access business intelligence or information. CARTO facilitates the creation of these apps with a complete set of development libraries and APIs. For visualization, CARTO makes use of the powerful deck.gl visualization library. You utilize CARTO Builder to design your maps and perform analytics using Spatial SQL similar to PostGIS, but with the scalability of Apache Spark and then you reference them in your code. CARTO will handle visualizing large datasets, updating the maps, and everything in between. In this talk we will walk you through the process to build spatial applications with CARTO hosted in Apache Spark.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Challenges in Time Series Forecasting

2022-07-19 Watch

video

AI/ML Databricks

Accurate business forecasts are one of the most important aspects of corporate planning. These are enormously challenging questions to answer using only human intellect and rudimentary tools like spreadsheets due to the numerous factors that go into forecasting. Machine learning applied to time series data is a much more efficient and effective way to analyze the data, apply a forecasting algorithm, and derive accurate forecasts.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Chaos Engineering in the World of Large-Scale Complex Data Flow

2022-07-19 Watch

video

Databricks

A complex data flow is a set of operations to extract data from multiple sources, write to multiple targets, and refine the results using extract, transform, join, filter, and sort. Chaos Engineering involves experimenting with a distributed system to test its ability to withstand turbulent conditions in production. But, what about data? How confident are we that the complex data system will be safe once it is in production? The key is to experiment in production and automate while minimizing customer pain and protecting data from getting corrupted or accidentally deleted. In this session, you will discover how chaos engineering principles apply to distributed data systems and the tools that enable us to make our data workloads more resilient. We will also show you how to leverage lakeFS to recover from deploying code that resulted in corrupted data, which can easily happen with many moving parts.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Cleanlab: AI to Find and Fix Errors in ML Datasets

2022-07-19 Watch

video

AI/ML Data Quality Databricks GitHub

Real-world datasets have a large fraction of errors, which negatively impacts model quality and benchmarking. This talk presents Cleanlab, an open-source tool that addresses these issues using the latest research in data-centric AI. Cleanlab has been used to improve datasets at a number of Fortune 500 companies.

Ontological issues, invalid data points, and label errors are pervasive in datasets. Even gold-standard ML datasets have on average 3.3% label errors (labelerrors.com). Data errors degrade model quality, and errors lead to incorrect conclusions about model performance and suboptimal models being deployed.

We present the cleanlab open-source package (github.com/cleanlab/cleanlab) for finding and fixing data errors. We will walk through using Cleanlab to fix errors in a real-world dataset, with an end-to-end demo of how Cleanlab improves data and model performance.

Finally, we will show Cleanlab Studio, which provides a web interface for human-in-the-loop data quality control.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Cloud Native Geospatial Analytics at JLL

2022-07-19 Watch

video

Yanqing Zeng (JLL) , Luis Sanz (CARTO)

AI/ML Analytics Cloud Computing Data Lakehouse Databricks

Luis Sanz, CEO of CARTO and Yanqing Zeng, Lead Data Scientist at JLL, take us through how cloud native geospatial analytics can be unlocked on the Databricks Lakehouse platform with CARTO. Yanqing will showcase her work on large scale spatial analytics projects to address some of the most critical analysis use cases in Real Estate. Taking a geospatial perspective, Yanqing will share practical examples of how large-scale spatial data and analytics can be used for property portfolio mapping, AI-driven risk assessment, real estate valuation and more.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Competitive advantage hinges on predictive insights generated from AI! Build powerful data-driven

2022-07-19 Watch

video

AI/ML Data Governance Data Science Databricks Snowplow

AI is central to unlocking competitive advantage. However data science teams don’t have access to a consistent level of high-quality data required to build AI & ML data applications.

Instead data scientists spend 80% of their time collecting, cleaning & preparing the data for analysis rather than building AI-data applications.

During this talk Snowplow introduces the concept of data creation. Create & deploy high-quality & predictive behavioral data in real-time to Databricks.

Learn how being equipped with AI-ready data in Databricks allows data science teams to focus on building AI data applications rather than data wrangling—dramatically accelerating the pace of data projects & improving model performance & managing data governance. - How to execute more AI & data intensive applications in production using Databricks & Snowplow - How to execute on each AI & data intensive application faster thanks to pre-validated & predictive data - How data creation can solve for data governance

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Complete Data Security and Governance Powered by Unity Catalog and Immuta

2022-07-19 Watch

video

Data Governance Databricks Cyber Security

Join Immuta and Databricks to learn how combining the Databricks Unity Catalog and Immuta’s industry-leading data access platform enables complete data governance with granular security. This new integration makes Immuta-orchestrated attribute-based access control (ABAC) policies even more powerful and non-invasive, taking the solution to new levels and empowering your data platform teams.

During this session, you’ll also learn: - Why ABAC is essential for modern data stacks - How customers use an ABAC model to orchestrate complex policies at scale - Details on the Unity primitives for row- and column-level security - How Immuta will scale Unity enforcement primitives through ABAC and abstractions

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Comprehensive Patient Data Self-Serve Environment and Executive Dashboards Leveraging Databricks

2022-07-19 Watch

video

Amazon EMR Databricks ELK

In this talk, we will outline our data pipelines and demo dashboards developed on top of the resulting elasticsearch index. This tool enables queries for terms or phrases in the raw documents to be executed together with any associated EMR patient data filters within 1-2 second for a data set containing millions of records/documents. Finally, the dashboards are simple to use and enable Real World Evidence data stakeholders to gain real-time statistical insight into the comprehensive patient information available.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Constraints, Democratization, and the Modern Data Stack - Building a Data Platform At Red Ventures

2022-07-19 Watch

video

Airflow Databricks dbt Fivetran Marketing Modern Data Stack

The time and attention of skilled engineers are some of the most constrained, valuable resources at Red Digital, a marketing agency embedded within Red Ventures. Acknowledging that constraint, the team at Red Digital has taken a deliberate, product-first approach to modernize and democratize their data platform. With the help of modern tools like Databricks, Fivetran, dbt, Monte Carlo, and Airflow, Red Digital has increased its development velocity and the size of the available talent pool to continue to grow the business.

This talk will walk through some of the key challenges, decisions, and solutions that the Red Digital team has made to build a suite of parallel data stacks capable of supporting its growing business.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Coral and Transport Portable SQL and UDFs for the Interoperability of Spark and Other Engines

2022-07-19 Watch

video

API Data Governance Databricks Hive Spark SQL

In this talk, we present two open source projects, Coral and Transport, that enable deep SQL and UDF interoperability between Spark and other engines, such as Trino and Hive. Coral is a SQL analysis, rewrite, and translation engine that enables compute engines to interoperate and analyze different SQL dialects and plans, through the conversion to a common relational algebraic intermediate representation. Transport is a UDF framework that enables users to write UDFs against a single API but execute them as native UDFs of multiple engines, such as Spark, Trino, and Hive. Further, we discuss how LinkedIn leverages Coral and Transport, and present a production use case for accessing views of other engines in Spark as well as enhancing Spark DataFrame and Dataset view schema. We discuss other potential applications such as automatic data governance and data obfuscation, query optimization, materialized view selection, incremental compute, and data source SQL and UDF communication.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Correlation Over Causation: Cracking the Relationship Between User Engagement and User Happiness

2022-07-19 Watch

video

Analytics Confluence Data Analytics Data Science Databricks

As a head of product on the Confluence team at Atlassian, I own the metrics associated with user happiness. This a common area of ownership for heads of product, GMs, CEOs. But how do you actually use data to move the needle on user happiness, and how do you convert user activity and engagement insights into clear actions that end up positively impacting user happiness? In this talk, I would like to share the approach we developed jointly with our data analytics team to understand, operationalize and report on our journey on make Confluence users happier. This talk will be useful for data analytics and data science practitioners, product executives, and anyone faced with a task of operationalizing improvement of a "fuzzy" metric like NPS or CSAT.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Data Boards: A Collaborative and Interactive Space for Data Science

2022-07-19 Watch

video

AI/ML Analytics Data Science Databricks

Databricks enables many organizations to harness the power of data; but while Databricks enables collaboration across Data Scientists and Data Engineers, there is still opportunity to begin democratizing access to domain experts. Successfully achieving this requires a rethinking of the classic analytics user interfaces, towards interactive systems with highly collaborative visual interfaces. Current visualization and workflow tools are ill-suited to bringing the full team together. I will present Northstar, a novel system we developed for Interactive Data Exploration at MIT / Brown University, now commercialized by Einblick. I will explain why Northstar required us to completely rethink the analytics stack, from the interface to the “guts,” and highlight the techniques we developed to provide a truly novel user-interface which enables creating code optional analysis over Databricks, where all user personas can collaborate together very large datasets and use complex ML operations.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Data Lake for State Health Exchange Analytics using Databricks

2022-07-19 Watch

video

AI/ML Analytics Cloud Computing Data Lake Databricks DWH

One of the largest State based health exchanges in the country was looking to modernize their data warehouse (DWH) environment to support the vision that every decision to design, implement and evaluate their state-based health exchange portal is informed by timely and rigorous evidence about its consumers’ experiences. The scope of the project was to replace existing Oracle-based DWH with an analytics platform that could support a much broader range of requirements with an ability to provide unified analytics capabilities including machine learning. The modernized analytics platform comprises a cloud native data lake and DWH solution using Databricks. The solution provides significantly higher performance and elastic scalability to better handle larger and varying data volumes with a much lower cost of ownership compared to the existing solution. In this session, we will walk through the rationale behind tool selection, solution architecture, project timeline and benefits expected.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Data Lakehouse and Data Mesh—Two Sides of the Same Coin

2022-07-19 Watch

video

Analytics Data Lake Data Lakehouse Data Management Databricks DWH

Over the last few years two new approaches to data management have been developed in the data community: Data Mesh and Data Lakehouse. The latter is an open architecture that pushes the technological advancements of a Data Lake by adding data management capabilities proven by a long history of Data Warehousing practices. Data Mesh on the other hand is addressing data management challenges from an organizational angle, by advocating decentralized ownership of domain data while applying product thinking and domain-driven design to analytics data. At first one might think that those two architectural approaches are competing with each other, however in this talk you will learn that the two are rather orthogonal and can go very well together.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Data Mesh Implementation Patterns

2022-07-19 Watch

video

Data Management Databricks

Data mesh has caught the attention of the practitioners with its promise of increased speed to insights and lower data management costs. There is sufficient literature available on “WHAT” of data mesh, but not enough design patterns on “HOW” of data mesh

The presentation will focus on implementation patterns of various data products based on nature of data sources, historical data needs, serving application, consumption requirements. Besides the implementation patterns at execution time, the presentation will also delve into access and search patterns

The implementation patterns are built with a focus on Domain Data as first-class concern as opposed to Data Pipeline as first-class concern, with the following core tenants being embedded as part of all the patterns:

Serving instead of Ingesting Discovering and using instead of extracting and loading Publishing events as streams instead of data flowing around via pipelines Ecosystem of data products instead of centralized data platform

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

talk-data.com

Databricks DATA + AI Summit 2023

Top Topics

Top Speakers

Securing Databricks on AWS Using Private Link

Security Best Practices for Lakehouse

Serverless Kafka and Apache Spark in a Multi-Cloud Data Lakehouse Architecture

Simon Whiteley + Denny Lee Live Ask Me Anything

Building Metadata and Lineage Driven Pipelines on Kubernetes

Building Production-Ready Recommender Systems with Feature Stores

Customer-centric Innovation to Scale Data & AI Everywhere

Streaming Data into Delta Lake with Rust and Kafka

Tackling Challenges of Distributed Deep Learning with Open Source Solutions

Building Scalable & Advanced AI based Language Solutions for R&D using Databricks

Building Spatial Applications with Apache Spark and CARTO

Challenges in Time Series Forecasting

Chaos Engineering in the World of Large-Scale Complex Data Flow

Cleanlab: AI to Find and Fix Errors in ML Datasets

Cloud Native Geospatial Analytics at JLL

Competitive advantage hinges on predictive insights generated from AI! Build powerful data-driven

Complete Data Security and Governance Powered by Unity Catalog and Immuta

Comprehensive Patient Data Self-Serve Environment and Executive Dashboards Leveraging Databricks

Constraints, Democratization, and the Modern Data Stack - Building a Data Platform At Red Ventures

Coral and Transport Portable SQL and UDFs for the Interoperability of Spark and Other Engines

Correlation Over Causation: Cracking the Relationship Between User Engagement and User Happiness

Data Boards: A Collaborative and Interactive Space for Data Science

Data Lake for State Health Exchange Analytics using Databricks

Data Lakehouse and Data Mesh—Two Sides of the Same Coin

Data Mesh Implementation Patterns