Cloud Computing

Essential Data Security Strategies for the Modern Enterprise Data Architecture

2023-07-26 · Databricks DATA + AI Summit 2023 Watch

video

by Piet Loubser

AI/ML Analytics Databricks Cyber Security

Balancing critical data requirements is a 24-7 task for enterprise-level organizations that must straddle the need to open specific gates to enable self-service data access while closing other access points to maintain internal and external compliance. Data breaches can cost U.S. businesses an average of $9.4 million per occurrence; ignoring this leaves organizations vulnerable to severe losses and crippling costs.

The 2022 Gartner Hype Cycle for Data Security reports that more and more enterprises are modernizing their data architecture with cloud and technology partners to help them collect, store and manage business data; a trend that does not appear to be letting up. According to Gartner®, “by 2025, 30% of enterprises will have adopted the Broad Data Security Platform (bDSP), up from less than 10% in 2021, due to the pent-up demand for higher levels of data security and the rapid increase in product capabilities."

Moving to both a modern data architecture and data-driven culture sets enterprises on the right trajectory for growth, but it’s important to keep in mind individual public cloud platforms are not guaranteed to protect and secure data. To solve this, Privacera pioneered the industry’s first open-standards-based data security platform that integrates privacy and compliance across multiple cloud services.

During this presentation, we will discuss: - Why today’s modern data architecture needs a DSP that works across the entire data ecosystem; Essential DSP prescriptive measures and adoption strategies. - Why faster and more responsible access to data insights helps reduce cost, increases productivity, expedites decision making, and leads to exponential growth.

Talk by: Piet Loubser

Here’s more to explore: Data, Analytics, and AI Governance: https://dbricks.co/44gu3YU

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Real-Time Streaming Solution for Call Center Analytics: Business Challenges and Technical Enablement

2023-07-26 · Databricks DATA + AI Summit 2023 Watch

video

by Natalia Demidova

Analytics Azure BI Data Lake Data Lakehouse Databricks Power BI Data Streaming

A large international client with a business footprint in North America, Europe and Africa reached out to us with an interest in having a real-time streaming solution designed and implemented for its call center handling incoming and outgoing client calls. The client had a previous bad experience with another vendor, who overpromised and underdelivered on the latency of the streaming solution. The previous vendor delivered an over-complex streaming data pipeline resulting in the data taking over five minutes to reach a visualization layer. The client felt that architecture was too complex and involved many services integrated together.

Our immediate challenges involved gaining the client's trust and proving that our design and implementation quality would supersede a previous experience. To resolve an immediate challenge of the overly complicated pipeline design, we deployed a Databricks Lakehouse architecture with Azure Databricks at the center of the solution. Our reference architecture integrated Genesys Cloud : App Services : Event Hub : Databricks : : Data Lake : Power BI.

The streaming solution proved to be low latency (seconds) during the POV stage, which led to subsequent productionalization of the pipeline with deployment of jobs, DLTs pipeline, including multi-notebook workflow and business and performance metrics dashboarding relied on by the call center staff for a day-to-day performance monitoring and improvements.

Talk by: Natalia Demidova

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Sponsored: Matillion - OurFamilyWizard Moves and Transforms Data for Databricks Delta Lake Easy

Delta Sharing: The Key Data Mesh Enabler

2023-07-26 · Databricks DATA + AI Summit 2023 Watch

video

by Francesco Pizzolon

Azure Databricks Delta

Data Mesh is an emerging architecture pattern that challenges the centralized data platform approach by empowering different engineering teams to own the data products in a specific business domain. One of the keys to the success of any Data Mesh initiative is selecting the right protocol for Data Sharing between different business data domains that could potentially be implemented through different technologies and cloud providers.

In this session you will learn about how the Delta Sharing protocol and the Delta table format have enabled the historically stuck-in-the-past energy and construction industry to be catapulted to the 21st century by way of a modern Data Mesh implementation based on Azure Databricks.

Talk by: Francesco Pizzolon

Here’s more to explore: A New Approach to Data Sharing: https://dbricks.co/44eUnT1

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Sponsored: Kyvos | Analytics 100x Faster Lowest Cost w/ Kyvos & Databricks, Even on Trillions Rows

How the Texas Rangers Revolutionized Baseball Analytics with a Modern Data Lakehouse

2023-07-26 · Databricks DATA + AI Summit 2023 Watch

video

by Alexander Booth (Texas Rangers Baseball Club) , Oliver Dykstra (Texas Rangers)

Analytics Big Data Data Engineering Data Lakehouse Databricks ETL/ELT MLOps

Don't miss this session where we demonstrate how the Texas Rangers baseball team organized their predictive models by using MLflow and the MLRegistry inside Databricks. They started using Databricks as a simple solution to centralizing our development on the cloud. This helped lessen the issue of siloed development in our team, and allowed us to leverage the benefits of distributed cloud computing.

But we quickly found that Databricks was a perfect solution to another problem that we faced in our data engineering stack. Specifically, cost, complexity, and scalability issues hampered our data architecture development for years, and we decided we needed to modernize our stack by migrating to a lakehouse. With Databricks Lakehouse, ad-hoc-analytics, ETL operations, and MLOps all living within Databricks, development at scale has never been easier for our team.

Going forward, we hope to fully eliminate the silos of development, and remove the disconnect between our analytics and data engineering teams. From computer vision, pose analytics, and player tracking, to pitch design, base stealing likelihood, and more, come see how the Texas Rangers are using innovative cloud technologies to create action-driven reports from the current sea of big data.

Talk by: Alexander Booth and Oliver Dykstra

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Processing Delta Lake Tables on AWS Using AWS Glue, Amazon Athena, and Amazon Redshift

2023-07-26 · Databricks DATA + AI Summit 2023 Watch

video

by Noritaka Sekiyama (Amazon Web Services (AWS)) , Akira Ajisaka

Athena AWS Amazon EMR AWS Glue Amazon RDS Data Lake Data Lakehouse Databricks Delta DWH DynamoDB MongoDB +3 more

Delta Lake is an open source project that helps implement modern data lake architectures commonly built on cloud storages. With Delta Lake, you can achieve ACID transactions, time travel queries, CDC, and other common use cases on the cloud.

There are a lot of use cases of Delta tables on AWS. AWS has invested a lot in this technology, and now Delta Lake is available with multiple AWS services, such as AWS Glue Spark jobs, Amazon EMR, Amazon Athena, and Amazon Redshift Spectrum. AWS Glue is a serverless, scalable data integration service that makes it easier to discover, prepare, move, and integrate data from multiple sources. With AWS Glue, you can easily ingest data from multiple data sources such as on-prem databases, Amazon RDS, DynamoDB, MongoDB into Delta Lake on Amazon S3 even without expertise in coding.

This session will demonstrate how to get started with processing Delta Lake tables on Amazon S3 using AWS Glue, and querying from Amazon Athena, and Amazon Redshift. The session also covers recent AWS service updates related to Delta Lake.

Talk by: Noritaka Sekiyama and Akira Ajisaka

Here’s more to explore: Why the Data Lakehouse Is Your next Data Warehouse: https://dbricks.co/3Pt5unq Lakehouse Fundamentals Training: https://dbricks.co/44ancQs

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Scaling MLOps for a Demand Forecasting Across Multiple Markets for a Large CPG

2023-07-26 · Databricks DATA + AI Summit 2023 Watch

video

by Vinit Doshi , Sunil Ranganathan

Azure CI/CD Databricks MLOps

In this session, we look at how one of the world’s largest CPG company setup a scalable MLOps pipeline for a demand forecasting use case that predicted demand at 100,000+ DFUs (demand forecasting units) on a weekly basis across more than 20 markets. This implementation resulted in significant cost savings in terms of improved productivity, reduced cloud usage and faster time to value amongst other benefits. You will leave this session with a clearer picture on the following:

Best practices in scaling MLOps with Databricks and Azure for a demand forecasting use case with a multi-market and multi-region roll-out.
Best practices related to model re-factoring and setting up standard CI-CD pipelines for MLOps.
What are some of the pitfalls to avoid in such scenarios?

Talk by: Sunil Ranganathan and Vinit Doshi

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Sponsored by: Toptal | Enable Data Streaming within Multicloud Strategies

Unlock the Next Evolution of the Modern Data Stack With the Lakehouse Revolution -- with Live Demos

2023-07-26 · Databricks DATA + AI Summit 2023 Watch

video

by Kyle Hale (Databricks) , Roberto Salcido (Databricks)

Data Lakehouse Databricks dbt Fivetran Modern Data Stack Tableau

As the data landscape evolves, organizations are seeking innovative solutions that provide enhanced value and scalability without exploding costs. In this session, we will explore the exciting frontier of the Modern Data Stack on Databricks Lakehouse, a game-changing alternative to traditional Data Cloud offerings. Learn how Databricks Lakehouse empowers you to harness the full potential of Fivetran, dbt, and Tableau, while optimizing your data investments and delivering unmatched performance.

We will showcase real-world demos that highlight the seamless integration of these modern data tools on the Databricks Lakehouse platform, enabling you to unlock faster and more efficient insights. Witness firsthand how the synergy of Lakehouse and the Modern Data Stack outperforms traditional solutions, propelling your organization into the future of data-driven innovation. Don't miss this opportunity to revolutionize your data strategy and unleash unparalleled value with the lakehouse revolution.

Talk by: Kyle Hale and Roberto Salcido

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

US Army Corp of Engineers Enhanced Commerce & National Sec Through Data-Driven Geospatial Insight

2023-07-26 · Databricks DATA + AI Summit 2023 Watch

video

by Jeff Mroz

AI/ML Analytics API Data Engineering Data Governance Data Lake Data Lakehouse Data Management Data Quality Databricks Delta DWH +12 more

The US Army Corps of Engineers (USACE) is responsible for maintaining and improving nearly 12,000 miles of shallow-draft (9'-14') inland and intracoastal waterways, 13,000 miles of deep-draft (14' and greater) coastal channels, and 400 ports, harbors, and turning basins throughout the United States. Because these components of the national waterway network are considered assets to both US commerce and national security, they must be carefully managed to keep marine traffic operating safely and efficiently.

The National DQM Program is tasked with providing USACE a nationally standardized remote monitoring and documentation system across multiple vessel types with timely data access, reporting, dredge certifications, data quality control, and data management. Government systems have often lagged commercial systems in modernization efforts, and the emergence of the cloud and Data Lakehouse Architectures have empowered USACE to successfully move into the modern data era.

This session incorporates aspects of these topics: Data Lakehouse Architecture: Delta Lake, platform security and privacy, serverless, administration, data warehouse, Data Lake, Apache Iceberg, Data Mesh GIS: H3, MOSAIC, spatial analysis data engineering: data pipelines, orchestration, CDC, medallion architecture, Databricks Workflows, data munging, ETL/ELT, lakehouses, data lakes, Parquet, Data Mesh, Apache Spark™ internals. Data Streaming: Apache Spark Structured Streaming, real-time ingestion, real-time ETL, real-time ML, real-time analytics, and real-time applications, Delta Live Tables. ML: PyTorch, TensorFlow, Keras, scikit-learn, Python and R ecosystems data governance: security, compliance, RMF, NIST data sharing: sharing and collaboration, delta sharing, data cleanliness, APIs.

Talk by: Jeff Mroz

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

High Volume Intelligent Streaming with Sub-Minute SLA for Near Real-Time Data Replication

2023-07-26 · Databricks DATA + AI Summit 2023 Watch

video

by Murali Madireddi , Suneel Konidala

Databricks Delta Kafka RDBMS Spark Data Streaming

Attend this session and learn about an innovative solution built around Databricks structured streaming and Delta Live Tables (DLT) to replicate thousands of tables from on-premises to cloud-based relational databases. A highly desirable pattern for many enterprises across the industries to replicate on-premises data to cloud-based data lakes and data stores in near real time for consumption.

This powerful architecture can offload legacy platform workloads and accelerate cloud journey. The intelligent cost-efficient solution leverages thread-pools, multi-task jobs, Kafka, Apache Spark™ structured streaming and DLT. This session will go into detail about problems, solutions, lessons-learned and best practices.

Talk by: Suneel Konidala and Murali Madireddi

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Planning and Executing a Snowflake Data Warehouse Migration to Databricks

2023-07-26 · Databricks DATA + AI Summit 2023 Watch

video

by Satish Garla , Ramachandran Venkat

AI/ML Analytics Data Lakehouse Data Science Databricks DWH ETL/ELT Snowflake SQL

Organizations are going through a critical phase of data infrastructure modernization, laying the foundation for the future, and adapting to support growing data and AI needs. Organizations that embraced cloud data warehouses (CDW) such as Snowflake have ended up trying to use a data warehousing tool for ETL pipelines and data science. This created unnecessary complexity and resulted in poor performance since data warehouses are optimized for SQL-based analytics only.

Realizing the limitation and pain with cloud data warehouses, organizations are turning to a lakehouse-first architecture. Though a cloud platform to cloud platform migration should be relatively easy, the breadth of the Databricks platform provides flexibility and hence requires careful planning and execution. In this session, we present the migration methodology, technical approaches, automation tools, product/feature mapping, a technical demo and best practices using real-world case studies for migrating data, ELT pipelines and warehouses from Snowflake to Databricks.

Talk by: Satish Garla and Ramachandran Venkat

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksin

JoinBoost: In Data Base Machine Learning for Tree-Models

2023-07-26 · Databricks DATA + AI Summit 2023 Watch

video

by Zachary Huang

AI/ML Databricks Python

Data and machine learning (ML) are crucial for enterprise operations. Enterprises store data in databases for management and use ML to gain business insights. However, there is a mismatch between the way ML expects data to be organized (a single table) and the way data is organized in databases (a join graph of multiple tables) and leads to inefficiencies when joining and materializing tables.

In this talk, you will see how we successfully address this issue. We introduce JoinBoost, a lightweight python library that trains tree models (such as random forests and gradient boosting) for join graphs in databases. JoinBoost acts as a query rewriting layer that is compatible with cloud databases, and eliminates the need for costly join materialization.

Talk by: Zachary Huang

Here’s more to explore: State of Data + AI Report: https://dbricks.co/44i2HBp Databricks named a Leader in 2022 Gartner® Magic QuadrantTM CDBMS: https://dbricks.co/3phw20d

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Lakehouse Architecture to Advance Security Analytics at the Department of State

2023-07-26 · Databricks DATA + AI Summit 2023 Watch

video

by Edward Moe , Timothy Ahrens

AI/ML Analytics Data Governance Data Lakehouse Databricks Cyber Security

In 2023, the Department of State surged forward on implementing a lakehouse architecture to get faster, smarter, and more effective on cybersecurity log monitoring and incident response. In addition to getting us ahead of federal mandates, this approach promises to enable advanced analytics and machine learning across our highly federated global IT environment while minimizing costs associated with data retention and aggregation.

This talk will include a high-level overview of the technical and policy challenge and a technical deeper dive on the tactical implementation choices made. We’ll share lessons learned related to governance and securing organizational support, connecting between multiple cloud environments, and standardizing data to make it useful for analytics. And finally, we’ll discuss how the lakehouse leverages Databricks in multicloud environments to promote decentralized ownership of data while enabling strong, centralized data governance practices.

Talk by: Timothy Ahrens and Edward Moe

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Sponsored: KPMG | Multicloud Enterprise Delta Sharing & Governance using Unity Catalog @ S&P Global

Vector Data Lakes

2023-07-26 · Databricks DATA + AI Summit 2023 Watch

video

by Tony Wang , Chang She (LanceDB)

AI/ML Analytics Cloud Storage Data Lakehouse Databricks Delta ELK ETL/ELT Parquet Pinecone Vector DB

Vector databases such as ElasticSearch and Pinecone offer fast ingestion and querying on vector embeddings with ANNs. However, they typically do not decouple compute and storage, making them hard to integrate in production data stacks. Because data storage in these databases is expensive and not easily accessible, data teams typically maintain ETL pipelines to offload historical embedding data to blob stores. When that data needs to be queried, they get loaded back into the vector database in another ETL process. This is reminiscent of loading data from OLTP database to cloud storage, then loading said data into an OLAP warehouse for offline analytics.

Recently, “lakehouse” offerings allow direct OLAP querying on cloud storage, removing the need for the second ETL step. The same could be done for embedding data. While embedding storage in blob stores cannot satisfy the high TPS requirements in online settings, we argue it’s sufficient for offline analytics use cases like slicing and dicing data based on embedding clusters. Instead of loading the embedding data back into the vector database for offline analytics, we propose direct processing on embeddings stored in Parquet files in Delta Lake. You will see that offline embedding workloads typically touch a large portion of the stored embeddings without the need for random access.

As a result, the workload is entirely bound by network throughput instead of latency, making it quite suitable for blob storage backends. On a test one billion vector dataset, ETL into cloud storage takes around one hour on a dedicated GPU instance, while batched nearest neighbor search can be done in under one minute with four CPU instances. We believe future “lakehouses” will ship with native support for these embedding workloads.

Talk by: Tony Wang and Chang She

Here’s more to explore: State of Data + AI Report: https://dbricks.co/44i2HBp Databricks named a Leader in 2022 Gartner® Magic QuadrantTM CDBMS: https://dbricks.co/3phw20d

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Sponsored by: Fivetran | Fivetran and Catalyst Enable Businesses & Solve Critical Market Challenges

Taking Your Cloud Vendor to the Next Level: Solving Complex Challenges with Azure Databricks

2023-07-26 · Databricks DATA + AI Summit 2023 Watch

video

by Tomer Patel , Itai Yaffe (Nielsen Identity Engine)

Azure Databricks Kafka

Akamai's content delivery network (CDN) processes about 30% of the internet's daily traffic, resulting in a massive amount of data that presents engineering challenges, both internally and with cloud vendors. In this session, we will discuss the barriers faced while building a data infrastructure on Azure, Databricks, and Kafka to meet strict SLAs, hitting the limits of some of our cloud vendors’ services. We will describe the iterative process of re-architecting a massive scale data platform using the aforementioned technologies.

We will also delve into how today, Akamai is able to quickly ingest and make available to customers terabytes of data, as well as efficiently query Petabytes of data and return results within 10 seconds for most queries. This discussion will provide valuable insights for attendees and organizations seeking to effectively process and analyze large amounts of data.

Talk by: Tomer Patel and Itai Yaffe

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Unity Catalog, Delta Sharing and Data Mesh on Databricks Lakehouse

2023-07-25 · Databricks DATA + AI Summit 2023 Watch

video

by Surya Turaga , Thomas Roach

Data Lakehouse Databricks Delta

In this technical deep dive, we will detail how customers implemented data mesh on Databricks and how standardizing on delta format enabled delta-to-delta share to non-Databricks consumers.

Current state of the IT landscape
Data silos (problems with organizations not having connected data in the ecosystem)
A look back on why we moved away from data warehouses and choose cloud in the first place
What caused the data chaos in the cloud (instrumentation and too much stitching together) ~ periodic table list of services of the cloud
How to strike the balance between autonomy and centralization
Why Databricks Unity Catalog puts you in the right path to implementing data mesh strategy
What are the process and features that enable and end-to-end Implementation of a data strategy
How customers were able to successfully implement the data mesh on out of the box Unity Catalog and delta sharing without overwhelming their IT tool stack
Use cases
Delta-to-delta data sharing
Delta-to-others data sharing
How do you navigate when data today is available across regions, across clouds, on-prem and external systems
Change data feed to share only “data that has changed”
Data stewardship
Why ABAC is important
How file based access policies and governance play an important role
Future state and its pitfalls
Egress costs
Data compliances

Talk by: Surya Turaga and Thomas Roach

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

talk-data.com

Activity Trend

Top Events

Top Speakers

Essential Data Security Strategies for the Modern Enterprise Data Architecture

Real-Time Streaming Solution for Call Center Analytics: Business Challenges and Technical Enablement

Sponsored: Matillion - OurFamilyWizard Moves and Transforms Data for Databricks Delta Lake Easy

Delta Sharing: The Key Data Mesh Enabler

Sponsored: Kyvos | Analytics 100x Faster Lowest Cost w/ Kyvos & Databricks, Even on Trillions Rows

How the Texas Rangers Revolutionized Baseball Analytics with a Modern Data Lakehouse

Processing Delta Lake Tables on AWS Using AWS Glue, Amazon Athena, and Amazon Redshift

Scaling MLOps for a Demand Forecasting Across Multiple Markets for a Large CPG

Sponsored by: Toptal | Enable Data Streaming within Multicloud Strategies

Unlock the Next Evolution of the Modern Data Stack With the Lakehouse Revolution -- with Live Demos

US Army Corp of Engineers Enhanced Commerce & National Sec Through Data-Driven Geospatial Insight

High Volume Intelligent Streaming with Sub-Minute SLA for Near Real-Time Data Replication

Planning and Executing a Snowflake Data Warehouse Migration to Databricks

JoinBoost: In Data Base Machine Learning for Tree-Models

Lakehouse Architecture to Advance Security Analytics at the Department of State

Sponsored: KPMG | Multicloud Enterprise Delta Sharing & Governance using Unity Catalog @ S&P Global

Vector Data Lakes

Sponsored by: Fivetran | Fivetran and Catalyst Enable Businesses & Solve Critical Market Challenges

Taking Your Cloud Vendor to the Next Level: Solving Complex Challenges with Azure Databricks

Unity Catalog, Delta Sharing and Data Mesh on Databricks Lakehouse