talk-data.com talk-data.com

Topic

Delta

Delta Lake

data_lake acid_transactions time_travel file_format storage

347

tagged

Activity Trend

117 peak/qtr
2020-Q1 2026-Q1

Activities

347 activities · Newest first

A Japanese Mega-Bank’s Journey to a Modern, GenAI-Powered, Governed Data Platform

SMBC, a major Japanese multinational financial services institution, has embarked on an initiative to build a GenAI-powered, modern and well-governed cloud data platform on Azure/Databricks. This initiative aims to build an enterprise data foundation encompassing loans, deposits, securities, derivatives, and other data domains. Its primary goals are: To decommission legacy data platforms and reduce data sprawl by migrating 20+ core banking systems to a multi-tenant Azure Databricks architecture To leverage Databrick’s delta-share capabilities to address SMBC’s unique global footprint and data sharing needs To govern data by design using Unity Catalog To achieve global adoption of the frameworks, accelerators, architecture and tool stack to support similar implementations across EMEA Deloitte and SMBC leveraged the Brickbuilder asset “Data as a Service for Banking” to accelerate this highly strategic transformation.

ClickHouse and Databricks for Real-Time Analytics

ClickHouse is a C++ based, column-oriented database built for real-time analytics. While it has its own internal storage format, the rise of open lakehouse architectures has created a growing need for seamless interoperability. In response, we have developed integrations with your favorite lakehouse ecosystem to enhance compatibility, performance and governance. From integrating with Unity Catalog to embedding the Delta Kernel into ClickHouse, this session will explore the key design considerations behind these integrations, their benefits to the community, the lessons learned and future opportunities for improved compatibility and seamless integration.

Scaling Data Engineering Pipelines: Preparing Credit Card Transactions Data for Machine Learning

We discuss two real-world use cases in big data engineering, focusing on constructing stable pipelines and managing storage at a petabyte scale. The first use case highlights the implementation of Delta Lake to optimize data pipelines, resulting in an 80% reduction in query time and a 70% reduction in storage space. The second use case demonstrates the effectiveness of the Workflows ‘ForEach’ operator in executing compute-intensive pipelines across multiple clusters, significantly reducing processing time from months to days. This approach involves a reusable design pattern that isolates notebooks into units of work, enabling data scientists to independently test and develop.

Sponsored by: Deloitte | Analyzing Geospatial Data at Scale in Databricks for Environment & Agriculture

Analyzing geospatial data has become a cornerstone of tackling many of today’s pressing challenges from climate change to resource management. However, storing and processing such data can be complex and hard to scale using common GIS packages. This talk explores how Deloitte and Databricks enable horizontally scalable geospatial analysis using delta lake, H3 integration and support for geospatial vector and raster data. We demonstrate how we have leveraged these capabilities for real-world applications in environmental monitoring and agriculture. In doing so, we cover end-to-end processing from ingestion, transformation and analysis to production of geospatial data products accessible by scientists and decision makers through standard GIS tools.

Crypto at Scale: Building a High-Performance Platform for Real-Time Blockchain Data

In today’s fast-evolving crypto landscape, organizations require fast, reliable intelligence to manage risk, investigate financial crime, and stay ahead of evolving threats. In this session we will discover how Elliptic built a scalable, high-performance Data Intelligence Platform that delivers real-time, actionable Blockchain insights to their customers. We’ll walk you through some of the key components of the Elliptic Platform, including the Elliptic Entity Graph and our User-Facing Analytics. Our focus will be put on the evolution of our User-Facing Analytics capabilities, and specifically how components from the Databricks ecosystem such as Structured Streaming, Delta Lake, and SQL Warehouse have played a vital role. We’ll also share some of the optimizations we’ve made to our streaming jobs to maximize performance and ensure Data Completeness. Whether you’re looking to enhance your streaming capabilities, expand your knowledge of how crypto analytics works or simply discover novel approaches to data processing at scale, this session will provide concrete strategies and valuable lessons learned.

Empowering the Warfighter With AI

The new Budget Execution Validation process has transformed how the Navy reviews unspent funds. Powered by Databricks Workflows, MLflow, Delta Lake and Apache Spark™, this data-driven model predicts which financial transactions are most likely to have errors, streamlining reviews and increasing accuracy. In FY24, it helped review $40 billion, freeing $1.1 billion for other priorities, including $260 million from active projects. By reducing reviews by 80%, cutting job runtime by over 50% and lowering costs by 60%, it saved 218,000 work hours and $6.7 million in labor costs. With automated workflows and robust data management, this system exemplifies how advanced tools can improve financial decision-making, save resources and ensure efficient use of taxpayer dollars.

Scaling Identity Graph Ingestion to 1M Events/Sec with Spark Streaming & Delta Lake

Adobe’s Real-Time Customer Data Platform relies on the identity graph to connect over 70 billion identities and deliver personalized experiences. This session will showcase how the platform leverages Databricks, Spark Streaming and Delta Lake, along with 25+ Databricks deployments across multiple regions and clouds — Azure & AWS — to process terabytes of data daily and handle over a million records per second. The talk will highlight the platform’s ability to scale, demonstrating a 10x increase in ingestion pipeline capacity to accommodate peak traffic during events like the Super Bowl. Attendees will learn about the technical strategies employed, including migrating from Flink to Spark Streaming, optimizing data deduplication, and implementing robust monitoring and anomaly detection. Discover how these optimizations enable Adobe to deliver real-time identity resolution at scale while ensuring compliance and privacy.

Smart Data, Smarter Vehicles: Building the Foundation for the Future of Transportation

Join industry pioneers Boeing and CARIAD (Volkswagen Group) as they showcase how advanced data platforms are revolutionizing mobility across air and ground transportation. Boeing's Jeppesen Smart NOTAMs system demonstrates the power of compound AI in aviation safety, processing over 4.5M critical flight notices annually and serving 75% of commercial aviation through an innovative combination of MLflow, GenAI, and Delta Sharing technologies. CARIAD follows with insights into their groundbreaking Unified Data Ecosystem (UDE), the singular data platform powering Volkswagen Group's global mobility transformation across all brands and markets. Together, these leaders illustrate how smart data architecture is building the foundation for the future of transportation, from the skies to the streets.

Unlock the Potential of Your Enterprise Data With Zero-Copy Data Sharing, featuring SAP and Salesforce

Tired of data silos and the constant need to move copies of your data across different systems? Imagine a world where all your enterprise data is readily available in Databricks without the cost and complexity of duplication and ingestion. Our vision is to break down these silos by enabling seamless, zero-copy data sharing across platforms, clouds, and regions. This unlocks the true potential of your data for analytics and AI, empowering you to make faster, more informed decisions leveraging your most important enterprise data sets. This session you will hear from Databricks, SAP, and Salesforce product leaders on how zero-copy data sharing can unlock the value of enterprise data. Explore how Delta Sharing makes this vision a reality, providing secure, zero-copy data access for enterprises.SAP Business Data Cloud: See Delta Sharing in action to unlock operational reporting, supply chain optimization, and financial planning. Salesforce Data Cloud: Enable customer analytics, churn prediction, and personalized marketing.

Master Schema Translations in the Era of Open Data Lake

Unity Catalog puts variety of schemas into a centralized repository, now the developer community wants more productivity and automation for schema inference, translation, evolution and optimization especially for the scenarios of ingestion and reverse-ETL with more code generations.Coinbase Data Platform attempts to pave a path with "Schemaster" to interact with data catalog with the (proposed) metadata model to make schema translation and evolution more manageable across some of the popular systems, such as Delta, Iceberg, Snowflake, Kafka, MongoDB, DynamoDB, Postgres...This Lighting Talk covers 4 areas: The complexity and caveats of schema differences among The proposed field-level metadata model, and 2 translation patterns: point-to-point vs hub-and-spoke Why Data Profiling be augmented to enhance schema understanding and translation Integrate it with Ingestion & Reverse-ETL in a Databricks-oriented eco system Takeaway: standardize schema lineage & translation

Somebody Set Up Us the Bomb: Identifying List Bombing of End Users in an Email Anti-Spam Context

Traditionally, spam emails are messages a user does not want, containing some kind of threat like phishing. Because of this, detection systems can focus on malicious content or sender behavior. List bombing upends this paradigm. By abusing public forms such as marketing signups, attackers can fill a user's inbox with high volumes of legitimate mail. These emails don't contain threats, and each sender is following best practices to confirm the recipient wants to be subscribed, but the net effect for an end user is their inbox being flooded with dozens of emails per minute. This talk covers the the exploration and implementation for identifying this attack in our company's anti-spam telemetry: from reading and writing to Kafka, Delta table streaming for ETL workflows, multi-table liquid clustering design for efficient table joins, curating gold tables to speed up critical queries and using Delta tables as an auditable integration point for interacting with external services.

Data Triggers and Advanced Control Flow With Lakeflow Jobs

Lakeflow Jobs is the production-ready fully managed orchestrator for the entire Lakehouse with 99.95% uptime. Join us for a dive into how you can orchestrate your enterprise data operations, from triggering your jobs only when your data is ready to advanced control flow with conditionals, looping and job modularity — with demos! Attendees will gain practical insights into optimizing their data operations by orchestrating with Lakeflow Jobs: New task types: Publish AI/BI Dashboards, push to Power BI or ingest with Lakeflow Connect Advanced execution control: Reference SQL Task outputs, run partial DAGs and perform targeted backfills Repair runs: Re-run failed pipelines with surgical precision using task-level repair Control flow upgrades: Native for-each loops and conditional logic make DAGs more dynamic + expressive Smarter triggers: Kick off jobs based on file arrival or Delta table changes, enabling responsive workflows Code-first approach to pipeline orchestration

Delta Sharing in Action: Architecture and Best Practices

Delta Sharing is revolutionizing how enterprises share live data and AI assets securely, openly and at scale. As the industry’s first open data-sharing protocol, it empowers organizations to collaborate seamlessly across platforms and with any partner, whether inside or outside the Databricks ecosystem. In this deep-dive session, you’ll learn best practices and real-world use cases that show how Delta Sharing helps accelerate collaboration and fuel AI-driven innovation. We’ll also unveil the latest advancements, including: Managed network configurations for easier, secure setup OIDC identity federation for trusted, open sharing Expanded asset types including dynamic views, materialized views, federated tables, read clones and more Whether you’re a data engineer, architect, or data leader, you’ll leave with practical strategies to future-proof your data-sharing architecture. Don’t miss the live demos, expert guidance and an exclusive look at what’s next in data collaboration.

Delta Sharing Demystified: Options, Use Cases and How it Works

Data sharing doesn’t have to be complicated. In this session, we’ll take a practical look at Delta Sharing in Databricks — what it is, how it works and how it fits into your organization’s data ecosystem. The focus will be on giving an overview of the different ways to share data using Databricks, from direct sharing setups to broader distribution via the Databricks Marketplace and more collaborative approaches like Clean Rooms. This talk is meant for anyone curious about modern, secure data sharing — whether you're just getting started or looking to expand your use of Databricks. Attendees will walk away with a clearer picture of what’s possible, what’s required to get started and how to choose the right sharing method for the right scenario.

Sponsored by: Domo, Inc | Enabling AI-Powered Business Solutions w/Databricks & Domo

Domo's Databricks integration seamlessly connects business users to both Delta Lake data and AI/ML models, eliminating technical barriers while maximizing performance. Domo's Cloud Amplifier optimizes data processing through pushdown SQL, while the Domo AI Services layer enables anyone to leverage both traditional ML and large language models directly from Domo. During this session, we’ll explore an AI solution around fraud detection to demonstrate the power of leveraging Domo on Databricks.

Breaking Silos: Using SAP Business Data Cloud and Delta Sharing for Seamless Access to SAP Data in Databricks

We’re excited to share with you how SAP Business Data Cloud supports Delta Sharing to share SAP data securely and seamlessly with Databricks—no complex ETL or data duplication required. This enables organizations to securely share SAP data for analytics and AI in Databricks while also supporting bidirectional data sharing back to SAP.In this session, we’ll demonstrate the integration in action, followed by a discussion of how the global beauty group, Natura, will leverage this solution. Whether you’re looking to bring SAP data into Databricks for advanced analytics or build AI models on top of trusted SAP datasets, this session will show you how to get started — securely and efficiently.

Cross-Cloud Data Mesh with Delta Sharing and UniForm in Mercedes-Benz

In this presentation, we'll show how we achieved a unified development experience for teams working on Mercedes-Benz Data Platforms in AWS and Azure. We will demonstrate how we implemented Azure to AWS and AWS to Azure data product sharing (using Delta Sharing and Cloud Tokens), integration with AWS Glue Iceberg tables through UniForm and automation to drive everything using Azure DevOps Pipelines and DABs. We will also show how to monitor and track cloud egress costs and how we present a consolidated view of all the data products and relevant cost information. The end goal is to show how customers can offer the same user experience to their engineers and not have to worry about which cloud or region the Data Product lives in. Instead, they can enroll in the data product through self-service and have it available to them in minutes, regardless of where it originates.

Delta-Kernel-RS: Unparalleled Interoperability Across Query Engines

Join us as we introduce Delta-Kernel-RS, a new Rust implementation of the Delta Lake protocol designed for unparalleled interoperability across query engines. In this session, we will explore how maintaining a native implementation of the Delta specification — with native C and C++ FFI support — can deliver consistent benefits across diverse data processing systems, eliminating the need for repetitive, engine-specific reimplementations. We will dive deep into a real-world case study where a query engine harnessed Delta-Kernel-RS to unlock significant data skipping improvements — enhancements achieved “for free” by leveraging the kernel. Attendees will gain insights into the architectural decisions, interoperability strategies and the practical impact of this innovation on performance and development efficiency in modern data ecosystems.

How Blue Origin Accelerates Innovation With Databricks and AWS GovCloud

Blue Origin is revolutionizing space exploration with a mission-critical data strategy powered by Databricks on AWS GovCloud. Learn how they leverage Databricks to meet ITAR and FedRAMP High compliance, streamline manufacturing and accelerate their vision of a 24/7 factory. Key use cases include predictive maintenance, real-time IoT insights and AI-driven tools that transform CAD designs into factory instructions. Discover how Delta Lake, Structured Streaming and advanced Databricks functionalities like Unity Catalog enable real-time analytics and future-ready infrastructure, helping Blue Origin stay ahead in the race to adopt generative AI and serverless solutions.