Kafka

Sponsored by: Confluent | Turn SAP Data into AI-Powered Insights with Databricks

Creating a Custom PySpark Stream Reader with PySpark 4.0

2025-06-11 · Data + AI Summit 2025 Watch

lightning_talk

by Skyler Myers (Entrada)

Databricks Delta Java MySQL PySpark Spark Data Streaming

PySpark supports many data sources out of the box, such as Apache Kafka, JDBC, ODBC, Delta Lake, etc. However, some older systems, such as systems that use JMS protocol, are not supported by default and require considerable extra work for developers to read from them. One such example is ActiveMQ for streaming. Traditionally, users of ActiveMQ have to use a middle-man in order to read the stream with Spark (such as writing to a MySQL DB using Java code and reading that table with Spark JDBC). With PySpark 4.0’s custom data sources (supported in DBR 15.3+) we are able to cut out the middle-man processing using batch or Spark Streaming and consume the queues directly from PySpark, saving developers considerable time and complexity in getting source data into your Delta Lake and governed by Unity Catalog and orchestrated with Databricks Workflows.

No More Fragile Pipelines: Kafka and Iceberg the Declarative Way

2025-06-11 · Data + AI Summit 2025 Watch

talk

by Adi Polak (Treeverse)

Analytics Iceberg Parquet Data Streaming

Moving data between operational systems and analytics platforms is often painful. Traditional pipelines become complex, brittle, and expensive to maintain.Take Kafka and Iceberg: batching on Kafka causes ingestion bottlenecks, while streaming-style writes to Iceberg create too many small Parquet files—cluttering metadata, degrading queries, and increasing maintenance overhead. Frequent updates further strain background table operations, causing retries—even before dealing with schema evolution. But much of this complexity is avoidable. What if Kafka Topics and Iceberg Tables were treated as two sides of the same coin? By establishing a transparent equivalence, we can rethink pipeline design entirely. This session introduces Tableflow—a new approach to bridging streaming and table-based systems. It shifts complexity away from pipelines and into a unified layer, enabling simpler, declarative workflows. We’ll cover schema evolution, compaction, topic-to-table mapping, and how to continuously materialize and optimize thousands of topics as Iceberg tables. Whether modernizing or starting fresh, you’ll leave with practical insights for building resilient, scalable, and future-proof data architectures.

Master Schema Translations in the Era of Open Data Lake

2025-06-11 · Data + AI Summit 2025 Watch

lightning_talk

by Eric Sun (Coinbase)

Data Lake Databricks Delta DynamoDB ETL/ELT Iceberg MongoDB Snowflake postgresql

Unity Catalog puts variety of schemas into a centralized repository, now the developer community wants more productivity and automation for schema inference, translation, evolution and optimization especially for the scenarios of ingestion and reverse-ETL with more code generations.Coinbase Data Platform attempts to pave a path with "Schemaster" to interact with data catalog with the (proposed) metadata model to make schema translation and evolution more manageable across some of the popular systems, such as Delta, Iceberg, Snowflake, Kafka, MongoDB, DynamoDB, Postgres...This Lighting Talk covers 4 areas: The complexity and caveats of schema differences among The proposed field-level metadata model, and 2 translation patterns: point-to-point vs hub-and-spoke Why Data Profiling be augmented to enhance schema understanding and translation Integrate it with Ingestion & Reverse-ETL in a Databricks-oriented eco system Takeaway: standardize schema lineage & translation

Somebody Set Up Us the Bomb: Identifying List Bombing of End Users in an Email Anti-Spam Context

2025-06-11 · Data + AI Summit 2025 Watch

lightning_talk

by Doug Sibley (Cisco Talos)

Delta ETL/ELT Marketing Data Streaming

Traditionally, spam emails are messages a user does not want, containing some kind of threat like phishing. Because of this, detection systems can focus on malicious content or sender behavior. List bombing upends this paradigm. By abusing public forms such as marketing signups, attackers can fill a user's inbox with high volumes of legitimate mail. These emails don't contain threats, and each sender is following best practices to confirm the recipient wants to be subscribed, but the net effect for an end user is their inbox being flooded with dozens of emails per minute. This talk covers the the exploration and implementation for identifying this attack in our company's anti-spam telemetry: from reading and writing to Kafka, Delta table streaming for ETL workflows, multi-table liquid clustering design for efficient table joins, curating gold tables to speed up critical queries and using Delta tables as an auditable integration point for interacting with external services.

Streaming Meets Governance: Building AI-Ready Tables With Confluent Tableflow and Unity Catalog

2025-06-10 · Data + AI Summit 2025 Watch

talk

by Kasun Indrasiri Gamage (Confluent) , Victoria Bukta (Shopify)

AI/ML Analytics Data Governance Databricks Delta Data Streaming

Learn how Databricks and Confluent are simplifying the path from real-time data to governed, analytics- and AI-ready tables. This session will cover how Confluent Tableflow automatically materializes Kafka topics into Delta tables and registers them with Unity Catalog — eliminating the need for custom streaming pipelines. We’ll walk through how this integration helps data engineers reduce ingestion complexity, enforce data governance and make real-time data immediately usable for analytics and AI.

GenAI Observability in Customer Care

2025-06-10 · Data + AI Summit 2025 Watch

talk

by Matteo Ciccozzi (EarnIn) , Willem Dhaeseleer (EarnIn)

AI/ML BI Databricks GenAI

Customer support is going through the GenAI revolution, but how can we use AI to foster deeper empathy with our end users?To enable this, Earnin has built its GenAI observability platform on Databricks, leveraging Lakeflow Declarative Pipeliness, Kafka and Databricks AI/BI.This session covers how we use Lakeflow Declarative Pipelines to monitor our customer care chatbot in near real-time and how we leverage Databricks to better anticipate our customers' needs.

Unifying Human-Curated Data Ingestion and Real-Time Updates with Databricks Lakeflow Declarative Pipelines, Protobuf and BSR

2025-06-10 · Data + AI Summit 2025

talk

by Dwight Whitlock (Clinician Nexus)

Data Governance Databricks Protobuf Data Streaming

Red Stapler is a streaming-native system on Databricks that merges file-based ingestion and real-time user edits into one Lakeflow Declarative Pipelines for near real-time feedback. Protobuf definitions, managed in the Buf Schema Registry (BSR), govern schema and data-quality rules, ensuring backward compatibility. All records — valid or not — are stored in an SCD Type 2 table, capturing every version for full history and immediate quarantine views of invalid data. This unified approach boosts data governance, simplifies auditing and streamlines error fixes.Running on Lakeflow Declarative Pipelines Serverless and the Kafka-compatible Bufstream keeps costs low by scaling down to zero when idle. Red Stapler’s configuration-driven Protobuf logic adapts easily to evolving survey definitions without risking production. The result is consistent validation, quick updates and a complete audit trail — all critical for trustworthy, flexible data pipelines.

Lakeflow Declarative Pipelines Integrations and Interoperability: Get Data From — and to — Anywhere

2025-06-10 · Data + AI Summit 2025 Watch

talk

by Ryan Nienhuis (Databricks)

API Azure Cosmos Data Lakehouse Delta ETL/ELT MongoDB Python Spark

This session is repeated.In this session, you will learn how to integrate Lakeflow Declarative Pipelines with external systems in order to ingest and send data virtually anywhere. Lakeflow Declarative Pipelines is most often used in ingestion and ETL into the Lakehouse. New Lakeflow Declarative Pipelines capabilities like the Lakeflow Declarative Pipelines Sinks API and added support for Python Data Source and ForEachBatch have opened up Lakeflow Declarative Pipelines to support almost any integration. This includes popular Apache Spark™ integrations like JDBC, Kafka, External and managed Delta tables, Azure CosmosDB, MongoDB and more.

Let's Save Tons of Money With Cloud-Native Data Ingestion!

2025-06-10 · Data + AI Summit 2025 Watch

talk

by Tyler Croy (Scribd, Inc.)

Airbyte AWS Aurora Kinesis Azure Cloud Computing Databricks Delta GCP

Delta Lake is a fantastic technology for quickly querying massive data sets, but first you need those massive data sets! In this session we will dive into the cloud-native architecture Scribd has adopted to ingest data from AWS Aurora, SQS, Kinesis Data Firehose and more. By using off-the-shelf open source tools like kafka-delta-ingest, oxbow and Airbyte, Scribd has redefined its ingestion architecture to be more event-driven, reliable, and most importantly: cheaper. No jobs needed! Attendees will learn how to use third-party tools in concert with a Databricks and Unity Catalog environment to provide a highly efficient and available data platform. This architecture will be presented in the context of AWS but can be adapted for Azure, Google Cloud Platform or even on-premise environments.

Petabyte-Scale On-Chain Insights: Real-Time Intelligence for the Next-Gen Financial Backbone

2025-06-10 · Data + AI Summit 2025 Watch

lightning_talk

by Leo Liang (CipherOwl Inc)

AI/ML Analytics Arrow Blockchain Data Lakehouse Delta Spark SQL

We’ll explore how CipherOwl Inc. constructed a near real-time, multi-chain data lakehouse to power anti-money laundering (AML) monitoring at a petabyte scale. We will walk through the end-to-end architecture, which integrates cutting-edge open-source technologies and AI-driven analytics to handle massive on-chain data volumes seamlessly. Off-chain intelligence complements this to meet rigorous AML requirements. At the core of our solution is ChainStorage, an OSS started by Coinbase that provides robust blockchain data ingestion and block-level serving. We enhanced it with Apache Spark™ and Arrow™, coupled for high-throughput processing and efficient data serialization, backed by Delta Lake and Kafka. For the serving layer, we employ StarRocks to deliver lightning-fast SQL analytics over vast datasets. Finally, our system incorporates machine learning and AI agents for continuous data curation and near real-time insights, which are crucial for tackling on-chain AML challenges.

Kafka Forwarder: Simplifying Kafka Consumption at OpenAI

2025-06-10 · Data + AI Summit 2025 Watch

talk

by Jigar Bhati (Open AI)

Databricks LLM Data Streaming

At OpenAI, Kafka fuels real-time data streaming at massive scale, but traditional consumers struggle under the burden of partition management, offset tracking, error handling, retries, Dead Letter Queues (DLQ), and dynamic scaling — all while racing to maintain ultra-high throughput. As deployments scale, complexity multiplies. Enter Kafka Forwarder — a game-changing Kafka Consumer Proxy that flips the script on traditional Kafka consumption. By offloading client-side complexity and pushing messages to consumers, it ensures at-least-once delivery, automated retries, and seamless DLQ management via Databricks. The result? Scalable, reliable and effortless Kafka consumption that lets teams focus on what truly matters. Curious how OpenAI simplified self-service, high-scale Kafka consumption? Join us as we walk through the motivation, architecture and challenges behind Kafka Forwarder, and share how we structured the pipeline to seamlessly route DLQ data into Databricks for analysis.

Building Real-Time Sport Model Insights with Spark Structured Streaming

2025-06-10 · Data + AI Summit 2025 Watch

talk

by Aaron Hope (Draftkings) , Ethan Summers (Draftkings)

Databricks Spark Data Streaming

In the dynamic world of sports betting, precision and adaptability are key. Sports traders must navigate risk management, limitations of data feeds, and much more to prevent small model miscalculations from causing significant losses. To ensure accurate real-time pricing of hundreds of interdependent markets, traders provide key inputs such as player skill-level adjustments, whilst maintaining precise correlations. Black-box models aren’t enough— constant feedback loops drive informed, accurate decisions. Join DraftKings as we showcase how we expose real-time metrics from our simulation engine, to empower traders with deeper insights into how their inputs shape the model. Using Spark Structured Streaming, Kafka, and Databricks dashboards, we transform raw simulation outputs into actionable data. This transparency into our engines enables fine-grained control over pricing― leading to more accurate odds, a more efficient sportsbook, and an elevated customer experience.

How an Open, Scalable and Secure Data Platform is Powering Quick Commerce Swiggy's AI

2025-06-10 · Data + AI Summit 2025 Watch

talk

by Vasan Vembu Srini (Databricks) , Akash Agarwal (Swiggy)

AI/ML Analytics Flink Data Lakehouse Databricks Delta GenAI Spark Data Streaming

Swiggy, India's leading quick commerce platform, serves ~13 million users across 653 cities, with 196,000 restaurant partners and 17,000 SKUs. To handle this scale, Swiggy developed a secure, scalable AI platform processing millions of predictions per second. The tech stack includes Apache Kafka for real-time streaming, Apache Spark on Databricks for analytics and ML, and Apache Flink for stream processing. The Lakehouse architecture on Delta ensures data reliability, while Unity Catalog enables centralized access control and auditing. These technologies power critical AI applications like demand forecasting, route optimization, personalized recommendations, predictive delivery SLAs, and generative AI use cases.Key Takeaway:This session explores building a data platform at scale, focusing on cost efficiency, simplicity, and speed, empowering Swiggy to seamlessly support millions of users and AI use cases.

Simplifying Data Pipelines With Lakeflow Declarative Pipelines: A Beginner’s Guide

2025-06-10 · Data + AI Summit 2025 Watch

talk

by Matt Jones (Databricks) , Brad Turnbaugh (84.51)

AI/ML Analytics Data Engineering ETL/ELT SQL Data Streaming

As part of the new Lakeflow data engineering experience, Lakeflow Declarative Pipelines makes it easy to build and manage reliable data pipelines. It unifies batch and streaming, reduces operational complexity and ensures dependable data delivery at scale — from batch ETL to real-time processing.Lakeflow Declarative Pipelines excels at declarative change data capture, batch and streaming workloads, and efficient SQL-based pipelines. In this session, you’ll learn how we’ve reimagined data pipelining with Lakeflow Declarative Pipelines, including: A brand new pipeline editor that simplifies transformations Serverless compute modes to optimize for performance or cost Full Unity Catalog integration for governance and lineage Reading/writing data with Kafka and custom sources Monitoring and observability for operational excellence “Real-time Mode” for ultra-low-latency streaming Join us to see how Lakeflow Declarative Pipelines powers better analytics and AI with reliable, unified pipelines.

Ursa: Augment Your Lakehouse With Kafka-Compatible Data Streaming Capabilities

2025-06-10 · Data + AI Summit 2025 Watch

talk

by Gaurav Saxena (Automotive Industry) , Sijie Guo (StreamNative)

AI/ML API Data Lakehouse Delta GenAI Iceberg Data Streaming

As data architectures evolve to meet the demands of real-time GenAI applications, organizations increasingly need systems that unify streaming and batch processing while maintaining compatibility with existing tools. The Ursa Engine offers a Kafka-API-compatible data streaming engine built on Lakehouse (Iceberg and Delta Lake). Designed to seamlessly integrate with data lakehouse architectures, Ursa extends your lakehouse capabilities by enabling streaming ingestion, transformation and processing — using a Kafka-compatible interface. In this session, we will explore how Ursa Engine augments your existing lakehouses with Kafka-compatible capabilities. Attendees will gain insights into Ursa Engine architecture and real-world use cases of Ursa Engine. Whether you're modernizing legacy systems or building cutting-edge AI-driven applications, discover how Ursa can help you unlock the full potential of your data.

talk-data.com

Activity Trend

Top Events

Top Speakers

Sponsored by: Confluent | Turn SAP Data into AI-Powered Insights with Databricks

Creating a Custom PySpark Stream Reader with PySpark 4.0

No More Fragile Pipelines: Kafka and Iceberg the Declarative Way

Master Schema Translations in the Era of Open Data Lake

Somebody Set Up Us the Bomb: Identifying List Bombing of End Users in an Email Anti-Spam Context

Streaming Meets Governance: Building AI-Ready Tables With Confluent Tableflow and Unity Catalog

GenAI Observability in Customer Care

Unifying Human-Curated Data Ingestion and Real-Time Updates with Databricks Lakeflow Declarative Pipelines, Protobuf and BSR

Lakeflow Declarative Pipelines Integrations and Interoperability: Get Data From — and to — Anywhere

Let's Save Tons of Money With Cloud-Native Data Ingestion!

Petabyte-Scale On-Chain Insights: Real-Time Intelligence for the Next-Gen Financial Backbone

Kafka Forwarder: Simplifying Kafka Consumption at OpenAI

Building Real-Time Sport Model Insights with Spark Structured Streaming

How an Open, Scalable and Secure Data Platform is Powering Quick Commerce Swiggy's AI

Simplifying Data Pipelines With Lakeflow Declarative Pipelines: A Beginner’s Guide

Ursa: Augment Your Lakehouse With Kafka-Compatible Data Streaming Capabilities