Kafka

Why Kafka + Iceberg Will Define the Next Decade of Data Infrastructure

2025-09-24 · Big Data LDN 2025

Face To Face

by Tom Scott (Streambased)

Analytics Iceberg Stitch Data Streaming

Data leaders today face a familiar challenge: complex pipelines, duplicated systems, and spiraling infrastructure costs. Standardizing around Kafka for real-time and Iceberg for large-scale analytics has gone some way towards addressing this but still requires separate stacks, leaving teams to stitch them together at high expense and risk.

This talk will explore how Kafka and Iceberg together form a new foundation for data infrastructure. One that unifies streaming and analytics into a single, cost-efficient layer. By standardizing on these open technologies, organizations can reduce data duplication, simplify governance, and unlock both instant insights and long-term value from the same platform.

You will come away with a clear understanding of why this convergence is reshaping the industry, how it lowers operational risk, and advantages it offers for building durable, future-proof data capabilities.

Buffer, Blast, or Balance: Three Ways to Stream to Iceberg at Big Data London 2025

2025-09-24 · Big Data LDN 2025

Face To Face

by Tom Scott (Streambased)

Analytics Big Data Iceberg Stitch Data Streaming

Data leaders today face a familiar challenge: complex pipelines, duplicated systems, and spiraling infrastructure costs. Standardizing around Kafka for real-time and Iceberg for large-scale analytics has gone some way towards addressing this but still requires separate stacks, leaving teams to stitch them together at high expense and risk.

This talk will explore how Kafka and Iceberg together form a new foundation for data infrastructure. One that unifies streaming and analytics into a single, cost-efficient layer. By standardizing on these open technologies, organizations can reduce data duplication, simplify governance, and unlock both instant insights and long-term value from the same platform.

You will come away with a clear understanding of why this convergence is reshaping the industry, how it lowers operational risk, and advantages it offers for building durable, future-proof data capabilities.

No More Fragile Pipelines: Kafka and Iceberg the Declarative Way

2025-09-24 · Big Data LDN 2025

Face To Face

by Adi Polak (Treeverse)

Analytics Iceberg Data Streaming

Moving data between operational systems and analytics platforms is often a painful process. Traditional pipelines that transfer data in and out of warehouses tend to become complex, brittle, and expensive to maintain over time.

Much of this complexity, however, is avoidable. Data in motion and data at rest—Kafka Topics and Iceberg Tables—can be treated as two sides of the same coin. By establishing an equivalence between Topics and Tables, it’s possible to transparently map between them and rethink how pipelines are built.

This talk introduces a declarative approach to bridging streaming and table-based systems. By shifting complexity into the data layer, we can decompose complex, imperative pipelines into simpler, more reliable workflows

We’ll explore the design principles behind this approach, including schema mapping and evolution between Kafka and Iceberg, and how to build a system that can continuously materialize and optimize hundreds of thousands of topics as Iceberg tables.

Whether you're building new pipelines or modernizing legacy systems, this session will provide practical patterns and strategies for creating resilient, scalable, and future-proof data architectures.

Managing Kafka Reliability

2025-09-23 · Google NY Site Reliability Engineering (SRE) Tech Talks, 23 Sep 2025

talk

by Drew Oetzel , Germain Cassis (lenses.io) , Kir Titievsky (Google)

google cloud managed service for apache kafka lenses.io

Apache Kafka is the simplest possible reliable, horizontally scalable low-latency storage system for commodity hardware. This is increasingly making it the backbone of analytic data collection stacks and event-bus like architectures. Critical systems like this require very reliable operations. Kafka is both stateful and distributed, so it has traditional sysadmin kind of problems and those that require pretty deep expertise. We will discuss the problems with CPU and disk capacity management as well as defining availability SLOs for a distributed stateful system. We will also show some of the ways in which the Google Cloud Managed Service for Apache Kafka and lenses.io helps in solving these problems in a demo.

Why Kafka + Iceberg Will Define the Next Decade of Data Infrastructure

2025-09-23 · IN PERSON: Apache Kafka x Apache Flink

talk

by Tom Scott (Streambased)

Iceberg

Data leaders today face a familiar challenge: complex pipelines, duplicated systems, and spiraling infrastructure costs. Standardizing around Kafka for real-time and Iceberg for large-scale analytics has gone some way towards addressing this but still requires separate stacks, leaving teams to stitch them together at high expense and risk. This talk will explore how Kafka and Iceberg together form a new foundation for data infrastructure. One that unifies streaming and analytics into a single, cost-efficient layer. By standardizing on these open technologies, organizations can reduce data duplication, simplify governance, and unlock both instant insights and long-term value from the same platform. You will come away with a clear understanding of why this convergence is reshaping the industry, how it lowers operational risk, and advantages it offers for building durable, future-proof data capabilities.

Why Kafka + Iceberg Will Define the Next Decade of Data Infrastructure

2025-09-23 · IN-PERSON: Apache Kafka® x Apache Flink® Meetup

talk

by Tom Scott (Streambased)

Iceberg

Data leaders today face a familiar challenge: complex pipelines, duplicated systems, and spiraling infrastructure costs. Standardizing around Kafka for real-time and Iceberg for large-scale analytics has gone some way towards addressing this but still requires separate stacks, leaving teams to stitch them together at high expense and risk.

This talk will explore how Kafka and Iceberg together form a new foundation for data infrastructure. One that unifies streaming and analytics into a single, cost-efficient layer. By standardizing on these open technologies, organizations can reduce data duplication, simplify governance, and unlock both instant insights and long-term value from the same platform.

You will come away with a clear understanding of why this convergence is reshaping the industry, how it lowers operational risk, and advantages it offers for building durable, future-proof data capabilities.

Monitoring Kafka-Based Applications Using Distributed Tracing with OpenTelemetry

2025-09-23 · IN PERSON: Apache Kafka x Apache Flink

talk

by Mehreen Tahir (New Relic)

New Relic flink jaeger java spring-boot opentelemetry

By leveraging tools like Jaeger and New Relic, we will uncover how to gain a full view of your microservices, even in the face of Apache Kafka's asynchronous nature. Join us for a live demo with a simple Java Spring-Boot app, where we will walk through both automatic and manual instrumentation to capture rich telemetry. We will also touch on infrastructure-level observability, pulling metrics and traces from Apache Kafka brokers and Apache Flink.

Monitoring Kafka-Based Applications Using Distributed Tracing with OpenTelemetry

2025-09-23 · IN-PERSON: Apache Kafka® x Apache Flink® Meetup

talk

by Mehreen Tahir (New Relic)

New Relic jaeger opentelemetry spring boot

By leveraging tools like Jaeger and New Relic, we’ll uncover how to gain a full view of your microservices, even in the face of Apache Kafka’s asynchronous nature. Join us for a live demo with a simple Java Spring-Boot app, where we’ll walk through both automatic and manual instrumentation to capture rich telemetry. We’ll also touch on infrastructure-level observability, pulling metrics and traces from Apache Kafka brokers and Apache Flink.

Workshop 2: Data Products

2025-09-23 · Data Driven LDN Conference

Face To Face

by Mandy Chessell (Pragmatic Data Research)

Airflow Dashboard Data Governance Linux Superset

Following on from the Building consumable data products keynote, we will dive deeper into the interactions around the data product catalog, to show how the network effect of explicit data sharing relationships starts to pay dividends to the participants. Such as:

For the product consumer:

• Searching for products, understanding content, costs, terms and conditions, licenses, quality certifications etc

• Inspecting sample data, choosing preferred data format, setting up a secure subscription, and seeing data provisioned into a database from the product catalog.

• Providing feedback and requesting help

• Reviewing own active subscriptions

• Understanding the lineage behind each product along with outstanding exceptions and future plans

For the product manager/owner:

• Setting up a new product, creating a new release of an existing product and issuing a data correction/restatement

• Reviewing a product’s active subscriptions and feedback/requests from consumers

• Interacting with the technical teams on pipeline implementations along with issues and proposed enhancements

• For the data governance team

• Viewing the network of dependencies between data products (the data mesh) to understand the data value chains and risk concentrations

• Reviewing a dashboard of metrics around the data products including popularity, errors/exceptions, subscriptions, interaction

• Show traceability from a governance policy relating to, say data sovereignty or data privacy to the product implementations.

• Building trust profiles for producers and consumers

The aim of the demonstrations and discussions is to explore the principles and patterns relating to data products, rather than push a particular implementation approach.

Having said that, all of the software used in the demonstrations is open source. Principally this is Egeria, Open Lineage and Unity Catalog from the Linux Foundation, plus Apache Airflow, Apache Kafka and Apache SuperSet from the Apache Software Foundation.

Videos of the demonstrations will be available on YouTube after the conference and the complete demo software can be downloaded and run on a laptop so you can share your experiences with your teams after the event.

GraphQL+Graph databases usage and how we leverage Kafka, Elastic Search, Hackolade together

2025-09-15 · People Inc. (formerly Dotdash Meredith) Tech Appy Hour

talk

elastic search graph databases graphql hackolade

Jian Huang speaks on GraphQL+Graph databases usage and how we leverage Kafka, Elastic Search, Hackolade together.

Data Engineering for Cybersecurity

2025-08-26 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by James Bonifield

Ansible Cloud Computing Data Engineering ELK Git Linux Logstash PowerShell Redis Cyber Security Data Streaming data +1 more

Security teams rely on telemetry—the continuous stream of logs, events, metrics, and signals that reveal what’s happening across systems, endpoints, and cloud services. But that data doesn’t organize itself. It has to be collected, normalized, enriched, and secured before it becomes useful. That’s where data engineering comes in. In this hands-on guide, cybersecurity engineer James Bonifield teaches you how to design and build scalable, secure data pipelines using free, open source tools such as Filebeat, Logstash, Redis, Kafka, and Elasticsearch and more. You’ll learn how to collect telemetry from Windows including Sysmon and PowerShell events, Linux files and syslog, and streaming data from network and security appliances. You’ll then transform it into structured formats, secure it in transit, and automate your deployments using Ansible. You’ll also learn how to: Encrypt and secure data in transit using TLS and SSH Centrally manage code and configuration files using Git Transform messy logs into structured events Enrich data with threat intelligence using Redis and Memcached Stream and centralize data at scale with Kafka Automate with Ansible for repeatable deployments Whether you’re building a pipeline on a tight budget or deploying an enterprise-scale system, this book shows you how to centralize your security data, support real-time detection, and lay the groundwork for incident response and long-term forensics.

Mastering real-time anomaly detection

2025-08-20 · PyData Berlin 2025 August Meetup

talk

by Olena Kutsenko (Confluent)

AI/ML Flink IoT Python Data Streaming

Abstract: Detecting problems as they happen is essential in today’s fast-moving, data-driven world. In this talk, you’ll learn how to build a flexible, real-time anomaly detection pipeline using Apache Kafka and Apache Flink, backed by statistical and machine learning models. We’ll start by demystifying what anomaly really means - exploring the different types (point, contextual, and collective anomalies) and the difference between unintentional issues and intentional outliers like fraud or abuse. Then, we’ll look at how anomaly detection is solved in practice: from classical statistical models like ARIMA to deep learning models like LSTM. You’ll learn how ARIMA breaks time series into AutoRegressive, Integrated, and Moving Average components, no math degree required (just a Python library). We’ll also uncover why forgetting is a feature, not a bug, when it comes to LSTMs, and how these models learn to detect complex patterns over time. Throughout, we’ll show how Kafka handles high-throughput streaming data and how Flink enables low-latency, stateful processing to catch issues as they emerge. You’ll leave knowing not just how these systems work, but when to use each type of model depending on your data and goals. Whether you're monitoring system health, tracking IoT devices, or looking for fraud in transactions, this talk will give you the foundations and tools to detect the unexpected - before it becomes a problem.

Real-Time Manufacturing Insights with Apache Flink and Kafka

2025-07-03 · IN-PERSON: Apache Flink® Meetup

talk

by Oded Nahum (Ness) , Laurentiu Bita (Ness)

Grafana flink postgresql

In this session, we’ll walk through how Apache Flink was used to enable near real-time operational insights using manufacturing IIoT Data sets. The goal: deliver actionable KPIs to production teams with sub-30-second latency, using streaming data pipelines built Kafka, Flink and Grafana. We’ll cover the key architectural patterns that made this possible, including handling structured data joins, managing out-of-order events, and integrating with downstream systems like PostgreSQL and Grafana. We’ll also share real-world performance benchmarks, lessons learned from scaling tests, and practical considerations for deploying Flink in a production-grade, low-latency analytics pipeline. The session will also include a live demo

If you're building Flink-based solutions for time-sensitive operations—whether in manufacturing, IoT, or other domains—this talk will provide proven insights from the field.

DISCLAIMER We don't cater to attendees under the age of 18. If you want to host or speak at a meetup, please email [email protected]

Real-Time Manufacturing Insights with Apache Flink and Kafka

2025-07-03 · IN-PERSON: Apache Flink® Meetup

talk

by Oded Nahum (Ness) , Laurentiu Bita (Ness)

Grafana flink postgresql

In this session, we’ll walk through how Apache Flink was used to enable near real-time operational insights using manufacturing IIoT Data sets. The goal: deliver actionable KPIs to production teams with sub-30-second latency, using streaming data pipelines built Kafka, Flink and Grafana. We’ll cover the key architectural patterns that made this possible, including handling structured data joins, managing out-of-order events, and integrating with downstream systems like PostgreSQL and Grafana. We’ll also share real-world performance benchmarks, lessons learned from scaling tests, and practical considerations for deploying Flink in a production-grade, low-latency analytics pipeline. The session will also include a live demo

If you're building Flink-based solutions for time-sensitive operations—whether in manufacturing, IoT, or other domains—this talk will provide proven insights from the field.

Spinning up an Event Streaming Environment from Scratch Confluent Cloud, Terraform, Connectors and Flink in Action

2025-07-03 · IN-PERSON: Apache Flink® Meetup

talk

by Sven Erik Knop (Confluent)

Terraform confluent cloud connectors flink ksqldb

According to Wikipedia, Infrastructure as Code is the process of managing and provisioning computer data center resources through machine-readable definition files, rather than physical hardware configuration or interactive configuration tools. This also applies to resources and reference data, connector plugins, connector configurations, and stream processes to clean up the data.

In this talk, we are going to discuss the use cases based on the Network Rail Data Feeds, the scripts used to spin up the environment and cluster in the Confluent Cloud as well as the different components required for the ingress and processing of the data.

This particular environment is used as a teaching tool for Event Stream Processing for Kafka Streams, ksqlDB, and Flink. Some examples of further processing and visualisation will also be provided.

Migrating Millions of Users with Debezium, Apache Kafka, and an Acyclic Synchronization Algorithm

2025-07-02 · Debezium, Apache Kafka®, and an Acyclic Synchronization Algorithm

talk

by MD Sayem Ahmed (Kleinanzeigen)

debezium

In this talk we will look into the details of how Kleinanzeigen, a leader in classifieds business in Germany, built a data migration system using Apache Kafka and Debezium that migrated millions of users' data from a legacy to a new platform and allowed bi-directional data sync between them in real time. We will also discover how the system allowed user's data to be updated on both platforms (partially, under certain conditions) while keeping the entire system in sync. Finally, we will learn how the system leveraged a logical clock to implement a custom synchronization algorithm that helped avoid infinite update loops between the platforms.

Migrating Millions of Users with Debezium, Apache Kafka, and an Acyclic Synchronization Algorithm

2025-07-02 · Debezium, Apache Kafka®, and an Acyclic Synchronization Algorithm

talk

by MD Sayem Ahmed (Kleinanzeigen)

debezium

In this talk we will look into the details of how Kleinanzeigen, a leader in classifieds business in Germany, built a data migration system using Apache Kafka and Debezium that migrated millions of users' data from a legacy to a new platform and allowed bi-directional data sync between them in real time. We will also discover how the system allowed user's data to be updated on both platforms (partially, under certain conditions) while keeping the entire system in sync. Finally, we will learn how the system leveraged a logical clock to implement a custom synchronization algorithm that helped avoid infinite update loops between the platforms.

Common provider abstractions: Key for multi-cloud data handling

2025-07-01 · Airflow Summit 2025

session

by Vikram Koka (Astronomer)

Airflow Kinesis Cloud Computing Pub/Sub SAP Snowflake SQL postgresql

Enterprises want the flexibility to operate across multiple clouds, whether to optimize costs, improve resiliency, to avoid vendor lock-in, or for data sovereignty. But for developers, that flexibility usually comes at the cost of extra complexity and redundant code. The goal here is simple: write once, run anywhere, with minimum boilerplate. In Apache Airflow, we’ve already begun tackling this problem with abstractions like Common-SQL, which lets you write database queries once and run them on 20+ databases, from Snowflake to Postgres to SQLite to SAP HANA. Similarly, Common-IO standardizes cloud blob storage interactions across all public clouds. With Airflow 3.0, we are pushing this further by introducing a Common Message Bus provider, which is an abstraction, initially supporting Amazon SQS and expanding to Google PubSub and Apache Kafka soon after. We expect additional implementations such as Amazon Kinesis and Managed Kafka over time. This talk will dive into why these abstractions matter, how they reduce friction for developers while giving enterprises true multi-cloud optionality, and what’s next for Airflow’s evolving provider ecosystem.

Sponsored by: Confluent | Turn SAP Data into AI-Powered Insights with Databricks

Creating a Custom PySpark Stream Reader with PySpark 4.0

2025-06-11 · Data + AI Summit 2025 Watch

lightning_talk

by Skyler Myers (Entrada)

Databricks Delta Java MySQL PySpark Spark Data Streaming

PySpark supports many data sources out of the box, such as Apache Kafka, JDBC, ODBC, Delta Lake, etc. However, some older systems, such as systems that use JMS protocol, are not supported by default and require considerable extra work for developers to read from them. One such example is ActiveMQ for streaming. Traditionally, users of ActiveMQ have to use a middle-man in order to read the stream with Spark (such as writing to a MySQL DB using Java code and reading that table with Spark JDBC). With PySpark 4.0’s custom data sources (supported in DBR 15.3+) we are able to cut out the middle-man processing using batch or Spark Streaming and consume the queues directly from PySpark, saving developers considerable time and complexity in getting source data into your Delta Lake and governed by Unity Catalog and orchestrated with Databricks Workflows.

talk-data.com

Activity Trend

Top Events

Top Speakers

Why Kafka + Iceberg Will Define the Next Decade of Data Infrastructure

Buffer, Blast, or Balance: Three Ways to Stream to Iceberg at Big Data London 2025

No More Fragile Pipelines: Kafka and Iceberg the Declarative Way

Managing Kafka Reliability

Why Kafka + Iceberg Will Define the Next Decade of Data Infrastructure

Why Kafka + Iceberg Will Define the Next Decade of Data Infrastructure

Monitoring Kafka-Based Applications Using Distributed Tracing with OpenTelemetry

Monitoring Kafka-Based Applications Using Distributed Tracing with OpenTelemetry

Workshop 2: Data Products

GraphQL+Graph databases usage and how we leverage Kafka, Elastic Search, Hackolade together

Data Engineering for Cybersecurity

Mastering real-time anomaly detection

Real-Time Manufacturing Insights with Apache Flink and Kafka

Real-Time Manufacturing Insights with Apache Flink and Kafka

Spinning up an Event Streaming Environment from Scratch Confluent Cloud, Terraform, Connectors and Flink in Action

Migrating Millions of Users with Debezium, Apache Kafka, and an Acyclic Synchronization Algorithm

Migrating Millions of Users with Debezium, Apache Kafka, and an Acyclic Synchronization Algorithm

Common provider abstractions: Key for multi-cloud data handling

Sponsored by: Confluent | Turn SAP Data into AI-Powered Insights with Databricks

Creating a Custom PySpark Stream Reader with PySpark 4.0