talk-data.com talk-data.com

Topic

Kafka

Apache Kafka

distributed_streaming message_queue event_streaming

240

tagged

Activity Trend

20 peak/qtr
2020-Q1 2026-Q1

Activities

240 activities · Newest first

Data leaders today face a familiar challenge: complex pipelines, duplicated systems, and spiraling infrastructure costs. Standardizing around Kafka for real-time and Iceberg for large-scale analytics has gone some way towards addressing this but still requires separate stacks, leaving teams to stitch them together at high expense and risk.

This talk will explore how Kafka and Iceberg together form a new foundation for data infrastructure. One that unifies streaming and analytics into a single, cost-efficient layer. By standardizing on these open technologies, organizations can reduce data duplication, simplify governance, and unlock both instant insights and long-term value from the same platform.

You will come away with a clear understanding of why this convergence is reshaping the industry, how it lowers operational risk, and advantages it offers for building durable, future-proof data capabilities.

Data leaders today face a familiar challenge: complex pipelines, duplicated systems, and spiraling infrastructure costs. Standardizing around Kafka for real-time and Iceberg for large-scale analytics has gone some way towards addressing this but still requires separate stacks, leaving teams to stitch them together at high expense and risk.

This talk will explore how Kafka and Iceberg together form a new foundation for data infrastructure. One that unifies streaming and analytics into a single, cost-efficient layer. By standardizing on these open technologies, organizations can reduce data duplication, simplify governance, and unlock both instant insights and long-term value from the same platform.

You will come away with a clear understanding of why this convergence is reshaping the industry, how it lowers operational risk, and advantages it offers for building durable, future-proof data capabilities.

Moving data between operational systems and analytics platforms is often a painful process. Traditional pipelines that transfer data in and out of warehouses tend to become complex, brittle, and expensive to maintain over time.

Much of this complexity, however, is avoidable. Data in motion and data at rest—Kafka Topics and Iceberg Tables—can be treated as two sides of the same coin. By establishing an equivalence between Topics and Tables, it’s possible to transparently map between them and rethink how pipelines are built.

This talk introduces a declarative approach to bridging streaming and table-based systems. By shifting complexity into the data layer, we can decompose complex, imperative pipelines into simpler, more reliable workflows

We’ll explore the design principles behind this approach, including schema mapping and evolution between Kafka and Iceberg, and how to build a system that can continuously materialize and optimize hundreds of thousands of topics as Iceberg tables.

Whether you're building new pipelines or modernizing legacy systems, this session will provide practical patterns and strategies for creating resilient, scalable, and future-proof data architectures.

Apache Kafka is the simplest possible reliable, horizontally scalable low-latency storage system for commodity hardware. This is increasingly making it the backbone of analytic data collection stacks and event-bus like architectures. Critical systems like this require very reliable operations. Kafka is both stateful and distributed, so it has traditional sysadmin kind of problems and those that require pretty deep expertise. We will discuss the problems with CPU and disk capacity management as well as defining availability SLOs for a distributed stateful system. We will also show some of the ways in which the Google Cloud Managed Service for Apache Kafka and lenses.io helps in solving these problems in a demo.

Data leaders today face a familiar challenge: complex pipelines, duplicated systems, and spiraling infrastructure costs. Standardizing around Kafka for real-time and Iceberg for large-scale analytics has gone some way towards addressing this but still requires separate stacks, leaving teams to stitch them together at high expense and risk. This talk will explore how Kafka and Iceberg together form a new foundation for data infrastructure. One that unifies streaming and analytics into a single, cost-efficient layer. By standardizing on these open technologies, organizations can reduce data duplication, simplify governance, and unlock both instant insights and long-term value from the same platform. You will come away with a clear understanding of why this convergence is reshaping the industry, how it lowers operational risk, and advantages it offers for building durable, future-proof data capabilities.

Data leaders today face a familiar challenge: complex pipelines, duplicated systems, and spiraling infrastructure costs. Standardizing around Kafka for real-time and Iceberg for large-scale analytics has gone some way towards addressing this but still requires separate stacks, leaving teams to stitch them together at high expense and risk.

This talk will explore how Kafka and Iceberg together form a new foundation for data infrastructure. One that unifies streaming and analytics into a single, cost-efficient layer. By standardizing on these open technologies, organizations can reduce data duplication, simplify governance, and unlock both instant insights and long-term value from the same platform.

You will come away with a clear understanding of why this convergence is reshaping the industry, how it lowers operational risk, and advantages it offers for building durable, future-proof data capabilities.

By leveraging tools like Jaeger and New Relic, we will uncover how to gain a full view of your microservices, even in the face of Apache Kafka's asynchronous nature. Join us for a live demo with a simple Java Spring-Boot app, where we will walk through both automatic and manual instrumentation to capture rich telemetry. We will also touch on infrastructure-level observability, pulling metrics and traces from Apache Kafka brokers and Apache Flink.

By leveraging tools like Jaeger and New Relic, we’ll uncover how to gain a full view of your microservices, even in the face of Apache Kafka’s asynchronous nature. Join us for a live demo with a simple Java Spring-Boot app, where we’ll walk through both automatic and manual instrumentation to capture rich telemetry. We’ll also touch on infrastructure-level observability, pulling metrics and traces from Apache Kafka brokers and Apache Flink.

Following on from the Building consumable data products keynote, we will dive deeper into the interactions around the data product catalog, to show how the network effect of explicit data sharing relationships starts to pay dividends to the participants. Such as:

For the product consumer:

• Searching for products, understanding content, costs, terms and conditions, licenses, quality certifications etc

• Inspecting sample data, choosing preferred data format, setting up a secure subscription, and seeing data provisioned into a database from the product catalog.

• Providing feedback and requesting help

• Reviewing own active subscriptions

• Understanding the lineage behind each product along with outstanding exceptions and future plans

For the product manager/owner:

• Setting up a new product, creating a new release of an existing product and issuing a data correction/restatement

• Reviewing a product’s active subscriptions and feedback/requests from consumers

• Interacting with the technical teams on pipeline implementations along with issues and proposed enhancements

• For the data governance team

• Viewing the network of dependencies between data products (the data mesh) to understand the data value chains and risk concentrations

• Reviewing a dashboard of metrics around the data products including popularity, errors/exceptions, subscriptions, interaction

• Show traceability from a governance policy relating to, say data sovereignty or data privacy to the product implementations.

• Building trust profiles for producers and consumers

The aim of the demonstrations and discussions is to explore the principles and patterns relating to data products, rather than push a particular implementation approach.

Having said that, all of the software used in the demonstrations is open source. Principally this is Egeria, Open Lineage and Unity Catalog from the Linux Foundation, plus Apache Airflow, Apache Kafka and Apache SuperSet from the Apache Software Foundation.  

Videos of the demonstrations will be available on YouTube after the conference and the complete demo software can be downloaded and run on a laptop so you can share your experiences with your teams after the event.

Data Engineering for Cybersecurity

Security teams rely on telemetry—the continuous stream of logs, events, metrics, and signals that reveal what’s happening across systems, endpoints, and cloud services. But that data doesn’t organize itself. It has to be collected, normalized, enriched, and secured before it becomes useful. That’s where data engineering comes in. In this hands-on guide, cybersecurity engineer James Bonifield teaches you how to design and build scalable, secure data pipelines using free, open source tools such as Filebeat, Logstash, Redis, Kafka, and Elasticsearch and more. You’ll learn how to collect telemetry from Windows including Sysmon and PowerShell events, Linux files and syslog, and streaming data from network and security appliances. You’ll then transform it into structured formats, secure it in transit, and automate your deployments using Ansible. You’ll also learn how to: Encrypt and secure data in transit using TLS and SSH Centrally manage code and configuration files using Git Transform messy logs into structured events Enrich data with threat intelligence using Redis and Memcached Stream and centralize data at scale with Kafka Automate with Ansible for repeatable deployments Whether you’re building a pipeline on a tight budget or deploying an enterprise-scale system, this book shows you how to centralize your security data, support real-time detection, and lay the groundwork for incident response and long-term forensics.

Abstract: Detecting problems as they happen is essential in today’s fast-moving, data-driven world. In this talk, you’ll learn how to build a flexible, real-time anomaly detection pipeline using Apache Kafka and Apache Flink, backed by statistical and machine learning models. We’ll start by demystifying what anomaly really means - exploring the different types (point, contextual, and collective anomalies) and the difference between unintentional issues and intentional outliers like fraud or abuse. Then, we’ll look at how anomaly detection is solved in practice: from classical statistical models like ARIMA to deep learning models like LSTM. You’ll learn how ARIMA breaks time series into AutoRegressive, Integrated, and Moving Average components, no math degree required (just a Python library). We’ll also uncover why forgetting is a feature, not a bug, when it comes to LSTMs, and how these models learn to detect complex patterns over time. Throughout, we’ll show how Kafka handles high-throughput streaming data and how Flink enables low-latency, stateful processing to catch issues as they emerge. You’ll leave knowing not just how these systems work, but when to use each type of model depending on your data and goals. Whether you're monitoring system health, tracking IoT devices, or looking for fraud in transactions, this talk will give you the foundations and tools to detect the unexpected - before it becomes a problem.

In this session, we’ll walk through how Apache Flink was used to enable near real-time operational insights using manufacturing IIoT Data sets. The goal: deliver actionable KPIs to production teams with sub-30-second latency, using streaming data pipelines built Kafka, Flink and Grafana. We’ll cover the key architectural patterns that made this possible, including handling structured data joins, managing out-of-order events, and integrating with downstream systems like PostgreSQL and Grafana. We’ll also share real-world performance benchmarks, lessons learned from scaling tests, and practical considerations for deploying Flink in a production-grade, low-latency analytics pipeline. The session will also include a live demo

If you're building Flink-based solutions for time-sensitive operations—whether in manufacturing, IoT, or other domains—this talk will provide proven insights from the field.


DISCLAIMER We don't cater to attendees under the age of 18. If you want to host or speak at a meetup, please email [email protected]

In this session, we’ll walk through how Apache Flink was used to enable near real-time operational insights using manufacturing IIoT Data sets. The goal: deliver actionable KPIs to production teams with sub-30-second latency, using streaming data pipelines built Kafka, Flink and Grafana. We’ll cover the key architectural patterns that made this possible, including handling structured data joins, managing out-of-order events, and integrating with downstream systems like PostgreSQL and Grafana. We’ll also share real-world performance benchmarks, lessons learned from scaling tests, and practical considerations for deploying Flink in a production-grade, low-latency analytics pipeline. The session will also include a live demo

If you're building Flink-based solutions for time-sensitive operations—whether in manufacturing, IoT, or other domains—this talk will provide proven insights from the field.

According to Wikipedia, Infrastructure as Code is the process of managing and provisioning computer data center resources through machine-readable definition files, rather than physical hardware configuration or interactive configuration tools. This also applies to resources and reference data, connector plugins, connector configurations, and stream processes to clean up the data.

In this talk, we are going to discuss the use cases based on the Network Rail Data Feeds, the scripts used to spin up the environment and cluster in the Confluent Cloud as well as the different components required for the ingress and processing of the data.

This particular environment is used as a teaching tool for Event Stream Processing for Kafka Streams, ksqlDB, and Flink. Some examples of further processing and visualisation will also be provided.

In this talk we will look into the details of how Kleinanzeigen, a leader in classifieds business in Germany, built a data migration system using Apache Kafka and Debezium that migrated millions of users' data from a legacy to a new platform and allowed bi-directional data sync between them in real time. We will also discover how the system allowed user's data to be updated on both platforms (partially, under certain conditions) while keeping the entire system in sync. Finally, we will learn how the system leveraged a logical clock to implement a custom synchronization algorithm that helped avoid infinite update loops between the platforms.

In this talk we will look into the details of how Kleinanzeigen, a leader in classifieds business in Germany, built a data migration system using Apache Kafka and Debezium that migrated millions of users' data from a legacy to a new platform and allowed bi-directional data sync between them in real time. We will also discover how the system allowed user's data to be updated on both platforms (partially, under certain conditions) while keeping the entire system in sync. Finally, we will learn how the system leveraged a logical clock to implement a custom synchronization algorithm that helped avoid infinite update loops between the platforms.

Enterprises want the flexibility to operate across multiple clouds, whether to optimize costs, improve resiliency, to avoid vendor lock-in, or for data sovereignty. But for developers, that flexibility usually comes at the cost of extra complexity and redundant code. The goal here is simple: write once, run anywhere, with minimum boilerplate. In Apache Airflow, we’ve already begun tackling this problem with abstractions like Common-SQL, which lets you write database queries once and run them on 20+ databases, from Snowflake to Postgres to SQLite to SAP HANA. Similarly, Common-IO standardizes cloud blob storage interactions across all public clouds. With Airflow 3.0, we are pushing this further by introducing a Common Message Bus provider, which is an abstraction, initially supporting Amazon SQS and expanding to Google PubSub and Apache Kafka soon after. We expect additional implementations such as Amazon Kinesis and Managed Kafka over time. This talk will dive into why these abstractions matter, how they reduce friction for developers while giving enterprises true multi-cloud optionality, and what’s next for Airflow’s evolving provider ecosystem.

Sponsored by: Confluent | Turn SAP Data into AI-Powered Insights with Databricks

Learn how Confluent simplifies real-time streaming of your SAP data into AI-ready Delta tables on Databricks. In this session, you'll see how Confluent’s fully managed data streaming platform—with unified Apache Kafka® and Apache Flink®—connects data from SAP S/4HANA, ECC, and 120+ other sources to enable easy development of trusted, real-time data products that fuel highly contextualized AI and analytics. With Tableflow, you can represent Kafka topics as Delta tables in just a few clicks—eliminating brittle batch jobs and custom pipelines. You’ll see a product demo showcasing how Confluent unites your SAP and Databricks environments to unlock ERP-fueled AI, all while reducing the total cost of ownership (TCO) for data streaming by up to 60%.

Creating a Custom PySpark Stream Reader with PySpark 4.0

PySpark supports many data sources out of the box, such as Apache Kafka, JDBC, ODBC, Delta Lake, etc. However, some older systems, such as systems that use JMS protocol, are not supported by default and require considerable extra work for developers to read from them. One such example is ActiveMQ for streaming. Traditionally, users of ActiveMQ have to use a middle-man in order to read the stream with Spark (such as writing to a MySQL DB using Java code and reading that table with Spark JDBC). With PySpark 4.0’s custom data sources (supported in DBR 15.3+) we are able to cut out the middle-man processing using batch or Spark Streaming and consume the queues directly from PySpark, saving developers considerable time and complexity in getting source data into your Delta Lake and governed by Unity Catalog and orchestrated with Databricks Workflows.