talk-data.com talk-data.com

Topic

Kafka

Apache Kafka

distributed_streaming message_queue event_streaming

240

tagged

Activity Trend

20 peak/qtr
2020-Q1 2026-Q1

Activities

240 activities · Newest first

Advanced SQL

SQL is no longer just a querying language for relational databases—it's a foundational tool for building scalable, modern data solutions across real-time analytics, machine learning workflows, and even generative AI applications. Advanced SQL shows data professionals how to move beyond conventional SELECT statements and tap into the full power of SQL as a programming interface for today's most advanced data platforms. Written by seasoned data experts Rui Pedro Machado, Hélder Russa, and Pedro Esmeriz, this practical guide explores the role of SQL in streaming architectures (like Apache Kafka and Flink), data lake ecosystems, cloud data warehouses, and ML pipelines. Geared toward data engineers, analysts, scientists, and analytics engineers, the book combines hands-on guidance with architectural best practices to help you extend your SQL skills into emerging workloads and real-world production systems. Use SQL to design and deploy modern, end-to-end data architectures Integrate SQL with data lakes, stream processing, and cloud platforms Apply SQL in feature engineering and ML model deployment Master pipe syntax and other advanced features for scalable, efficient queries Leverage SQL to build GenAI-ready data applications and pipelines

Practical Data Engineering with Apache Projects: Solving Everyday Data Challenges with Spark, Iceberg, Kafka, Flink, and More

This book is a comprehensive guide designed to equip you with the practical skills and knowledge necessary to tackle real-world data challenges using Open Source solutions. Focusing on 10 real-world data engineering projects, it caters specifically to data engineers at the early stages of their careers, providing a strong foundation in essential open source tools and techniques such as Apache Spark, Flink, Airflow, Kafka, and many more. Each chapter is dedicated to a single project, starting with a clear presentation of the problem it addresses. You will then be guided through a step-by-step process to solve the problem, leveraging widely-used open-source data tools. This hands-on approach ensures that you not only understand the theoretical aspects of data engineering but also gain valuable experience in applying these concepts to real-world scenarios. At the end of each chapter, the book delves into common challenges that may arise during the implementation of the solution, offering practical advice on troubleshooting these issues effectively. Additionally, the book highlights best practices that data engineers should follow to ensure the robustness and efficiency of their solutions. A major focus of the book is using open-source projects and tools to solve problems encountered in data engineering. In summary, this book is an indispensable resource for data engineers looking to build a strong foundation in the field. By offering practical, real-world projects and emphasizing problem-solving and best practices, it will prepare you to tackle the complex data challenges encountered throughout your career. Whether you are an aspiring data engineer or looking to enhance your existing skills, this book provides the knowledge and tools you need to succeed in the ever-evolving world of data engineering. You Will Learn: The foundational concepts of data engineering and practical experience in solving real-world data engineering problems How to proficiently use open-source data tools like Apache Kafka, Flink, Spark, Airflow, and Trino 10 hands-on data engineering projects Troubleshoot common challenges in data engineering projects Who is this book for: Early-career data engineers and aspiring data engineers who are looking to build a strong foundation in the field; mid-career professionals looking to transition into data engineering roles; and technology enthusiasts interested in gaining insights into data engineering practices and tools.

AWS re:Invent 2025 - Powering your Agentic AI experience with AWS Streaming and Messaging (ANT310)

Organizations are accelerating innovation with generative AI and agentic AI use cases. This session explores how AWS streaming and messaging services such as Amazon Managed Streaming for Apache Kafka, Kinesis Data Streams, Amazon Managed Service for Apache Flink, and Amazon SQS build intelligent, responsive applications. Discover how streaming supports real-time data ingestion and processing, while messaging ensures reliable coordination between AI agents, orchestrates workflows, and delivers critical information at scale. Learn architectural patterns that highlight how a unified approach acts on data as fast as needed, providing the reliability and scale to grow for your next generation of AI.

Learn more: More AWS events: https://go.aws/3kss9CP

Subscribe: More AWS videos: http://bit.ly/2O3zS75 More AWS events videos: http://bit.ly/316g9t4

ABOUT AWS: Amazon Web Services (AWS) hosts events, both online and in-person, bringing the cloud computing community together to connect, collaborate, and learn from AWS experts. AWS is the world's most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—are using AWS to lower costs, become more agile, and innovate faster.

AWSreInvent #AWSreInvent2025 #AWS

AWS re:Invent 2025 - Operating Apache Kafka and Apache Flink at scale (ANT307)

Enterprises use Apache Kafka and Apache Flink for an increasing number of mission-critical use-cases, real-time analytics, application messaging, and machine learning. As this usage grows in size and scale, so does the criticality, scale, and cost of managing the Kafka and Flink clusters. Learn how customers can achieve the same or higher availability and durability of their growing clusters, both at lower unit costs and with operational simplicity with Amazon MSK (Managed Streaming for Apache Kafka), and Amazon MSF (Managed Streaming for Apache Flink).

Learn more: More AWS events: https://go.aws/3kss9CP

Subscribe: More AWS videos: http://bit.ly/2O3zS75 More AWS events videos: http://bit.ly/316g9t4

ABOUT AWS: Amazon Web Services (AWS) hosts events, both online and in-person, bringing the cloud computing community together to connect, collaborate, and learn from AWS experts. AWS is the world's most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—are using AWS to lower costs, become more agile, and innovate faster.

AWSreInvent #AWSreInvent2025 #AWS

In this 2-hour hands-on workshop, you'll build an end-to-end streaming analytics pipeline that captures live cryptocurrency prices, processes them in real-time, and uses AI to forecast the future. Ingest live crypto data into Apache Kafka using Kafka Connect; tame that chaos with Apache Flink's stream processing; freeze streams into queryable Apache Iceberg tables using Tableflow; and forecast price trends with Flink AI.

Grafana's magic comes from seeing what's happening right now and instantly comparing it to everything that has happened before. Real-time data lets teams spot anomalies the moment they emerge. Long-term data reveals whether those anomalies are new, seasonal, or the same gremlins haunting your system every quarter. But actually building this capability? That's where everything gets messy. Today's dashboards are cobbled together from two very different worlds: Long-term data living in lakes and warehouses; Real-time streams blasting through Kafka or similar systems. These systems rarely fit together cleanly, which forces dashboard developers to wrestle with: Differing processing concepts - What does SQL even mean on a stream? Inconsistent governance - Tables vs. message schemas, different owners, different rules; Incomplete history - Not everything is kept forever, and you never know what will vanish next; Maintenance drift - As pipelines evolve, your ETL always falls behind. But what if there were no separation at all? Join us for a deep dive into a new, unified approach where real-time and historical data live together in a single, seamless dataset. Imagine dashboards powered by one source of truth that stretches from less than one second ago to five, ten, or even fifteen years into the past, without stitching, syncing, or duct-taping systems together. Using Apache Kafka, Apache Iceberg, and a modern architectural pattern that eliminates the old 'batch vs. stream' divide, we'll show how to: Build Grafana dashboards that just work with consistent semantics end-to-end; Keep every message forever without drowning in storage costs; Query real-time and historical data with the same language, same governance, same everything; Escape the ETL death spiral once and for all. If you've ever wished your dashboards were both lightning-fast and infinitely deep, this talk will show you how close that future really is.

Kafka and Flink tend to get lumped in as "data services", in the sense that they process data, but in comparison to traditional databases they differ quite dramatically in functionality and utility. In this talk, we'll run through the lifetime of a write in Postgres to establish a baseline, understanding all the different services that data hits on its way down to the disk. Then we'll walk through writing data to a Kafka topic, and what 'writing' (or really, streaming) data to a Flink workflow looks like from a similar systems perspective. Along the way, we'll understand the key differences between the services and why some are more suited to long-term data storage than others.

While Kafka excels at streaming data, the real challenge lies in making that data analytically useful without sacrificing consistency or performance. This talk explores why Apache Iceberg has emerged as the ideal streaming destination, offering ACID transactions, schema evolution, and time travel capabilities that traditional data lakes can't match. Learn about some foundational tools that enable streaming pipelines and why they all converge on this next-generation table format built for flexibility and scalability.

In this session, see how meshIQ, a comprehensive management and observability platform for messaging and event streaming technologies like Kafka, RabbitMQ and IBM MQ, can be used to track application message flows, help identify bottlenecks and locate "missing" messages.

In this session, we’ll explore the real-world journey of implementing a scalable, secure, and resilient data streaming platform—from the ground up. Bridging DevOps and DataOps practices, we’ll cover how our team designed the architecture, selected the right tools (like Kafka and Kubernetes), automated deployments, and enforced data governance across environments. You'll learn how we tackled challenges like schema evolution, CI/CD for data pipelines, monitoring at scale, and team collaboration. Whether you're just starting or scaling your data platform, this talk offers practical takeaways and battle-tested lessons from the trenches of building streaming infrastructure in production.

L'industrie sidérurgique se distingue par des filières de production complexes et des processus physico-chimiques exigeants, générant un besoin majeur de modélisation pour optimiser la productivité.

Notre stratégie de consolidation des informations industrielles et de mise à disposition de flux de données temps réel permet de multiplier et d'accélérer le déploiement de modèles tant physiques qu'à base d'intelligence artificielle.

Les principaux sujets abordés sont :

• La libération et le désilotage des données via une architecture hybride basée sur Kafka

• La simplification de l'intégration de composants dans nos systèmes industriels

• Le gain d'autonomie des équipes via des outils self-service (catalogue de données, requêtage, plateformes de modélisation)

Cette session s’achève sur un cas concret de modélisation, illustrant les bénéfices opérationnels apportés par cette architecture, avec un focus sur le pilotage en temps réel du process pour maîtriser la qualité produit avec des solutions hybrides couplant la modélisation physique et des modèles de données.

Kafka 4.1 promotes KIP-932, “Queue For Kafka,” to preview status. Beyond the specification's provocative title, it introduces "share groups" into Kafka, adopting some of the mechanisms found in queuing systems. However, there are fundamental differences with JMS or MQ. We will therefore explore in detail the KIP, its APIs and the problems it solves, but also its limitations and potential future developments.

Kafka est un outil puissant, mais son intégration peut vite ressembler à de la magie noire : configuration complexe, gestion fine des partitions, des retries, des schémas… Cette présentation propose de démystifier Kafka en explorant ses cas d’usage concrets (broadcast, coopération entre instances), sa configuration, et surtout comment notre SDK maison en C# simplifie son utilisation.

Kafka est un outil puissant, mais son intégration peut vite ressembler à de la magie noire : configuration complexe, gestion fine des partitions, des retries, des schémas… Cette présentation propose de démystifier Kafka en explorant ses cas d’usage concrets (broadcast, coopération entre instances), sa configuration, et surtout comment notre SDK maison en C# simplifie son utilisation.

Discussion : Apache Iceberg, Apache Kafka, Apache Polaris (incubating) sont sur un bateau: peupler des tables Iceberg. Après une introduction rapide de Apache Iceberg, Apache Kafka, Apache Polaris (incubating), nous verrons les approches possibles pour injecter de la donnée dans des tables Iceberg avec Kafka et quels sont les problématiques à considérer: gestion des snapshots, compaction des data files, etc. Nous verrons des solutions possibles et comment Apache Polaris (incubating) peut aider à la maintenance des tables.

Après une introduction rapide de Apache Iceberg, Apache Kafka, Apache Polaris (incubating), nous verrons les approches possibles pour injecter des données dans des tables Iceberg avec Kafka et quels sont les problématiques à considérer: gestion des snapshots, compaction des data files, etc. Nous verrons des solutions possibles et comment Apache Polaris (incubating) peut aider à la maintenance des tables.

Kafka Internals I Wish I Knew Sooner: The Non-Boring Truths

Most of us start with Kafka by building a simple producer/consumer demo. It just works — until it doesn’t. Suddenly, disk space isn’t freed up after data “expires,” rebalances loop endlessly during deploys, and strange errors about missing leaders clog your logs. In the panic, we dive into Kafka’s ocean of config options — hoping something will stick. Sound familiar?

This talk is a collection of hard-won lessons — not flashy tricks, but the kind of insights you only gain after operating Kafka in production for years. You’ll walk away with mental models that make Kafka’s internal behavior more predictable and less surprising.

We’ll cover: - Storage internals: Why expired data doesn’t always free space — and how Kafka actually reclaims disk - Transactions & delivery semantics: What “exactly-once” really means, and when it silently downgrades - Consumer group rebalancing: Why rebalances loop, and how the controller’s hidden behavior affects them

If you’ve used Kafka — or plan to — these insights will save you hours of frustration and debugging. A basic understanding of partitions, replication, and Kafka’s general architecture will help get the most out of this session.

Data leaders today face a familiar challenge: complex pipelines, duplicated systems, and spiraling infrastructure costs. Standardizing around Kafka for real-time and Iceberg for large-scale analytics has gone some way towards addressing this but still requires separate stacks, leaving teams to stitch them together at high expense and risk.

This talk will explore how Kafka and Iceberg together form a new foundation for data infrastructure. One that unifies streaming and analytics into a single, cost-efficient layer. By standardizing on these open technologies, organizations can reduce data duplication, simplify governance, and unlock both instant insights and long-term value from the same platform.

You will come away with a clear understanding of why this convergence is reshaping the industry, how it lowers operational risk, and advantages it offers for building durable, future-proof data capabilities.