An organisation's data has traditionally been split between the operational estate, for daily business operations, and the analytical estate for after-the-fact analysis and reporting. The journey from one side to the other is today a long and torturous one. But does it have to be?\n\nIn the modern data stack Apache Kafka is your defacto standard operational platform and Apache Iceberg has emerged as the champion of table formats to power analytical applications. Can we leverage the best of Iceberg and Kafka to create a powerful solution greater than the sum of its parts?\n\nYes you can and we did!\n\nThis isn't a typical story of connectors, ELT, and separate data stores. We've developed an advanced projection of Kafka data in an Iceberg-compatible format, allowing direct access from warehouses and analytical tools.\n\nIn this talk, we'll cover:\n* How we presented Kafka data for Iceberg processors without moving or transforming data upfront—no hidden ETL!\n* Integrating Kafka's ecosystem into Iceberg, leveraging Schema Registry, consumer groups, and more.\n* Meeting Iceberg's performance and cost reduction expectations while sourcing data directly from Kafka.\n\nExpect a technical deep dive into the protocols, formats, and services we used, all while staying true to our core principles:\n* Kafka as the single source of truth—no separate stores.\n* Analytical processors shouldn't need Kafka-specific adjustments.\n* Operational performance must remain uncompromised.\n* Kafka's mature ecosystem features, like ACLs and quotas, should be reused, not reinvented.\n\nJoin us for a thrilling account of the highs and lows of merging two data giants and stay tuned for the surprise twist at the end!
talk-data.com
Topic
Kafka
Apache Kafka
240
tagged
Activity Trend
Top Events
Leverage the flexibility of Cloud Run and its ease of use for your Apache Kafka workloads. In this session, we’ll introduce Cloud Run worker pools, a new resource specifically designed for non-request-based workloads, like Kafka consumers. Learn how worker pools, along with a self-hosted Kafka autoscaler, can enable fast and flexible scaling of your Kafka consumers by using Kafka queue metrics.
Madhive built their ad analytics and bidding infrastructure using databases and batch pipelines. When the pipeline lag got too long to bid effectively, they rebuilt from scratch with Google Cloud’s Managed Service for Apache Kafka. Join this session to learn about Madhive’s journey and dive deep into how the service works, how it can help you build streaming systems quickly and securely, and what migration looks like. This session is relevant for Kafka administrators and architects building event-sourcing platforms or event-driven systems.
Google's Data Cloud is a unified platform for the entire data lifecycle, from streaming with Managed Kafka, to ML feature creation in BigQuery, to global deployment via Bigtable. In this talk, we’ll give you a behind the scenes look at how Spotify's recommendation engine team uses Google's Data Cloud for their feature pipelines. Plus, we will demonstrate BigQuery AI Query Engine and how it streamlines feature development and testing. Finally, we'll explore new Bigtable capabilities that simplify application deployment and monitoring.
Redpanda, a leading Kafka API-compatible streaming platform, now supports storing topics in Apache Iceberg, seamlessly fusing low-latency streaming with data lakehouses using BigQuery and BigLake in GCP. Iceberg Topics eliminate complex & inefficient ETL between streams and tables, making real-time data instantly accessible for analysis in BigQuery This push-button integration eliminates the need for costly connectors or custom pipelines, enabling both simple and sophisticated SQL queries across streams and other datasets. By combining Redpanda and Iceberg, GCP customers gain a secure, scalable, and cost-effective solution that transforms their agility while reducing infrastructure and human capital costs.
This Session is hosted by a Google Cloud Next Sponsor.
Visit your registration profile at g.co/cloudnext to opt out of sharing your contact information with the sponsor hosting this session.
Kir Titievsky, Product Manager at Google Cloud with extensive experience in streaming and storage infrastructure, joined Yuliia and Dumky to talk about streaming. Drawing from his work with Apache Kafka, Cloud PubSub, Dataflow and Cloud Storage since 2015, Kir explains the fundamental differences between streaming and micro-batch processing. He challenges common misconceptions about streaming costs, explaining how streaming can be significantly less expensive than batch processing for many use cases. Kir shares insights on the "service bus architecture" revival, discussing how modern distributed messaging systems have solved historic bottlenecks while creating new opportunities for business and performance needs.Kir's medium - https://medium.com/@kir-gcpKir's Linkedin page - https://www.linkedin.com/in/kir-titievsky-%F0%9F%87%BA%F0%9F%87%A6-7775052/
In this podcast episode, we talked with Adrian Brudaru about the past, present and future of data engineering.
About the speaker: Adrian Brudaru studied economics in Romania but soon got bored with how creative the industry was, and chose to go instead for the more factual side. He ended up in Berlin at the age of 25 and started a role as a business analyst. At the age of 30, he had enough of startups and decided to join a corporation, but quickly found out that it did not provide the challenge he wanted. As going back to startups was not a desirable option either, he decided to postpone his decision by taking freelance work and has never looked back since. Five years later, he co-founded a company in the data space to try new things. This company is also looking to release open source tools to help democratize data engineering.
0:00 Introduction to DataTalks.Club 1:05 Discussing trends in data engineering with Adrian 2:03 Adrian's background and journey into data engineering 5:04 Growth and updates on Adrian's company, DLT Hub 9:05 Challenges and specialization in data engineering today 13:00 Opportunities for data engineers entering the field 15:00 The "Modern Data Stack" and its evolution 17:25 Emerging trends: AI integration and Iceberg technology 27:40 DuckDB and the emergence of portable, cost-effective data stacks 32:14 The rise and impact of dbt in data engineering 34:08 Alternatives to dbt: SQLMesh and others 35:25 Workflow orchestration tools: Airflow, Dagster, Prefect, and GitHub Actions 37:20 Audience questions: Career focus in data roles and AI engineering overlaps 39:00 The role of semantics in data and AI workflows 41:11 Focusing on learning concepts over tools when entering the field 45:15 Transitioning from backend to data engineering: challenges and opportunities 47:48 Current state of the data engineering job market in Europe and beyond 49:05 Introduction to Apache Iceberg, Delta, and Hudi file formats 50:40 Suitability of these formats for batch and streaming workloads 52:29 Tools for streaming: Kafka, SQS, and related trends 58:07 Building AI agents and enabling intelligent data applications 59:09Closing discussion on the place of tools like DBT in the ecosystem
🔗 CONNECT WITH ADRIAN BRUDARU Linkedin - / data-team Website - https://adrian.brudaru.com/ 🔗 CONNECT WITH DataTalksClub Join the community - https://datatalks.club/slack.html Subscribe to our Google calendar to have all our events in your calendar - https://calendar.google.com/calendar/... Check other upcoming events - https://lu.ma/dtc-events LinkedIn - /datatalks-club Twitter - /datatalksclub Website - https://datatalks.club/
🌟 Session Overview 🌟
Session Name: Supernovas, Black Holes, and Streaming Data: A Journey in Space with Apache Kafka data streams from NASA Speaker: Frank Munz Session Description: In this fun, hands-on, and in-depth How-To, we explore NASA's GCN project, which publishes various events in space as Kafka topics.
The focus of my talk is on end-to-end data engineering, from consuming the data and ELT-ing the stream, to using generative AI tools for analytics.
We will analyze GCN data in real time, specifically targeting the data stream from exploding supernovas. This data triggers dozens of terrestrial telescopes to potentially reposition and point toward the event.
The speaker will kick off the session by contrasting various ways of ingesting and transforming the data, discussing their trade-offs: Should you use a declarative data pipeline, or can a data analyst manage with SQL only? Alternatively, when would it be better to follow the classic approach of orchestrating Spark notebooks to get the data ingested?
He will answer the question: Does a data engineer working with streaming data benefit from generative AI-based tools and assistants today? Is it worth it, or is it just hype?
The demo is easy to replicate at home, and Frank will share the notebooks in a GitHub repository so you can analyze real NASA data yourself!
This session is ideal for data engineers, data architects who enjoy some coding, generative AI enthusiasts, or anyone fascinated by technology and the sparkling stars in the night sky.
While the focus is clearly on tech, the demo will run on the open-source and open-standards-based Databricks Intelligence Platform (so inevitably, you'll get a high-level overview here too).
🚀 About Big Data and RPA 2024 🚀
Unlock the future of innovation and automation at Big Data & RPA Conference Europe 2024! 🌟 This unique event brings together the brightest minds in big data, machine learning, AI, and robotic process automation to explore cutting-edge solutions and trends shaping the tech landscape. Perfect for data engineers, analysts, RPA developers, and business leaders, the conference offers dual insights into the power of data-driven strategies and intelligent automation. 🚀 Gain practical knowledge on topics like hyperautomation, AI integration, advanced analytics, and workflow optimization while networking with global experts. Don’t miss this exclusive opportunity to expand your expertise and revolutionize your processes—all from the comfort of your home! 📊🤖✨
📅 Yearly Conferences: Curious about the evolution of QA? Check out our archive of past Big Data & RPA sessions. Watch the strategies and technologies evolve in our videos! 🚀 🔗 Find Other Years' Videos: 2023 Big Data Conference Europe https://www.youtube.com/playlist?list=PLqYhGsQ9iSEpb_oyAsg67PhpbrkCC59_g 2022 Big Data Conference Europe Online https://www.youtube.com/playlist?list=PLqYhGsQ9iSEryAOjmvdiaXTfjCg5j3HhT 2021 Big Data Conference Europe Online https://www.youtube.com/playlist?list=PLqYhGsQ9iSEqHwbQoWEXEJALFLKVDRXiP
💡 Stay Connected & Updated 💡
Don’t miss out on any updates or upcoming event information from Big Data & RPA Conference Europe. Follow us on our social media channels and visit our website to stay in the loop!
🌐 Website: https://bigdataconference.eu/, https://rpaconference.eu/ 👤 Facebook: https://www.facebook.com/bigdataconf, https://www.facebook.com/rpaeurope/ 🐦 Twitter: @BigDataConfEU, @europe_rpa 🔗 LinkedIn: https://www.linkedin.com/company/73234449/admin/dashboard/, https://www.linkedin.com/company/75464753/admin/dashboard/ 🎥 YouTube: http://www.youtube.com/@DATAMINERLT
🌟 Session Overview 🌟
Session Name: Apache Pulsar: Finally an Alternative to Kafka? Speaker: Julien Jakubowski Session Description: Today, when you think about building event-driven and real-time applications, the names that likely come to mind are RabbitMQ, ActiveMQ, or Kafka. These solutions dominate the landscape. But have you ever heard of Apache Pulsar?
After a brief presentation of the fundamental concepts of messaging, you'll discover the Apache Pulsar features that enable you to build amazing event-driven applications. You'll learn the following:
How Apache Pulsar's architecture differs from other brokers. How it enables scaling processing power and data independently, quickly, and with no hassle. How it guarantees high durability of messages across nodes and different data centers. How it covers the use cases of both RabbitMQ and Kafka while involving a single broker. How to integrate Pulsar with your existing application portfolio. And more.
🚀 About Big Data and RPA 2024 🚀
Unlock the future of innovation and automation at Big Data & RPA Conference Europe 2024! 🌟 This unique event brings together the brightest minds in big data, machine learning, AI, and robotic process automation to explore cutting-edge solutions and trends shaping the tech landscape. Perfect for data engineers, analysts, RPA developers, and business leaders, the conference offers dual insights into the power of data-driven strategies and intelligent automation. 🚀 Gain practical knowledge on topics like hyperautomation, AI integration, advanced analytics, and workflow optimization while networking with global experts. Don’t miss this exclusive opportunity to expand your expertise and revolutionize your processes—all from the comfort of your home! 📊🤖✨
📅 Yearly Conferences: Curious about the evolution of QA? Check out our archive of past Big Data & RPA sessions. Watch the strategies and technologies evolve in our videos! 🚀 🔗 Find Other Years' Videos: 2023 Big Data Conference Europe https://www.youtube.com/playlist?list=PLqYhGsQ9iSEpb_oyAsg67PhpbrkCC59_g 2022 Big Data Conference Europe Online https://www.youtube.com/playlist?list=PLqYhGsQ9iSEryAOjmvdiaXTfjCg5j3HhT 2021 Big Data Conference Europe Online https://www.youtube.com/playlist?list=PLqYhGsQ9iSEqHwbQoWEXEJALFLKVDRXiP
💡 Stay Connected & Updated 💡
Don’t miss out on any updates or upcoming event information from Big Data & RPA Conference Europe. Follow us on our social media channels and visit our website to stay in the loop!
🌐 Website: https://bigdataconference.eu/, https://rpaconference.eu/ 👤 Facebook: https://www.facebook.com/bigdataconf, https://www.facebook.com/rpaeurope/ 🐦 Twitter: @BigDataConfEU, @europe_rpa 🔗 LinkedIn: https://www.linkedin.com/company/73234449/admin/dashboard/, https://www.linkedin.com/company/75464753/admin/dashboard/ 🎥 YouTube: http://www.youtube.com/@DATAMINERLT
🌟 Session Overview 🌟
Session Name: Sentiment Analysis in Action: Building Your Real-time Pipeline Speaker: Olena Kutsenko Session Description: Monitoring and interpreting the sentiment of data records is important for a variety of use cases. However, traditional human-based methods fall short in handling huge volumes of information with the required speed and efficiency. AI, however, can address this challenge.
AI is only part of the solution. We need to build a data pipeline that ingests data from various channels, processes it using AI-driven sentiment analysis models to classify the sentiment of each individual record, and prepares it to be consumed by applications for aggregation and analysis.
In this session, we'll build a system using open-source technologies Apache Kafka and Apache Flink with AI models to obtain real-time sentiment from social media data. Apache Kafka's scalability ensures that no record is left behind, making it a reliable foundation for sentiment analysis. Apache Flink, with its adaptability to fluctuations in data volume and velocity, will enable the analysis of a continuous data stream using an AI model.
🚀 About Big Data and RPA 2024 🚀
Unlock the future of innovation and automation at Big Data & RPA Conference Europe 2024! 🌟 This unique event brings together the brightest minds in big data, machine learning, AI, and robotic process automation to explore cutting-edge solutions and trends shaping the tech landscape. Perfect for data engineers, analysts, RPA developers, and business leaders, the conference offers dual insights into the power of data-driven strategies and intelligent automation. 🚀 Gain practical knowledge on topics like hyperautomation, AI integration, advanced analytics, and workflow optimization while networking with global experts. Don’t miss this exclusive opportunity to expand your expertise and revolutionize your processes—all from the comfort of your home! 📊🤖✨
📅 Yearly Conferences: Curious about the evolution of QA? Check out our archive of past Big Data & RPA sessions. Watch the strategies and technologies evolve in our videos! 🚀 🔗 Find Other Years' Videos: 2023 Big Data Conference Europe https://www.youtube.com/playlist?list=PLqYhGsQ9iSEpb_oyAsg67PhpbrkCC59_g 2022 Big Data Conference Europe Online https://www.youtube.com/playlist?list=PLqYhGsQ9iSEryAOjmvdiaXTfjCg5j3HhT 2021 Big Data Conference Europe Online https://www.youtube.com/playlist?list=PLqYhGsQ9iSEqHwbQoWEXEJALFLKVDRXiP
💡 Stay Connected & Updated 💡
Don’t miss out on any updates or upcoming event information from Big Data & RPA Conference Europe. Follow us on our social media channels and visit our website to stay in the loop!
🌐 Website: https://bigdataconference.eu/, https://rpaconference.eu/ 👤 Facebook: https://www.facebook.com/bigdataconf, https://www.facebook.com/rpaeurope/ 🐦 Twitter: @BigDataConfEU, @europe_rpa 🔗 LinkedIn: https://www.linkedin.com/company/73234449/admin/dashboard/, https://www.linkedin.com/company/75464753/admin/dashboard/ 🎥 YouTube: http://www.youtube.com/@DATAMINERLT
Enterprises use Apache Kafka and Apache Flink for an increasing number of mission-critical use cases, real-time analytics, application messaging, and machine learning. As this usage grows in size and scale, so does the criticality, scale, and cost of managing the Kafka and Flink clusters. Learn how AWS customers can achieve the same or higher availability and durability of their growing clusters, both at lower unit costs and with operational simplicity, with Amazon MSK and Amazon Managed Service for Apache Flink.
Learn more: AWS re:Invent: https://go.aws/reinvent. More AWS events: https://go.aws/3kss9CP
Subscribe: More AWS videos: http://bit.ly/2O3zS75 More AWS events videos: http://bit.ly/316g9t4
About AWS: Amazon Web Services (AWS) hosts events, both online and in-person, bringing the cloud computing community together to connect, collaborate, and learn from AWS experts. AWS is the world's most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—are using AWS to lower costs, become more agile, and innovate faster.
AWSreInvent #AWSreInvent2024
Send us a text Welcome to the cozy corner of the tech world where ones and zeros mingle with casual chit-chat. Datatopics Unplugged is your go-to spot for relaxed discussions around tech, news, data, and society. In this episode, we’re joined by a special guest: Alex Gallego, founder and CEO of Red Panda. Together, we dive deep into building data-intensive applications, the evolution of streaming technologies, and balancing high throughput and low latency demands. Key topics covered: What is Red Panda and why it matters: Red Panda’s mission to redefine data streaming while being the fastest Kafka-compatible option on the market.Batch vs. streaming data: An accessible guide to understanding the classic debate and how the tech landscape is shifting towards unified data frameworks.Scaling at speed: The challenges and innovations driving Red Panda’s performance optimizations, from zero-copy architecture to storage engines.AI, ML, and streaming data integration: How Red Panda empowers real-time machine learning and AI-powered workloads with ease.Open source vs. enterprise models: Navigating licensing challenges and balancing business goals in the hybrid cloud era.Leadership and career shifts: Alex’s reflections on moving from technical lead to CEO, blending engineering know-how with company vision.
Ready to simplify the process of building data lakehouses and data pipelines at scale? In this practical guide, learn how Delta Lake is helping data engineers, data scientists, and data analysts overcome key data reliability challenges with modern data engineering and management techniques. Authors Denny Lee, Tristen Wentling, Scott Haines, and Prashanth Babu (with contributions from Delta Lake maintainer R. Tyler Croy) share expert insights on all things Delta Lake--including how to run batch and streaming jobs concurrently and accelerate the usability of your data. You'll also uncover how ACID transactions bring reliability to data lakehouses at scale. This book helps you: Understand key data reliability challenges and how Delta Lake solves them Explain the critical role of Delta transaction logs as a single source of truth Learn the Delta Lake ecosystem with technologies like Apache Flink, Kafka, and Trino Architect data lakehouses with the medallion architecture Optimize Delta Lake performance with features like deletion vectors and liquid clustering
In a world increasingly reliant on real-time data processing, we require efficient access to live, distributed states. This talk delves into Interactive Queries v2 in Kafka Streams — a powerful feature that allows direct, real-time access to the state of distributed Kafka Streams applications.
We will build on the context of stream processing and why accessing the state of distributed applications is crucial for modern data-driven architectures. Through this lens, you will see how Interactive Queries bridge the gap between real-time analytics state and external consumption, enabling you to unlock new possibilities for monitoring, data exploration, and responsive application behavior.
Building on this foundation, we will discuss a running example illustrating how Interactive Queries can be leveraged to create a dynamic and responsive application. We will guide you through the different types of state stores and how to query them.
Attendees will leave with a solid understanding of how Interactive Queries work, the types of problems they solve, and how they can be applied effectively to enhance the value of Kafka-based streaming applications.
Dive into the world of real-time data streaming with this introduction to Apache Kafka. This talk is tailored for developers, data engineers, and IT professionals who want to gain a foundational understanding of Kafka, a powerful open-source platform used for building scalable, event-driven applications.You will learn about:
Kafka fundamentals: the core concepts of Kafka, including topics, partitions, producers, and consumers
The Kafka ecosystem: brokers, clients, Schema Registry, and Kafka Connect
Stream processing: Kafka Streams and Apache Flink
Use cases: discover how data streaming with Kafka has transformed various industries
This book covers modern data engineering functions and important Python libraries, to help you develop state-of-the-art ML pipelines and integration code. The book begins by explaining data analytics and transformation, delving into the Pandas library, its capabilities, and nuances. It then explores emerging libraries such as Polars and CuDF, providing insights into GPU-based computing and cutting-edge data manipulation techniques. The text discusses the importance of data validation in engineering processes, introducing tools such as Great Expectations and Pandera to ensure data quality and reliability. The book delves into API design and development, with a specific focus on leveraging the power of FastAPI. It covers authentication, authorization, and real-world applications, enabling you to construct efficient and secure APIs using FastAPI. Also explored is concurrency in data engineering, examining Dask's capabilities from basic setup to crafting advanced machine learning pipelines. The book includes development and delivery of data engineering pipelines using leading cloud platforms such as AWS, Google Cloud, and Microsoft Azure. The concluding chapters concentrate on real-time and streaming data engineering pipelines, emphasizing Apache Kafka and workflow orchestration in data engineering. Workflow tools such as Airflow and Prefect are introduced to seamlessly manage and automate complex data workflows. What sets this book apart is its blend of theoretical knowledge and practical application, a structured path from basic to advanced concepts, and insights into using state-of-the-art tools. With this book, you gain access to cutting-edge techniques and insights that are reshaping the industry. This book is not just an educational tool. It is a career catalyst, and an investment in your future as a data engineering expert, poised to meet the challenges of today's data-driven world. What You Will Learn Elevate your data wrangling jobs by utilizing the power of both CPU and GPU computing, and learn to process data using Pandas 2.0, Polars, and CuDF at unprecedented speeds Design data validation pipelines, construct efficient data service APIs, develop real-time streaming pipelines and master the art of workflow orchestration to streamline your engineering projects Leverage concurrent programming to develop machine learning pipelines and get hands-on experience in development and deployment of machine learning pipelines across AWS, GCP, and Azure Who This Book Is For Data analysts, data engineers, data scientists, machine learning engineers, and MLOps specialists
This talk walks through the process of creating real-time data pipelines using Flink. It introduces how to connect Flink with various data sources (like Kafka, or relational databases), focusing on transforming and enriching data streams. This talk is useful for understanding how Flink integrates with other components in a typical data processing pipeline.
Stream Processing has evolved quickly in a short time: only a few years ago, stream processing was mostly simple real-time aggregations with limited throughput and consistency. Today, many stream processing applications have sophisticated business logic, strict correctness guarantees, high performance, low latency, and maintain terabytes of state without databases. Stream processing frameworks also abstract a lot of the low-level details away, such as routing the data streams, taking care of concurrent executions, and handling various failure scenarios while ensuring correctness.
This talk will give an introduction into Apache Flink, one of the most advanced open source stream processors that powers applications in Netflix, Uber, and Alibaba among others. In particular, we will go through the use cases that Flink was designed for, explain concepts like stateful and event-time stream processing, and discuss Flink's APIs and ecosystem.
During his talk, Alex will discuss the value of moving from a batch-based architecture to a real-time, event-driven architecture with Apache Kafka®. He will then explain how you can build valuable data products with Confluent that unlock high-volume performance, data governance, and AI use cases.