Melting Icebergs: Enabling Analytical Access to Kafka Data through Iceberg Projections

2025-04-14 · IN PERSON: Apache Kafka® x Apache Iceberg™ Meetup

talk

apache iceberg

An organisation's data has traditionally been split between the operational estate, for daily business operations, and the analytical estate for after-the-fact analysis and reporting. The journey from one side to the other is today a long and torturous one. But does it have to be?\n\nIn the modern data stack Apache Kafka is your defacto standard operational platform and Apache Iceberg has emerged as the champion of table formats to power analytical applications. Can we leverage the best of Iceberg and Kafka to create a powerful solution greater than the sum of its parts?\n\nYes you can and we did!\n\nThis isn't a typical story of connectors, ELT, and separate data stores. We've developed an advanced projection of Kafka data in an Iceberg-compatible format, allowing direct access from warehouses and analytical tools.\n\nIn this talk, we'll cover:\n* How we presented Kafka data for Iceberg processors without moving or transforming data upfront—no hidden ETL!\n* Integrating Kafka's ecosystem into Iceberg, leveraging Schema Registry, consumer groups, and more.\n* Meeting Iceberg's performance and cost reduction expectations while sourcing data directly from Kafka.\n\nExpect a technical deep dive into the protocols, formats, and services we used, all while staying true to our core principles:\n* Kafka as the single source of truth—no separate stores.\n* Analytical processors shouldn't need Kafka-specific adjustments.\n* Operational performance must remain uncompromised.\n* Kafka's mature ecosystem features, like ACLs and quotas, should be reused, not reinvented.\n\nJoin us for a thrilling account of the highs and lows of merging two data giants and stay tuned for the surprise twist at the end!

Autoscale Kafka consumers on Cloud Run

2025-04-11 · Google Cloud Next '25

session

by Matt Larkin (Google Cloud) , Adam Kane (Google Cloud)

Cloud Computing Cloud Run

Leverage the flexibility of Cloud Run and its ease of use for your Apache Kafka workloads. In this session, we’ll introduce Cloud Run worker pools, a new resource specifically designed for non-request-based workloads, like Kafka consumers. Learn how worker pools, along with a self-hosted Kafka autoscaler, can enable fast and flexible scaling of your Kafka consumers by using Kafka queue metrics.

Save time and effort with Google Cloud Managed Service for Apache Kafka

2025-04-11 · Google Cloud Next '25

session

by Mehran Nazir (Google Cloud) , Shan Kulandaivel (Google Cloud) , Neil Arlo (Madhive) , Kir Titievsky (Google)

Analytics Cloud Computing GCP Data Streaming

Madhive built their ad analytics and bidding infrastructure using databases and batch pipelines. When the pipeline lag got too long to bid effectively, they rebuilt from scratch with Google Cloud’s Managed Service for Apache Kafka. Join this session to learn about Madhive’s journey and dive deep into how the service works, how it can help you build streaming systems quickly and securely, and what migration looks like. This session is relevant for Kafka administrators and architects building event-sourcing platforms or event-driven systems.

Solve real-time AI challenges: Bigtable and BigQuery in Spotify’s music recommendation engine

2025-04-09 · Google Cloud Next '25

session

by Christopher Crosbie (Google Cloud) , Sandeep Karmarkar (Google Cloud) , Sanket Gupta (Spotify)

AI/ML BigQuery Cloud Computing Data Streaming

Google's Data Cloud is a unified platform for the entire data lifecycle, from streaming with Managed Kafka, to ML feature creation in BigQuery, to global deployment via Bigtable. In this talk, we’ll give you a behind the scenes look at how Spotify's recommendation engine team uses Google's Data Cloud for their feature pipelines. Plus, we will demonstrate BigQuery AI Query Engine and how it streamlines feature development and testing. Finally, we'll explore new Bigtable capabilities that simplify application deployment and monitoring.

Creating a turnkey streaming data lakehouse for BigQuery and BigLake users with Redpanda Iceberg Topics

2025-04-09 · Google Cloud Next '25

session

by Matt Schumpert (Redpanda)

API BigQuery Cloud Computing Data Lakehouse ETL/ELT GCP Iceberg Redpanda SQL Data Streaming

Redpanda, a leading Kafka API-compatible streaming platform, now supports storing topics in Apache Iceberg, seamlessly fusing low-latency streaming with data lakehouses using BigQuery and BigLake in GCP. Iceberg Topics eliminate complex & inefficient ETL between streams and tables, making real-time data instantly accessible for analysis in BigQuery This push-button integration eliminates the need for costly connectors or custom pipelines, enabling both simple and sophisticated SQL queries across streams and other datasets. By combining Redpanda and Iceberg, GCP customers gain a secure, scalable, and cost-effective solution that transforms their agility while reducing infrastructure and human capital costs.

This Session is hosted by a Google Cloud Next Sponsor.
Visit your registration profile at g.co/cloudnext to opt out of sharing your contact information with the sponsor hosting this session.

Kir Tititevsky: Modern Streaming Architecture Transforming the Service Bus

2025-03-31 · Straight Data Talk Listen

podcast_episode

by Yuliia Tkachova (Masthead Data) , Kir Titievsky (Google)

Cloud Computing Cloud Storage Dataflow GCP Pub/Sub Data Streaming

Kir Titievsky, Product Manager at Google Cloud with extensive experience in streaming and storage infrastructure, joined Yuliia and Dumky to talk about streaming. Drawing from his work with Apache Kafka, Cloud PubSub, Dataflow and Cloud Storage since 2015, Kir explains the fundamental differences between streaming and micro-batch processing. He challenges common misconceptions about streaming costs, explaining how streaming can be significantly less expensive than batch processing for many use cases. Kir shares insights on the "service bus architecture" revival, discussing how modern distributed messaging systems have solved historic bottlenecks while creating new opportunities for business and performance needs.Kir's medium - https://medium.com/@kir-gcpKir's Linkedin page - https://www.linkedin.com/in/kir-titievsky-%F0%9F%87%BA%F0%9F%87%A6-7775052/

Trends in Data Engineering – Adrian Brudaru

2025-03-07 · DataTalks.Club Listen

podcast_episode

by Adrian Brudaru (dlthub)

AI/ML Airflow Dagster Data Engineering dbt Delta DuckDB GitHub HTML Hudi Iceberg Modern Data Stack +3 more

In this podcast episode, we talked with Adrian Brudaru about the past, present and future of data engineering.

About the speaker: Adrian Brudaru studied economics in Romania but soon got bored with how creative the industry was, and chose to go instead for the more factual side. He ended up in Berlin at the age of 25 and started a role as a business analyst. At the age of 30, he had enough of startups and decided to join a corporation, but quickly found out that it did not provide the challenge he wanted. As going back to startups was not a desirable option either, he decided to postpone his decision by taking freelance work and has never looked back since. Five years later, he co-founded a company in the data space to try new things. This company is also looking to release open source tools to help democratize data engineering.

0:00 Introduction to DataTalks.Club 1:05 Discussing trends in data engineering with Adrian 2:03 Adrian's background and journey into data engineering 5:04 Growth and updates on Adrian's company, DLT Hub 9:05 Challenges and specialization in data engineering today 13:00 Opportunities for data engineers entering the field 15:00 The "Modern Data Stack" and its evolution 17:25 Emerging trends: AI integration and Iceberg technology 27:40 DuckDB and the emergence of portable, cost-effective data stacks 32:14 The rise and impact of dbt in data engineering 34:08 Alternatives to dbt: SQLMesh and others 35:25 Workflow orchestration tools: Airflow, Dagster, Prefect, and GitHub Actions 37:20 Audience questions: Career focus in data roles and AI engineering overlaps 39:00 The role of semantics in data and AI workflows 41:11 Focusing on learning concepts over tools when entering the field 45:15 Transitioning from backend to data engineering: challenges and opportunities 47:48 Current state of the data engineering job market in Europe and beyond 49:05 Introduction to Apache Iceberg, Delta, and Hudi file formats 50:40 Suitability of these formats for batch and streaming workloads 52:29 Tools for streaming: Kafka, SQS, and related trends 58:07 Building AI agents and enabling intelligent data applications 59:09Closing discussion on the place of tools like DBT in the ecosystem

🔗 CONNECT WITH ADRIAN BRUDARU Linkedin - / data-team Website - https://adrian.brudaru.com/ 🔗 CONNECT WITH DataTalksClub Join the community - https://datatalks.club/slack.html Subscribe to our Google calendar to have all our events in your calendar - https://calendar.google.com/calendar/... Check other upcoming events - https://lu.ma/dtc-events LinkedIn - /datatalks-club Twitter - /datatalksclub Website - https://datatalks.club/

Frank Munz: A Journey in Space with Apache Kafka data streams from NASA

2024-12-06 · DATA MINER Big Data Europe Conference 2020 Watch

video

by Frank Munz (Databricks)

AI/ML Analytics Big Data Dashboard Data Engineering Databricks ETL/ELT GenAI GitHub Spark SQL Data Streaming

🌟 Session Overview 🌟

Session Name: Supernovas, Black Holes, and Streaming Data: A Journey in Space with Apache Kafka data streams from NASA Speaker: Frank Munz Session Description: In this fun, hands-on, and in-depth How-To, we explore NASA's GCN project, which publishes various events in space as Kafka topics.

The focus of my talk is on end-to-end data engineering, from consuming the data and ELT-ing the stream, to using generative AI tools for analytics.

We will analyze GCN data in real time, specifically targeting the data stream from exploding supernovas. This data triggers dozens of terrestrial telescopes to potentially reposition and point toward the event.

The speaker will kick off the session by contrasting various ways of ingesting and transforming the data, discussing their trade-offs: Should you use a declarative data pipeline, or can a data analyst manage with SQL only? Alternatively, when would it be better to follow the classic approach of orchestrating Spark notebooks to get the data ingested?

He will answer the question: Does a data engineer working with streaming data benefit from generative AI-based tools and assistants today? Is it worth it, or is it just hype?

The demo is easy to replicate at home, and Frank will share the notebooks in a GitHub repository so you can analyze real NASA data yourself!

This session is ideal for data engineers, data architects who enjoy some coding, generative AI enthusiasts, or anyone fascinated by technology and the sparkling stars in the night sky.

While the focus is clearly on tech, the demo will run on the open-source and open-standards-based Databricks Intelligence Platform (so inevitably, you'll get a high-level overview here too).

🚀 About Big Data and RPA 2024 🚀

Unlock the future of innovation and automation at Big Data & RPA Conference Europe 2024! 🌟 This unique event brings together the brightest minds in big data, machine learning, AI, and robotic process automation to explore cutting-edge solutions and trends shaping the tech landscape. Perfect for data engineers, analysts, RPA developers, and business leaders, the conference offers dual insights into the power of data-driven strategies and intelligent automation. 🚀 Gain practical knowledge on topics like hyperautomation, AI integration, advanced analytics, and workflow optimization while networking with global experts. Don’t miss this exclusive opportunity to expand your expertise and revolutionize your processes—all from the comfort of your home! 📊🤖✨

📅 Yearly Conferences: Curious about the evolution of QA? Check out our archive of past Big Data & RPA sessions. Watch the strategies and technologies evolve in our videos! 🚀 🔗 Find Other Years' Videos: 2023 Big Data Conference Europe https://www.youtube.com/playlist?list=PLqYhGsQ9iSEpb_oyAsg67PhpbrkCC59_g 2022 Big Data Conference Europe Online https://www.youtube.com/playlist?list=PLqYhGsQ9iSEryAOjmvdiaXTfjCg5j3HhT 2021 Big Data Conference Europe Online https://www.youtube.com/playlist?list=PLqYhGsQ9iSEqHwbQoWEXEJALFLKVDRXiP

💡 Stay Connected & Updated 💡

Don’t miss out on any updates or upcoming event information from Big Data & RPA Conference Europe. Follow us on our social media channels and visit our website to stay in the loop!

🌐 Website: https://bigdataconference.eu/, https://rpaconference.eu/ 👤 Facebook: https://www.facebook.com/bigdataconf, https://www.facebook.com/rpaeurope/ 🐦 Twitter: @BigDataConfEU, @europe_rpa 🔗 LinkedIn: https://www.linkedin.com/company/73234449/admin/dashboard/, https://www.linkedin.com/company/75464753/admin/dashboard/ 🎥 YouTube: http://www.youtube.com/@DATAMINERLT

Julien Jakubowski: Apache Pulsar: Finally an Alternative to Kafka?

2024-12-06 · DATA MINER Big Data Europe Conference 2020 Watch

video

by Julien Jakubowski

AI/ML Analytics Big Data Dashboard

🌟 Session Overview 🌟

Session Name: Apache Pulsar: Finally an Alternative to Kafka? Speaker: Julien Jakubowski Session Description: Today, when you think about building event-driven and real-time applications, the names that likely come to mind are RabbitMQ, ActiveMQ, or Kafka. These solutions dominate the landscape. But have you ever heard of Apache Pulsar?

After a brief presentation of the fundamental concepts of messaging, you'll discover the Apache Pulsar features that enable you to build amazing event-driven applications. You'll learn the following:

How Apache Pulsar's architecture differs from other brokers. How it enables scaling processing power and data independently, quickly, and with no hassle. How it guarantees high durability of messages across nodes and different data centers. How it covers the use cases of both RabbitMQ and Kafka while involving a single broker. How to integrate Pulsar with your existing application portfolio. And more.

🚀 About Big Data and RPA 2024 🚀

Unlock the future of innovation and automation at Big Data & RPA Conference Europe 2024! 🌟 This unique event brings together the brightest minds in big data, machine learning, AI, and robotic process automation to explore cutting-edge solutions and trends shaping the tech landscape. Perfect for data engineers, analysts, RPA developers, and business leaders, the conference offers dual insights into the power of data-driven strategies and intelligent automation. 🚀 Gain practical knowledge on topics like hyperautomation, AI integration, advanced analytics, and workflow optimization while networking with global experts. Don’t miss this exclusive opportunity to expand your expertise and revolutionize your processes—all from the comfort of your home! 📊🤖✨

📅 Yearly Conferences: Curious about the evolution of QA? Check out our archive of past Big Data & RPA sessions. Watch the strategies and technologies evolve in our videos! 🚀 🔗 Find Other Years' Videos: 2023 Big Data Conference Europe https://www.youtube.com/playlist?list=PLqYhGsQ9iSEpb_oyAsg67PhpbrkCC59_g 2022 Big Data Conference Europe Online https://www.youtube.com/playlist?list=PLqYhGsQ9iSEryAOjmvdiaXTfjCg5j3HhT 2021 Big Data Conference Europe Online https://www.youtube.com/playlist?list=PLqYhGsQ9iSEqHwbQoWEXEJALFLKVDRXiP

💡 Stay Connected & Updated 💡

Don’t miss out on any updates or upcoming event information from Big Data & RPA Conference Europe. Follow us on our social media channels and visit our website to stay in the loop!

🌐 Website: https://bigdataconference.eu/, https://rpaconference.eu/ 👤 Facebook: https://www.facebook.com/bigdataconf, https://www.facebook.com/rpaeurope/ 🐦 Twitter: @BigDataConfEU, @europe_rpa 🔗 LinkedIn: https://www.linkedin.com/company/73234449/admin/dashboard/, https://www.linkedin.com/company/75464753/admin/dashboard/ 🎥 YouTube: http://www.youtube.com/@DATAMINERLT

Olena Kutsenko: Sentiment Analysis in Action: Building Your Real-time Pipeline

2024-12-06 · DATA MINER Big Data Europe Conference 2020 Watch

video

by Olena Kutsenko (Confluent)

AI/ML Analytics Flink Big Data Dashboard

🌟 Session Overview 🌟

Session Name: Sentiment Analysis in Action: Building Your Real-time Pipeline Speaker: Olena Kutsenko Session Description: Monitoring and interpreting the sentiment of data records is important for a variety of use cases. However, traditional human-based methods fall short in handling huge volumes of information with the required speed and efficiency. AI, however, can address this challenge.

AI is only part of the solution. We need to build a data pipeline that ingests data from various channels, processes it using AI-driven sentiment analysis models to classify the sentiment of each individual record, and prepares it to be consumed by applications for aggregation and analysis.

In this session, we'll build a system using open-source technologies Apache Kafka and Apache Flink with AI models to obtain real-time sentiment from social media data. Apache Kafka's scalability ensures that no record is left behind, making it a reliable foundation for sentiment analysis. Apache Flink, with its adaptability to fluctuations in data volume and velocity, will enable the analysis of a continuous data stream using an AI model.

🚀 About Big Data and RPA 2024 🚀

Unlock the future of innovation and automation at Big Data & RPA Conference Europe 2024! 🌟 This unique event brings together the brightest minds in big data, machine learning, AI, and robotic process automation to explore cutting-edge solutions and trends shaping the tech landscape. Perfect for data engineers, analysts, RPA developers, and business leaders, the conference offers dual insights into the power of data-driven strategies and intelligent automation. 🚀 Gain practical knowledge on topics like hyperautomation, AI integration, advanced analytics, and workflow optimization while networking with global experts. Don’t miss this exclusive opportunity to expand your expertise and revolutionize your processes—all from the comfort of your home! 📊🤖✨

📅 Yearly Conferences: Curious about the evolution of QA? Check out our archive of past Big Data & RPA sessions. Watch the strategies and technologies evolve in our videos! 🚀 🔗 Find Other Years' Videos: 2023 Big Data Conference Europe https://www.youtube.com/playlist?list=PLqYhGsQ9iSEpb_oyAsg67PhpbrkCC59_g 2022 Big Data Conference Europe Online https://www.youtube.com/playlist?list=PLqYhGsQ9iSEryAOjmvdiaXTfjCg5j3HhT 2021 Big Data Conference Europe Online https://www.youtube.com/playlist?list=PLqYhGsQ9iSEqHwbQoWEXEJALFLKVDRXiP

💡 Stay Connected & Updated 💡

Don’t miss out on any updates or upcoming event information from Big Data & RPA Conference Europe. Follow us on our social media channels and visit our website to stay in the loop!

🌐 Website: https://bigdataconference.eu/, https://rpaconference.eu/ 👤 Facebook: https://www.facebook.com/bigdataconf, https://www.facebook.com/rpaeurope/ 🐦 Twitter: @BigDataConfEU, @europe_rpa 🔗 LinkedIn: https://www.linkedin.com/company/73234449/admin/dashboard/, https://www.linkedin.com/company/75464753/admin/dashboard/ 🎥 YouTube: http://www.youtube.com/@DATAMINERLT

AWS re:Invent 2024 - Operate and scale managed Apache Kafka and Apache Flink clusters (ANT342)

2024-12-04 · AWS re:Invent 2024 Watch

video

by Sai Maddali (dbt Labs) , Ali Alemi (Amazon Web Services (AWS))

Agile/Scrum AI/ML Analytics Flink AWS Cloud Computing

Enterprises use Apache Kafka and Apache Flink for an increasing number of mission-critical use cases, real-time analytics, application messaging, and machine learning. As this usage grows in size and scale, so does the criticality, scale, and cost of managing the Kafka and Flink clusters. Learn how AWS customers can achieve the same or higher availability and durability of their growing clusters, both at lower unit costs and with operational simplicity, with Amazon MSK and Amazon Managed Service for Apache Flink.

Learn more: AWS re:Invent: https://go.aws/reinvent. More AWS events: https://go.aws/3kss9CP

Subscribe: More AWS videos: http://bit.ly/2O3zS75 More AWS events videos: http://bit.ly/316g9t4

About AWS: Amazon Web Services (AWS) hosts events, both online and in-person, bringing the cloud computing community together to connect, collaborate, and learn from AWS experts. AWS is the world's most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—are using AWS to lower costs, become more agile, and innovate faster.

AWSreInvent #AWSreInvent2024

#69 From Engineer to CEO: Alex Gallego on Building Red Panda

2024-11-21 · DataTopics: All Things Data, AI & Tech Listen

podcast_episode

by Alex Gallego (Red Panda)

AI/ML Cloud Computing Redpanda Data Streaming

Send us a text Welcome to the cozy corner of the tech world where ones and zeros mingle with casual chit-chat. Datatopics Unplugged is your go-to spot for relaxed discussions around tech, news, data, and society. In this episode, we’re joined by a special guest: Alex Gallego, founder and CEO of Red Panda. Together, we dive deep into building data-intensive applications, the evolution of streaming technologies, and balancing high throughput and low latency demands. Key topics covered: What is Red Panda and why it matters: Red Panda’s mission to redefine data streaming while being the fastest Kafka-compatible option on the market.Batch vs. streaming data: An accessible guide to understanding the classic debate and how the tech landscape is shifting towards unified data frameworks.Scaling at speed: The challenges and innovations driving Red Panda’s performance optimizations, from zero-copy architecture to storage engines.AI, ML, and streaming data integration: How Red Panda empowers real-time machine learning and AI-powered workloads with ease.Open source vs. enterprise models: Navigating licensing challenges and balancing business goals in the hybrid cloud era.Leadership and career shifts: Alex’s reflections on moving from technical lead to CEO, blending engineering know-how with company vision.

Delta Lake: The Definitive Guide

2024-10-31 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Denny Lee (Databricks) , Prashanth Babu (Databricks) , Tristen Wentling (Databricks) , Scott Haines (Databricks)

Flink Data Engineering Delta Data Streaming Trino data data-engineering delta-lake storage-repositories

Ready to simplify the process of building data lakehouses and data pipelines at scale? In this practical guide, learn how Delta Lake is helping data engineers, data scientists, and data analysts overcome key data reliability challenges with modern data engineering and management techniques. Authors Denny Lee, Tristen Wentling, Scott Haines, and Prashanth Babu (with contributions from Delta Lake maintainer R. Tyler Croy) share expert insights on all things Delta Lake--including how to run batch and streaming jobs concurrently and accelerate the usability of your data. You'll also uncover how ACID transactions bring reliability to data lakehouses at scale. This book helps you: Understand key data reliability challenges and how Delta Lake solves them Explain the critical role of Delta transaction logs as a single source of truth Learn the Delta Lake ecosystem with technologies like Apache Flink, Kafka, and Trino Architect data lakehouses with the medallion architecture Optimize Delta Lake performance with features like deletion vectors and liquid clustering

Making Distributed State Accessible: An Introduction to Kafka Interactive Queries v2

2024-10-22 · IN-PERSON: Apache Kafka® Meetup Berlin - October 2024

talk

by Ramin Gharib (bakdata)

kafka streams

In a world increasingly reliant on real-time data processing, we require efficient access to live, distributed states. This talk delves into Interactive Queries v2 in Kafka Streams — a powerful feature that allows direct, real-time access to the state of distributed Kafka Streams applications.

We will build on the context of stream processing and why accessing the state of distributed applications is crucial for modern data-driven architectures. Through this lens, you will see how Interactive Queries bridge the gap between real-time analytics state and external consumption, enabling you to unlock new possibilities for monitoring, data exploration, and responsive application behavior.

Building on this foundation, we will discuss a running example illustrating how Interactive Queries can be leveraged to create a dynamic and responsive application. We will guide you through the different types of state stores and how to query them.

Attendees will leave with a solid understanding of how Interactive Queries work, the types of problems they solve, and how they can be applied effectively to enhance the value of Kafka-based streaming applications.

Intro to Apache Kafka

2024-10-22 · IN-PERSON: Apache Kafka® Meetup Berlin - October 2024

talk

by David Anderson (Confluent)

flink kafka connect kafka streams schema registry

Dive into the world of real-time data streaming with this introduction to Apache Kafka. This talk is tailored for developers, data engineers, and IT professionals who want to gain a foundational understanding of Kafka, a powerful open-source platform used for building scalable, event-driven applications.You will learn about:

Kafka fundamentals: the core concepts of Kafka, including topics, partitions, producers, and consumers

The Kafka ecosystem: brokers, clients, Schema Registry, and Kafka Connect

Stream processing: Kafka Streams and Apache Flink

Use cases: discover how data streaming with Kafka has transformed various industries

Visualizing Realtime Stock Data with Streamlit, Apache Kafka®, and Apache Flink®

2024-10-03 · Apache Kafka® x Apache Flink® x Elastic

talk

flink streamlit

Data Engineering for Machine Learning Pipelines: From Python Libraries to ML Pipelines and Cloud Platforms

2024-09-27 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Pavan Kumar Narayanan

AI/ML Airflow Analytics API AWS Azure Cloud Computing Data Analytics Data Engineering Data Quality GCP Microsoft +9 more

This book covers modern data engineering functions and important Python libraries, to help you develop state-of-the-art ML pipelines and integration code. The book begins by explaining data analytics and transformation, delving into the Pandas library, its capabilities, and nuances. It then explores emerging libraries such as Polars and CuDF, providing insights into GPU-based computing and cutting-edge data manipulation techniques. The text discusses the importance of data validation in engineering processes, introducing tools such as Great Expectations and Pandera to ensure data quality and reliability. The book delves into API design and development, with a specific focus on leveraging the power of FastAPI. It covers authentication, authorization, and real-world applications, enabling you to construct efficient and secure APIs using FastAPI. Also explored is concurrency in data engineering, examining Dask's capabilities from basic setup to crafting advanced machine learning pipelines. The book includes development and delivery of data engineering pipelines using leading cloud platforms such as AWS, Google Cloud, and Microsoft Azure. The concluding chapters concentrate on real-time and streaming data engineering pipelines, emphasizing Apache Kafka and workflow orchestration in data engineering. Workflow tools such as Airflow and Prefect are introduced to seamlessly manage and automate complex data workflows. What sets this book apart is its blend of theoretical knowledge and practical application, a structured path from basic to advanced concepts, and insights into using state-of-the-art tools. With this book, you gain access to cutting-edge techniques and insights that are reshaping the industry. This book is not just an educational tool. It is a career catalyst, and an investment in your future as a data engineering expert, poised to meet the challenges of today's data-driven world. What You Will Learn Elevate your data wrangling jobs by utilizing the power of both CPU and GPU computing, and learn to process data using Pandas 2.0, Polars, and CuDF at unprecedented speeds Design data validation pipelines, construct efficient data service APIs, develop real-time streaming pipelines and master the art of workflow orchestration to streamline your engineering projects Leverage concurrent programming to develop machine learning pipelines and get hands-on experience in development and deployment of machine learning pipelines across AWS, GCP, and Azure Who This Book Is For Data analysts, data engineers, data scientists, machine learning engineers, and MLOps specialists

Data pipelines with Apache Flink

2024-09-26 · VIRTUAL: Apache Flink® Meetup

talk

by Raghav Nehru (Platformatory)

flink

This talk walks through the process of creating real-time data pipelines using Flink. It introduces how to connect Flink with various data sources (like Kafka, or relational databases), focusing on transforming and enriching data streams. This talk is useful for understanding how Flink integrates with other components in a typical data processing pipeline.

Introduction to Stateful Stream Processing with Apache Flink

2024-09-26 · VIRTUAL: Apache Flink® Meetup

talk

by Robert Metzger (Confluent)

flink

Stream Processing has evolved quickly in a short time: only a few years ago, stream processing was mostly simple real-time aggregations with limited throughput and consistency. Today, many stream processing applications have sophisticated business logic, strict correctness guarantees, high performance, low latency, and maintain terabytes of state without databases. Stream processing frameworks also abstract a lot of the low-level details away, such as routing the data streams, taking care of concurrent executions, and handling various failure scenarios while ensuring correctness.

This talk will give an introduction into Apache Flink, one of the most advanced open source stream processors that powers applications in Netflix, Uber, and Alibaba among others. In particular, we will go through the use cases that Flink was designed for, explain concepts like stateful and event-time stream processing, and discuss Flink's APIs and ecosystem.

Breaking Batch: Data Streaming and the Modern Data Landscape

2024-09-19 · Big Data LDN 2024

Face To Face

AI/ML Data Governance Data Streaming

During his talk, Alex will discuss the value of moving from a batch-based architecture to a real-time, event-driven architecture with Apache Kafka®. He will then explain how you can build valuable data products with Confluent that unlock high-volume performance, data governance, and AI use cases.

talk-data.com

Kafka

Activity Trend

Top Events

Top Speakers

Melting Icebergs: Enabling Analytical Access to Kafka Data through Iceberg Projections

Autoscale Kafka consumers on Cloud Run

Save time and effort with Google Cloud Managed Service for Apache Kafka

Solve real-time AI challenges: Bigtable and BigQuery in Spotify’s music recommendation engine

Creating a turnkey streaming data lakehouse for BigQuery and BigLake users with Redpanda Iceberg Topics

Kir Tititevsky: Modern Streaming Architecture Transforming the Service Bus

Trends in Data Engineering – Adrian Brudaru

Frank Munz: A Journey in Space with Apache Kafka data streams from NASA

Julien Jakubowski: Apache Pulsar: Finally an Alternative to Kafka?

Olena Kutsenko: Sentiment Analysis in Action: Building Your Real-time Pipeline

AWS re:Invent 2024 - Operate and scale managed Apache Kafka and Apache Flink clusters (ANT342)

AWSreInvent #AWSreInvent2024

#69 From Engineer to CEO: Alex Gallego on Building Red Panda

Delta Lake: The Definitive Guide

Making Distributed State Accessible: An Introduction to Kafka Interactive Queries v2

Intro to Apache Kafka

Visualizing Realtime Stock Data with Streamlit, Apache Kafka®, and Apache Flink®

Data Engineering for Machine Learning Pipelines: From Python Libraries to ML Pipelines and Cloud Platforms

Data pipelines with Apache Flink

Introduction to Stateful Stream Processing with Apache Flink

Breaking Batch: Data Streaming and the Modern Data Landscape