In this course, you’ll learn how to apply patterns to securely store and delete personal information for data governance and compliance on the Data Intelligence Platform. We’ll cover topics like storing sensitive data appropriately to simplify granting access and processing deletes, processing deletes to ensure compliance with the right to be forgotten, performing data masking, and configuring fine-grained access control to configure appropriate privileges to sensitive data.Pre-requisites: Ability to perform basic code development tasks using the Databricks workspace (create clusters, run code in notebooks, use basic notebook operations, import repos from git, etc), intermediate programming experience with SQL and PySpark (extract data from a variety of file formats and data sources, apply a number of common transformations to clean data, reshape and manipulate complex data using advanced built-in functions), intermediate programming experience with Delta Lake (create tables, perform complete and incremental updates, compact files, restore previous versions etc.). Beginner experience with Lakeflow Declarative Pipelines and streaming workloads.Labs: YesCertification Path: Databricks Certified Data Engineer Professional
talk-data.com
Topic
Data Streaming
739
tagged
Activity Trend
Top Events
In this course, you’ll learn how to optimize workloads and physical layout with Spark and Delta Lake and and analyze the Spark UI to assess performance and debug applications. We’ll cover topics like streaming, liquid clustering, data skipping, caching, photons, and more. Pre-requisites: Ability to perform basic code development tasks using the Databricks workspace (create clusters, run code in notebooks, use basic notebook operations, import repos from git, etc), intermediate programming experience with SQL and PySpark (extract data from a variety of file formats and data sources, apply a number of common transformations to clean data, reshape and manipulate complex data using advanced built-in functions), intermediate programming experience with Delta Lake (create tables, perform complete and incremental updates, compact files, restore previous versions etc.). Labs: Yes Certification Path: Databricks Certified Data Engineer Professional
In this course, you’ll learn how to Incrementally process data to power analytic insights with Structured Streaming and Auto Loader, and how to apply design patterns for designing workloads to perform ETL on the Data Intelligence Platform with Lakeflow Declarative Pipelines. First, we’ll cover topics including ingesting raw streaming data, enforcing data quality, implementing CDC, and exploring and tuning state information. Then, we’ll cover options to perform a streaming read on a source, requirements for end-to-end fault tolerance, options to perform a streaming write to a sink, and creating an aggregation and watermark on a streaming dataset. Pre-requisites: Ability to perform basic code development tasks using the Databricks workspace (create clusters, run code in notebooks, use basic notebook operations, import repos from git, etc.), intermediate programming experience with SQL and PySpark (extract data from a variety of file formats and data sources, apply a number of common transformations to clean data, reshape and manipulate complex data using advanced built-in functions), intermediate programming experience with Delta Lake (create tables, perform complete and incremental updates, compact files, restore previous versions etc.). Beginner experience with streaming workloads and familiarity with Lakeflow Declarative Pipelines. Labs: No Certification Path: Databricks Certified Data Engineer Professional
In this course, you’ll learn how to have efficient data ingestion with Lakeflow Connect and manage that data. Topics include ingestion with built-in connectors for SaaS applications, databases and file sources, as well as ingestion from cloud object storage, and batch and streaming ingestion. We'll cover the new connector components, setting up the pipeline, validating the source and mapping to the destination for each type of connector. We'll also cover how to ingest data with Batch to Streaming ingestion into Delta tables, using the UI with Auto Loader, automating ETL with Lakeflow Declarative Pipelines or using the API.This will prepare you to deliver the high-quality, timely data required for AI-driven applications by enabling scalable, reliable, and real-time data ingestion pipelines. Whether you're supporting ML model training or powering real-time AI insights, these ingestion workflows form a critical foundation for successful AI implementation.Pre-requisites: Beginner familiarity with the Databricks Data Intelligence Platform (selecting clusters, navigating the Workspace, executing notebooks), cloud computing concepts (virtual machines, object storage, etc.), production experience working with data warehouses and data lakes, intermediate experience with basic SQL concepts (select, filter, groupby, join, etc), beginner programming experience with Python (syntax, conditions, loops, functions), beginner programming experience with the Spark DataFrame API (Configure DataFrameReader and DataFrameWriter to read and write data, Express query transformations using DataFrame methods and Column expressions, etc.Labs: NoCertification Path: Databricks Certified Data Engineer Associate
Explore how AI-powered Generative Agents can evolve in real time using live data streams. Inspired by Stanford's 'Generative Agents' paper, this session dives into building dynamic, AI-driven worlds with Apache Kafka, Flink, and Iceberg - plus LLMs, RAG, and Python. Demos and practical examples included!
Deep dive into how Monzo reduced the effort it takes to generate point-in-time correct features for model development and productionise them with realtime streaming using our event-driven architecture.
Apache Kafka, start to finish. Apache Kafka in Action: From basics to production guides you through the concepts and skills you’ll need to deploy and administer Kafka for data pipelines, event-driven applications, and other systems that process data streams from multiple sources. Authors Anatoly Zelenin and Alexander Kropp have spent years using Kafka in real-world production environments. In this guide, they reveal their hard-won expert insights to help you avoid common Kafka pitfalls and challenges. Inside Apache Kafka in Action you’ll discover: Apache Kafka from the ground up Achieving reliability and performance Troubleshooting Kafka systems Operations, governance, and monitoring Kafka use cases, patterns, and anti-patterns Clear, concise, and practical, Apache Kafka in Action is written for IT operators, software engineers, and IT architects working with Kafka every day. Chapter by chapter, it guides you through the skills you need to deliver and maintain reliable and fault-tolerant data-driven applications. About the Technology Apache Kafka is the gold standard streaming data platform for real-time analytics, event sourcing, and stream processing. Acting as a central hub for distributed data, it enables seamless flow between producers and consumers via a publish-subscribe model. Kafka easily handles millions of events per second, and its rock-solid design ensures high fault tolerance and smooth scalability. About the Book Apache Kafka in Action is a practical guide for IT professionals who are integrating Kafka into data-intensive applications and infrastructures. The book covers everything from Kafka fundamentals to advanced operations, with interesting visuals and real-world examples. Readers will learn to set up Kafka clusters, produce and consume messages, handle real-time streaming, and integrate Kafka into enterprise systems. This easy-to-follow book emphasizes building reliable Kafka applications and taking advantage of its distributed architecture for scalability and resilience. What's Inside Master Kafka’s distributed streaming capabilities Implement real-time data solutions Integrate Kafka into enterprise environments Build and manage Kafka applications Achieve fault tolerance and scalability About the Reader For IT operators, software architects and developers. No experience with Kafka required. About the Authors Anatoly Zelenin is a Kafka expert known for workshops across Europe, especially in banking and manufacturing. Alexander Kropp specializes in Kafka and Kubernetes, contributing to cloud platform design and monitoring. Quotes A great introduction. Even experienced users will go back to it again and again. - Jakub Scholz, Red Hat Approachable, practical, well-illustrated, and easy to follow. A must-read. - Olena Kutsenko, Confluent A zero to hero journey to understanding and using Kafka! - Anthony Nandaa, Microsoft Thoughtfully explores a wide range of topics. A wealth of valuable information seamlessly presented and easily accessible. - Olena Babenko, Aiven Oy
Simplify real-time data analytics and build event-driven, AI-powered applications using BigQuery and Pub/Sub. Learn to ingest and process massive streaming data from users, devices, and microservices for immediate insights and rapid action. Explore BigQuery's continuous queries for real-time analytics and ML model training. Discover how Flipkart, India’s leading e-commerce platform, leverages Google Cloud to build scalable, efficient real-time data pipelines and AI/ML solutions, and gain insights on driving business value through real–time data.
Madhive built their ad analytics and bidding infrastructure using databases and batch pipelines. When the pipeline lag got too long to bid effectively, they rebuilt from scratch with Google Cloud’s Managed Service for Apache Kafka. Join this session to learn about Madhive’s journey and dive deep into how the service works, how it can help you build streaming systems quickly and securely, and what migration looks like. This session is relevant for Kafka administrators and architects building event-sourcing platforms or event-driven systems.
Audiences around the world have almost limitless access to content that’s only a click, swipe, or voice command away. Companies are embracing cloud capabilities to evolve from traditional media companies into media-tech and media-AI companies. Join us to discover how the cloud is maximizing personalization and monetization to enable the next generation of AI-powered streaming experiences for audiences everywhere.
Overwhelmed by the complexities of building a robust and scalable data pipeline for algo trading with AlloyDB? This session provides the Google Cloud services, tools, recommendations, and best practices you need to succeed. We'll explore battle-tested strategies for implementing a low-latency, high-volume trading platform using AlloyDB and Spark Streaming on Dataproc.
Leverage Composer Orchestration to create a scalable and efficient data pipeline that meets the demands of algo trading and can handle increasing data volumes and trading activity by utilizing the scalability of Google Cloud services.
The telecom industry has always been critical to advancing how we communicate, work, and play, whether through creation of our mobile world or streaming through high bandwidth connectivity. In this session we will explore how communication service providers from around the globe are leveraging AI agents across their workforce, customer experience, field operations, network operations, and more.
This blog defines the governance requirements that streaming data pipelines must meet to make artificial intelligence/machine learning (AI/ML) initiatives successful. Published at: https://www.eckerson.com/articles/streaming-data-governance-three-must-have-requirements-to-support-ai-ml-innovation
Enhance your data ingestion architecture's resilience with Google Cloud's serverless solutions. Gain end-to-end visibility into your data's lineage—track each data point's transformation journey, including timestamps, user actions, and process outcomes. Implement real-time streaming and daily batch processes for Vertex AI Retail Search to deliver near real-time search capabilities while maintaining a daily backup for contingencies. Adopt best practices for data management, lineage tracking, and forensic capabilities to streamline issue diagnosis. This talk presents a scalable and fault-tolerant design that optimizes data quality and search performance while ensuring forensic-level traceability for every data movement.
Join this Cloud Talk to explore how Large Language Models (LLMs) can revolutionize your data workflows. Learn to automate SQL query generation and stream results into Confluent using Vertex AI for real-time analytics and decision-making. Dive into integrating advanced AI into data pipelines, simplifying SQL creation, enhancing workflows, and leveraging Vertex AI for scalable machine learning. Discover how to optimize your data infrastructure and drive insights with Confluent’s Data Streaming Platform and cutting-edge AI technology.
This Session is hosted by a Google Cloud Next Sponsor.
Visit your registration profile at g.co/cloudnext to opt out of sharing your contact information with the sponsor hosting this session.
Google's Data Cloud is a unified platform for the entire data lifecycle, from streaming with Managed Kafka, to ML feature creation in BigQuery, to global deployment via Bigtable. In this talk, we’ll give you a behind the scenes look at how Spotify's recommendation engine team uses Google's Data Cloud for their feature pipelines. Plus, we will demonstrate BigQuery AI Query Engine and how it streamlines feature development and testing. Finally, we'll explore new Bigtable capabilities that simplify application deployment and monitoring.
Unlock the potential of AI with high-performance, scalable lakehouses using BigQuery and Apache Iceberg. This session details how BigQuery leverages Google's infrastructure to supercharge Iceberg, delivering peak performance and resilience. Discover BigQuery's unified read/write path for rapid queries, superior storage management beyond simple compaction, and robust, high-throughput streaming pipelines. Learn how Spotify utilizes BigQuery's lakehouse architecture for a unified data source, driving analytics and AI innovation.
Redpanda, a leading Kafka API-compatible streaming platform, now supports storing topics in Apache Iceberg, seamlessly fusing low-latency streaming with data lakehouses using BigQuery and BigLake in GCP. Iceberg Topics eliminate complex & inefficient ETL between streams and tables, making real-time data instantly accessible for analysis in BigQuery This push-button integration eliminates the need for costly connectors or custom pipelines, enabling both simple and sophisticated SQL queries across streams and other datasets. By combining Redpanda and Iceberg, GCP customers gain a secure, scalable, and cost-effective solution that transforms their agility while reducing infrastructure and human capital costs.
This Session is hosted by a Google Cloud Next Sponsor.
Visit your registration profile at g.co/cloudnext to opt out of sharing your contact information with the sponsor hosting this session.
Kir Titievsky, Product Manager at Google Cloud with extensive experience in streaming and storage infrastructure, joined Yuliia and Dumky to talk about streaming. Drawing from his work with Apache Kafka, Cloud PubSub, Dataflow and Cloud Storage since 2015, Kir explains the fundamental differences between streaming and micro-batch processing. He challenges common misconceptions about streaming costs, explaining how streaming can be significantly less expensive than batch processing for many use cases. Kir shares insights on the "service bus architecture" revival, discussing how modern distributed messaging systems have solved historic bottlenecks while creating new opportunities for business and performance needs.Kir's medium - https://medium.com/@kir-gcpKir's Linkedin page - https://www.linkedin.com/in/kir-titievsky-%F0%9F%87%BA%F0%9F%87%A6-7775052/