talk-data.com talk-data.com

Topic

Data Streaming

realtime event_processing data_flow

739

tagged

Activity Trend

70 peak/qtr
2020-Q1 2026-Q1

Activities

739 activities · Newest first

The Data Engineer's Guide to Microsoft Fabric

Modern data engineering is evolving; and with Microsoft Fabric, the entire data platform experience is being redefined. This essential book offers a fresh, hands-on approach to navigating this shift. Rather than being an introduction to features, this guide explains how Fabric's key components—Lakehouse, Warehouse, and Real-Time Intelligence—work under the hood and how to put them to use in realistic workflows. Written by Christian Henrik Reich, a data engineering expert with experience that extends from Databricks to Fabric, this book is a blend of foundational theory and practical implementation of lakehouse solutions in Fabric. You'll explore how engines like Apache Spark and Fabric Warehouse collaborate with Fabric's Real-Time Intelligence solution in an integrated platform, and how to build ETL/ELT pipelines that deliver on speed, accuracy, and scale. Ideal for both new and practicing data engineers, this is your entry point into the fabric of the modern data platform. Acquire a working knowledge of lakehouses, warehouses, and streaming in Fabric Build resilient data pipelines across real-time and batch workloads Apply Python, Spark SQL, T-SQL, and KQL within a unified platform Gain insight into architectural decisions that scale with data needs Learn actionable best practices for engineering clean, efficient, governed solutions

Advanced SQL

SQL is no longer just a querying language for relational databases—it's a foundational tool for building scalable, modern data solutions across real-time analytics, machine learning workflows, and even generative AI applications. Advanced SQL shows data professionals how to move beyond conventional SELECT statements and tap into the full power of SQL as a programming interface for today's most advanced data platforms. Written by seasoned data experts Rui Pedro Machado, Hélder Russa, and Pedro Esmeriz, this practical guide explores the role of SQL in streaming architectures (like Apache Kafka and Flink), data lake ecosystems, cloud data warehouses, and ML pipelines. Geared toward data engineers, analysts, scientists, and analytics engineers, the book combines hands-on guidance with architectural best practices to help you extend your SQL skills into emerging workloads and real-world production systems. Use SQL to design and deploy modern, end-to-end data architectures Integrate SQL with data lakes, stream processing, and cloud platforms Apply SQL in feature engineering and ML model deployment Master pipe syntax and other advanced features for scalable, efficient queries Leverage SQL to build GenAI-ready data applications and pipelines

Data Engineering with Azure Databricks

Master end-to-end data engineering on Azure Databricks. From data ingestion and Delta Lake to CI/CD and real-time streaming, build secure, scalable, and performant data solutions with Spark, Unity Catalog, and ML tools. Key Features Build scalable data pipelines using Apache Spark and Delta Lake Automate workflows and manage data governance with Unity Catalog Learn real-time processing and structured streaming with practical use cases Implement CI/CD, DevOps, and security for production-ready data solutions Explore Databricks-native ML, AutoML, and Generative AI integration Book Description "Data Engineering with Azure Databricks" is your essential guide to building scalable, secure, and high-performing data pipelines using the powerful Databricks platform on Azure. Designed for data engineers, architects, and developers, this book demystifies the complexities of Spark-based workloads, Delta Lake, Unity Catalog, and real-time data processing. Beginning with the foundational role of Azure Databricks in modern data engineering, you’ll explore how to set up robust environments, manage data ingestion with Auto Loader, optimize Spark performance, and orchestrate complex workflows using tools like Azure Data Factory and Airflow. The book offers deep dives into structured streaming, Delta Live Tables, and Delta Lake’s ACID features for data reliability and schema evolution. You’ll also learn how to manage security, compliance, and access controls using Unity Catalog, and gain insights into managing CI/CD pipelines with Azure DevOps and Terraform. With a special focus on machine learning and generative AI, the final chapters guide you in automating model workflows, leveraging MLflow, and fine-tuning large language models on Databricks. Whether you're building a modern data lakehouse or operationalizing analytics at scale, this book provides the tools and insights you need. What you will learn Set up a full-featured Azure Databricks environment Implement batch and streaming ingestion using Auto Loader Optimize Spark jobs with partitioning and caching Build real-time pipelines with structured streaming and DLT Manage data governance using Unity Catalog Orchestrate production workflows with jobs and ADF Apply CI/CD best practices with Azure DevOps and Git Secure data with RBAC, encryption, and compliance standards Use MLflow and Feature Store for ML pipelines Build generative AI applications in Databricks Who this book is for This book is for data engineers, solution architects, cloud professionals, and software engineers seeking to build robust and scalable data pipelines using Azure Databricks. Whether you're migrating legacy systems, implementing a modern lakehouse architecture, or optimizing data workflows for performance, this guide will help you leverage the full power of Databricks on Azure. A basic understanding of Python, Spark, and cloud infrastructure is recommended.

Notebooks struggle when data vastly exceeds RAM: pagination hacks, fragile sampling, and surprise OOMs. Buckaroo is a modern data table for notebooks built to quickly make sense of dataframes by providing search, summary stats, and scrolling with every view. This talk reviews how Buckaroo uses out‑of‑core design patterns, viewport streaming, lazy Polars pipelines, batched background stats, and a series cache to make interactive exploration fast and reliable on commodity laptops. We’ll walk through the lifecycle of opening a large Parquet/CSV file: detecting formats, avoiding full materialization, fetching only requested row/column ranges, and throttling UI updates for smoothness. We’ll show how column‑level hashing (via a lightweight Rust extension) enables stable, cache keys so warm loads render the first viewport and stats in under a second. CSV specifics and a practical CSV→Parquet streaming path round out the approach. The ideas are tool‑agnostic and reproducible with the open‑source PyData stack; Buckaroo serves as a concrete reference implementation. You’ll leave with guidelines and snippets to bring these patterns to your own workflows.

Fantasy basketball involves daily decisions: which players to start, who to pick up from free agency, and how to balance competing objectives across multiple statistical categories. This talk demonstrates how linear programming and integer programming can help solving those problems.

Using Python library PuLP we'll explore when to use linear programming versus integer programming, how to formulate constraints for roster decisions, and how to handle different league formats. Through practical examples, we'll build optimizers for start/sit decisions and free agency streaming.

Hands-On with LLM-Powered Recommenders: Hybrid Architectures for Next-Gen Personalization

Recommender systems power everything from e-commerce to media streaming, but most pipelines still rely on collaborative filtering or neural models that focus narrowly on user–item interactions. Large language models (LLMs), by contrast, excel at reasoning across unstructured text, contextual information, and explanations. This tutorial bridges the two worlds. Participants will build a hybrid recommender system that uses structured embeddings for retrieval and integrates an LLM layer for personalization and natural-language explanations. We’ll also discuss practical engineering constraints: scaling, latency, caching, distillation/quantization, and fairness. By the end, attendees will leave with a working hybrid recommender they can extend for their own data, along with a playbook for when and how to bring LLMs into recommender workflows responsibly.

AWS re:Invent 2025 - A practitioner’s guide to data for agentic AI (DAT315)

In this session, gain the skills needed to deploy end-to-end agentic AI applications using your most valuable data. This session focuses on data management using processes like Model Context Protocol (MCP) and Retrieval Augmented Generation (RAG), and provides concepts that apply to other methods of customizing agentic AI applications. Discover best practice architectures using AWS database services like Amazon Aurora and OpenSearch Service, along with analytical, data processing and streaming experiences found in SageMaker Unified Studio. Learn data lake, governance, and data quality concepts and how Amazon Bedrock AgentCore and Bedrock Knowledge Bases, and other features tie solution components together.

Learn more: More AWS events: https://go.aws/3kss9CP

Subscribe: More AWS videos: http://bit.ly/2O3zS75 More AWS events videos: http://bit.ly/316g9t4

ABOUT AWS: Amazon Web Services (AWS) hosts events, both online and in-person, bringing the cloud computing community together to connect, collaborate, and learn from AWS experts. AWS is the world's most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—are using AWS to lower costs, become more agile, and innovate faster.

AWSreInvent #AWSreInvent2025 #AWS

AWS re:Invent 2025 - Autonomous agents powered by streaming data and Retrieval Augmented Generation

Unlock the potential of intelligent autonomous agents that combine real-time streaming data with Retrieval Augmented Generation (RAG) for dynamic decision-making. You will learn how to use streaming technologies like Amazon Kinesis, Amazon MSK, and Managed Service for Apache Flink create a robust pipeline to transform raw events into actionable insights. This session will show you how autonomous agents leverage these real-time insights with RAG architecture powered by OpenSearch, enabling immediate, context-aware responses to changing conditions. This practical architecture drives real-world value in critical scenarios like predictive maintenance, automated incident response, and intelligent customer service automation, with improved accuracy and reduced latency.

Learn more: More AWS events: https://go.aws/3kss9CP

Subscribe: More AWS videos: http://bit.ly/2O3zS75 More AWS events videos: http://bit.ly/316g9t4

ABOUT AWS: Amazon Web Services (AWS) hosts events, both online and in-person, bringing the cloud computing community together to connect, collaborate, and learn from AWS experts. AWS is the world's most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—are using AWS to lower costs, become more agile, and innovate faster.

AWSreInvent #AWSreInvent2025 #AWS

AWS re:Invent 2025 - Powering your Agentic AI experience with AWS Streaming and Messaging (ANT310)

Organizations are accelerating innovation with generative AI and agentic AI use cases. This session explores how AWS streaming and messaging services such as Amazon Managed Streaming for Apache Kafka, Kinesis Data Streams, Amazon Managed Service for Apache Flink, and Amazon SQS build intelligent, responsive applications. Discover how streaming supports real-time data ingestion and processing, while messaging ensures reliable coordination between AI agents, orchestrates workflows, and delivers critical information at scale. Learn architectural patterns that highlight how a unified approach acts on data as fast as needed, providing the reliability and scale to grow for your next generation of AI.

Learn more: More AWS events: https://go.aws/3kss9CP

Subscribe: More AWS videos: http://bit.ly/2O3zS75 More AWS events videos: http://bit.ly/316g9t4

ABOUT AWS: Amazon Web Services (AWS) hosts events, both online and in-person, bringing the cloud computing community together to connect, collaborate, and learn from AWS experts. AWS is the world's most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—are using AWS to lower costs, become more agile, and innovate faster.

AWSreInvent #AWSreInvent2025 #AWS

AWS re:Invent 2025 - Using Strands Agents to build autonomous, self-improving AI agents (AIM426)

Explore the cutting edge of AI with Strands Agents—autonomous systems that evolve and learn continuously. We'll demonstrate advanced agents that can identify knowledge gaps, self-modify reasoning strategies, and dynamically build tools. These systems learn from interactions, improving decision-making without human intervention while communicating through multiple protocols and real-time, bi-directional streaming. Using Strands' model-driven approach, agents operate independently for extended periods, continuously enhancing effectiveness. Through real-world examples, see how self-improving agents have transformed business processes by adapting to changing requirements automatically. Join us to challenge conventional thinking about agent limitations and reshape your approach to building truly autonomous AI systems.

Learn more: More AWS events: https://go.aws/3kss9CP

Subscribe: More AWS videos: http://bit.ly/2O3zS75 More AWS events videos: http://bit.ly/316g9t4

ABOUT AWS: Amazon Web Services (AWS) hosts events, both online and in-person, bringing the cloud computing community together to connect, collaborate, and learn from AWS experts. AWS is the world's most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—are using AWS to lower costs, become more agile, and innovate faster.

AWSreInvent #AWSreInvent2025 #AWS

AWS re:Invent 2025 - Scaling foundation model inference on Amazon SageMaker AI (AIM424)

Learn how to optimize and deploy popular open-source models like Qwen3, GPT-OSS, and Llama4 using advanced inference engines such as vLLM on SageMaker. We'll explore key features including bidirectional streaming for audio and text applications, and share proven optimization techniques for inferencing. Through live demos, learn to boost performance with KV caching, intelligent routing, and autoscaling to maintain stability under varying loads. We'll demonstrate solutions for building Agentic workflows with SageMaker AI, LangChain, and Amazon Bedrock AgentCore integration and share best practices helping you confidently move from prototype to trusted AI experiences that delight users.

Learn more: More AWS events: https://go.aws/3kss9CP

Subscribe: More AWS videos: http://bit.ly/2O3zS75 More AWS events videos: http://bit.ly/316g9t4

ABOUT AWS: Amazon Web Services (AWS) hosts events, both online and in-person, bringing the cloud computing community together to connect, collaborate, and learn from AWS experts. AWS is the world's most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—are using AWS to lower costs, become more agile, and innovate faster.

AWSreInvent #AWSreInvent2025 #AWS

AWS re:Invent 2025 - Amazon Kinesis Data Streams under the hood (ANT423)

Discover how AWS is changing data streaming with Amazon Kinesis Data Streams for infrastructure and operations. This session will explore recent innovations in how Kinesis Data Streams enables you to build robust, scalable data streaming applications that can handle millions of events per second. Join this session to see how you can leverage Amazon Kinesis Data Streams to build scalable, resilient data streaming applications for faster insights and improved decision-making.

Learn more: More AWS events: https://go.aws/3kss9CP

Subscribe: More AWS videos: http://bit.ly/2O3zS75 More AWS events videos: http://bit.ly/316g9t4

ABOUT AWS: Amazon Web Services (AWS) hosts events, both online and in-person, bringing the cloud computing community together to connect, collaborate, and learn from AWS experts. AWS is the world's most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—are using AWS to lower costs, become more agile, and innovate faster.

AWSreInvent #AWSreInvent2025 #AWS

AWS re:Invent 2025 - Binge-worthy: Netflix's journey to Amazon Aurora at scale (DAT322)

In this session, learn how Netflix successfully orchestrated the migration of terabytes of mission-critical data across 100+ clusters to Amazon Aurora while ensuring continuous service for millions of global subscribers. Through a detailed examination of their innovative approach combining AWS Database Migration Service and Netflix's proprietary Data Streaming Platform, explore how they achieved near-zero downtime and maintained data integrity throughout this complex transition. Technical leaders will gain actionable insights into architecting similar migrations, managing risks, and leveraging AWS tools effectively. Join us to learn how Netflix's experience can inform your own database modernization strategy.

Learn more: More AWS events: https://go.aws/3kss9CP

Subscribe: More AWS videos: http://bit.ly/2O3zS75 More AWS events videos: http://bit.ly/316g9t4

ABOUT AWS: Amazon Web Services (AWS) hosts events, both online and in-person, bringing the cloud computing community together to connect, collaborate, and learn from AWS experts. AWS is the world's most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—are using AWS to lower costs, become more agile, and innovate faster.

AWSreInvent #AWSreInvent2025 #AWS

AWS re:Invent 2025 - Operating Apache Kafka and Apache Flink at scale (ANT307)

Enterprises use Apache Kafka and Apache Flink for an increasing number of mission-critical use-cases, real-time analytics, application messaging, and machine learning. As this usage grows in size and scale, so does the criticality, scale, and cost of managing the Kafka and Flink clusters. Learn how customers can achieve the same or higher availability and durability of their growing clusters, both at lower unit costs and with operational simplicity with Amazon MSK (Managed Streaming for Apache Kafka), and Amazon MSF (Managed Streaming for Apache Flink).

Learn more: More AWS events: https://go.aws/3kss9CP

Subscribe: More AWS videos: http://bit.ly/2O3zS75 More AWS events videos: http://bit.ly/316g9t4

ABOUT AWS: Amazon Web Services (AWS) hosts events, both online and in-person, bringing the cloud computing community together to connect, collaborate, and learn from AWS experts. AWS is the world's most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—are using AWS to lower costs, become more agile, and innovate faster.

AWSreInvent #AWSreInvent2025 #AWS

Summary In this crossover episode, Max Beauchemin explores how multiplayer, multi‑agent engineering is transforming the way individuals and teams build data and AI systems. He digs into the shifting boundary between data and AI engineering, the rise of “context as code,” and how just‑in‑time retrieval via MCP and CLIs lets agents gather what they need without bloating context windows. Max shares hard‑won practices from going “AI‑first” for most tasks, where humans focus on orchestration and taste, and the new bottlenecks that appear — code review, QA, async coordination — when execution accelerates 2–10x. He also dives deep into Agor, his open‑source agent orchestration platform: a spatial, multiplayer workspace that manages Git worktrees and live dev environments, templatizes prompts by workflow zones, supports session forking and sub‑sessions, and exposes an internal MCP so agents can schedule, monitor, and even coordinate other agents.

Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementData teams everywhere face the same problem: they're forcing ML models, streaming data, and real-time processing through orchestration tools built for simple ETL. The result? Inflexible infrastructure that can't adapt to different workloads. That's why Cash App and Cisco rely on Prefect. Cash App's fraud detection team got what they needed - flexible compute options, isolated environments for custom packages, and seamless data exchange between workflows. Each model runs on the right infrastructure, whether that's high-memory machines or distributed compute. Orchestration is the foundation that determines whether your data team ships or struggles. ETL, ML model training, AI Engineering, Streaming - Prefect runs it all from ingestion to activation in one platform. Whoop and 1Password also trust Prefect for their data operations. If these industry leaders use Prefect for critical workflows, see what it can do for you at dataengineeringpodcast.com/prefect.Data migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details.Composable data infrastructure is great, until you spend all of your time gluing it together. Bruin is an open source framework, driven from the command line, that makes integration a breeze. Write Python and SQL to handle the business logic, and let Bruin handle the heavy lifting of data movement, lineage tracking, data quality monitoring, and governance enforcement. Bruin allows you to build end-to-end data workflows using AI, has connectors for hundreds of platforms, and helps data teams deliver faster. Teams that use Bruin need less engineering effort to process data and benefit from a fully integrated data platform. Go to dataengineeringpodcast.com/bruin today to get started. And for dbt Cloud customers, they'll give you $1,000 credit to migrate to Bruin Cloud.Your host is Tobias Macey and today I'm interviewing Maxime Beauchemin about the impact of multi-player multi-agent engineering on individual and team velocity for building better data systemsInterview IntroductionHow did you get involved in the area of data management?Can you start by giving an overview of the types of work that you are relying on AI development agents for?As you bring agents into the mix for software engineering, what are the bottlenecks that start to show up?In my own experience there are a finite number of agents that I can manage in parallel. How does Agor help to increase that limit?How does making multi-agent management a multi-player experience change the dynamics of how you apply agentic engineering workflows?Contact Info LinkedInLinks AgorApache AirflowApache SupersetPresetClaude CodeCodexPlaywright MCPTmuxGit WorktreesOpencode.aiGitHub CodespacesOnaThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Harness the power of real-time analytics and digital twins to achieve critical operational tasks. In this lab, you'll learn how to transform physical systems into dynamic digital replicas, enhancing simulations and optimizing operations. Discover how to build end-to-end solutions for event-driven scenarios, streaming data, and data logs. These practical steps will empower you to drive smart decision-making and foster innovation within your organization.

Please RSVP and arrive at least 5 minutes before the start time, at which point remaining spaces are open to standby attendees.

Harness the power of real-time analytics and digital twins to achieve critical operational tasks. In this lab, you'll learn how to transform physical systems into dynamic digital replicas, enhancing simulations and optimizing operations. Discover how to build end-to-end solutions for event-driven scenarios, streaming data, and data logs. These practical steps will empower you to drive smart decision-making and foster innovation within your organization.

Please RSVP and arrive at least 5 minutes before the start time, at which point remaining spaces are open to standby attendees.

Harness the power of real-time analytics and digital twins to achieve critical operational tasks. In this lab, you'll learn how to transform physical systems into dynamic digital replicas, enhancing simulations and optimizing operations. Discover how to build end-to-end solutions for event-driven scenarios, streaming data, and data logs. These practical steps will empower you to drive smart decision-making and foster innovation within your organization.

Please RSVP and arrive at least 5 minutes before the start time, at which point remaining spaces are open to standby attendees.

Harness the power of real-time analytics and digital twins to achieve critical operational tasks. In this lab, you'll learn how to transform physical systems into dynamic digital replicas, enhancing simulations and optimizing operations. Discover how to build end-to-end solutions for event-driven scenarios, streaming data, and data logs. These practical steps will empower you to drive smart decision-making and foster innovation within your organization.

Please RSVP and arrive at least 5 minutes before the start time, at which point remaining spaces are open to standby attendees.

Experience the power of the newly launched NVIDIA Blackwell GPUs, designed to accelerate a wide array of workloads including video streaming, graphics rendering, product design, digital twins, and AI development. Explore the latest GPU offerings with Azure Cloud and learn how to leverage Azure Local for cost-effective hybrid deployments of cutting-edge use cases. We will also provide sizing recommendations and deployment options for various workloads.