Data Streaming

From RAG to Relational: How Agentic Patterns Are Reshaping Data Architecture

2025-09-18 · Data Engineering Podcast Listen

podcast_episode

by Mark Brooker (AWS) , Tobias Macey

AI/ML AWS Aurora AWS Lambda Data Engineering Data Management Datafold ELK ETL/ELT LLM Prefect Python +3 more

Summary In this episode of the AI Engineering Podcast Mark Brooker, VP and Distinguished Engineer at AWS, talks about how agentic workflows are transforming database usage and infrastructure design. He discusses the evolving role of data in AI systems, from traditional models to more modern approaches like vectors, RAG, and relational databases. Mark explains why agents require serverless, elastic, and operationally simple databases, and how AWS solutions like Aurora and DSQL address these needs with features such as rapid provisioning, automated patching, geodistribution, and spiky usage. The conversation covers topics including tool calling, improved model capabilities, state in agents versus stateless LLM calls, and the role of Lambda and AgentCore for long-running, session-isolated agents. Mark also touches on the shift from local MCP tools to secure, remote endpoints, the rise of object storage as a durable backplane, and the need for better identity and authorization models. The episode highlights real-world patterns like agent-driven SQL fuzzing and plan analysis, while identifying gaps in simplifying data access, hardening ops for autonomous systems, and evolving serverless database ergonomics to keep pace with agentic development.

Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementData teams everywhere face the same problem: they're forcing ML models, streaming data, and real-time processing through orchestration tools built for simple ETL. The result? Inflexible infrastructure that can't adapt to different workloads. That's why Cash App and Cisco rely on Prefect. Cash App's fraud detection team got what they needed - flexible compute options, isolated environments for custom packages, and seamless data exchange between workflows. Each model runs on the right infrastructure, whether that's high-memory machines or distributed compute. Orchestration is the foundation that determines whether your data team ships or struggles. ETL, ML model training, AI Engineering, Streaming - Prefect runs it all from ingestion to activation in one platform. Whoop and 1Password also trust Prefect for their data operations. If these industry leaders use Prefect for critical workflows, see what it can do for you at dataengineeringpodcast.com/prefect.Data migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details.Your host is Tobias Macey and today I'm interviewing Marc Brooker about the impact of agentic workflows on database usage patterns and how they change the architectural requirements for databasesInterview IntroductionHow did you get involved in the area of data management?Can you describe what the role of the database is in agentic workflows?There are numerous types of databases, with relational being the most prevalent. How does the type and purpose of an agent inform the type of database that should be used?Anecdotally I have heard about how agentic workloads have become the predominant "customers" of services like Neon and Fly.io. How would you characterize the different patterns of scale for agentic AI applications? (e.g. proliferation of agents, monolithic agents, multi-agent, etc.)What are some of the most significant impacts on workload and access patterns for data storage and retrieval that agents introduce?What are the categorical differences in that behavior as compared to programmatic/automated systems?You have spent a substantial amount of time on Lambda at AWS. Given that LLMs are effectively stateless, how does the added ephemerality of serverless functions impact design and performance considerations around having to "re-hydrate" context when interacting with agents?What are the most interesting, innovative, or unexpected ways that you have seen serverless and database systems used for agentic workloads?What are the most interesting, unexpected, or challenging lessons that you have learned while working on technologies that are supporting agentic applications?Contact Info BlogLinkedInParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.Links AWS Aurora DSQLAWS LambdaThree Tier ArchitectureVector DatabaseGraph DatabaseRelational DatabaseVector EmbeddingRAG == Retrieval Augmented GenerationAI Engineering Podcast EpisodeGraphRAGAI Engineering Podcast EpisodeLLM Tool CallingMCP == Model Context ProtocolA2A == Agent 2 Agent ProtocolAWS Bedrock AgentCoreStrandsLangChainKiroThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Speaking From All Hands Day

2025-09-17 · Moody's Talks - Inside Economics Listen

podcast_episode

by Cris deRitis , Mark Zandi (Moody's Analytics) , Marisa DiNatale (Moody's Analytics)

Analytics

The Inside Economics team gets together in person at All Hands Day. It is a short podcast, with more than the typical amount of chit-chat (as we are in person). But it is an action-packed conversation on the Fed’s rate decision (see if we got it right), our proposal to unlock the housing market, and, of course, the statistics game! Explore the risks and realities shaping the economy in our new webinar, now streaming for free: U.S. Economic Outlook: Under Unprecedented Uncertainty Watch here: https://events.moodys.com/mc68453-wbn-2025-mau25777-us-macro-outlook-precipice-recession?mkt_tok=OT… Hosts: Mark Zandi – Chief Economist, Moody’s Analytics, Cris deRitis – Deputy Chief Economist, Moody’s Analytics, and Marisa DiNatale – Senior Director - Head of Global Forecasting, Moody’s Analytics Follow Mark Zandi on 'X' and BlueSky @MarkZandi, Cris deRitis on LinkedIn, and Marisa DiNatale on LinkedIn Questions or Comments, please email us at [email protected]. We would love to hear from you.

Questions or Comments, please email us at [email protected]. We would love to hear from you. To stay informed and follow the insights of Moody's Analytics economists, visit Economic View.

Hosted by Simplecast, an AdsWizz company. See pcm.adswizz.com for information about our collection and use of personal data for advertising.

Achieving Sub-100 ms Real-Time Stream Processing with an S3-Native Architecture

2025-09-16 · Data Builders’ Evening: Architecture, Engineering & Beyond

talk

by Yingjun Wu (RisingWave Labs)

S3 hummock log-structured state engine object storage

Stream processing systems have traditionally relied on local storage engines such as RocksDB to achieve low latency. While effective in single-node setups, this model doesn't scale well in the cloud, where elasticity and separation of compute and storage are essential. In this talk, we'll explore how RisingWave rethinks the architecture by building directly on top of S3 while still delivering sub-100 ms latency. At the core is Hummock, a log-structured state engine designed for object storage. Hummock organizes state into a three-tier hierarchy: in-memory cache for the hottest keys, disk cache managed by Foyer for warm data, and S3 as the persistent cold tier. This approach ensures queries never directly hit S3, avoiding its variable performance. We'll also examine how remote compaction offloads heavy maintenance tasks from query nodes, eliminating interference between user queries and background operations. Combined with fine-grained caching policies and eviction strategies, this architecture enables both consistent query performance and cloud-native elasticity. Attendees will walk away with a deeper understanding of how to design streaming systems that balance durability, scalability, and low latency in an S3-based environment.

CPI and AI with Capital Group's Jared Franz

2025-09-12 · Moody's Talks - Inside Economics Listen

podcast_episode

by Matt Colyar (Moody's Analytics) , Cris deRitis , Jared Franz (Capital Group) , Mark Zandi (Moody's Analytics)

AI/ML Analytics

Mark and Cris are joined by Matt Colyar to break down the latest CPI inflation report, while Jared Franz from the Capital Group explores how artificial intelligence is reshaping the American economy and labor market. We examine the opportunities and challenges of the AI revolution and what it means for workers, businesses, and investors in this rapidly changing economic landscape. Jared Franz is an economist at Capital Group, responsible for covering the United States. He has 19 years of investment industry experience and has been with Capital Group for 10 years. Prior to joining Capital, Jared was head of international macroeconomic research at Hartford Investment Management Company. Before that, he was an international and U.S. economist at T. Rowe Price. He holds a PhD in economics from the University of Illinois at Chicago, a bachelor’s degree in mathematics from Northwestern University and attended the U.S. Naval Academy. He is also a member of the Forecasters Club of New York, an elected member of the Conference of Business Economists and a member of the Pacific Council. Jared is based in Los Angeles. Explore more insights from Capital Group’s Jared Franz in the articles below: 4 charts on why the U.S. economy could stay resilient | Capital Group Benjamin Button’s clues for the US economy Explore the risks and realities shaping the economy in our new webinar, now streaming for free. U.S. Economic Outlook: Under Unprecedented Uncertainty Watch here: https://events.moodys.com/mc68453-wbn-2025-mau25777-us-macro-outlook-precipice-recession?mkt_tok=OT… Hosts: Mark Zandi – Chief Economist, Moody’s Analytics, Cris deRitis – Deputy Chief Economist, Moody’s Analytics, and Marisa DiNatale – Senior Director - Head of Global Forecasting, Moody’s Analytics Follow Mark Zandi on 'X' and BlueSky @MarkZandi, Cris deRitis on LinkedIn, and Marisa DiNatale on LinkedIn

Questions or Comments, please email us at [email protected]. We would love to hear from you. To stay informed and follow the insights of Moody's Analytics economists, visit Economic View.

Hosted by Simplecast, an AdsWizz company. See pcm.adswizz.com for information about our collection and use of personal data for advertising.

Behind the Scenes: How RTL Uses AI to Power Video Workflows

2025-09-11 · Data Expo NL 2025

talk

by Prajakta Shouche (RTL Netherlands)

AI/ML

RTL processes a high volume of video content across platforms—from streaming dramas to breaking news. In this session, we’ll share how we use AI to automate repetitive tasks like subtitling, thumbnail selection, and audio separation, helping teams work more efficiently. We’ll also take a look at the modular, open-source infrastructure behind these workflows—and how it integrates AI into production with flexibility.

Duck Lake: Simplifying the Lakehouse Ecosystem

2025-09-10 · Data Engineering Podcast Listen

podcast_episode

by Mark Raasveldt (DuckDB) , Hannes Mühleisen (DuckDB Labs) , Tobias Macey

AI/ML Flink Data Engineering Data Lakehouse Data Management Datafold Delta DuckDB ETL/ELT Iceberg Lance Motherduck +4 more

Summary In this episode of the Data Engineering Podcast Hannes Mühleisen and Mark Raasveldt, the creators of DuckDB, share their work on Duck Lake, a new entrant in the open lakehouse ecosystem. They discuss how Duck Lake, is focused on simplicity, flexibility, and offers a unified catalog and table format compared to other lakehouse formats like Iceberg and Delta. Hannes and Mark share insights into how Duck Lake revolutionizes data architecture by enabling local-first data processing, simplifying deployment of lakehouse solutions, and offering benefits such as encryption features, data inlining, and integration with existing ecosystems.

Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementData teams everywhere face the same problem: they're forcing ML models, streaming data, and real-time processing through orchestration tools built for simple ETL. The result? Inflexible infrastructure that can't adapt to different workloads. That's why Cash App and Cisco rely on Prefect. Cash App's fraud detection team got what they needed - flexible compute options, isolated environments for custom packages, and seamless data exchange between workflows. Each model runs on the right infrastructure, whether that's high-memory machines or distributed compute. Orchestration is the foundation that determines whether your data team ships or struggles. ETL, ML model training, AI Engineering, Streaming - Prefect runs it all from ingestion to activation in one platform. Whoop and 1Password also trust Prefect for their data operations. If these industry leaders use Prefect for critical workflows, see what it can do for you at dataengineeringpodcast.com/prefect.Data migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details. Your host is Tobias Macey and today I'm interviewing Hannes Mühleisen and Mark Raasveldt about DuckLake, the latest entrant into the open lakehouse ecosystemInterview IntroductionHow did you get involved in the area of data management?Can you describe what DuckLake is and the story behind it?What are the particular problems that DuckLake is solving for?How does this compare to the capabilities of MotherDuck?Iceberg and Delta already have a well established ecosystem, but so does DuckDB. Who are the primary personas that you are trying to focus on in these early days of DuckLake?One of the major factors driving the adoption of formats like Iceberg is cost efficiency for large volumes of data. That brings with it challenges of large batch processing of data. How does DuckLake account for these axes of scale?There is also a substantial investment in the ecosystem of technologies that support Iceberg. The most notable ecosystem challenge for DuckDB and DuckLake is in the query layer. How are you thinking about the evolution and growth of that capability beyond DuckDB (e.g. support in Trino/Spark/Flink)?What are your opinions on the viability of a future where DuckLake and Iceberg become a unified standard and implementation? (why can't Iceberg REST catalog implementations just use DuckLake under the hood?)Digging into the specifics of the specification and implementation, what are some of the capabilities that it offers above and beyond Iceberg?Is it now possible to enforce PK/FK constraints, indexing on underlying data?Given that DuckDB has a vector type, how do you think about the support for vector storage/indexing?How do the capabilities of DuckLake and the integration with DuckDB change the ways that data teams design their data architecture and access patterns?What are your thoughts on the impact of "data gravity" in today's data ecosystem, with engines like DuckDB, KuzuDB, LanceDB, etc. available for embedded and edge use cases?What are the most interesting, innovative, or unexpected ways that you have seen DuckLake used?What are the most interesting, unexpected, or challenging lessons that you have learned while working on DuckLake?When is DuckLake the wrong choice?What do you have planned for the future of DuckLake?Contact Info HannesWebsiteMarkWebsiteParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.Links DuckDBPodcast EpisodeDuckLakeDuckDB LabsMySQLCWIMonetDBIcebergIceberg REST CatalogDeltaHudiLanceDuckDB Iceberg ConnectorACID == Atomicity, Consistency, Isolation, DurabilityMotherDuckMotherDuck Managed DuckLakeTrinoSparkPrestoSpark DuckLake DemoDelta KernelArrowdltS3 TablesAttribute Based Access Control (ABAC)ParquetArrow FlightHadoopHDFSDuckLake RoadmapThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

🎙️ Talking Data with Dima Spivak, Director of Product Management, StreamSets {Replay}

2025-09-03 · Making Data Simple Listen

podcast_episode

by Dima Spivak (StreamSets) , Al Martin (IBM)

AI/ML IBM Fabric

Send us a text Get ready for an insightful conversation on the future of data integration and real-time decision making with Dima Spivak, Director of Product Management at StreamSets. We cover everything from the “why StreamSets” story to its secret sauce, how it plays in regulated industries, and what makes it a powerful player in data fabric, AI, and streaming use cases. If you’re passionate about the future of data pipelines, governance, and AI-driven insights, this one’s for you! ⏱️ Episode Guide: 02:02 | Meet Dima Spivak04:19 | Why StreamSets?06:00 | What is StreamSets?09:48 | On-Demand Expense11:34 | Regulated Industries12:36 | The Secret Sauce14:41 | A Competitive View15:50 | Data Fabric + StreamSets18:25 | StreamSets + AI21:12 | Use Cases That Matter24:02 | The Future of Streaming25:48 | Quality + Testing31:19 | For Fun 🎉🔗 Connect with Dima: LinkedIn: linkedin.com/in/dmitryspivak Website: https://www.ibm.com/blog/announcement/ibm-acquires-streamSets/ Want to be featured as a guest on Making Data Simple? Reach out to us at [email protected] and tell us why you should be next. The Making Data Simple Podcast is hosted by Al Martin, WW VP Technical Sales, IBM, where we explore trending technologies, business innovation, and leadership ... while keeping it simple & fun.

Data Engineering for Cybersecurity

2025-08-26 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by James Bonifield

Ansible Cloud Computing Data Engineering ELK Git Kafka Linux Logstash PowerShell Redis Cyber Security data +1 more

Security teams rely on telemetry—the continuous stream of logs, events, metrics, and signals that reveal what’s happening across systems, endpoints, and cloud services. But that data doesn’t organize itself. It has to be collected, normalized, enriched, and secured before it becomes useful. That’s where data engineering comes in. In this hands-on guide, cybersecurity engineer James Bonifield teaches you how to design and build scalable, secure data pipelines using free, open source tools such as Filebeat, Logstash, Redis, Kafka, and Elasticsearch and more. You’ll learn how to collect telemetry from Windows including Sysmon and PowerShell events, Linux files and syslog, and streaming data from network and security appliances. You’ll then transform it into structured formats, secure it in transit, and automate your deployments using Ansible. You’ll also learn how to: Encrypt and secure data in transit using TLS and SSH Centrally manage code and configuration files using Git Transform messy logs into structured events Enrich data with threat intelligence using Redis and Memcached Stream and centralize data at scale with Kafka Automate with Ansible for repeatable deployments Whether you’re building a pipeline on a tight budget or deploying an enterprise-scale system, this book shows you how to centralize your security data, support real-time detection, and lay the groundwork for incident response and long-term forensics.

Mastering real-time anomaly detection

2025-08-20 · PyData Berlin 2025 August Meetup

talk

by Olena Kutsenko (Confluent)

AI/ML Flink IoT Kafka Python

Abstract: Detecting problems as they happen is essential in today’s fast-moving, data-driven world. In this talk, you’ll learn how to build a flexible, real-time anomaly detection pipeline using Apache Kafka and Apache Flink, backed by statistical and machine learning models. We’ll start by demystifying what anomaly really means - exploring the different types (point, contextual, and collective anomalies) and the difference between unintentional issues and intentional outliers like fraud or abuse. Then, we’ll look at how anomaly detection is solved in practice: from classical statistical models like ARIMA to deep learning models like LSTM. You’ll learn how ARIMA breaks time series into AutoRegressive, Integrated, and Moving Average components, no math degree required (just a Python library). We’ll also uncover why forgetting is a feature, not a bug, when it comes to LSTMs, and how these models learn to detect complex patterns over time. Throughout, we’ll show how Kafka handles high-throughput streaming data and how Flink enables low-latency, stateful processing to catch issues as they emerge. You’ll leave knowing not just how these systems work, but when to use each type of model depending on your data and goals. Whether you're monitoring system health, tracking IoT devices, or looking for fraud in transactions, this talk will give you the foundations and tools to detect the unexpected - before it becomes a problem.

Streaming the Future: Dr. Sean Falconer, AI Entrepreneur in Residence at Confluent on AI, LLMs, and the Contrarian View

2025-07-30 · Making Data Simple Listen

podcast_episode

by Al Martin (IBM) , Sean Falconer (Confluent)

AI/ML IBM LLM

Send us a text Deep Diving into the future of AI:

Join Dr. Sean Falconer — AI Entrepreneur in Residence at Confluent, software engineering leader, and developer relations expert — for a deep dive into the future of AI, data streaming, and what it really means to build at the edge of innovation. From managing multiple LLMs to testing autonomous agents and sharing his bold contrarian takes, Sean helps us simplify the complexity of today's tech. 📌 Timestamps 04:38 – Meet Sean Falconer 11:11 – Lifelong Learning 12:31 – AI Entrepreneur in Residence 16:28 – Multiple LLMs in Action 21:07 – The Tech Behind Confluent 25:51 – Why Sean Chose Confluent 28:40 – Invest or Short? 36:58 – Testing Agents IRL 40:51 – The Contrarian AI Take 42:27 – Looking Ahead: The Future of AI🔗 Connect with Sean: LinkedInSubstackMedium

Want to be featured as a guest on Making Data Simple? Reach out to us at [email protected] and tell us why you should be next. The Making Data Simple Podcast is hosted by Al Martin, WW VP Technical Sales, IBM, where we explore trending technologies, business innovation, and leadership ... while keeping it simple & fun.

Warehouse Native Incremental Data Processing With Dynamic Tables And Delayed View Semantics

2025-07-21 · Data Engineering Podcast Listen

podcast_episode

by Dan Sotolongo (Snowflake) , Tobias Macey

AI/ML Data Engineering Data Management Datafold Python Snowflake

Summary In this episode of the Data Engineering Podcast Dan Sotolongo from Snowflake talks about the complexities of incremental data processing in warehouse environments. Dan discusses the challenges of handling continuously evolving datasets and the importance of incremental data processing for optimized resource use and reduced latency. He explains how delayed view semantics can address these challenges by maintaining up-to-date results with minimal work, leveraging Snowflake's dynamic tables feature. The conversation also explores the broader landscape of data processing, comparing batch and streaming systems, and highlights the trade-offs between them. Dan emphasizes the need for a unified theoretical framework to discuss semantic guarantees in data pipelines and introduces the concept of delayed view semantics, touching on the limitations of current systems and the potential of dynamic tables to simplify complex data workflows.

Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementData migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details.Your host is Tobias Macey and today I'm interviewing Dan Sotolongo about the challenges of incremental data processing in warehouse environments and how delayed view semantics help to address the problemInterview IntroductionHow did you get involved in the area of data management?Can you start by defining the scope of the term "incremental data processing"?What are some of the common solutions that data engineers build when creating workflows to implement that pattern?What are some common difficulties that they encounter in the pursuit of incremental data?Can you describe what delayed view semantics are and the story behind it?What are the problems that DVS explicitly doesn't address?How does the approach that you have taken in Dynamic View Semantics compare to systems like Materialize, Feldera, etc.Can you describe the technical architecture of the implementation of Dynamic Tables?What are the elements of the problem that are as-yet unsolved?How has the implementation changed/evolved as you learned more about the solution space?What would be involved in implementing the delayed view semantics pattern in other dbms engines?For someone who wants to use DVS/Dyamic Tables for managing their incremental data loads, what does the workflow look like?What are the options for being able to apply tests/validation logic to a dynamic table while it is operating?What are the most interesting, innovative, or unexpected ways that you have seen Dynamic Tables used?What are the most interesting, unexpected, or challenging lessons that you have learned while working on Dynamic Tables/Delayed View Semantics?When are Dynamic Tables/DVS the wrong choice?What do you have planned for the future of Dynamic Tables?Contact Info LinkedInParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.Links Delayed View Semantics: Presentation SlidesSnowflakeNumPyIPythonJupyterFlinkSpark StreamingKafkaSnowflake Dynamic TablesAirflowDagsterStreaming WatermarksMaterializeFelderaACIDCAP Theorem)LinearizabilitySerializable ConsistencySIGMODMaterialized ViewsdbtData VaultApache IcebergDatabricks DeltaHudiDead Letter Queuepg_ivmProperty Based TestingIceberg V3 Row LineagePrometheusThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Fabrizio Lazzaretti, Jens Muller, Patrick Muller: Streaming & Eventing for Top Management

2025-07-17 · DATA MINER Big Data Europe Conference 2020 Watch

video

by Jens Muller , Fabrizio Lazzaretti , Patrick Muller