The Data Engineer's Guide to Microsoft Fabric

2027-05-25 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Christian Henrik Reich (twoday Data & AI)

Data Engineering Data Lakehouse Databricks Microsoft Fabric Python Spark SQL Data Streaming analytics-platforms data data-science +1 more

Modern data engineering is evolving; and with Microsoft Fabric, the entire data platform experience is being redefined. This essential book offers a fresh, hands-on approach to navigating this shift. Rather than being an introduction to features, this guide explains how Fabric's key components—Lakehouse, Warehouse, and Real-Time Intelligence—work under the hood and how to put them to use in realistic workflows. Written by Christian Henrik Reich, a data engineering expert with experience that extends from Databricks to Fabric, this book is a blend of foundational theory and practical implementation of lakehouse solutions in Fabric. You'll explore how engines like Apache Spark and Fabric Warehouse collaborate with Fabric's Real-Time Intelligence solution in an integrated platform, and how to build ETL/ELT pipelines that deliver on speed, accuracy, and scale. Ideal for both new and practicing data engineers, this is your entry point into the fabric of the modern data platform. Acquire a working knowledge of lakehouses, warehouses, and streaming in Fabric Build resilient data pipelines across real-time and batch workloads Apply Python, Spark SQL, T-SQL, and KQL within a unified platform Gain insight into architectural decisions that scale with data needs Learn actionable best practices for engineering clean, efficient, governed solutions

Data Engineering for Multimodal AI

2026-05-25 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Vasundra Srinivasan

AI/ML Cloud Computing Data Engineering Data Governance MLOps Cyber Security data data-engineering

A shift is underway in how organizations approach data infrastructure for AI-driven transformation. As multimodal AI systems and applications become increasingly sophisticated and data hungry, data systems must evolve to meet these complex demands. Data Engineering for Multimodal AI is one of the first practical guides for data engineers, machine learning engineers, and MLOps specialists looking to rapidly master the skills needed to build robust, scalable data infrastructures for multimodal AI systems and applications. You'll follow the entire lifecycle of AI-driven data engineering, from conceptualizing data architectures to implementing data pipelines optimized for multimodal learning in both cloud native and on-premises environments. And each chapter includes step-by-step guides and best practices for implementing key concepts. Design and implement cloud native data architectures optimized for multimodal AI workloads Build efficient and scalable ETL processes for preparing diverse AI training data Implement real-time data processing pipelines for multimodal AI inference Develop and manage feature stores that support multiple data modalities Apply data governance and security practices specific to multimodal AI projects Optimize data storage and retrieval for various types of multimodal ML models Integrate data versioning and lineage tracking in multimodal AI workflows Implement data-quality frameworks to ensure reliable outcomes across data types Design data pipelines that support responsible AI practices in a multimodal context

The Lifecycle of a Jupyter Environment: From Exploration to Production-Grade Pipelines

2025-12-09 · PyData Boston 2025

talk

by Dawn Wages

AI/ML Data Science Spark

Most data science projects start with a simple notebook—a spark of curiosity, some exploration, and a handful of promising results. But what happens when that experiment needs to grow up and go into production?

This talk follows the story of a single machine learning exploration that matures into a full-fledged ETL pipeline. We’ll walk through the practical steps and real-world challenges that come up when moving from a Jupyter notebook to something robust enough for daily use.

We’ll cover how to:

Set clear objectives and document the process from the beginning
Break messy notebook logic into modular, reusable components
Choose the right tools (Papermill, nbconvert, shell scripts) based on your workflow—not just the hype
Track environments and dependencies to make sure your project runs tomorrow the way it did today
Handle data integrity, schema changes, and even evolving labels as your datasets shift over time

And as a bonus: bring your results to life with interactive visualizations using tools like PyScript, Voila, and Panel + HoloViz

AWS re:Invent 2025 - Best practices for building Apache Iceberg based lakehouse architectures on AWS

2025-12-06 · AWS re:Invent 2024 Watch

video

Agile/Scrum Athena AWS Amazon EMR AWS Glue Cloud Computing Data Lakehouse Iceberg Redshift S3 Amazon SageMaker Spark +1 more

Discover advanced strategies for implementing Apache Iceberg on AWS, focusing on Amazon S3 Tables and integration of Iceberg Rest Catalog with the lakehouse in Amazon SageMaker. We'll cover performance optimization techniques for Amazon Athena and Amazon Redshift queries, real-time processing using Apache Spark, and integration with Amazon EMR, AWS Glue, and Trino. Explore practical implementations of zero-ETL, change data capture (CDC) patterns, and medallion architecture. Gain hands-on expertise in implementing enterprise-grade lakehouse solutions with Iceberg on AWS.

Learn more: More AWS events: https://go.aws/3kss9CP

Subscribe: More AWS videos: http://bit.ly/2O3zS75 More AWS events videos: http://bit.ly/316g9t4

ABOUT AWS: Amazon Web Services (AWS) hosts events, both online and in-person, bringing the cloud computing community together to connect, collaborate, and learn from AWS experts. AWS is the world's most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—are using AWS to lower costs, become more agile, and innovate faster.

AWSreInvent #AWSreInvent2025 #AWS

AWS re:Invent 2025 - Deep dive into databases zero-ETL integrations (DAT445)

2025-12-06 · AWS re:Invent 2024 Watch

video

Agile/Scrum AI/ML Analytics AWS Aurora Amazon RDS Cloud Computing DynamoDB Redshift Amazon SageMaker

In this session, learn how AWS zero-ETL integrations remove the need to manage complex data movement pipelines across multiple source database engines and targets so data engineers, architects, & DBAs can eliminate maintenance overhead while ensuring near real-time data availability for analytics & ML workloads. Examine the underlying architecture and how it works for the supported zero-ETL integrations between Amazon Aurora, Amazon DynamoDB, and Amazon RDS sources to Amazon Redshift, Amazon SageMaker, and Amazon OpenSearch Service targets - all without traditional ETL complexity. Dive into the data movement options, tunable settings, and how to monitor ongoing data movement.

Learn more: More AWS events: https://go.aws/3kss9CP

Subscribe: More AWS videos: http://bit.ly/2O3zS75 More AWS events videos: http://bit.ly/316g9t4

ABOUT AWS: Amazon Web Services (AWS) hosts events, both online and in-person, bringing the cloud computing community together to connect, collaborate, and learn from AWS experts. AWS is the world's most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—are using AWS to lower costs, become more agile, and innovate faster.

AWSreInvent #AWSreInvent2025 #AWS

AWS re:Invent 2025 - Fast-track to insights: AWS-SAP data strategy (ANT333)

2025-12-05 · AWS re:Invent 2024 Watch

video

Agile/Scrum AI/ML Analytics AWS Cloud Computing Amazon SageMaker SAP

This lightning talk showcases AWS and SAP's innovative solution for enterprise data integration challenges. Learn how to access data between SAP and AWS environments, eliminating complex ETL pipelines while maintaining business context. In this talk, we will demonstrate how to enable zero-ETL integration between SAP and Amazon SageMaker so that you can reduce time spent building data pipelines and focus on running unified analytics and AI/ML on all your data.

Learn more: More AWS events: https://go.aws/3kss9CP

Subscribe: More AWS videos: http://bit.ly/2O3zS75 More AWS events videos: http://bit.ly/316g9t4

ABOUT AWS: Amazon Web Services (AWS) hosts events, both online and in-person, bringing the cloud computing community together to connect, collaborate, and learn from AWS experts. AWS is the world's most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—are using AWS to lower costs, become more agile, and innovate faster.

AWSreInvent #AWSreInvent2025 #AWS

AWS re:Invent 2025 - Deep dive into Amazon Aurora and its innovations (DAT441)

2025-12-04 · AWS re:Invent 2024 Watch

video

Agile/Scrum AI/ML AWS Aurora Cloud Computing GenAI MySQL postgresql

With an innovative architecture that decouples compute from storage and advanced features like Amazon Aurora Global Database and low-latency read replicas, Aurora reimagines what it means to be a relational database. Aurora is a built-for-the-cloud, serverless relational database service that delivers unparalleled performance and availability at global scale for MySQL, PostgreSQL, and DSQL. In this session, dive deep into new features – and Aurora’s most popular offerings including serverless, I/O-Optimized, zero-ETL integrations, MCP integration, and generative AI support for vector search and storage. Also learn about the groundbreaking Aurora DSQL engine.

Learn more: More AWS events: https://go.aws/3kss9CP

Subscribe: More AWS videos: http://bit.ly/2O3zS75 More AWS events videos: http://bit.ly/316g9t4

ABOUT AWS: Amazon Web Services (AWS) hosts events, both online and in-person, bringing the cloud computing community together to connect, collaborate, and learn from AWS experts. AWS is the world's most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—are using AWS to lower costs, become more agile, and innovate faster.

AWSreInvent #AWSreInvent2025 #AWS

AWS re:Invent 2025 - Universal data connectivity with ETL and SQL queries (ANT209)

2025-12-03 · AWS re:Invent 2024 Watch

video

Agile/Scrum AI/ML Analytics AWS Cloud Computing GenAI SQL

Learn how AWS can help you with data integration and preparing data for analytics, machine learning (ML) and generative AI workloads. Explore new capabilities that enable your users to have controlled access to all relevant data, easily build and maintain scalable and resilient data pipelines, and enhance decision-making quality,all with exceptional price performance. See how zero-ETL and query federation can complement ETL and ELT data pipelines.

Learn more: More AWS events: https://go.aws/3kss9CP

Subscribe: More AWS videos: http://bit.ly/2O3zS75 More AWS events videos: http://bit.ly/316g9t4

ABOUT AWS: Amazon Web Services (AWS) hosts events, both online and in-person, bringing the cloud computing community together to connect, collaborate, and learn from AWS experts. AWS is the world's most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—are using AWS to lower costs, become more agile, and innovate faster.

AWSreInvent #AWSreInvent2025 #AWS

AWS re:Invent 2025 - Enterprise-scale ETL optimization for Apache Spark (ANT336)

2025-12-02 · AWS re:Invent 2024 Watch

video

Agile/Scrum AWS Amazon EMR AWS Glue Cloud Computing S3 Amazon SageMaker Spark

Apache Spark on AWS Glue, Amazon EMR, and Amazon SageMaker enhances the optimization of large-scale data processing workloads. These include faster read and write throughput, accelerated processing of common file formats, and expanded Amazon S3 support through the S3A protocol for greater flexibility in write operations. In this session, we'll explore recent enhancements in Spark for distributed computation and in-memory storage to enable efficient data aggregation and job optimization. We'll also demonstrate how these innovations, combined with Spark's native capabilities, strengthen governance and encryption to help you optimize performance while maintaining control and compliance. Join us to learn how to build unified, secure, and high-performance ETL pipelines on AWS using Spark.

Learn more: More AWS events: https://go.aws/3kss9CP

Subscribe: More AWS videos: http://bit.ly/2O3zS75 More AWS events videos: http://bit.ly/316g9t4

ABOUT AWS: Amazon Web Services (AWS) hosts events, both online and in-person, bringing the cloud computing community together to connect, collaborate, and learn from AWS experts. AWS is the world's most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—are using AWS to lower costs, become more agile, and innovate faster.

AWSreInvent #AWSreInvent2025 #AWS

Blurring Lines: Data, AI, and the New Playbook for Team Velocity

2025-11-24 · Data Engineering Podcast Listen

podcast_episode

by Maxime Beauchemin (Preset) , Tobias Macey

AI/ML Cloud Computing Data Engineering Data Management Data Quality Datafold dbt Git Prefect Python SQL Data Streaming

Summary In this crossover episode, Max Beauchemin explores how multiplayer, multi‑agent engineering is transforming the way individuals and teams build data and AI systems. He digs into the shifting boundary between data and AI engineering, the rise of “context as code,” and how just‑in‑time retrieval via MCP and CLIs lets agents gather what they need without bloating context windows. Max shares hard‑won practices from going “AI‑first” for most tasks, where humans focus on orchestration and taste, and the new bottlenecks that appear — code review, QA, async coordination — when execution accelerates 2–10x. He also dives deep into Agor, his open‑source agent orchestration platform: a spatial, multiplayer workspace that manages Git worktrees and live dev environments, templatizes prompts by workflow zones, supports session forking and sub‑sessions, and exposes an internal MCP so agents can schedule, monitor, and even coordinate other agents.

Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementData teams everywhere face the same problem: they're forcing ML models, streaming data, and real-time processing through orchestration tools built for simple ETL. The result? Inflexible infrastructure that can't adapt to different workloads. That's why Cash App and Cisco rely on Prefect. Cash App's fraud detection team got what they needed - flexible compute options, isolated environments for custom packages, and seamless data exchange between workflows. Each model runs on the right infrastructure, whether that's high-memory machines or distributed compute. Orchestration is the foundation that determines whether your data team ships or struggles. ETL, ML model training, AI Engineering, Streaming - Prefect runs it all from ingestion to activation in one platform. Whoop and 1Password also trust Prefect for their data operations. If these industry leaders use Prefect for critical workflows, see what it can do for you at dataengineeringpodcast.com/prefect.Data migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details.Composable data infrastructure is great, until you spend all of your time gluing it together. Bruin is an open source framework, driven from the command line, that makes integration a breeze. Write Python and SQL to handle the business logic, and let Bruin handle the heavy lifting of data movement, lineage tracking, data quality monitoring, and governance enforcement. Bruin allows you to build end-to-end data workflows using AI, has connectors for hundreds of platforms, and helps data teams deliver faster. Teams that use Bruin need less engineering effort to process data and benefit from a fully integrated data platform. Go to dataengineeringpodcast.com/bruin today to get started. And for dbt Cloud customers, they'll give you $1,000 credit to migrate to Bruin Cloud.Your host is Tobias Macey and today I'm interviewing Maxime Beauchemin about the impact of multi-player multi-agent engineering on individual and team velocity for building better data systemsInterview IntroductionHow did you get involved in the area of data management?Can you start by giving an overview of the types of work that you are relying on AI development agents for?As you bring agents into the mix for software engineering, what are the bottlenecks that start to show up?In my own experience there are a finite number of agents that I can manage in parallel. How does Agor help to increase that limit?How does making multi-agent management a multi-player experience change the dynamics of how you apply agentic engineering workflows?Contact Info LinkedInLinks AgorApache AirflowApache SupersetPresetClaude CodeCodexPlaywright MCPTmuxGit WorktreesOpencode.aiGitHub CodespacesOnaThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Build real‑time analytics with Cosmos DB in Microsoft Fabric

2025-11-21 · Microsoft Ignite 2025

talk

by Mark Brown (Microsoft) , Jasmine Greenaway (Microsoft)

Analytics Cosmos Microsoft Fabric Spark SQL

In this lab you'll help a coffee shop unify their operational and analytical workloads with Cosmos DB in Microsoft Fabric. You'll blend operational data with curated sources using cross-database SQL, stream and visualize real-time POS events, and create a gold layer for personalization. Finally, you'll implement reverse ETL to Cosmos for lightning-fast serving and train a lightweight Spark notebook model to deliver the right offer at the right time before your customer’s order is ready.

Please RSVP and arrive at least 5 minutes before the start time, at which point remaining spaces are open to standby attendees.

Build real‑time analytics with Cosmos DB in Microsoft Fabric

2025-11-20 · Microsoft Ignite 2025

talk

by Mark Brown (Microsoft) , Jasmine Greenaway (Microsoft)

Analytics Cosmos Microsoft Fabric Spark SQL

In this lab you'll help a coffee shop unify their operational and analytical workloads with Cosmos DB in Microsoft Fabric. You'll blend operational data with curated sources using cross-database SQL, stream and visualize real-time POS events, and create a gold layer for personalization. Finally, you'll implement reverse ETL to Cosmos for lightning-fast serving and train a lightweight Spark notebook model to deliver the right offer at the right time before your customer’s order is ready.

Please RSVP and arrive at least 5 minutes before the start time, at which point remaining spaces are open to standby attendees.

Zero-copy data unification with Fabric Mirroring and Shortcuts

2025-11-20 · Microsoft Ignite 2025

theater

by Maraki Ketema (Microsoft) , Matthew Hicks (Microsoft)

AI/ML BI Cloud Computing Microsoft Fabric

In this fast-paced theater session, discover how Microsoft Fabric simplifies data unification with Mirroring and Shortcuts. Instantly connect to any data source—cloud, on-prem, structured or unstructured—without duplication or ETL. Learn how OneLake and its catalog turn connected data into actionable insights for BI and AI at scale.

Connect to and manage any data, anywhere in Microsoft OneLake

2025-11-19 · Microsoft Ignite 2025 Watch

breakout

by Paul Purvis (Chevron) , Adi Regev (Microsoft) , Dipti Borkar (Microsoft) , Clay Yeaman (Chevron) , Joshua Caplan (Microsoft)

Microsoft Fabric Cyber Security

Frontier firms aiming to build custom agents and foster data rich cultures need a single, unified access point for all their data that everyone can use. This session will explore how Microsoft OneLake can help you connect to any data, anywhere without ETL or data duplication. We'll also show how Fabric users and admins alike can manage and secure that data using tools like the OneLake catalog and OneLake security.

Build real‑time analytics with Cosmos DB in Microsoft Fabric

2025-11-19 · Microsoft Ignite 2025

talk

by Mark Brown (Microsoft) , Jasmine Greenaway (Microsoft)

Analytics Cosmos Microsoft Fabric Spark SQL

In this lab you'll help a coffee shop unify their operational and analytical workloads with Cosmos DB in Microsoft Fabric. You'll blend operational data with curated sources using cross-database SQL, stream and visualize real-time POS events, and create a gold layer for personalization. Finally, you'll implement reverse ETL to Cosmos for lightning-fast serving and train a lightweight Spark notebook model to deliver the right offer at the right time before your customer’s order is ready.

Please RSVP and arrive at least 5 minutes before the start time, at which point remaining spaces are open to standby attendees.

Build real‑time analytics with Cosmos DB in Microsoft Fabric

2025-11-18 · Microsoft Ignite 2025

talk

by Mark Brown (Microsoft) , Jasmine Greenaway (Microsoft)

Analytics Cosmos Microsoft Fabric Spark SQL

In this lab you'll help a coffee shop unify their operational and analytical workloads with Cosmos DB in Microsoft Fabric. You'll blend operational data with curated sources using cross-database SQL, stream and visualize real-time POS events, and create a gold layer for personalization. Finally, you'll implement reverse ETL to Cosmos for lightning-fast serving and train a lightweight Spark notebook model to deliver the right offer at the right time before your customer’s order is ready.

Please RSVP and arrive at least 5 minutes before the start time, at which point remaining spaces are open to standby attendees.

State, Scale, and Signals: Rethinking Orchestration with Durable Execution

2025-11-16 · Data Engineering Podcast Listen

podcast_episode

by Preeti Somal (Temporal) , Tobias Macey

AI/ML Airflow Cloud Computing Dagster Data Engineering Data Management Data Quality Datafold dbt Prefect Python RAG +2 more

Summary In this episode Preeti Somal, EVP of Engineering at Temporal, talks about the durable execution model and how it reshapes the way teams build reliable, stateful systems for data and AI. She explores Temporal’s code‑first programming model—workflows, activities, task queues, and replay—and how it eliminates hand‑rolled retry, checkpoint, and error‑handling scaffolding while letting data remain where it lives. Preeti shares real-world patterns for replacing DAG-first orchestration, integrating application and data teams through signals and Nexus for cross-boundary calls, and using Temporal to coordinate long-running, human-in-the-loop, and agentic AI workflows with full observability and auditability. Shee also discusses heuristics for choosing Temporal alongside (or instead of) traditional orchestrators, managing scale without moving large datasets, and lessons from running durable execution as a cloud service.

Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementData teams everywhere face the same problem: they're forcing ML models, streaming data, and real-time processing through orchestration tools built for simple ETL. The result? Inflexible infrastructure that can't adapt to different workloads. That's why Cash App and Cisco rely on Prefect. Cash App's fraud detection team got what they needed - flexible compute options, isolated environments for custom packages, and seamless data exchange between workflows. Each model runs on the right infrastructure, whether that's high-memory machines or distributed compute. Orchestration is the foundation that determines whether your data team ships or struggles. ETL, ML model training, AI Engineering, Streaming - Prefect runs it all from ingestion to activation in one platform. Whoop and 1Password also trust Prefect for their data operations. If these industry leaders use Prefect for critical workflows, see what it can do for you at dataengineeringpodcast.com/prefect.Data migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details. Composable data infrastructure is great, until you spend all of your time gluing it together. Bruin is an open source framework, driven from the command line, that makes integration a breeze. Write Python and SQL to handle the business logic, and let Bruin handle the heavy lifting of data movement, lineage tracking, data quality monitoring, and governance enforcement. Bruin allows you to build end-to-end data workflows using AI, has connectors for hundreds of platforms, and helps data teams deliver faster. Teams that use Bruin need less engineering effort to process data and benefit from a fully integrated data platform. Go to dataengineeringpodcast.com/bruin today to get started. And for dbt Cloud customers, they'll give you $1,000 credit to migrate to Bruin Cloud.Your host is Tobias Macey and today I'm interviewing Preeti Somal about how to incorporate durable execution and state management into AI application architectures Interview IntroductionHow did you get involved in the area of data management?Can you describe what durable execution is and how it impacts system architecture?With the strong focus on state maintenance and high reliability, what are some of the most impactful ways that data teams are incorporating tools like Temporal into their work?One of the core primitives in Temporal is a "workflow". How does that compare to similar primitives in common data orchestration systems such as Airflow, Dagster, Prefect, etc.? What are the heuristics that you recommend when deciding which tool to use for a given task, particularly in data/pipeline oriented projects? Even if a team is using a more data-focused orchestration engine, what are some of the ways that Temporal can be applied to handle the processing logic of the actual data?AI applications are also very dependent on reliable data to be effective in production contexts. What are some of the design patterns where durable execution can be integrated into RAG/agent applications?What are some of the conceptual hurdles that teams experience when they are starting to adopt Temporal or other durable execution frameworks?What are the most interesting, innovative, or unexpected ways that you have seen Temporal/durable execution used for data/AI services?What are the most interesting, unexpected, or challenging lessons that you have learned while working on Temporal?When is Temporal/durable execution the wrong choice?What do you have planned for the future of Temporal for data and AI systems? Contact Info LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story. Links TemporalDurable ExecutionFlinkMachine Learning EpochSpark StreamingAirflowDirected Acyclic Graph (DAG)Temporal NexusTensorZeroAI Engineering Podcast Episode The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

The AI Data Paradox: High Trust in Models, Low Trust in Data

2025-11-09 · Data Engineering Podcast Listen

podcast_episode

by Ariel Pohoryles (Rivery) , Tobias Macey

AI/ML Analytics Cloud Computing Data Engineering Data Management Data Quality Datafold dbt GenAI Marketing Master Data Management Prefect +3 more

Summary In this episode of the Data Engineering Podcast Ariel Pohoryles, head of product marketing for Boomi's data management offerings, talks about a recent survey of 300 data leaders on how organizations are investing in data to scale AI. He shares a paradox uncovered in the research: while 77% of leaders trust the data feeding their AI systems, only 50% trust their organization's data overall. Ariel explains why truly productionizing AI demands broader, continuously refreshed data with stronger automation and governance, and highlights the challenges posed by unstructured data and vector stores. The conversation covers the need to shift from manual reviews to automated pipelines, the resurgence of metadata and master data management, and the importance of guardrails, traceability, and agent governance. Ariel also predicts a growing convergence between data teams and application integration teams and advises leaders to focus on high-value use cases, aggressive pipeline automation, and cataloging and governing the coming sprawl of AI agents, all while using AI to accelerate data engineering itself.

Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementData teams everywhere face the same problem: they're forcing ML models, streaming data, and real-time processing through orchestration tools built for simple ETL. The result? Inflexible infrastructure that can't adapt to different workloads. That's why Cash App and Cisco rely on Prefect. Cash App's fraud detection team got what they needed - flexible compute options, isolated environments for custom packages, and seamless data exchange between workflows. Each model runs on the right infrastructure, whether that's high-memory machines or distributed compute. Orchestration is the foundation that determines whether your data team ships or struggles. ETL, ML model training, AI Engineering, Streaming - Prefect runs it all from ingestion to activation in one platform. Whoop and 1Password also trust Prefect for their data operations. If these industry leaders use Prefect for critical workflows, see what it can do for you at dataengineeringpodcast.com/prefect.Data migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details.Composable data infrastructure is great, until you spend all of your time gluing it together. Bruin is an open source framework, driven from the command line, that makes integration a breeze. Write Python and SQL to handle the business logic, and let Bruin handle the heavy lifting of data movement, lineage tracking, data quality monitoring, and governance enforcement. Bruin allows you to build end-to-end data workflows using AI, has connectors for hundreds of platforms, and helps data teams deliver faster. Teams that use Bruin need less engineering effort to process data and benefit from a fully integrated data platform. Go to dataengineeringpodcast.com/bruin today to get started. And for dbt Cloud customers, they'll give you $1,000 credit to migrate to Bruin Cloud.Your host is Tobias Macey and today I'm interviewing Ariel Pohoryles about data management investments that organizations are making to enable them to scale AI implementationsInterview IntroductionHow did you get involved in the area of data management?Can you start by describing the motivation and scope of your recent survey on data management investments for AI across your respondents?What are the key takeaways that were most significant to you?The survey reveals a fascinating paradox: 77% of leaders trust the data used by their AI systems, yet only half trust their organization's overall data quality. For our data engineering audience, what does this suggest about how companies are currently sourcing data for AI? Does it imply they are using narrow, manually-curated "golden datasets," and what are the technical challenges and risks of that approach as they try to scale?The report highlights a heavy reliance on manual data quality processes, with one expert noting companies feel it's "not reliable to fully automate validation" for external or customer data. At the same time, maturity in "Automated tools for data integration and cleansing" is low, at only 42%. What specific technical hurdles or organizational inertia are preventing teams from adopting more automation in their data quality and integration pipelines?There was a significant point made that with generative AI, "biases can scale much faster," making automated governance essential. From a data engineering perspective, how does the data management strategy need to evolve to support generative AI versus traditional ML models? What new types of data quality checks, lineage tracking, or monitoring for feedback loops are required when the model itself is generating new content based on its own outputs?The report champions a "centralized data management platform" as the "connective tissue" for reliable AI. How do you see the scale and data maturity impacting the realities of that effort?How do architectural patterns in the shape of cloud warehouses, lakehouses, data mesh, data products, etc. factor into that need for centralized/unified platforms?A surprising finding was that a third of respondents have not fully grasped the risk of significant inaccuracies in their AI models if they fail to prioritize data management. In your experience, what are the biggest blind spots for data and analytics leaders?Looking at the maturity charts, companies rate themselves highly on "Developing a data management strategy" (65%) but lag significantly in areas like "Automated tools for data integration and cleansing" (42%) and "Conducting bias-detection audits" (24%). If you were advising a data engineering team lead based on these findings, what would you tell them to prioritize in the next 6-12 months to bridge the gap between strategy and a truly scalable, trustworthy data foundation for AI?The report states that 83% of companies expect to integrate more data sources for their AI in the next year. For a data engineer on the ground, what is the most important capability they need to build into their platform to handle this influx?What are the most interesting, innovative, or unexpected ways that you have seen teams addressing the new and accelerated data needs for AI applications?What are some of the noteworthy trends or predictions that you have for the near-term future of the impact that AI is having or will have on data teams and systems?Contact Info LinkedInParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.Links BoomiData ManagementIntegration & Automation DemoAgentstudioData Connector Agent WebinarSurvey ResultsData GovernanceShadow ITPodcast EpisodeThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Keep it Simple and "Scalable": pythonic Extract, Load, Transform (ELT) using dltHub

2025-11-04 · Small Data SF 2025

workshop

by Elvis Kahoro (Chalk) , Brian Douglas (Continue) , Thierry Jean (dltHub)

AI/ML Data Quality LLM Python

Get ready to ingest data and transform it into ready-to-use datasets using Python. We'll share a no-nonsense approach for developing and testing data connectors and transformations locally. Moving to production will be a matter of tweaking your configuration. In the end, you get a simple dataset interface to build dashboards & applications, train predictive models, or create agentic workflows. This workshop includes two guest speakers. Brian teach how to leverage AI IDEs, MCP servers and LLM scaffoldings to create ingestion pipelines. Elvis will show how to interactively define transformations and data quality checks.

The 2025 MAD Landscape w/ Matt Turck

2025-11-04 · The Joe Reis Show Listen

podcast_episode

by Matt Turck (FirstMark Capital) , Joe Reis (DeepLearning.AI)

AI/ML CDP

Matt Turck (VC at FirstMark) joins the show to break down the most controversial MAD (Machine Learning, AI, and Data) Landscape yet. This year, the team "declared bankruptcy" and cut over 1,000 logos to better reflect the market reality: a "Cambrian explosion" of AI companies and a fierce "struggle and tension between the very large companies and the startups".

Matt discusses why incumbents are "absolutely not lazy" , which categories have "largely just gone away" (like Customer Data Platforms and Reverse ETL) , and what new categories (like AI Agents and Local AI) are emerging. We also cover his investment thesis in a world dominated by foundation models, the "very underestimated" European AI scene , and whether an AI could win a Nobel Prize by 2027.

https://www.mattturck.com/mad2025

talk-data.com

ETL/ELT

Activity Trend

Top Events

Top Speakers

The Data Engineer's Guide to Microsoft Fabric

Data Engineering for Multimodal AI

The Lifecycle of a Jupyter Environment: From Exploration to Production-Grade Pipelines

AWS re:Invent 2025 - Best practices for building Apache Iceberg based lakehouse architectures on AWS

AWSreInvent #AWSreInvent2025 #AWS

AWS re:Invent 2025 - Deep dive into databases zero-ETL integrations (DAT445)

AWSreInvent #AWSreInvent2025 #AWS

AWS re:Invent 2025 - Fast-track to insights: AWS-SAP data strategy (ANT333)

AWSreInvent #AWSreInvent2025 #AWS

AWS re:Invent 2025 - Deep dive into Amazon Aurora and its innovations (DAT441)

AWSreInvent #AWSreInvent2025 #AWS

AWS re:Invent 2025 - Universal data connectivity with ETL and SQL queries (ANT209)

AWSreInvent #AWSreInvent2025 #AWS

AWS re:Invent 2025 - Enterprise-scale ETL optimization for Apache Spark (ANT336)

AWSreInvent #AWSreInvent2025 #AWS

Blurring Lines: Data, AI, and the New Playbook for Team Velocity

Build real‑time analytics with Cosmos DB in Microsoft Fabric

Build real‑time analytics with Cosmos DB in Microsoft Fabric

Zero-copy data unification with Fabric Mirroring and Shortcuts

Connect to and manage any data, anywhere in Microsoft OneLake

Build real‑time analytics with Cosmos DB in Microsoft Fabric

Build real‑time analytics with Cosmos DB in Microsoft Fabric

State, Scale, and Signals: Rethinking Orchestration with Durable Execution

The AI Data Paradox: High Trust in Models, Low Trust in Data

Keep it Simple and "Scalable": pythonic Extract, Load, Transform (ELT) using dltHub

The 2025 MAD Landscape w/ Matt Turck