Engineering Lakehouses with Open Table Formats

2025-12-26 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Dipankar Mazumdar , Vinoth Govindarajan (Apple)

Airflow Flink Big Data Data Lakehouse dbt Delta Hudi Iceberg Python Spark data data-engineering +2 more

Engineering Lakehouses with Open Table Formats introduces the architecture and capabilities of open table formats like Apache Iceberg, Apache Hudi, and Delta Lake. The book guides you through the design, implementation, and optimization of lakehouses that can handle modern data processing requirements effectively with real-world practical insights. What this Book will help me do Understand the fundamentals of open table formats and their benefits in lakehouse architecture. Learn how to implement performant data processing using tools like Apache Spark and Flink. Master advanced topics like indexing, partitioning, and interoperability between data formats. Explore data lifecycle management and integration with frameworks like Apache Airflow and dbt. Build secure lakehouses with regulatory compliance using best practices detailed in the book. Author(s) Dipankar Mazumdar and Vinoth Govindarajan are seasoned professionals with extensive experience in big data processing and software architecture. They bring their expertise from working with data lakehouses and are known for their ability to explain complex technical concepts clearly. Their collaborative approach brings valuable insights into the latest trends in data management. Who is it for? This book is ideal for data engineers, architects, and software professionals aiming to master modern lakehouse architectures. If you are familiar with data lakes or warehouses and wish to transition to an open data architectural design, this book is suited for you. Readers should have basic knowledge of databases, Python, and Apache Spark for the best experience.

Building a Data and AI Platform with PostgreSQL

2025-12-12 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Jozef de Vries , Benjamin Anderson , Tom Taulli (Forbes)

AI/ML RAG data data-engineering postgresql relational-databases

In a world where data sovereignty, scalability, and AI innovation are at the forefront of enterprise strategy, PostgreSQL is emerging as the key to unlocking transformative business value. This new guide serves as your beacon for navigating the convergence of AI, open source technologies, and intelligent data platforms. Authors Tom Taulli, Benjamin Anderson, and Jozef de Vries offer a strategic and practical approach to building AI and data platforms that balance innovation with governance, empowering organizations to take control of their data future. Whether you're designing frameworks for advanced AI applications, modernizing legacy infrastructures, or solving data challenges at scale, you can use this guide to bridge the gap between technical complexity and actionable strategy. Written for IT executives, data leaders, and practitioners alike, it will equip you with the tools and insights to harness Postgre's unique capabilities—extensibility, unstructured data management, and hybrid workloads—for long-term success in an AI-driven world. Learn how to build an AI and data platform using PostgreSQL Overcome data challenges like modernization, integration, and governance Optimize AI performance with model fine-tuning and retrieval-augmented generation (RAG) best practices Discover use cases that align data strategy with business goals Take charge of your data and AI future with this comprehensive and accessible roadmap

Keynote by Lisa Amini- What’s Next in AI for Data and Data Management?

2025-12-09 · PyData Boston 2025 Watch

talk

AI/ML Dataflow LLM RAG

Advances in large language models (LLMs) have propelled a recent flurry of AI tools for data management and operations. For example, AI-powered code assistants leverage LLMs to generate code for dataflow pipelines. RAG pipelines enable LLMs to ground responses with relevant information from external data sources. Data agents leverage LLMs to turn natural language questions into data-driven answers and actions. While challenges remain, these advances are opening exciting new opportunities for data scientists and engineers. In this talk, we will examine recent advances, along with some still incubating in research labs, with the goal of understanding where this is all heading, and present our perspective on what’s next for AI in data management and data operations.

AWS re:Invent 2025 - Data Processing architectures for building AI solutions (ANT328)

2025-12-08 · AWS re:Invent 2024 Watch

video

Agile/Scrum AI/ML Athena AWS Amazon EMR AWS Glue Cloud Computing Cyber Security

Prepare to revolutionize your data infrastructure for the AI era with Amazon EMR, AWS Glue, and Amazon Athena. This session will guide you through leveraging these powerful AWS services to construct robust, scalable data architectures that empower AI solutions at scale. Gain insights into effective architectural strategies for data processing to build AI applications, optimizing for cost-efficiency and security. Explore architectural frameworks that underpin successful AI-driven data initiatives, and learn from field lessons on how to navigate modernization projects. Whether you’re starting your modernization journey or refining current setups, this session offers practical strategies to fast-track your organization towards achieving excellence in AI-powered data management.

Learn more: More AWS events: https://go.aws/3kss9CP

Subscribe: More AWS videos: http://bit.ly/2O3zS75 More AWS events videos: http://bit.ly/316g9t4

ABOUT AWS: Amazon Web Services (AWS) hosts events, both online and in-person, bringing the cloud computing community together to connect, collaborate, and learn from AWS experts. AWS is the world's most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—are using AWS to lower costs, become more agile, and innovate faster.

AWSreInvent #AWSreInvent2025 #AWS

AWS re:Invent 2025 - How Samsung optimized 1.2 PB on Amazon DynamoDB with zero downtime (DAT323)

2025-12-06 · AWS re:Invent 2024 Watch

video

Agile/Scrum AWS Cloud Computing DynamoDB

Join Samsung Cloud to learn how they successfully implemented a massive-scale table optimization, handling 1.2 PB on Amazon DynamoDB with zero downtime. Follow the innovative journey behind Samsung Internet's tab sync feature, showcasing how the team developed an in-house solution to handle complex data transfer achieving substantial cost savings without third-party dependencies. Learn how this implementation resulted in the creation of a new standard framework that now serves as Samsung's blueprint for future data management initiatives. Gain practical insights into building scalable solutions through deep service domain understanding and discover actionable strategies to tackle similar challenges in your organization.

Learn more: More AWS events: https://go.aws/3kss9CP

Subscribe: More AWS videos: http://bit.ly/2O3zS75 More AWS events videos: http://bit.ly/316g9t4

ABOUT AWS: Amazon Web Services (AWS) hosts events, both online and in-person, bringing the cloud computing community together to connect, collaborate, and learn from AWS experts. AWS is the world's most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—are using AWS to lower costs, become more agile, and innovate faster.

AWSreInvent #AWSreInvent2025 #AWS

AWS re:Invent 2025 - A practitioner’s guide to data for agentic AI (DAT315)

2025-12-05 · AWS re:Invent 2024 Watch

video

Agile/Scrum AI/ML AWS Aurora Cloud Computing Data Lake Data Quality RAG Amazon SageMaker Data Streaming

In this session, gain the skills needed to deploy end-to-end agentic AI applications using your most valuable data. This session focuses on data management using processes like Model Context Protocol (MCP) and Retrieval Augmented Generation (RAG), and provides concepts that apply to other methods of customizing agentic AI applications. Discover best practice architectures using AWS database services like Amazon Aurora and OpenSearch Service, along with analytical, data processing and streaming experiences found in SageMaker Unified Studio. Learn data lake, governance, and data quality concepts and how Amazon Bedrock AgentCore and Bedrock Knowledge Bases, and other features tie solution components together.

Learn more: More AWS events: https://go.aws/3kss9CP

Subscribe: More AWS videos: http://bit.ly/2O3zS75 More AWS events videos: http://bit.ly/316g9t4

ABOUT AWS: Amazon Web Services (AWS) hosts events, both online and in-person, bringing the cloud computing community together to connect, collaborate, and learn from AWS experts. AWS is the world's most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—are using AWS to lower costs, become more agile, and innovate faster.

AWSreInvent #AWSreInvent2025 #AWS

AWS re:Invent 2025 -What’s new in search, observability, and vector databases w/ OpenSearch (ANT201)

2025-12-05 · AWS re:Invent 2024 Watch

video

Agile/Scrum AI/ML Analytics AWS Cloud Computing S3 Vector DB

Discover the latest Amazon OpenSearch Service launches and capabilities that enable and quickly deploy agentic AI applications and vector search operations. Learn how new integrations with Amazon Q enable intelligent data discovery and automated insights, while enhanced Amazon S3 connectivity streamlines data management. This session showcases how our latest vector database optimizations accelerate AI/ML workloads for efficient development of agentic AI, semantic search, and recommendation systems. We'll demonstrate new cost optimization features and performance enhancements across all OpenSearch use cases, including significant updates to Observability. Whether you're building next-generation AI applications or scaling your existing search infrastructure, join us for a comprehensive update on new launches and releases that can transform your search and analytics capabilities.

Learn more: More AWS events: https://go.aws/3kss9CP

Subscribe: More AWS videos: http://bit.ly/2O3zS75 More AWS events videos: http://bit.ly/316g9t4

ABOUT AWS: Amazon Web Services (AWS) hosts events, both online and in-person, bringing the cloud computing community together to connect, collaborate, and learn from AWS experts. AWS is the world's most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—are using AWS to lower costs, become more agile, and innovate faster.

AWSreInvent #AWSreInvent2025 #AWS

Why organisations struggle with change and what to do about It

2025-12-04 · Hub & Spoken: Data | Analytics | Chief Data Officer | CDO | Data Strategy Listen

podcast_episode

by Jason Foster (Cynozure) , Sunil Kumar

AI/ML Analytics

Most organisations don't struggle with change because of strategy or technology, they struggle because change is fundamentally human. In this episode of Hub & Spoken, Jason Foster, CEO & Founder of Cynozure, speaks with Sunil Kumar, Chief Transformation Officer, to explore why transformation so often stalls and what leaders can do to make it stick. Drawing on more than 26 years working across airlines, telecoms, finance and FMCG, Sunil explains why context, such as geopolitics, customer behaviour, industry shifts and internal culture, is the deciding factor in how change lands. When leaders ignore that context, resistance and fatigue follow. Jason and Sunil discuss the human realities behind change, including: Why people naturally resist it How values and beliefs influence adoption Why narrative and excitement matter more than familiar project metrics Sunil also shares his practical "push, pull, connect" model for building momentum and why adoption, not go-live, should be the true measure of success. 🎧 Listen to the full episode now Cynozure is a leading data, analytics and AI company that helps organisations to reach their data potential. It works with clients on data and AI strategy, data management, data architecture and engineering, analytics and AI, data culture and literacy, and data leadership. The company was named one of The Sunday Times' fastest-growing private companies in both 2022 and 2023 and recognised as The Best Place to Work in Data by DataIQ in 2023 and 2024. Cynozure is a certified B Corporation.

Blurring Lines: Data, AI, and the New Playbook for Team Velocity

2025-11-24 · Data Engineering Podcast Listen

podcast_episode

by Maxime Beauchemin (Preset) , Tobias Macey

AI/ML Cloud Computing Data Engineering Data Quality Datafold dbt ETL/ELT Git Prefect Python SQL Data Streaming

Summary In this crossover episode, Max Beauchemin explores how multiplayer, multi‑agent engineering is transforming the way individuals and teams build data and AI systems. He digs into the shifting boundary between data and AI engineering, the rise of “context as code,” and how just‑in‑time retrieval via MCP and CLIs lets agents gather what they need without bloating context windows. Max shares hard‑won practices from going “AI‑first” for most tasks, where humans focus on orchestration and taste, and the new bottlenecks that appear — code review, QA, async coordination — when execution accelerates 2–10x. He also dives deep into Agor, his open‑source agent orchestration platform: a spatial, multiplayer workspace that manages Git worktrees and live dev environments, templatizes prompts by workflow zones, supports session forking and sub‑sessions, and exposes an internal MCP so agents can schedule, monitor, and even coordinate other agents.

Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementData teams everywhere face the same problem: they're forcing ML models, streaming data, and real-time processing through orchestration tools built for simple ETL. The result? Inflexible infrastructure that can't adapt to different workloads. That's why Cash App and Cisco rely on Prefect. Cash App's fraud detection team got what they needed - flexible compute options, isolated environments for custom packages, and seamless data exchange between workflows. Each model runs on the right infrastructure, whether that's high-memory machines or distributed compute. Orchestration is the foundation that determines whether your data team ships or struggles. ETL, ML model training, AI Engineering, Streaming - Prefect runs it all from ingestion to activation in one platform. Whoop and 1Password also trust Prefect for their data operations. If these industry leaders use Prefect for critical workflows, see what it can do for you at dataengineeringpodcast.com/prefect.Data migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details.Composable data infrastructure is great, until you spend all of your time gluing it together. Bruin is an open source framework, driven from the command line, that makes integration a breeze. Write Python and SQL to handle the business logic, and let Bruin handle the heavy lifting of data movement, lineage tracking, data quality monitoring, and governance enforcement. Bruin allows you to build end-to-end data workflows using AI, has connectors for hundreds of platforms, and helps data teams deliver faster. Teams that use Bruin need less engineering effort to process data and benefit from a fully integrated data platform. Go to dataengineeringpodcast.com/bruin today to get started. And for dbt Cloud customers, they'll give you $1,000 credit to migrate to Bruin Cloud.Your host is Tobias Macey and today I'm interviewing Maxime Beauchemin about the impact of multi-player multi-agent engineering on individual and team velocity for building better data systemsInterview IntroductionHow did you get involved in the area of data management?Can you start by giving an overview of the types of work that you are relying on AI development agents for?As you bring agents into the mix for software engineering, what are the bottlenecks that start to show up?In my own experience there are a finite number of agents that I can manage in parallel. How does Agor help to increase that limit?How does making multi-agent management a multi-player experience change the dynamics of how you apply agentic engineering workflows?Contact Info LinkedInLinks AgorApache AirflowApache SupersetPresetClaude CodeCodexPlaywright MCPTmuxGit WorktreesOpencode.aiGitHub CodespacesOnaThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

How Yale New Haven Health modernized security with Sentinel and Cribl

2025-11-21 · Microsoft Ignite 2025

theater

by Robert Arbuckle (Yale New Haven Health Systems) , Kam Amir (Cribl)

Microsoft Cyber Security

Unexpected spikes in log volume left Yale New Haven Health, Connecticut's largest healthcare system, struggling with soaring SIEM costs and operational inefficiencies. In this session, hear how the team used Cribl to cut log volume by 40%, centralize data telemetry from 30k+ endpoints, and accelerate migration to Microsoft Sentinel in just two weeks. Learn how to control costs, streamline operations, and strengthen security posture with a modern telemetry data management strategy.

Scale operations and optimize costs with Microsoft Sentinel data lake

2025-11-21 · Microsoft Ignite 2025

theater

by Dani Halfin (Microsoft) , Julian Gonzalez (Microsoft)

AI/ML Data Lake Microsoft Cyber Security

As security data grows, it is critical for organizations to balance business outcomes and budgets. In this session, discover how Microsoft Sentinel data lake simplifies data management, accelerates AI adoption and optimizes costs. Serving as the foundation of the Microsoft Sentinel platform, it unifies all security data for greater visibility, deeper analysis, and contextual awareness. We’ll share real-world examples, tools, and best practices to help you maximize Microsoft Sentinel value.

Masterclasses

2025-11-20 · Connected Data London 2025

masterclass

by Amy Hodler (GraphGeeks.org)

Masterclasses led by Ben Gardner, Martin O’Hanlon, Paco Nathan and Amy Hodler covering ontology-based data management, multimodal GraphRAG and building high-quality knowledge graphs.

Power intelligent data management from on-premises to the Cloud

2025-11-18 · Microsoft Ignite 2025

theater

by David Stamen (Pure Storage) , Anthony Nocentino (Pure Storage)

Azure Cloud Computing Microsoft SQL

Discover how Pure partners with Microsoft to deliver a transformative, future-ready data management approach that empowers customers to innovate and efficiently scale from on-premises to cloud. This session will highlight Pure's latest capabilities and product offerings, specifically focusing on advancements in our SQL Server 2025 integration for FlashArray and Pure Storage Cloud Azure Native.

State, Scale, and Signals: Rethinking Orchestration with Durable Execution

2025-11-16 · Data Engineering Podcast Listen

podcast_episode

by Preeti Somal (Temporal) , Tobias Macey

AI/ML Airflow Cloud Computing Dagster Data Engineering Data Quality Datafold dbt ETL/ELT Prefect Python RAG +2 more

Summary In this episode Preeti Somal, EVP of Engineering at Temporal, talks about the durable execution model and how it reshapes the way teams build reliable, stateful systems for data and AI. She explores Temporal’s code‑first programming model—workflows, activities, task queues, and replay—and how it eliminates hand‑rolled retry, checkpoint, and error‑handling scaffolding while letting data remain where it lives. Preeti shares real-world patterns for replacing DAG-first orchestration, integrating application and data teams through signals and Nexus for cross-boundary calls, and using Temporal to coordinate long-running, human-in-the-loop, and agentic AI workflows with full observability and auditability. Shee also discusses heuristics for choosing Temporal alongside (or instead of) traditional orchestrators, managing scale without moving large datasets, and lessons from running durable execution as a cloud service.

Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementData teams everywhere face the same problem: they're forcing ML models, streaming data, and real-time processing through orchestration tools built for simple ETL. The result? Inflexible infrastructure that can't adapt to different workloads. That's why Cash App and Cisco rely on Prefect. Cash App's fraud detection team got what they needed - flexible compute options, isolated environments for custom packages, and seamless data exchange between workflows. Each model runs on the right infrastructure, whether that's high-memory machines or distributed compute. Orchestration is the foundation that determines whether your data team ships or struggles. ETL, ML model training, AI Engineering, Streaming - Prefect runs it all from ingestion to activation in one platform. Whoop and 1Password also trust Prefect for their data operations. If these industry leaders use Prefect for critical workflows, see what it can do for you at dataengineeringpodcast.com/prefect.Data migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details. Composable data infrastructure is great, until you spend all of your time gluing it together. Bruin is an open source framework, driven from the command line, that makes integration a breeze. Write Python and SQL to handle the business logic, and let Bruin handle the heavy lifting of data movement, lineage tracking, data quality monitoring, and governance enforcement. Bruin allows you to build end-to-end data workflows using AI, has connectors for hundreds of platforms, and helps data teams deliver faster. Teams that use Bruin need less engineering effort to process data and benefit from a fully integrated data platform. Go to dataengineeringpodcast.com/bruin today to get started. And for dbt Cloud customers, they'll give you $1,000 credit to migrate to Bruin Cloud.Your host is Tobias Macey and today I'm interviewing Preeti Somal about how to incorporate durable execution and state management into AI application architectures Interview IntroductionHow did you get involved in the area of data management?Can you describe what durable execution is and how it impacts system architecture?With the strong focus on state maintenance and high reliability, what are some of the most impactful ways that data teams are incorporating tools like Temporal into their work?One of the core primitives in Temporal is a "workflow". How does that compare to similar primitives in common data orchestration systems such as Airflow, Dagster, Prefect, etc.? What are the heuristics that you recommend when deciding which tool to use for a given task, particularly in data/pipeline oriented projects? Even if a team is using a more data-focused orchestration engine, what are some of the ways that Temporal can be applied to handle the processing logic of the actual data?AI applications are also very dependent on reliable data to be effective in production contexts. What are some of the design patterns where durable execution can be integrated into RAG/agent applications?What are some of the conceptual hurdles that teams experience when they are starting to adopt Temporal or other durable execution frameworks?What are the most interesting, innovative, or unexpected ways that you have seen Temporal/durable execution used for data/AI services?What are the most interesting, unexpected, or challenging lessons that you have learned while working on Temporal?When is Temporal/durable execution the wrong choice?What do you have planned for the future of Temporal for data and AI systems? Contact Info LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story. Links TemporalDurable ExecutionFlinkMachine Learning EpochSpark StreamingAirflowDirected Acyclic Graph (DAG)Temporal NexusTensorZeroAI Engineering Podcast Episode The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Why companies fail with data and what to do about it

2025-11-13 · Hub & Spoken: Data | Analytics | Chief Data Officer | CDO | Data Strategy Listen

podcast_episode

by Shachar Meir (Shachar Meir) , Jason Foster (Cynozure)

AI/ML Analytics

In this episode of Hub & Spoken, Jason Foster, CEO and Founder of Cynozure, speaks with Shachar Meir, a data advisor who has worked with organisations from startups to the likes of Meta and Paypal. Together, they explore why so many companies, even those with skilled data teams, solid platforms and plenty of data, still struggle to deliver real business value. Shachar's take is clear: the problem isn't technology - it's people, process, and culture. Too often, data teams focus on building sophisticated platforms instead of understanding the business problems they're meant to solve. His summary: why guess when you can know? This episode is a practical conversation for anyone looking to move their organisation from data chaos to data clarity. 🎧 Listen now to discover how clarity beats complexity in data strategy. Cynozure is a leading data, analytics and AI company that helps organisations to reach their data potential. It works with clients on data and AI strategy, data management, data architecture and engineering, analytics and AI, data culture and literacy, and data leadership. The company was named one of The Sunday Times' fastest-growing private companies in both 2022 and 2023 and recognised as The Best Place to Work in Data by DataIQ in 2023 and 2024. Cynozure is a certified B Corporation.

The AI Data Paradox: High Trust in Models, Low Trust in Data

2025-11-09 · Data Engineering Podcast Listen

podcast_episode

by Ariel Pohoryles (Rivery) , Tobias Macey

AI/ML Analytics Cloud Computing Data Engineering Data Quality Datafold dbt ETL/ELT GenAI Marketing Master Data Management Prefect +3 more

Summary In this episode of the Data Engineering Podcast Ariel Pohoryles, head of product marketing for Boomi's data management offerings, talks about a recent survey of 300 data leaders on how organizations are investing in data to scale AI. He shares a paradox uncovered in the research: while 77% of leaders trust the data feeding their AI systems, only 50% trust their organization's data overall. Ariel explains why truly productionizing AI demands broader, continuously refreshed data with stronger automation and governance, and highlights the challenges posed by unstructured data and vector stores. The conversation covers the need to shift from manual reviews to automated pipelines, the resurgence of metadata and master data management, and the importance of guardrails, traceability, and agent governance. Ariel also predicts a growing convergence between data teams and application integration teams and advises leaders to focus on high-value use cases, aggressive pipeline automation, and cataloging and governing the coming sprawl of AI agents, all while using AI to accelerate data engineering itself.

Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementData teams everywhere face the same problem: they're forcing ML models, streaming data, and real-time processing through orchestration tools built for simple ETL. The result? Inflexible infrastructure that can't adapt to different workloads. That's why Cash App and Cisco rely on Prefect. Cash App's fraud detection team got what they needed - flexible compute options, isolated environments for custom packages, and seamless data exchange between workflows. Each model runs on the right infrastructure, whether that's high-memory machines or distributed compute. Orchestration is the foundation that determines whether your data team ships or struggles. ETL, ML model training, AI Engineering, Streaming - Prefect runs it all from ingestion to activation in one platform. Whoop and 1Password also trust Prefect for their data operations. If these industry leaders use Prefect for critical workflows, see what it can do for you at dataengineeringpodcast.com/prefect.Data migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details.Composable data infrastructure is great, until you spend all of your time gluing it together. Bruin is an open source framework, driven from the command line, that makes integration a breeze. Write Python and SQL to handle the business logic, and let Bruin handle the heavy lifting of data movement, lineage tracking, data quality monitoring, and governance enforcement. Bruin allows you to build end-to-end data workflows using AI, has connectors for hundreds of platforms, and helps data teams deliver faster. Teams that use Bruin need less engineering effort to process data and benefit from a fully integrated data platform. Go to dataengineeringpodcast.com/bruin today to get started. And for dbt Cloud customers, they'll give you $1,000 credit to migrate to Bruin Cloud.Your host is Tobias Macey and today I'm interviewing Ariel Pohoryles about data management investments that organizations are making to enable them to scale AI implementationsInterview IntroductionHow did you get involved in the area of data management?Can you start by describing the motivation and scope of your recent survey on data management investments for AI across your respondents?What are the key takeaways that were most significant to you?The survey reveals a fascinating paradox: 77% of leaders trust the data used by their AI systems, yet only half trust their organization's overall data quality. For our data engineering audience, what does this suggest about how companies are currently sourcing data for AI? Does it imply they are using narrow, manually-curated "golden datasets," and what are the technical challenges and risks of that approach as they try to scale?The report highlights a heavy reliance on manual data quality processes, with one expert noting companies feel it's "not reliable to fully automate validation" for external or customer data. At the same time, maturity in "Automated tools for data integration and cleansing" is low, at only 42%. What specific technical hurdles or organizational inertia are preventing teams from adopting more automation in their data quality and integration pipelines?There was a significant point made that with generative AI, "biases can scale much faster," making automated governance essential. From a data engineering perspective, how does the data management strategy need to evolve to support generative AI versus traditional ML models? What new types of data quality checks, lineage tracking, or monitoring for feedback loops are required when the model itself is generating new content based on its own outputs?The report champions a "centralized data management platform" as the "connective tissue" for reliable AI. How do you see the scale and data maturity impacting the realities of that effort?How do architectural patterns in the shape of cloud warehouses, lakehouses, data mesh, data products, etc. factor into that need for centralized/unified platforms?A surprising finding was that a third of respondents have not fully grasped the risk of significant inaccuracies in their AI models if they fail to prioritize data management. In your experience, what are the biggest blind spots for data and analytics leaders?Looking at the maturity charts, companies rate themselves highly on "Developing a data management strategy" (65%) but lag significantly in areas like "Automated tools for data integration and cleansing" (42%) and "Conducting bias-detection audits" (24%). If you were advising a data engineering team lead based on these findings, what would you tell them to prioritize in the next 6-12 months to bridge the gap between strategy and a truly scalable, trustworthy data foundation for AI?The report states that 83% of companies expect to integrate more data sources for their AI in the next year. For a data engineer on the ground, what is the most important capability they need to build into their platform to handle this influx?What are the most interesting, innovative, or unexpected ways that you have seen teams addressing the new and accelerated data needs for AI applications?What are some of the noteworthy trends or predictions that you have for the near-term future of the impact that AI is having or will have on data teams and systems?Contact Info LinkedInParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.Links BoomiData ManagementIntegration & Automation DemoAgentstudioData Connector Agent WebinarSurvey ResultsData GovernanceShadow ITPodcast EpisodeThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

The Great Data Engineering "Reset": From Pipelines to Agents and Beyond

2025-11-05 · Small Data SF 2025

talk

by Joe Reis (DeepLearning.AI)

AI/ML Analytics Data Engineering Data Governance Data Modelling

For years, data engineering was a story of predictable "pipelines": move data from point A to point B. But AI just hit the reset button on our entire field. Now, we're all staring into the void, wondering what's next. While the fundamentals haven't changed, data remains challenging in the traditional areas of data governance, data management, and data modeling, which still present challenges. Everything else is up for grabs. This talk will cut through the noise and explore the future of data engineering in an AI-driven world. We'll examine how team structures will evolve, why agentic workflows and real-time systems are becoming non-negotiable, and how our focus must shift from building dashboards and analytics to architecting for automated action. The reset button has been pushed. It's time for us to invent the future of our industry.

Bridging the AI–Data Gap: Collect, Curate, Serve

2025-11-02 · Data Engineering Podcast Listen

podcast_episode

by Ido Bronstein (Upriver) , Omri Lifshitz (Upriver) , Tobias Macey

AI/ML Cloud Computing Data Engineering Data Quality Datafold dbt ETL/ELT LLM Prefect Python RAG SQL +1 more

Summary In this episode of the Data Engineering Podcast Omri Lifshitz (CTO) and Ido Bronstein (CEO) of Upriver talk about the growing gap between AI's demand for high-quality data and organizations' current data practices. They discuss why AI accelerates both the supply and demand sides of data, highlighting that the bottleneck lies in the "middle layer" of curation, semantics, and serving. Omri and Ido outline a three-part framework for making data usable by LLMs and agents: collect, curate, serve, and share challenges of scaling from POCs to production, including compounding error rates and reliability concerns. They also explore organizational shifts, patterns for managing context windows, pragmatic views on schema choices, and Upriver's approach to building autonomous data workflows using determinism and LLMs at the right boundaries. The conversation concludes with a look ahead to AI-first data platforms where engineers supervise business semantics while automation stitches technical details end-to-end.

Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementData teams everywhere face the same problem: they're forcing ML models, streaming data, and real-time processing through orchestration tools built for simple ETL. The result? Inflexible infrastructure that can't adapt to different workloads. That's why Cash App and Cisco rely on Prefect. Cash App's fraud detection team got what they needed - flexible compute options, isolated environments for custom packages, and seamless data exchange between workflows. Each model runs on the right infrastructure, whether that's high-memory machines or distributed compute. Orchestration is the foundation that determines whether your data team ships or struggles. ETL, ML model training, AI Engineering, Streaming - Prefect runs it all from ingestion to activation in one platform. Whoop and 1Password also trust Prefect for their data operations. If these industry leaders use Prefect for critical workflows, see what it can do for you at dataengineeringpodcast.com/prefect.Data migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details.Composable data infrastructure is great, until you spend all of your time gluing it together. Bruin is an open source framework, driven from the command line, that makes integration a breeze. Write Python and SQL to handle the business logic, and let Bruin handle the heavy lifting of data movement, lineage tracking, data quality monitoring, and governance enforcement. Bruin allows you to build end-to-end data workflows using AI, has connectors for hundreds of platforms, and helps data teams deliver faster. Teams that use Bruin need less engineering effort to process data and benefit from a fully integrated data platform. Go to dataengineeringpodcast.com/bruin today to get started. And for dbt Cloud customers, they'll give you $1,000 credit to migrate to Bruin Cloud.Your host is Tobias Macey and today I'm interviewing Omri Lifshitz and Ido Bronstein about the challenges of keeping up with the demand for data when supporting AI systemsInterview IntroductionHow did you get involved in the area of data management?We're here to talk about "The Growing Gap Between Data & AI". From your perspective, what is this gap, and why do you think it's widening so rapidly right now?How does this gap relate to the founding story of Upriver? What problems were you and your co-founders experiencing that led you to build this?The core premise of new AI tools, from RAG pipelines to LLM agents, is that they are only as good as the data they're given. How does this "garbage in, garbage out" problem change when the "in" is not a static file but a complex, high-velocity, and constantly changing data pipeline?Upriver is described as an "intelligent agent system" and an "autonomous data engineer." This is a fascinating "AI to solve for AI" approach. Can you describe this agent-based architecture and how it specifically works to bridge that data-AI gap?Your website mentions a "Data Context Layer" that turns "tribal knowledge" into a "machine-usable mode." This sounds critical for AI. How do you capture that context, and how does it make data "AI-ready" in a way that a traditional data catalog or quality tool doesn't?What are the most innovative or unexpected ways you've seen companies trying to make their data "AI-ready"? And where are the biggest points of failure you observe?What has been the most challenging or unexpected lesson you've learned while building an AI system (Upriver) that is designed to fix the data foundation for other AI systems?When is an autonomous, agent-based approach not the right solution for a team's data quality problems? What organizational or technical maturity is required to even start closing this data-AI gap?What do you have planned for the future of Upriver? And looking more broadly, how do you see this gap between data and AI evolving over the next few years?Contact Info Ido - LinkedInOmri - LinkedInParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.Links UpriverRAG == Retrieval Augmented GenerationAI Engineering Podcast EpisodeAI AgentContext WindowModel Finetuning)The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Building data excellence: culture, value and the human side of AI

2025-10-30 · Hub & Spoken: Data | Analytics | Chief Data Officer | CDO | Data Strategy Listen

podcast_episode

by Roberto Maranca (Schneider Electric) , Jason Foster (Cynozure)

AI/ML Analytics

In this episode of Hub & Spoken, Jason Foster, CEO and Founder of Cynozure, speaks with Roberto Maranca, data & digital transformation expert and author of Data Excellence. They explore what it really means to build a 'data fit' organisation, one that treats data capability like physical fitness by understanding where you are, training for where you want to be and making improvement a daily routine. Drawing from ancient philosophy and modern business, Roberto explains how concepts from Socrates and Aristotle can help leaders rethink culture, value and human responsibility in an AI-driven world. Together, they discuss how organisations can: Shift from seeing data as a tech issue to a leadership mindset Build collective intelligence and cultural readiness Stay human in the age of intelligent machines Cynozure is a leading data, analytics and AI company that helps organisations to reach their data potential. It works with clients on data and AI strategy, data management, data architecture and engineering, analytics and AI, data culture and literacy, and data leadership. The company was named one of The Sunday Times' fastest-growing private companies in both 2022 and 2023 and recognised as The Best Place to Work in Data by DataIQ in 2023 and 2024. Cynozure is a certified B Corporation.

Mastering Snowflake DataOps with DataOps.live: An End-to-End Guide to Modern Data Management

2025-10-30 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Ronald L. Steelman Jr.

DataOps dbt DevOps Git Snowflake data data-engineering

This practical, in-depth guide shows you how to build modern, sophisticated data processes using the Snowflake platform and DataOps.live —the only platform that enables seamless DataOps integration with Snowflake. Designed for data engineers, architects, and technical leaders, it bridges the gap between DataOps theory and real-world implementation, helping you take control of your data pipelines to deliver more efficient, automated solutions. . You’ll explore the core principles of DataOps and how they differ from traditional DevOps, while gaining a solid foundation in the tools and technologies that power modern data management—including Git, DBT, and Snowflake. Through hands-on examples and detailed walkthroughs, you’ll learn how to implement your own DataOps strategy within Snowflake and maximize the power of DataOps.live to scale and refine your DataOps processes. Whether you're just starting with DataOps or looking to refine and scale your existing strategies, this book—complete with practical code examples and starter projects—provides the knowledge and tools you need to streamline data operations, integrate DataOps into your Snowflake infrastructure, and stay ahead of the curve in the rapidly evolving world of data management. What You Will Learn Explore the fundamentals of DataOps , its differences from DevOps, and its significance in modern data management Understand Git’s role in DataOps and how to use it effectively Know why DBT is preferred for DataOps and how to apply it Set up and manage DataOps.live within the Snowflake ecosystem Apply advanced techniques to scale and evolve your DataOps strategy Who This Book Is For Snowflake practitioners—including data engineers, platform architects, and technical managers—who are ready to implement DataOps principles and streamline complex data workflows using DataOps.live.

talk-data.com

Data Management

Activity Trend

Top Events

Top Speakers

Engineering Lakehouses with Open Table Formats

Building a Data and AI Platform with PostgreSQL

Keynote by Lisa Amini- What’s Next in AI for Data and Data Management?

AWS re:Invent 2025 - Data Processing architectures for building AI solutions (ANT328)

AWSreInvent #AWSreInvent2025 #AWS

AWS re:Invent 2025 - How Samsung optimized 1.2 PB on Amazon DynamoDB with zero downtime (DAT323)

AWSreInvent #AWSreInvent2025 #AWS

AWS re:Invent 2025 - A practitioner’s guide to data for agentic AI (DAT315)

AWSreInvent #AWSreInvent2025 #AWS

AWS re:Invent 2025 -What’s new in search, observability, and vector databases w/ OpenSearch (ANT201)

AWSreInvent #AWSreInvent2025 #AWS

Why organisations struggle with change and what to do about It

Blurring Lines: Data, AI, and the New Playbook for Team Velocity

How Yale New Haven Health modernized security with Sentinel and Cribl

Scale operations and optimize costs with Microsoft Sentinel data lake

Masterclasses

Power intelligent data management from on-premises to the Cloud

State, Scale, and Signals: Rethinking Orchestration with Durable Execution

Why companies fail with data and what to do about it

The AI Data Paradox: High Trust in Models, Low Trust in Data

The Great Data Engineering "Reset": From Pipelines to Agents and Beyond

Bridging the AI–Data Gap: Collect, Curate, Serve

Building data excellence: culture, value and the human side of AI

Mastering Snowflake DataOps with DataOps.live: An End-to-End Guide to Modern Data Management