ETL/ELT

Building Streaming Lakehouse Icebergs with Redpanda

2025-09-24 · Big Data LDN 2025

Face To Face

by Paul Wilkinson (Redpanda)

Data Lake Data Lakehouse Iceberg Redpanda Data Streaming

In this session, Paul Wilkinson, Principal Solutions Architect at Redpanda, will demonstrate Redpanda's native Iceberg capability: a game-changing addition that bridges the gap between real-time streaming and analytical workloads, eliminating the complexity of traditional data lake architectures while maintaining the performance and simplicity that Redpanda is known for.

Paul will explore how this new capability enables organizations to seamlessly transition streaming data into analytical formats without complex ETL pipelines or additional infrastructure overhead in a follow-along demo - allowing you to build your own streaming lakehouse and show it to your team!

From RAG to Relational: How Agentic Patterns Are Reshaping Data Architecture

2025-09-18 · Data Engineering Podcast Listen

podcast_episode

by Mark Brooker (AWS) , Tobias Macey

AI/ML AWS Aurora AWS Lambda Data Engineering Data Management Datafold ELK LLM Prefect Python RAG +3 more

Summary In this episode of the AI Engineering Podcast Mark Brooker, VP and Distinguished Engineer at AWS, talks about how agentic workflows are transforming database usage and infrastructure design. He discusses the evolving role of data in AI systems, from traditional models to more modern approaches like vectors, RAG, and relational databases. Mark explains why agents require serverless, elastic, and operationally simple databases, and how AWS solutions like Aurora and DSQL address these needs with features such as rapid provisioning, automated patching, geodistribution, and spiky usage. The conversation covers topics including tool calling, improved model capabilities, state in agents versus stateless LLM calls, and the role of Lambda and AgentCore for long-running, session-isolated agents. Mark also touches on the shift from local MCP tools to secure, remote endpoints, the rise of object storage as a durable backplane, and the need for better identity and authorization models. The episode highlights real-world patterns like agent-driven SQL fuzzing and plan analysis, while identifying gaps in simplifying data access, hardening ops for autonomous systems, and evolving serverless database ergonomics to keep pace with agentic development.

Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementData teams everywhere face the same problem: they're forcing ML models, streaming data, and real-time processing through orchestration tools built for simple ETL. The result? Inflexible infrastructure that can't adapt to different workloads. That's why Cash App and Cisco rely on Prefect. Cash App's fraud detection team got what they needed - flexible compute options, isolated environments for custom packages, and seamless data exchange between workflows. Each model runs on the right infrastructure, whether that's high-memory machines or distributed compute. Orchestration is the foundation that determines whether your data team ships or struggles. ETL, ML model training, AI Engineering, Streaming - Prefect runs it all from ingestion to activation in one platform. Whoop and 1Password also trust Prefect for their data operations. If these industry leaders use Prefect for critical workflows, see what it can do for you at dataengineeringpodcast.com/prefect.Data migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details.Your host is Tobias Macey and today I'm interviewing Marc Brooker about the impact of agentic workflows on database usage patterns and how they change the architectural requirements for databasesInterview IntroductionHow did you get involved in the area of data management?Can you describe what the role of the database is in agentic workflows?There are numerous types of databases, with relational being the most prevalent. How does the type and purpose of an agent inform the type of database that should be used?Anecdotally I have heard about how agentic workloads have become the predominant "customers" of services like Neon and Fly.io. How would you characterize the different patterns of scale for agentic AI applications? (e.g. proliferation of agents, monolithic agents, multi-agent, etc.)What are some of the most significant impacts on workload and access patterns for data storage and retrieval that agents introduce?What are the categorical differences in that behavior as compared to programmatic/automated systems?You have spent a substantial amount of time on Lambda at AWS. Given that LLMs are effectively stateless, how does the added ephemerality of serverless functions impact design and performance considerations around having to "re-hydrate" context when interacting with agents?What are the most interesting, innovative, or unexpected ways that you have seen serverless and database systems used for agentic workloads?What are the most interesting, unexpected, or challenging lessons that you have learned while working on technologies that are supporting agentic applications?Contact Info BlogLinkedInParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.Links AWS Aurora DSQLAWS LambdaThree Tier ArchitectureVector DatabaseGraph DatabaseRelational DatabaseVector EmbeddingRAG == Retrieval Augmented GenerationAI Engineering Podcast EpisodeGraphRAGAI Engineering Podcast EpisodeLLM Tool CallingMCP == Model Context ProtocolA2A == Agent 2 Agent ProtocolAWS Bedrock AgentCoreStrandsLangChainKiroThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Making Data Simple: Live Data, Smarter AI with Snow Leopard founder Deepti Srivastava

2025-09-17 · Making Data Simple Listen

podcast_episode

by Deepti Srivastava (Snow Leopard AI) , Al Martin (IBM)

AI/ML GenAI IBM RAG

Send us a text What if AI could tap into live operational data — without ETL or RAG? In this episode, Deepti Srivastava, founder of Snow Leopard, reveals how her company is transforming enterprise data access with intelligent data retrieval, semantic intelligence, and a governance-first approach. Tune in for a fresh perspective on the future of AI and the startup journey behind it.

We explore how companies are revolutionizing their data access and AI strategies. Deepti Srivastava, founder of Snow Leopard, shares her insights on bridging the gap between live operational data and generative AI — and how it’s changing the game for enterprises worldwide. We dive into Snow Leopard’s innovative approach to data retrieval, semantic intelligence, and governance-first architecture. 04:54 Meeting Deepti Srivastava 14:06 AI with No ETL, no RAG 17:11 Snow Leopard's Intelligent Data Fetching 19:00 Live Query Challenges 21:01 Snow Leopard's Secret Sauce 22:14 Latency 23:48 Schema Changes 25:02 Use Cases 26:06 Snow Leopard's Roadmap 29:16 Getting Started 33:30 The Startup Journey 34:12 A Woman in Technology 36:03 The Contrarian View🔗 LinkedIn: https://www.linkedin.com/in/thedeepti/ 🔗 Website: https://www.snowleopard.ai/ Want to be featured as a guest on Making Data Simple? Reach out to us at [email protected] and tell us why you should be next. The Making Data Simple Podcast is hosted by Al Martin, WW VP Technical Sales, IBM, where we explore trending technologies, business innovation, and leadership ... while keeping it simple & fun.

From Raw Data to Insight: Asking the right questions

2025-09-10 · Data Expo NL 2025

talk

by Katerina Tashoska (Xebia)

Analytics AWS AWS Glue

This session explores the building blocks of next-generation data platforms, with a focus on framing the right questions to unlock innovation. We’ll showcase how AWS Glue, ETL pipelines, crawlers, and data catalogs can transform raw data into analytics-ready insights. Drawing on hands-on experience, we’ll share forward-thinking strategies, lessons learned, and emerging best practices to help you architect a data foundation that is intelligent, adaptable, and future-proof.

Van Warehouse naar Cloud DWH: een verrassende kijk op data-logistiek

2025-09-10 · Data Expo NL 2025

talk

by Jeroen Jacobs (Infotopics) , Jürgen Volder (Ploeger Logistics)

Cloud Computing DWH

Veel organisaties worstelen met een versnipperd datalandschap vol scripts en ETL-tooling die alleen door experts begrepen wordt. Ploeger Logistics laat zien dat het anders kan. Samen met Infotopics migreerde de logistiek dienstverlener haar complete data-logistiek naar de cloud – zonder verlies van continuïteit. Het resultaat: één schaalbaar platform, met een beter datafundament voor de hele organisatie. Tijdens deze sessie ontdek je de keuzes, obstakels en impact van deze transformatie.

Duck Lake: Simplifying the Lakehouse Ecosystem

2025-09-10 · Data Engineering Podcast Listen

podcast_episode

by Mark Raasveldt (DuckDB) , Hannes Mühleisen (DuckDB Labs) , Tobias Macey

AI/ML Flink Data Engineering Data Lakehouse Data Management Datafold Delta DuckDB Iceberg Lance Motherduck Prefect +4 more

Summary In this episode of the Data Engineering Podcast Hannes Mühleisen and Mark Raasveldt, the creators of DuckDB, share their work on Duck Lake, a new entrant in the open lakehouse ecosystem. They discuss how Duck Lake, is focused on simplicity, flexibility, and offers a unified catalog and table format compared to other lakehouse formats like Iceberg and Delta. Hannes and Mark share insights into how Duck Lake revolutionizes data architecture by enabling local-first data processing, simplifying deployment of lakehouse solutions, and offering benefits such as encryption features, data inlining, and integration with existing ecosystems.

Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementData teams everywhere face the same problem: they're forcing ML models, streaming data, and real-time processing through orchestration tools built for simple ETL. The result? Inflexible infrastructure that can't adapt to different workloads. That's why Cash App and Cisco rely on Prefect. Cash App's fraud detection team got what they needed - flexible compute options, isolated environments for custom packages, and seamless data exchange between workflows. Each model runs on the right infrastructure, whether that's high-memory machines or distributed compute. Orchestration is the foundation that determines whether your data team ships or struggles. ETL, ML model training, AI Engineering, Streaming - Prefect runs it all from ingestion to activation in one platform. Whoop and 1Password also trust Prefect for their data operations. If these industry leaders use Prefect for critical workflows, see what it can do for you at dataengineeringpodcast.com/prefect.Data migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details. Your host is Tobias Macey and today I'm interviewing Hannes Mühleisen and Mark Raasveldt about DuckLake, the latest entrant into the open lakehouse ecosystemInterview IntroductionHow did you get involved in the area of data management?Can you describe what DuckLake is and the story behind it?What are the particular problems that DuckLake is solving for?How does this compare to the capabilities of MotherDuck?Iceberg and Delta already have a well established ecosystem, but so does DuckDB. Who are the primary personas that you are trying to focus on in these early days of DuckLake?One of the major factors driving the adoption of formats like Iceberg is cost efficiency for large volumes of data. That brings with it challenges of large batch processing of data. How does DuckLake account for these axes of scale?There is also a substantial investment in the ecosystem of technologies that support Iceberg. The most notable ecosystem challenge for DuckDB and DuckLake is in the query layer. How are you thinking about the evolution and growth of that capability beyond DuckDB (e.g. support in Trino/Spark/Flink)?What are your opinions on the viability of a future where DuckLake and Iceberg become a unified standard and implementation? (why can't Iceberg REST catalog implementations just use DuckLake under the hood?)Digging into the specifics of the specification and implementation, what are some of the capabilities that it offers above and beyond Iceberg?Is it now possible to enforce PK/FK constraints, indexing on underlying data?Given that DuckDB has a vector type, how do you think about the support for vector storage/indexing?How do the capabilities of DuckLake and the integration with DuckDB change the ways that data teams design their data architecture and access patterns?What are your thoughts on the impact of "data gravity" in today's data ecosystem, with engines like DuckDB, KuzuDB, LanceDB, etc. available for embedded and edge use cases?What are the most interesting, innovative, or unexpected ways that you have seen DuckLake used?What are the most interesting, unexpected, or challenging lessons that you have learned while working on DuckLake?When is DuckLake the wrong choice?What do you have planned for the future of DuckLake?Contact Info HannesWebsiteMarkWebsiteParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.Links DuckDBPodcast EpisodeDuckLakeDuckDB LabsMySQLCWIMonetDBIcebergIceberg REST CatalogDeltaHudiLanceDuckDB Iceberg ConnectorACID == Atomicity, Consistency, Isolation, DurabilityMotherDuckMotherDuck Managed DuckLakeTrinoSparkPrestoSpark DuckLake DemoDelta KernelArrowdltS3 TablesAttribute Based Access Control (ABAC)ParquetArrow FlightHadoopHDFSDuckLake RoadmapThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

More than DataFrames: Data Pipelines with the Swiss Army Knife DuckDB

2025-09-01 · PyData Berlin 2025 Watch

talk

by Mehdi Ouazza (MotherDuck)

Analytics DuckDB Pandas Polars Python SQL

Most Python developers reach for Pandas or Polars when working with tabular data—but DuckDB offers a powerful alternative that’s more than just another DataFrame library. In this tutorial, you’ll learn how to use DuckDB as an in-process analytical database: building data pipelines, caching datasets, and running complex queries with SQL—all without leaving Python. We’ll cover common use cases like ETL, lightweight data orchestration, and interactive analytics workflows. You’ll leave with a solid mental model for using DuckDB effectively as the “SQLite for analytics.”

Part 1: Top 5 ETL Challenges

2025-08-14 · Less Drag. More Data. How to Optimise Your ETL Pipelines

talk

by Jon Cooke (Dataception) , Frank Khan Sullivan , Dan Harris (Cloudaeon)

Discussion of the top five ETL challenges.

Part 2: Solving Key Challenges

2025-08-14 · Less Drag. More Data. How to Optimise Your ETL Pipelines

talk

by Jon Cooke (Dataception) , Frank Khan Sullivan , Dan Harris (Cloudaeon)

Discussion on approaches to solving key ETL challenges.

Part 3: Quick Wins and Optimisation Techniques

2025-08-14 · Less Drag. More Data. How to Optimise Your ETL Pipelines

talk

by Jon Cooke (Dataception) , Frank Khan Sullivan , Dan Harris (Cloudaeon)

Overview of quick wins and techniques to optimise ETL processes.

Part 4: Cloudaeon’s ETL Optimisation Program

2025-08-14 · Less Drag. More Data. How to Optimise Your ETL Pipelines

talk

by Jon Cooke (Dataception) , Frank Khan Sullivan , Dan Harris (Cloudaeon)

Presentation of Cloudaeon’s ETL Optimisation Program.

#315 DataFramed x Alter Everything: Future-Proofing Your Career in AI and Data Analytics | Richie & Megan Bowers

2025-08-13 · DataFramed Listen

podcast_episode

by Megan Bowers (Alteryx) , Richie (DataCamp)

AI/ML Alteryx Analytics Analytics Engineering BI Data Analytics Data Engineering Data Quality Data Science DataViz Power BI

The relationship between AI and data professionals is evolving rapidly, creating both opportunities and challenges. As companies embrace AI-first strategies and experiment with AI agents, the skills needed to thrive in data roles are fundamentally changing. Is coding knowledge still essential when AI can generate code for you? How important is domain expertise when automated tools can handle technical tasks? With data engineering and analytics engineering gaining prominence, the focus is shifting toward ensuring data quality and building reliable pipelines. But where does the human fit in this increasingly automated landscape, and how can you position yourself to thrive amid these transformations? Megan Bowers is Senior Content Manager, Digital Customer Success at Alteryx, where she develops resources for the Maveryx Community. She writes technical blogs and hosts the Alter Everything podcast, spotlighting best practices from data professionals across the industry. Before joining Alteryx, Megan worked as a data analyst at Stanley Black & Decker, where she led ETL and dashboarding projects and trained teams on Alteryx and Power BI. Her transition into data began after earning a degree in Industrial Engineering and completing a data science bootcamp. Today, she focuses on creating accessible, high-impact content that helps data practitioners grow. Her favorite topics include switching career paths after college, building a professional brand on LinkedIn, writing technical blogs people actually want to read, and best practices in Alteryx, data visualization, and data storytelling. Presented by Alteryx, Alter Everything serves as a podcast dedicated to the culture of data science and analytics, showcasing insights from industry specialists. Covering a range of subjects from the use of machine learning to various analytics career trajectories, and all that lies between, Alter Everything stands as a celebration of the critical role of data literacy in a data-driven world. In the episode, Richie and Megan explore the impact of AI on job functions, the rise of AI agents in business, and the importance of domain knowledge and process analytics in data roles. They also discuss strategies for staying updated in the fast-paced world of AI and data science, and much more. Links Mentioned in the Show: Alter EverythingConnect with MeganSkill Track: Alteryx FundamentalsRelated Episode: Scaling Enterprise Analytics with Libby Duane Adams, Chief Advocacy Officer and Co-Founder of AlteryxRewatch RADAR AI New to DataCamp? Learn on the go using the DataCamp mobile appEmpower your business with world-class data and AI skills with DataCamp for business

Jumpstart Snowflake: A Step-by-Step Guide to Modern Cloud Analytics

2025-08-01 · O'Reilly Data Engineering Books O'Reilly Amazon

book

by Donna Strok , Dmitry Foshin , Dmitry Anoshin

Analytics BI Cloud Computing Data Analytics Databricks DWH Iceberg Matillion Cyber Security Snowflake Tableau data +1 more

This book is your guide to the modern market of data analytics platforms and the benefits of using Snowflake, the data warehouse built for the cloud. As organizations increasingly rely on modern cloud data platforms, the core of any analytics framework—the data warehouse—is more important than ever. This updated 2nd edition ensures you are ready to make the most of the industry’s leading data warehouse. This book will onboard you to Snowflake and present best practices for deploying and using the Snowflake data warehouse. The book also covers modern analytics architecture, integration with leading analytics software such as Matillion ETL, Tableau, and Databricks, and migration scenarios for on-premises legacy data warehouses. This new edition includes expanded coverage of SnowPark for developing complex data applications, an introduction to managing large datasets with Apache Iceberg tables, and instructions for creating interactive data applications using Streamlit, ensuring readers are equipped with the latest advancements in Snowflake's capabilities. What You Will Learn Master key functionalities of Snowflake Set up security and access with cluster Bulk load data into Snowflake using the COPY command Migrate from a legacy data warehouse to Snowflake Integrate the Snowflake data platform with modern business intelligence (BI) and data integration tools Manage large datasets with Apache Iceberg Tables Implement continuous data loading with Snowpipe and Dynamic Tables Who This Book Is For Data professionals, business analysts, IT administrators, and existing or potential Snowflake users