talk-data.com talk-data.com

Topic

ETL/ELT

ETL/ELT

data_integration data_transformation data_loading

480

tagged

Activity Trend

40 peak/qtr
2020-Q1 2026-Q1

Activities

480 activities · Newest first

In this session, Paul Wilkinson, Principal Solutions Architect at Redpanda, will demonstrate Redpanda's native Iceberg capability: a game-changing addition that bridges the gap between real-time streaming and analytical workloads, eliminating the complexity of traditional data lake architectures while maintaining the performance and simplicity that Redpanda is known for.

Paul will explore how this new capability enables organizations to seamlessly transition streaming data into analytical formats without complex ETL pipelines or additional infrastructure overhead in a follow-along demo - allowing you to build your own streaming lakehouse and show it to your team!

Summary In this episode of the AI Engineering Podcast Mark Brooker, VP and Distinguished Engineer at AWS, talks about how agentic workflows are transforming database usage and infrastructure design. He discusses the evolving role of data in AI systems, from traditional models to more modern approaches like vectors, RAG, and relational databases. Mark explains why agents require serverless, elastic, and operationally simple databases, and how AWS solutions like Aurora and DSQL address these needs with features such as rapid provisioning, automated patching, geodistribution, and spiky usage. The conversation covers topics including tool calling, improved model capabilities, state in agents versus stateless LLM calls, and the role of Lambda and AgentCore for long-running, session-isolated agents. Mark also touches on the shift from local MCP tools to secure, remote endpoints, the rise of object storage as a durable backplane, and the need for better identity and authorization models. The episode highlights real-world patterns like agent-driven SQL fuzzing and plan analysis, while identifying gaps in simplifying data access, hardening ops for autonomous systems, and evolving serverless database ergonomics to keep pace with agentic development.

Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementData teams everywhere face the same problem: they're forcing ML models, streaming data, and real-time processing through orchestration tools built for simple ETL. The result? Inflexible infrastructure that can't adapt to different workloads. That's why Cash App and Cisco rely on Prefect. Cash App's fraud detection team got what they needed - flexible compute options, isolated environments for custom packages, and seamless data exchange between workflows. Each model runs on the right infrastructure, whether that's high-memory machines or distributed compute. Orchestration is the foundation that determines whether your data team ships or struggles. ETL, ML model training, AI Engineering, Streaming - Prefect runs it all from ingestion to activation in one platform. Whoop and 1Password also trust Prefect for their data operations. If these industry leaders use Prefect for critical workflows, see what it can do for you at dataengineeringpodcast.com/prefect.Data migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details.Your host is Tobias Macey and today I'm interviewing Marc Brooker about the impact of agentic workflows on database usage patterns and how they change the architectural requirements for databasesInterview IntroductionHow did you get involved in the area of data management?Can you describe what the role of the database is in agentic workflows?There are numerous types of databases, with relational being the most prevalent. How does the type and purpose of an agent inform the type of database that should be used?Anecdotally I have heard about how agentic workloads have become the predominant "customers" of services like Neon and Fly.io. How would you characterize the different patterns of scale for agentic AI applications? (e.g. proliferation of agents, monolithic agents, multi-agent, etc.)What are some of the most significant impacts on workload and access patterns for data storage and retrieval that agents introduce?What are the categorical differences in that behavior as compared to programmatic/automated systems?You have spent a substantial amount of time on Lambda at AWS. Given that LLMs are effectively stateless, how does the added ephemerality of serverless functions impact design and performance considerations around having to "re-hydrate" context when interacting with agents?What are the most interesting, innovative, or unexpected ways that you have seen serverless and database systems used for agentic workloads?What are the most interesting, unexpected, or challenging lessons that you have learned while working on technologies that are supporting agentic applications?Contact Info BlogLinkedInParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.Links AWS Aurora DSQLAWS LambdaThree Tier ArchitectureVector DatabaseGraph DatabaseRelational DatabaseVector EmbeddingRAG == Retrieval Augmented GenerationAI Engineering Podcast EpisodeGraphRAGAI Engineering Podcast EpisodeLLM Tool CallingMCP == Model Context ProtocolA2A == Agent 2 Agent ProtocolAWS Bedrock AgentCoreStrandsLangChainKiroThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Send us a text What if AI could tap into live operational data — without ETL or RAG? In this episode, Deepti Srivastava, founder of Snow Leopard, reveals how her company is transforming enterprise data access with intelligent data retrieval, semantic intelligence, and a governance-first approach. Tune in for a fresh perspective on the future of AI and the startup journey behind it.

We explore how companies are revolutionizing their data access and AI strategies. Deepti Srivastava, founder of Snow Leopard, shares her insights on bridging the gap between live operational data and generative AI — and how it’s changing the game for enterprises worldwide. We dive into Snow Leopard’s innovative approach to data retrieval, semantic intelligence, and governance-first architecture. 04:54 Meeting Deepti Srivastava 14:06 AI with No ETL, no RAG 17:11 Snow Leopard's Intelligent Data Fetching 19:00 Live Query Challenges 21:01 Snow Leopard's Secret Sauce 22:14 Latency 23:48 Schema Changes 25:02 Use Cases 26:06 Snow Leopard's Roadmap 29:16 Getting Started 33:30 The Startup Journey 34:12 A Woman in Technology 36:03 The Contrarian View🔗 LinkedIn: https://www.linkedin.com/in/thedeepti/ 🔗 Website:  https://www.snowleopard.ai/ Want to be featured as a guest on Making Data Simple? Reach out to us at [email protected] and tell us why you should be next. The Making Data Simple Podcast is hosted by Al Martin, WW VP Technical Sales, IBM, where we explore trending technologies, business innovation, and leadership ... while keeping it simple & fun.

This session explores the building blocks of next-generation data platforms, with a focus on framing the right questions to unlock innovation. We’ll showcase how AWS Glue, ETL pipelines, crawlers, and data catalogs can transform raw data into analytics-ready insights. Drawing on hands-on experience, we’ll share forward-thinking strategies, lessons learned, and emerging best practices to help you architect a data foundation that is intelligent, adaptable, and future-proof.

Veel organisaties worstelen met een versnipperd datalandschap vol scripts en ETL-tooling die alleen door experts begrepen wordt. Ploeger Logistics laat zien dat het anders kan. Samen met Infotopics migreerde de logistiek dienstverlener haar complete data-logistiek naar de cloud – zonder verlies van continuïteit. Het resultaat: één schaalbaar platform, met een beter datafundament voor de hele organisatie. Tijdens deze sessie ontdek je de keuzes, obstakels en impact van deze transformatie.

Summary In this episode of the Data Engineering Podcast Hannes Mühleisen and Mark Raasveldt, the creators of DuckDB, share their work on Duck Lake, a new entrant in the open lakehouse ecosystem. They discuss how Duck Lake, is focused on simplicity, flexibility, and offers a unified catalog and table format compared to other lakehouse formats like Iceberg and Delta. Hannes and Mark share insights into how Duck Lake revolutionizes data architecture by enabling local-first data processing, simplifying deployment of lakehouse solutions, and offering benefits such as encryption features, data inlining, and integration with existing ecosystems.

Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementData teams everywhere face the same problem: they're forcing ML models, streaming data, and real-time processing through orchestration tools built for simple ETL. The result? Inflexible infrastructure that can't adapt to different workloads. That's why Cash App and Cisco rely on Prefect. Cash App's fraud detection team got what they needed - flexible compute options, isolated environments for custom packages, and seamless data exchange between workflows. Each model runs on the right infrastructure, whether that's high-memory machines or distributed compute. Orchestration is the foundation that determines whether your data team ships or struggles. ETL, ML model training, AI Engineering, Streaming - Prefect runs it all from ingestion to activation in one platform. Whoop and 1Password also trust Prefect for their data operations. If these industry leaders use Prefect for critical workflows, see what it can do for you at dataengineeringpodcast.com/prefect.Data migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details. Your host is Tobias Macey and today I'm interviewing Hannes Mühleisen and Mark Raasveldt about DuckLake, the latest entrant into the open lakehouse ecosystemInterview IntroductionHow did you get involved in the area of data management?Can you describe what DuckLake is and the story behind it?What are the particular problems that DuckLake is solving for?How does this compare to the capabilities of MotherDuck?Iceberg and Delta already have a well established ecosystem, but so does DuckDB. Who are the primary personas that you are trying to focus on in these early days of DuckLake?One of the major factors driving the adoption of formats like Iceberg is cost efficiency for large volumes of data. That brings with it challenges of large batch processing of data. How does DuckLake account for these axes of scale?There is also a substantial investment in the ecosystem of technologies that support Iceberg. The most notable ecosystem challenge for DuckDB and DuckLake is in the query layer. How are you thinking about the evolution and growth of that capability beyond DuckDB (e.g. support in Trino/Spark/Flink)?What are your opinions on the viability of a future where DuckLake and Iceberg become a unified standard and implementation? (why can't Iceberg REST catalog implementations just use DuckLake under the hood?)Digging into the specifics of the specification and implementation, what are some of the capabilities that it offers above and beyond Iceberg?Is it now possible to enforce PK/FK constraints, indexing on underlying data?Given that DuckDB has a vector type, how do you think about the support for vector storage/indexing?How do the capabilities of DuckLake and the integration with DuckDB change the ways that data teams design their data architecture and access patterns?What are your thoughts on the impact of "data gravity" in today's data ecosystem, with engines like DuckDB, KuzuDB, LanceDB, etc. available for embedded and edge use cases?What are the most interesting, innovative, or unexpected ways that you have seen DuckLake used?What are the most interesting, unexpected, or challenging lessons that you have learned while working on DuckLake?When is DuckLake the wrong choice?What do you have planned for the future of DuckLake?Contact Info HannesWebsiteMarkWebsiteParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.Links DuckDBPodcast EpisodeDuckLakeDuckDB LabsMySQLCWIMonetDBIcebergIceberg REST CatalogDeltaHudiLanceDuckDB Iceberg ConnectorACID == Atomicity, Consistency, Isolation, DurabilityMotherDuckMotherDuck Managed DuckLakeTrinoSparkPrestoSpark DuckLake DemoDelta KernelArrowdltS3 TablesAttribute Based Access Control (ABAC)ParquetArrow FlightHadoopHDFSDuckLake RoadmapThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

More than DataFrames: Data Pipelines with the Swiss Army Knife DuckDB

Most Python developers reach for Pandas or Polars when working with tabular data—but DuckDB offers a powerful alternative that’s more than just another DataFrame library. In this tutorial, you’ll learn how to use DuckDB as an in-process analytical database: building data pipelines, caching datasets, and running complex queries with SQL—all without leaving Python. We’ll cover common use cases like ETL, lightweight data orchestration, and interactive analytics workflows. You’ll leave with a solid mental model for using DuckDB effectively as the “SQLite for analytics.”

The relationship between AI and data professionals is evolving rapidly, creating both opportunities and challenges. As companies embrace AI-first strategies and experiment with AI agents, the skills needed to thrive in data roles are fundamentally changing. Is coding knowledge still essential when AI can generate code for you? How important is domain expertise when automated tools can handle technical tasks? With data engineering and analytics engineering gaining prominence, the focus is shifting toward ensuring data quality and building reliable pipelines. But where does the human fit in this increasingly automated landscape, and how can you position yourself to thrive amid these transformations? Megan Bowers is Senior Content Manager, Digital Customer Success at Alteryx, where she develops resources for the Maveryx Community. She writes technical blogs and hosts the Alter Everything podcast, spotlighting best practices from data professionals across the industry. Before joining Alteryx, Megan worked as a data analyst at Stanley Black & Decker, where she led ETL and dashboarding projects and trained teams on Alteryx and Power BI. Her transition into data began after earning a degree in Industrial Engineering and completing a data science bootcamp. Today, she focuses on creating accessible, high-impact content that helps data practitioners grow. Her favorite topics include switching career paths after college, building a professional brand on LinkedIn, writing technical blogs people actually want to read, and best practices in Alteryx, data visualization, and data storytelling. Presented by Alteryx, Alter Everything serves as a podcast dedicated to the culture of data science and analytics, showcasing insights from industry specialists. Covering a range of subjects from the use of machine learning to various analytics career trajectories, and all that lies between, Alter Everything stands as a celebration of the critical role of data literacy in a data-driven world. In the episode, Richie and Megan explore the impact of AI on job functions, the rise of AI agents in business, and the importance of domain knowledge and process analytics in data roles. They also discuss strategies for staying updated in the fast-paced world of AI and data science, and much more. Links Mentioned in the Show: Alter EverythingConnect with MeganSkill Track: Alteryx FundamentalsRelated Episode: Scaling Enterprise Analytics with Libby Duane Adams, Chief Advocacy Officer and Co-Founder of AlteryxRewatch RADAR AI  New to DataCamp? Learn on the go using the DataCamp mobile appEmpower your business with world-class data and AI skills with DataCamp for business

Jumpstart Snowflake: A Step-by-Step Guide to Modern Cloud Analytics

This book is your guide to the modern market of data analytics platforms and the benefits of using Snowflake, the data warehouse built for the cloud. As organizations increasingly rely on modern cloud data platforms, the core of any analytics framework—the data warehouse—is more important than ever. This updated 2nd edition ensures you are ready to make the most of the industry’s leading data warehouse. This book will onboard you to Snowflake and present best practices for deploying and using the Snowflake data warehouse. The book also covers modern analytics architecture, integration with leading analytics software such as Matillion ETL, Tableau, and Databricks, and migration scenarios for on-premises legacy data warehouses. This new edition includes expanded coverage of SnowPark for developing complex data applications, an introduction to managing large datasets with Apache Iceberg tables, and instructions for creating interactive data applications using Streamlit, ensuring readers are equipped with the latest advancements in Snowflake's capabilities. What You Will Learn Master key functionalities of Snowflake Set up security and access with cluster Bulk load data into Snowflake using the COPY command Migrate from a legacy data warehouse to Snowflake Integrate the Snowflake data platform with modern business intelligence (BI) and data integration tools Manage large datasets with Apache Iceberg Tables Implement continuous data loading with Snowpipe and Dynamic Tables Who This Book Is For Data professionals, business analysts, IT administrators, and existing or potential Snowflake users

Red Hat’s unified data and AI platform relies on Apache Airflow for orchestration, alongside Snowflake, Fivetran, and Atlan. The platform prioritizes building a dependable data foundation, recognizing that effective AI depends on quality data. Airflow was selected for its predictability, extensive connectivity, reliability, and scalability. The platform now supports business analytics, transitioning from ETL to ELT processes. This has resulted in a remarkable improvement in how we make data available for business decisions. The platform’s capabilities are being extended to power Digital Workers (AI agents) using large language models, encompassing model training, fine-tuning, and inference. Two Digital Workers are currently deployed, with more in development. This presentation will detail the rationale and background of this evolution, followed by an explanation of the architectural decisions made and the challenges encountered and resolved throughout the process of transforming into an AI-enabled data platform to power Red Hat’s business.

Discover how Apache Airflow powers scalable ELT pipelines, enabling seamless data ingestion, transformation, and machine learning-driven insights. This session will walk through: Automating Data Ingestion: Using Airflow to orchestrate raw data ingestion from third-party sources into your data lake (S3, GCP), ensuring a steady pipeline of high-quality training and prediction data. Optimizing Transformations with Serverless Computing: Offloading intensive transformations to serverless functions (GCP Cloud Run, AWS Lambda) and machine learning models (BigQuery ML, Sagemaker), integrating their outputs seamlessly into Airflow workflows. Real-World Impact: A case study on how INTRVL leveraged Airflow, BigQuery ML, and Cloud Run to analyze early voting data in near real-time, generating actionable insights on voter behavior across swing states. This talk not only provides a deep dive into the Political Tech space but also serves as a reference architecture for building robust, repeatable ELT pipelines. Attendees will gain insights into modern serverless technologies from AWS and GCP that enhance Airflow’s capabilities, helping data engineers design scalable, cloud-agnostic workflows.

As data platforms grow in complexity, so do the orchestration needs behind them. Time-based (cron) scheduling has long been the default in Airflow, but dataset-based scheduling promises a more data-aware, efficient alternative. In this session, I’ll share lessons learned from operating Airflow at scale—supporting thousands of DAGs across teams with varied use cases, from simple ETL to complex ML workflows. We’ll explore when dataset scheduling makes sense, the challenges it introduces, and how to evolve your DAG design and platform architecture to make the most of it. Whether you’re migrating legacy workflows or designing new ones, this talk will help you evaluate the right scheduling model for your needs.

Airflow 3.0 is the most significant release in the project’s history, and brings a better user experience, stronger security, and the ability to run tasks anywhere, at any time. In this workshop, you’ll get hands-on experience with the new release and learn how to leverage new features like DAG versioning, backfills, data assets, and a new react-based UI. Whether you’re writing traditional ELT/ETL pipelines or complex ML and GenAI workflows, you’ll learn how Airflow 3 will make your day-to-day work smoother and your pipelines even more flexible. This workshop is suitable for intermediate to advanced Airflow users. Beginning users should consider taking the Airflow fundamentals course on the Astronomer Academy before attending this workshop.

At LinkedIn, our data pipelines process exabytes of data, with our offline infrastructure executing 300K ETL workflows daily and 10K concurrent executions. Historically, these workloads ran on our legacy system, Azkaban, which faced UX, scalability, and operational challenges. To modernize our infra, we built a managed Airflow service, leveraging its enhanced developer & operator experience, rich feature set, and strong OSS community support. That initiated LinkedIn’s largest-ever infrastructure migration—transitioning thousands of legacy workflows to Airflow. In this talk, we will share key lessons from migrating massive-scale pipelines with minimal production disruption. We will discuss: Overall Migration Strategy Custom Tooling Enhancements on testing, deployment, and observability Architectural Innovations decoupling orchestration and compute GenAI-powered Migration automating code rewrites Post-Migration Challenges & Airflow 3.0. Attendees will walk away with battle-tested strategies for large-scale Airflow adoption and practical insights into scaling Airflow in enterprise environments.

In the age of Generative AI, knowledge bases are the backbone of intelligent systems, enabling them to deliver accurate and context-aware responses. But how do you ensure that these knowledge bases remain up-to-date and relevant in a rapidly changing world? Enter Apache Airflow, a robust orchestration tool that streamlines the automation of data workflows. This talk will explore how Airflow can be leveraged to manage and update AI knowledge bases across multiple data sources. We’ll dive into the architecture, demonstrate how Airflow enables efficient data extraction, transformation, and loading (ETL), and share insights on tackling challenges like data consistency, scheduling, and scalability. Whether you’re building your own AI-driven systems or looking to optimize existing workflows, this session will provide practical takeaways to make the most of Apache Airflow in orchestrating intelligent solutions.

The journey from ML model development to production deployment and monitoring is often complex and fragmented. How can teams overcome the chaos of disparate tools and processes? This session dives into how Apache Airflow serves as a unifying force in MLOps. We’ll begin with a look at the broader MLOps trends observed by Google within the Airflow community, highlighting how Airflow is evolving to meet these challenges and showcasing diverse MLOps use cases – both current and future. Then, Priceline will present a deep-dive case study on their MLOps transformation. Learn how they leveraged Cloud Composer, Google Cloud’s managed Apache Airflow service, to orchestrate their entire ML pipeline end-to-end: ETL, data preprocessing, model building & training, Dockerization, Google Artifact Registry integration, deployment, model serving, and evaluation. Discover how using Cloud Composer on GCP enabled them to build a scalable, reliable, adaptable, and maintainable MLOps practice, moving decisively from chaos to coordination. Cloud Composer (Airflow) has served as a major backbone in transforming the whole ML experience in Priceline. Join us to learn how to harness Airflow, particularly within a managed environment like Cloud Composer, for robust MLOps workflows, drawing lessons from both industry trends and a concrete, successful implementation.