talk-data.com talk-data.com

Topic

Data Engineering

etl data_pipelines big_data

1127

tagged

Activity Trend

127 peak/qtr
2020-Q1 2026-Q1

Activities

1127 activities · Newest first

AI is only as good as the data it runs on. Yet Gartner predicts in 2026, over 60% of AI projects will fail to deliver value - because the underlying data isn’t truly AI-ready. MIT is even more concerned! “Good enough” data simply isn’t enough. 

At this World Tour launch event, DataOps.live reveal Momentum, the next generation of its DataOps automation platform designed to operationalize trusted AI at enterprise scale on Snowflake. Based on experiences from building over 9,000 Data Products to date, Momentum introduces breakthrough capabilities including AI-Ready Data Scoring to ensure data is fit for AI use cases, Data Product Lineage for end-to-end visibility, and a Data Engineering Agent that accelerates building reusable data products. Combined with automated CI/CD, continuous observability, and governance enforcement, Momentum closes the AI-readiness gap by embedding collaboration, metadata, and automation across the entire data lifecycle. Backed by Snowflake Ventures and trusted by leading enterprises, including AstraZeneca, Disney and AT&T, DataOps.live is the proven catalyst for scaling AI-ready data. In this session, you’ll unpack what AI-ready data really means, learn essential practices, discover a faster, easier, and more impactful way to make your AI initiatives succeed. Be the first to see Momentum in action - the future of AI-ready data.

Apache Iceberg™ fournit une norme de stockage ouverte qui peut démocratiser vos données stockées dans des data lakes distincts en offrant la liberté et l'interopérabilité d'utiliser divers moteurs de traitement de données. Rejoignez cette session pour explorer les dernières avancées de Snowflake pour le data engineering sur les tables Iceberg de Snowflake. Nous plongerons dans les fonctionnalités récemment lancées qui améliorent l'interopérabilité et apportent la facilité d'utilisation de Snowflake à vos data lakes Iceberg.

Thomas in't Veld, founder of Tasman Analytics, joined Yuliia and Dumke to discuss why data projects fail: teams obsess over tooling while ignoring proper data modeling and business alignment. Drawing from building analytics for 70-80 companies, Thomas explains why the best data model never changes unless the business changes, and how his team acts as "data therapists" forcing marketing and sales to agree on fundamental definitions. He shares his controversial take that data modeling sits more in analysis than engineering. Another hot take: analytics engineering is merging back into data engineering, and why showing off your DAG at meetups completely misses the point - business understanding is the critical differentiator, not your technology stack.

Summary In this episode of the Data Engineering Podcast Vijay Subramanian, founder and CEO of Trace, talks about metric trees - a new approach to data modeling that directly captures a company's business model. Vijay shares insights from his decade-long experience building data practices at Rent the Runway and explains how the modern data stack has led to a proliferation of dashboards without a coherent way for business consumers to reason about cause, effect, and action. He explores how metric trees differ from and interoperate with other data modeling approaches, serve as a backend for analytical workflows, and provide concrete examples like modeling Uber's revenue drivers and customer journeys. Vijay also discusses the potential of AI agents operating on metric trees to execute workflows, organizational patterns for defining inputs and outputs with business teams, and a vision for analytics that becomes invisible infrastructure embedded in everyday decisions.

Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementData teams everywhere face the same problem: they're forcing ML models, streaming data, and real-time processing through orchestration tools built for simple ETL. The result? Inflexible infrastructure that can't adapt to different workloads. That's why Cash App and Cisco rely on Prefect. Cash App's fraud detection team got what they needed - flexible compute options, isolated environments for custom packages, and seamless data exchange between workflows. Each model runs on the right infrastructure, whether that's high-memory machines or distributed compute. Orchestration is the foundation that determines whether your data team ships or struggles. ETL, ML model training, AI Engineering, Streaming - Prefect runs it all from ingestion to activation in one platform. Whoop and 1Password also trust Prefect for their data operations. If these industry leaders use Prefect for critical workflows, see what it can do for you at dataengineeringpodcast.com/prefect.Data migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details.Your host is Tobias Macey and today I'm interviewing Vijay Subramanian about metric trees and how they empower more effective and adaptive analyticsInterview IntroductionHow did you get involved in the area of data management?Can you describe what metric trees are and their purpose?How do metric trees relate to metric/semantic layers?What are the shortcomings of existing data modeling frameworks that prevent effective use of those assets?How do metric trees build on top of existing investments in dimensional data models?What are some strategies for engaging with the business to identify metrics and their relationships?What are your recommendations for storage, representation, and retrieval of metric trees?How do metric trees fit into the overall lifecycle of organizational data workflows?When creating any new data asset it introduces overhead of maintenance, monitoring, and evolution. How do metric trees fit into existing testing and validation frameworks that teams rely on for dimensional modeling?What are some of the key differences in useful evaluation/testing that teams need to develop for metric trees?How do metric trees assist in context engineering for AI-powered self-serve access to organizational data?What are the most interesting, innovative, or unexpected ways that you have seen metric trees used?What are the most interesting, unexpected, or challenging lessons that you have learned while working on metric trees and operationalizing them at Trace?When is a metric tree the wrong abstraction?What do you have planned for the future of Trace and applications of metric trees?Contact Info LinkedInParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.Links Metric TreeTraceModern Data StackHadoopVerticaLuigidbtRalph KimballBill InmonMetric LayerDimensional Data WarehouseMaster Data ManagementData GovernanceFinancial P&L (Profit and Loss)EBITDA ==Earnings before interest, taxes, depreciation and amortizationThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Modern data engineering leverages Python to build robust, scalable, end-to-end workflows. In this talk, we will cover how Snowflake offers you a flexible development environment for developing Python data pipelines, performing transformation at scale, orchestrating and deploying your pipelines at scale. Topics we’ll cover include: – Ingest: Data source APIs, Snowflake file-to-read and ingest data of any format when files arrive, with sources outside Snowflake – Develop: Packaging (artifact repo), Python runtimes, IDE (Notebook, vscode) – Transform: Snowpark pandas, UDFs, UDAFs – Deploy: Tasks, Notebook scheduling

Apache Iceberg provides an open storage standard that can democratize your data stored in disparate data lakes by providing freedom and interoperability to use various data processing engines. Join this session to explore Snowflake’s latest advancements for data engineering on Snowflake Iceberg Tables. We’ll dive into newly launched features that enhance interoperability as well as bring Snowflake’s ease of use to your Iceberg data lakes.

In this episode, I sit down with Saket Saurabh (CEO of Nexla) to discuss the fundamental shift happening in the AI landscape. The conversation is moving beyond the race to build the biggest foundational models and towards a new battleground: context. We explore what it means to be a "model company" versus a "context company" and how this changes everything for data strategy and enterprise AI.

Join us as we cover: Model vs. Context Companies: The emerging divide between companies building models (like OpenAI) and those whose advantage lies in their unique data and integrations. The Limits of Current Models: Why we might be hitting an asymptote with the current transformer architecture for solving complex, reliable business processes. "Context Engineering": What this term really means, from RAG to stitching together tools, data, and memory to feed AI systems. The Resurgence of Knowledge Graphs: Why graph databases are becoming critical for providing deterministic, reliable information to probabilistic AI models, moving beyond simple vector similarity. AI's Impact on Tooling: How tools like Lovable and Cursor are changing workflows for prototyping and coding, and the risk of creating the "-10x engineer." The Future of Data Engineering: How the field is expanding as AI becomes the primary consumer of data, requiring a new focus on architecture, semantics, and managing complexity at scale.

Summary In this crossover episode of the AI Engineering Podcast, host Tobias Macey interviews Brijesh Tripathi, CEO of Flex AI, about revolutionizing AI engineering by removing DevOps burdens through "workload as a service". Brijesh shares his expertise from leading AI/HPC architecture at Intel and deploying supercomputers like Aurora, highlighting how access friction and idle infrastructure slow progress. Join them as they discuss Flex AI's innovative approach to simplifying heterogeneous compute, standardizing on consistent Kubernetes layers, and abstracting inference across various accelerators, allowing teams to iterate faster without wrestling with drivers, libraries, or cloud-by-cloud differences. Brijesh also shares insights into Flex AI's strategies for lifting utilization, protecting real-time workloads, and spanning the full lifecycle from fine-tuning to autoscaled inference, all while keeping complexity at bay.

Pre-amble I hope you enjoy this cross-over episode of the AI Engineering Podcast, another show that I run to act as your guide to the fast-moving world of building scalable and maintainable AI systems. As generative AI models have grown more powerful and are being applied to a broader range of use cases, the lines between data and AI engineering are becoming increasingly blurry. The responsibilities of data teams are being extended into the realm of context engineering, as well as designing and supporting new infrastructure elements that serve the needs of agentic applications. This episode is an example of the types of work that are not easily categorized into one or the other camp.

Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementData teams everywhere face the same problem: they're forcing ML models, streaming data, and real-time processing through orchestration tools built for simple ETL. The result? Inflexible infrastructure that can't adapt to different workloads. That's why Cash App and Cisco rely on Prefect. Cash App's fraud detection team got what they needed - flexible compute options, isolated environments for custom packages, and seamless data exchange between workflows. Each model runs on the right infrastructure, whether that's high-memory machines or distributed compute. Orchestration is the foundation that determines whether your data team ships or struggles. ETL, ML model training, AI Engineering, Streaming - Prefect runs it all from ingestion to activation in one platform. Whoop and 1Password also trust Prefect for their data operations. If these industry leaders use Prefect for critical workflows, see what it can do for you at dataengineeringpodcast.com/prefect.Data migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details. Your host is Tobias Macey and today I'm interviewing Brijesh Tripathi about FlexAI, a platform offering a service-oriented abstraction for AI workloadsInterview IntroductionHow did you get involved in machine learning?Can you describe what FlexAI is and the story behind it?What are some examples of the ways that infrastructure challenges contribute to friction in developing and operating AI applications?How do those challenges contribute to issues when scaling new applications/businesses that are founded on AI?There are numerous managed services and deployable operational elements for operationalizing AI systems. What are some of the main pitfalls that teams need to be aware of when determining how much of that infrastructure to own themselves?Orchestration is a key element of managing the data and model lifecycles of these applications. How does your approach of "workload as a service" help to mitigate some of the complexities in the overall maintenance of that workload?Can you describe the design and architecture of the FlexAI platform?How has the implementation evolved from when you first started working on it?For someone who is going to build on top of FlexAI, what are the primary interfaces and concepts that they need to be aware of?Can you describe the workflow of going from problem to deployment for an AI workload using FlexAI?One of the perennial challenges of making a well-integrated platform is that there are inevitably pre-existing workloads that don't map cleanly onto the assumptions of the vendor. What are the affordances and escape hatches that you have built in to allow partial/incremental adoption of your service?What are the elements of AI workloads and applications that you are explicitly not trying to solve for?What are the most interesting, innovative, or unexpected ways that you have seen FlexAI used?What are the most interesting, unexpected, or challenging lessons that you have learned while working on FlexAI?When is FlexAI the wrong choice?What do you have planned for the future of FlexAI?Contact Info LinkedInParting Question From your perspective, what are the biggest gaps in tooling, technology, or training for AI systems today?Links Flex AIAurora Super ComputerCoreWeaveKubernetesCUDAROCmTensor Processing Unit (TPU)PyTorchTritonTrainiumASIC == Application Specific Integrated CircuitSOC == System On a ChipLoveableFlexAI BlueprintsTenstorrentThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

At PyData Berlin, community members and industry voices highlighted how AI and data tooling are evolving across knowledge graphs, MLOps, small-model fine-tuning, explainability, and developer advocacy.

  • Igor Kvachenok (Leuphana University / ProKube) combined knowledge graphs with LLMs for structured data extraction in the polymer industry, and noted how MLOps is shifting toward LLM-focused workflows.
  • Selim Nowicki (Distill Labs) introduced a platform that uses knowledge distillation to fine-tune smaller models efficiently, making model specialization faster and more accessible.
  • Gülsah Durmaz (Architect & Developer) shared her transition from architecture to coding, creating Python tools for design automation and volunteering with PyData through PyLadies.
  • Yashasvi Misra (Pure Storage) spoke on explainable AI, stressing accountability and compliance, and shared her perspective as both a data engineer and active Python community organizer.
  • Mehdi Ouazza (MotherDuck) reflected on developer advocacy through video, workshops, and branding, showing how creative communication boosts adoption of open-source tools like DuckDB.

Igor Kvachenok Master’s student in Data Science at Leuphana University of Lüneburg, writing a thesis on LLM-enhanced data extraction for the polymer industry. Builds RDF knowledge graphs from semi-structured documents and works at ProKube on MLOps platforms powered by Kubeflow and Kubernetes.

Connect: https://www.linkedin.com/in/igor-kvachenok/

Selim Nowicki Founder of Distill Labs, a startup making small-model fine-tuning simple and fast with knowledge distillation. Previously led data teams at Berlin startups like Delivery Hero, Trade Republic, and Tier Mobility. Sees parallels between today’s ML tooling and dbt’s impact on analytics.

Connect: https://www.linkedin.com/in/selim-nowicki/

Gülsah Durmaz Architect turned developer, creating Python-based tools for architectural design automation with Rhino and Grasshopper. Active in PyLadies and a volunteer at PyData Berlin, she values the community for networking and learning, and aims to bring ML into architecture workflows.

Connect: https://www.linkedin.com/in/gulsah-durmaz/

Yashasvi (Yashi) Misra Data Engineer at Pure Storage, community organizer with PyLadies India, PyCon India, and Women Techmakers. Advocates for inclusive spaces in tech and speaks on explainable AI, bridging her day-to-day in data engineering with her passion for ethical ML.

Connect: https://www.linkedin.com/in/misrayashasvi/

Mehdi Ouazza Developer Advocate at MotherDuck, formerly a data engineer, now focused on building community and education around DuckDB. Runs popular YouTube channels ("mehdio DataTV" and "MotherDuck") and delivered a hands-on workshop at PyData Berlin. Blends technical clarity with creative storytelling.

Connect: https://www.linkedin.com/in/mehd-io/

Minus Three Tier: Data Architecture Turned Upside Down

Every data architecture diagram out there makes it abundantly clear who's in charge: At the bottom sits the analyst, above that is an API server, and on the very top sits the mighty data warehouse. This pattern is so ingrained we never ever question its necessity, despite its various issues like slow data response time, multi-level scaling issues, and massive cost.

But there is another way: Disconnect of storage and compute enables localization of query processing closer to people, leading to much snappier responses, natural scaling with client-side query processing, and much reduced cost.

In this talk, it will be discussed how modern data engineering paradigms like decomposition of storage, single-node query processing, and lakehouse formats enable a radical departure from the tired three-tier architecture. By inverting the architecture we can put user's needs first. We can rely on commoditised components like object store to enable fast, scalable, and cost-effective solutions.

This presentation provides an overview of how NVIDIA RAPIDS accelerates data science and data engineering workflows end-to-end. Key topics include leveraging RAPIDS for machine learning, large-scale graph analytics, real-time inference, hyperparameter optimization, and ETL processes. Case studies demonstrate significant performance improvements and cost savings across various industries using RAPIDS for Apache Spark, XGBoost, cuML, and other GPU-accelerated tools. The talk emphasizes the impact of accelerated computing on modern enterprise applications, including LLMs, recommenders, and complex data processing pipelines.

Face To Face
by Sian Rodway (Manuka AI) , Sam Cremins (Kingsley Napley) , Leanne Lynch (ISS UK&I)

Data remains one of the most valuable assets a company has to guide its decision making. How that data is processed, used and presented is changing rapidly and with it the role and skills of data engineers. 

In this fireside chat, Manuka will explore the future of data engineering and the ongoing challenges of overcoming legacy constrains and governance with the latest breakthroughs in AI.

Expect a grounded discussion on:

• What “AI-ready” really means for data engineers

• Engineering through legacy constraints in a highly regulated environment

• Designing ingestion, orchestration, and observability that scale

• Embedding governance and quality without slowing delivery

• What’s next for data engineering in the age of generative AI

Whether you’re building pipelines, managing platforms, or designing modern data infrastructure, this is a rare behind-the-scenes look at how data engineering is evolving to meet the AI moment.

Are AI code generators delivering SQL that "looks right but works wrong" for your data engineering challenges? Is your AI generating brilliant-sounding but functionally flawed results? 

The critical bottleneck isn't the AI's intelligence; it's the missing context.

In this talk, we will put thing in context and reveal how providing AI with structured, deep understanding—from data semantics and lineage to user intent and external knowledge—is the true paradigm shift. 

We'll explore how this context engineering powers the rise of dependable AI agents and leverages techniques like Retrieval-Augmented Generation (RAG) to move beyond mere text generation towards trustworthy, intelligent automation across all domains. 

This limitation highlights a broader challenge across AI applications: the need for systems to possess a deep understanding of all relevant signals, ranging from environmental cues and user history to explicit intent, to achieve reliable and meaningful operation.

Join us for real-world, practical case studies directly from data engineers that demonstrate precisely how to unlock this transformative power and achieve truly reliable AI.

Join Amperity’s Marcus Owens, Lead Solution Consultant, to learn more about the rapid innovations in data architecture brought by the new wave of AI agents. This session will start with a quick overview of what makes a good AI Agent – and then focus on how Agentic strategies can accelerate two key needs in customer data: 

Make Customer Data Usable – How AI Agents accelerate customer data engineering with Amperity’s Stitch and Chuck Data – saving data engineering teams hundreds of hours of effort. 

Make Use of Customer Data – How AmpAI allows Marketers to build outcome-driven customer journeys, going from intent to results faster than ever before.

AI is only as good as the data it runs on. Yet Gartner predicts in 2026, over 60% of AI projects will fail to deliver value - because the underlying data isn’t truly AI-ready. “Good enough” data isn’t enough.

In this exclusive BDL launch session, DataOps.live reveal Momentum, the next generation of its DataOps automation platform designed to operationalize trusted AI at enterprise scale.

Based on experiences from building over 9000 Data Products to date, Momentum introduces breakthrough capabilities including AI-Ready Data Scoring to ensure data is fit for AI use cases, Data Product Lineage for end-to-end visibility, and a Data Engineering Agent that accelerates building reusable data products. Combined with automated CI/CD, continuous observability, and governance enforcement, Momentum closes the AI-readiness gap by embedding collaboration, metadata, and automation across the entire data lifecycle.

Backed by Snowflake Ventures and trusted by leading enterprises, including AstraZeneca, Disney and AT&T, DataOps.live is the proven catalyst for scaling AI-ready data. In this session, you’ll unpack what AI-ready data really means, learn essential practices, discover a faster, easier, and more impactful way to make your AI initiatives succeed.

Be the first to see Momentum in action - the future of AI-ready data.

This session will provide a Maia demo with roadmap teasers. The demo will showcase Maia's core capabilities: authoring pipelines in business language, multiplying productivity by accelerating tasks, and enabling self-service. It demonstrates how Maia takes natural language prompts and translates them into YAML-based, human-readable Data Pipeline Language (DPL), generating graphical pipelines. Expect to see Maia interacting with Snowflake metadata to sample data and suggest transformations, as well as its ability to troubleshoot and debug pipelines in real-time. The session will also cover how Maia can create custom connectors from REST API documentation in seconds, a task that traditionally takes days . Roadmap teasers will likely include the upcoming Semantic Layer, a Pipeline Reviewing Agent, and enhanced file type support for various legacy ETL tools and code conversions.

Data teams know the pain of moving from proof-of-concepts to production. We’ve all seen brittle scripts, one-off notebooks, and manual fixes turn into hidden risks. With large language models, the same story is playing out, unless we borrow the lessons of modern data engineering.

This talk introduces a declarative approach to LLM engineering using DSPy and Dagster. DSPy treats prompts, retrieval strategies, and evaluation metrics as first-class, composable building blocks. Instead of tweaking text by hand, you declare the behavior you want, and DSPy optimizes and tunes the pipeline for you. Dagster is built on a similar premise; with Dagster Components, you can build modular and declarative pipelines.

This approach means:

- Trust & auditability: Every LLM output can be traced back through a reproducible graph.

- Safety in production: Automated evaluation loops catch drift and regressions before they matter.

- Scalable experimentation: The same declarative spec can power quick tests or robust, HIPAA/GxP-grade pipelines.

By treating LLM workflows like data pipelines: declarative, observable, and orchestrate, we can avoid the prompt spaghetti trap and build AI systems that meet the same reliability bar as the rest of the stack.