talk-data.com talk-data.com

Topic

Data Engineering

etl data_pipelines big_data

1127

tagged

Activity Trend

127 peak/qtr
2020-Q1 2026-Q1

Activities

1127 activities · Newest first

Metadata-Driven Streaming Ingestion Using Lakeflow Declarative Pipelines, Azure Event Hubs and a Schema Registry

At Plexure, we ingest hundreds of millions of customer activities and transactions into our data platform every day, fuelling our personalisation engine and providing insights into the effectiveness of marketing campaigns.We're on a journey to transition from infrequent batch ingestion to near real-time streaming using Azure Event Hubs and Lakeflow Declarative Pipelines. This transformation will allow us to react to customer behaviour as it happens, rather than hours or even days later.It also enables us to move faster in other ways. By leveraging a Schema Registry, we've created a metadata-driven framework that allows data producers to: Evolve schemas with confidence, ensuring downstream processes continue running smoothly. Seamlessly publish new datasets into the data platform without requiring Data Engineering assistance. Join us to learn more about our journey and see how we're implementing this with Lakeflow Declarative Pipelines meta-programming - including a live demo of the end-to-end process!

Transforming Data Pipeline Management With a Targeted Proof of Concept

At Capital One, data-driven decision making is paramount to our success. This session explores how a focused proof of concept (POC) accelerated a shift in our data pipeline management strategy, resulting in operational improvements and expanded analytical capabilities. We'll cover the business challenges that motivated POC initiation, including data latency, cost savings and scalability limitations, and real-world results. We'll also dive into an examination of the before-and-after architecture with highlights for key technological levers. This session offers insights for data engineering and machine learning practitioners seeking to optimize their data pipelines for improved performance, scalability and business value.

Authoring Data Pipelines With the New Lakeflow Declarative Pipelines Editor

We’re introducing a new developer experience for Lakeflow Declarative Pipelines designed for data practitioners who prefer a code-first approach and expect robust developer tooling. The new multi-file editor brings an IDE-like environment to declarative pipeline development, making it easy to structure transformation logic, configure pipelines throughout the development lifecycle and iterate efficiently.Features like contextual data previews and selective table updates enable step-by-step development. UI-driven tools, such as DAG previews and DAG-based actions, enhance productivity for experienced users and provide a bridge for those transitioning to declarative workflows.In this session, we’ll showcase the new editor in action, highlighting how these enhancements simplify declarative coding and improve development for production-ready data pipelines. Whether you’re an experienced developer or new to declarative data engineering, join us to see how Lakeflow Declarative Pipelines can enhance your data practice.

Introducing Lakeflow: The Future of Data Engineering on Databricks

Join us to explore Lakeflow, Databricks' end-to-end solution for simplifying and unifying the most complex data engineering workflows. This session builds on keynote announcements, offering an accessible introduction for newcomers while emphasizing the transformative value Lakeflow delivers.We’ll cover: What is Lakeflow? – A cohesive overview of its components: Lakeflow Connect, Lakeflow Declarative Pipelines, and Lakeflow Jobs. Core Capabilities in Action – Live demos showcasing no-code data ingestion, code-optional declarative pipelines, and unified, end-to-end orchestration. Vision for the Future – Unveil the roadmap, introducing no-code and open-source initiatives. Discover how Lakeflow equips data teams with a seamless experience for ingestion, transformation, and orchestration, reducing complexity and driving productivity. By unifying these capabilities, Lakeflow lays the groundwork for scalable, reliable, efficient data pipelines in a governed and high-performing environment.

Building a Self-Service Data Platform With a Small Data Team

Discover how Dodo Brands, a global pizza and coffee business with over 1,200 retail locations and 40k employees, revolutionized their analytics infrastructure by creating a self-service data platform. This session explores the approach to empowering analysts, data scientists and ML engineers to independently build analytical pipelines with minimal involvement from data engineers. By leveraging Databricks as the backbone of their platform, the team developed automated tools like a "job-generator" that uses Jinja templates to streamline the creation of data jobs. This approach minimized manual coding and enabled non-data engineers to create over 1,420 data jobs — 90% of which were auto-generated by user configurations. Supporting thousands of weekly active users via tools like Apache Superset. This session provides actionable insights for organizations seeking to scale their analytics capabilities efficiently without expanding their data engineering teams.

This introductory workshop caters to data engineers seeking hands-on experience and data architects looking to deepen their knowledge. The workshop is structured to provide a solid understanding of the following data engineering and streaming concepts: Introduction to Lakeflow and the Data Intelligence Platform Getting started with Lakeflow Declarative Pipelines for declarative data pipelines in SQL using Streaming Tables and Materialized Views Mastering Databricks Workflows with advanced control flow and triggers Understanding serverless compute Data governance and lineage with Unity Catalog Generative AI for Data Engineers: Genie and Databricks Assistant We believe you can only become an expert if you work on real problems and gain hands-on experience. Therefore, we will equip you with your own lab environment in this workshop and guide you through practical exercises like using GitHub, ingesting data from various sources, creating batch and streaming data pipelines, and more.

Sponsored by: SAP | SAP Business Data Cloud: Fuel AI with SAP data products across ERP and lines-of-business

Unlock the power of your SAP data with SAP Business Data Cloud—a fully managed SaaS solution that unifies and governs all SAP data while seamlessly connecting it with third-party data. As part of SAP Business Data Cloud, SAP Databricks brings together trusted, semantically rich business data with industry-leading capabilities in AI, machine learning, and data engineering. Discover how to access curated SAP data products across critical business processes, enrich and harmonize your data without data copies using Delta Sharing, and leverage the results across your business data fabric. See it all in action with a demonstration.

Summary In this episode of the Data Engineering Podcast Alex Albu, tech lead for AI initiatives at Starburst, talks about integrating AI workloads with the lakehouse architecture. From his software engineering roots to leading data engineering efforts, Alex shares insights on enhancing Starburst's platform to support AI applications, including an AI agent for data exploration and using AI for metadata enrichment and workload optimization. He discusses the challenges of integrating AI with data systems, innovations like SQL functions for AI tasks and vector databases, and the limitations of traditional architectures in handling AI workloads. Alex also shares his vision for the future of Starburst, including support for new data formats and AI-driven data exploration tools.

Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementData migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details.This is a pharmaceutical Ad for Soda Data Quality. Do you suffer from chronic dashboard distrust? Are broken pipelines and silent schema changes wreaking havoc on your analytics? You may be experiencing symptoms of Undiagnosed Data Quality Syndrome — also known as UDQS. Ask your data team about Soda. With Soda Metrics Observability, you can track the health of your KPIs and metrics across the business — automatically detecting anomalies before your CEO does. It’s 70% more accurate than industry benchmarks, and the fastest in the category, analyzing 1.1 billion rows in just 64 seconds. And with Collaborative Data Contracts, engineers and business can finally agree on what “done” looks like — so you can stop fighting over column names, and start trusting your data again.Whether you’re a data engineer, analytics lead, or just someone who cries when a dashboard flatlines, Soda may be right for you. Side effects of implementing Soda may include: Increased trust in your metrics, reduced late-night Slack emergencies, spontaneous high-fives across departments, fewer meetings and less back-and-forth with business stakeholders, and in rare cases, a newfound love of data. Sign up today to get a chance to win a $1000+ custom mechanical keyboard. Visit dataengineeringpodcast.com/soda to sign up and follow Soda’s launch week. It starts June 9th. This episode is brought to you by Coresignal, your go-to source for high-quality public web data to power best-in-class AI products. Instead of spending time collecting, cleaning, and enriching data in-house, use ready-made multi-source B2B data that can be smoothly integrated into your systems via APIs or as datasets. With over 3 billion data records from 15+ online sources, Coresignal delivers high-quality data on companies, employees, and jobs. It is powering decision-making for more than 700 companies across AI, investment, HR tech, sales tech, and market intelligence industries. A founding member of the Ethical Web Data Collection Initiative, Coresignal stands out not only for its data quality but also for its commitment to responsible data collection practices. Recognized as the top data provider by Datarade for two consecutive years, Coresignal is the go-to partner for those who need fresh, accurate, and ethically sourced B2B data at scale. Discover how Coresignal's data can enhance your AI platforms. Visit dataengineeringpodcast.com/coresignal to start your free 14-day trial.Your host is Tobias Macey and today I'm interviewing Alex Albu about how Starburst is extending the lakehouse to support AI workloadsInterview IntroductionHow did you get involved in the area of data management?Can you start by outlining the interaction points of AI with the types of data workflows that you are supporting with Starburst?What are some of the limitations of warehouse and lakehouse systems when it comes to supporting AI systems?What are the points of friction for engineers who are trying to employ LLMs in the work of maintaining a lakehouse environment?Methods such as tool use (exemplified by MCP) are a means of bolting on AI models to systems like Trino. What are some of the ways that is insufficient or cumbersome?Can you describe the technical implementation of the AI-oriented features that you have incorporated into the Starburst platform?What are the foundational architectural modifications that you had to make to enable those capabilities?For the vector storage and indexing, what modifications did you have to make to iceberg?What was your reasoning for not using a format like Lance?For teams who are using Starburst and your new AI features, what are some examples of the workflows that they can expect?What new capabilities are enabled by virtue of embedding AI features into the interface to the lakehouse?What are the most interesting, innovative, or unexpected ways that you have seen Starburst AI features used?What are the most interesting, unexpected, or challenging lessons that you have learned while working on AI features for Starburst?When is Starburst/lakehouse the wrong choice for a given AI use case?What do you have planned for the future of AI on Starburst?Contact Info LinkedInParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.Links StarburstPodcast EpisodeAWS AthenaMCP == Model Context ProtocolLLM Tool UseVector EmbeddingsRAG == Retrieval Augmented GenerationAI Engineering Podcast EpisodeStarburst Data ProductsLanceLanceDBParquetORCpgvectorStarburst IcehouseThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

A Practitioner’s Guide to Databricks Serverless

This session is repeated. Databricks Serverless revolutionizes data engineering and analytics by eliminating the complexities of infrastructure management. This talk will provide an overview of this powerful serverless compute option, highlighting how it enables practitioners to focus solely on building robust data pipelines. We'll explore the core benefits, including automatic scaling, cost optimization and seamless integration with the Databricks ecosystem. Learn how serverless workflows simplify the orchestration of various data tasks, from ingestion to dashboards, ultimately accelerating time-to-insight and boosting productivity. This session is ideal for data engineers, data scientists and analysts looking to leverage the agility and efficiency of serverless computing in their data workflows.

Simplify Data Ingest and Egress with the New Python Data Source API

Data engineering teams are frequently tasked with building bespoke ingest and/or egress solutions for myriad custom, proprietary, or industry-specific data sources or sinks. Many teams find this work cumbersome and time-consuming. Recognizing these challenges, Databricks interviewed numerous companies across different industries to better understand their diverse data integration needs. This comprehensive feedback led us to develop the Python Data Source API for Apache Spark™.

Unlocking the Power of Iceberg: Our Journey to a Unified Lakehouse on Databricks

This session showcases our journey of adopting Apache Iceberg™ to build a modern lakehouse architecture and leveraging Databricks advanced Iceberg support to take it to the next level. We’ll dive into the key design principles behind our lakehouse, the operational challenges we tackled and how Databricks enabled us to unlock enhanced performance, scalability and streamlined data workflows. Whether you’re exploring Apache Iceberg™ or building a lakehouse on Databricks, this session offers actionable insights, lessons learned and best practices for modern data engineering.

Highways and Hexagons: Processing Large Geospatial Datasets With H3

The problem of matching GPS locations to roads and local government areas (LGAs) involves handling large datasets and a number of geospatial operations. In this deep dive, we will outline the challenges of developing scalable solutions for these tasks. We will discuss our multi-step approach, first focusing on the use of H3 indexing to isolate matches with single candidates, then explaining use of different geospatial computational techniques to accurately match points with multiple candidates. From technical perspective, the talk will showcase the use of broadcasting and partitioning techniques, their effect on autoscaling, memory usage and effective data parallelization. This session is for anyone interested in geospatial data, spark performance optimization and the real-world challenges of large-scale data engineering.

Sponsored by: Astronomer | Scaling Data Teams for the Future

The role of data teams and data engineers is evolving. No longer just pipeline builders or dashboard creators, today’s data teams must evolve to drive business strategy, enable automation, and scale with growing demands. Best practices seen in the software engineering world (Agile development, CI/CD, and Infrastructure-as-code) from the DevOps movement are gradually making their way into data engineering. We believe these changes have led to the rise of DataOps and a new wave of best practices that will transform the discipline of data engineering. But how do you transform a reactive team into a proactive force for innovation? We’ll explore the key principles for building a resilient, high-impact data team—from structuring for collaboration, testing, automation, to leveraging modern orchestration tools. Whether you’re leading a team or looking to future-proof your career, you’ll walk away with actionable insights on how to stay ahead in the rapidly changing data landscape.

This course provides a comprehensive review of DevOps principles and their application to Databricks projects. It begins with an overview of core DevOps, DataOps, continuous integration (CI), continuous deployment (CD), and testing, and explores how these principles can be applied to data engineering pipelines. The course then focuses on continuous deployment within the CI/CD process, examining tools like the Databricks REST API, SDK, and CLI for project deployment. You will learn about Databricks Asset Bundles (DABs) and how they fit into the CI/CD process. You’ll dive into their key components, folder structure, and how they streamline deployment across various target environments in Databricks. You will also learn how to add variables, modify, validate, deploy, and execute Databricks Asset Bundles for multiple environments with different configurations using the Databricks CLI. Finally, the course introduces Visual Studio Code as an Interactive Development Environment (IDE) for building, testing, and deploying Databricks Asset Bundles locally, optimizing your development process. The course concludes with an introduction to automating deployment pipelines using GitHub Actions to enhance the CI/CD workflow with Databricks Asset Bundles. By the end of this course, you will be equipped to automate Databricks project deployments with Databricks Asset Bundles, improving efficiency through DevOps practices. Pre-requisites: Strong knowledge of the Databricks platform, including experience with Databricks Workspaces, Apache Spark, Delta Lake, the Medallion Architecture, Unity Catalog, Delta Live Tables, and Workflows. In particular, knowledge of leveraging Expectations with Lakeflow Declarative Pipelines. Labs : Yes Certification Path: Databricks Certified Data Engineer Professional

In today's data-driven world, the ability to efficiently manage and transform data is crucial for any organization. This presentation will explore the process of converting a complex and messy workflow into a clean and simple Lakeflow Declarative Pipelines at a large integrated health system, Intermountain Health.Alteryx is a powerful tool for data preparation and blending, but as workflows grow in complexity, they can become difficult to manage and maintain. Lakeflow Declarative Pipelines, on the other hand, offers a more democratized, streamlined and scalable approach to data engineering, leveraging the power of Apache Spark and Delta Lake.We will begin by examining a typical legacy workflow, identifying common pain points such as tangled logic, performance bottlenecks and maintenance challenges. Next, we will demonstrate how to translate this workflow into a Lakeflow Declarative Pipelines, highlighting key steps such as data transformation, validation and delivery.

Sponsored by: SAP | SAP and Databricks Open a Bold New Era of Data and AI​

SAP and Databricks have formed a landmark partnership that brings together SAP's deep expertise in mission-critical business processes and semantically rich data with Databricks' industry-leading capabilities in AI, machine learning, and advanced data engineering. From curated, SAP-managed data products to zero-copy Delta Sharing integration, discover how SAP Business Data Cloud empowers data and AI professionals to build AI solutions that unlock unparalleled business insights using trusted business data.

Federated Data Analytics Platform

Are you struggling to keep up with rapid business changes that demand constant updates to your data pipelines? Is your data engineering team growing rapidly just to manage this complexity? Databricks was not immune to this challenge either. Managing our BI with contributions from hundreds of Product Engineering Teams across the company while maintaining central oversight and quality posed significant hurdles. Join us to learn how we developed a config-driven data pipeline framework using Metric Store and UC Metrics that helped us reduce engineering effort — achieving the work of 100 classical data engineers with just two platform engineers.

Sponsored by: Prophecy | Taming Industry-Specific Data Sets: How to Simplify Access and Collaboration to FHIR and Beyond

Industry-standard data formats can streamline data exchange, but they come with significant complexity and can seem ‘impossible’. One example is FHIR (Fast Healthcare Interoperability Resources), the healthcare data standard with a broad scope of 180 entities. With its widespread adoption, FHIR promises to speed delivering insights to improve patient experiences, optimize operations, and drive better outcomes. Yet working with FHIR data is no small task. Its complexity and extensions push it beyond the reach of most analysts. Instead, it lands on data engineers, who must load, transform, and keep up with changes. This creates bottlenecks, slows down insights, and places a heavy maintenance burden on already backlogged data engineering teams. Learn how to tame FHIR for analysts - if you want to speed time-to-insight, reduce engineering effort, and enable true cross-functional collaboration on industry-specific data - this is a session you don’t want to miss.

Simplifying Data Pipelines With Lakeflow Declarative Pipelines: A Beginner’s Guide

As part of the new Lakeflow data engineering experience, Lakeflow Declarative Pipelines makes it easy to build and manage reliable data pipelines. It unifies batch and streaming, reduces operational complexity and ensures dependable data delivery at scale — from batch ETL to real-time processing.Lakeflow Declarative Pipelines excels at declarative change data capture, batch and streaming workloads, and efficient SQL-based pipelines. In this session, you’ll learn how we’ve reimagined data pipelining with Lakeflow Declarative Pipelines, including: A brand new pipeline editor that simplifies transformations Serverless compute modes to optimize for performance or cost Full Unity Catalog integration for governance and lineage Reading/writing data with Kafka and custom sources Monitoring and observability for operational excellence “Real-time Mode” for ultra-low-latency streaming Join us to see how Lakeflow Declarative Pipelines powers better analytics and AI with reliable, unified pipelines.