talk-data.com talk-data.com

Topic

Delta

Delta Lake

data_lake acid_transactions time_travel file_format storage

347

tagged

Activity Trend

117 peak/qtr
2020-Q1 2026-Q1

Activities

347 activities · Newest first

Lakeflow Connect: Smarter, Simpler File Ingestion With the Next Generation of Auto Loader

Auto Loader is the definitive tool for ingesting data from cloud storage into your lakehouse. In this session, we’ll unveil new features and best practices that simplify every aspect of cloud storage ingestion. We’ll demo out-of-the-box observability for pipeline health and data quality, walk through improvements for schema management, introduce a series of new data formats and unveil recent strides in Auto Loader performance. Along the way, we’ll provide examples and best practices for optimizing cost and performance. Finally, we’ll introduce a preview of what’s coming next — including a REST API for pushing files directly to Delta, a UI for creating cloud storage pipelines and more. Join us to help shape the future of file ingestion on Databricks.

This course is designed to introduce three primary machine learning deployment strategies and illustrate the implementation of each strategy on Databricks. Following an exploration of the fundamentals of model deployment, the course delves into batch inference, offering hands-on demonstrations and labs for utilizing a model in batch inference scenarios, along with considerations for performance optimization. The second part of the course comprehensively covers pipeline deployment, while the final segment focuses on real-time deployment. Participants will engage in hands-on demonstrations and labs, deploying models with Model Serving and utilizing the serving endpoint for real-time inference. By mastering deployment strategies for a variety of use cases, learners will gain the practical skills needed to move machine learning models from experimentation to production. This course shows you how to operationalize AI solutions efficiently, whether it's automating decisions in real-time or integrating intelligent insights into data pipelines. Pre-requisites: Familiarity with Databricks workspace and notebooks, familiarity with Delta Lake and Lakehouse, intermediate level knowledge of Python (e.g. common Python libraries for DS/ML like Scikit-Learn, awareness of model deployment strategies) Labs: Yes Certification Path: Databricks Certified Machine Learning Associate

The Hitchhiker's Guide to Delta Lake Streaming in an Agentic Universe

As data engineering continues to evolve the shift from batch-oriented to streaming-first has become standard across the enterprise. The reality is these changes have been taking shape for the past decade — we just now also happen to be standing on the precipice of true disruption through automation, the likes of which we could only dream about before. Yes, AI Agents and LLMs are already a large part of our daily lives, but we (as data engineers) are ultimately on the frontlines ensuring that the future of AI is powered by consistent, just-in-time data — and Delta Lake is critical to help us get there. This session will provide you with best practices learned the hard way by one of the authors of The Delta Lake Definitive Guide including: Guide to writing generic applications as components Workflow automation tips and tricks Tips and tricks for Delta clustering (liquid, z-order, and classic) Future facing: Leveraging metadata for agentic pipelines and workflow automation

ThredUp’s Journey with Databricks: Modernizing Our Data Infrastructure

Building an AI-ready data platform requires strong governance, performance optimization, and seamless adoption of new technologies. At ThredUp, our Databricks journey began with a need for better data management and evolved into a full-scale transformation powering analytics, machine learning, and real-time decision-making. In this session, we’ll cover: Key inflection points: Moving from legacy systems to a modernized Delta Lake foundation Unity Catalog’s impact: Improving governance, access control, and data discovery Best practices for onboarding: Ensuring smooth adoption for engineering and analytics teams What’s next? Serverless SQL and conversational analytics with Genie Whether you’re new to Databricks or scaling an existing platform, you’ll gain practical insights on navigating the transition, avoiding pitfalls, and maximizing AI and data intelligence.

Unlocking Industrial Intelligence with AVEVA and Agnico Eagle

Industrial data is the foundation for operational excellence, but sharing and leveraging this data across systems presents significant challenges. Fragmented approaches create delays in decision-making, increase maintenance costs, and erode trust in data quality. This session explores how the partnership between AVEVA and Databricks addresses these issues through CONNECT, which integrates directly with Databricks via Delta Sharing. By accelerating time to value, eliminating data wrangling, ensuring high data quality, and reducing maintenance costs, this solution drives faster, more confident decision-making and greater user adoption. We will showcase how Agnico Eagle Mines—the world’s third-largest gold producer with 10 mines across Canada, Australia, Mexico, and Finland—is leveraging this capability to overcome data intelligence barriers at scale. With this solution, Agnico Eagle is making insights more accessible and actionable across its entire organization.

Ursa: Augment Your Lakehouse With Kafka-Compatible Data Streaming Capabilities

As data architectures evolve to meet the demands of real-time GenAI applications, organizations increasingly need systems that unify streaming and batch processing while maintaining compatibility with existing tools. The Ursa Engine offers a Kafka-API-compatible data streaming engine built on Lakehouse (Iceberg and Delta Lake). Designed to seamlessly integrate with data lakehouse architectures, Ursa extends your lakehouse capabilities by enabling streaming ingestion, transformation and processing — using a Kafka-compatible interface. In this session, we will explore how Ursa Engine augments your existing lakehouses with Kafka-compatible capabilities. Attendees will gain insights into Ursa Engine architecture and real-world use cases of Ursa Engine. Whether you're modernizing legacy systems or building cutting-edge AI-driven applications, discover how Ursa can help you unlock the full potential of your data.

In this course, you’ll learn how to apply patterns to securely store and delete personal information for data governance and compliance on the Data Intelligence Platform. We’ll cover topics like storing sensitive data appropriately to simplify granting access and processing deletes, processing deletes to ensure compliance with the right to be forgotten, performing data masking, and configuring fine-grained access control to configure appropriate privileges to sensitive data.Pre-requisites: Ability to perform basic code development tasks using the Databricks workspace (create clusters, run code in notebooks, use basic notebook operations, import repos from git, etc), intermediate programming experience with SQL and PySpark (extract data from a variety of file formats and data sources, apply a number of common transformations to clean data, reshape and manipulate complex data using advanced built-in functions), intermediate programming experience with Delta Lake (create tables, perform complete and incremental updates, compact files, restore previous versions etc.). Beginner experience with Lakeflow Declarative Pipelines and streaming workloads.Labs: YesCertification Path: Databricks Certified Data Engineer Professional

In this course, you’ll learn how to develop traditional machine learning models on Databricks. We’ll cover topics like using popular ML libraries, executing common tasks efficiently with AutoML and MLflow, harnessing Databricks' capabilities to track model training, leveraging feature stores for model development, and implementing hyperparameter tuning. Additionally, the course covers AutoML for rapid and low-code model training, ensuring that participants gain practical, real-world skills for streamlined and effective machine learning model development in the Databricks environment. Pre-requisites: Familiarity with Databricks workspace and notebooks, familiarity with Delta Lake and Lakehouse, intermediate level knowledge of Python (e.g. common Python libraries for DS/ML like Scikit-Learn, fundamental ML algorithms like regression and classification, model evaluation with common metrics) Labs: Yes Certification Path: Databricks Certified Machine Learning Associate

In this course, you’ll learn how to optimize workloads and physical layout with Spark and Delta Lake and and analyze the Spark UI to assess performance and debug applications. We’ll cover topics like streaming, liquid clustering, data skipping, caching, photons, and more. Pre-requisites: Ability to perform basic code development tasks using the Databricks workspace (create clusters, run code in notebooks, use basic notebook operations, import repos from git, etc), intermediate programming experience with SQL and PySpark (extract data from a variety of file formats and data sources, apply a number of common transformations to clean data, reshape and manipulate complex data using advanced built-in functions), intermediate programming experience with Delta Lake (create tables, perform complete and incremental updates, compact files, restore previous versions etc.). Labs: Yes Certification Path: Databricks Certified Data Engineer Professional

In this course, you’ll learn how to Incrementally process data to power analytic insights with Structured Streaming and Auto Loader, and how to apply design patterns for designing workloads to perform ETL on the Data Intelligence Platform with Lakeflow Declarative Pipelines. First, we’ll cover topics including ingesting raw streaming data, enforcing data quality, implementing CDC, and exploring and tuning state information. Then, we’ll cover options to perform a streaming read on a source, requirements for end-to-end fault tolerance, options to perform a streaming write to a sink, and creating an aggregation and watermark on a streaming dataset. Pre-requisites: Ability to perform basic code development tasks using the Databricks workspace (create clusters, run code in notebooks, use basic notebook operations, import repos from git, etc.), intermediate programming experience with SQL and PySpark (extract data from a variety of file formats and data sources, apply a number of common transformations to clean data, reshape and manipulate complex data using advanced built-in functions), intermediate programming experience with Delta Lake (create tables, perform complete and incremental updates, compact files, restore previous versions etc.). Beginner experience with streaming workloads and familiarity with Lakeflow Declarative Pipelines. Labs: No Certification Path: Databricks Certified Data Engineer Professional

Data Ingestion with Lakeflow Connect

In this course, you’ll learn how to have efficient data ingestion with Lakeflow Connect and manage that data. Topics include ingestion with built-in connectors for SaaS applications, databases and file sources, as well as ingestion from cloud object storage, and batch and streaming ingestion. We'll cover the new connector components, setting up the pipeline, validating the source and mapping to the destination for each type of connector. We'll also cover how to ingest data with Batch to Streaming ingestion into Delta tables, using the UI with Auto Loader, automating ETL with Lakeflow Declarative Pipelines or using the API.This will prepare you to deliver the high-quality, timely data required for AI-driven applications by enabling scalable, reliable, and real-time data ingestion pipelines. Whether you're supporting ML model training or powering real-time AI insights, these ingestion workflows form a critical foundation for successful AI implementation.Pre-requisites: Beginner familiarity with the Databricks Data Intelligence Platform (selecting clusters, navigating the Workspace, executing notebooks), cloud computing concepts (virtual machines, object storage, etc.), production experience working with data warehouses and data lakes, intermediate experience with basic SQL concepts (select, filter, groupby, join, etc), beginner programming experience with Python (syntax, conditions, loops, functions), beginner programming experience with the Spark DataFrame API (Configure DataFrameReader and DataFrameWriter to read and write data, Express query transformations using DataFrame methods and Column expressions, etc.Labs: NoCertification Path: Databricks Certified Data Engineer Associate

This course offers a deep dive into designing data models within the Databricks Lakehouse environment, and understanding the data products lifecycle. Participants will learn to align business requirements with data organization and model design leveraging Delta Lake and Unity Catalog for defining data architectures, and techniques for data integration and sharing. Prerequisites: Foundational knowledge equivalent to Databricks Certified Data Engineer Associate and familiarity with many topics covered in Databricks Certified Data Engineer Professional. Experience with: Basic SQL queries and table creation on Databricks Lakehouse architecture fundamentals (medallion layers) Unity Catalog concepts (high-level) [Optional] Familiarity with data warehousing concepts (dimensional modeling, 3NF, etc.) is beneficial but not mandatory. Labs: Yes

Summary In this episode of the Data Engineering Podcast Sida Shen, product manager at CelerData, talks about StarRocks, a high-performance analytical database. Sida discusses the inception of StarRocks, which was forked from Apache Doris in 2020 and evolved into a high-performance Lakehouse query engine. He explains the architectural design of StarRocks, highlighting its capabilities in handling high concurrency and low latency queries, and its integration with open table formats like Apache Iceberg, Delta Lake, and Apache Hudi. Sida also discusses how StarRocks differentiates itself from other query engines by supporting on-the-fly joins and eliminating the need for denormalization pipelines, and shares insights into its use cases, such as customer-facing analytics and real-time data processing, as well as future directions for the platform.

Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementData migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details.Your host is Tobias Macey and today I'm interviewing Sida Shen about StarRocks, a high performance analytical database supporting shared nothing and shared data patternsInterview IntroductionHow did you get involved in the area of data management?Can you describe what StarRocks is and the story behind it?There are numerous analytical databases on the market. What are the attributes of StarRocks that differentiate it from other options?Can you describe the architecture of StarRocks?What are the "-ilities" that are foundational to the design of the system?How have the design and focus of the project evolved since it was first created?What are the tradeoffs involved in separating the communication layer from the data layers?The tiered architecture enables the shared nothing and shared data behaviors, which allows for the implementation of lakehouse patterns. What are some of the patterns that are possible due to the single interface/dual pattern nature of StarRocks?The shared data implementation has cacheing built in to accelerate interaction with datasets. What are some of the limitations/edge cases that operators and consumers should be aware of?StarRocks supports management of lakehouse tables (Iceberg, Delta, Hudi, etc.), which overlaps with use cases for Trino/Presto/Dremio/etc. What are the cases where StarRocks acts as a replacement for those systems vs. a supplement to them?The other major category of engines that StarRocks overlaps with is OLAP databases (e.g. Clickhouse, Firebolt, etc.). Why might someone use StarRocks in addition to or in place of those techologies?We would be remiss if we ignored the dominating trend of AI and the systems that support it. What is the role of StarRocks in the context of an AI application?What are the most interesting, innovative, or unexpected ways that you have seen StarRocks used?What are the most interesting, unexpected, or challenging lessons that you have learned while working on StarRocks?When is StarRocks the wrong choice?What do you have planned for the future of StarRocks?Contact Info LinkedInParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.Links StarRocksCelerDataApache DorisSIMD == Single Instruction Multiple DataApache IcebergClickHousePodcast EpisodeDruidFireboltPodcast EpisodeSnowflakeBigQueryTrinoDatabricksDremioData LakehouseDelta LakeApache HiveC++Cost-Based OptimizerIceberg Summit Tencent Games PresentationApache PaimonLancePodcast EpisodeDelta UniformApache ArrowStarRocks Python UDFDebeziumPodcast EpisodeThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

In this podcast episode, we talked with Adrian Brudaru about ​the past, present and future of data engineering.

About the speaker: Adrian Brudaru studied economics in Romania but soon got bored with how creative the industry was, and chose to go instead for the more factual side. He ended up in Berlin at the age of 25 and started a role as a business analyst. At the age of 30, he had enough of startups and decided to join a corporation, but quickly found out that it did not provide the challenge he wanted. As going back to startups was not a desirable option either, he decided to postpone his decision by taking freelance work and has never looked back since. Five years later, he co-founded a company in the data space to try new things. This company is also looking to release open source tools to help democratize data engineering.

0:00 Introduction to DataTalks.Club 1:05 Discussing trends in data engineering with Adrian 2:03 Adrian's background and journey into data engineering 5:04 Growth and updates on Adrian's company, DLT Hub 9:05 Challenges and specialization in data engineering today 13:00 Opportunities for data engineers entering the field 15:00 The "Modern Data Stack" and its evolution 17:25 Emerging trends: AI integration and Iceberg technology 27:40 DuckDB and the emergence of portable, cost-effective data stacks 32:14 The rise and impact of dbt in data engineering 34:08 Alternatives to dbt: SQLMesh and others 35:25 Workflow orchestration tools: Airflow, Dagster, Prefect, and GitHub Actions 37:20 Audience questions: Career focus in data roles and AI engineering overlaps 39:00 The role of semantics in data and AI workflows 41:11 Focusing on learning concepts over tools when entering the field 45:15 Transitioning from backend to data engineering: challenges and opportunities 47:48 Current state of the data engineering job market in Europe and beyond 49:05 Introduction to Apache Iceberg, Delta, and Hudi file formats 50:40 Suitability of these formats for batch and streaming workloads 52:29 Tools for streaming: Kafka, SQS, and related trends 58:07 Building AI agents and enabling intelligent data applications 59:09Closing discussion on the place of tools like DBT in the ecosystem

🔗 CONNECT WITH ADRIAN BRUDARU Linkedin -  / data-team   Website - https://adrian.brudaru.com/ 🔗 CONNECT WITH DataTalksClub Join the community - https://datatalks.club/slack.html Subscribe to our Google calendar to have all our events in your calendar - https://calendar.google.com/calendar/... Check other upcoming events - https://lu.ma/dtc-events LinkedIn -  /datatalks-club   Twitter -  /datatalksclub   Website - https://datatalks.club/

S1 Ep#33 Bridging Business and Data: The Art of Data Product Management The Data Product Management In Action podcast, brought to you by executive producer Scott Hirleman, is a platform for data product management practitioners to share insights and experiences. 

In Season 01, Episode 33, Amritha, our newest host, chats with Sagar Nikam, Head of Product at CK Delta. Sagar shares his journey from finance to data product management, highlighting the art of translating complex AI/ML models into actionable business strategies. He discusses the challenges of defining data products, the importance of clear communication, and why adoption often outweighs accuracy. Sagar also offers insights on handling uncertainty, setting success metrics, and the cross-industry applicability of data product management skills. Tune in for a deep dive into making data-driven decisions that drive real business impact.

About our host Amritha Arun Babu: Amritha is an accomplished Product Leader with over a decade of experience building and scaling products across AI platforms, supply chain systems, and enterprise workflows in industries such as e-commerce, AI/ML, and marketing automation. At Amazon, she led machine learning platform products powering recommendation and personalization engines, building tools for model experimentation, deployment, and monitoring that improved efficiency for 1,500+ ML scientists. At Wayfair, she managed international supply chain systems, overseeing contracts, billing, product catalogs, and vendor operations, delivering cost savings and optimizing large-scale workflows. At Klaviyo, she drives both AI infrastructure and customer-facing AI tools, including recommendation engines, content generation assistants, and workflow automation agents, enabling scalable and personalized marketing workflows. Earlier, she worked on enterprise systems and revenue operations workflows, focusing on cost optimization and process improvements in complex technical environments. Amritha excels at bridging technical depth with strategic clarity, leading cross-functional teams, and delivering measurable business outcomes across diverse domains. Connect with Amritha on LinkedIn.
About our guest Sagar Nikam: Sagar is the Head of Products at CKDelta, leading the development of AI-driven solutions, AI Agents, and intelligent applications that enhance decision-making and automation across industries. With experience in banking, utilities, and SaaS, he has successfully launched and scaled AI-powered products that drive real business impact. Sagar helps product teams seamlessly integrate AI, navigate challenges, and build solutions that are user-centric, explainable, and trusted. Connect with Sagar on LinkedIn.  

All views and opinions expressed are those of the individuals and do not necessarily reflect their employers or anyone else. 

Join the conversation on LinkedIn. 

Apply to be a guest or nominate someone that you know. 

Do you love what you're listening to? Please rate and review the podcast, and share it with fellow practitioners you know. Your support helps us reach more listeners and continue providing valuable insights! 

Databricks Certified Data Engineer Associate Study Guide

Data engineers proficient in Databricks are currently in high demand. As organizations gather more data than ever before, skilled data engineers on platforms like Databricks become critical to business success. The Databricks Data Engineer Associate certification is proof that you have a complete understanding of the Databricks platform and its capabilities, as well as the essential skills to effectively execute various data engineering tasks on the platform. In this comprehensive study guide, you will build a strong foundation in all topics covered on the certification exam, including the Databricks Lakehouse and its tools and benefits. You'll also learn to develop ETL pipelines in both batch and streaming modes. Moreover, you'll discover how to orchestrate data workflows and design dashboards while maintaining data governance. Finally, you'll dive into the finer points of exactly what's on the exam and learn to prepare for it with mock tests. Author Derar Alhussein teaches you not only the fundamental concepts but also provides hands-on exercises to reinforce your understanding. From setting up your Databricks workspace to deploying production pipelines, each chapter is carefully crafted to equip you with the skills needed to master the Databricks Platform. By the end of this book, you'll know everything you need to ace the Databricks Data Engineer Associate certification exam with flying colors, and start your career as a certified data engineer from Databricks! You'll learn how to: Use the Databricks Platform and Delta Lake effectively Perform advanced ETL tasks using Apache Spark SQL Design multi-hop architecture to process data incrementally Build production pipelines using Delta Live Tables and Databricks Jobs Implement data governance using Databricks SQL and Unity Catalog Derar Alhussein is a senior data engineer with a master's degree in data mining. He has over a decade of hands-on experience in software and data projects, including large-scale projects on Databricks. He currently holds eight certifications from Databricks, showcasing his proficiency in the field. Derar is also an experienced instructor, with a proven track record of success in training thousands of data engineers, helping them to develop their skills and obtain professional certifications.

It’s time for another episode of the Data Engineering Central Podcast. In this episode, we cover … * AWS Lambda + DuckDB and Delta Lake (Polars, Daft, etc). * IAC - Long Live Terraform. * Databricks Data Quality with DQX. * Unity Catalog releases for DuckDB and Polars * Bespoke vs Managed Data Platforms * Delta Lake vs. Iceberg and UinFORM for a single table. Thanks for b…

This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit dataengineeringcentral.substack.com/subscribe

Paul Andrew: An Evolution of Data Architectures - Lambda, Kappa, Delta, Mesh & Fabric

🌟 Session Overview 🌟

Session Name: An Evolution of Data Architectures - Lambda, Kappa, Delta, Mesh & Fabric Speaker: Paul Andrew Session Description: How have advancements in highly scalable cloud technology influenced the design principles we apply when building data platform solutions? Are we designing solely for speed and batch layers, or do we want more from our platforms? Who says these patterns must be delivered exclusively?

Let’s disrupt the theory and consider the practical application of everything Microsoft now has to offer, where concepts and patterns meet technology. Can we now utilize cloud technology to build architectures that cater to lambda, kappa, and Delta Lake concepts in a complete stack of services? Should we be considering a solution that offers all these principles in a nirvana of data insight perfection? How does the concept of Data Fabric align with Microsoft Fabric as a product?

In this session, we’ll explore the answers to these questions and more in a thought-provoking, argument-generating examination of the challenges every data platform engineer/architect faces.

🚀 About Big Data and RPA 2024 🚀

Unlock the future of innovation and automation at Big Data & RPA Conference Europe 2024! 🌟 This unique event brings together the brightest minds in big data, machine learning, AI, and robotic process automation to explore cutting-edge solutions and trends shaping the tech landscape. Perfect for data engineers, analysts, RPA developers, and business leaders, the conference offers dual insights into the power of data-driven strategies and intelligent automation. 🚀 Gain practical knowledge on topics like hyperautomation, AI integration, advanced analytics, and workflow optimization while networking with global experts. Don’t miss this exclusive opportunity to expand your expertise and revolutionize your processes—all from the comfort of your home! 📊🤖✨

📅 Yearly Conferences: Curious about the evolution of QA? Check out our archive of past Big Data & RPA sessions. Watch the strategies and technologies evolve in our videos! 🚀 🔗 Find Other Years' Videos: 2023 Big Data Conference Europe https://www.youtube.com/playlist?list=PLqYhGsQ9iSEpb_oyAsg67PhpbrkCC59_g 2022 Big Data Conference Europe Online https://www.youtube.com/playlist?list=PLqYhGsQ9iSEryAOjmvdiaXTfjCg5j3HhT 2021 Big Data Conference Europe Online https://www.youtube.com/playlist?list=PLqYhGsQ9iSEqHwbQoWEXEJALFLKVDRXiP

💡 Stay Connected & Updated 💡

Don’t miss out on any updates or upcoming event information from Big Data & RPA Conference Europe. Follow us on our social media channels and visit our website to stay in the loop!

🌐 Website: https://bigdataconference.eu/, https://rpaconference.eu/ 👤 Facebook: https://www.facebook.com/bigdataconf, https://www.facebook.com/rpaeurope/ 🐦 Twitter: @BigDataConfEU, @europe_rpa 🔗 LinkedIn: https://www.linkedin.com/company/73234449/admin/dashboard/, https://www.linkedin.com/company/75464753/admin/dashboard/ 🎥 YouTube: http://www.youtube.com/@DATAMINERLT

Building Modern Data Applications Using Databricks Lakehouse

This book, "Building Modern Data Applications Using Databricks Lakehouse," provides a comprehensive guide for data professionals to master the Databricks platform. You'll learn to effectively build, deploy, and monitor robust data pipelines with Databricks' Delta Live Tables, empowering you to manage and optimize cloud-based data operations effortlessly. What this Book will help me do Understand the foundations and concepts of Delta Live Tables and its role in data pipeline development. Learn workflows to process and transform real-time and batch data efficiently using the Databricks lakehouse architecture. Master the implementation of Unity Catalog for governance and secure data access in modern data applications. Deploy and automate data pipeline changes using CI/CD, leveraging tools like Terraform and Databricks Asset Bundles. Gain advanced insights in monitoring data quality and performance, optimizing cloud costs, and managing DataOps tasks effectively. Author(s) Will Girten, the author, is a seasoned Solutions Architect at Databricks with over a decade of experience in data and AI systems. With a deep expertise in modern data architectures, Will is adept at simplifying complex topics and translating them into actionable knowledge. His books emphasize real-time application and offer clear, hands-on examples, making learning engaging and impactful. Who is it for? This book is geared towards data engineers, analysts, and DataOps professionals seeking efficient strategies to implement and maintain robust data pipelines. If you have a basic understanding of Python and Apache Spark and wish to delve deeper into the Databricks platform for streamlining workflows, this book is tailored for you.

Delta Lake: The Definitive Guide

Ready to simplify the process of building data lakehouses and data pipelines at scale? In this practical guide, learn how Delta Lake is helping data engineers, data scientists, and data analysts overcome key data reliability challenges with modern data engineering and management techniques. Authors Denny Lee, Tristen Wentling, Scott Haines, and Prashanth Babu (with contributions from Delta Lake maintainer R. Tyler Croy) share expert insights on all things Delta Lake--including how to run batch and streaming jobs concurrently and accelerate the usability of your data. You'll also uncover how ACID transactions bring reliability to data lakehouses at scale. This book helps you: Understand key data reliability challenges and how Delta Lake solves them Explain the critical role of Delta transaction logs as a single source of truth Learn the Delta Lake ecosystem with technologies like Apache Flink, Kafka, and Trino Architect data lakehouses with the medallion architecture Optimize Delta Lake performance with features like deletion vectors and liquid clustering