talk-data.com talk-data.com

Topic

Delta

Delta Lake

data_lake acid_transactions time_travel file_format storage

347

tagged

Activity Trend

117 peak/qtr
2020-Q1 2026-Q1

Activities

347 activities · Newest first

Data Engineering with Azure Databricks

Master end-to-end data engineering on Azure Databricks. From data ingestion and Delta Lake to CI/CD and real-time streaming, build secure, scalable, and performant data solutions with Spark, Unity Catalog, and ML tools. Key Features Build scalable data pipelines using Apache Spark and Delta Lake Automate workflows and manage data governance with Unity Catalog Learn real-time processing and structured streaming with practical use cases Implement CI/CD, DevOps, and security for production-ready data solutions Explore Databricks-native ML, AutoML, and Generative AI integration Book Description "Data Engineering with Azure Databricks" is your essential guide to building scalable, secure, and high-performing data pipelines using the powerful Databricks platform on Azure. Designed for data engineers, architects, and developers, this book demystifies the complexities of Spark-based workloads, Delta Lake, Unity Catalog, and real-time data processing. Beginning with the foundational role of Azure Databricks in modern data engineering, you’ll explore how to set up robust environments, manage data ingestion with Auto Loader, optimize Spark performance, and orchestrate complex workflows using tools like Azure Data Factory and Airflow. The book offers deep dives into structured streaming, Delta Live Tables, and Delta Lake’s ACID features for data reliability and schema evolution. You’ll also learn how to manage security, compliance, and access controls using Unity Catalog, and gain insights into managing CI/CD pipelines with Azure DevOps and Terraform. With a special focus on machine learning and generative AI, the final chapters guide you in automating model workflows, leveraging MLflow, and fine-tuning large language models on Databricks. Whether you're building a modern data lakehouse or operationalizing analytics at scale, this book provides the tools and insights you need. What you will learn Set up a full-featured Azure Databricks environment Implement batch and streaming ingestion using Auto Loader Optimize Spark jobs with partitioning and caching Build real-time pipelines with structured streaming and DLT Manage data governance using Unity Catalog Orchestrate production workflows with jobs and ADF Apply CI/CD best practices with Azure DevOps and Git Secure data with RBAC, encryption, and compliance standards Use MLflow and Feature Store for ML pipelines Build generative AI applications in Databricks Who this book is for This book is for data engineers, solution architects, cloud professionals, and software engineers seeking to build robust and scalable data pipelines using Azure Databricks. Whether you're migrating legacy systems, implementing a modern lakehouse architecture, or optimizing data workflows for performance, this guide will help you leverage the full power of Databricks on Azure. A basic understanding of Python, Spark, and cloud infrastructure is recommended.

Engineering Lakehouses with Open Table Formats

Engineering Lakehouses with Open Table Formats introduces the architecture and capabilities of open table formats like Apache Iceberg, Apache Hudi, and Delta Lake. The book guides you through the design, implementation, and optimization of lakehouses that can handle modern data processing requirements effectively with real-world practical insights. What this Book will help me do Understand the fundamentals of open table formats and their benefits in lakehouse architecture. Learn how to implement performant data processing using tools like Apache Spark and Flink. Master advanced topics like indexing, partitioning, and interoperability between data formats. Explore data lifecycle management and integration with frameworks like Apache Airflow and dbt. Build secure lakehouses with regulatory compliance using best practices detailed in the book. Author(s) Dipankar Mazumdar and Vinoth Govindarajan are seasoned professionals with extensive experience in big data processing and software architecture. They bring their expertise from working with data lakehouses and are known for their ability to explain complex technical concepts clearly. Their collaborative approach brings valuable insights into the latest trends in data management. Who is it for? This book is ideal for data engineers, architects, and software professionals aiming to master modern lakehouse architectures. If you are familiar with data lakes or warehouses and wish to transition to an open data architectural design, this book is suited for you. Readers should have basic knowledge of databases, Python, and Apache Spark for the best experience.

From Data Lake Entanglement to Data Mesh Decoupling: Scaling a Self-Service Data Platform

Our data platform journey started with a classic data lake — easy to ingest, hard to evolve. As domains scaled, tight coupling across source systems, pipelines, and data products slowed everything down. In this talk, we share how we re-architected toward a domain-oriented data mesh using PySpark, Delta Lake and DQX to achieve true decoupling. Expect practical lessons on designing independent data products, managing lineage and governance, and scaling self-service without chaos.

Unleashing SAP Databricks on Azure: Modernize, analyze, and innovate

SAP Databricks on Azure integrates Databricks Data Intelligence Platform with SAP Business Data Cloud, unifying SAP and external data for advanced analytics, AI, and ML. It enables building intelligent apps and actionable insights using trusted SAP and third-party business data. Available natively on Azure within SAP Business Data Cloud, it offers seamless access without data duplication via Delta Sharing. This session highlights automated forecasting, exploratory analysis, and BI use cases.

DuckDB is the best way to execute SQL on a single node. But with its embedding-friendly nature, it makes an excellent foundation for building distributed systems. George Fraser, CEO of Fivetran, will tell us how Fivetran used DuckDB to power its Iceberg data lake writer—coordinating thousands of small, parallel tasks across a fleet of workers, each running DuckDB queries on bounded datasets. The result is a high-throughput, dual-format (Iceberg + Delta) data lake architecture where every write scales linearly, snapshots stay perfectly in sync, and performance rivals a commercial database while remaining open and portable.

Le groupe Egis conçoit et exploite des infrastructures complexes à l’échelle mondiale : autoroutes, aéroports, ferroviaire, bâtiments, services de mobilité, énergie, aménagement urbain et environnement. La diversité et le volume des données générées posent des défis majeurs en matière de gouvernance, d’industrialisation et de scalabilité.

Pour y répondre, Egis a déployé une infrastructure Data Mesh sur Azure, mise en place et opérée par une équipe dédiée. Cette équipe assure la conception, la gouvernance et la mise à disposition de l’architecture pour l’ensemble des Business Lines. L’infrastructure s’appuie sur :

• Stockage distribué avec ADLS Gen2,

• ETL et traitements big data avec Azure Data Factory et Databricks,

• Visualisation et partage sécurisé via Power BI Service et Delta Sharing,

• Des mécanismes de gouvernance avancés pour garantir interopérabilité et fiabilité.

Cette session présentera :

• Les choix d’architecture et patterns techniques pour mettre en place un Data Mesh distribué à l’échelle d’un grand groupe international

• Le rôle et l’organisation de l’équipe dédiée dans la mise à disposition et l’accompagnement des projets métiers

• Les enseignements pratiques tirés de cas d’usage concrets déjà en production

 

Une immersion au cœur de la mise en œuvre réelle du Data Mesh, pensée pour transformer la donnée en un actif accessible, fiable et exploitable à grande échelle par les équipes métiers et techniques.

Face To Face
by Gavi Regunath (Advancing Analytics) , Simon Whiteley (Advancing Analytics) , Holly Smith (Databricks)

We’re excited to be back at Big Data LDN this year—huge thanks to the organisers for hosting Databricks London once more!

Join us for an evening of insights, networking, and community with the Databricks Team and Advancing Analytics!

🎤 Agenda:

6:00 PM – 6:10 PM | Kickoff & Warm Welcome

Grab a drink, say hi, and get the lowdown on what’s coming up. We’ll set the scene for an evening of learning and laughs.

6:10 PM – 6:50 PM | The Metadata Marathon: How three projects are racing forward – Holly Smith (Staff Developer Advocate, Databricks)

With the enormous amount of discussion about open storage formats between nerds and even not-nerds, it can be hard to keep track of who’s doing what and how this actually makes any impact on day to day data projects.

Holly will take a closer look at the three big projects in this space; Delta, Hudi and Iceberg. They’re all trying to solve for similar data problems and have tackled the various challenges in different ways. Her talk will start with the very basics of how we got here, what the history is before diving deep into the underlying tech, their roadmaps, and their impacts on the data landscape as a whole.

6:50 PM – 7:10 PM | What’s New in Databricks & Databricks AI – Simon Whiteley & Gavi Regunath

Hot off the press! Simon and Gavi will walk you through the latest and greatest from Databricks, including shiny new AI features and platform updates you’ll want to try ASAP.

7:10 PM onwards | Q&A Panel + Networking

Your chance to ask the experts anything—then stick around for drinks, snacks, and some good old-fashioned data geekery.

As organizations increasingly adopt data lake architectures, analytics databases face significant integration challenges beyond simple data ingestion. This talk explores the complex technical hurdles encountered when building robust connections between analytics engines and modern data lake formats.

We'll examine critical implementation challenges, including the absence of native library support for formats like Delta Lake, which necessitates expansion into new programming languages such as Rust to achieve optimal performance. The session explores the complexities of managing stateful systems, addressing caching inconsistencies, and reconciling state across distributed environments.

A key focus will be on integrating with external catalogs while maintaining data consistency and performance - a challenge that requires careful architectural decisions around metadata management and query optimization. We'll explore how these technical constraints impact system design and the trade-offs involved in different implementation approaches.

Attendees will gain a practical understanding of the engineering complexity behind seamless data lake integration and actionable approaches to common implementation obstacles.

How to move data from thousands of SQL databases to data lake with no impact on OLTP? We'll explore the challenges we faced while migrated legacy batch data flows to event-based architecture. A key challenge for our data engineers was the multi-tenant architecture of our backend, meaning that we had to handle the same SQL schema on over 15k databases. We'll present the journey employing Debezium, Azure Event Hub, Delta Live tables and the extra tooling we had to put in place.

So you’ve heard of Databricks, but still not sure what the fuss is all about. Yes you’ve heard it’s Spark, but then there’s this Delta thing that’s both a data lake and a data warehouse (isn’t that what Iceberg is?) And then there's Unity Catalog, that's not just a catalog, it also does access management but even surprising things like optimise your data and programmatic access to lineage and billing? But then serverless came out and now you don’t even have to learn Spark? And of course there’s a bunch of AI stuff to use or create yourself. So why not spend 30 mins learning the details of what Databricks does, and how it can turn you into a rockstar Data Engineer.

Summary In this episode of the Data Engineering Podcast Hannes Mühleisen and Mark Raasveldt, the creators of DuckDB, share their work on Duck Lake, a new entrant in the open lakehouse ecosystem. They discuss how Duck Lake, is focused on simplicity, flexibility, and offers a unified catalog and table format compared to other lakehouse formats like Iceberg and Delta. Hannes and Mark share insights into how Duck Lake revolutionizes data architecture by enabling local-first data processing, simplifying deployment of lakehouse solutions, and offering benefits such as encryption features, data inlining, and integration with existing ecosystems.

Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementData teams everywhere face the same problem: they're forcing ML models, streaming data, and real-time processing through orchestration tools built for simple ETL. The result? Inflexible infrastructure that can't adapt to different workloads. That's why Cash App and Cisco rely on Prefect. Cash App's fraud detection team got what they needed - flexible compute options, isolated environments for custom packages, and seamless data exchange between workflows. Each model runs on the right infrastructure, whether that's high-memory machines or distributed compute. Orchestration is the foundation that determines whether your data team ships or struggles. ETL, ML model training, AI Engineering, Streaming - Prefect runs it all from ingestion to activation in one platform. Whoop and 1Password also trust Prefect for their data operations. If these industry leaders use Prefect for critical workflows, see what it can do for you at dataengineeringpodcast.com/prefect.Data migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details. Your host is Tobias Macey and today I'm interviewing Hannes Mühleisen and Mark Raasveldt about DuckLake, the latest entrant into the open lakehouse ecosystemInterview IntroductionHow did you get involved in the area of data management?Can you describe what DuckLake is and the story behind it?What are the particular problems that DuckLake is solving for?How does this compare to the capabilities of MotherDuck?Iceberg and Delta already have a well established ecosystem, but so does DuckDB. Who are the primary personas that you are trying to focus on in these early days of DuckLake?One of the major factors driving the adoption of formats like Iceberg is cost efficiency for large volumes of data. That brings with it challenges of large batch processing of data. How does DuckLake account for these axes of scale?There is also a substantial investment in the ecosystem of technologies that support Iceberg. The most notable ecosystem challenge for DuckDB and DuckLake is in the query layer. How are you thinking about the evolution and growth of that capability beyond DuckDB (e.g. support in Trino/Spark/Flink)?What are your opinions on the viability of a future where DuckLake and Iceberg become a unified standard and implementation? (why can't Iceberg REST catalog implementations just use DuckLake under the hood?)Digging into the specifics of the specification and implementation, what are some of the capabilities that it offers above and beyond Iceberg?Is it now possible to enforce PK/FK constraints, indexing on underlying data?Given that DuckDB has a vector type, how do you think about the support for vector storage/indexing?How do the capabilities of DuckLake and the integration with DuckDB change the ways that data teams design their data architecture and access patterns?What are your thoughts on the impact of "data gravity" in today's data ecosystem, with engines like DuckDB, KuzuDB, LanceDB, etc. available for embedded and edge use cases?What are the most interesting, innovative, or unexpected ways that you have seen DuckLake used?What are the most interesting, unexpected, or challenging lessons that you have learned while working on DuckLake?When is DuckLake the wrong choice?What do you have planned for the future of DuckLake?Contact Info HannesWebsiteMarkWebsiteParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.Links DuckDBPodcast EpisodeDuckLakeDuckDB LabsMySQLCWIMonetDBIcebergIceberg REST CatalogDeltaHudiLanceDuckDB Iceberg ConnectorACID == Atomicity, Consistency, Isolation, DurabilityMotherDuckMotherDuck Managed DuckLakeTrinoSparkPrestoSpark DuckLake DemoDelta KernelArrowdltS3 TablesAttribute Based Access Control (ABAC)ParquetArrow FlightHadoopHDFSDuckLake RoadmapThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Capitalizing Alternatives Data on the Addepar Platform: Private Markets Benchmarking

Addepar possesses an enormous private investment data set with 40% of the $7T assets on the platform allocated to alternatives. Leveraging the Addepar Data Lakehouse (ADL), built on Databricks, we have built a scalable data pipeline that assesses millions of private fund investment cash flows and translates it to a private fund benchmarks data offering. Investors on the Addepar platform can leverage this data seamlessly integrated against their portfolio investments and obtain actionable investment insights. At a high-level, this data offering consists of an extensive data aggregation, filtering, and construction logic that dynamically updates for clients through the Databricks job workflows. This derived dataset has gone through several iterations with investment strategists and academics that leveraged delta shared tables. Irrespective of the data source, the data pipeline coalesces all relevant cash flow activity against a unique identifier before constructing the benchmarks.

Welcome Lakehouse, from a DWH transformation to a M&A data sharing

At DXC, we helped our customer FastWeb with their "Welcome Lakehouse" project - a data warehouse transformation from on-premises to Databricks on AWS. But the implementation became something more. Thanks to features such as Lakehouse Federation and Delta Sharing, from the first day of the Fastweb+Vodafone merger, we have been able to connect two different platforms with ease and make the business focus on the value of data and not on the IT integration. This session will feature our customer Alessandro Gattolin of Fastweb to talk about the experience.

Databricks + Apache Iceberg™: Managed and Foreign Tables in Unity Catalog

Unity Catalog support for Apache Iceberg™ brings open, interoperable table formats to the heart of the Databricks Lakehouse. In this session, we’ll introduce new capabilities that allow you to write Iceberg tables from any REST-compatible engine, apply fine-grained governance across all data, and unify access to external Iceberg catalogs like AWS Glue, Hive Metastore, and Snowflake Horizon. Learn how Databricks is eliminating data silos, simplifying performance with Predictive Optimization, and advancing a truly open lakehouse architecture with Delta and Iceberg side by side.

Supercharging Sales Intelligence: Processing Billions of Events via Structured Streaming

DigiCert is a digital security company that provides digital certificates, encryption and authentication services and serves 88% of the Fortune 500, securing over 28 billion web connections daily. Our project aggregates and analyzes certificate transparency logs via public APIs to provide comprehensive market and competitive intelligence. Instead of relying on third-party providers with limited data, our project gives full control, deeper insights and automation. Databricks has helped us reliably poll public APIs in a scalable manner that fetches millions of events daily, deduplicate and store them in our Delta tables. We specifically use Spark for parallel processing, structured streaming for real-time ingestion and deduplication, Delta tables for data reliability, pools and jobs to ensure our costs are optimized. These technologies help us keep our data fresh, accurate and cost effective. This data has helped our sales team with real-time intelligence, ensuring DigiCert's success.