Most organizations run complex cloud data architectures that silo applications, users and data. Join this interactive hands-on workshop to learn how Databricks SQL allows you to operate a multi-cloud lakehouse architecture that delivers data warehouse performance at data lake economics — with up to 12x better price/performance than traditional cloud data warehouses. Here’s what we’ll cover: How Databricks SQL fits in the Data Intelligence Platform, enabling you to operate a multicloud lakehouse architecture that delivers data warehouse performance at data lake economics How to manage and monitor compute resources, data access and users across your lakehouse infrastructure How to query directly on your data lake using your tools of choice or the built-in SQL editor and visualizations How to use AI to increase productivity when querying, completing code or building dashboards Ask your questions during this hands-on lab, and the Databricks experts will guide you.
talk-data.com
Topic
Data Lake
311
tagged
Activity Trend
Top Events
DoorDash Data organization actively adopts LakeHouse paradigm. This presentation describes the methodology which allows to migrate the classic Data Warehouse and Data Lake platforms to unified LakeHouse solution.The objective of this effort include Elimination of excessive data movement.Seamless integration and consolidation of the query engine layers, including Snowflake, Databricks, EMR and Trino.Query performance optimization.Abstracting away complexity of underlying storage layers and table formatsStrategic and justified decision on the Unified Metadata catalog used across varios compute platforms
This session explores the strategic migration from Snowflake to Databricks, focusing on the journey of transforming a data lake to leverage Databricks’ advanced capabilities. It outlines the assessment of key architectural differences, performance benchmarks, and cost implications driving the decision. Attendees will gain insights into planning and execution, including data ingestion pipelines, schema conversion and metadata migration. Challenges such as maintaining data quality, optimizing compute resources and minimizing downtime are discussed, alongside solutions implemented to ensure a seamless transition. The session highlights the benefits of unified analytics and enhanced scalability achieved through Databricks, delivering actionable takeaways for similar migrations.
Unity Catalog puts variety of schemas into a centralized repository, now the developer community wants more productivity and automation for schema inference, translation, evolution and optimization especially for the scenarios of ingestion and reverse-ETL with more code generations.Coinbase Data Platform attempts to pave a path with "Schemaster" to interact with data catalog with the (proposed) metadata model to make schema translation and evolution more manageable across some of the popular systems, such as Delta, Iceberg, Snowflake, Kafka, MongoDB, DynamoDB, Postgres...This Lighting Talk covers 4 areas: The complexity and caveats of schema differences among The proposed field-level metadata model, and 2 translation patterns: point-to-point vs hub-and-spoke Why Data Profiling be augmented to enhance schema understanding and translation Integrate it with Ingestion & Reverse-ETL in a Databricks-oriented eco system Takeaway: standardize schema lineage & translation
This session will explore Databricks Unity Catalog (UC) implementation by P&G to enhance data governance, reduce data redundancy and improve the developer experience through the enablement of a Lakehouse architecture. The presentation will cover: The distinction between data treated as a product and standard application data, highlighting how UC's structure maximizes the value of data in P&G's data lake. Real-life examples from two years of using Unity Catalog, demonstrating benefits such as improved governance, reduced waste and enhanced data discovery. Challenges related to disaster recovery and external data access, along with our collaboration with Databricks to address these issues. Sharing our experience can provide valuable insights for organizations planning to adopt Unity Catalog on an enterprise scale.
Most organizations run complex cloud data architectures that silo applications, users and data. Join this interactive hands-on workshop to learn how Databricks SQL allows you to operate a multi-cloud lakehouse architecture that delivers data warehouse performance at data lake economics — with up to 12x better price/performance than traditional cloud data warehouses.Here’s what we’ll cover: How Databricks SQL fits in the Data Intelligence Platform, enabling you to operate a multicloud lakehouse architecture that delivers data warehouse performance at data lake economics How to manage and monitor compute resources, data access and users across your lakehouse infrastructure How to query directly on your data lake using your tools of choice or the built-in SQL editor and visualizations How to use AI to increase productivity when querying, completing code or building dashboards Ask your questions during this hands-on lab, and the Databricks experts will guide you.
Unity Catalog (UC) enables governance and security for all data and AI assets within an enterprise’s data lake and is necessary to unlock the full potential of Databricks as a true Data Intelligence Platform. Unfortunately, UC migrations are non-trivial; especially for enterprises that have been using Databricks for more than five years, i.e., 7-Eleven. System Integrators (SIs) offer accelerators, guides, and services to support UC migrations; however, cloud infrastructure changes, anti-patterns within code, and data sprawl can significantly complicate UC migrations. There is no “shortcut” to success when planning and executing a complex UC migration. In this session, we will share how UCX by Databricks Labs, a UC Migration Assistant, allowed 7-Eleven to reorient their UC migration by leveraging assessments and workflows, etc., to assess, characterize, and ultimately plan a tenable approach for their UC migration.
Deloitte is observing a growing trend among cybersecurity organizations to develop big data management and analytics solutions beyond traditional Security Information and Event Management (SIEM) systems. Leveraging Databricks to extend these SIEM capabilities, Deloitte can help clients lower the cost of cyber data management while enabling scalable, cloud-native architectures. Deloitte helps clients design and implement cybersecurity data meshes, using Databricks as a foundational data lake platform to unify and govern security data at scale. Additionally, Deloitte extends clients’ cybersecurity capabilities by integrating advanced AI and machine learning solutions on Databricks, driving more proactive and automated cybersecurity solutions. Attendees will gain insight into how Deloitte is utilizing Databricks to manage enterprise cyber risks and deliver performant and innovative analytics and AI insights that traditional security tools and data platforms aren’t able to deliver.
Dynamic policy enforcement is increasingly critical in today's landscape, where data compliance is a top priorities for companies, individuals, and regulators alike. In this talk, Walaa explores how LinkedIn has implemented a robust dynamic policy enforcement engine, ViewShift, and integrated it within its data lake. He will demystify LinkedIn's query engine stack by demonstrating how catalogs can automatically route table resolutions to compliance-enforcing SQL views. These SQL views possess several noteworthy properties: Auto-Generated: Created automatically from declarative data annotations. User-Centric: They honor user-level consent and preferences. Context-Aware: They apply different transformations tailored to specific use cases. Portable: Despite the SQL logic being implemented in a single dialect, it remains accessible across all engines. Join this session to learn how ViewShift helps ensure that compliance is seamlessly integrated into data processing workflows.
Given that data is the new oil, it must be treated as such. Organizations that pursue greater insight into their businesses and their customers must manage, govern, protect and observe the use of the data that drives these insights in an efficient, cost-effective, compliant and auditable manner without degrading access to that data. Azure Data Lake Storage offers many features which allow customers to apply such controls and protections to their critical data assets. Understanding how these features behave, the granularity, cost and scale implications and the degree of control or protection that they apply are essential to implement a data lake that reflects the value contained within. In this session, the various data protection, governance and management capabilities available now and upcoming in ADLS will be discussed. This will include how deep integration with Azure Databricks can provide a more comprehensive, end-to-end coverage for these concerns, yielding a highly efficient and effective data governance solution.
Toyota, the world’s largest automaker, sought to accelerate time-to-data and empower business users with secure data collaboration for faster insights. Partnering with Cognizant, they established a Unified Data Lake, integrating SOX principles, Databricks Unity Catalog to ensure compliance and security. Additionally, they developed a Data Scanner solution to automatically detect non-sensitive data and accelerate data ingestion. Join this dynamic session to discover how they achieved it.
Join us for this insightful session to learn how you can leverage the power of the Microsoft ecosystem along with Azure Databricks to take your business to the next level. Azure Databricks is a fully integrated, native, first-party solution on Microsoft Azure. Databricks and Microsoft continue to actively collaborate on product development, ensuring tight integration, optimized performance, and a streamlined support experience. Azure Databricks offers seamless integrations with Power BI, Azure Open AI, Microsoft Purview, Azure Data Lake Storage (ADLS) and Foundry. In this session, you’ll learn how you can leverage deep integration between Azure Databricks and the Microsoft solutions to empower your organization to do more with your data estate. You’ll also get an exclusive sneak peek into the product roadmap.
NHS England is revolutionizing healthcare research by enabling secure, seamless access to de-identified patient data through the Federated Data Platform (FDP). Despite vast data resources spread across regional and national systems, analysts struggle with fragmented, inconsistent datasets. Enter Databricks: powering a unified, virtual data lake with Unity Catalog at its core — integrating diverse NHS systems while ensuring compliance and security. By bridging AWS and Azure environments with a private exchange and leveraging the Iceberg connector to interface with Palantir, analysts gain scalable, reliable and governed access to vital healthcare data. This talk explores how this innovative architecture is driving actionable insights, accelerating research and ultimately improving patient outcomes.
At Kaizen Gaming, data drives our decision-making, but rapid growth exposed inefficiencies in our legacy cloud setup — escalating costs, delayed insights and scalability limits. Operating in 18 countries with 350M daily transactions (1PB+), shared quotas and limited cost transparency hindered efficiency. To address this, we redesigned our cloud architecture with Data Landing Zones, a modular framework that decouples resources, enabling independent scaling and cost accountability. Automation streamlined infrastructure, reduced overhead and enhanced FinOps visibility, while Unity Catalog ensured governance and security. Migration challenges included maintaining stability, managing costs and minimizing latency. A phased approach, Delta Sharing, and DBx Asset Bundles simplified transitions. The result: faster insights, improved cost control and reduced onboarding time, fostering innovation and efficiency. We share our transformation, offering insights for modern cloud optimization.
Data lakehouses continue to be hyped but do they replace or complement data lakes and data warehouses? Where do we stand from an architectural perspective? What is hype and what is real? What should be expected in the coming years?
Data lakehouses continue to be hyped, but do they replace or complement data lakes and data warehouses? Where do we stand from an architectural perspective? What is hype and what is real? What should be expected in the coming years?
Future-proof your data architecture: Learn how DoorDash built a data lakehouse powered by Starburst to achieve a 20-30% faster time to insights. Akshat Nair shares lessons learned about what drove DoorDash to move beyond Snowflake to embrace the lakehouse. He will share his rationale for selecting Trino as their lakehouse query engine and why his team chose Starburst over open source. Discover how DoorDash seamlessly queries diverse sources, including Snowflake, Postgres, and data lake table formats, achieving faster data-driven decision-making at scale with cost benefits.
Summary In this episode of the Data Engineering Podcast Viktor Kessler, co-founder of Vakmo, talks about the architectural patterns in the lake house enabled by a fast and feature-rich Iceberg catalog. Viktor shares his journey from data warehouses to developing the open-source project, Lakekeeper, an Apache Iceberg REST catalog written in Rust that facilitates building lake houses with essential components like storage, compute, and catalog management. He discusses the importance of metadata in making data actionable, the evolution of data catalogs, and the challenges and innovations in the space, including integration with OpenFGA for fine-grained access control and managing data across formats and compute engines.
Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data managementData migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details.Your host is Tobias Macey and today I'm interviewing Viktor Kessler about architectural patterns in the lakehouse that are unlocked by a fast and feature-rich Iceberg catalogInterview IntroductionHow did you get involved in the area of data management?Can you describe what LakeKeeper is and the story behind it? What is the core of the problem that you are addressing?There has been a lot of activity in the catalog space recently. What are the driving forces that have highlighted the need for a better metadata catalog in the data lake/distributed data ecosystem?How would you characterize the feature sets/problem spaces that different entrants are focused on addressing?Iceberg as a table format has gained a lot of attention and adoption across the data ecosystem. The REST catalog format has opened the door for numerous implementations. What are the opportunities for innovation and improving user experience in that space?What is the role of the catalog in managing security and governance? (AuthZ, auditing, etc.)What are the channels for propagating identity and permissions to compute engines? (how do you avoid head-scratching about permission denied situations)Can you describe how LakeKeeper is implemented?How have the design and goals of the project changed since you first started working on it?For someone who has an existing set of Iceberg tables and catalog, what does the migration process look like?What new workflows or capabilities does LakeKeeper enable for data teams using Iceberg tables across one or more compute frameworks?What are the most interesting, innovative, or unexpected ways that you have seen LakeKeeper used?What are the most interesting, unexpected, or challenging lessons that you have learned while working on LakeKeeper?When is LakeKeeper the wrong choice?What do you have planned for the future of LakeKeeper?Contact Info LinkedInParting Question From your perspective, what is the biggest gap in the tooling or technology for data management today?Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.init covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.Links LakeKeeperSAPMicrosoft AccessMicrosoft ExcelApache IcebergPodcast EpisodeIceberg REST CatalogPyIcebergSparkTrinoDremioHive MetastoreHadoopNATSPolarsDuckDBPodcast EpisodeDataFusionAtlanPodcast EpisodeOpen MetadataPodcast EpisodeApache AtlasOpenFGAHudiPodcast EpisodeDelta LakePodcast EpisodeLance Table FormatPodcast EpisodeUnity CatalogPolaris CatalogApache GravitinoPodcast Episode KeycloakOpen Policy Agent (OPA)Apache RangerApache NiFiThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Get the inside story of Yahoo’s data lake transformation. As a Hadoop pioneer, Yahoo’s move to Google Cloud is a significant shift in data strategy. Explore the business drivers behind this transformation, technical hurdles encountered, and strategic partnership with Google Cloud that enabled a seamless migration. We’ll uncover key lessons, best practices for data lake modernization, and how Yahoo is using BigQuery, Dataproc, Pub/Sub, and other services to drive business value, enhance operational efficiency, and fuel their AI initiatives.