talk-data.com talk-data.com

Topic

Databricks

big_data analytics spark

1286

tagged

Activity Trend

515 peak/qtr
2020-Q1 2026-Q1

Activities

1286 activities · Newest first

Delta and Databricks as a Performant Exabyte-Scale Application Backend

The Delta Lake architecture promises to provide a single, highly functional, and high-scale copy of data that can be leveraged by a variety of tools to satisfy a broad range of use cases. To date, most use cases have focused on interactive data warehousing, ETL, model training, and streaming. Real-time access is generally delegated to costly and sometimes difficult-to-scale NoSQL, indexed storage, and domain-specific specialty solutions, which provide limited functionality compared to Spark on Delta Lake. In this session, we will explore the Delta data-skipping and optimization model and discuss how Capital One leveraged it along with Databricks photon and Spark Connect to implement a real-time web application backend. We’ll share how we built a highly-functional and performant security information and event management user experience (SIEM UX) that is cost effective.

Empowering Business Users With Databricks — Integrating AI/BI Genie With Microsoft Teams

In this session, we'll explore how Rooms To Go enhances organizational collaboration by integrating AI/BI Genie with Microsoft Teams. Genie enables warehouse employees and members of the sales team to interact with data using natural language, simplifying data exploration and analysis. By connecting Genie to Microsoft Teams, we bring real-time data insights directly to a user’s phone. We'll provide a comprehensive overview on setting up this integration as well as a demo of how the team uses it daily. Attendees will gain practical knowledge to implement this integration, empowering their teams to access and interact with data seamlessly within Microsoft Teams.

Enterprise Cost Management for Data Warehousing with Databricks SQL

This session shows you how to gain visibility into your Databricks SQL spend and ensure cost efficiency. Learn about the latest features to gain detailed insights into Databricks SQL expenses so you can easily monitor and control your costs. Find out how you can enable attribution to internal projects, understand the Total Cost of Ownership, set up proactive controls and find ways to continually optimize your spend.

Enterprise Financial Crime Detection: A Lakehouse Framework for FATF, Basel III, and BSA Compliance

We will present a framework for FinCrime detection leveraging Databricks lakehouse architecture specifically how institutions can achieve both data flexibility & ACID transaction guarantees essential for FinCrime monitoring. The framework incorporates advanced ML models for anomaly detection, pattern recognition, and predictive analytics, while maintaining clear data lineage & audit trails required by regulatory bodies. We will also discuss some specific improvements in reduction of false positives, improvement in detection speed, and faster regulatory reporting, delve deep into how the architecture addresses specific FATF recommendations, Basel III risk management requirements, and BSA compliance obligations, particularly in transaction monitoring and SAR. The ability to handle structured and unstructured data while maintaining data quality and governance makes it particularly valuable for large financial institutions dealing with complex, multi-jurisdictional compliance requirements.

From Imperative to Declarative Paradigm: Rebuilding a CI/CD Infrastructure Using Hatch and DABs

Building and deploying Pyspark pipelines to Databricks should be effortless. However, our team at FreeWheel has, for the longest time, struggled with a convoluted and hard-to-maintain CI/CD infrastructure. It followed an imperative paradigm, demanding that every project implement custom scripts to build artifacts and deploy resources, and resulting in redundant boilerplate code and awkward interactions with the Databricks REST API. We set our mind on rebuilding it from scratch, following a declarative paradigm instead. We will share how we were able to eliminate thousands of lines of code from our repository, create a fully configuration-driven infrastructure where projects can be easily onboarded, and improve the quality of our codebase using Hatch and Databricks Asset Bundles as our tools of choice. In particular, DAB has made deploying across our 3 environments a breeze, and has allowed us to quickly adopt new features as soon as they are released by Databricks.

Maximize Retail Data Insights in Genie with DeltaSharing via Crisp’s Collaborative Commerce Platform

Crisp streamlines a brand’s data ingestion across 60+ retail sources, to build a foundation of sales and inventory intelligence on Databricks. Data is normalized and analysis-ready, and integrates seamlessly with AI tools - such as Databricks’ Genie and Blueprints. This session will provide an overview of the Crisp retail data platform and how our semantic layer, normalized and harmonized data sets can help drive powerful insights for supply chain, BI/Analytics, and data science teams.

Real-Time Analytics Pipeline for IoT Device Monitoring and Reporting

This session will show how we implemented a solution to support high-frequency data ingestion from smart meters. We implemented a robust API endpoint that interfaces directly with IoT devices. This API processes messages in real time from millions of distributed IoT devices and meters across the network. The architecture leverages cloud storage as a landing zone for the raw data, followed by a streaming pipeline built on Lakeflow Declarative Pipelines. This pipeline implements a multi-layer medallion architecture to progressively clean, transform and enrich the data. The pipeline operates continuously to maintain near real-time data freshness in our gold layer tables. These datasets connect directly to Databricks Dashboards, providing stakeholders with immediate insights into their operational metrics. This solution demonstrates how modern data architecture can handle high-volume IoT data streams while maintaining data quality and providing accessible real-time analytics for business users.

Scaling Trust in BI: How Bolt Manages Thousands of Metrics Across Databricks, dbt, and Looker

Managing metrics across teams can feel like everyone’s speaking a different language, which often leads to loss of trust in numbers. Based on a real-world use case, we’ll show you how to establish a governed source of truth for metrics that works at scale and builds a solid foundation for AI integration. You’ll explore how Bolt.eu’s data team governs consistent metrics for different data users and leverages Euno’s automations to navigate the overlap between Looker and dbt. We’ll cover best practices for deciding where your metrics belong and how to optimize engineering and maintenance workflows across Databricks, dbt and Looker. For curious analytics engineers, we’ll dive into thinking in dimensions & measures vs. tables & columns and determining when pre-aggregations make sense. The goal is to help you contribute to a self-serve experience with consistent metric definitions, so business teams and AI agents can access the right data at the right time without endless back-and-forth.

Securing Databricks using Databricks as SIEM showcases our approach on how we leverage Databricks product capabilities to prevent and mitigate security risks for Databricks. It demonstrates how Databricks can serve as a powerful Security Information and Event Management (SIEM) platform, offering advanced capabilities for data collection and threat detection. This session explores data collection from diverse data sources and real-time threat detection.

Simplify Data Ingest and Egress with the New Python Data Source API

Data engineering teams are frequently tasked with building bespoke ingest and/or egress solutions for myriad custom, proprietary, or industry-specific data sources or sinks. Many teams find this work cumbersome and time-consuming. Recognizing these challenges, Databricks interviewed numerous companies across different industries to better understand their diverse data integration needs. This comprehensive feedback led us to develop the Python Data Source API for Apache Spark™.

Sponsored by: Google Cloud | Unleash the power of Gemini for Databricks

Elevate your AI initiatives on Databricks by harnessing the latest advancements in Google Cloud's Gemini models. Learn how to integrate Gemini's built-in reasoning and powerful development tools to build more dynamic and intelligent applications within your existing Databricks platform. We'll explore concrete ideas for agentic AI solutions, showcasing how Gemini can help you unlock new value from your data in Databricks.

Sponsored by: Hightouch | Unleashing AI at PetSmart: Using AI Decisioning Agents to Drive Revenue

With 75M+ Treats Rewards members, PetSmart knows how to build loyalty with pet parents. But recently, traditional email testing and personalization strategies weren’t delivering the engagement and growth they wanted—especially in the Salon business. This year, they replaced their email calendar and A/B testing with AI Decisioning, achieving a +22% incremental lift in bookings. Join Bradley Breuer, VP of Marketing – Loyalty, Personalization, CRM, and Customer Analytics, to learn how his team reimagined CRM using AI to personalize campaigns and dynamically optimize creative, offers, and timing for every unique pet parent. Learn: How PetSmart blends human insight and creativity with AI to deliver campaigns that engage and convert. How they moved beyond batch-and-blast calendars with AI Decisioning Agents to optimize sends—while keeping control over brand, messaging, and frequency. How using Databricks as their source of truth led to surprising learnings and better outcomes.

Sponsored by: RowZero | Spreadsheets in the modern data stack: security, governance, AI, and self-serve analytics

Despite the proliferation of cloud data warehousing, BI tools, and AI, spreadsheets are still the most ubiquitous data tool. Business teams in finance, operations, sales, and marketing often need to analyze data in the cloud data warehouse but don't know SQL and don't want to learn BI tools. AI tools offer a new paradigm but still haven't broadly replaced the spreadsheet. With new AI tools and legacy BI tools providing business teams access to data inside Databricks, security and governance are put at risk. In this session, Row Zero CEO, Breck Fresen, will share examples and strategies data teams are using to support secure spreadsheet analysis at Fortune 500 companies and the future of spreadsheets in the world of AI. Breck is a former Principal Engineer from AWS S3 and was part of the team that wrote the S3 file system. He is an expert in storage, data infrastructure, cloud computing, and spreadsheets.

Sponsored by: Snowplow | Snowplow Signals: Powering Tomorrow’s Customer Experiences on Databricks

The web is on the verge of a major shift. Agentic applications will redefine how customers engage with digital experiences—delivering highly personalized, relevant interactions. In this talk, Snowplow CTO Yali Sassoon explores how Snowplow Signals enables agents to perceive users through short- and long-term memory, natively on the Databricks Data Intelligence Platform.

Transforming Data Governance for Multimodal Data at Amgen With Databricks

Amgen is advancing its Enterprise Data Fabric to securely manage sensitive multimodal data, such as imaging and research data, across formats.Databricks is already the de facto standard for governance on structured data, and Amgen seeks to extend it for unstructured multi modal data too. This approach will also allow Amgen to standardize its GenAI projects on Databricks. Key priorities include: Centralized data access: establishing a unified, secure access control system Enhanced traceability: implementing detailed processes for transparency and accountability Consistent access standards: ensuring uniform data access privilege experience User support: providing flexible access for diverse stakeholders Comprehensive auditing: enabling thorough permission audits and data usage tracking Learn strategies for implementing a comprehensive multimodal data governance framework using Databricks, as we share our experience on standardizing data governance for GenAI use cases.

Unlocking Access: Simplifying Identity Management at Scale With Databricks

Effective Identity and Access Management (IAM) is essential for securing enterprise environments while enabling innovation and collaboration. As companies scale, ensuring users have the right access without adding administrative overhead is critical. In this session, we’ll explore how Databricks is simplifying identity management by integrating with customers’ Identity Providers (IDPs). Learn about Automatic Identity Management in Azure Databricks, which eliminates SCIM for Entra ID users and ensures scalable identity provisioning for other IDPs. We'll also cover externally managed groups, PIM integration and upcoming enhancements like a bring-your-own-IDP model for Google Cloud. Through a customer success story and live demo, see how Databricks is making IAM more scalable, secure and user-friendly.

Building Trustworthy AI at Northwestern Mutual: Guardrail Technologies and Strategies

This intermediate-level presentation will explore the various methods we've leveraged within Databricks to deliver and evaluate guardrail models for AI safety. From prompt engineering with custom built frameworks to hosting models served from the market place and beyond. We've utilized GPU within clusters to fine-tune and run large open sourced models at inference such as Llama Guard 3.1 and generate synthetic datasets based on questions we've received from production.

No Time for the Dad Bod: Automating Life with AI and Databricks

Life as a father, tech leader, and fitness enthusiast demands efficiency. To reclaim my time, I’ve built AI-driven solutions that automate everyday tasks—from research agents that prep for podcasts to multi-agent systems that plan meals—all powered by real-time data and automation. This session dives into the technical foundations of these solutions, focusing on event-driven agent design and scalable patterns for robust AI systems. You’ll discover how Databricks technologies like Delta Lake, for reliable and scalable data management, and DSPy, for streamlining the development of generative AI workflows, empower seamless decision-making and deliver actionable insights. Through detailed architecture diagrams and a live demo, I’ll showcase how to design systems that process data in motion to tackle complex, real-world problems. Whether you’re an engineer, architect, or data scientist, you’ll leave with practical strategies to integrate AI-driven automation into your workflows.

Zillow has well-established, comprehensive systems for defining and enforcing data quality contracts and detecting anomalies.In this session, we will share how we evaluated Databricks’ native data quality features and why we chose Lakeflow Declarative Pipelines expectations for Lakeflow Declarative Pipelines, along with a combination of enforced constraints and self-defined queries for other job types. Our evaluation considered factors such as performance overhead, cost and scalability. We’ll highlight key improvements over our previous system and demonstrate how these choices have enabled Zillow to enforce scalable, production-grade data quality.Additionally, we are actively testing Databricks’ latest data quality innovations, including enhancements to lakehouse monitoring and the newly released DQX project from Databricks Labs.In summary, we will cover Zillow’s approach to data quality in the lakehouse, key lessons from our migration and actionable takeaways.