talk-data.com talk-data.com

Topic

Databricks

big_data analytics spark

1286

tagged

Activity Trend

515 peak/qtr
2020-Q1 2026-Q1

Activities

1286 activities · Newest first

Databricks, the Good, the Bad and the Ugly

Databricks is the bestest platform ever where everything is perfect and nothing else could ever make it any better, right? …right? You and I know, this is not true. Don’t get me wrong, there are features that I absolutely love, but there are also some that require powering through the papercuts. And then there are those that I pretend don’t exist. I’ll be opening up to give my honest take on three of each category, why I do (or don’t) like them, and then telling you which talks to attend to find out more.

You shouldn’t have to sacrifice data governance just to leverage the tools your business needs. In this session, we will give practical tips on how you can cut through the data sprawl and get a unified view of your data estate in Unity Catalog without disrupting existing workloads. We will walk through how to set up federation with Glue, Hive Metastore, and other catalogs like Snowflake, and show you how powerful new tools help you adopt Databricks at your own pace with no downtime and full interoperability.

In this course, you'll learn concepts and perform labs that showcase workflows using Unity Catalog - Databricks' unified and open governance solution for data and AI. We'll start off with a brief introduction to Unity Catalog, discuss fundamental data governance concepts, and then dive into a variety of topics including using Unity Catalog for data access control, managing external storage and tables, data segregation, and more. Pre-requisites: Beginner familiarity with the Databricks Data Intelligence Platform (selecting clusters, navigating the Workspace, executing notebooks), cloud computing concepts (virtual machines, object storage, etc.), production experience working with data warehouses and data lakes, intermediate experience with basic SQL concepts (select, filter, groupby, join, etc), beginner programming experience with Python (syntax, conditions, loops, functions), beginner programming experience with the Spark DataFrame API (Configure DataFrameReader and DataFrameWriter to read and write data, Express query transformations using DataFrame methods and Column expressions, etc.) Labs: Yes Certification Path: Databricks Certified Data Engineer Associate

In this course, you’ll learn how to orchestrate data pipelines with Lakeflow Jobs (previously Databricks Workflows) and schedule dashboard updates to keep analytics up-to-date. We’ll cover topics like getting started with Lakeflow Jobs, how to use Databricks SQL for on-demand queries, and how to configure and schedule dashboards and alerts to reflect updates to production data pipelines. Pre-requisites: Beginner familiarity with the Databricks Data Intelligence Platform (selecting clusters, navigating the Workspace, executing notebooks), cloud computing concepts (virtual machines, object storage, etc.), production experience working with data warehouses and data lakes, intermediate experience with basic SQL concepts (select, filter, groupby, join, etc), beginner programming experience with Python (syntax, conditions, loops, functions), beginner programming experience with the Spark DataFrame API (Configure DataFrameReader and DataFrameWriter to read and write data, Express query transformations using DataFrame methods and Column expressions, etc.) Labs: No Certification Path: Databricks Certified Data Engineer Associate

Easy Ways to Optimize Your Databricks Costs

In this session, we will explore effective strategies for optimizing costs on the Databricks platform, a leading solution for handling large-scale data workloads. Databricks, known for its open and unified approach, offers several tools and methodologies to ensure users can maximize their return on investment (ROI) while managing expenses efficiently. Key points: Understanding usage with AI/BI tools Organizing costs with tagging Setting up budgets Leveraging System Tables By the end of this session, you will have a comprehensive understanding of how to leverage Databricks' built-in tools for cost optimization, ensuring that their data and AI projects not only deliver value but do so in a cost-effective manner. This session is ideal for data engineers, financial analysts, and decision-makers looking to enhance their organization’s efficiency and financial performance through strategic cost management on Databricks.

Elevating Data Quality Standards With Databricks DQX

Join us for an introductory session on Databricks DQX, a Python-based framework designed to validate the quality of PySpark DataFrames. Discover how DQX can empower you to proactively tackle data quality challenges, enhance pipeline reliability and make more informed business decisions with confidence. Traditional data quality tools often fall short by providing limited, actionable insights, relying heavily on post-factum monitoring, and being restricted to batch processing. DQX overcomes these limitations by enabling real-time quality checks at the point of data entry, supporting both batch and streaming data validation and delivering granular insights at the row and column level. If you’re seeking a simple yet powerful data quality framework that integrates seamlessly with Databricks, this session is for you.

This course introduces learners to evaluating and governing GenAI (generative artificial intelligence) systems. First, learners will explore the meaning behind and motivation for building evaluation and governance/security systems. Next, the course will connect evaluation and governance systems to the Databricks Data Intelligence Platform. Third, learners will be introduced to a variety of evaluation techniques for specific components and types of applications. Finally, the course will conclude with an analysis of evaluating entire AI systems with respect to performance and cost. Pre-requisites: Familiarity with prompt engineering, and experience with the Databricks Data Intelligence Platform. Additionally, knowledge of retrieval-augmented generation (RAG) techniques including data preparation, embeddings, vectors, and vector databases Labs: Yes Certification Path: Databricks Certified Generative AI Engineer Associate

Getting Started With Lakeflow Connect

Hundreds of customers are already ingesting data with Lakeflow Connect from SQL Server, Salesforce, ServiceNow, Google Analytics, SharePoint, PostgreSQL and more to unlock the full power of their data. Lakeflow Connect introduces built-in, no-code ingestion connectors from SaaS applications, databases and file sources to help unlock data intelligence. In this demo-packed session, you’ll learn how to ingest ready-to-use data for analytics and AI with a few clicks in the UI or a few lines of code. We’ll also demonstrate how Lakeflow Connect is fully integrated with the Databricks Data Intelligence Platform for built-in governance, observability, CI/CD, automated pipeline maintenance and more. Finally, we’ll explain how to use Lakeflow Connect in combination with downstream analytics and AI tools to tackle common business challenges and drive business impact.

Improve AI Training With the First Synthetic Personas Dataset Aligned to Real-World Distributions

A big challenge in LLM development and synthetic data generation is ensuring data quality and diversity. While data incorporating varied perspectives and reasoning traces consistently improves model performance, procuring such data remains impossible for most enterprises. Human-annotated data struggles to scale, while purely LLM-based generation often suffers from distribution clipping and low entropy. In a novel compound AI approach, we combine LLMs with probabilistic graphical models and other tools to generate synthetic personas grounded in real demographic statistics. The approach allows us to address major limitations in bias, licensing, and persona skew of existing methods. We release the first open-source dataset aligned with real-world distributions and show how enterprises can leverage it with Gretel Data Designer (now part of NVIDIA) to bring diversity and quality to model training on the Databricks platform, all while addressing model collapse and data provenance concerns head-on.

Lakeflow Connect: Smarter, Simpler File Ingestion With the Next Generation of Auto Loader

Auto Loader is the definitive tool for ingesting data from cloud storage into your lakehouse. In this session, we’ll unveil new features and best practices that simplify every aspect of cloud storage ingestion. We’ll demo out-of-the-box observability for pipeline health and data quality, walk through improvements for schema management, introduce a series of new data formats and unveil recent strides in Auto Loader performance. Along the way, we’ll provide examples and best practices for optimizing cost and performance. Finally, we’ll introduce a preview of what’s coming next — including a REST API for pushing files directly to Delta, a UI for creating cloud storage pipelines and more. Join us to help shape the future of file ingestion on Databricks.

This course is designed to introduce three primary machine learning deployment strategies and illustrate the implementation of each strategy on Databricks. Following an exploration of the fundamentals of model deployment, the course delves into batch inference, offering hands-on demonstrations and labs for utilizing a model in batch inference scenarios, along with considerations for performance optimization. The second part of the course comprehensively covers pipeline deployment, while the final segment focuses on real-time deployment. Participants will engage in hands-on demonstrations and labs, deploying models with Model Serving and utilizing the serving endpoint for real-time inference. By mastering deployment strategies for a variety of use cases, learners will gain the practical skills needed to move machine learning models from experimentation to production. This course shows you how to operationalize AI solutions efficiently, whether it's automating decisions in real-time or integrating intelligent insights into data pipelines. Pre-requisites: Familiarity with Databricks workspace and notebooks, familiarity with Delta Lake and Lakehouse, intermediate level knowledge of Python (e.g. common Python libraries for DS/ML like Scikit-Learn, awareness of model deployment strategies) Labs: Yes Certification Path: Databricks Certified Machine Learning Associate

In this session, attendees will learn how to leverage Databricks' system tables to measure user adoption and track key performance indicators (KPIs) for data products. The session will focus on how organizations can use system tables to analyze user behavior, assess engagement with data products and identify usage trends that can inform product development. By measuring KPIs such as user retention, frequency of use and data queries, organizations can optimize their data products for better performance and ROI.

Scaling Sales Excellence: How Databricks Uses Its Own Tech to Train GTM Teams

In this session, discover how Databricks leverages the power of Gen AI, MosaicML, Model Serving and Databricks Apps to revolutionize sales enablement. We’ll showcase how we built an advanced chatbot that equips our go-to-market team with the tools and knowledge needed to excel in customer-facing interactions. This AI-driven solution not only trains our salespeople but also enhances their confidence and effectiveness in demonstrating the transformative potential of Databricks to future customers. Attendees will gain insights into the architecture, development process and practical applications of this innovative approach. The session will conclude with an interactive demo, offering a firsthand look at the chatbot in action. Join us to explore how Databricks is using its own platform to drive sales excellence through cutting-edge AI solutions.

Self-Improving Agents and Agent Evaluation With Arize & Databricks ML Flow

As autonomous agents become increasingly sophisticated and widely deployed, the ability for these agents to evaluate their own performance and continuously self-improve is essential. However, the growing complexity of these agents amplifies potential risks, including exposure to malicious inputs and generation of undesirable outputs. In this talk, we'll explore how to build resilient, self-improving agents. To drive self-improvement effectively, both the agent and the evaluation techniques must simultaneously improve with a continuously iterating feedback loop. Drawing from extensive real-world experiences across numerous productionized use cases, we will demonstrate practical strategies for combining tools from Arize, Databricks MLflow and Mosaic AI to evaluate and improve high-performing agents.

Sponsored by: Lovelytics | Predict and Mitigate Asset Risk: Unlock Geospatial Analytics with GenAI

Discover how Xcel Energy and Lovelytics leveraged the power of geospatial analytics and GenAI to tackle one of the energy sector’s most pressing challenges—wildfire prevention. Transitioning from manual processes to automated GenAI unlocked transformative business value, delivering over 3x greater data coverage, over 4x improved accuracy, and 64x faster processing of geospatial data. In this session, you'll learn how Databricks empowers data leaders to transform raw data, like location information and visual imagery, into actionable insights that save costs, mitigate risks, and enhance customer service. Walk away with strategies for scaling geospatial workloads efficiently, building GenAI-driven solutions, and driving innovation in energy and utilities.

Sponsored by: Qlik | Turning Data into Business Impact: How to Build AI-Ready, Trusted Data Products on Databricks

Explore how to build use case-specific data products designed to power everything from traditional BI dashboards to machine learning and LLM-enabled applications. Gain an understanding of what data products are and why they are essential for delivering AI-ready data that is integrated, timely, high-quality, secure, contextual, and easily consumable. Discover strategies for unlocking business data from source systems to enable analytics and AI use cases, with a deep dive into the three-tiered data product architecture: the Data Product Engineering Plane (where data engineers ingest, integrate, and transform data), the Data Product Management Plane (where teams manage the full lifecycle of data products), and the Data Product Marketplace Plane (where consumers search for and use data products). Discover how a flexible, composable data architecture can support organizations at any stage of their data journey and drive impactful business outcomes.

State Street Uses Databricks as a Cybersecurity Lakehouse for Threat Intelligence & Real-Time Alerts

Organizations face the challenge of managing vast amounts of data to combat emerging threats. The Databricks Data Intelligence platform represents a paradigm shift in cybersecurity at State Street, providing a comprehensive solution for managing and analyzing diverse security data. Through its partnership with Databricks, State Street has created a capability to: Efficiently manage structured and unstructured data. Scale up to analyze 50 petabytes of data in real-time. Ingest and parse data for critical security data streams. Build advanced cybersecurity data products and use automation & orchestration to streamline cybersecurity operations. By leveraging these capabilities, State Street has positioned itself as a leader in the financial services industry when it comes to cybersecurity.

The Future of Real Time Insights with Databricks and SAP

Tired of waiting on SAP data? Join this session to see how Databricks and SAP make it easy to query business-ready data—no ETL. With Databricks SQL, you’ll get instant scale, automatic optimizations, and built-in governance across all your enterprise analytics data. Fast and AI-powered insights from SAP data are finally possible—and this is how.

The Lakeflow Effect
talk
by Bilal Aslam (Databricks) , Josue Bogran (JosueBogran.com & zeb.co)

Lakeflow brings much excitement, simplicity and unification to Databricks’ engineering experience. Databricks’ Bilal Aslam (Sr. Director of Product Management) and Josue A. Bogran (Databricks MVP & content creator) provide an overview of the history of Lakeflow, current value to your organization and the direction its capabilities are going toward. The session covers: What is Lakeflow? Differences and similarities between Lakeflow Declarative Pipelines Overview of current Lakeflow Connect, Pipelines and Jobs capabilities How to get started What's Next? The session will also provide you with an opportunity to ask questions to the team behind Lakeflow.

ThredUp’s Journey with Databricks: Modernizing Our Data Infrastructure

Building an AI-ready data platform requires strong governance, performance optimization, and seamless adoption of new technologies. At ThredUp, our Databricks journey began with a need for better data management and evolved into a full-scale transformation powering analytics, machine learning, and real-time decision-making. In this session, we’ll cover: Key inflection points: Moving from legacy systems to a modernized Delta Lake foundation Unity Catalog’s impact: Improving governance, access control, and data discovery Best practices for onboarding: Ensuring smooth adoption for engineering and analytics teams What’s next? Serverless SQL and conversational analytics with Genie Whether you’re new to Databricks or scaling an existing platform, you’ll gain practical insights on navigating the transition, avoiding pitfalls, and maximizing AI and data intelligence.