talk-data.com talk-data.com

Topic

Databricks

big_data analytics spark

1286

tagged

Activity Trend

515 peak/qtr
2020-Q1 2026-Q1

Activities

1286 activities · Newest first

Databricks As Code:Effectively Automate a Secure Lakehouse Using Terraform for Resource Provisioning

At Rivian, we have automated more than 95% of our Databricks resource provisioning workflows using an in-house Terraform module, affording us a lean admin team to manage over 750 users. In this session, we will cover the following elements of our approach and how others can benefit from improved team efficiency.

  • User and service principal management
  • Our permission model on Unity Catalog for data governance
  • Workspace and secrets resource management
  • Managing internal package dependencies using init scripts
  • Facilitating dashboards, SQL queries and their associated permissions
  • Scaling source of truth Petabyte scale Delta Lake table ingestion jobs and workflows

Talk by: Jason Shiverick and Vadivel Selvaraj

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

MLOps at Gucci: From Zero to Hero

Delta Lake is an open-source storage format that can be ideally used for storing large-scale datasets, which can be used for single-node and distributed training of deep learning models. Delta Lake storage format gives deep learning practitioners unique data management capabilities for working with their datasets. The challenge is that, as of now, it’s not possible to use Delta Lake to train PyTorch models directly.

PyTorch community has recently introduced a Torchdata library for efficient data loading. This library supports many formats out of the box, but not Delta Lake. This talk will demonstrate using the Delta Lake storage format for single-node and distributed PyTorch training using the torchdata framework and standalone delta-rs Delta Lake implementation.

Talk by: Michael Shtelma

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

What's New in Databricks SQL -- With Live Demos

We’ve been pushing ahead to make the lakehouse even better for data warehousing across several pillars: native serverless experience, best in class price performance, intelligent workload management & observability and enhanced connectivity, analyst & developer experiences. As we look to double down on that pace of innovation, we want to deep dive into everything that’s been keeping us busy.

In this session we will share an update on key roadmap items. To bring things to life, you will see live demos of the most recent capabilities, from data ingestion, transformation, and consumption, using the modern data stack along with Databricks SQL.

Talk by: Can Efeoglu

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

How to Build a Metadata Driven Data Pipelines with Delta Live Tables

In this session, you will learn how you can use metaprogramming to automate the creation and management of Delta Live Tables pipelines at scale. The goal is to make it easy to use DLT for large-scale migrations, and other use cases that require ingesting and managing hundreds or thousands of tables, using generic code components and configuration-driven pipelines that can be dynamically reused across different projects or datasets.

Talk by: Mojgan Mazouchi and Ravi Gawai

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Building & Managing a Data Platform for a Delta Lake Exceeding 13PB & 1000s of Users: AT&T's Story

Data runs AT&T’s business, just like it runs most businesses these days. Data can lead to a greater understanding of a business and when translated correctly into information can provide human and business systems valuable insights to make better decisions. Unique to AT&T is the volume of data we support, how much of our work that is driven by AI and the scale at which data and AI drive value for our customers and stakeholders.

Our cloud migration journey includes making data and AI more accessible to employees throughout AT&T so they can use their deep business expertise to leverage data more easily and rapidly. We always had to balance this data democratization and desire for speed with keeping our data private and secure. We loved the open ecosystem model of Lakehouse that enables data, BI and ML tools to be seamlessly integrated on a single pane arena; it simplifies the architecture and reduces dependencies between technologies in the cloud. Being clear in our architecture guidelines and patterns was very important to us for our success.

We are seeing more interest from our business unit partners and continuing to build the AI capability AI as a service to support more citizen data scientists. To scale up our Lakehouse journey, we built a Databricks center of excellence (CoE) function in AT&T which today has approximately 1400+ active members, further concentrating existing expertise and resources in ML/AI discipline to collaborate on all things Databricks like technical support, trainings, FAQ’s and best practices to attain and sustain world-class performance and drive business value for AT&T. Join us to learn more about how we process and manage over 10 petabytes of our network Lakehouse with Delta Lake and Databricks.

Build Your Data Lakehouse with a Modern Data Stack on Databricks

Are you looking for an introduction to the Lakehouse and what the related technology is all about? This session is for you. This session explains the value that lakehouses bring to the table using examples of companies that are actually modernizing their data, showing demos throughout. The data lakehouse is the future for modern data teams that want to simplify data workloads, ease collaboration, and maintain the flexibility and openness to stay agile as a company scales.

Come to this session and learn about the full stack, including data engineering, data warehousing in a lakehouse, data streaming, governance, and data science and AI. Learn how you can create modern data solutions of your own.

Talk by: Ari Kaplan and Pearl Ubaru

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Lakehouse Federation: Access and Governance of External Data Sources from Unity Catalog

Are you tired of spending time and money moving data across multiple sources and platforms to access the right data at the right time? Join our session and discover Databricks new Lakehouse Federation feature, which allows you to access, query, and govern your data in place without leaving the Lakehouse. Our experts will demonstrate how you can leverage the latest enhancements in Unity Catalog, including query federation, Hive interface, and Delta Sharing, to discover and govern all your data in one place, regardless of where it lives.

Talk by: Can Efeoglu and Todd Greenstein

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

How to Build LLMs on Your Company’s Data While on a Budget

Large Language Models (LLMs) are taking AI mainstream across companies and individuals. However, public LLMs are trained on general-purpose data. They do not include your own corporate data and they are black boxes on how they are trained. Because terminology is different for healthcare, financial, retail, digital-native and other industries, companies today are looking for industry-specific LLMs to better understand the terminology, context and knowledge that better suits their needs. In contrast to closed LLMs, open source-based models can be used for commercial usage or customized to suit an enterprise’s needs on their own data. Learn how Databricks makes it easy for you to build, tune and use custom models, including a deep dive into Dolly, the first open source, instruction-following LLM, fine-tuned on a human-generated instruction dataset licensed for research and commercial use.

In this session, you will:

  • See a real-life demo of creating your own LLMs specific to your industry
  • Learn how to securely train on your own documents if needed
  • Learn how Databricks makes it quick, scalable and inexpensive
  • Deep dive into Dolly and its applications

Talk by: Sean Owen

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Databricks Connect Powered by Spark Connect: Develop and Debug Spark From Any Developer Tool

Spark developers want to develop and debug their code using their tools of choice and development best practices while ensuring high-production fidelity on the target remote cluster. However, Spark's driver architecture is monolithic, with no built-in capability to directly connect to a remote Spark cluster from languages other than SQL. This makes it hard to enable such interactive developer experiences from a user’s local IDE of choice. Spark Connect’s decoupled client-server architecture introduces remote connectivity to Spark clusters and with that, enables interactive development experience - Spark and its open ecosystem can be leveraged from everywhere.

In this session, we show how we leverage Spark Connect to build a completely redesigned version of Databricks Connect, a first-class IDE-based developer experience that offers interactive debugging from any IDE. We show how developers can easily ensure consistency between their local and remote environments. We walk the audience through real-live examples of how to locally debug code running on Databrick. We also show how Databricks Connect integrates into the Databricks Visual Studio Code extension for an even better developer experience.

Talk by: Martin Grund and Stefania Leone

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Using DMS and DLT for Change Data Capture

Bringing in Relational Data Store (RDS) data into your data lake is a critical and important process to facilitate use cases. By leveraging AWS Database Migration Services (DMS) and Databricks Delta Live Tables (DLT) we can simplify change data capture from your RDS. In this talk, we will be breaking down this complex process by discussing the fundamentals and best practices. There will also be a demo where we bring this all together.

Talk by: Neil Patel and Ganesh Chand

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Photon for Dummies: How Does this New Execution Engine Actually Work?

Did you finish the Photon whitepaper and think, wait, what? I know I did; it’s my job to understand it, explain it, and then use it. If your role involves using Apache Spark™ on Databricks, then you need to know about Photon and where to use it. Join me, chief dummy, nay "supreme" dummy, as I break down this whitepaper into easy to understand explanations that don’t require a computer science degree. Together we will unravel mysteries such as:

  • Why is a Java Virtual Machine the current bottleneck for Spark enhancements?
  • What does vectorized even mean? And how was it done before?
  • Why is the relationship status between Spark and Photon "complicated?"

In this session, we’ll start with the basics of Apache Spark, the details we pretend to know, and where those performance cracks are starting to show through. Only then will we start to look at Photon, how it’s different, where the clever design choices are and how you can make the most of this in your own workloads. I’ve spent over 50 hours going over the paper in excruciating detail; every reference, and in some instances, the references of the references so that you don’t have to.

Talk by: Holly Smith

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Top Mistakes to Avoid in Streaming Applications

Are you a data engineer seeking to enhance the performance of your streaming applications? Join our session where we will share valuable insights and best practices gained from handling diverse customer streaming use cases using Apache Spark™ Structured Streaming.

In this session, we will delve into the common pitfalls that can hinder your streaming workflows. Learn practical tips and techniques to overcome these challenges during different stages of application development. By avoiding these errors, you can unlock faster performance, improved data reliability, and smoother data processing.

Don't miss out on this opportunity to level up your streaming skills and excel in your data engineering journey. Join us to gain valuable knowledge and practical techniques that will empower you to optimize your streaming applications and drive exceptional results.

Talk by: Vikas Reddy Aravabhumi

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

The English SDK for Apache Spark™

In the fast-paced world of data science and AI, we will explore how large language models (LLMs) can elevate the development process of Apache Spark applications.

We'll demonstrate how LLMs can simplify SQL query creation, data ingestion, and DataFrame transformations, leading to faster development and clearer code that's easier to review and understand. We'll also show how LLMs can assist in creating visualizations and clarifying data insights, making complex data easy to understand.

Furthermore, we'll discuss how LLMs can be used to create user-defined data sources and functions, offering a higher level of adaptability in Apache Spark applications.

Our session, filled with practical examples, highlights the innovative role of LLMs in the realm of Apache Spark development. We invite you to join us in this exploration of how these advanced language models can drive innovation and boost efficiency in the sphere of data science and AI.

Talk by: Gengliang Wang and Allison Wang

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Learn How to Reliably Monitor Your Data and Model Quality in the Lakehouse

Developing and upkeep of production data engineering and machine learning pipelines is a challenging process for many data teams. Even more challenging is monitoring the quality of your data and models once they go into production. Building upon untrustworthy data can cause many complications for data teams. Without a monitoring service, it is challenging to proactively discover when your ML models degrade over time, and the root causes behind it. Furthermore, with a lack of lineage tracking, it is even more painful to debug errors in your models and data. Databricks Lakehouse Monitoring offers a unified service to monitor the quality of all your data and ML assets.

In this session, you’ll learn how to:

  • Use one unified tool to monitor the quality of any data product: data or AI 
  • Quickly diagnose errors in your data products with root cause analysis
  • Set up a monitor with low friction, requiring only a button click or a single API call to start and automatically generate out-of-the-box metrics
  • Enable self-serve experiences for data analysts by providing reliability status for every data asset

Talk by: Kasey Uhlenhuth and Alkis Polyzotis

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Advancements in Open Source LLM Tooling, Including MLflow

MLflow is one of the most used open source machine learning frameworks with over 13 million monthly downloads. With the recent advancements in generative AI, MLflow has been rapidly integrating support for a lot of the popular AI tools being used such as Hugging Face, LangChain, and OpenAI. This means that it’s becoming easier than ever to build AI pipelines with your data as the foundation, yet expanding your capabilities with the incredible advancements of the AI community.

Come to this session to learn how MLflow can help you:

  • Easily grab open source models from Hugging Face and use Transformers pipelines in MLflow
  • Integrate LangChain for more advanced services and to add context into your model pipelines
  • Bring in OpenAI APIs as part of your pipelines
  • Quickly track and deploy models on the lakehouse using MLflow

Talk by: Corey Zumar and Ben Wilson

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

What’s New in Databricks Workflows -- With Live Demos

Databricks Workflows provides unified orchestration for the Lakehouse. Since it was first announced last year, thousands of organizations have been leveraging Workflows for orchestrating lakehouse workloads such as ETL, BI dashboard refresh and ML model training.

In this session, the Workflows product team will cover and demo the latest features and capabilities of Databricks Workflows in the areas of workflow authoring, observability and more. This session will also include an outlook for future innovations you can expect to see in the coming months.

Talk by: Muhammad Bilal Aslam

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Databricks Asset Bundles: A Standard, Unified Approach to Deploying Data Products on Databricks

In this session, we will introduce Databricks Asset Bundles, provide a demonstration of how they work for a variety of data products, and how to fit them into an overall CICD strategy for the well-architected Lakehouse.

Data teams produce a variety of assets; datasets, reports and dashboards, ML models, and business applications. These assets depend upon code (notebooks, repos, queries, pipelines), infrastructure (clusters, SQL warehouses, serverless endpoints), and supporting services/resources like Unity Catalog, Databricks Workflows, and DBSQL dashboards. Today, each organization must figure out a deployment strategy for the variety of data products they build on Databricks as there is no consistent way to describe the infrastructure and services associated with project code.

Databricks Asset Bundles is a new capability on Databricks that standardizes and unifies the deployment strategy for all data products developed on the platform. It allows developers to describe the infrastructure and resources of their project through a YAML configuration file, regardless of whether they are producing a report, dashboard, online ML model, or Delta Live Tables pipeline. Behind the scenes, these configuration files use Terraform to manage resources in a Databricks workspace, but knowledge of Terraform is not required to use Databricks Asset Bundles.

Talk by: Rafi Kurlansik and Pieter Noordhuis

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

A Technical Deep Dive into Unity Catalog's Practitioner Playbook

Get ready to take a deep dive into Unity Catalog and explore how it can simplify data, analytics and AI governance across multiple clouds. In this session, take a deep dive into Unity Catalog and the expert Databricks team will guide you through a hands-on demo, showcasing the latest features and best practices for data governance. You'll learn how to master Unity Catalog and gain a practical understanding of how it can streamline your analytics and AI initiatives. Whether you're migrating from Hive Metastore or just looking to expand your knowledge of Unity Catalog, this session is for you. Join us for a practical, hands-on deep dive into Unity Catalog and learn how to achieve seamless data governance while following best practices for data, analytics and AI governance.

Talk by: Zeashan Pappa and Ifigeneia Derekli

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksin

Introduction to Data Engineering on the Lakehouse

Data engineering is a requirement for any data, analytics or AI workload. With the increased complexity of data pipelines, the need to handle real-time streaming data and the challenges of orchestrating reliable pipelines, data engineers require the best tools to help them achieve their goals. The Databricks Lakehouse Platform offers a unified platform to ingest, transform and orchestrate data and simplifies the task of building reliable ETL pipelines.

This session will provide an introductory overview of the end-to-end data engineering capabilities of the platform, including Delta Live Tables and Databricks Workflows. We’ll see how these capabilities come together to provide a complete data engineering solution and how they are used in the real world by organizations leveraging the lakehouse turning raw data into insights.

Talk by: Jibreal Hamenoo and Ori Zohar

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Introduction to Data Streaming on the Lakehouse

Streaming is the future of all data pipelines and applications. It enables businesses to make data-driven decisions sooner and react faster, develop data-driven applications considered previously impossible, and deliver new and differentiated experiences to customers. However, many organizations have not realized the promise of streaming to its full potential because it requires them to completely redevelop their data pipelines and applications on new, complex, proprietary, and disjointed technology stacks.

The Databricks Lakehouse Platform is a simple, unified, and open platform that supports all streaming workloads ranging from ingestion, ETL to event processing, event-driven application, and ML inference. In this session, we will discuss the streaming capabilities of the Databricks Lakehouse Platform and demonstrate how easy it is to build end-to-end, scalable streaming pipelines and applications, to fulfill the promise of streaming for your business.

Talk by: Zoe Durand and Yue Zhang

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc