talk-data.com talk-data.com

Event

Databricks DATA + AI Summit 2023

2026-01-11 YouTube Visit website ↗

Activities tracked

561

Filtering by: Databricks ×

Sessions & talks

Showing 426–450 of 561 · Newest first

Search within this event →
Deliver Faster Decision Intelligence From Your Lakehouse

Deliver Faster Decision Intelligence From Your Lakehouse

2022-07-19 Watch
video

Accelerate the path from data to decisions with the the Tellius AI-driven Decision Intelligence platform powered by Databricks Delta Lake. Empower business users and data teams to analyze data residing in the Delta Lake to understand what is happening in their business, uncover the reasons why metrics change, and get recommendations on how to impact outcomes. Learn how organizations derive value from Delta Lakehouse with a modern analytics experience that unifies guided insights, natural language search, and automated machine learning to speed up data-driven decision making at cloud scale.

In this session, we will showcase how customers: - Discover changes in KPIs and investigate the reasons why metrics change with AI-powered automated analysis - Empower business users and data analysts to iteratively explore data to identify trend drivers, uncover new customer segments, and surface hidden patterns in data - Simplify and speed-up analysis from massive datasets on Databrick Delta lake

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Delta Lake, the Foundation of Your Lakehouse

Delta Lake, the Foundation of Your Lakehouse

2022-07-19 Watch
video

Delta Lake is the open source storage layer that makes the Databricks Lakehouse Platform possible by adding reliability, performance, and scalability to your data, wherever it is located. Join this session for an inside look at what is under the hood of Databricks - see how Delta Lake, by adding ACID transactions and versioning to Parquet files together with the Photon engine, provides customers with huge performance gains and the ability to address new challenges. This session will include a demo and overview of customer use cases unlocked by Delta Lake, and the benefits of running Delta Lake workloads on Databricks.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Delta Sharing - A New Paradigm for Secure Data Sharing and Data Collaboration on Lakehouse

Delta Sharing - A New Paradigm for Secure Data Sharing and Data Collaboration on Lakehouse

2022-07-19 Watch
video

Data sharing and data collaboration have become important in today's hyper connected digital economy. But to date, a lack of standards-based data sharing protocol has resulted in data sharing solutions tied to a single vendor or commercial product introducing vendor lock-in risks. What the industry deserves is an open approach to data sharing. Additionally, with stringent privacy regulations, data collaboration on sensitive data has become a challenge for organizations, resulting in fragmented, siloed, and incomplete insights. Join this session to learn how Databricks Lakehouse Platform simplifies secure data sharing and enables data collaboration across organizations in a privacy centric way.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Designing Better MLOps Systems

Designing Better MLOps Systems

2022-07-19 Watch
video

Real-world data problems are becoming increasingly daunting to solve, as data volume grows and computing tools proliferate. Since 2018, Gartner has predicted that 85% of ML projects will fail and this trend will likely continue through 2022 as well. Nevertheless, in most cases, ML practitioners have the opportunity to avoid their projects from failing in the early phases.

In this talk, the speaker will borrow from her consultancy and hands-on implementation experience with cross-functional clients to share her takeaways in designing better ML systems. The talk will walk through common pitfalls to watch out for, relevant best practices in software engineering for ML, and technical anchors that make a robust system. This talk aims to empower the audience – beginner and experienced practitioners alike – with confidence in their ML project designs and help provide the big-picture design thinking framework for successful projects.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Destination Lakehouse: All Your Data, Analytics and AI on One Platform

Destination Lakehouse: All Your Data, Analytics and AI on One Platform

2022-07-19 Watch
video

The data lakehouse is the future for modern data teams seeking to innovate with a data architecture that simplifies data workloads, eases collaboration, and maintains the flexibility and openness to stay agile as a company scales. The Databricks Lakehouse Platform realizes this idea by unifying analytics, data engineering, machine learning, and streaming workloads across clouds on one simple, open data platform. In this session, learn how the Databricks Lakehouse Platform can meet your needs for every data and analytics workload, with examples of real-customer applications, reference architectures, and demos to showcase how you can create modern data solutions of your own.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Distributed Machine Learning at Lyft

Distributed Machine Learning at Lyft

2022-07-19 Watch
video

Data collection, preprocessing, feature engineering are the fundamental steps in any Machine Learning Pipeline. After feature engineering, being able to parallelize training on multiple low cost machines helps to reduce cost and time both. And, then being able to train models in a distributed manner speeds up Hyperparameter Tuning. How can we unify these stages of ML Pipeline in one unified distributed training platform together? And that too on Kubernetes?

Our ML platform is completely based on Kubernetes because of its scalability and rapid bootstrapping time of resources. In this talk we will demonstrate how Lyft uses Spark on Kubernetes, Fugue (our home grown unifying compute abstraction layer) to design a holistic end to end ML Pipeline system for distributed feature engineering, training & prediction experience for our customers on our ML Platform on top of Spark on K8s. We will also do a deep dive to show how we are abstracting and hiding infrastructure complexities so that our Data Scientists and Research Scientist can focus only on the business logic for their models through simple pythonic APIs and SQL. We let the users focus on ''what to do'' and the platform takes care of ''how to do''. We will share our challenges, learning and the fun we had while implementing. Using Spark on K8s have helped us achieve large scale data processing with 90% less cost and at times bringing down processing time from 2 hours to less than 20 mins.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Dive Deeper into Data Engineering on Databricks

Dive Deeper into Data Engineering on Databricks

2022-07-19 Watch
video

To derive value from data, engineers need to collect, transform, and orchestrate data from various data types and source systems. However, today’s data engineering solutions support only a limited number of delivery styles, involve a significant amount of hand-coding, and have become resource-intensive. Modern data engineering requires more advanced data lifecycle for data ingestion, transformation, and processing. In this session, learn how the Databricks Lakehouse Platform provides an end-to-end data engineering solution — ingestion, processing and scheduling — that automates the complexity of building and maintaining pipelines and running ETL workloads directly on a data lake, so your team can focus on quality and reliability to drive valuable insights.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Diving into Delta Lake 2.0

Diving into Delta Lake 2.0

2022-07-19 Watch
video

The Delta ecosystem rapidly expanded with the release of Delta Lake 1.2 which included integrations with Apache Spark™, Apache Flink, Presto, Trino, features such as OPTIMIZE, data skipping using column statistics, restore APIs, S3 multi-cluster writes, and more.

Join this session to learn about how the wider Delta community collaborated together to bring these features and integrations together; as well as the current roadmap. This will be an interactive session so come prepared with your questions—we should have answers!

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Driving Real-Time Data Capture and Transformation in Delta Lake with Change Data Capture

Driving Real-Time Data Capture and Transformation in Delta Lake with Change Data Capture

2022-07-19 Watch
video

Change data capture (CDC) is an increasingly common technology used in real-time machine learning and AI data pipelines. When paired with Databricks Delta Lake, it provides organizations with a number of benefits including lower data processing costs and highly responsive analytics applications. This session will provide a detailed overview of Matillion’s new CDC capabilities and how the integration of these capabilities with Delta Lake on Databricks can help you manage dataset changes, making it easy to automate the capture, transformation, and enrichment of data in near real time. Attend this session and see the advantages of a Matillion’s CDC capabilities to simplify real time data capture and analytics in your Delta Lake.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Efficient and Multi-Tenant Scheduling of Big Data and AI Workloads

Efficient and Multi-Tenant Scheduling of Big Data and AI Workloads

2022-07-19 Watch
video

Many ML and big data teams in the open source community are looking to run their workloads in the cloud and they invariably face a common set of challenges such as multi-tenant cluster management, resource fairness and sharing, gang scheduling and cost-effective infrastructure operations. Kubernetes is the de-facto standard platform for running containerized applications in the cloud. However, the default resource scheduler in Kubernetes leaves more to be desired for AI scenarios when running ML/DL training workloads or large-scale data processing jobs for feature engineering.

In this talk, we will share how the community leverage and build upon Apache YuniKorn to address the unique resource scheduling needs for ML and big data teams.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Eliminating AI Risk—One Model Failure at a Time

Eliminating AI Risk—One Model Failure at a Time

2022-07-19 Watch
video

As organizations adopt AI they inherent AI risk. AI risk often manifests itself in AI models that produce erroneous predictions that go undetected and result in serious consequences for the organization and individuals affected by the decisions.

In this talk we will discuss root causes for AI models going haywire, and present a rigorous framework for eliminating risk from AI. We will show how this methodology can be used as building blocks for building an AI firewall that can prevent and model AI model failures.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Emerging Data Architectures & Approaches for Real-Time AI using Redis

Emerging Data Architectures & Approaches for Real-Time AI using Redis

2022-07-19 Watch
video

As more applications harness the power of real-time data, it’s important to architect and implement a data stack to meet the broad requirements of operational ML and be able to seamlessly integrate neural embeddings into applications.

Real-time ML requires more than just deploying ML models to production using MLOps tooling; it requires a fast and scalable operational database that easily integrates into the MLOps workflow. Milliseconds matter and can make the difference in delivering fast online predictions whether it’s personalized recommendations, detecting fraud, or figuring out the most optimal food delivery route.

Attend this session to explore how a modern data stack can be used for real-time operational ML and building AI-infused applications. The session will over the following topics:

Emerging architectural components for operational ML such as the online feature store for real-time serving.

Operational excellence in managing globally distributed ML data and feature pipelines

Foundational data types of Redis including the representation of data using vector embeddings.

Using Redis as a vector database to build vector similarity search applications.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Enabling BI in a Lakehouse Environment: How Spark and Delta Can Help With Automating a DWH Develop

Enabling BI in a Lakehouse Environment: How Spark and Delta Can Help With Automating a DWH Develop

2022-07-19 Watch
video

Traditional data warehouses typically struggle when it comes to handling large volumes of data and traffic, particularly when it comes to unstructured data. In contrast, data lakes overcome such issues and have become the central hub for storing data. We outline how we can enable BI Kimball data modelling in a Lakehouse environment.

We present how we built a Spark-based framework to modernize DWH development with data lake as central storage, assuring high data quality and scalability. The framework was implemented at over 15 enterprise data warehouses across Europe.

We present how one can tackle in Spark & with Delta Lake the data warehouse principles like surrogate, foreign and business keys, SCD type 1 and 2 etc. Additionally, we share our experiences on how such a unified data modelling framework can bridge BI with modern day use cases, such as machine learning and real time analytics. The session outlines the original challenges, the steps taken and the technical hurdles we faced.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Enabling Learning on Confidential Data

Enabling Learning on Confidential Data

2022-07-19 Watch
video
Rishabh Poddar (Opaque Systems)

Multiple organizations often wish to aggregate their confidential data and learn from it, but they cannot do so because they cannot share their data with each other. For example, banks wish to train models jointly over their aggregate transaction data to detect money launderers more efficiently because criminals hide their traces across different banks.

To address such problems, we developed MC^2 at UC Berkeley, an open-source framework for multi-party confidential computation, on top of Apache Spark. MC^2 enables organizations to share encrypted data and perform analytics and machine learning on the encrypted data without any organization or the cloud seeing the data. Our company Opaque brings the MC^2 technology in an easy-to-use form to organizations in the financial, medical, ad tech, and other sectors.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Ensuring Correct Distributed Writes to Delta Lake in Rust with Formal Verification

Ensuring Correct Distributed Writes to Delta Lake in Rust with Formal Verification

2022-07-19 Watch
video

Rust guarantees zero memory access bug once a program compiles. However, one can still introduce logical bugs in the implementation.

In this talk, I will first give a high level overview on common formal verification methods used in distributed system designs and implementations. Then I will talk about our experiences with using TLA+ and Stateright to formally model delta-rs' multi-writer S3 backend implementation. The end result of combining both Rust and formal verification is we end up with an efficient native Delta Lake implementation that is both memory safe and logical bug free!

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

FugueSQL—The Enhanced SQL Interface for Pandas and Spark DataFrames

FugueSQL—The Enhanced SQL Interface for Pandas and Spark DataFrames

2022-07-19 Watch
video

SQL users working with Pandas and Spark quickly realize SQL is a second-class interface, invoked between predominantly Python code.

We will introduce FugueSQL, an enhanced SQL interface that allows SQL lovers to express end-to-end workflows predominantly in SQL. With a Jupyter notebook extension, SQL commands can be used in Databricks notebooks for interactive handling of in-memory datasets. This allows heavy SQL users to fully leverage Spark in their preferred grammar.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Gazelle-Jni: A Middle Layer to Offload Spark SQL to Native Engines for Execution Acceleration

Gazelle-Jni: A Middle Layer to Offload Spark SQL to Native Engines for Execution Acceleration

2022-07-19 Watch
video

This session will introduce Gazelle-Jni, which was proposed to better integrate the various native SQL engines as Spark SQL’s backend. It implemented a shared JVM and JNI middle layer. With the help of Gazlle-Jni, Spark SQL execution can be offloaded to native engines by passing Substrait transformed physical plan.

Examples will be presented on how to integrate native engines with Spark SQL.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Goodbye Hell of Unions in Spark SQL

Goodbye Hell of Unions in Spark SQL

2022-07-19 Watch
video

It is known that applications, which heavily use Spark SQL union() operation, cause performance problems. The union() operation combines multiple rows into one table. When union() operation merges many Dataframes, the size of the generated Spark SQL planning tree will be huge while the Spark SQL code is small. The huge planning tree may lead to performance problems. This talk reviews performance problems from the Spark SQL planning perspective and explains how to avoid the performance issues with common practices.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Moving to the Lakehouse: Fast & Efficient Ingestion with Auto Loader

Moving to the Lakehouse: Fast & Efficient Ingestion with Auto Loader

2022-07-19 Watch
video

Auto loader, the most popular tool for incremental data ingestion from cloud storage to Databricks’ Lakehouse, is used in our biggest customers’ ingestion workflows. Auto Loader is our all-in-one solution for exactly-once processing offering efficient file discovery, schema inference and evolution, and fault tolerance.

In this talk, we want to delve into key features in Auto Loader, including: • Avro schema inference • Rescued column • Semi-structured data support • Incremental listing • Asynchronous backfilling • Native listing • File-level tracking and observability

Auto Loader is also used in other Databricks features such as Delta Live Tables. We will discuss the architecture, provide a demo, and feature an Auto Loader customer speaking about their experience migrating to Auto Loader.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

X-FIPE: eXtended Feature Impact for Prediction Explanation

X-FIPE: eXtended Feature Impact for Prediction Explanation

2022-07-19 Watch
video

Many enterprises have built their own machine learning platforms in the cloud using Databricks, e.g. Humana FlorenceAI. In order to effectively drive the adoption of predictive models in daily business operations, data scientists and business teams need to work closely to make sure they serve the consumer needs in compliance with regulatory rules. Model interpretability is key. In this talk, we would like to share an explainable AI algorithm developed at Humana, X-FIPE, eXtended Feature Impact for Prediction Explanation.

X-FIPE is a top-driver algorithm to calculate feature importance for any machine learning predictive models, whether it is Python or PySpark, at a local level. Instead of showing the feature importance on a population level, it can find the top drivers for each observation or member. These top drivers could differ widely from one member to another member in the population. it not only helps explain the predictive model, but also offer users actionable insights.

Compared with widely used algorithms, e.g. LIME, SHAP, and FIPE, X-FIPE improves the time complexity from linear O(n) to logarithmic O(log(n)), where n is the number of used model features. Also, we discovered the connection between X-FIPE value and Shapley value -- X-FIPE a first order approximation of Shapley value. Our observation shows that the most contribution of Shapley value of a feature comes from the marginal contribution when it is first added and when it is last removed from the full features. This is why the X-FIPE keeps enough accuracy and also reduces the computation.

Hopefully this talk will provide you a path forward to include explainable AI into your machine learning workflows, you are encouraged to try out and contribute to our open source Python package xfipe soon to come.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

You Have BI. Now What? Activate Your Data!

You Have BI. Now What? Activate Your Data!

2022-07-19 Watch
video

Analytics has long been the end goal for data teams— standing up dashboards and exporting reports for business teams. But what if data teams could extend their work directly into the tools business teams use?

The next evolution for data teams is Activation. Smart organizations use reverse ETL to extend the value of Databricks by syncing data directly into business platforms, making their lakehouse a Customer Data Platform (CDP). By making Databricks the single source of truth for your data, you can create business models in your lakehouse and serve them directly to your marketing tools, ad networks, CRMs, and more. This saves time and money, unlocks new use cases for your data and turns data team efforts into revenue generating activities.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Your fastest path to Lakehouse and beyond

Your fastest path to Lakehouse and beyond

2022-07-19 Watch
video

Azure Databricks is an easy, open, and collaborative service for data, analytics & AI use cases, enabled by Lakehouse architecture. Join this session to discover how you can get the most out of your Azure investments by combining the best of Azure Synapse Analytics, Azure Databricks and Power BI for building a complete analytics & AI solution based on Lakehouse architecture.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Auditing Your Data and Answering the Lifelong Question—Is It the End of the Day Yet?

Auditing Your Data and Answering the Lifelong Question—Is It the End of the Day Yet?

2022-07-19 Watch
video

Huge volumes of data flow through a robust Kafka architecture, into several ETLs, receiving, transforming and storing the data. We clearly understood our ETLs’ workflow and our data architecture, from source to destination.

But how much did we know about the way our data makes though our systems? And what about the life long question, is it the end of the day yet?

In this talk I’m going to present to you the design process behind our Data Auditing system, Life Line. From tracking and producing, to analyzing and storing auditing information, using technologies such as Kafka, Avro, Spark, Lambda functions and complex SQL queries. We’re going to cover: * AVRO Audit header * Auditing heart beat - designing your metadata * Designing and optimizing your auditing table - what does this data look like anyway? * Creating an alert based monitoring system * Answering the most important question of all - is it the end of the day yet?

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Azure Databricks Excitement 2022

Azure Databricks Excitement 2022

2022-07-19 Watch
video

Data + AI Summit 2022 was a great opportunity to check-in on the partnership between Microsoft Azure and Databricks!

Can ML Forecast Fashion Trends? What Should We Predict?

Can ML Forecast Fashion Trends? What Should We Predict?

2022-07-19 Watch
video

Can How to use existing data to feature ‘style’ in order to make relevant recommendation to customers 1. Is fashion style predictable? What can be forecast by machine in fashion world, what cannot? 2. What type of data we can get for machine to learn in fashion world? 3. Personal perspective of three dimensions to feature style in AI

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/