talk-data.com talk-data.com

Event

Databricks DATA + AI Summit 2023

2026-01-11 YouTube Visit website ↗

Activities tracked

582

Sessions & talks

Showing 451–475 of 582 · Newest first

Search within this event →
Dive Deeper into Data Engineering on Databricks

Dive Deeper into Data Engineering on Databricks

2022-07-19 Watch
video

To derive value from data, engineers need to collect, transform, and orchestrate data from various data types and source systems. However, today’s data engineering solutions support only a limited number of delivery styles, involve a significant amount of hand-coding, and have become resource-intensive. Modern data engineering requires more advanced data lifecycle for data ingestion, transformation, and processing. In this session, learn how the Databricks Lakehouse Platform provides an end-to-end data engineering solution — ingestion, processing and scheduling — that automates the complexity of building and maintaining pipelines and running ETL workloads directly on a data lake, so your team can focus on quality and reliability to drive valuable insights.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Diving into Delta Lake 2.0

Diving into Delta Lake 2.0

2022-07-19 Watch
video

The Delta ecosystem rapidly expanded with the release of Delta Lake 1.2 which included integrations with Apache Spark™, Apache Flink, Presto, Trino, features such as OPTIMIZE, data skipping using column statistics, restore APIs, S3 multi-cluster writes, and more.

Join this session to learn about how the wider Delta community collaborated together to bring these features and integrations together; as well as the current roadmap. This will be an interactive session so come prepared with your questions—we should have answers!

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Driving Real-Time Data Capture and Transformation in Delta Lake with Change Data Capture

Driving Real-Time Data Capture and Transformation in Delta Lake with Change Data Capture

2022-07-19 Watch
video

Change data capture (CDC) is an increasingly common technology used in real-time machine learning and AI data pipelines. When paired with Databricks Delta Lake, it provides organizations with a number of benefits including lower data processing costs and highly responsive analytics applications. This session will provide a detailed overview of Matillion’s new CDC capabilities and how the integration of these capabilities with Delta Lake on Databricks can help you manage dataset changes, making it easy to automate the capture, transformation, and enrichment of data in near real time. Attend this session and see the advantages of a Matillion’s CDC capabilities to simplify real time data capture and analytics in your Delta Lake.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Efficient and Multi-Tenant Scheduling of Big Data and AI Workloads

Efficient and Multi-Tenant Scheduling of Big Data and AI Workloads

2022-07-19 Watch
video

Many ML and big data teams in the open source community are looking to run their workloads in the cloud and they invariably face a common set of challenges such as multi-tenant cluster management, resource fairness and sharing, gang scheduling and cost-effective infrastructure operations. Kubernetes is the de-facto standard platform for running containerized applications in the cloud. However, the default resource scheduler in Kubernetes leaves more to be desired for AI scenarios when running ML/DL training workloads or large-scale data processing jobs for feature engineering.

In this talk, we will share how the community leverage and build upon Apache YuniKorn to address the unique resource scheduling needs for ML and big data teams.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Eliminating AI Risk—One Model Failure at a Time

Eliminating AI Risk—One Model Failure at a Time

2022-07-19 Watch
video

As organizations adopt AI they inherent AI risk. AI risk often manifests itself in AI models that produce erroneous predictions that go undetected and result in serious consequences for the organization and individuals affected by the decisions.

In this talk we will discuss root causes for AI models going haywire, and present a rigorous framework for eliminating risk from AI. We will show how this methodology can be used as building blocks for building an AI firewall that can prevent and model AI model failures.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Emerging Data Architectures & Approaches for Real-Time AI using Redis

Emerging Data Architectures & Approaches for Real-Time AI using Redis

2022-07-19 Watch
video

As more applications harness the power of real-time data, it’s important to architect and implement a data stack to meet the broad requirements of operational ML and be able to seamlessly integrate neural embeddings into applications.

Real-time ML requires more than just deploying ML models to production using MLOps tooling; it requires a fast and scalable operational database that easily integrates into the MLOps workflow. Milliseconds matter and can make the difference in delivering fast online predictions whether it’s personalized recommendations, detecting fraud, or figuring out the most optimal food delivery route.

Attend this session to explore how a modern data stack can be used for real-time operational ML and building AI-infused applications. The session will over the following topics:

Emerging architectural components for operational ML such as the online feature store for real-time serving.

Operational excellence in managing globally distributed ML data and feature pipelines

Foundational data types of Redis including the representation of data using vector embeddings.

Using Redis as a vector database to build vector similarity search applications.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Enabling BI in a Lakehouse Environment: How Spark and Delta Can Help With Automating a DWH Develop

Enabling BI in a Lakehouse Environment: How Spark and Delta Can Help With Automating a DWH Develop

2022-07-19 Watch
video

Traditional data warehouses typically struggle when it comes to handling large volumes of data and traffic, particularly when it comes to unstructured data. In contrast, data lakes overcome such issues and have become the central hub for storing data. We outline how we can enable BI Kimball data modelling in a Lakehouse environment.

We present how we built a Spark-based framework to modernize DWH development with data lake as central storage, assuring high data quality and scalability. The framework was implemented at over 15 enterprise data warehouses across Europe.

We present how one can tackle in Spark & with Delta Lake the data warehouse principles like surrogate, foreign and business keys, SCD type 1 and 2 etc. Additionally, we share our experiences on how such a unified data modelling framework can bridge BI with modern day use cases, such as machine learning and real time analytics. The session outlines the original challenges, the steps taken and the technical hurdles we faced.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Enabling Learning on Confidential Data

Enabling Learning on Confidential Data

2022-07-19 Watch
video
Rishabh Poddar (Opaque Systems)

Multiple organizations often wish to aggregate their confidential data and learn from it, but they cannot do so because they cannot share their data with each other. For example, banks wish to train models jointly over their aggregate transaction data to detect money launderers more efficiently because criminals hide their traces across different banks.

To address such problems, we developed MC^2 at UC Berkeley, an open-source framework for multi-party confidential computation, on top of Apache Spark. MC^2 enables organizations to share encrypted data and perform analytics and machine learning on the encrypted data without any organization or the cloud seeing the data. Our company Opaque brings the MC^2 technology in an easy-to-use form to organizations in the financial, medical, ad tech, and other sectors.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Ensuring Correct Distributed Writes to Delta Lake in Rust with Formal Verification

Ensuring Correct Distributed Writes to Delta Lake in Rust with Formal Verification

2022-07-19 Watch
video

Rust guarantees zero memory access bug once a program compiles. However, one can still introduce logical bugs in the implementation.

In this talk, I will first give a high level overview on common formal verification methods used in distributed system designs and implementations. Then I will talk about our experiences with using TLA+ and Stateright to formally model delta-rs' multi-writer S3 backend implementation. The end result of combining both Rust and formal verification is we end up with an efficient native Delta Lake implementation that is both memory safe and logical bug free!

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

FugueSQL—The Enhanced SQL Interface for Pandas and Spark DataFrames

FugueSQL—The Enhanced SQL Interface for Pandas and Spark DataFrames

2022-07-19 Watch
video

SQL users working with Pandas and Spark quickly realize SQL is a second-class interface, invoked between predominantly Python code.

We will introduce FugueSQL, an enhanced SQL interface that allows SQL lovers to express end-to-end workflows predominantly in SQL. With a Jupyter notebook extension, SQL commands can be used in Databricks notebooks for interactive handling of in-memory datasets. This allows heavy SQL users to fully leverage Spark in their preferred grammar.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Gazelle-Jni: A Middle Layer to Offload Spark SQL to Native Engines for Execution Acceleration

Gazelle-Jni: A Middle Layer to Offload Spark SQL to Native Engines for Execution Acceleration

2022-07-19 Watch
video

This session will introduce Gazelle-Jni, which was proposed to better integrate the various native SQL engines as Spark SQL’s backend. It implemented a shared JVM and JNI middle layer. With the help of Gazlle-Jni, Spark SQL execution can be offloaded to native engines by passing Substrait transformed physical plan.

Examples will be presented on how to integrate native engines with Spark SQL.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Goodbye Hell of Unions in Spark SQL

Goodbye Hell of Unions in Spark SQL

2022-07-19 Watch
video

It is known that applications, which heavily use Spark SQL union() operation, cause performance problems. The union() operation combines multiple rows into one table. When union() operation merges many Dataframes, the size of the generated Spark SQL planning tree will be huge while the Spark SQL code is small. The huge planning tree may lead to performance problems. This talk reviews performance problems from the Spark SQL planning perspective and explains how to avoid the performance issues with common practices.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Moving to the Lakehouse: Fast & Efficient Ingestion with Auto Loader

Moving to the Lakehouse: Fast & Efficient Ingestion with Auto Loader

2022-07-19 Watch
video

Auto loader, the most popular tool for incremental data ingestion from cloud storage to Databricks’ Lakehouse, is used in our biggest customers’ ingestion workflows. Auto Loader is our all-in-one solution for exactly-once processing offering efficient file discovery, schema inference and evolution, and fault tolerance.

In this talk, we want to delve into key features in Auto Loader, including: • Avro schema inference • Rescued column • Semi-structured data support • Incremental listing • Asynchronous backfilling • Native listing • File-level tracking and observability

Auto Loader is also used in other Databricks features such as Delta Live Tables. We will discuss the architecture, provide a demo, and feature an Auto Loader customer speaking about their experience migrating to Auto Loader.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

X-FIPE: eXtended Feature Impact for Prediction Explanation

X-FIPE: eXtended Feature Impact for Prediction Explanation

2022-07-19 Watch
video

Many enterprises have built their own machine learning platforms in the cloud using Databricks, e.g. Humana FlorenceAI. In order to effectively drive the adoption of predictive models in daily business operations, data scientists and business teams need to work closely to make sure they serve the consumer needs in compliance with regulatory rules. Model interpretability is key. In this talk, we would like to share an explainable AI algorithm developed at Humana, X-FIPE, eXtended Feature Impact for Prediction Explanation.

X-FIPE is a top-driver algorithm to calculate feature importance for any machine learning predictive models, whether it is Python or PySpark, at a local level. Instead of showing the feature importance on a population level, it can find the top drivers for each observation or member. These top drivers could differ widely from one member to another member in the population. it not only helps explain the predictive model, but also offer users actionable insights.

Compared with widely used algorithms, e.g. LIME, SHAP, and FIPE, X-FIPE improves the time complexity from linear O(n) to logarithmic O(log(n)), where n is the number of used model features. Also, we discovered the connection between X-FIPE value and Shapley value -- X-FIPE a first order approximation of Shapley value. Our observation shows that the most contribution of Shapley value of a feature comes from the marginal contribution when it is first added and when it is last removed from the full features. This is why the X-FIPE keeps enough accuracy and also reduces the computation.

Hopefully this talk will provide you a path forward to include explainable AI into your machine learning workflows, you are encouraged to try out and contribute to our open source Python package xfipe soon to come.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

You Have BI. Now What? Activate Your Data!

You Have BI. Now What? Activate Your Data!

2022-07-19 Watch
video

Analytics has long been the end goal for data teams— standing up dashboards and exporting reports for business teams. But what if data teams could extend their work directly into the tools business teams use?

The next evolution for data teams is Activation. Smart organizations use reverse ETL to extend the value of Databricks by syncing data directly into business platforms, making their lakehouse a Customer Data Platform (CDP). By making Databricks the single source of truth for your data, you can create business models in your lakehouse and serve them directly to your marketing tools, ad networks, CRMs, and more. This saves time and money, unlocks new use cases for your data and turns data team efforts into revenue generating activities.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Your fastest path to Lakehouse and beyond

Your fastest path to Lakehouse and beyond

2022-07-19 Watch
video

Azure Databricks is an easy, open, and collaborative service for data, analytics & AI use cases, enabled by Lakehouse architecture. Join this session to discover how you can get the most out of your Azure investments by combining the best of Azure Synapse Analytics, Azure Databricks and Power BI for building a complete analytics & AI solution based on Lakehouse architecture.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Auditing Your Data and Answering the Lifelong Question—Is It the End of the Day Yet?

Auditing Your Data and Answering the Lifelong Question—Is It the End of the Day Yet?

2022-07-19 Watch
video

Huge volumes of data flow through a robust Kafka architecture, into several ETLs, receiving, transforming and storing the data. We clearly understood our ETLs’ workflow and our data architecture, from source to destination.

But how much did we know about the way our data makes though our systems? And what about the life long question, is it the end of the day yet?

In this talk I’m going to present to you the design process behind our Data Auditing system, Life Line. From tracking and producing, to analyzing and storing auditing information, using technologies such as Kafka, Avro, Spark, Lambda functions and complex SQL queries. We’re going to cover: * AVRO Audit header * Auditing heart beat - designing your metadata * Designing and optimizing your auditing table - what does this data look like anyway? * Creating an alert based monitoring system * Answering the most important question of all - is it the end of the day yet?

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Azure Databricks Excitement 2022

Azure Databricks Excitement 2022

2022-07-19 Watch
video

Data + AI Summit 2022 was a great opportunity to check-in on the partnership between Microsoft Azure and Databricks!

Can ML Forecast Fashion Trends? What Should We Predict?

Can ML Forecast Fashion Trends? What Should We Predict?

2022-07-19 Watch
video

Can How to use existing data to feature ‘style’ in order to make relevant recommendation to customers 1. Is fashion style predictable? What can be forecast by machine in fashion world, what cannot? 2. What type of data we can get for machine to learn in fashion world? 3. Personal perspective of three dimensions to feature style in AI

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Data-Centric Principles for AI Engineering

Data-Centric Principles for AI Engineering

2022-07-19 Watch
video

While some AI problems can be solved with end-to-end deep learning models that go from raw inputs to outputs, practitioners (including our customers!) find that such "mega models" are, on their own, not enough to build production-ready AI applications. In practice, it’s critical that AI engineers can inspect, test, and refactor the modular components of their applications, as they would with any piece of infrastructure or software.

In this talk, we’ll introduce a data-centric approach to AI engineering that highlights the advantages of modular components, fine-grained evaluation, and rapid iteration through programmatic labeling. We'll discuss the practical trade-offs of incrementally building and testing pipelines composed of models, preprocessing steps, and business logic. Along the way, we’ll share examples of these principles in practice through real-world case studies.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

DataFusion and Arrow: Supercharge Your Data Analytical Tool with a Rusty Query Engine

DataFusion and Arrow: Supercharge Your Data Analytical Tool with a Rusty Query Engine

2022-07-19 Watch
video

Learn how Rust, the Apache Arrow project, and the Data Fusion Query Engine are increasingly being used to accelerate the creation of modern data stacks.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Data Warehousing on the Lakehouse

Data Warehousing on the Lakehouse

2022-07-19 Watch
video

Most organizations routinely operate their business with complex cloud data architectures that silo applications, users and data. As a result, there is no single source of truth of data for analytics, and most analysis is performed with stale data. To solve these challenges, the lakehouse has emerged as the new standard for data architecture, with the promise to unify data, AI and analytic workloads in one place. In this session, we will cover why the data lakehouse is the next best data warehouse. You will hear from the experts success stories, use cases, and best practices learned from the field and discover how the data lakehouse ingests, stores and governs business-critical data at scale to build a curated data lake for data warehousing, SQL and BI workloads. You will also learn how Databricks SQL can help you lower costs and get started in seconds with instant, elastic SQL serverless compute, and how to empower every analytics engineers and analysts to quickly find and share new insights using their favorite BI and SQL tools, like Fivetran, dbt, Tableau or PowerBI.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

DBA Perspective—Optimizing Performance Table-by-Table

DBA Perspective—Optimizing Performance Table-by-Table

2022-07-19 Watch
video

As a DBA for your Organization’s Lakehouse, it’s your job to stay on top of performance & cost optimization techniques. We will discuss how to use the available Delta Lake tools to tune your jobs and optimize your tables.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

dbt and Databricks: Analytics Engineering on the Lakehouse

dbt and Databricks: Analytics Engineering on the Lakehouse

2022-07-19 Watch
video
Aaron Steichen (dbt Labs)

dbt's analytics engineering workflow has been adopted by 11,000+ teams, and quickly become an industry standard for data transformation. This is a great chance to see why.

dbt allows anyone who knows SQL to develop, document, test, and deploy models. With the native, SQL-first integration between Databricks and dbt Cloud, analytics teams can collaborate in the same workspace as data engineers and data scientists to build production-grade data transformation pipelines on the lakehouse.

In this live session, Aaron Steichen, Solutions Architect at dbt Labs will walk you through dbt's workflow, how it works with Databricks, and what it makes possible.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

dbt and Python—Better Together

dbt and Python—Better Together

2022-07-19 Watch
video
Drew Banin (Fishtown Analytics)

Drew Banin is the co-founder of dbt Labs and one of the maintainers of dbt Core, the open source standard in data modeling and transformation. In this talk, he will demonstrate an approach to unifying SQL and Python workloads under a single dbt execution graph, illustrating the powerful, flexible nature of dbt running on Databricks.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/