talk-data.com talk-data.com

Topic

Databricks

big_data analytics spark

1286

tagged

Activity Trend

515 peak/qtr
2020-Q1 2026-Q1

Activities

1286 activities · Newest first

Gazelle-Jni: A Middle Layer to Offload Spark SQL to Native Engines for Execution Acceleration

This session will introduce Gazelle-Jni, which was proposed to better integrate the various native SQL engines as Spark SQL’s backend. It implemented a shared JVM and JNI middle layer. With the help of Gazlle-Jni, Spark SQL execution can be offloaded to native engines by passing Substrait transformed physical plan.

Examples will be presented on how to integrate native engines with Spark SQL.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Goodbye Hell of Unions in Spark SQL

It is known that applications, which heavily use Spark SQL union() operation, cause performance problems. The union() operation combines multiple rows into one table. When union() operation merges many Dataframes, the size of the generated Spark SQL planning tree will be huge while the Spark SQL code is small. The huge planning tree may lead to performance problems. This talk reviews performance problems from the Spark SQL planning perspective and explains how to avoid the performance issues with common practices.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Moving to the Lakehouse: Fast & Efficient Ingestion with Auto Loader

Auto loader, the most popular tool for incremental data ingestion from cloud storage to Databricks’ Lakehouse, is used in our biggest customers’ ingestion workflows. Auto Loader is our all-in-one solution for exactly-once processing offering efficient file discovery, schema inference and evolution, and fault tolerance.

In this talk, we want to delve into key features in Auto Loader, including: • Avro schema inference • Rescued column • Semi-structured data support • Incremental listing • Asynchronous backfilling • Native listing • File-level tracking and observability

Auto Loader is also used in other Databricks features such as Delta Live Tables. We will discuss the architecture, provide a demo, and feature an Auto Loader customer speaking about their experience migrating to Auto Loader.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

X-FIPE: eXtended Feature Impact for Prediction Explanation

Many enterprises have built their own machine learning platforms in the cloud using Databricks, e.g. Humana FlorenceAI. In order to effectively drive the adoption of predictive models in daily business operations, data scientists and business teams need to work closely to make sure they serve the consumer needs in compliance with regulatory rules. Model interpretability is key. In this talk, we would like to share an explainable AI algorithm developed at Humana, X-FIPE, eXtended Feature Impact for Prediction Explanation.

X-FIPE is a top-driver algorithm to calculate feature importance for any machine learning predictive models, whether it is Python or PySpark, at a local level. Instead of showing the feature importance on a population level, it can find the top drivers for each observation or member. These top drivers could differ widely from one member to another member in the population. it not only helps explain the predictive model, but also offer users actionable insights.

Compared with widely used algorithms, e.g. LIME, SHAP, and FIPE, X-FIPE improves the time complexity from linear O(n) to logarithmic O(log(n)), where n is the number of used model features. Also, we discovered the connection between X-FIPE value and Shapley value -- X-FIPE a first order approximation of Shapley value. Our observation shows that the most contribution of Shapley value of a feature comes from the marginal contribution when it is first added and when it is last removed from the full features. This is why the X-FIPE keeps enough accuracy and also reduces the computation.

Hopefully this talk will provide you a path forward to include explainable AI into your machine learning workflows, you are encouraged to try out and contribute to our open source Python package xfipe soon to come.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

You Have BI. Now What? Activate Your Data!

Analytics has long been the end goal for data teams— standing up dashboards and exporting reports for business teams. But what if data teams could extend their work directly into the tools business teams use?

The next evolution for data teams is Activation. Smart organizations use reverse ETL to extend the value of Databricks by syncing data directly into business platforms, making their lakehouse a Customer Data Platform (CDP). By making Databricks the single source of truth for your data, you can create business models in your lakehouse and serve them directly to your marketing tools, ad networks, CRMs, and more. This saves time and money, unlocks new use cases for your data and turns data team efforts into revenue generating activities.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Your fastest path to Lakehouse and beyond

Azure Databricks is an easy, open, and collaborative service for data, analytics & AI use cases, enabled by Lakehouse architecture. Join this session to discover how you can get the most out of your Azure investments by combining the best of Azure Synapse Analytics, Azure Databricks and Power BI for building a complete analytics & AI solution based on Lakehouse architecture.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Auditing Your Data and Answering the Lifelong Question—Is It the End of the Day Yet?

Huge volumes of data flow through a robust Kafka architecture, into several ETLs, receiving, transforming and storing the data. We clearly understood our ETLs’ workflow and our data architecture, from source to destination.

But how much did we know about the way our data makes though our systems? And what about the life long question, is it the end of the day yet?

In this talk I’m going to present to you the design process behind our Data Auditing system, Life Line. From tracking and producing, to analyzing and storing auditing information, using technologies such as Kafka, Avro, Spark, Lambda functions and complex SQL queries. We’re going to cover: * AVRO Audit header * Auditing heart beat - designing your metadata * Designing and optimizing your auditing table - what does this data look like anyway? * Creating an alert based monitoring system * Answering the most important question of all - is it the end of the day yet?

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Can ML Forecast Fashion Trends? What Should We Predict?

Can How to use existing data to feature ‘style’ in order to make relevant recommendation to customers 1. Is fashion style predictable? What can be forecast by machine in fashion world, what cannot? 2. What type of data we can get for machine to learn in fashion world? 3. Personal perspective of three dimensions to feature style in AI

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Data-Centric Principles for AI Engineering

While some AI problems can be solved with end-to-end deep learning models that go from raw inputs to outputs, practitioners (including our customers!) find that such "mega models" are, on their own, not enough to build production-ready AI applications. In practice, it’s critical that AI engineers can inspect, test, and refactor the modular components of their applications, as they would with any piece of infrastructure or software.

In this talk, we’ll introduce a data-centric approach to AI engineering that highlights the advantages of modular components, fine-grained evaluation, and rapid iteration through programmatic labeling. We'll discuss the practical trade-offs of incrementally building and testing pipelines composed of models, preprocessing steps, and business logic. Along the way, we’ll share examples of these principles in practice through real-world case studies.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

DataFusion and Arrow: Supercharge Your Data Analytical Tool with a Rusty Query Engine

Learn how Rust, the Apache Arrow project, and the Data Fusion Query Engine are increasingly being used to accelerate the creation of modern data stacks.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Data Warehousing on the Lakehouse

Most organizations routinely operate their business with complex cloud data architectures that silo applications, users and data. As a result, there is no single source of truth of data for analytics, and most analysis is performed with stale data. To solve these challenges, the lakehouse has emerged as the new standard for data architecture, with the promise to unify data, AI and analytic workloads in one place. In this session, we will cover why the data lakehouse is the next best data warehouse. You will hear from the experts success stories, use cases, and best practices learned from the field and discover how the data lakehouse ingests, stores and governs business-critical data at scale to build a curated data lake for data warehousing, SQL and BI workloads. You will also learn how Databricks SQL can help you lower costs and get started in seconds with instant, elastic SQL serverless compute, and how to empower every analytics engineers and analysts to quickly find and share new insights using their favorite BI and SQL tools, like Fivetran, dbt, Tableau or PowerBI.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

DBA Perspective—Optimizing Performance Table-by-Table

As a DBA for your Organization’s Lakehouse, it’s your job to stay on top of performance & cost optimization techniques. We will discuss how to use the available Delta Lake tools to tune your jobs and optimize your tables.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

dbt and Databricks: Analytics Engineering on the Lakehouse

dbt's analytics engineering workflow has been adopted by 11,000+ teams, and quickly become an industry standard for data transformation. This is a great chance to see why.

dbt allows anyone who knows SQL to develop, document, test, and deploy models. With the native, SQL-first integration between Databricks and dbt Cloud, analytics teams can collaborate in the same workspace as data engineers and data scientists to build production-grade data transformation pipelines on the lakehouse.

In this live session, Aaron Steichen, Solutions Architect at dbt Labs will walk you through dbt's workflow, how it works with Databricks, and what it makes possible.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

dbt and Python—Better Together

Drew Banin is the co-founder of dbt Labs and one of the maintainers of dbt Core, the open source standard in data modeling and transformation. In this talk, he will demonstrate an approach to unifying SQL and Python workloads under a single dbt execution graph, illustrating the powerful, flexible nature of dbt running on Databricks.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

dbt + Machine Learning: What Makes a Great Baton Pass?

dbt has done a great job of building an elegant, common interface between data engineers and data analysts: uniting on SQL. As the data industry evolves, there's plenty of pain and room to grow in building that interface between data scientists and data analysts. There isn't a good answer for when things go wrong in the machine learning arena: should the data analyst own fine-tuning the pre-processing data(think: prepping transformed data even more for machine learning models to better work with the data). Should we increase the SQL surface area to build ML models or should we leave that to non-SQL interfaces(python/scala/etc.)? Does this have to be an either/or future? Whatever the interface evolves into, it must center people, create a low bar and high ceiling, and focus on outcomes and not the mystique of features/tools behind a learning curve.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

DELETE, UPDATE, MERGE Operations in Data Source

If you’ve ever had to delete a set of records for regulatory compliance, update a set of records to fix an issue in the ingestion pipeline, or apply changes in a transaction log to a fact table, you know that row-level operations are becoming critical for modern data lake workflows. This talk will focus on some of the upcoming features in Spark 3.3 that will enable execution of row-level operations and allow Spark to only pass to connectors what rows to delete, update, or insert. As a result, data sources won’t have to provide low-level SQL extensions for Spark and will be able to benefit from a scalable built-in implementation that works across all connectors. The presentation will be useful for data source developers as well as data engineers and analysts interested in performing DELETE, UPDATE, MERGE operations in Spark.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Delta Lake 2.0 Overview

After three years of hard work by the Delta community, we are proud to announce the release of Delta Lake 2.0. Completing the work to open-source all of Delta Lake while tens of thousands of organizations were running in production was no small feat and we have the ever-expanding Delta community to thank! Join this session to learn about how the wider Delta community collaborated together to bring these features and integrations together.

Join this session to learn about how the wider Delta community collaborated together to bring these features and integrations together. This includes the Integrations with Apache Spark™, Apache Flink, Apache Pulsar, Presto, Trino, and more.

Features such as OPTIMIZE ZORDER, data skipping using column stats, S3 multi-cluster writes, Change Data Feed, and more.

Language APIs including Rust, Python, Ruby, GoLang, Scala, and Java.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Delta Live Tables: Modern Software Engineering and Management for ETL

Data engineers have the difficult task of cleansing complex, diverse data, and transforming it into a usable source to drive data analytics, data science, and machine learning. They need to know the data infrastructure platform in depth, build complex queries in various languages and stitch them together for production. Join this talk to learn how Delta Live Tables (DLT) simplifies the complexity of data transformation and ETL. DLT is the first ETL framework to use modern software engineering practices to deliver reliable and trusted data pipelines at any scale. Discover how analysts and data engineers can innovate rapidly with simple pipeline development and maintenance, how to remove operational complexity by automating administrative tasks and gaining visibility into pipeline operations, how built-in quality controls and monitoring ensure accurate BI, data science, and ML, and how simplified batch and streaming can be implemented with self-optimizing and auto-scaling data pipelines.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Democratizing Metrics at Airbnb

Data democratization is the process of enabling self-service access to harness insights from data for anyone with varying levels of data expertise in an organization. Without being deliberate, this process often leads to a proliferation of data tools that makes it inherently challenging to ensure consistent insights. At Airbnb, we’ve created a centralized metrics platform named Minerva to guarantee data consistency at scale. You may read about the introduction of Minerva (a 3-part blog) in the Airbnb Tech Blog. In this talk, we’ll share several architectural changes we’ve made to allow for unprecedented flexibility while maintaining consistency, and introduce our plan for open-sourcing Minerva.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/