talk-data.com talk-data.com

Topic

Cloud Computing

infrastructure saas iaas

99

tagged

Activity Trend

471 peak/qtr
2020-Q1 2026-Q1

Activities

Showing filtered results

Filtering by: Databricks DATA + AI Summit 2023 ×
Deliver Faster Decision Intelligence From Your Lakehouse

Accelerate the path from data to decisions with the the Tellius AI-driven Decision Intelligence platform powered by Databricks Delta Lake. Empower business users and data teams to analyze data residing in the Delta Lake to understand what is happening in their business, uncover the reasons why metrics change, and get recommendations on how to impact outcomes. Learn how organizations derive value from Delta Lakehouse with a modern analytics experience that unifies guided insights, natural language search, and automated machine learning to speed up data-driven decision making at cloud scale.

In this session, we will showcase how customers: - Discover changes in KPIs and investigate the reasons why metrics change with AI-powered automated analysis - Empower business users and data analysts to iteratively explore data to identify trend drivers, uncover new customer segments, and surface hidden patterns in data - Simplify and speed-up analysis from massive datasets on Databrick Delta lake

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Efficient and Multi-Tenant Scheduling of Big Data and AI Workloads

Many ML and big data teams in the open source community are looking to run their workloads in the cloud and they invariably face a common set of challenges such as multi-tenant cluster management, resource fairness and sharing, gang scheduling and cost-effective infrastructure operations. Kubernetes is the de-facto standard platform for running containerized applications in the cloud. However, the default resource scheduler in Kubernetes leaves more to be desired for AI scenarios when running ML/DL training workloads or large-scale data processing jobs for feature engineering.

In this talk, we will share how the community leverage and build upon Apache YuniKorn to address the unique resource scheduling needs for ML and big data teams.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Enabling Learning on Confidential Data

Multiple organizations often wish to aggregate their confidential data and learn from it, but they cannot do so because they cannot share their data with each other. For example, banks wish to train models jointly over their aggregate transaction data to detect money launderers more efficiently because criminals hide their traces across different banks.

To address such problems, we developed MC^2 at UC Berkeley, an open-source framework for multi-party confidential computation, on top of Apache Spark. MC^2 enables organizations to share encrypted data and perform analytics and machine learning on the encrypted data without any organization or the cloud seeing the data. Our company Opaque brings the MC^2 technology in an easy-to-use form to organizations in the financial, medical, ad tech, and other sectors.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Moving to the Lakehouse: Fast & Efficient Ingestion with Auto Loader

Auto loader, the most popular tool for incremental data ingestion from cloud storage to Databricks’ Lakehouse, is used in our biggest customers’ ingestion workflows. Auto Loader is our all-in-one solution for exactly-once processing offering efficient file discovery, schema inference and evolution, and fault tolerance.

In this talk, we want to delve into key features in Auto Loader, including: • Avro schema inference • Rescued column • Semi-structured data support • Incremental listing • Asynchronous backfilling • Native listing • File-level tracking and observability

Auto Loader is also used in other Databricks features such as Delta Live Tables. We will discuss the architecture, provide a demo, and feature an Auto Loader customer speaking about their experience migrating to Auto Loader.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

X-FIPE: eXtended Feature Impact for Prediction Explanation

Many enterprises have built their own machine learning platforms in the cloud using Databricks, e.g. Humana FlorenceAI. In order to effectively drive the adoption of predictive models in daily business operations, data scientists and business teams need to work closely to make sure they serve the consumer needs in compliance with regulatory rules. Model interpretability is key. In this talk, we would like to share an explainable AI algorithm developed at Humana, X-FIPE, eXtended Feature Impact for Prediction Explanation.

X-FIPE is a top-driver algorithm to calculate feature importance for any machine learning predictive models, whether it is Python or PySpark, at a local level. Instead of showing the feature importance on a population level, it can find the top drivers for each observation or member. These top drivers could differ widely from one member to another member in the population. it not only helps explain the predictive model, but also offer users actionable insights.

Compared with widely used algorithms, e.g. LIME, SHAP, and FIPE, X-FIPE improves the time complexity from linear O(n) to logarithmic O(log(n)), where n is the number of used model features. Also, we discovered the connection between X-FIPE value and Shapley value -- X-FIPE a first order approximation of Shapley value. Our observation shows that the most contribution of Shapley value of a feature comes from the marginal contribution when it is first added and when it is last removed from the full features. This is why the X-FIPE keeps enough accuracy and also reduces the computation.

Hopefully this talk will provide you a path forward to include explainable AI into your machine learning workflows, you are encouraged to try out and contribute to our open source Python package xfipe soon to come.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Data Warehousing on the Lakehouse

Most organizations routinely operate their business with complex cloud data architectures that silo applications, users and data. As a result, there is no single source of truth of data for analytics, and most analysis is performed with stale data. To solve these challenges, the lakehouse has emerged as the new standard for data architecture, with the promise to unify data, AI and analytic workloads in one place. In this session, we will cover why the data lakehouse is the next best data warehouse. You will hear from the experts success stories, use cases, and best practices learned from the field and discover how the data lakehouse ingests, stores and governs business-critical data at scale to build a curated data lake for data warehousing, SQL and BI workloads. You will also learn how Databricks SQL can help you lower costs and get started in seconds with instant, elastic SQL serverless compute, and how to empower every analytics engineers and analysts to quickly find and share new insights using their favorite BI and SQL tools, like Fivetran, dbt, Tableau or PowerBI.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

dbt and Databricks: Analytics Engineering on the Lakehouse

dbt's analytics engineering workflow has been adopted by 11,000+ teams, and quickly become an industry standard for data transformation. This is a great chance to see why.

dbt allows anyone who knows SQL to develop, document, test, and deploy models. With the native, SQL-first integration between Databricks and dbt Cloud, analytics teams can collaborate in the same workspace as data engineers and data scientists to build production-grade data transformation pipelines on the lakehouse.

In this live session, Aaron Steichen, Solutions Architect at dbt Labs will walk you through dbt's workflow, how it works with Databricks, and what it makes possible.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Enabling Business Users to Perform Interactive Ad-Hoc Analysis over Delta Lake with No Code

In this talk, we'll first introduce Sigma Workbooks along with its technical design motivations and architectural details. Sigma Workbooks is an interactive visual data analytics system that enables business users to easily perform complex ad-hoc analysis over data in cloud data warehouses (CDWs). We'll then demonstrate the expressivity, scalability, and ease-of-use of Sigma Workbooks through real-life use cases over datasets stored in Delta Lake. We’ll conclude the talk by sharing the lessons that we have learned throughout the design and implementation iterations of Sigma Workbooks.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

How AARP Services, Inc. automated SAS transformation to Databricks using LeapLogic

While SAS has been a standard in analytics and data science use cases, it is not cloud-native and does not scale well. Join us to learn how AARP automated the conversion of hundreds of complex data processing, model scoring, and campaign workloads to Databricks using LeapLogic, an intelligent code transformation accelerator that can transform any and all legacy ETL, analytics, data warehouse and Hadoop to modern data platforms.

In this session experts from AARP and Impetus will share about collaborating with Databricks and how they were able to: • Automate modernization of SAS marketing analytics based on coding best practices • Establish a rich library of Spark and Python equivalent functions on Databricks with the same capabilities as SAS procedures, DATA step operations, macros, and functions • Leverage Databricks-native services like Delta Live Tables to implement waterfall techniques for campaign execution and simplify pipeline monitoring

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Big Data in the Age of Moneyball

Data and predictions have permeated sports and our conversations around it since the beginning. Who will win the big game this weekend? How many points will your favorite player score? How much money will be guaranteed in the next free agent contract? Once could argue that data-driven decisions in sports started with Moneyball in baseball, in 2003. In the two decades since, data and technology have exploded on the scene. The Texas Rangers are using modern cloud software, such as Databricks, to help make sense of this data, and provide actionable information to create a World Series team on the field. From computer vision, pose analytics, and player tracking, to pitch design, base stealing likelihood, and more, come see how the Texas Rangers are using innovative cloud technologies to create action-driven reports from the current sea of Big Data. Finally, this talk will demonstrate how the Texas Rangers use MLFlow and the MLRegistry inside Databricks to organize their predictive models.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Cloud Fetch: High-bandwidth Connectivity With BI Tools

Business Intelligence (BI) tools such as Tableau and Microsoft Power BI are notoriously slow at extracting large query results from traditional data warehouses because they typically fetch the data in a single thread through a SQL endpoint that becomes a data transfer bottleneck. Data analysts can connect their BI tools to Databricks SQL endpoints to query data in tables through an ODBC/JDBC protocol integrated in our Simba drivers. With Cloud Fetch, which we released in Databricks Runtime 8.3 and Simba ODBC 2.6.17 driver, we introduce a new mechanism for fetching data in parallel via cloud storage such as AWS S3 and Azure Data Lake Storage to bring the data faster to BI tools. In our experiments using Cloud Fetch, we observed a 10x speed-up in extract performance due to parallelism.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

How the Largest County in the US is Transforming Hiring with a Modern Data Lakehouse

Los Angeles County’s Department of Human Resources (DHR) is responsible for attracting a diverse workforce for the 37 departments it supports. Each year, DHR processes upwards of 400,000 applications for job opportunities making it one of the largest employers in the nation. Managing a hiring process of this scale is complex with many complicated factors such as background checks and skills examination. These processes, if not managed properly, can create bottlenecks and a poor experience for both candidates and hiring managers.

In order to identify areas for improvement, DHR set out to build detailed operational metrics across each stage of the hiring process. DHR used to conduct high level analysis manually using excel and other disparate tools. The data itself was limited, difficult to obtain, and analyze. In addition, it was taking analysts weeks to manually pull data from half a dozen siloed systems into excel for cleansing and analysis. This process was labor-intensive, inefficient, and prone to human error.

To overcome these challenges, DHR in partnership with Internal Services Department (ISD) adopted a modern data architecture in the cloud. Powered by the Azure Databricks Lakehouse, DHR was able to bring together their diverse volumes of data into a single platform for data analytics. Manual ETL processes that took weeks could now be automated in 10 minutes or less. With this new architecture, DHR has built Business Intelligence dashboards to unpack the hiring process to get a clear picture of where the bottlenecks are and track the speed with which candidates move through the process The dashboards allow the County departments innovate and make changes to enhance and improve the experience of potential job seekers and improve the timeliness of securing highly qualified and diverse County personnel at all employment levels.

In this talk, we’ll discuss DHR’s journey towards building a data-driven hiring process, the architecture decisions that enabled this transformation and the types of analytics that we’ve deployed to improve hiring efforts.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Financial Services Experience at Data + AI Summit 2022

The future of Financial Services is open with data and AI at its core. Welcome data teams and executives in Financial Services! This year’s Data + AI Summit is jam-packed with talks, demos and discussions on how Financial Services leaders are harnessing the power of data and analytics to digitally transform, minimize risk, accelerate time to market and drive sustainable value creation To help you take full advantage of the Financial Services industry experience at Summit, we’ve curated all the programs in one place.

Highlights at this year’s Summit:

Financial Services Industry Forum: Our flagship event for Financial Services attendees at Summit featuring keynotes and panel discussions with ADP, Northwestern Mutual, Point72 Asset Management, S&P Global and EY, followed by networking. More details in the agenda below. Financial Services Lounge: Stop by our lounge located outside the Expo floor to meet with Databricks’ industry experts and see solutions from our partners including Accenture, Avanade, Deloitte and others. Session Talks: Over 15 technical talks and demos on topics including hyper-personalization, AI-fueled forecasting, enterprise analytics in cloud, scaling privacy and cybersecurity, MLOps in cryptocurrency, ethical credit scoring and more.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

How Adobe migrated to a unified and open data Lakehouse to deliver personalization at scale.

In this keynote talk, David Weinstein, VP of Engineering for Adobe Experience Cloud, will share Adobe’s journey from a simple data lake to a unified, open Lakehouse architecture with Databricks. Adobe can now deliver personalized experiences at scale to diverse customers with greater speed, operational efficiency and faster innovation across the Experience Cloud portfolio. Learn why they chose to migrate from Iceberg to Delta Lake to drive its open standard development and accelerate innovation of their Lakehouse, and they’ll also share how leveraging the Delta Lake table format has allowed for techniques to support change data capture and significantly improve operational efficiency.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Privacy Preserving Machine Learning and Big Data Analytics Using Apache Spark

In recent years, latest privacy laws & regulations bring a fundamental shift in the protection of data and privacy, placing new challenges to data applications. To resolve these privacy & security challenges in big data ecosystem without impacting existing applications, several hardware TEE (Trusted Execution Environment) solutions have been proposed for Apache Spark, e.g., PySpark with Scone and Opaque etc. However, to the best of our knowledge, none of them provide full protection to data pipelines in Spark applications. An adversary may still get sensitive information from unprotected components and stages. Furthermore, some of them greatly narrowed supported applications, e.g., only support SparkSQL. In this presentation, we will present a new PPMLA (privacy preserving machine learning and analytics) solution built on top of Apache Spark, BigDL, Occlum and Intel SGX. It ensures all spark components and pipelines are fully protected by Intel SGX, and existing Spark applications written in Scala, Java or Python can be migrated into our platform without any code change. We will demonstrate how to build distributed end-to-end SparkML/SparkSQL workloads with our solution on untrusted cloud environment and share real-world use cases for PPMLA.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Scalable XGBoost on GPU Clusters

XGBoost is a popular open-source implementation of gradient boosting tree algorithms. In this talk, we walk through some of the new features in XGBoost that help us train better models, and explain how to scale up the pipeline to larger datasets with GPU clusters.

It is challenging to train gradient boosting models with the growing size and complexity of data. The latest XGBoost introduces categorical data support to help data scientists work with non-numerical data without the need for encoding. The new XGBoost could train multi-output models to handle datasets with non-exclusive class labels and multi-target regression. XGBoost has also introduced a new AUC implementation that supports more model types and features a robust approximation in distributed environments.

The latest XGBoost has significantly improved its built-in GPU support for scalability and performance. The data loading and processing have been improved for increased memory efficiency, enabling users to handle larger datasets. GPU-based model training is over 2x faster compared to past versions. The performance improvement has also been extended to model explanation. XGBoost added GPU-based SHAP value computation, obtaining more than 10x speedup compared to the traditional CPU-based method. On Spark GPU clusters, end-to-end pipelines could now be accelerated on GPU from feature engineering in ETL to model training/inference in XGBoost.

We will walk through these XGBoost improvements with the newly released XGBoost packages from DMLC. Benchmark results will be shared. Example applications and notebooks will be provided for audiences to learn these new features on the cloud.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

UIMeta: A 10X Faster Cloud-Native Apache Spark History Server

Spark history server is an essential tool for monitoring, analyzing and optimizing spark jobs.

The original history server is based on Spark event log mechanism. A running Spark job will produce many kinds of events that describe the job's status continuously. All the events are serialized into JSON and appended to a file —— event log. The history server has to replay the event log and rebuild the memory store needed for UI. In a cluster, the history server also needs to periodically scan the event log directory and cache all the files' metadata in memory.

Actually, an event log contains too much redundant info for a history server. A long-running application can bring a huge event log which may cost a lot to maintain and require a long time to replay. In large-scale production, the number of jobs can be large and leads to a heavy burden on history servers. It needs additional development to build a scalable history server service.

In this talk, we want to introduce a new history server based on UIMeta. UIMeta is a wrapper of the KVStore objects needed by a Spark UI. A job will bring a UIMeta log by stagedly serializing UIMeta. An UIMeta log is approximately 10x smaller in size and 10x faster in replaying compared to the original event log file. Benefitting from the good performance, we develop a new stateless history server without a directory scan. Currently, UIMeta Service has taken the place of the original history server and provided service for millions of jobs per day in Bytedance.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Swedbank: Enterprise Analytics in Cloud

Swedbank is the largest bank in Sweden & third largest in Nordics. They have about 7-8M customers across retail, mortgage , and investment (pensions). One of the key drivers for the bank was to look at data across all silos and build analytics to drive their ML models - they couldn’t. That’s when Swedbank made a strategic decision to go to the cloud and make bets on Databricks, Immuta, and Azure.

-Enterprise analytics in cloud is an initiative to move Swedbanks on-premise Hadoop based data lake into the cloud to provide improved analytical capabilities at scale. The strategic goals of the “Analytics Data Lake” are: -Advanced analytics: Improve analytical capabilities in terms of functionality, reduce analytics time to market and better predictive modelling -A Catalyst for Sharing Data: Make data Visible, Accessible, Understandable, Linked, and Trusted Technical advancements: Future proof with ability to add new tools/libraries, support for 3rd party solutions for Deep Learning/AI

To achieve these goals, Swedbank had to migrate existing capabilities and application services to Azure Databricks & implement Immuta as its unified access control plane. A “data discovery” space was created for data scientists to be able to come & scan (new) data, develop, train & operationalise ML models. To meet these goals Swedbank requires dynamic and granular data access controls to both mitigate data exposure (due to compromised accounts, attackers monitoring a network, and other threats) while empowering users via self-service data discovery & analytics. Protection of sensitive data is key to enable Swedbank to support key financial services use cases.

The presentation will focus on this journey, calling out key technical challenges, learning & benefits observed.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

US Air Force: Safeguarding Personnel Data at Enterprise Scale

The US Air Force VAULT platform is a cloud-native enterprise data platform designed to provide the Department of the Air Force (DAF) with a robust, interoperable, and secure data environment. The strategic goals of VAULT include:

  • Leading Data Culture - Increase data use and literacy to improve efficiency and effectiveness of decisions, readiness, mission operations, and cybersecurity.
  • A Catalyst for Sharing Data - Make data Visible, Accessible, Understandable, Linked, and Trusted (VAULT).
  • Driving Data Capabilities - Increase access to the right combination of state-of-the-art technologies needed to best utilize data.

To achieve these goals, the VAULT team created a self-service platform to onboard and extract, transform and load data, perform data analytics, machine learning and visualization, and data governance. Supporting over 50 tenants across NIPR and SIPR, adds complexity to maintaining data security while ensuring data can be shared and utilized for analytics. To meet these goals VAULT requires dynamic and granular data access controls to both mitigate data exposure (due to compromised accounts, attackers monitoring a network, and other threats) while empowering users via self-service analytics. Protection of sensitive data is key to enable VAULT to support key use cases such as personal readiness to optimally place Airmen trainees to meet production goals, increase readiness, and match trainees to their preferences.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/