Databricks DATA + AI Summit 2023

Building a Lakehouse on AWS for Less with AWS Graviton and Photon

2022-07-19 Watch

video

AWS Amazon EC2 Data Lakehouse Databricks

AWS Graviton processors are custom-designed by AWS to enable the best price performance for workloads in Amazon EC2. In this session we will review benchmarks that demonstrate how AWS Graviton based instances run Databricks workloads at a lower price and better performance than x86-based instances on AWS, and when combined with Photon, the new Databricks engine, the price performance gains are even greater. Learn how you can optimize your Databricks workloads on AWS and save more.

Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Building an Analytics Lakehouse at Grab

2022-07-19 Watch

video

AI/ML Analytics BI Data Lakehouse Databricks DWH

Grab shares the story of their Lakehouse journey, from the drivers behind their shift to this new paradigm, to lessons learned along the way. From a starting point of a siloed, data warehouse centric architecture that had inherent challenges with scalability, performance and data duplication, Grab has standardized upon Databricks to serve as an open and unified Lakehouse platform to deliver insights at scale, democratizing data through the rapid deployment of AI and BI use cases across their operations.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Building and Scaling Machine Learning-Based Products in the World's Largest Brewery

2022-07-19 Watch

video

AI/ML Data Science Databricks

In this session we will present how Anheuser-Busch InBev (Brazil) has been developing and growing an ML platform product to democratize and evolve AI usage within the full company. Our cutting-edge intelligence product offers a set of tools and processes to facilitate everything from exploratory data analysis to the development of state-of-the-art machine learning algorithms. We designed a simple, scalable, and performative product that involves the full data science/machine learning lifecycle, with processes abstraction, feature store, promptness to production and pipeline orchestration. Today we withstand and are always evolving a solution that is used by cross-functional teams in several countries, and helps data scientists to create their solutions in a cooperative setting and supports data engineers to monitor the model pipelines.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Building an Operational Machine Learning Organization from Zero and Leveraging ML for Crypto Securit

2022-07-19 Watch

video

AI/ML Analytics Blockchain Data Science Databricks Cyber Security

BlockFi is a cryptocurrency platform that allows its clients to grow wealth through various financial products including loans, trading and interest accounts. In this presentation, we will showcase our journey adopting Databricks to build an operational nerve center for analytics across the company. We will demonstrate how to build a cross-functional organization and solve key business problems to earn executive buy-in. We will showcase two of the early successes we've had using machine learning & data science to solve key business challenges in the domains of cyber security and IT Operations. In the domain of security, we will showcase how we are using Graph Analytics to analyze millions of blockchain transactions to identify dust attacks, account takeover and flag risky transactions. The operational IT use case will showcase how we are using Sarimax to forecast platform usage patterns to scale our infrastructure using hourly crypto prices, and financial indicators.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Building Enterprise Scale Data and Analytics Platforms at Amgen

2022-07-19 Watch

video

Analytics Cloud Computing Data Lake Databricks Delta Fabric

Amgen has developed a suite of enterprise data & analytics platforms powered by modern, cloud native and open source technologies, that have played a vital role in building game changing analytics capabilities within the organization. Our platforms include a mature Data Lake with extensive self service capabilities, a Data Fabric with semantically connected data, a Data Marketplace for advanced cataloging, an intelligent Enterprise search among others to solve for a range of high value business problems. In this talk, we - Amgen and our partner ZS Associates - will share learning from our journey so far, best practices for building enterprise scale data & analytics platforms, and describe several business use cases and how we leverage modern technologies such as Databricks to enable our business teams. We will cover use cases related to Delta Lake, microservices, platform monitoring, fine grained security, and more.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Practical Data Governance in a Large Scale Databricks Environment

2022-07-19 Watch

video

Agile/Scrum AI/ML Data Governance Data Science Databricks Cyber Security

Learn from two governance and data practitioners what it takes to do data governance at enterprise scale. This is critical, since the power of Data Science is the ability to tap into any type of data source and turn it into pure value. It is at odds with its key enablers of Scale and Governance and we often must tackle new ways to bring our focus back to unlocking the insights inside the data. In this session, We will share new agile practices to roll out governance policies that balance Governance and Scale. We will untap how to deliver centralized fine-grained governance for ML and data transformation workloads that actually empowers data scientists in an enterprise Databricks environment that ensures privacy and compliance across hundreds of datasets. With automation being key to scale, we will also explore how we successfully automated security and governance

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Predicting and Preventing Machine Downtime with AI and Expert Alerts

2022-07-19 Watch

video

AI/ML Data Lake Data Lakehouse Databricks

John Deere’s Expert alerts is a proactive monitoring system that notifies dealers of potential machine issues. This allows technicians to diagnose issues remotely and fix them before they become a problem thereby avoiding multiple trips by a repair technician and minimizing downtime. John Deere ingests petabytes of data every year from its Connected Machines across the globe. To improve the availability, uptime and performance of the John Deere machines globally, our data scientists perform machine data analysis on our data lake in an efficient and scalable manner. The result is dramatically improved mean time to repair, decreased downtime with predictive alerts, improved cost efficiency, improved customer satisfaction and great yields and results for John Deere’s customers.

You will learn • What are Experts Alerts at John Deere and what challenges they seek to solve • How John Deere migrated from a legacy application for alerting to flexible and scalable Lakehouse framework • Getting stakeholder buy in and converting business logic to AI • Overcoming the scale problem: processing petabytes of data within SLAs • What is next for Alert

Other Resources • Two Minute Overview of Expert Alerts: https://www.youtube.com/watch?v=yFnMhMhipXA • Expert Alerts: Dealer Execution - John Deere: https://www.youtube.com/watch?v=2FGz0lx4UiM • Ben Burgess FarmSight services - Expert Alerts from John Deere: https://www.youtube.com/watch?v=BrQhX4oCsSw • U.S. Farm Report Driving Technology: John Deere Expert Alerts: https://www.youtube.com/watch?v=h8IGtk61EDo

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Predicting Repeat Admissions to Substance Abuse Treatment with Machine Learning

2022-07-19 Watch

video

AI/ML BI Dashboard Databricks Power BI Scikit-learn

In our presentation, we will walk through a model created to predict repeat admissions to substance abuse treatment centers. The goal is to predict early who will be at high risk for relapse so care can be tailored to put additional focus on these patients. We used the Treatment Episode Data Set (TEDS) Admissions data set, which includes every publicly funded substance abuse treatment admission in the US.

While longitudinal data is not available in the data set, we were able to predict with 88% accuracy and an f-score of 0.85 which admissions were first or repeat admissions. Our solution used a scikit-learn Random Forest model and leveraged MLFlow to track model metrics to choose the most effective model. Our pipeline tested over 100 models of different types ranging from Gradient Boosted Trees to Deep Neural Networks in Tensorflow.

To improve model interpretability, we used Shapley values to measure which variables were most important for predicting readmission. These model metrics along with other valuable data are visualized in an interactive Power BI dashboard designed to help practitioners understand who to focus on during treatment. We are in discussions with companies and researchers who may be able to leverage this model in substance abuse treatment centers in the field.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Presto On Spark: A Unified SQL Experience

2022-07-19 Watch

video

Analytics Data Lake Databricks ETL/ELT Presto S3

Presto was originally designed to run interactive queries against data warehouses, but now it has evolved into a unified SQL engine on top of open data lake analytics for both interactive and batch workloads. However, Presto doesn't scale to very large and complex batch pipelines. Presto Unlimited was designed to address such scalability challenges but it didn’t fully solve fault tolerance, isolation, and resource management.

Spark is the tool of choice across the industry for running large scale complex batch ETL pipelines. This motivated the development of Presto On Spark. Presto on Spark runs Presto as a library that is submitted with spark-submit to a Spark cluster. It leverages Spark for scaling shuffle, worker execution, and resource management. It thereby eliminates any query conversion between interactive and batch use cases. This solution helps enable a performant and scalable platform with seamless end-to-end experience to explore and process data.

Many analysts at Intuit use Presto to explore data in the Data Lake/S3 and use Spark for batch processing. These analysts would earlier spend several hours converting these exploration SQLs written for Presto to Spark SQL to operationalize/schedule them as data pipelines. Presto On Spark is now used by analysts at Intuit to run thousands of critical jobs. No query conversion is required here, improved analysts' productivity and empowered them to deliver insights at high speed.

Benefits from session: Attendees will learn about Presto On Spark architecture Attendees will learn when To Use Spark's Execution Engine With Presto Attendees will learn how Intuit runs thousands of presto jobs daily leveraging databricks platform which they can apply to their own work

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Productionizing Ethical Credit Scoring Systems with Delta Lake, Feature Store and MLFlow

2022-07-19 Watch

video

AI/ML Analytics Data Analytics Databricks Delta Marketing

Fairness, Ethics, Accountability and Transparency (FEAT) are must-haves for high-stakes machine learning models. In particular, models within the Financial Services industry such as those that assign credit scores can impact people’s access to housing and utilities and even influence their social standing. Hence, model developers have a moral responsibility to ensure that models do not systematically disadvantage any one group. Nevertheless, implementing such models in industrial settings remains challenging. A lack of concrete guidelines, common standards and technical templates make evaluating models from a FEAT perspective unfeasible. To address these implementation challenges, the Monetary Authority of Singapore (MAS) set up the Veritas Initiative to create a framework for operationalising the FEAT principles, so as to guide the responsible development of AIDA (Artificial Intelligence and Data Analytics) systems.

In January 2021, MAS announced the successful conclusion of Phase 1 of the Veritas Initiative. Deliverables included an assessment methodology for the Fairness principle and open source code for applying Fairness metrics to two use cases - customer marketing and credit scoring. In this talk, we demonstrate how these open-source examples, and their fairness metrics, might be put into production using open source tools such as Delta Lake and MLFlow. Although the Veritas Framework was developed in Singapore, the ethical framework is applicable across geographies.

By doing this, we illustrate how ethical principles can be operationalised, monitored and maintained in production, thus moving beyond only accuracy-based metrics of model performance and towards a more holistic and principled way of developing and productionizing machine learning systems.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Protecting PII/PHI Data in Data Lake via Column Level Encryption

2022-07-19 Watch

video

Data Collection Data Lake Databricks HTML Cyber Security

Data breach is a concern for any data collection company including Northwestern mutual. Every measure is taken to avoid the identity theft and fraud for our customers; however they are still not sufficient if the security around it is not updated periodically. A multiple layer of encryption is the most common approach utilized to avoid breaches however unauthorized internal access to this sensitive data still poses a threat

This presentation will walk you following steps: - Design to build encryption at column level - How to protect PII data that is used as key for joins - Ability for authorized users to decrypt data at run time - Ability to rotate the encryption keys if needed

At Northwestern Mutual, a combination of Fernet, AES encryption libraries, user-defined functions (UDFs), and Databricks secrets, were utilized to develop a process to encrypt PII information. Access was only provided to those with a business need to decrypt it, this helps avoids the internal threat. This is also done without data duplication or metadata (view/tables) duplication. Our goal is to help you understand on how you can build a secure data lake for your organization which can eliminate threats of data breach internally and externally. Associated blog: https://databricks.com/blog/2020/11/20/enforcing-column-level-encryption-and-avoiding-data-duplication-with-pii.html

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Pushing the limits of scale/performance for enterprise-wide analytics: A fire-side chat with Akamai

2022-07-19 Watch

video

Analytics Azure Cloud Computing Databricks Hadoop Microsoft

With the world’s most distributed compute platform — from cloud to edge — Akamai makes it easy for businesses to develop and run applications, while keeping experiences closer to users and threats farther away. So when it was time to scale it’s legacy Hadoop-like infrastructure reaching its capacity limits, while keeping their global operations running uninterrupted, Akamai partnered with Microsoft and Databricks to migrate to Azure Databricks.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

PySpark in Apache Spark 3.3 and Beyond

2022-07-19 Watch

video

API Databricks Pandas PySpark Python Spark

PySpark has rapidly evolved with the momentum of Project Zen introduced in Apache Spark 3.0. We improved error messages, added type hints for autocompletion, implemented visualization, etc. Most importantly, Pandas API on Spark was introduced from Apache Spark 3.2 which exposes the pandas API that runs on Apache Spark, and the Pandas API on Spark has gained a lot of popularity.

In Apache Spark 3.3, the effort of Project Zen continued and PySpark has many cool changes such as more API coverage & faster default index in Pandas API on Spark, datetime.timedelta support, new PyArrow batch interface, better autocompletion, Python & Pandas UDF profiler and new error classification.

In this talk, we will introduce what is new in PySpark at Apache Spark 3.3, and what is next beyond Apache Spark 3.3 with the current effort and roadmap in PySpark.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Quick to Production with the Best of Both Apache Spark and Tensorflow on Databricks

2022-07-19 Watch

video

AI/ML Databricks MLOps Spark TensorFlow

Using Tensorflow with big datasets has been an impediment for building deep learning models due to the added complexities of running it in a distributed setting and complicated MLOps code, recent advancements in tensorflow 2, and some extension libraries for Spark has now simplified a lot of this. This talk focuses on how we can leverage the best of both Spark and tensorflow to build machine learning and deep learning models using minimal MLOps code letting Spark handle the grunt of work, enabling us to focus more on feature engineering and building the model itself. This design also enables us to use any of the libraries in the tensorflow ecosystem (like tensorflow recommenders) with the same boilerplate code. For businesses like ours, fast prototyping and quick experimentations are key to building completely new experiences in an efficient and iterative way. It is always preferable to have tangible results before putting more resources into a certain project. This design provides us with that capability and lets us spend more time on research, building models, testing quickly, and rapidly iterating. It also provides us with the flexibility to use our choice of framework at any stage of the machine learning lifecycle. In this talk, we will go through some of the best and new features of both spark and tensorflow, how to go from single node training to distributed training with very few extra lines of code, how to leverage MLFlow as a central model store, and finally, using these models for batch and real-time inference.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Radical Speed on the Lakehouse: Photon Under the Hood

2022-07-19 Watch

video

API Data Lakehouse Databricks DWH ETL/ELT Spark

Many organizations are standardizing on the lakehouse, however, this new architecture poses challenges with an underlying query execution engine for accessing structured and unstructured data. The execution engine needs to provide the performance of a data warehouse and the scalability of data lakes. To ensure optimum performance, the Databricks Lakehouse Platform offers Photon. This next-gen vectorized query execution engine outperforms existing data warehouses in SQL workloads and implements a more general execution framework for efficient processing of data with support of the Apache Spark™ API. With Photon, analytical queries are seeing a 3 to 5x speed increase, with a 40% reduction in compute hours for ETL workloads. In this session, we will dive into Photon, describe its integration with the Databricks Platform and Apache Spark™ runtimes, talk through customer use cases, and show how your SQL and DataFrame workloads can benefit from the performance of Photon.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Realize the Promise of Streaming with the Databricks Lakehouse Platform

2022-07-19 Watch

video

Erica Lee (Upwork)

AI/ML Data Lakehouse Databricks ETL/ELT React Data Streaming

Streaming is the future of all data pipelines and applications. It enables businesses to make data-driven decisions sooner and react faster, develop data-driven applications considered previously impossible, and deliver new and differentiated experiences to customers. However, many organizations have not realized the promise of streaming to its full potential because it requires them to completely redevelop their data pipelines and applications on new, complex, proprietary, and disjointed technology stacks.

The Databricks Lakehouse Platform is a simple, unified, and open platform that supports all streaming workloads ranging from ingestion, ETL to event processing, event-driven application, and ML inference. In this session, we will discuss the streaming capabilities of the Lakehouse Platform and demonstrate how easy it is to build end-to-end, scalable streaming pipelines and applications, to fulfill the promise of streaming for your business. You will also hear from Erica Lee, VP of ML at Upwork, the world's largest Work Marketplace, share how the Upwork team uses Databricks to enable real-time predictions by computing ML features in a continuous streaming manner.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Real-Time Search and Recommendation at Scale Using Embeddings and Hopsworks

2022-07-19 Watch

video

Databricks Spark

The dominant paradigm today for real-time personalized recommendations and personalized search is the retrieval and ranking architecture based on embeddings. It is a fan-out architecture where a single query produces a storm of requests on the backend. A single query will search through millions of items to retrieve hundreds of candidates that are then enriched by a feature store and ranked so only a few recommended items are presented to the user. A search should return in much less than 1 second. Retrieval and ranking architectures need significant infrastructure - an embeddings store and a feature store - to provide both the required scale and real-time performance. In this talk, we will introduce an open-source, scalable retrieval and ranking serving architecture based on open-source technology: Hopsworks Feature Store, OpenSearch, and KServe. We will describe how to build and operate personalized search and recommendation systems using a retrieval model based on a two tower embedding model, and a ranking model gradient boosted trees. We will also show how you can train your embeddings and build your embeddings store index using Hopsworks and Apache Spark.

Attend this session to learn:

how to to build a scalable, real-time retrieval and ranking recommender system using open-source platforms;
how to train item/user embedding models and ranking models;
how to put all these pieces together in an end-to-end solution for training and operating a scalable recommender/search engine.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Rethinking Orchestration as Reconciliation: Software-Defined Assets in Dagster

2022-07-19 Watch

video

AI/ML Dagster Data Management Databricks dbt DevOps

This talk discusses “software-defined assets”, a declarative approach to orchestration and data management that makes it drastically easier to trust and evolve datasets and ML models. Dagster is an open source orchestrator built for maintaining software-defined assets.

In traditional data platforms, code and data are only loosely coupled. As a consequence, deploying changes to data feels dangerous, backfills are error-prone and irreversible, and it’s difficult to trust data, because you don’t know where it comes from or how it’s intended to be maintained. Each time you run a job that mutates a data asset, you add a new variable to account for when debugging problems.

Dagster proposes an alternative approach to data management that tightly couples data assets to code - each table or ML model corresponds to the function that’s responsible for generating it. This results in a “Data as Code” approach that mimics the “Infrastructure as Code” approach that’s central to modern DevOps. Your git repo becomes your source of truth on your data, so pushing data changes feels as safe as pushing code changes. Backfills become easy to reason about. You trust your data assets because you know how they’re computed and can reproduce them at any time. The role of the orchestrator is to ensure that physical assets in the data warehouse match the logical assets that are defined in code, so each job run is a step towards order.

Software-defined assets is a natural approach to orchestration for the modern data stack, in part because dbt models are a type of software-defined asset.

Attendees of this session will learn how to build and maintain lakehouses of software-defined assets with Dagster.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Running a Low Cost, Versatile Data Management Ecosystem with Apache Spark at Core

2022-07-19 Watch

video

AI/ML Analytics AWS Amazon EC2 Amazon EMR CI/CD

Data is the key component of Analytics, AI or ML platform. Organizations may not be successful without having a Platform that can Source, Transform, Quality check and present data in a reportable format that can drive actionable insights.

This session will focus on how Capital One HR Team built a Low Cost Data movement Ecosystem that can source data, transform at scale and build the data storage (Redshift) at a level that can be easily consumed by AI/ML programs - by using AWS Services with combination of Open source software(Spark) and Enterprise Edition Hydrograph (UI Based ETL tool with Spark as backend) This presentation is mainly to demonstrate the flexibility that Apache Spark provides for various types ETL Data Pipelines when we code in Spark.

We have been running 3 types of pipelines over 6+ years , over 400+ nightly batch jobs for $1000/mo. (1) Spark on EC2 (2) UI Based ETL tool with Spark backend (on the same EC2) (3) Spark on EMR. We have a CI/CD pipeline that supports easy integration and code deployment in all non-prod and prod regions ( even supports automated unit testing). We will also demonstrate how this ecosystem can failover to a different region in less than 15 minutes , making our application highly resilient.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Scaling AI Workloads with the Ray Ecosystem

2022-07-19 Watch

video

AI/ML Databricks Python

Modern machine learning (ML) workloads, such as deep learning and large-scale model training, are compute-intensive and require distributed execution. Ray is an open-source, distributed framework from U.C. Berkeley’s RISELab that easily scales Python applications and ML workloads from a laptop to a cluster, with an emphasis on the unique performance challenges of ML/AI systems. It is now used in many production deployments.

This talk will cover Ray’s overview, architecture, core concepts, and primitives, such as remote Tasks and Actors; briefly discuss Ray’s native libraries (Ray Tune, Ray Train, Ray Serve, Ray Datasets, RLlib); and Ray’s growing ecosystem to scale your Python or ML workloads.

Through a demo using XGBoost for classification, we will demonstrate how you can scale training, hyperparameter tuning, and inference—from a single node to a cluster, with tangible performance difference when using Ray.

The takeaways from this talk are :

Learn Ray architecture, core concepts, and Ray primitives and patterns Why Distributed computing will be the norm not an exception How to scale your ML workloads with Ray libraries: Training on a single node vs. Ray cluster, using XGBoost with/without Ray Hyperparameter search and tuning, using XGBoost with Ray and Ray Tune Inferencing at scale, using XGBoost with/without Ray

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Scaling Deep Learning on Databricks

2022-07-19 Watch

video

Databricks PyTorch Spark

Training modern Deep Learning models in a timely fashion requires leveraging GPUs to accelerate the process. Ensuring that this expensive hardware is properly utilised and scales efficiently is complex however. All the steps, from data storage and loading through to preprocessing and finally distributing the model training process requires careful thought.

To reduce the cost of training a model, we need to ensure that we are making best use of our hardware resources. Typically, the GPUs that we rely on are memory constrained with much smaller amounts of VRAM being available relative to CPU RAM. As such we will need to leverage a variety of libraries to help ensure that we can keep our GPUs running.

Through the use of libraries like Petastorm to handle the data loading side, PyTorch Lightning and Horovod to handle the model distribution side we can accelerate can leverage commodity spark clusters to accelerate the training process for our Deep Learning Models.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Scaling Salesforce In-Memory Streaming Analytics Platform for Trillion Events Per Day

2022-07-19 Watch

video

Analytics Databricks S3 Spark Data Streaming

In general , in-memory pipelines would scale quite well in Spark if we apply the same processing logic to all records. But for Salesforce the major challenge is, we need to apply custom logic specific to a Log Record Type (LRT). The custom logic includes applying different schemas while processing each event. So performing such custom logic specific to LRT , we need to have a mechanism to collect LRT specific data In-Memory such that we can apply custom logic to each collection. We normally get around 50K files in S3 every 5 minutes and there are around 4 billion log events there in 50K files. Creating a DataFrame from 50K files, then group events by LRTs and applying filters per LRT to create a child DataFrame is one approach. One major challenge is that LRT data distribution is very skewed , so we need an efficient in-memory partitioning strategy to distribute the data. Also just applying filters on parent DataFrame will have many child Data frames with empty partitions due to large skew in data distribution and this creates too many empty tasks while processing child DataFrames. So we need to have a Partitioning schema to distribute data and filter by Log Type but not create unnecessary empty partitions in child DataFrames. We also need a scheduling algorithm to process all child DataFrames to utilize cluster efficiency. We have implemented a custom Spark Streaming for reading SQS notifications and then reading new files in S3 which is designed to scale with ingestion volume . This talk will cover how we performed a Spark RangePartition based on Size distribution of the incoming data and applying schema specific transformation logic. This talk will explain various optimizations at various stages of the processing to meet our latency goal.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Scaling Up Machine Learning in Instacart Search for the 2020 Surge in Online Shopping

2022-07-19 Watch

video

AI/ML Databricks

As online grocery business accelerated in 2020, Instacart search, which supports one of the largest catalog of grocery items in the world, started facing new challenges. We experienced a sudden surge in the number of users, retailers and traffic to our search engine. As a result, the scale of our data grew manifold and the predictive performance of our model started degrading due to lack of historical data for many new retailers and users that started using Instacart. New users searched for queries that we have never seen before. The new retailers on our platform were quite diverse - ranging from local grocery stores to office supplies, pharmacies and halloween stores - which are categories that our models were never trained on. As our relatively small team team of four engineers tried to build new models to address these issues, we faced a number of operational challenges. This talk will focus on details about the challenges we encountered in this new world including drift in our data and cold start issues. We will cover the architecture of our search engine and the issues we faced in training and serving our ML models due to the increase in scale. We will talk about how we we overcame the issues by using more sophisticated models that are trained and served on a more robust infrastructure and technical stack. We will also cover the iterations on our ML ranking models to adapt to this new world and we successfully improved the quality of search results and our revenue while operating in a robust production environment.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Scaling Your Workloads with Databricks Serverless

2022-07-19 Watch

video

BI Data Lakehouse Databricks SQL

Databricks SQL provides a first-class user experience for BI and SQL directly on the lakehouse platform. But you still need to administer and maintain clusters of virtual machines. What if you could focus on your Databricks SQL queries and never need to worry about the underlying compute infrastructure? Learn how Databricks Serverless, built into the Databricks Lakehouse Platform, eliminates cluster management, provides instant compute, and lowers total cost of ownership for Databricks SQL. In this session, you will see demos, hear from customers, learn how Databricks Serverless works under the hood, be equipped with everything you need to get started – and ultimately get the best out of Databricks Serverless.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Search and Aggregations Made Easy with OpenSearch and NodeJS

2022-07-19 Watch

video

Databricks GitHub

In this session the audience will get both theoretical and practical knowledge on what OpenSearch is and how they can work with it by using its NodeJS client. This is a hands-on session where the audience is invited to follow along.

There will be an accompanying GitHub repository to allow the audience to follow me during or after the lecture.

During the session we will: - Overview on OpenSearch architecture. - Set up the cluster and prepare NodeJS project. - Load sample data (we'll use a dataset with 20k recipes). - Explore different types of search queries: term-level, full-text and boolean. - Explore different types of aggregations: metric, bucket, pipeline. - Visualisations with the help of OpenSearch Dashboards.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

talk-data.com

Databricks DATA + AI Summit 2023

Top Topics

Top Speakers

Building a Lakehouse on AWS for Less with AWS Graviton and Photon

Building an Analytics Lakehouse at Grab

Building and Scaling Machine Learning-Based Products in the World's Largest Brewery

Building an Operational Machine Learning Organization from Zero and Leveraging ML for Crypto Securit

Building Enterprise Scale Data and Analytics Platforms at Amgen

Practical Data Governance in a Large Scale Databricks Environment

Predicting and Preventing Machine Downtime with AI and Expert Alerts

Predicting Repeat Admissions to Substance Abuse Treatment with Machine Learning

Presto On Spark: A Unified SQL Experience

Productionizing Ethical Credit Scoring Systems with Delta Lake, Feature Store and MLFlow

Protecting PII/PHI Data in Data Lake via Column Level Encryption

Pushing the limits of scale/performance for enterprise-wide analytics: A fire-side chat with Akamai

PySpark in Apache Spark 3.3 and Beyond

Quick to Production with the Best of Both Apache Spark and Tensorflow on Databricks

Radical Speed on the Lakehouse: Photon Under the Hood

Realize the Promise of Streaming with the Databricks Lakehouse Platform

Real-Time Search and Recommendation at Scale Using Embeddings and Hopsworks

Rethinking Orchestration as Reconciliation: Software-Defined Assets in Dagster

Running a Low Cost, Versatile Data Management Ecosystem with Apache Spark at Core

Scaling AI Workloads with the Ray Ecosystem

Scaling Deep Learning on Databricks

Scaling Salesforce In-Memory Streaming Analytics Platform for Trillion Events Per Day

Scaling Up Machine Learning in Instacart Search for the 2020 Surge in Online Shopping

Scaling Your Workloads with Databricks Serverless

Search and Aggregations Made Easy with OpenSearch and NodeJS