talk-data.com talk-data.com

Event

Databricks DATA + AI Summit 2023

2026-01-11 YouTube Visit website ↗

Activities tracked

287

Filtering by: AI/ML ×

Sessions & talks

Showing 176–200 of 287 · Newest first

Search within this event →
Orchestration Made Easy with Databricks Workflows

Orchestration Made Easy with Databricks Workflows

2022-07-19 Watch
video

Orchestrating and managing end-to-end production pipelines have remained a bottleneck for many organizations. Data teams spend too much time stitching pipeline tasks and manually managing and monitoring the orchestration process – with heavy reliance on external or cloud-specific orchestration solutions, all of which slow down the delivery of new data. In this session, we introduce you to Databricks Workflows: a fully managed orchestration service for all your data, analytics, and AI, built in the Databricks Lakehouse Platform. Join us as we dive deep into the new workflow capabilities, and understand the integration with the underlying platform. You will learn how to create and run reliable production workflows, centrally manage and monitor workflows, and learn how to implement recovery actions such as repair and run, as well as other new features.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Scaling ML at CashApp with Tecton

Scaling ML at CashApp with Tecton

2022-07-19 Watch
video

This is a joint talk given by CashApp and Tecton. CashApp’s mobile payment product has stringent technical requirements: scale, reliability, speed. ML-based recommendations are at the core of this service and pose a significant engineering challenge. This talk describes CashApp’s journey through various generations of its core ML capabilities, covering the technical and organizational challenges associated with building large-scale production recommendation systems. The talk finishes with a look at the latest generation of CashApp’s ML platform and highlights how Tecton’s real-time Feature Platform helps CashApp deliver world-class recommendations.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Accidentally Building a Petabyte-Scale Cybersecurity Data Mesh in Azure With Delta Lake at HSBC

Accidentally Building a Petabyte-Scale Cybersecurity Data Mesh in Azure With Delta Lake at HSBC

2022-07-19 Watch
video
Ryan Harris (HSBC)

Due to the unique cybersecurity challenges that HSBC faces daily - from high data volumes to untrustworthy sources to the privacy and security restrictions of a highly regulated industry - the resulting architecture was an unwieldy set of disparate data silos. So, how do we build a cybersecurity advanced analytics environment to enrich and transform these myriad data sources into a unified, well-documented, robust, resilient, repeatable, scalable, maintainable platform that will empower the cyber analysts of the future? That at the same time remains cost-effective and enables everyone from the less-technical junior reporting user to the senior machine learning engineers?

In this session, Ryan Harris, Principal Cybersecurity Engineer at HSBC, dives into the infrastructure and architecture employed, ranging from the landing zone concepts, secure access workstations, data lake structure, and isolated data ingestion, to the enterprise integration layer. In the process of building the data pipelines and lakehouses, we ended up building a hybrid data mesh leveraging Delta Lake. The result is a flexible, secure, self-service environment that is unlocking the capabilities of our humans.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Automating Model Lifecycle Orchestration with Jenkins

Automating Model Lifecycle Orchestration with Jenkins

2022-07-19 Watch
video

A key part of the lifecycle involves bringing a model to production. In regular software systems, this is accomplished via a CI/CD pipeline such as one built with Jenkins. However, integrating Jenkins into a typical DS/ML workflow is not straightforward for X, Y, Z reasons. In this hands-on talk, I will talk about what Jenkins and CI/CD practices can bring to your ML workflows, demonstrate a few of these workflows, and share some best practices on how a bit of Jenkins can level up your MLOps processes.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Building an Analytics Lakehouse at Grab

Building an Analytics Lakehouse at Grab

2022-07-19 Watch
video

Grab shares the story of their Lakehouse journey, from the drivers behind their shift to this new paradigm, to lessons learned along the way. From a starting point of a siloed, data warehouse centric architecture that had inherent challenges with scalability, performance and data duplication, Grab has standardized upon Databricks to serve as an open and unified Lakehouse platform to deliver insights at scale, democratizing data through the rapid deployment of AI and BI use cases across their operations.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Building and Scaling Machine Learning-Based Products in the World's Largest Brewery

Building and Scaling Machine Learning-Based Products in the World's Largest Brewery

2022-07-19 Watch
video

In this session we will present how Anheuser-Busch InBev (Brazil) has been developing and growing an ML platform product to democratize and evolve AI usage within the full company. Our cutting-edge intelligence product offers a set of tools and processes to facilitate everything from exploratory data analysis to the development of state-of-the-art machine learning algorithms. We designed a simple, scalable, and performative product that involves the full data science/machine learning lifecycle, with processes abstraction, feature store, promptness to production and pipeline orchestration. Today we withstand and are always evolving a solution that is used by cross-functional teams in several countries, and helps data scientists to create their solutions in a cooperative setting and supports data engineers to monitor the model pipelines.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Building an Operational Machine Learning Organization from Zero and Leveraging ML for Crypto Securit

Building an Operational Machine Learning Organization from Zero and Leveraging ML for Crypto Securit

2022-07-19 Watch
video

BlockFi is a cryptocurrency platform that allows its clients to grow wealth through various financial products including loans, trading and interest accounts. In this presentation, we will showcase our journey adopting Databricks to build an operational nerve center for analytics across the company. We will demonstrate how to build a cross-functional organization and solve key business problems to earn executive buy-in. We will showcase two of the early successes we've had using machine learning & data science to solve key business challenges in the domains of cyber security and IT Operations. In the domain of security, we will showcase how we are using Graph Analytics to analyze millions of blockchain transactions to identify dust attacks, account takeover and flag risky transactions. The operational IT use case will showcase how we are using Sarimax to forecast platform usage patterns to scale our infrastructure using hourly crypto prices, and financial indicators.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Practical Data Governance in a Large Scale Databricks Environment

Practical Data Governance in a Large Scale Databricks Environment

2022-07-19 Watch
video

Learn from two governance and data practitioners what it takes to do data governance at enterprise scale. This is critical, since the power of Data Science is the ability to tap into any type of data source and turn it into pure value. It is at odds with its key enablers of Scale and Governance and we often must tackle new ways to bring our focus back to unlocking the insights inside the data. In this session, We will share new agile practices to roll out governance policies that balance Governance and Scale. We will untap how to deliver centralized fine-grained governance for ML and data transformation workloads that actually empowers data scientists in an enterprise Databricks environment that ensures privacy and compliance across hundreds of datasets. With automation being key to scale, we will also explore how we successfully automated security and governance

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Predicting and Preventing Machine Downtime with AI and Expert Alerts

Predicting and Preventing Machine Downtime with AI and Expert Alerts

2022-07-19 Watch
video

John Deere’s Expert alerts is a proactive monitoring system that notifies dealers of potential machine issues. This allows technicians to diagnose issues remotely and fix them before they become a problem thereby avoiding multiple trips by a repair technician and minimizing downtime. John Deere ingests petabytes of data every year from its Connected Machines across the globe. To improve the availability, uptime and performance of the John Deere machines globally, our data scientists perform machine data analysis on our data lake in an efficient and scalable manner. The result is dramatically improved mean time to repair, decreased downtime with predictive alerts, improved cost efficiency, improved customer satisfaction and great yields and results for John Deere’s customers.

You will learn • What are Experts Alerts at John Deere and what challenges they seek to solve • How John Deere migrated from a legacy application for alerting to flexible and scalable Lakehouse framework • Getting stakeholder buy in and converting business logic to AI • Overcoming the scale problem: processing petabytes of data within SLAs • What is next for Alert

Other Resources • Two Minute Overview of Expert Alerts: https://www.youtube.com/watch?v=yFnMhMhipXA • Expert Alerts: Dealer Execution - John Deere: https://www.youtube.com/watch?v=2FGz0lx4UiM • Ben Burgess FarmSight services - Expert Alerts from John Deere: https://www.youtube.com/watch?v=BrQhX4oCsSw • U.S. Farm Report Driving Technology: John Deere Expert Alerts: https://www.youtube.com/watch?v=h8IGtk61EDo

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Predicting Repeat Admissions to Substance Abuse Treatment with Machine Learning

Predicting Repeat Admissions to Substance Abuse Treatment with Machine Learning

2022-07-19 Watch
video

In our presentation, we will walk through a model created to predict repeat admissions to substance abuse treatment centers. The goal is to predict early who will be at high risk for relapse so care can be tailored to put additional focus on these patients. We used the Treatment Episode Data Set (TEDS) Admissions data set, which includes every publicly funded substance abuse treatment admission in the US.

While longitudinal data is not available in the data set, we were able to predict with 88% accuracy and an f-score of 0.85 which admissions were first or repeat admissions. Our solution used a scikit-learn Random Forest model and leveraged MLFlow to track model metrics to choose the most effective model. Our pipeline tested over 100 models of different types ranging from Gradient Boosted Trees to Deep Neural Networks in Tensorflow.

To improve model interpretability, we used Shapley values to measure which variables were most important for predicting readmission. These model metrics along with other valuable data are visualized in an interactive Power BI dashboard designed to help practitioners understand who to focus on during treatment. We are in discussions with companies and researchers who may be able to leverage this model in substance abuse treatment centers in the field.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Productionizing Ethical Credit Scoring Systems with Delta Lake, Feature Store and MLFlow

Productionizing Ethical Credit Scoring Systems with Delta Lake, Feature Store and MLFlow

2022-07-19 Watch
video

Fairness, Ethics, Accountability and Transparency (FEAT) are must-haves for high-stakes machine learning models. In particular, models within the Financial Services industry such as those that assign credit scores can impact people’s access to housing and utilities and even influence their social standing. Hence, model developers have a moral responsibility to ensure that models do not systematically disadvantage any one group. Nevertheless, implementing such models in industrial settings remains challenging. A lack of concrete guidelines, common standards and technical templates make evaluating models from a FEAT perspective unfeasible. To address these implementation challenges, the Monetary Authority of Singapore (MAS) set up the Veritas Initiative to create a framework for operationalising the FEAT principles, so as to guide the responsible development of AIDA (Artificial Intelligence and Data Analytics) systems.

In January 2021, MAS announced the successful conclusion of Phase 1 of the Veritas Initiative. Deliverables included an assessment methodology for the Fairness principle and open source code for applying Fairness metrics to two use cases - customer marketing and credit scoring. In this talk, we demonstrate how these open-source examples, and their fairness metrics, might be put into production using open source tools such as Delta Lake and MLFlow. Although the Veritas Framework was developed in Singapore, the ethical framework is applicable across geographies.

By doing this, we illustrate how ethical principles can be operationalised, monitored and maintained in production, thus moving beyond only accuracy-based metrics of model performance and towards a more holistic and principled way of developing and productionizing machine learning systems.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Quick to Production with the Best of Both Apache Spark and Tensorflow on Databricks

Quick to Production with the Best of Both Apache Spark and Tensorflow on Databricks

2022-07-19 Watch
video

Using Tensorflow with big datasets has been an impediment for building deep learning models due to the added complexities of running it in a distributed setting and complicated MLOps code, recent advancements in tensorflow 2, and some extension libraries for Spark has now simplified a lot of this. This talk focuses on how we can leverage the best of both Spark and tensorflow to build machine learning and deep learning models using minimal MLOps code letting Spark handle the grunt of work, enabling us to focus more on feature engineering and building the model itself. This design also enables us to use any of the libraries in the tensorflow ecosystem (like tensorflow recommenders) with the same boilerplate code. For businesses like ours, fast prototyping and quick experimentations are key to building completely new experiences in an efficient and iterative way. It is always preferable to have tangible results before putting more resources into a certain project. This design provides us with that capability and lets us spend more time on research, building models, testing quickly, and rapidly iterating. It also provides us with the flexibility to use our choice of framework at any stage of the machine learning lifecycle. In this talk, we will go through some of the best and new features of both spark and tensorflow, how to go from single node training to distributed training with very few extra lines of code, how to leverage MLFlow as a central model store, and finally, using these models for batch and real-time inference.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Realize the Promise of Streaming with the Databricks Lakehouse Platform

Realize the Promise of Streaming with the Databricks Lakehouse Platform

2022-07-19 Watch
video
Erica Lee (Upwork)

Streaming is the future of all data pipelines and applications. It enables businesses to make data-driven decisions sooner and react faster, develop data-driven applications considered previously impossible, and deliver new and differentiated experiences to customers. However, many organizations have not realized the promise of streaming to its full potential because it requires them to completely redevelop their data pipelines and applications on new, complex, proprietary, and disjointed technology stacks.

The Databricks Lakehouse Platform is a simple, unified, and open platform that supports all streaming workloads ranging from ingestion, ETL to event processing, event-driven application, and ML inference. In this session, we will discuss the streaming capabilities of the Lakehouse Platform and demonstrate how easy it is to build end-to-end, scalable streaming pipelines and applications, to fulfill the promise of streaming for your business. You will also hear from Erica Lee, VP of ML at Upwork, the world's largest Work Marketplace, share how the Upwork team uses Databricks to enable real-time predictions by computing ML features in a continuous streaming manner.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Rethinking Orchestration as Reconciliation: Software-Defined Assets in Dagster

Rethinking Orchestration as Reconciliation: Software-Defined Assets in Dagster

2022-07-19 Watch
video

This talk discusses “software-defined assets”, a declarative approach to orchestration and data management that makes it drastically easier to trust and evolve datasets and ML models. Dagster is an open source orchestrator built for maintaining software-defined assets.

In traditional data platforms, code and data are only loosely coupled. As a consequence, deploying changes to data feels dangerous, backfills are error-prone and irreversible, and it’s difficult to trust data, because you don’t know where it comes from or how it’s intended to be maintained. Each time you run a job that mutates a data asset, you add a new variable to account for when debugging problems.

Dagster proposes an alternative approach to data management that tightly couples data assets to code - each table or ML model corresponds to the function that’s responsible for generating it. This results in a “Data as Code” approach that mimics the “Infrastructure as Code” approach that’s central to modern DevOps. Your git repo becomes your source of truth on your data, so pushing data changes feels as safe as pushing code changes. Backfills become easy to reason about. You trust your data assets because you know how they’re computed and can reproduce them at any time. The role of the orchestrator is to ensure that physical assets in the data warehouse match the logical assets that are defined in code, so each job run is a step towards order.

Software-defined assets is a natural approach to orchestration for the modern data stack, in part because dbt models are a type of software-defined asset.

Attendees of this session will learn how to build and maintain lakehouses of software-defined assets with Dagster.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Running a Low Cost, Versatile Data Management Ecosystem with Apache Spark at Core

Running a Low Cost, Versatile Data Management Ecosystem with Apache Spark at Core

2022-07-19 Watch
video

Data is the key component of Analytics, AI or ML platform. Organizations may not be successful without having a Platform that can Source, Transform, Quality check and present data in a reportable format that can drive actionable insights.

This session will focus on how Capital One HR Team built a Low Cost Data movement Ecosystem that can source data, transform at scale and build the data storage (Redshift) at a level that can be easily consumed by AI/ML programs - by using AWS Services with combination of Open source software(Spark) and Enterprise Edition Hydrograph (UI Based ETL tool with Spark as backend) This presentation is mainly to demonstrate the flexibility that Apache Spark provides for various types ETL Data Pipelines when we code in Spark.

We have been running 3 types of pipelines over 6+ years , over 400+ nightly batch jobs for $1000/mo. (1) Spark on EC2 (2) UI Based ETL tool with Spark backend (on the same EC2) (3) Spark on EMR. We have a CI/CD pipeline that supports easy integration and code deployment in all non-prod and prod regions ( even supports automated unit testing). We will also demonstrate how this ecosystem can failover to a different region in less than 15 minutes , making our application highly resilient.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Scaling AI Workloads with the Ray Ecosystem

Scaling AI Workloads with the Ray Ecosystem

2022-07-19 Watch
video

Modern machine learning (ML) workloads, such as deep learning and large-scale model training, are compute-intensive and require distributed execution. Ray is an open-source, distributed framework from U.C. Berkeley’s RISELab that easily scales Python applications and ML workloads from a laptop to a cluster, with an emphasis on the unique performance challenges of ML/AI systems. It is now used in many production deployments.

This talk will cover Ray’s overview, architecture, core concepts, and primitives, such as remote Tasks and Actors; briefly discuss Ray’s native libraries (Ray Tune, Ray Train, Ray Serve, Ray Datasets, RLlib); and Ray’s growing ecosystem to scale your Python or ML workloads.

Through a demo using XGBoost for classification, we will demonstrate how you can scale training, hyperparameter tuning, and inference—from a single node to a cluster, with tangible performance difference when using Ray.

The takeaways from this talk are :

Learn Ray architecture, core concepts, and Ray primitives and patterns Why Distributed computing will be the norm not an exception How to scale your ML workloads with Ray libraries: Training on a single node vs. Ray cluster, using XGBoost with/without Ray Hyperparameter search and tuning, using XGBoost with Ray and Ray Tune Inferencing at scale, using XGBoost with/without Ray

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Scaling Up Machine Learning in Instacart Search for the 2020 Surge in Online Shopping

Scaling Up Machine Learning in Instacart Search for the 2020 Surge in Online Shopping

2022-07-19 Watch
video

As online grocery business accelerated in 2020, Instacart search, which supports one of the largest catalog of grocery items in the world, started facing new challenges. We experienced a sudden surge in the number of users, retailers and traffic to our search engine. As a result, the scale of our data grew manifold and the predictive performance of our model started degrading due to lack of historical data for many new retailers and users that started using Instacart. New users searched for queries that we have never seen before. The new retailers on our platform were quite diverse - ranging from local grocery stores to office supplies, pharmacies and halloween stores - which are categories that our models were never trained on. As our relatively small team team of four engineers tried to build new models to address these issues, we faced a number of operational challenges. This talk will focus on details about the challenges we encountered in this new world including drift in our data and cold start issues. We will cover the architecture of our search engine and the issues we faced in training and serving our ML models due to the increase in scale. We will talk about how we we overcame the issues by using more sophisticated models that are trained and served on a more robust infrastructure and technical stack. We will also cover the iterations on our ML ranking models to adapt to this new world and we successfully improved the quality of search results and our revenue while operating in a robust production environment.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Building Metadata and Lineage Driven Pipelines on Kubernetes

Building Metadata and Lineage Driven Pipelines on Kubernetes

2022-07-19 Watch
video

Machine Learning becomes a critical role in every industry amid its widespread adoption. Composing an ML pipeline at a rapid pace is an inevitable way for success. However, an ML pipeline consists of several components and needs various efforts of different teams, including data engineers, data scientists, ML engineers, etc. A typical cooperation strategy is to define a sequence of tasks, coordinate the integration, test, apply fixes and enhancements, and repeat. ML pipeline components produced by task-driven approach lack reusability only maintenance efforts. Kubeflow Pipelines, a platform making deployments of ML pipeline on Kubernetes straightforward and scalable, provides metadata and lineage-driven approach to develop platform-independent and portable ML pipelines. Data linkage and propagation become crystal clear within ML pipelines. It also nourishes ML pipeline composition.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Building Production-Ready Recommender Systems with Feature Stores

Building Production-Ready Recommender Systems with Feature Stores

2022-07-19 Watch
video

Recommender systems are highly prevalent in modern applications and services but are notoriously difficult to build and maintain. Organizations face challenges such as complex data dependencies, data leakage, and frequently changing data/models. These challenges are compounded when building, deploying, and maintaining ML pipelines spans data scientists and engineers. Feature stores help address many of the operational challenges associated with recommender systems.

In this talk, we explore:

  • Challenges of building recommender systems
  • Strategies for reducing latency, while balancing requirements for freshness
  • Challenges in mitigating data quality issues
  • Technical and organizational challenges feature stores solve
  • How to integrate Feast, an open-source feature store, into an existing recommender system to support production systems

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Customer-centric Innovation to Scale Data & AI Everywhere

Customer-centric Innovation to Scale Data & AI Everywhere

2022-07-19 Watch
video

Imagine a world where you have the flexibility to infuse intelligence into every application, from edge to cloud. In this session, you will learn how Intel is enabling customer-centric innovation and delivering the simplicity, productivity, and performance the developers need to scale their data and AI solutions everywhere. An overview of Intel end-to-end data analytics and AI technologies, developer tools as well as examples of customers use cases will be presented.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Tackling Challenges of Distributed Deep Learning with Open Source Solutions

Tackling Challenges of Distributed Deep Learning with Open Source Solutions

2022-07-19 Watch
video

Deep learning has had an enormous impact in a variety of domains, however, with model and data size growing at a rapid pace, scaling out deep learning training has become essential for practical use.

In this talk, you will learn about the challenges and various solutions for distributed deep learning.

We will first cover some of the common patterns used to scale out deep learning training.

We will then describe some of the challenges with distributed deep learning in practice: Infrastructure and hardware management Spending too much time managing clusters, resources, and the scheduling/placement of jobs or processes. Developer iteration speed. Too much overhead to go from small-scale local ML development to large-scale training Hard to run distributed training jobs in a notebook/interactive environment. Difficulty integrating with open source software. Scale out training while still being able to leverage open source tools such as MLflow, Pytorch Lightning, and Huggingface Managing large-scale training data. Efficiently ingest large amounts of training data to my distributed machine learning model. Cloud compute costs. Leverage cheaper spot instances, without having to restart training in case of node pre-emption. Easily switch between cloud providers to reduce costs without rewriting all my code

Then, we will share the merits of the ML open source ecosystem for distributed deep learning. In particular, we will introduce Ray Train, an open source library built on the Ray distributed execution framework, and show how it’s integrations with other open source libraries (PyTorch, Huggingface, MLflow, etc.) alleviate the pain points above.

We will conclude with a live demo showing large-scale distributed training using these open source tools.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Building Scalable & Advanced AI based Language Solutions for R&D using Databricks

Building Scalable & Advanced AI based Language Solutions for R&D using Databricks

2022-07-19 Watch
video

AI is ubiquitous and language-based AI is even more omniscient. Processing of unstructured data, understanding language to form information and generating language to respond to questions or write essays have specific business applications. In Pharma’s R&D Deloitte has been investing to create solutions across every part of the value chain and AI / ML models are embedded across. We have leveraged Databricks both as a development platform and data pipeline system which in turn has helped us accelerate and streamline the AI / ML model development required for R&D value chain. Through a systematic and scalable approach to processing, understanding, and generating unstructured content we have successfully delivered multiple use cases through which we have achieved business value and proved out critical advanced AI capabilities. We will discuss such situational challenges and solutions during our session.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Challenges in Time Series Forecasting

Challenges in Time Series Forecasting

2022-07-19 Watch
video

Accurate business forecasts are one of the most important aspects of corporate planning. These are enormously challenging questions to answer using only human intellect and rudimentary tools like spreadsheets due to the numerous factors that go into forecasting. Machine learning applied to time series data is a much more efficient and effective way to analyze the data, apply a forecasting algorithm, and derive accurate forecasts.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Cleanlab: AI to Find and Fix Errors in ML Datasets

Cleanlab: AI to Find and Fix Errors in ML Datasets

2022-07-19 Watch
video

Real-world datasets have a large fraction of errors, which negatively impacts model quality and benchmarking. This talk presents Cleanlab, an open-source tool that addresses these issues using the latest research in data-centric AI. Cleanlab has been used to improve datasets at a number of Fortune 500 companies.

Ontological issues, invalid data points, and label errors are pervasive in datasets. Even gold-standard ML datasets have on average 3.3% label errors (labelerrors.com). Data errors degrade model quality, and errors lead to incorrect conclusions about model performance and suboptimal models being deployed.

We present the cleanlab open-source package (github.com/cleanlab/cleanlab) for finding and fixing data errors. We will walk through using Cleanlab to fix errors in a real-world dataset, with an end-to-end demo of how Cleanlab improves data and model performance.

Finally, we will show Cleanlab Studio, which provides a web interface for human-in-the-loop data quality control.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Cloud Native Geospatial Analytics at JLL

Cloud Native Geospatial Analytics at JLL

2022-07-19 Watch
video
Yanqing Zeng (JLL) , Luis Sanz (CARTO)

Luis Sanz, CEO of CARTO and Yanqing Zeng, Lead Data Scientist at JLL, take us through how cloud native geospatial analytics can be unlocked on the Databricks Lakehouse platform with CARTO. Yanqing will showcase her work on large scale spatial analytics projects to address some of the most critical analysis use cases in Real Estate. Taking a geospatial perspective, Yanqing will share practical examples of how large-scale spatial data and analytics can be used for property portfolio mapping, AI-driven risk assessment, real estate valuation and more.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/