NLP

Using Lakehouse to Fight Cancer:Ontada’s Journey to Establish a RWD Platform on Databricks Lakehouse

2023-07-28 · Databricks DATA + AI Summit 2023 Watch

video

by Donghwa Kim

Analytics BI Data Lakehouse Databricks DWH Oracle

Ontada, a McKesson business, is an oncology real-world data and evidence, clinical education and provider of technology business dedicated to transforming the fight against cancer. Core to Ontada’s mission is using real-world data (RWD) and evidence generation to improve patient health outcomes and to accelerate life science research.

To support its mission, Ontada embarked on a journey to migrate its enterprise data warehouse (EDW) from an on-premise Oracle database to Databricks Lakehouse. This move allows Ontada to now consume data from any source, including structured and unstructured data from its own EHR and genomics lab results, and realize faster time to insight. In addition, using the Lakehouse has helped Ontada eliminate data silos, enabling the organization to realize the full potential of RWD – from running traditional descriptive analytics to extracting biomarkers from unstructured data. The session will cover the following topics:

Oracle to Databricks: migration best practices and lessons learned
People, process, and tools: expediting innovation while protecting patient information using Unity Catalog
Getting the most out of the Databricks Lakehouse: from BI to genomics, running all analytics under one platform
Hyperscale biomarker abstraction: reducing the manual effort needed to extract biomarkers from large unstructured data (medical notes, scanned/faxed documents) using spaCY and John Snow Lab NLP libraries

Join this session to hear how Ontada is transforming RWD to deliver safe and effective cancer treatment.

Talk by: Donghwa Kim

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

JetBlue’s Real-Time AI & ML Digital Twin Journey Using Databricks

2023-07-26 · Databricks DATA + AI Summit 2023 Watch

video

by Rob Bajra , Derrick Olson

AI/ML Analytics API BI Data Science Databricks Data Streaming

JetBlue has embarked over the past year on an AI and ML transformation. Databricks has been instrumental in this transformation due to the ability to integrate streaming pipelines, ML training using MLflow, ML API serving using ML registry and more in one cohesive platform. Using real-time streams of weather, aircraft sensors, FAA data feeds, JetBlue operations and more are used for the world's first AI and ML operating system orchestrating a digital-twin, known as BlueSky for efficient and safe operations. JetBlue has over 10 ML products (multiple models each product) in production across multiple verticals including dynamic pricing, customer recommendation engines, supply chain optimization, customer sentiment NLP and several more.

The core JetBlue data science and analytics team consists of Operations Data Science, Commercial Data Science, AI and ML engineering and Business Intelligence. To facilitate the rapid growth and faster go-to-market strategy, the team has built an internal Data Catalog + AutoML + AutoDeploy wrapper called BlueML using Databricks features to empower data scientists including advanced analysts with the ability to train and deploy ML models in less than five lines of code.

Talk by: Derrick Olson and Rob Bajra

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

IFC's MALENA Provides Analytics for ESG Reviews in Emerging Markets Using NLP and LLMs

2023-07-26 · Databricks DATA + AI Summit 2023 Watch

video

by Atiyah Curmally , Blaise Sandwidi

AI/ML Analytics Data Lake Databricks LLM MLOps Spark

International Finance Corporation (IFC) is using data and AI to build machine learning solutions that create analytical capacity to support the review of ESG issues at scale. This includes natural language processing and requires entity recognition and other applications to support the work of IFC’s experts and other investors working in emerging markets. These algorithms are available via IFC’s Machine Learning ESG Analyst (MALENA) platform to enable rapid analysis, increase productivity, and build investor confidence. In this manner, IFC, a development finance institution with the mandate to address poverty in emerging markets, is making use of its historical datasets and open source AI solutions to build custom-AI applications that democratize access to ESG capacity to read and classify text.

In this session, you will learn the unique flexibility of the Apache Spark™ ecosystem from Databricks and how that has allowed IFC’s MALENA project to connect to scalable data lake storage, use different natural language processing models and seamlessly adopt MLOps.

Talk by: Atiyah Curmally and Blaise Sandwidi

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Explainable Data Drift for NLP

2023-07-26 · Databricks DATA + AI Summit 2023 Watch

video

by Noam Bressler

AI/ML Databricks LLM MLOps

Detecting data drift, although far from solved-for tabular data, has become a common approach to monitor ML models in production. For Natural Language Processing (NLP) on the other hand the question remains mostly open. In this session, we will present and compare two approaches. In the first approach, we will demonstrate how by extracting a wide range of explainable properties per document such as topics, language, sentiment, named entities, keywords and more we are able to explore potential sources of drift. We will show how these properties can be consistently tracked over time, how they can be used to detect meaningful data drift as soon as it occurs and how they can be used to explain and fix the root cause.

The second approach we will present is to detect drift by using the embeddings of common foundation models (such as GPT3 in the Open AI model family) and use them to identify areas in the embedding space in which significant drift has occurred. These areas in embedding space should then be characterized in a human-readable way to enable root cause analysis of the detected drift. We will compare the performance and explainability of these two methods and explore the pros and cons of each approach.

Talk by: Noam Bressler

Here’s more to explore: LLM Compact Guide: https://dbricks.co/43WuQyb Big Book of MLOps: https://dbricks.co/3r0Pqiz

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Scaling AI Applications with Databricks, HuggingFace and Pinecone

2023-07-26 · Databricks DATA + AI Summit 2023 Watch

video

by Roie Schwaber-Cohen

AI/ML Databricks Pinecone Spark Vector DB

The production and management of large-scale vector embeddings can be a challenging problem. The integration of Databricks, Hugging Face and Pinecone offers a powerful solution. Vector embeddings have become an essential tool in the development of AI powered applications. Embeddings are representations of data learned by machine models. High quality embeddings are unlocking use cases like semantic search, recommendation engines, and anomaly detection. Databricks' Apache Spark™ ecosystem together with Hugging Face's Transformers library enable large-scale vector embeddings production using GPU processing, Pinecone's vector database provides ultra-low latency querying and upserting of billions of embeddings, allowing for high-quality embeddings at scale for real-time AI apps.

In this session, we will present a concrete use case of this integration in the context of a natural language processing application. We will demonstrate how Pinecone's vector database can be integrated with Databricks and Hugging Face to produce large-scale vector embeddings of text data and how these embeddings can be used to improve the performance of various AI applications. You will see the benefits of this integration in terms of speed, scalability, and cost efficiency. By leveraging the GPU processing capabilities of Databricks and the ultra low-latency querying capabilities of Pinecone, we can significantly improve the performance of NLP tasks while reducing the cost and complexity of managing large-scale vector embeddings. You will learn about the technical details of this integration and how it can be implemented in your own AI projects, and gain insights into the speed, scalability, and cost efficiency benefits of using this solution.

Talk by: Roie Schwaber-Cohen

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Demonstrate-Search-Predict: Composing Retrieval and Language Models for Knowledge-Intensive NLP

2023-07-26 · Databricks DATA + AI Summit 2023 Watch

video

by Keshav Santhanam

AI/ML Databricks

In this talk, you will learn about how retrieval-augmented in-context learning has emerged as a powerful approach for addressing knowledge intensive tasks using frozen language models (LM) and retrieval models (RM). Existing work has combined these in simple “retrieve-then-read” pipelines in which the RM retrieves passages that are inserted into the LM prompt.

To begin to fully realize the potential of frozen LMs and RMs, we propose Demonstrate–Search–Predict (DSP), a framework that relies on passing natural language texts in sophisticated pipelines between an LM and an RM. DSP can express high-level programs that bootstrap pipeline-aware demonstrations, search for relevant passages, and generate grounded predictions, systematically breaking down problems into small transformations that the LM and RM can handle more reliably.

We have written novel DSP programs for answering questions in open-domain, multi-hop, and conversational settings, establishing in early evaluations new state-of-the-art in-context learning results and delivering 37–125%, 8–40%, and 80–290% relative gains against vanilla LMs, a standard retrieve-then-read pipeline, and a contemporaneous self-ask pipeline, respectively.

Talk by: Keshav Santhanam

Here’s more to explore: State of Data + AI Report: https://dbricks.co/44i2HBp Databricks named a Leader in 2022 Gartner® Magic QuadrantTM CDBMS: https://dbricks.co/3phw20d

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

How We Made a Unified Talent Solution Using Databricks Machine Learning, Fine-Tuned LLM & Dolly 2.0

2023-07-26 · Databricks DATA + AI Summit 2023 Watch

video

by Nitu Nivedita

AI/ML Analytics Databricks DataOps Delta DevOps LLM Matplotlib Plotly Power BI React Data Streaming

Using Databricks, we built a “Unified Talent Solution” backed by a robust data and AI engine for analyzing skills of a combined pool of permanent employees, contractors, part-time employees and vendors, inferring skill gaps, future trends and recommended priority areas to bridge talent gaps, which ultimately greatly improved operational efficiency, transparency, commercial model, and talent experience of our client. We leveraged a variety of ML algorithms such as boosting, neural networks and NLP transformers to provide better AI-driven insights.

One inevitable part of developing these models within a typical DS workflow is iteration. Databricks' end-to-end ML/DS workflow service, MLflow, helped streamline this process by organizing them into experiments that tracked the data used for training/testing, model artifacts, lineage and the corresponding results/metrics. For checking the health of our models using drift detection, bias and explainability techniques, MLflow's deploying, and monitoring services were leveraged extensively.

Our solution built on Databricks platform, simplified ML by defining a data-centric workflow that unified best practices from DevOps, DataOps, and ModelOps. Databricks Feature Store allowed us to productionize our models and features jointly. Insights were done with visually appealing charts and graphs using PowerBI, plotly, matplotlib, that answer business questions most relevant to clients. We built our own advanced custom analytics platform on top of delta lake as Delta’s ACID guarantees allows us to build a real-time reporting app that displays consistent and reliable data - React (for front-end), Structured Streaming for ingesting data from Delta table with live query analytics on real time data ML predictions based on analytics data.

Talk by: Nitu Nivedita

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Using NLP to Evaluate 100 Million Global Webpages Daily to Contextually Target Consumers

2023-07-26 · Databricks DATA + AI Summit 2023 Watch

video

by Xuefu Wang , Mark Lee (Databricks)

AI/ML Databricks Marketing Spark

This session will cover the challenges and the solution that The Trade Desk went through to scale their ML models for NLP for 100 million web pages per day.

TTD's contextual targeting team needs to analyze 100 million web pages per day. Fifty percent of the webpages are non-English. Half of the content was not being properly analyzed and targeted intelligently. TTD attempted to build a model using Spark NLP, however the package could not scale and was not cost-effective. GPU utilization was low and the solution was cost prohibitive. TTD engaged with Databricks in early 2022 to build an NLP model on Databricks. Our teams partnered closely together. We were able to build a solution using distributed inference (150-200 GPUs running at 80%+ utilization); Each day, Databricks translated two hundred times faster across 50 million web pages that are in for over 35 + languages and at a fraction of the cost. This solution enables TTD teams to standardize on English for contextual targeting ML models. TTD can now be a one-stop shop for their customers' global advertising needs.

The Trade Desk is headquartered in Ventura, California. It is the largest independent demand-side platform in the world, competing against Google, Facebook, and others. Unlike traditional marketing, programmatic marketing is operated by real-time, split-second decisions based on user identity, device information, and other data points. It enables highly personalized consumer experiences and improves return-on-investment for companies and advertisers.

Talk by: Xuefu Wang and Mark Lee

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Building a Lakehouse for Data Science at DoorDash

2022-07-19 · Databricks DATA + AI Summit 2023 Watch

video

AI/ML Big Data Data Lakehouse Data Science Databricks DWH Data Streaming

DoorDash was using a data warehouse but found that they needed more data transparency, lower costs, and the ability to handle streaming data as well as batch data. With an engineering team rooted in big data backgrounds at Uber and LinkedIn, they moved to a Lakehouse architecture intuitively, without knowing about the term. In this session, learn more about how they arrived at that architecture, the process of making the move, and the results they have seen. While addressing both data analysts and data scientists from their lakehouse, this session will focus on their machine learning operations, and how their efficiencies are enabling them to tackle more advanced use cases such as NLP and image classification.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

State-of-the-Art Natural Language Processing with Apache Spark NLP

2022-07-19 · Databricks DATA + AI Summit 2023 Watch

video

Databricks Python Spark

This session teaches how & why to use the open-source Spark NLP library. Spark NLP provides state-of-the-art accuracy, speed, and scalability for language understanding by delivering production-grade implementations of recent research advances. Spark NLP is the most widely used NLP library in the enterprise today; provides thousands of current, supported, pre-trained models for 200+ languages out of the box; and is the only open-source NLP library that can natively scale to use any Apache Spark cluster.

We’ll walk through Python code running common NLP tasks like document classification, named entity recognition, sentiment analysis, spell checking, question answering, and translation. The discussion of each task includes the latest advances in deep learning and transfer learning used to tackle it. We’ll also cover new free tools for data annotation, no-code active learning & transfer learning, easily deploying NLP models as production-grade services, and sharing models you’ve trained.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Achieve Machine Learning Hyper-Productivity with Transformers and Hugging Face

2022-07-19 · Databricks DATA + AI Summit 2023 Watch

video

AI/ML Databricks

According to the latest State of AI report, "transformers have emerged as a general-purpose architecture for ML. Not just for Natural Language Processing, but also Speech, Computer Vision or even protein structure prediction." Indeed, the Transformer architecture has proven very efficient on a wide variety of Machine Learning tasks. But how can we keep up with the frantic pace of innovation? Do we really need expert skills to leverage these state-of-the-art models? Or is there a shorter path to creating business value in less time? In this code-level talk, we'll gradually build and deploy a demo involving several Transformer models. Along the way, you'll learn about the portfolio of open source and commercial Hugging Face solutions, how they can help you become hyper-productive in order to deliver high-quality Machine Learning solutions faster than ever before.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Adversarial AI—The Nature of the Threat, Impacts, and Mitigation Strategies

2022-07-19 · Databricks DATA + AI Summit 2023 Watch

video

AI/ML Databricks

Adversarial AI/ML is an emerging research area focused on the vulnerabilities of Artificial Intelligence (AI)/Machine Learning (ML) models to adversarial exploitation such as data poisoning, adversarial perturbations, inference and extraction attacks. This research area is of particular interest to domains where AI/ML models play an essential role in the mission-critical decision making processes. In this presentation, we will give a review of the four principal categories of Adversarial AI. We will discuss each one of these, supported by the relevant and interesting examples, and we will discuss the future implications. We will present in greater depth our research in Adversarial NLP, backed by the specific data poisoning and adversarial perturbation examples attacks on NLP classifiers. We will conclude the presentation by discussing the current mitigation approaches and methods, and offer some general recommendations for how to best address the Adversarial AI exploits.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Health Care and Life Sciences Experience at Data + AI Summit 2022

2022-07-19 · Databricks DATA + AI Summit 2023 Watch

video

AI/ML Analytics Databricks

Welcome data teams and executives in the Healthcare and Life Sciences industry! This year’s Data + AI Summit is jam-packed with talks, demos and discussions on the biggest innovations in patient care and drug R&D. To help you take full advantage of the Healthcare and Life Sciences experience at Summit, we’ve curated all the programs in one place.

Highlights at this year’s Summit:

Healthcare and Life Sciences Industry Forum: Our capstone event for Healthcare and Life Sciences attendees at Summit featuring keynotes and panel discussions with Walgreens, Takeda, Optum, and Humana followed by networking. More details in the agenda below. Healthcare and Life Sciences Lounge: Stop by our industry lounge located outside the Expo floor to meet with Databricks’ industry experts and see solutions from our partners including ZS Associates, John Snow Labs and others. Session Talks: Over 10 technical talks on topics including healthcare NLP, knowledge graphs for R&D, commercial analytics, and predicting hospital readmissions.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

talk-data.com

Activity Trend

Top Events

Top Speakers

Using Lakehouse to Fight Cancer:Ontada’s Journey to Establish a RWD Platform on Databricks Lakehouse

JetBlue’s Real-Time AI & ML Digital Twin Journey Using Databricks

IFC's MALENA Provides Analytics for ESG Reviews in Emerging Markets Using NLP and LLMs

Explainable Data Drift for NLP

Scaling AI Applications with Databricks, HuggingFace and Pinecone

Demonstrate-Search-Predict: Composing Retrieval and Language Models for Knowledge-Intensive NLP

How We Made a Unified Talent Solution Using Databricks Machine Learning, Fine-Tuned LLM & Dolly 2.0

Using NLP to Evaluate 100 Million Global Webpages Daily to Contextually Target Consumers

Building a Lakehouse for Data Science at DoorDash

State-of-the-Art Natural Language Processing with Apache Spark NLP

Achieve Machine Learning Hyper-Productivity with Transformers and Hugging Face

Adversarial AI—The Nature of the Threat, Impacts, and Mitigation Strategies

Health Care and Life Sciences Experience at Data + AI Summit 2022