talk-data.com talk-data.com

Event

Databricks DATA + AI Summit 2023

2026-01-11 YouTube Visit website ↗

Activities tracked

287

Filtering by: AI/ML ×

Sessions & talks

Showing 151–175 of 287 · Newest first

Search within this event →
Building a Lakehouse for Data Science at DoorDash

Building a Lakehouse for Data Science at DoorDash

2022-07-19 Watch
video

DoorDash was using a data warehouse but found that they needed more data transparency, lower costs, and the ability to handle streaming data as well as batch data. With an engineering team rooted in big data backgrounds at Uber and LinkedIn, they moved to a Lakehouse architecture intuitively, without knowing about the term. In this session, learn more about how they arrived at that architecture, the process of making the move, and the results they have seen. While addressing both data analysts and data scientists from their lakehouse, this session will focus on their machine learning operations, and how their efficiencies are enabling them to tackle more advanced use cases such as NLP and image classification.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Cloud and Data Science Modernization of Veterans Affairs Financial Service Center with Azure Databri

Cloud and Data Science Modernization of Veterans Affairs Financial Service Center with Azure Databri

2022-07-19 Watch
video

The Department of Veterans Affairs (VA) is home to over 420,000 employees, provides health care for 9.16 million enrollees and manages the benefits of 5.75 million recipients. The VA also hosts an array of financial management, professional, and administrative services at their Financial Service Center (FSC), located in Austin, Texas. The FSC is divided into various service groups organized around revenue centers and product lines, including the Data Analytics Service (DAS). To support the VA mission, in 2021 FSC DAS continued to press forward with their cloud modernization efforts, successfully achieving four key accomplishments:

Office of Community Care (OCC) Financial Time Series Forecast - Financial forecasting enhancements to predict claims CFO Dashboard - Productivity and capability enhancements for financial and audit analytics Datasets Migrated to the Cloud - Migration of on-prem datasets to the cloud for down-stream analytics (includes a supply chain proof-of-concept) Data Science Hackathon - A hackathon to predict bad claims codes and demonstrate DAS abilities to accelerate a ML use case using Databricks AutoML

This talk discusses FSC DAS’ cloud and data science modernization accomplishments in 2021, lessons learned, and what’s ahead.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Connecting the Dots with DataHub: Lakehouse and Beyond

Connecting the Dots with DataHub: Lakehouse and Beyond

2022-07-19 Watch
video

You’ve successfully built your data lakehouse. Congratulations! But what happens when your operational data stores, streaming systems like Apache Kafka or data ingestion systems produce bad data into the lakehouse? Can you be proactive when it comes to preventing bad data from affecting your business? How can you take advantage of automation to ensure that raw data assets become well maintained data products (clear ownership, documentation and sensitivity classification) without requiring people to do redundant work across operational, ingestion and lakehouse systems? How do you get live and historical visibility into your entire data ecosystem (schemas, pipelines, data lineage, models, features and dashboards) within and across your production services, ingestion pipelines and data lakehouse? Data engineers struggle with data quality and data governance issues constantly interrupting their day and limiting their upside impact on the business.

In this talk, we will share how data engineers from our 3K+ strong DataHub community are using DataHub to track lineage, understand data quality, and prevent failures from impacting their important dashboards, ML models and features. The talk will include details of how DataHub extracts lineage automatically from Spark, schema and statistics from Delta Lake and shift-left strategies for developer-led governance.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Mapping Data Quality Concerns to Data Lake Zones

Mapping Data Quality Concerns to Data Lake Zones

2022-07-19 Watch
video

A common pattern in Data Lake and Lakehouse design is structuring data into zones, with Bronze, Silver and Gold being typical labels. Each zone is suitable for different workloads and different consumers: for instance, machine learning algorithms typically process against Bronze or Silver, while analytic dashboards often query Gold. This prompts the question: which layer is best suited for applying data quality rules and actions? Our answer: all of them.

In this session, we’ll expand on our answer by describing the purposes of the different zones, and mapping the categories of data quality relevant for each by assessing its qualitative requirements. We’ll describe Data Enrichment: the practice of making observed anomalies available as inputs to downstream data pipelines, and provide recommendations for when to merely alert, when to quarantine data, when to halt pipelines, and when to apply automated corrective actions.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Scaling Privacy: Practical Architectures and Experiences

Scaling Privacy: Practical Architectures and Experiences

2022-07-19 Watch
video

At Spark Data & AI 2021, We presented the use case around Privacy in an Insurance Landscape using Privacera. Scaling Privacy in a Spark Ecosystem (https://www.youtube.com/watch?v=cjJEMlNcg5k). In one year, the concept of privacy and security have taken off as a major need to solve and the ability to embed this into business process to empower data democratization has become mandatory. The concept that data is a product is now commonplace and that ability to rapidly innovate those products hinges on the ability to balance a dual mandate. One mandate: Move Fast. Second Mandate: Manage Privacy and Security. How do we make this happen? Let's dig into the real details and experiences and show the blueprint for success.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Serving Near Real-Time Features at Scale

Serving Near Real-Time Features at Scale

2022-07-19 Watch
video

This presentation will first introduce the use case, which generates the price adjustments based on the network effect, and the corresponding model relies on the 108 near real-time features computed by Flink pipelines with the raw demand and supply events. Here is the simplified computation logic:

-The pipelines need to process the raw real-time events at the rate of 300k/s including both demand and supply -Each event needs to be computed on the geospatial, temporal and other dimensions -Each event contributes to the computation on the original hexagon and the 1K+ neighbours due to the fan-out effect of Kring smooth -Each event contributions to the aggregation on multiple window sizes up to 32 minutes, sliding by 1 minute, or 63 windows in total

Next the presentation will briefly go through the DAG of the Flink pipeline before optimization and the issues we faced: the pipeline could not run stably due to OOM and backpressure. The presentation will discuss how to optimize a streaming pipeline with the generic performance tuning framework, which focuses on three areas: Network, CPU and Memory, and five domains: Parallelism, Partition, Remote Call, Algorithm and Garbage Collector. The presentation will also show some example techniques being applied onto the pipelines by following the performance tuning framework.

Then the presentation will discuss one particular optimization technique: Customized Sliding Window.

Powering machine learning models with near real-time features can be quite challenging, due to computation logic complexity, write throughput, serving SLA, etc. In this talk, we have introduced some of the problems that we faced and our solutions to them, in the hope of aiding our peers in similar use cases.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

The Databricks Notebook: Front Door of the Lakehouse

The Databricks Notebook: Front Door of the Lakehouse

2022-07-19 Watch
video

One of the greatest data challenges organizations face is the sprawl of disparate toolchains, multiple vendors, and siloed teams. This can result in each team working on their own subset of data, preventing the delivery of cohesive and comprehensive insights and inhibiting the value that data can provide. This problem is not insurmountable, however; it can be fixed by a collaborative platform that enables users of all personas to discover and share data insights with each other. Whether you're a marketing analyst or a data scientist, the Databricks Notebook is that single platform that lets you tap into the awesome power of the Lakehouse. The Databricks Notebook supercharges data teams’ ability to collaborate, explore data, and create data assets like tables, pipelines, reports, dashboards, and ML models—all in the language of users’ choice. Join this session to discover how the Notebook can unleash the power of the Lakehouse. You will also learn about new data visualizations, the introduction of ipywidgets and bamboolib, workflow automation and orchestration, CI/CD, and integrations with MLflow and Databricks SQL.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Data Governance and Sharing on Lakehouse | Matei Zaharia | Keynote Data + AI Summit 2022

Data Governance and Sharing on Lakehouse | Matei Zaharia | Keynote Data + AI Summit 2022

2022-07-19 Watch
video
Matei Zaharia (Databricks)

Data + AI Summit Keynote talk from Matei Zaharia on Data Governance and Sharing on Lakehouse

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

How unsupervised machine learning can scale data quality monitoring in Databricks

How unsupervised machine learning can scale data quality monitoring in Databricks

2022-07-19 Watch
video

Technologies like Databricks Delta Lake and Databricks SQL enable enterprises to store and query their data. But existing rules and metrics approaches to monitoring the quality of this data are tedious to set up and maintain, fail to catch unexpected issues, and generate false positive alerts that lead to alert fatigue.

In this talk, Jeremy will describe a set of fully unsupervised machine learning algorithms for monitoring data quality at scale in Databricks. He will cover how the algorithms work, their strengths and weaknesses, and how they are tested and calibrated.

Participants will leave this talk with an understanding of unsupervised data quality monitoring, its strengths and weaknesses, and how to begin monitoring data using it in Databricks.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Achieve Machine Learning Hyper-Productivity with Transformers and Hugging Face

Achieve Machine Learning Hyper-Productivity with Transformers and Hugging Face

2022-07-19 Watch
video

According to the latest State of AI report, "transformers have emerged as a general-purpose architecture for ML. Not just for Natural Language Processing, but also Speech, Computer Vision or even protein structure prediction." Indeed, the Transformer architecture has proven very efficient on a wide variety of Machine Learning tasks. But how can we keep up with the frantic pace of innovation? Do we really need expert skills to leverage these state-of-the-art models? Or is there a shorter path to creating business value in less time? In this code-level talk, we'll gradually build and deploy a demo involving several Transformer models. Along the way, you'll learn about the portfolio of open source and commercial Hugging Face solutions, how they can help you become hyper-productive in order to deliver high-quality Machine Learning solutions faster than ever before.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

AI powered Assortment Planning Solution

AI powered Assortment Planning Solution

2022-07-19 Watch
video

For shop owners to maximize revenue, they need to ensure that the right products are available on the right shelf at the right time. So, how does one assort the right mix of products to make max profit & reduce inventory pressure? Today, these decisions are led by human knowledge of trends & inputs from salespeople. This is error prone and cannot scale with a growing product assortment & varying demand patterns. Mindtree has analyzed this problem and built a cloud-based AI/ML solution that provides contextual, real-time insights and optimizes inventory management. In this presentation, you will hear our solution approach to help global CPG organization, promote new products, increase demand across their product offerings and drive impactful insights. You will also learn about the technical solution architecture, orchestration of product and KPI generation using Databricks, AI/ML models, heterogenous cloud platform options for deployment and rollout, scale-up and scale-out options.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

A Practitioner's Guide to Unity Catalog—A Technical Deep Dive

A Practitioner's Guide to Unity Catalog—A Technical Deep Dive

2022-07-19 Watch
video

As a practitioner, managing and governing data assets and ML models in the data lakehouse is critical for your business initiatives to be successful. With Databricks Unity Catalog, you have a unified governance solution for all data and AI asserts in your lakehouse, giving you much better performance, management and security on any cloud. When deploying Unity Catalog for your lakehouse, you must be prepared with best practices to ensure a smooth governance implementation. This session will cover key considerations for a successful implementation such as: • How to manage Unity Catalog’s metastore and understand various usage patterns • How to use identity federation to assign account principals to a Databricks Workspace • Best practices for leveraging cloud storages, managed tables and external tables with Unity catalog

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Accelerating the Pace of Autism Diagnosis with Machine Learning Models

Accelerating the Pace of Autism Diagnosis with Machine Learning Models

2022-07-19 Watch
video

A formal autism diagnosis can be an inefficient and lengthy process. Families may wait months or longer before receiving a diagnosis for their child despite evidence that earlier intervention leads to better treatment outcomes. Digital technologies which detect the presence of behaviors related to autism can scale access to pediatric diagnoses. This work aims to demonstrate the feasibility of deep learning technologies for detecting hand flapping from unstructured home videos as a first step towards validating whether models and digital technologies can be leveraged to aid with autism diagnoses. We used the Self-Stimulatory Behavior Dataset (SSBD), which contains 75 videos of hand flapping, head banging, and spinning exhibited by children. From all the hand flapping videos, we extracted 100 positive and control videos of hand flapping, each between 2 to 5 seconds in duration. Utilizing both landmark-driven-approaches and MobileNet V2’s pretrained convolutional layers, our highest performing model achieved a testing F1 score of 84% (90% precision and 80% recall) on the Self-Stimulatory Behavior Dataset (SSBD). This work provides the first step towards developing precise deep learning methods for activity detection of autism-related behaviors.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Adversarial Drifts, Model Monitoring, and Feedback Loops: Building Human-in-the-Loop Machine Learnin

Adversarial Drifts, Model Monitoring, and Feedback Loops: Building Human-in-the-Loop Machine Learnin

2022-07-19 Watch
video

Protecting the user community and the platform from illegal or undesirable behavior is an important problem for most large online platforms. Content moderation (aka Integrity) systems aim to define, detect and take action on bad behavior/content at scale, usually accomplished with a combination of machine learning and human review.

Building hybrid human/ML systems for content moderation presents unique challenges, some of which we will discuss in this talk: * Human review annotation guidelines & how it impacts label quality for ML models * Bootstrapping labels for new categories of content violation policies * Role of adversarial drift in model performance degradation * Best practices for monitoring model performance & ecosystem health * Building adaptive machine learning models

The talk is a distillation of learnings from building such systems at Facebook, and from talking to other ML practitioners & researchers who’ve worked on similar systems elsewhere.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Amgen’s Journey To Building a Global 360 View of its Customers with the Lakehouse

Amgen’s Journey To Building a Global 360 View of its Customers with the Lakehouse

2022-07-19 Watch
video

Serving patients in over 100 countries, Amgen is a leading global biotech company focused on developing therapies that have the power to save lives. Delivering on this mission requires our commercial teams to regularly meet with healthcare providers to discuss new treatments that can help patients in need. With the onset of the pandemic, where face-to-face interactions with doctors and other Healthcare Providers (HCPs) were severely impacted, Amgen had to rethink these interactions. With that in mind, the Amgen Commercial Data and Analytics team leveraged a modern data and AI architecture built on the Databricks Lakehouse to help accelerate its digital and data insights capabilities. This foundation enabled Amgen’s teams to develop a comprehensive, customer-centric view to support flexible go-to-market models and provide personalized experiences to our customers. In this presentation, we will share our recent journey of how we took an agile approach to bringing together over 2.2 petabytes of internally generated and externally sourced vendor data , and onboard into our AWS Cloud and Databricks environments to enable a standardized, scalable and robust capabilities to meet the business requirements in our fast-changing life sciences environment. We will share use cases of how we harmonized and managed our diverse sets of data to deliver efficiency, simplification, and performance outcomes for the business. We will cover the following aspects of our journey along with best practices we learned over time: • Our architecture to support Amgen’s Commercial Data & Analytics constant processing around the globe • Engineering best practices for building large scale Data Lakes and Analytics platforms such as Team organization, Data Ingestion and Data Quality Frameworks, DevOps Toolkit and Maturity Frameworks, and more • Databricks capabilities adopted such as Delta Lake, Workspace policies, SQL workspace endpoints, and MLflow for model registry and deployment. Also, various tools were built for Databricks workspace administration • Databricks capabilities being explored for future, such as Multi-task Orchestration, Container-based Apache Spark Processing, Feature Store, Repos for Git integration, etc. • The types of commercial analytics use cases we are building on the Databricks Lakehouse platform Attendees building global and Enterprise scale data engineering solutions to meet diverse sets of business requirements will benefit from learning about our journey. Technologists will learn how we addressed specific Business problems via reusable capabilities built to maximize value.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Implementing an End-to-End Demand Forecasting Solution Through Databricks and MLflow

Implementing an End-to-End Demand Forecasting Solution Through Databricks and MLflow

2022-07-19 Watch
video

In retail, the right quantity at the right time is crucial for success. In this session we share how a demand forecasting solution helped some of our retailers to improve efficiencies and sharpen fresh product production and delivery planning.

With the setup in place we train hundreds of models in parallel, training on various levels including store level, product level and the combination of the two. By leveraging the distributed computation of Spark, we can do all of this in a scalable and fast way. Powered by Delta Lake, feature store and MLFlow this session clarifies how we built a highly reliable ML factory.

We show how this setup runs at various retailers and feeds accurate demand forecasts back to the ERP system, supporting the clients in their production planning and delivery. Through this session we want to inspire retailers & conference attendants to use data & AI to not only gain efficiency but also decrease food waste.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Leveraging ML-Powered Analytics for Rapid Insights and Action (a demonstration)

Leveraging ML-Powered Analytics for Rapid Insights and Action (a demonstration)

2022-07-19 Watch
video

The modern data stack makes it possible to query high-volume data with extremely high granularity, dimensionality, and cardinality. Operationalized machine learning is a great way to address this complex data, focusing the scope of analyst inquiry and quickly exposing dimensions, groups, and sub-groups of data with the greatest impact on key metrics.

This session will discuss how to leverage operationalized AI/ML to automatically define millions of features and perform billions of simultaneous hypothesis tests across a wide dataset to identify key drivers of metric change. A technical demonstration will include an overview of leveraging the Databricks Lakehouse using Sisu’s AI/ML-powered decision intelligence platform: connecting to Databricks, defining metrics, automated AI/ML-powered analysis, and exposing actionable business insights.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Low-Code Machine Learning on Databricks with AutoML

Low-Code Machine Learning on Databricks with AutoML

2022-07-19 Watch
video

Teams across an organization should be able to use predictive analytics for their business. While there are data scientists and data engineers who can leverage code to build ML models, there are domain experts and analysts who can benefit from low-code tools to build ML solutions.

Join this session to learn how you can leverage Databricks AutoML and other low-code tools to build, train and deploy ML models into production. Additionally, Databricks takes a unique glass-box approach, so you can take the code behind ML model and tweak further to fine-tune performance and integrate into production systems. See these capabilities in action and learn how Databricks empowers users of varying levels of expertise to build ML solutions.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Migrate Your Existing DAGs to Databricks Workflows

Migrate Your Existing DAGs to Databricks Workflows

2022-07-19 Watch
video

In this session, you will learn the benefits of orchestrating your business-critical ETL and ML workloads within the lakehouse, as well as how to migrate and consolidate your existing workflows to Databricks Workflows - a fully managed lakehouse orchestration service that allows you to run workflows on any cloud. We’ll walk you through different migration scenarios and share lessons learned and recommendations to help you reap the benefits of orchestration with Databricks Workflows.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

ML on the Lakehouse: Bringing Data and ML Together to Accelerate AI Use Cases

ML on the Lakehouse: Bringing Data and ML Together to Accelerate AI Use Cases

2022-07-19 Watch
video

Discover the latest innovations from Databricks that can help you build and operationalize the next generation of machine learning solutions. This session will dive into Databricks Machine Learning, a data-centric AI platform that spans the full machine learning lifecycle - from data ingestion and model training to production MLOps. You'll learn about key capabilities that you can leverage in your ML use cases and see the product in action. You will also directly hear how Databricks ML is being used to maximize supply chain logistics and keep millions of Coca-Cola products on the shelf.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

MLOps at DoorDash

MLOps at DoorDash

2022-07-19 Watch
video

MLOps is one of the widely discussed topics in the ML practitioner community. Streamlining the ML development and productionalizing ML are important ingredients to realize the power of ML, however it requires a vast and complex infrastructure. The ROI of ML projects will start only when they are in production. The journey to implementing MLOps will be unique to each company. At DoorDash, we’ve been applying MLOps for a couple of years to support a diverse set of ML use cases and to perform large scale predictions at low latency.

This session will share our approach to MLOps, as well as some of the learnings and challenges. In addition, it will share some details about the DoorDash ML stack, which consists of a mixture of homegrown solutions, open source solutions and vendor solutions like Databricks.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

MLOps on Databricks: A How-To Guide

MLOps on Databricks: A How-To Guide

2022-07-19 Watch
video

As companies roll out ML pervasively, operational concerns become the primary source of complexity. Machine Learning Operations (MLOps) has emerged as a practice to manage this complexity. At Databricks, we see firsthand how customers develop their MLOps approaches across a huge variety of teams and businesses. In this session, we will show how your organization can build robust MLOps practices incrementally. We will unpack general principles which can guide your organization’s decisions for MLOps, presenting the most common target architectures we observe across customers. Combining our experiences designing and implementing MLOps solutions for Databricks customers, we will walk through our recommended approaches to deploying ML models and pipelines on Databricks. You will come away with a deeper understanding of how to scale deployment of ML models across your organization, as well as a practical, coded example illustrating how to implement an MLOps workflow on Databricks.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Monitoring and Quality Assurance of Complex ML Deployments via Assertions

Monitoring and Quality Assurance of Complex ML Deployments via Assertions

2022-07-19 Watch
video

Machine Learning (ML) is increasingly being deployed in complex situations by teams. While much research effort has focused on the training and validation stages, other parts have been neglected by the research community.

In this talk, Daniel Kang will describe two abstractions (model assertions and learned observation assertions) that allow users to input domain knowledge to find errors at deployment time and in labeling pipelines. He will show real-world errors in labels and ML models deployed in autonomous vehicles, visual analytics, and ECG classification that these abstractions can find. I'll further describe how they can be used to improve model quality by up to 2x at a fixed labeling budget. This work is being conducted jointly with researchers from Stanford University and Toyota Research Institute.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Multimodal Deep Learning Applied to E-commerce Big Data

Multimodal Deep Learning Applied to E-commerce Big Data

2022-07-19 Watch
video

At Mirakl, we empower marketplaces with Artificial Intelligence solutions. Catalogs data is an extremely rich source of e-commerce sellers and marketplaces products which include images, descriptions, brands, prices and attributes (for example, size, gender, material or color). Such big volumes of data are suitable for training multimodal deep learning models and present several technical Machine Learning and MLOps challenges to tackle.

We will dive deep into two key use cases: deduplication and categorization of products. For categorization the creation of quality multimodal embeddings plays a crucial role and is achieved through experimentation of transfer learning techniques on state-of-the-art models. Finding very similar or almost identical products among millions and millions can be a very difficult problem and that is where our deduplication algorithm comes to bring a fast and computationally efficient solution.

Furthermore we will show how we deal with big volumes of products using robust and efficient pipelines, Spark for distributed and parallel computing, TFRecords to stream and ingest data optimally on multiple machines avoiding memory issues, and MLflow for tracking experiments and metrics of our models.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Near Real-Time Analytics with Event Streaming, Live Tables, and Delta Sharing

Near Real-Time Analytics with Event Streaming, Live Tables, and Delta Sharing

2022-07-19 Watch
video

Microservices is an increasingly popular architecture much loved by application teams, for it allows services to be developed and scaled independently. Data teams, though, often need a centralized repository where all data from different services come together to join and aggregate. The data platform can serve as a single source of company facts, enable near real time analytics, and secure sharing of massive data sets across clouds.

A viable microservices ingestion pattern is Change Data Capture, using AWS Database Migration Services or Debezium. CDC proves to be a scalable solution ideal for stable platforms, but it has several challenges for evolving services: Frequent schema changes, complex, unsupported DDL during migration, and automated deployments are but a few. An event streaming architecture can address these challenges.

Confluent, for example, provides a schema registry service where all services can register their event schemas. Schema registration helps with verifying that the events are being published based on the agreed contracts between data producers and consumers. It also provides a separation between internal service logic and the data consumed downstream. The services write their events to Kafka using the registered schemas with a specific topic based on the type of the event.

Data teams can leverage Spark jobs to ingest Kafka topics into Bronze tables in the Delta Lake. On ingestion, the registered schema from schema registry is used to validate the schema based on the provided version. A merge operation is sometimes called to translate events into final states of the records per business requirements.

Data teams can take advantage of Delta Live Tables on streaming datasets to produce Silver and Gold tables in near real time. Each input data source also has a set of expectations to ensure data quality and business rules. The pipeline allows Engineering and Analytics to collaborate by mixing Python and SQL. The refined data sets are then fed into Auto ML for discovery and baseline modeling.

To expose Gold tables to more consumers, especially non spark users across clouds, data teams can implement Delta Sharing. Recipients can accesses Silver tables from a different cloud and build their own analytics data sets. Analytics teams can also access Gold tables via pandas Delta Sharing client and BI tools.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/