talk-data.com talk-data.com

Event

Databricks DATA + AI Summit 2023

2026-01-11 YouTube Visit website ↗

Activities tracked

582

Sessions & talks

Showing 126–150 of 582 · Newest first

Search within this event →
Testing Generative AI Models: What You Need to Know

Testing Generative AI Models: What You Need to Know

2023-07-26 Watch
video

Generative AI shows incredible promise for enterprise applications. The explosion of generative AI can be attributed to the convergence of several factors. Most significant is that the barrier to entry has dropped for AI application developers through customizable prompts (few-shot learning), enabling laypeople to generate high-quality content. The flexibility of models like ChatGPT and DALLE-2 have sparked curiosity and creativity about new applications that they can support. The number of tools will continue to grow in a manner similar to how AWS fueled app development. But excitement must be tampered by concerns about new risks imposed to business and society. Increased capability and adoption also increase risk exposure. As organizations explore creative boundaries of generative models, measures to reduce risk must be put in place. However, the enormous size of the input space and inherent complexity make this task more challenging than traditional ML models.

In this session, we summarize the new risks introduced by the new class of generative foundation models through several examples, and compare how these risks relate to the risks of mainstream discriminative models. Steps can be taken to reduce the operational risk, bias and fairness issues, and privacy and security of systems that leverage LLM for automation. We’ll explore model hallucinations, output evaluation, output bias, prompt injection, data leakage, stochasticity, and more. We’ll discuss some of the larger issues common to LLMs and show how to test for them. A comprehensive, test-based approach to generative AI development will help instill model integrity by proactively mitigating failure and the associated business risk.

Talk by: Yaron Singer

Here’s more to explore: LLM Compact Guide: https://dbricks.co/43WuQyb Big Book of MLOps: https://dbricks.co/3r0Pqiz

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Unleashing the Magic of Large Language Modeling with Dolly 2.0

Unleashing the Magic of Large Language Modeling with Dolly 2.0

2023-07-26 Watch
video

As the field of artificial intelligence continues to advance at an unprecedented pace, LLMs are becoming increasingly powerful and transformative. LLMs use deep learning techniques to analyze vast amounts of text data, and can generate language that is like human language. These models have been used for a wide range of applications, including language translation, chatbots, text summarization, and more.

Dolly 2.0 is the first open-source, instruction-following LLM that has been fine-tuned on a human-generated instruction dataset – with zero chance of copyright implications. This makes it an ideal tool for research and commercial use, and opens up new possibilities for businesses looking to streamline their operations and enhance their customer service offerings.

In this session, we will provide an overview of Dolly 2.0, discuss its features and capabilities, and showcase its potential through a demo of Dolly in action. Attendees will gain insights into the LLMs, and learn how to maximize the impact of this cutting-edge technology in their organizations. By the end of the session, attendees will have a deep understanding of the capabilities of Dolly 2.0, and will be equipped with the knowledge they need to integrate LLMs into their own operations in order to achieve greater efficiency, productivity, and customer satisfaction.

Talk by: Gavita Regunath

Here’s more to explore: LLM Compact Guide: https://dbricks.co/43WuQyb Big Book of MLOps: https://dbricks.co/3r0Pqiz

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Weaving the Data Mesh in the Department of Defense

Weaving the Data Mesh in the Department of Defense

2023-07-26 Watch
video

The Chief Digital and AI Office (CDAO) was created to lead the strategy and policy on data, analytics, and AI adoption across the Department of Defense. To enable that vision, the Department must achieve new ways to scale and standardize delivery under a global strategy while enabling decentralized workflows that capture the wealth of data and domain expertise.

CDAO’s strategy and goals are aligned with data mesh principles. This alignment starts with providing enterprise-level infrastructure and services to advance the adoption of data, analytics, and AI, creating the self-service data infrastructure as a platform. And it continues through implementing policy for federated computational governance centered around decentralizing data ownership to become domain-oriented but enforcing the quality and trustworthiness of data. CDAO seeks to expand and make enterprise data more accessible through providing data as a product and leveraging a federated data catalog to designate authoritative data and common data models. This results in domain-oriented, decentralized data ownership to empower the business domains across the Department to increase mission and business impact that result in significant cost savings, saving lives, and data serving as a “public good.”

Please join us in our session as we discuss how the CDAO leverages modern, innovative implementations that accelerate the delivery of data and AI throughout one of the largest distributed organizations in the world; the Department of Defense. We will walk through how this enables delivery in various Department of Defense use cases.

Talk by: Brad Corwin and Cody Ferguson

Here’s more to explore: State of Data + AI Report: https://dbricks.co/44i2HBp The Data Team's Guide to the Databricks Lakehouse Platform: https://dbricks.co/46nuDpI

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Delta Sharing: The Key Data Mesh Enabler

Delta Sharing: The Key Data Mesh Enabler

2023-07-26 Watch
video

Data Mesh is an emerging architecture pattern that challenges the centralized data platform approach by empowering different engineering teams to own the data products in a specific business domain. One of the keys to the success of any Data Mesh initiative is selecting the right protocol for Data Sharing between different business data domains that could potentially be implemented through different technologies and cloud providers.

In this session you will learn about how the Delta Sharing protocol and the Delta table format have enabled the historically stuck-in-the-past energy and construction industry to be catapulted to the 21st century by way of a modern Data Mesh implementation based on Azure Databricks.

Talk by: Francesco Pizzolon

Here’s more to explore: A New Approach to Data Sharing: https://dbricks.co/44eUnT1

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

How Mars Achieved a People Analytics Transformation with a Modern Data Stack

How Mars Achieved a People Analytics Transformation with a Modern Data Stack

2023-07-26 Watch
video

People Analytics at Mars was formed two years ago as part of an ambitious journey to transform our HR analytics capabilities. To transform, we needed to build foundational services to provide our associates with helpful insights through fast results and resolving complex problems. Critical in that foundation are data governance and data enablement which is the responsibility of the Mars People Data Office team whose focus is to deliver high quality and reliable data that is reusable for current and future People Analytics use cases. Come learn how this team used Databricks in helping Mars achieve its People Analytics Transformation.

Talk by: Rachel Belino and Sreeharsha Alagani

Here’s more to explore: State of Data + AI Report: https://dbricks.co/44i2HBp The Data Team's Guide to the Databricks Lakehouse Platform: https://dbricks.co/46nuDpI

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Simplifying Migrations to Lakehouse

Simplifying Migrations to Lakehouse

2023-07-26 Watch
video

This session will cover:

  • Challenges with legacy platforms
  • Perenti Databricks migration journey
  • Reimagining migrations the Databricks way
  • The Databricks migration methodology and approach

Talk by: Dan Smith

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Unlocking the Value of Data Sharing in Financial Services with Lakehouse

Unlocking the Value of Data Sharing in Financial Services with Lakehouse

2023-07-26 Watch
video
Spencer Cook (Databricks)

The emergence of secure data sharing is already having a tremendous economic impact, in large part due to the increasing ease and safety of sharing financial data. McKinsey predicts that the impact of open financial data will be 1-4.5% of GDP globally by 2030. This indicates there is a narrowing window on a massive opportunity for financial institutions and it is critical that they prioritize data sharing. This session will first address the ways in which Delta Sharing and Unity Catalog on a Databricks Lakehouse architecture provides a simple and open framework for building a Secure Data Sharing platform in the financial services industry. Next we will use a Databricks environment to walk through different use cases for open banking data and secure data sharing, demonstrating how they will be implemented using Delta Sharing, Unity Catalog, and other parts of the Lakehouse platform. The use cases will include examples of new product features such as Databricks to Databricks sharing, change data feed and streaming on Delta Sharing, table/column lineage, and the Delta Sharing Excel plugin to demonstrate state of the art sharing capabilities.

In this session, we will discuss secure data sharing on Databricks Lakehouse and will demonstrate architecture and code for common sharing use cases in the finance industry.

Talk by: Spencer Cook

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Feeding the World One Plant at a Time

Feeding the World One Plant at a Time

2023-07-26 Watch
video
Naveed Farooqui , Fahad Khan (Volt Active Data)

Join this session to learn how the CVML and Data Platform team at BlueRiver Technology utilized Databricks to maximize savings on herbicide usage and revolutionize Precision Agriculture.

Blue River Technology is an agricultural technology company that uses computer vision and machine learning (CVML) to revolutionize the way crops are grown and harvested. BRT’s See & Spray technology, which uses CVML to identify and precisely determine whether the plant is a weed or a crop so it can deliver a small, targeted dose of herbicide directly to the plant, while leaving the crop unharmed. By using this approach, Blue River significantly reduces the amount of herbicides used in agriculture by over 70% and has a positive impact on the environment and human health.

The technical challenges we seek to overcome are:  - Processing massive petabytes of proprietary data at scale and in real time. Equipment in the field can generate up to 40TBs of data per hour per machine. - Aggregating, curating and visualizing at scale data can often be convoluted, error-prone and complex.  - Streamlining pipelines runs from weeks to hours to ensure continuous delivery of data.  - Abstracting and automating  the infra, deployment and data management from each program. - Building downstream data products based on descriptive analysis, predictive analysis or prescriptive analysis to drive the machine behavior.

The business questions we seek to answer for any machine are:  - Are we getting the spray savings we anticipated? - Are we reducing the use of herbicide at the scale we expected? - Are spraying nozzles performing at the expected rate? - Finding the relevant data to troubleshoot new edge conditions.  - Providing a simple interface for data exploration to both technical and non-technical personas to help improve our model. - Identifying repetitive and new faults in our machines. - Filtering out data based on certain incidents. - Identifying anomalies for e.g. sudden drop in spray saving, like frequency of broad spray suddenly is too high.

How we are addressing and plan to address these challenges: - Designating Databricks as our purposeful DB for all data - using the bronze, silver and gold layer standards. - Processing new machine logs using a Delta Live table as a source both in batch and incremental manner. - Democratize access for data scientists, product managers, data engineers who are not proficient with the robotic software stack via notebooks for quick development as well as real time dashboards.

Talk by: Fahad Khan and Naveed Farooqui

Here’s more to explore: LLM Compact Guide: https://dbricks.co/43WuQyb Big Book of MLOps: https://dbricks.co/3r0Pqiz

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksin

Sponsored: Kyvos | Analytics 100x Faster Lowest Cost w/ Kyvos & Databricks, Even on Trillions Rows

Sponsored: Kyvos | Analytics 100x Faster Lowest Cost w/ Kyvos & Databricks, Even on Trillions Rows

2023-07-26 Watch
video

Databricks and Kyvos together are helping organizations build their next-generation cloud analytics platform. A platform that can process and analyze massive amounts of data, even trillions of rows, and provide multidimensional insights instantly. Combining the power of Databricks with the speed, scale and cost optimization capabilities of Kyvos Analytics Acceleration Platform, customers can go beyond the limit of their analytics boundaries. Join our session to know how and also learn about a real-world use case.

Talk by: Leo Duncan

Here’s more to explore: Why the Data Lakehouse Is Your next Data Warehouse: https://dbricks.co/3Pt5unq Lakehouse Fundamentals Training: https://dbricks.co/44ancQs

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksin

Activate Your Lakehouse with Unity Catalog

Activate Your Lakehouse with Unity Catalog

2023-07-26 Watch
video

Building a lakehouse is straightforward today thanks to many open source technologies and Databricks. However, it can be taxing to extract value from lakehouses as they grow without robust data operations. Join us to learn how YipitData uses the Unity Catalog to streamline data operations and discover best practices to scale your own Lakehouse. At YipitData, our 15+ petabyte Lakehouse is a self-service data platform built with Databricks and AWS, supporting analytics for a data team of over 250. We will share how leveraging Unity Catalog accelerates our mission to help financial institutions and corporations leverage alternative data by:

  • Enabling clients to universally access our data through a spectrum of channels, including Sigma, Delta Sharing, and multiple clouds
  • Fostering collaboration across internal teams using a data mesh paradigm that yields rich insights
  • Strengthening the integrity and security of data assets through ACLs, data lineage, audit logs, and further isolation of AWS resources
  • Reducing the cost of large tables without downtime through automated data expiration and ETL optimizations on managed delta tables

Through our migration to Unity Catalog, we have gained tactics and philosophies to seamlessly flow our data assets internally and externally. Data platforms need to be value-generating, secure, and cost-effective in today's world. We are excited to share how Unity Catalog delivers on this and helps you get the most out of your lakehouse.

Talk by: Anup Segu

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

An API for Deep Learning Inferencing on Apache Spark™

An API for Deep Learning Inferencing on Apache Spark™

2023-07-26 Watch
video

Apache Spark is a popular distributed framework for big data processing. It is commonly used for ETL (extract, transform and load) across large datasets. Today, the transform stage can often include the application of deep learning models on the data. For example, common models can be used for classification of images, sentiment analysis of text, language translation, anomaly detection, and many other use cases. Applying these models within Spark can be done today with the combination of PySpark, Pandas_UDF, and a lot of glue code. Often, that glue code can be difficult to get right, because it requires expertise across multiple domains - deep learning frameworks, PySpark APIs, pandas_UDF internal behavior, and performance optimization.

In this session, we introduce a new, simplified API for deep learning inferencing on Spark, introduced in SPARK-40264 as a collaboration between NVIDIA and Databricks, which seeks to standardize and open source this glue code to make deep learning inference integrations easier for everyone. We discuss its design and demonstrate its usage across multiple deep learning frameworks and models.

Talk by: Lee Yang

Here’s more to explore: LLM Compact Guide: https://dbricks.co/43WuQyb Big Book of MLOps: https://dbricks.co/3r0Pqiz

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Best Data Warehouse is a Lakehouse: Databricks Achieves Ops Efficiency w/ Lakehouse Architecture

Best Data Warehouse is a Lakehouse: Databricks Achieves Ops Efficiency w/ Lakehouse Architecture

2023-07-26 Watch
video
Naveen Zutshi (Databricks) , Romit Jadhwani (Databricks)

At Databricks, we use the Lakehouse architecture to build an optimized data warehouse that drives better insights, increased operational efficiency, and reduces costs. In this session, Naveen Zutshi, CIO at Databricks and Romit Jadhwani, Senior Director Analytics and Integrations at Databricks will discuss the Databricks journey and provide technical and business insights into how these results were achieved.

The session will cover topics such as medallion architecture, building efficient third party integrations, how Databricks built various data products/services on the data warehouse, and how to use governance to break down data silos and achieve consistent sources of truth.

Talk by: Naveen Zutshi and Romit Jadhwani

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksin

Building AI-Powered Products with Foundation Models

Building AI-Powered Products with Foundation Models

2023-07-26 Watch
video

Foundation models make for fantastic demos, but in practice, they can be challenging to put into production. These models work well over datasets that match common training distributions (e.g., generating WEBTEXT or internet images), but may fail on domain-specific tasks or long-tail edge case; the settings that matter most to organizations building differentiated products. We propose a data-centric development approach that organizations can use to adapt foundation models to their own private/proprietary datasets.

We'll describe several techniques, including supervision "warmstarts" and interactive prompting (spoiler alert: no code needed). To make these techniques come to life, we'll walk through real case studies describing how we've seen data-centric development drive AI-powered products, from "AI assist" use cases (e.g., copywriting assistants) to "fully automated" solutions (e.g., loan processing engines).

Talk by: Vincent Chen

Here’s more to explore: LLM Compact Guide: https://dbricks.co/43WuQyb Big Book of MLOps: https://dbricks.co/3r0Pqiz

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Building Apps on the Lakehouse with Databricks SQL

Building Apps on the Lakehouse with Databricks SQL

2023-07-26 Watch
video

BI applications are undoubtedly one of the major consumers of a data warehouse. Nevertheless, the prospect of accessing data using standard SQL is appealing to many more stakeholders than just the data analysts. We’ve heard from customers that they experience an increasing demand to provide access to data in their lakehouse platforms from external applications beyond BI, such as e-commerce platforms, CRM systems, SaaS applications, or custom data applications developed in-house. These applications require an “always on” experience, which makes Databricks SQL Serverless a great fit.

In this session, we give an overview of the approaches available to application developers to connect to Databricks SQL and create modern data applications tailored to needs of users across an entire organization. We discuss when to choose one of the Databricks native client libraries for languages such as Python, Go, or node.js and when to use the SQL Statement Execution API, the newest addition to the toolset. We also explain when ODBC and JDBC might not be the best for the task and when they are your best friends. Live demos are included.

Talk by: Adriana Ispas and Chris Stevens

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Combining Privacy Solutions to Solve Data Access at Scale

Combining Privacy Solutions to Solve Data Access at Scale

2023-07-26 Watch
video

The trend that has made data easier to collect and analyze has only aggravated privacy risks. Luckily, a range of privacy technologies have emerged to enable private data management; differential privacy, synthetic data, confidential computing. In isolation, those technologies have had a limited impact because they did not always bring the 10x improvement expected by data leaders.

Combining these privacy technologies has been the real game changer. We will demonstrate that the right mix of technologies brings the optimal balance of privacy and flexibility at the scale of the data warehouse. We will illustrate this by real-life applications of Sarus in three domains:

  • Healthcare: how to make hospital data available for research at scale in full compliance
  • Finance: how to pool data between several banks to fight criminal transactions
  • Marketing: how to build insights on combined data from partners and distributors

The examples will be illustrated using data stored in Databricks and queried using Sarus differential privacy engine.

Talk by: Maxime Agostini

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Comparing Databricks and Snowflake for Machine Learning

Comparing Databricks and Snowflake for Machine Learning

2023-07-26 Watch
video
Michael Green , Don Scott (Microsoft)

Snowflake and Databricks both aim to provide data science toolkits for machine learning workflows, albeit with different approaches and resources. While developing ML models is technically possible using either platform, the Hitachi Solutions Empower team tested which solution will be easier, faster, and cheaper to work with in terms of both user experience and business outcomes for our customers. To do this, we designed and conducted a series of experiments with use cases from the TPCx-AI benchmark standard. We developed both single-node and multi-node versions of these experiments, which sometimes required us to set up separate compute infrastructure outside of the platform, in the case of Snowflake. We also built datasets of various sizes (1GB, 10GB, and 100GB), to assess how each platform/node setup handles scale.

Based on our findings, on the average, Databricks is faster, cheaper, and easier to use for developing machine learning models, and we use it exclusively for data science on the Empower platform. Snowflake’s reliance on third party resources for distributed training is a major drawback, and the need to use multiple compute environments to scale up training is complex and, in our view, an unnecessary complication to achieve best results.

Talk by: Michael Green and Don Scott

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Databricks SQL Serverless Under the Hood: How We Use ML to Get the Best Price/Performance

Databricks SQL Serverless Under the Hood: How We Use ML to Get the Best Price/Performance

2023-07-26 Watch
video
Gaurav Saraf (Databricks) , Mostafa Mokhtar (Databricks) , Jeremy Lewallen (Databricks)

Join this session to learn how Databricks SQL Serverless warehouses use ML to make large improvements in price-performance for both ETL and BI workloads. We will demonstrate how they can cater to an organization’s peak concurrency needs for BI and showcase the latest advancements in resource-based scheduling, autoscaling, and caching enhancements that allow for seamless performance and workload management. We will deep dive into new features such as Predictive I/O and Intelligent Workload Management, and show new price/performance benchmarks.

Talk by: Gaurav Saraf, Mostafa Mokhtar, and Jeremy Lewallen

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksin

De-Risking Language Models for Faster Adoption

De-Risking Language Models for Faster Adoption

2023-07-26 Watch
video

Language models are incredible engineering breakthroughs but require auditing and risk management before productization. These systems raise concerns about toxicity, transparency and reproducibility, intellectual property licensing and ownership, disinformation and misinformation, supply chains, and more. How can your organization leverage these new tools without taking on undue or unknown risks? While language models and associated risk management are in their infancy, a small number of best practices in governance and risk are starting to emerge. If you have a language model use case in mind, want to understand your risks, and do something about them, this presentation is for you! We'll be covering the following: 

  • Studying past incidents in the AI Incident Database and using this information to guide debugging.
  • Adhering to authoritative standards, like the NIST AI Risk Management Framework. 
  • Finding and fixing common data quality issues.
  • Applying general public tools and benchmarks as appropriate (e.g., BBQ, Winogender, TruthfulQA).
  • Binarizing specific tasks and debugging them using traditional model assessment and bias testing.
  • Engineering adversarial prompts with strategies like counterfactual reasoning, role-playing, and content exhaustion. 
  • Conducting random attacks: random sequences of attacks, prompts, or other tests that may evoke unexpected responses. 
  • Countering prompt injection attacks, auditing for backdoors and data poisoning, ensuring endpoints are protected with authentication and throttling, and analyzing third-party dependencies. 
  • Engaging stakeholders to help find problems system designers and developers cannot see. 
  • Everyone knows that generative AI is going to be huge. Don't let inadequate risk management ruin the party at your organization!

Talk by: Patrick Hall

Here’s more to explore: LLM Compact Guide: https://dbricks.co/43WuQyb Big Book of MLOps: https://dbricks.co/3r0Pqiz

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Determining When to Use GPU for Your ETL Pipelines at Scale

Determining When to Use GPU for Your ETL Pipelines at Scale

2023-07-26 Watch
video
Hao Zhu (Databricks) , Chris Vo (Databricks)

Assuming you have hundreds of jobs and/or clusters in your Databricks workspace, what is the best way to determine if those pipelines can take advantage of GPU for speed and/or cost saving? Join this session to learn about using the NVIDIA GPU Qualification Tool applied at scale to project potential cost saving for your entire workspace.

Talk by: Chris Vo and Hao Zhu

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Explainable Data Drift for NLP

Explainable Data Drift for NLP

2023-07-26 Watch
video

Detecting data drift, although far from solved-for tabular data, has become a common approach to monitor ML models in production. For Natural Language Processing (NLP) on the other hand the question remains mostly open. In this session, we will present and compare two approaches. In the first approach, we will demonstrate how by extracting a wide range of explainable properties per document such as topics, language, sentiment, named entities, keywords and more we are able to explore potential sources of drift. We will show how these properties can be consistently tracked over time, how they can be used to detect meaningful data drift as soon as it occurs and how they can be used to explain and fix the root cause.

The second approach we will present is to detect drift by using the embeddings of common foundation models (such as GPT3 in the Open AI model family) and use them to identify areas in the embedding space in which significant drift has occurred. These areas in embedding space should then be characterized in a human-readable way to enable root cause analysis of the detected drift. We will compare the performance and explainability of these two methods and explore the pros and cons of each approach.

Talk by: Noam Bressler

Here’s more to explore: LLM Compact Guide: https://dbricks.co/43WuQyb Big Book of MLOps: https://dbricks.co/3r0Pqiz

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

From Insights to Recommendations:How SkyWatch Predicts Demand for Satellite Imagery Using Databricks

From Insights to Recommendations:How SkyWatch Predicts Demand for Satellite Imagery Using Databricks

2023-07-26 Watch
video
Aayush Patel (SkyWatch)

SkyWatch is on a mission to democratize earth observation data and make it simple for anyone to use.

In this session, you will learn about how SkyWatch aggregates demand signals for the EO market and turns them into monetizable recommendations for satellite operators. Skywatch’s Data & Platform Engineer, Aayush will share how the team built a serverless architecture that synthesizes customer requests for satellite images and identifies geographic locations with high demand, helping satellite operators maximize revenue and satisfying a broad range of EO data hungry consumers.

This session will cover:

  • Challenges with Fulfillment in Earth Observation ecosystem
  • Processing large scale GeoSpatial Data with Databricks
  • Databricks in-built H3 functions
  • Delta Lake to efficiently store data leveraging optimization techniques like Z-Ordering
  • Data LakeHouse Architecture with Serverless SQL Endpoints and AWS Step Functions
  • Building Tasking Recommendations for Satellite Operators

Talk by: Aayush Patel

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

How the Texas Rangers Revolutionized Baseball Analytics with a Modern Data Lakehouse

How the Texas Rangers Revolutionized Baseball Analytics with a Modern Data Lakehouse

2023-07-26 Watch
video
Alexander Booth (Texas Rangers Baseball Club) , Oliver Dykstra (Texas Rangers)

Don't miss this session where we demonstrate how the Texas Rangers baseball team organized their predictive models by using MLflow and the MLRegistry inside Databricks. They started using Databricks as a simple solution to centralizing our development on the cloud. This helped lessen the issue of siloed development in our team, and allowed us to leverage the benefits of distributed cloud computing.

But we quickly found that Databricks was a perfect solution to another problem that we faced in our data engineering stack. Specifically, cost, complexity, and scalability issues hampered our data architecture development for years, and we decided we needed to modernize our stack by migrating to a lakehouse. With Databricks Lakehouse, ad-hoc-analytics, ETL operations, and MLOps all living within Databricks, development at scale has never been easier for our team.

Going forward, we hope to fully eliminate the silos of development, and remove the disconnect between our analytics and data engineering teams. From computer vision, pose analytics, and player tracking, to pitch design, base stealing likelihood, and more, come see how the Texas Rangers are using innovative cloud technologies to create action-driven reports from the current sea of big data.

Talk by: Alexander Booth and Oliver Dykstra

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Improve Apache Spark™ DS v2 Query Planning Using Column Stats

Improve Apache Spark™ DS v2 Query Planning Using Column Stats

2023-07-26 Watch
video

When doing the TPC-DS benchmark using external v2 data source, we have observed that for several of the queries, DS v1 has better join plans than Apache Spark. The main reason is that DS v1 uses column stats, especially number of distinct values (NDV) for query optimization. Currently, Spark™ DS v2 only has interfaces for data sources to report table statistics such as size in bytes and number of rows. In order to use column stats in DS v2, we have added new interfaces to allow external data sources to report column stats to Spark.

For a data source with huge data, it’s always challenging to get the column stats, especially the NDV. We plan to calculate NDV using Apache DataSketches Theta sketch and save the serialized compact sketch in the statistics file. The NDV and other column stats will be reported to Spark for query plan optimization.

Talk by: Huaxin Gao and Parth Chandra

Here’s more to explore: Why the Data Lakehouse Is Your next Data Warehouse: https://dbricks.co/3Pt5unq Lakehouse Fundamentals Training: https://dbricks.co/44ancQs

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Leveraging IoT Data at Scale to Mitigate Global Water Risks Using Apache Spark™ Streaming and Delta

Leveraging IoT Data at Scale to Mitigate Global Water Risks Using Apache Spark™ Streaming and Delta

2023-07-26 Watch
video

Every year, billions of dollars are lost due to water risks from storms, floods, and droughts. Water data scarcity and excess are issues that risk models cannot overcome, creating a world of uncertainty. Divirod is building a platform of water data by normalizing diverse data sources of varying velocity into one unified data asset. In addition to publicly available third-party datasets, we are rapidly deploying our own IoT sensors. These sensors ingest signals at a rate of about 100,000 messages per hour into preprocessing, signal-processing, analytics, and postprocessing workloads in one spark-streaming pipeline to enable critical real-time decision-making processes. By leveraging streaming architecture, we were able to reduce end-to-end latency from tens of minutes to just a few seconds.

We are leveraging Delta Lake to provide a single query interface across multiple tables of this continuously changing data. This enables data science and analytics workloads to always use the most current and comprehensive information available. In addition to the obvious schema transformations, we implement data quality metrics and datum conversions to provide a trustworthy unified dataset.

Talk by: Adam Wilson and Heiko Udluft

Here’s more to explore: Big Book of Data Engineering: 2nd Edition: https://dbricks.co/3XpPgNV The Data Team's Guide to the Databricks Lakehouse Platform: https://dbricks.co/46nuDpI

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Making the Shift to Application-Driven Intelligence

Making the Shift to Application-Driven Intelligence

2023-07-26 Watch
video

In the digital economy, application-driven intelligence delivered against live, real-time data will become a core capability of successful enterprises. It has the potential to improve the experience that you provide to your customers and deepen their engagement. But to make application-driven intelligence a reality, you can no longer rely only on copying live application data out of operational systems into analytics stores. Rather, it takes the unique real-time application-serving layer of a MongoDB database combined with the scale and real-time capabilities of a Databricks Lakehouse to automate and operationalize complex and AI-enhanced applications at scale.

In this session, we will show how it can be seamless for developers and data scientists to automate decisioning and actions on fresh application data and we'll deliver a practical demonstration on how operational data can be integrated in real time to run complex machine learning pipelines.

Talk by: Mat Keep and Ashwin Gangadhar

Here’s more to explore: Why the Data Lakehouse Is Your next Data Warehouse: https://dbricks.co/3Pt5unq Lakehouse Fundamentals Training: https://dbricks.co/44ancQs

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc