talk-data.com talk-data.com

Topic

Databricks

big_data analytics spark

561

tagged

Activity Trend

515 peak/qtr
2020-Q1 2026-Q1

Activities

Showing filtered results

Filtering by: Databricks DATA + AI Summit 2023 ×
Monetizing Data Assets: Sharing Data, Models and Features

Data is an asset. Selling/sharing data has largely been solved, and hosted models exist (example: ChatGPT), but moving sensitive data across the public internet or across clouds is problematic. Sharing features (the result of feature engineering) can be monetized for new potential revenue streams. Sharing models can also be monetized while avoiding the transfer of sensitive data.

This session will walk through a few examples of how to share models and features to generate new revenue streams using Delta Sharing, MLflow, and Databricks

Talk by: Keith Anderson and Avinash Sooriyarachchi

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Planning and Executing a Snowflake Data Warehouse Migration to Databricks

Organizations are going through a critical phase of data infrastructure modernization, laying the foundation for the future, and adapting to support growing data and AI needs. Organizations that embraced cloud data warehouses (CDW) such as Snowflake have ended up trying to use a data warehousing tool for ETL pipelines and data science. This created unnecessary complexity and resulted in poor performance since data warehouses are optimized for SQL-based analytics only.

Realizing the limitation and pain with cloud data warehouses, organizations are turning to a lakehouse-first architecture. Though a cloud platform to cloud platform migration should be relatively easy, the breadth of the Databricks platform provides flexibility and hence requires careful planning and execution. In this session, we present the migration methodology, technical approaches, automation tools, product/feature mapping, a technical demo and best practices using real-world case studies for migrating data, ELT pipelines and warehouses from Snowflake to Databricks.

Talk by: Satish Garla and Ramachandran Venkat

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksin

Post-Merger: Implementing Unity Catalog Across Multiple Accounts

Warner Media and Discovery have recently merged to form Warner Bros Discovery. Owning two Databricks accounts and wanting to maintain their separation, our data governance team has successfully implemented Unity Catalog as our data governance solution across both accounts, allowing our teams to collaboratively and securely use the data assets of two organizations collaboratively and securely.

This session is aimed at sharing that success story, including initial challenges, our approach, our architecture, the actual implementation, and user success post-implementation.

Talk by: Ramprasad Koya and Susheel Lakshmipathi

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Scaling AI Applications with Databricks, HuggingFace and Pinecone

The production and management of large-scale vector embeddings can be a challenging problem. The integration of Databricks, Hugging Face and Pinecone offers a powerful solution. Vector embeddings have become an essential tool in the development of AI powered applications. Embeddings are representations of data learned by machine models. High quality embeddings are unlocking use cases like semantic search, recommendation engines, and anomaly detection. Databricks' Apache Spark™ ecosystem together with Hugging Face's Transformers library enable large-scale vector embeddings production using GPU processing, Pinecone's vector database provides ultra-low latency querying and upserting of billions of embeddings, allowing for high-quality embeddings at scale for real-time AI apps.

In this session, we will present a concrete use case of this integration in the context of a natural language processing application. We will demonstrate how Pinecone's vector database can be integrated with Databricks and Hugging Face to produce large-scale vector embeddings of text data and how these embeddings can be used to improve the performance of various AI applications. You will see the benefits of this integration in terms of speed, scalability, and cost efficiency. By leveraging the GPU processing capabilities of Databricks and the ultra low-latency querying capabilities of Pinecone, we can significantly improve the performance of NLP tasks while reducing the cost and complexity of managing large-scale vector embeddings. You will learn about the technical details of this integration and how it can be implemented in your own AI projects, and gain insights into the speed, scalability, and cost efficiency benefits of using this solution.

Talk by: Roie Schwaber-Cohen

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Sponsored: Ascent IO | Publish a Data Mesh Product in Under 10 Minutes w/ Delta Sharing & Ascend

Learn how to quickly ingest, transform and share data in Delta Lake with intelligent data pipelines on Ascend. Using live data, we'll cover everything you need to know to get your first data products up and running fast. We'll talk about first principles for building a scalable mesh and tips for reducing maintenance work as you grow. And you'll see how Ascend applies patented fingerprinting technology to manage change across your interconnected pipelines as you build out the mesh.

Talk by: Jon Osborn

Here’s more to explore: A New Approach to Data Sharing: https://dbricks.co/44eUnT1

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Sponsored: AWS|Build Generative AI Solution on Open Source Databricks Dolly 2.0 on Amazon SageMaker

Create a custom chat-based solution to query and summarize your data within your VPC using Dolly 2.0 and Amazon SageMaker. In this talk, you will learn about Dolly 2.0, Databricks, state-of-the-art, open source, LLM, available for commercial and Amazon SageMaker, AWS’s premiere toolkit for ML builders. You will learn how to deploy and customize models to reference your data using retrieval augmented generation (RAG) and additional fine tuning techniques…all using open-source components available today.

Talk by: Venkat Viswanathan and Karl Albertsen

Here’s more to explore: LLM Compact Guide: https://dbricks.co/43WuQyb Big Book of MLOps: https://dbricks.co/3r0Pqiz

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Sponsored by: Immuta | Building an End-to-End MLOps Workflow with Automated Data Access Controls

WorldQuant Predictive’s customers rely on our predictions to understand how changing world and market conditions will impact decisions to be made. Speed is critical, and so are accuracy and resilience. To that end, our data team built a modern, automated MLOps data flow using Databricks as a key part of our data science tooling, and integrated with Immuta to provide automated data security and access control.

In this session, we will share details of how we used policy-as-code to support our globally distributed data science team with secure data sharing, testing, validation and other model quality requirements. We will also discuss our data science workflow that uses Databricks-hosted MLflow together with an Immuta-backed custom feature store to maximize speed and quality of model development through automation. Finally, we will discuss how we deploy the models into our customized serverless inference environment, and how that powers our industry solutions.

Talk by: Tyler Ditto

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

The C-Level Guide to Data Strategy Success with the Lakehouse

Join us for a practical session on implementing a data strategy leveraging people, process, and technology to meet the growing demands of your business stakeholders for faster innovation at lower cost. In this session we will share real-world examples on best practices and things to avoid as you drive your strategy from the board to the business units in your organization

Talk by: Robin Sutara and Dael Williamson

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksin

The First Sports & Ent Data Market Powered by Pumpjack Dataworks, Revelate, Immuta & Databricks

Creating a secure and easily actionable marketplace is no simple task. Add to this governance requirements of privacy frameworks and responsibilities of protecting consumer data, and things get harder. With Pumpjack Dataworks partnering with Databricks, Immuta, and Revelate, we bring secure, privacy-focused data products directly to data consumers.

Talk by: Corey Zwart and Tom Tercek

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Use Apache Spark™ from Anywhere: Remote Connectivity with Spark Connect

Over the past decade, developers, researchers, and the community at large have successfully built tens of thousands of data applications using Apache Spark™. Since then, use cases and requirements of data applications have evolved. Today, every application, from web services that run in application servers, interactive environments such as notebooks and IDEs, to phones and edge devices such as smart home devices, want to leverage the power of data. However, Spark's driver architecture is monolithic, running client applications on top of a scheduler, optimizer and analyzer. This architecture makes it hard to address these new requirements as there is no built-in capability to remotely connect to a Spark cluster from languages other than SQL.

Spark Connect introduces a decoupled client-server architecture for Apache Spark that allows remote connectivity to Spark clusters using the DataFrame API and unresolved logical plans as the protocol. The separation between client and server allows Spark and its open ecosystem to be leveraged from everywhere. It can be embedded in modern data applications, in IDEs, notebooks and programming languages. This session highlights how simple it is to connect to Spark using Spark Connect from any data applications or IDEs. We will do a deep dive into the architecture of Spark Connect and provide an outlook on how the community can participate in the extension of Spark Connect for new programming languages and frameworks bringing the power of Spark everywhere.

Talk by: Martin Grund and Stefania Leone

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

What’s New With Platform Security and Compliance in the Databricks Lakehouse Platform

At Databricks, we know that data is one of your most valuable assets and alwasys must be protected, that’s why security is built into every layer of the Databricks Lakehouse Platform. Databricks provides comprehensive security to protect your data and workloads, such as encryption, network controls, data governance and auditing.

In this session, you will hear from Databricks product leaders on the platform security and compliance progress made over the past year, with demos on how administrators can start protecting workloads fast. You will also learn more about the roadmap that delivers on the Databricks commitment to you as the most trusted, compliant, and secure data and AI platform with the Databricks Lakehouse.

Talk by: Samrat Ray and David Veuve

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksin

Colossal AI: Scaling AI Models in Big Model Era

The proliferation of large models based on Transformer has outpaced advances in hardware, resulting in an urgent need for the ability to distribute enormous models across multiple GPUs. Despite this growing demand, best practices for choosing an optimal strategy are still lacking due to the breadth of knowledge required across HPC, DL, and distributed systems. These difficulties have stimulated both AI and HPC developers to explore the key questions: How can training and inference efficiency of large models be improved to reduce costs? How can larger AI models be accommodated even with limited resources?

What can be done to enable more community members to easily access large models and large-scale applications? In this session, we investigate efforts to solve the questions mentioned above. Firstly, diverse parallelization is an important tool to improve the efficiency of large model training and inference. Heterogeneous memory management can help enhance the model accommodation capacity of processors like GPUs.

Furthermore, user-friendly DL systems for large models significantly reduce the specialized background knowledge users need, allowing more community members to get started with larger models more efficiently. We will provide participants with a system-level open-source solution, Colossal-AI. More information can be found at https://github.com/hpcaitech/ColossalAI.

Talk by: James Demmel and Yang You

Here’s more to explore: LLM Compact Guide: https://dbricks.co/43WuQyb Big Book of MLOps: https://dbricks.co/3r0Pqiz

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Delta Kernel: Simplifying Building Connectors for Delta

Since the release of Delta 2.0, the project has been growing at a breakneck speed. In this session, we will cover all the latest capabilities that makes Delta Lake the best format for the lakehouse. Based on lessons learned from this past year, we will introduce Project Aqueduct and how we will simplify building Delta Lake APIs from Rust and Go to Trino, Flink, and PySpark.

Talk by: Tathagata Das and Denny Lee

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Enterprise Use of Generative AI Needs Guardrails: Here's How to Build Them

Large Language Models (LLMs) such as ChatGPT have revolutionized AI applications, offering unprecedented potential for complex real-world scenarios. However, fully harnessing this potential comes with unique challenges such as model brittleness and the need for consistent, accurate outputs. These hurdles become more pronounced when developing production-grade applications that utilize LLMs as a software abstraction layer.

In this session, we will tackle these challenges head-on. We introduce Guardrails AI, an open-source platform designed to mitigate risks and enhance the safety and efficiency of LLMs. We will delve into specific techniques and advanced control mechanisms that enable developers to optimize model performance effectively. Furthermore, we will explore how implementing these safeguards can significantly improve the development process of LLMs, ultimately leading to safer, more reliable, and robust real-world AI applications

Talk by: Shreya Rajpal

Here’s more to explore: LLM Compact Guide: https://dbricks.co/43WuQyb Big Book of MLOps: https://dbricks.co/3r0Pqiz

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Lakehouse / Spark AMA

Have some great questions about Apache Spark™ and Lakehouses?  Well, come by and ask the experts your questions!

Talk by: Martin Grund, Hyukjin Kwon, and Wenchen Fan

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Navigating the Complexities of LLMs: Insights from Practitioners

Interested in diving deeper into the world of large language models (LLMs) and their real-life applications? In this session, we bring together our experienced team members and some of our esteemed customers to talk about their journey with LLMs. We'll delve into the complexities of getting these models to perform accurately and efficiently, the challenges, and the dynamic nature of LLM technology as it constantly evolves. This engaging conversation will offer you a broader perspective on how LLMs are being applied across different industries and how they’re revolutionizing our interaction with technology. Whether you're well-versed in AI or just beginning to explore, this session promises to enrich your understanding of the practical aspects of LLM implementation.

Talk by: Sai Ravuru, Eric Peter, Ankit Mathur, and Salman Mohammed

Here’s more to explore: LLM Compact Guide: https://dbricks.co/43WuQyb Big Book of MLOps: https://dbricks.co/3r0Pqiz

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Simplifying Lakehouse Observability: Databricks Key Design Goals and Strategies

In this session, we'll explore Databricks vision for simplifying lakehouse observability, a critical component of any successful data, analytics, and machine learning initiatives. By directly integrating observability solutions within the lakehouse, Databricks aims to provide users with the tools and insights needed to run a successful business on top of lakehouse.

Our approach is designed to leverage existing expertise and simplify the process of monitoring and optimizing data and ML workflows, enabling teams to deliver sustainable and scalable data and AI applications. Join us to learn more about our key design goals and how Databricks is streamlining lakehouse observability to support the next generation of data-driven applications

Talk by: Michael Milirud

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Sponsored by: Dataiku | Have Your Cake and Eat it Too with Dataiku + Databricks

In this session, we will highlight all parts of the analytics lifecycle using Dataiku + Databricks. Explore, blend, and prepare source data, train a machine learning model and score new data, and visualize and publish results — all using only Dataiku's visual interface. Plus, we will use LLMs for everything from simple data prep to sophisticated development pipelines. Attend and learn how you can truly have it all with Dataiku + Databricks.

Talk by: Amanda Milberg

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Sponsored by: Privacera | Applying Advanced Data Security Governance with Databricks Unity Catalog

This talk explores the application of advanced data security and access control integrated with Databricks Unity Catalog through Privacera. Learn about Databricks with Unity Catalog and Privacera capabilities and real-world use cases demonstrating data security and access control best practices and how to successfully plan for and implement enterprise data security governance at scale across your entire Databricks Lakehouse.

Talk by: Don Bosco Durai

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Using Databricks to Develop Stats & Mathe Models to Forecast Monkeypox Outbreak in Washington State

In the spring and summer of 2022, monkeypox was detected in the United States and quickly spread throughout the country. To contain and mitigate the spread of the disease in Washington State, the Washington Department of Health data science team used the Databricks platform to develop a modeling pipeline that employed statistical and mathematical techniques to forecast the course of the monkeypox outbreak throughout the state. These models provided actionable information that helped inform decision making and guide the public health response to the outbreak.

We used contact-tracing data, standard line-lists, and published parameters to train a variety of time-series forecasting models, including an ARIMA model, a Poisson regression, and an SEIR compartmental model. We also calculated the daily R-effective rate as an additional output. The compartmental model best fit the reported cases when tested out of sample, but the statistical models were quicker and easier to deploy and helped inform initial decision-making. The R-effective rate was particularly useful throughout the effort.

Overall, these efforts highlighted the importance of rapidly deployable and scalable infectious disease modeling pipelines. Public health data science is still a nascent field, however, so common best practices in other industries are often-times novel approaches in public health. The need for stable, generalizable pipelines is crucial. Using the Databricks platform has allowed us to more quickly scale and iteratively improve our modeling pipelines to include other infectious diseases, such as influenza and RSV. Further development of scalable and standardized approaches to disease forecasting at the state and local level is vital to better informing future public health response efforts.

Talk by: Matthew Doxey

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc