talk-data.com talk-data.com

Topic

Databricks

big_data analytics spark

1286

tagged

Activity Trend

515 peak/qtr
2020-Q1 2026-Q1

Activities

1286 activities · Newest first

What’s New With Platform Security and Compliance in the Databricks Lakehouse Platform

At Databricks, we know that data is one of your most valuable assets and alwasys must be protected, that’s why security is built into every layer of the Databricks Lakehouse Platform. Databricks provides comprehensive security to protect your data and workloads, such as encryption, network controls, data governance and auditing.

In this session, you will hear from Databricks product leaders on the platform security and compliance progress made over the past year, with demos on how administrators can start protecting workloads fast. You will also learn more about the roadmap that delivers on the Databricks commitment to you as the most trusted, compliant, and secure data and AI platform with the Databricks Lakehouse.

Talk by: Samrat Ray and David Veuve

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksin

Colossal AI: Scaling AI Models in Big Model Era

The proliferation of large models based on Transformer has outpaced advances in hardware, resulting in an urgent need for the ability to distribute enormous models across multiple GPUs. Despite this growing demand, best practices for choosing an optimal strategy are still lacking due to the breadth of knowledge required across HPC, DL, and distributed systems. These difficulties have stimulated both AI and HPC developers to explore the key questions: How can training and inference efficiency of large models be improved to reduce costs? How can larger AI models be accommodated even with limited resources?

What can be done to enable more community members to easily access large models and large-scale applications? In this session, we investigate efforts to solve the questions mentioned above. Firstly, diverse parallelization is an important tool to improve the efficiency of large model training and inference. Heterogeneous memory management can help enhance the model accommodation capacity of processors like GPUs.

Furthermore, user-friendly DL systems for large models significantly reduce the specialized background knowledge users need, allowing more community members to get started with larger models more efficiently. We will provide participants with a system-level open-source solution, Colossal-AI. More information can be found at https://github.com/hpcaitech/ColossalAI.

Talk by: James Demmel and Yang You

Here’s more to explore: LLM Compact Guide: https://dbricks.co/43WuQyb Big Book of MLOps: https://dbricks.co/3r0Pqiz

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Delta Kernel: Simplifying Building Connectors for Delta

Since the release of Delta 2.0, the project has been growing at a breakneck speed. In this session, we will cover all the latest capabilities that makes Delta Lake the best format for the lakehouse. Based on lessons learned from this past year, we will introduce Project Aqueduct and how we will simplify building Delta Lake APIs from Rust and Go to Trino, Flink, and PySpark.

Talk by: Tathagata Das and Denny Lee

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Enterprise Use of Generative AI Needs Guardrails: Here's How to Build Them

Large Language Models (LLMs) such as ChatGPT have revolutionized AI applications, offering unprecedented potential for complex real-world scenarios. However, fully harnessing this potential comes with unique challenges such as model brittleness and the need for consistent, accurate outputs. These hurdles become more pronounced when developing production-grade applications that utilize LLMs as a software abstraction layer.

In this session, we will tackle these challenges head-on. We introduce Guardrails AI, an open-source platform designed to mitigate risks and enhance the safety and efficiency of LLMs. We will delve into specific techniques and advanced control mechanisms that enable developers to optimize model performance effectively. Furthermore, we will explore how implementing these safeguards can significantly improve the development process of LLMs, ultimately leading to safer, more reliable, and robust real-world AI applications

Talk by: Shreya Rajpal

Here’s more to explore: LLM Compact Guide: https://dbricks.co/43WuQyb Big Book of MLOps: https://dbricks.co/3r0Pqiz

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Lakehouse / Spark AMA

Have some great questions about Apache Spark™ and Lakehouses?  Well, come by and ask the experts your questions!

Talk by: Martin Grund, Hyukjin Kwon, and Wenchen Fan

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Navigating the Complexities of LLMs: Insights from Practitioners

Interested in diving deeper into the world of large language models (LLMs) and their real-life applications? In this session, we bring together our experienced team members and some of our esteemed customers to talk about their journey with LLMs. We'll delve into the complexities of getting these models to perform accurately and efficiently, the challenges, and the dynamic nature of LLM technology as it constantly evolves. This engaging conversation will offer you a broader perspective on how LLMs are being applied across different industries and how they’re revolutionizing our interaction with technology. Whether you're well-versed in AI or just beginning to explore, this session promises to enrich your understanding of the practical aspects of LLM implementation.

Talk by: Sai Ravuru, Eric Peter, Ankit Mathur, and Salman Mohammed

Here’s more to explore: LLM Compact Guide: https://dbricks.co/43WuQyb Big Book of MLOps: https://dbricks.co/3r0Pqiz

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Simplifying Lakehouse Observability: Databricks Key Design Goals and Strategies

In this session, we'll explore Databricks vision for simplifying lakehouse observability, a critical component of any successful data, analytics, and machine learning initiatives. By directly integrating observability solutions within the lakehouse, Databricks aims to provide users with the tools and insights needed to run a successful business on top of lakehouse.

Our approach is designed to leverage existing expertise and simplify the process of monitoring and optimizing data and ML workflows, enabling teams to deliver sustainable and scalable data and AI applications. Join us to learn more about our key design goals and how Databricks is streamlining lakehouse observability to support the next generation of data-driven applications

Talk by: Michael Milirud

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Sponsored by: Dataiku | Have Your Cake and Eat it Too with Dataiku + Databricks

In this session, we will highlight all parts of the analytics lifecycle using Dataiku + Databricks. Explore, blend, and prepare source data, train a machine learning model and score new data, and visualize and publish results — all using only Dataiku's visual interface. Plus, we will use LLMs for everything from simple data prep to sophisticated development pipelines. Attend and learn how you can truly have it all with Dataiku + Databricks.

Talk by: Amanda Milberg

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Sponsored by: Privacera | Applying Advanced Data Security Governance with Databricks Unity Catalog

This talk explores the application of advanced data security and access control integrated with Databricks Unity Catalog through Privacera. Learn about Databricks with Unity Catalog and Privacera capabilities and real-world use cases demonstrating data security and access control best practices and how to successfully plan for and implement enterprise data security governance at scale across your entire Databricks Lakehouse.

Talk by: Don Bosco Durai

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Using Databricks to Develop Stats & Mathe Models to Forecast Monkeypox Outbreak in Washington State

In the spring and summer of 2022, monkeypox was detected in the United States and quickly spread throughout the country. To contain and mitigate the spread of the disease in Washington State, the Washington Department of Health data science team used the Databricks platform to develop a modeling pipeline that employed statistical and mathematical techniques to forecast the course of the monkeypox outbreak throughout the state. These models provided actionable information that helped inform decision making and guide the public health response to the outbreak.

We used contact-tracing data, standard line-lists, and published parameters to train a variety of time-series forecasting models, including an ARIMA model, a Poisson regression, and an SEIR compartmental model. We also calculated the daily R-effective rate as an additional output. The compartmental model best fit the reported cases when tested out of sample, but the statistical models were quicker and easier to deploy and helped inform initial decision-making. The R-effective rate was particularly useful throughout the effort.

Overall, these efforts highlighted the importance of rapidly deployable and scalable infectious disease modeling pipelines. Public health data science is still a nascent field, however, so common best practices in other industries are often-times novel approaches in public health. The need for stable, generalizable pipelines is crucial. Using the Databricks platform has allowed us to more quickly scale and iteratively improve our modeling pipelines to include other infectious diseases, such as influenza and RSV. Further development of scalable and standardized approaches to disease forecasting at the state and local level is vital to better informing future public health response efforts.

Talk by: Matthew Doxey

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Writing Data-Sharing Apps Using Node.js and Delta Sharing

JavaScript remains the top programming language today with most code repositories written using JavaScript on GitHub. However, JavaScript is evolving beyond just a language for web application development into a language built for tomorrow. Everyday tasks like data wrangling, data analysis, and predictive analytics are possible today directly from a web browser. For example, many popular data analytics libraries, like Tensorflow.js, now support JavaScript SDKs.

Another popular library, Danfo.js, makes it possible to wrangle data using familiar pandas-like operations, shortening the learning curve and arming the typical data engineer or data scientist with another data tool in their toolbox. In this presentation, we’ll explore using the Node.js connector for Delta Sharing to build a data analytics app that summarizes a Twitter dataset.

Talk by: Will Girten

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Apache Spark™ Streaming and Delta Live Tables Accelerates KPMG Clients For Real Time IoT Insights

Unplanned downtime in manufacturing costs firms up to a trillion dollars annually. Time that materials spend sitting on a production line is lost revenue. Even just 15 hours of downtime a week adds up to over 800 hours of downtime yearly. The use of Internet of Things or IoT devices can cut this time down by providing details of machine metrics. However, IoT predictive maintenance is challenged by the lack of effective, scalable infrastructure and machine learning solutions. IoT data can be the size of multiple terabytes per day and can come in a variety of formats. Furthermore, without any insights and analysis, this data becomes just another table.

The KPMG Databricks IoT Accelerator is a comprehensive solution enabling manufacturing plant operators to have a bird’s eye view of their machines’ health and empowers proactive machine maintenance across their portfolio of IoT devices. The Databricks Accelerator ingests IoT streaming data at scale and implements the Databricks Medallion architecture while leveraging Delta Live Tables to clean and process data. Real time machine learning models are developed from IoT machine measurements and are managed in MLflow. The AI predictions and IoT device readings are compiled in the gold table powering downstream dashboards like Tableau. Dashboards inform machine operators of not only machines’ ailments, but action they can take to mitigate issues before they arise. Operators can see fault history to aid in understanding failure trends, and can filter dashboards by fault type, machine, or specific sensor reading. 

Talk by: MacGregor Winegard

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksin

Clean Room Primer: Using Databricks Clean Rooms to Use More & Better Data in your BI, ML, & Beyond

In this session, we will discuss the foundational changes in the ecosystem, the implications of data insights on marketing, analytics, and measurement, and how companies are coming together to collaborate through data clean rooms in new and exciting ways to power mutually beneficial value for their businesses while preserving privacy and governance.

We will delve into the concept and key features of clean room technology and how they can be used to access more and better data for business intelligence (BI), machine learning (ML), and other data-driven initiatives. By examining real-world use cases of clean rooms in action, attendees will gain a clear understanding of the benefits they can bring to industries like CPG, retail, and media and entertainment. In addition, we will unpack the advantages of using Databricks as a clean room platform, specifically showcasing how interoperable clean rooms can be leveraged to enhance BI, ML and other compute scenarios. By the end of this session, you will be equipped with the knowledge and inspiration to explore how clean rooms can unlock new collaboration opportunities that drive better outcomes for your business and improved experiences for consumers.

Talk by: Matthew Karasick, and Anil Puliyeril

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Data Biases and Generative AI: A Practitioner Discussion

Join this panel discussion that unpacks the technical challenges surrounding biases found in data, and poses potential solutions and strategies for the future including Generative AI. This session is a showcase highlighting diverse perspectives in the data and AI industry.

Talk by: Adi Polak, Gavita Regunath, Christina Taylor, and Layla Yang

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Demonstrate-Search-Predict: Composing Retrieval and Language Models for Knowledge-Intensive NLP

In this talk, you will learn about how retrieval-augmented in-context learning has emerged as a powerful approach for addressing knowledge intensive tasks using frozen language models (LM) and retrieval models (RM). Existing work has combined these in simple “retrieve-then-read” pipelines in which the RM retrieves passages that are inserted into the LM prompt.

To begin to fully realize the potential of frozen LMs and RMs, we propose Demonstrate–Search–Predict (DSP), a framework that relies on passing natural language texts in sophisticated pipelines between an LM and an RM. DSP can express high-level programs that bootstrap pipeline-aware demonstrations, search for relevant passages, and generate grounded predictions, systematically breaking down problems into small transformations that the LM and RM can handle more reliably.

We have written novel DSP programs for answering questions in open-domain, multi-hop, and conversational settings, establishing in early evaluations new state-of-the-art in-context learning results and delivering 37–125%, 8–40%, and 80–290% relative gains against vanilla LMs, a standard retrieve-then-read pipeline, and a contemporaneous self-ask pipeline, respectively.

Talk by: Keshav Santhanam

Here’s more to explore: State of Data + AI Report: https://dbricks.co/44i2HBp Databricks named a Leader in 2022 Gartner® Magic QuadrantTM CDBMS: https://dbricks.co/3phw20d

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Future Data Access Control: Booz Allen Hamilton’s Way of Securing Databricks Lakehouse with Immuta

In this talk, I’ll review how we utilize Attribute-Based Access Control (ABAC) to enforce policy via Immuta. I’ll discuss the differences between the ABAC and legacy Role-Based Access Control (RBAC) approaches to control access and how the RBAC approach is not sufficient to keep up with today’s growing big data market. With so much data available, there also comes substantial risk. Data can contain many sensitive data elements, including PII and PHI. Industry leaders like Databricks are pushing the boundaries of data technology, which leads to constantly evolving data use cases. And that’s a good thing. However, the RBAC approach is struggling to keep up with those advancements.

So what is RBAC? It’s an approach to data access that permits system access based on the end-user’s role. For legacy systems, it’s meant as a simple but effective approach to securing data. Are you a manager? Then you’ll get access to data meant for managers. This is great for small deployments with clearly defined roles. Here at Booz Allen, we invested in Databricks because we have an environment of over 30 thousand users and billions of rows of data.

To mitigate this problem and align with our forward-thinking company standard, we introduced Immuta into our stack. Immuta uses ABAC to allow for dynamic data access control. Users are automatically assigned certain attributes, and access is based on those attributes instead of just their role. This allows for more flexibility and allows data access control to easily scale without the need to constantly map a user to their role. Using attributes, we can write policies in one place and have them applied across all our data platforms. This makes for a truly holistic data governance approach and provides immediate ROI and time savings for the company.

Talk by: Jeffrey Hess

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

How to Train Your Own Large Language Models

Given the success of OpenAI’s GPT-4 and Google’s PaLM, every company is now assessing its own use cases for Large Language Models (LLMs). Many companies will ultimately decide to train their own LLMs for a variety of reasons, ranging from data privacy to increased control over updates and improvements. One of the most common reasons will be to make use of proprietary internal data.

In this session, we’ll go over how to train your own LLMs, from raw data to deployment in a user-facing production environment. We’ll discuss the engineering challenges, and the vendors that make up the modern LLM stack: Databricks, Hugging Face, and MosaicML. We’ll also break down what it means to train an LLM using your own data, including the various approaches and their associated tradeoffs.

Topics covered in this session: - How Replit trained a state-of-the-art LLM from scratch - The different approaches to using LLMs with your internal data - The differences between fine-tuning, instruction tuning, and RLHF

Talk by: Reza Shabani

Here’s more to explore: LLM Compact Guide: https://dbricks.co/43WuQyb Big Book of MLOps: https://dbricks.co/3r0Pqiz

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

JoinBoost: In Data Base Machine Learning for Tree-Models

Data and machine learning (ML) are crucial for enterprise operations. Enterprises store data in databases for management and use ML to gain business insights. However, there is a mismatch between the way ML expects data to be organized (a single table) and the way data is organized in databases (a join graph of multiple tables) and leads to inefficiencies when joining and materializing tables.

In this talk, you will see how we successfully address this issue. We introduce JoinBoost, a lightweight python library that trains tree models (such as random forests and gradient boosting) for join graphs in databases. JoinBoost acts as a query rewriting layer that is compatible with cloud databases, and eliminates the need for costly join materialization.

Talk by: Zachary Huang

Here’s more to explore: State of Data + AI Report: https://dbricks.co/44i2HBp Databricks named a Leader in 2022 Gartner® Magic QuadrantTM CDBMS: https://dbricks.co/3phw20d

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Lakehouse Architecture to Advance Security Analytics at the Department of State

In 2023, the Department of State surged forward on implementing a lakehouse architecture to get faster, smarter, and more effective on cybersecurity log monitoring and incident response. In addition to getting us ahead of federal mandates, this approach promises to enable advanced analytics and machine learning across our highly federated global IT environment while minimizing costs associated with data retention and aggregation.

This talk will include a high-level overview of the technical and policy challenge and a technical deeper dive on the tactical implementation choices made. We’ll share lessons learned related to governance and securing organizational support, connecting between multiple cloud environments, and standardizing data to make it useful for analytics. And finally, we’ll discuss how the lakehouse leverages Databricks in multicloud environments to promote decentralized ownership of data while enabling strong, centralized data governance practices.

Talk by: Timothy Ahrens and Edward Moe

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Leveraging Data Science for Game Growth and Matchmaking Optimization

For video games, Data Science solutions can be applied throughout players' lifecycle, from Adtech, LTV forecasting, In-game economic system monitoring to experimentation. Databricks is used as a data and computation foundation to power these data science solutions, enabling data scientists to easily develop and deploy these solutions for different use cases.

In this session, we will share insights on how Databricks-powered data science solutions drive game growth and improve player experiences using different advanced analytics, modeling, experimentation, and causal inference methods. We will introduce the business use cases, data science techniques, as well as Databricks demos.

Talk by: Zhenyu Zhao and Shuo Chen

Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc