talk-data.com talk-data.com

Topic

Data Collection

3

tagged

Activity Trend

17 peak/qtr
2020-Q1 2026-Q1

Activities

Showing filtered results

Filtering by: Databricks DATA + AI Summit 2023 ×
Protecting PII/PHI Data in Data Lake via Column Level Encryption

Data breach is a concern for any data collection company including Northwestern mutual. Every measure is taken to avoid the identity theft and fraud for our customers; however they are still not sufficient if the security around it is not updated periodically. A multiple layer of encryption is the most common approach utilized to avoid breaches however unauthorized internal access to this sensitive data still poses a threat

This presentation will walk you following steps: - Design to build encryption at column level - How to protect PII data that is used as key for joins - Ability for authorized users to decrypt data at run time - Ability to rotate the encryption keys if needed

At Northwestern Mutual, a combination of Fernet, AES encryption libraries, user-defined functions (UDFs), and Databricks secrets, were utilized to develop a process to encrypt PII information. Access was only provided to those with a business need to decrypt it, this helps avoids the internal threat. This is also done without data duplication or metadata (view/tables) duplication. Our goal is to help you understand on how you can build a secure data lake for your organization which can eliminate threats of data breach internally and externally. Associated blog: https://databricks.com/blog/2020/11/20/enforcing-column-level-encryption-and-avoiding-data-duplication-with-pii.html

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Unifying Data Science and Business: AI Augmentation/Integration in Production Business Applications

Why is it so hard to integrate Machine Learning into real business applications? In 2019 Gartner predicted that AI augmentation would solve this problem and would create will create $2.9 trillion of business value and 6.2 billion hours of worker productivity in 2021. A new realm of business science methods that encompass AI-powered analytics that allows people with domain expertise to make smarter decisions faster and with more confidence have also emerged as a solution to this problem. Dr. Harvey will demystify why integration challenges still account for $30.2 billion in annual global losses and discuss what it takes to integrate AI/ML code or algorithms into real business applications and the effort that goes into making each component, including data collection, preparation, training, and serving production-ready, enabling organizations to use the results of integrated models repeatedly with minimal user intervention. Finally, Dr. Harvey will discuss AISquared’s integration with Databricks and MLFlow to accelerate the integration of AI by unifying data science with business. By adding five lines of code to your model, users can now leverage AISquared’s model integration API framework which provides a quick and easy way to integrate models directly into live business applications.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/

Distributed Machine Learning at Lyft

Data collection, preprocessing, feature engineering are the fundamental steps in any Machine Learning Pipeline. After feature engineering, being able to parallelize training on multiple low cost machines helps to reduce cost and time both. And, then being able to train models in a distributed manner speeds up Hyperparameter Tuning. How can we unify these stages of ML Pipeline in one unified distributed training platform together? And that too on Kubernetes?

Our ML platform is completely based on Kubernetes because of its scalability and rapid bootstrapping time of resources. In this talk we will demonstrate how Lyft uses Spark on Kubernetes, Fugue (our home grown unifying compute abstraction layer) to design a holistic end to end ML Pipeline system for distributed feature engineering, training & prediction experience for our customers on our ML Platform on top of Spark on K8s. We will also do a deep dive to show how we are abstracting and hiding infrastructure complexities so that our Data Scientists and Research Scientist can focus only on the business logic for their models through simple pythonic APIs and SQL. We let the users focus on ''what to do'' and the platform takes care of ''how to do''. We will share our challenges, learning and the fun we had while implementing. Using Spark on K8s have helped us achieve large scale data processing with 90% less cost and at times bringing down processing time from 2 hours to less than 20 mins.

Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data... Instagram: https://www.instagram.com/databricksinc/