Scale your AI training and achieve peak performance with AI Hypercomputer. Gain actionable insights into optimizing your AI workloads for maximum goodput. Learn how to leverage our robust infrastructure for diverse models, including dense, Mixture of Experts, and diffusion. Discover how to customize your workflows with custom kernels and developer tools, facilitating seamless interactive development. You'll learn firsthand how Pathways, developed by Google Deepmind, enables large scale training resiliency, flexibility to express architecture.
talk-data.com
Speaker
Vaibhav Singh
4
talks
Filter by Event / Source
Talks & appearances
4 activities · Newest first
This session is a deep dive into strategies for maximizing the performance and efficiency of generative AI model training using Vertex AI and Cloud TPUs (Tensor Processing Units) and GPUs. You'll learn how to harness the power of Cloud TPUs and GPUs for accelerated training. Join our experts to learn more about best practices for configuring compute resources, selecting the ideal hardware for your use cases, and streamlining the overall model development process with Ray, Persistent Cluster, and shared reservations.
Click the blue “Learn more” button above to tap into special offers designed to help you implement what you are learning at Google Cloud Next 25.
If left unmanaged, failures and infrastructure inefficiencies can account for as much as 45% of your compute resources and precious engineering time (according to a Stanford University study). In this session, we discuss how to measure and maximize machine learning (ML) productivity for large-scale training jobs, spanning tens of thousands of accelerators. We’ll demonstrate a canonical view of large-scale training infrastructure and patterns our customers are applying that are available to you today.
Click the blue “Learn more” button above to tap into special offers designed to help you implement what you are learning at Google Cloud Next 25.
Training large AI models at scale requires high-performance and purpose-built infrastructure. This session will guide you through the key considerations for choosing tensor processing units (TPUs) and graphics processing unit (GPUs) for your training needs. Explore the strengths of each accelerator for various workloads, like large language models and generative AI models. Discover best practices for training and optimizing your training workflow on Google Cloud using TPUs and GPUs. Understand the performance and cost implications, along with cost-optimization strategies at scale.
Click the blue “Learn more” button above to tap into special offers designed to help you implement what you are learning at Google Cloud Next 25.