talk-data.com
AWS re:Invent 2024 - High performance distributed model training with Amazon SageMaker (AIM380)
Speakers
Description
Foundation models continue to grow in size with billions or even trillions of parameters, and they often won’t fit into a single accelerator device such as a GPU. Amazon SageMaker distributed training capabilities help you apply advanced parallelization techniques, communication optimizations, and efficient checkpointing strategies to distribute your training workload across hundreds or thousands of GPUs, reducing model training time and cost by up to 20%. Join this session for a deep dive into the infrastructure used to run distributed training at scale. Learn how to integrate Amazon SageMaker training capabilities to reduce the total cost of foundation model development.
Learn more: AWS re:Invent: https://go.aws/reinvent. More AWS events: https://go.aws/3kss9CP
Subscribe: More AWS videos: http://bit.ly/2O3zS75 More AWS events videos: http://bit.ly/316g9t4
About AWS: Amazon Web Services (AWS) hosts events, both online and in-person, bringing the cloud computing community together to connect, collaborate, and learn from AWS experts. AWS is the world's most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—are using AWS to lower costs, become more agile, and innovate faster.