Scaling AI/ML Workflows on HPC for Geoscientific Applications.

Scaling artificial intelligence (AI) and machine learning (ML) workflows on high-performance computing (HPC) systems presents unique challenges, particularly as models become more complex and data-intensive. This study explores strategies to optimize AI/ML workflows for enhanced performance and resource utilization on HPC platforms.

We investigate advanced parallelization techniques, such as Data Parallelism (DP), Distributed Data Parallel (DDP), and Fully Sharded Data Parallel (FSDP). Implementing memory-efficient strategies, including mixed precision training and activation checkpointing, significantly reduces memory consumption without compromising model accuracy. Additionally, we examine various communication backends( i.e. NCCL, MPI, and Gloo) to enhance inter-GPU and inter-node communication efficiency. Special attention is given to the complexities of implementing these backends in HPC environments, providing solutions for proper configuration and execution.

Our findings demonstrate that these optimizations enable stable and scalable AI/ML model training and inference, achieving substantial improvements in training times and resource efficiency. This presentation will detail the technical challenges encountered and the solutions developed, offering insights into effectively scaling AI/ML workflows on HPC systems.

talk-data.com

Scaling AI/ML Workflows on HPC for Geoscientific Applications.

Description