PyTorch Data Loader Tuning + GPU Cross-Architecture Optimizations: CUDA and AMD

Zoom link: https://us02web.zoom.us/j/82308186562

Talk #0: Introductions and Meetup Updates by Chris Fregly and Antje Barth

Talk #1: Solving Bottlenecks with Data Input Pipeline with PyTorch Profiler and TensorBoard by Chaim Rand, et al.

Based on this Medium post: https://medium.com/data-science/solving-bottlenecks-on-the-data-input-pipeline-with-pytorch-profiler-and-tensorboard-5dced134dbe9

Talk #2: How to Write Cross-Architecture Kernels: NVIDIA CUDA and AMD ROCm (a.k.a "CUDA for AMD") by Quentin Anthony, Cross-Platform Kernel Engineer @ Zyphra

New models such as DeepSeek-R1 and Llama-4 are being deployed across AMD and NVIDIA GPUs, but how are cross-hardware kernels written? In my talk, we'll discuss considerations such as kernel sizing and cross-architecture optimization when writing kernels across different SIMD hardware.

Zoom link: https://us02web.zoom.us/j/82308186562

Related Links Zoom link: https://us02web.zoom.us/j/82308186562

Talk #0: Introductions and Meetup Updates by Chris Fregly and Antje Barth

Talk #1: GPU, PyTorch, and CUDA Performance Optimizations

Talk #2: GPU, PyTorch, and CUDA Performance Optimizations

Zoom link: https://us02web.zoom.us/j/82308186562

Related Links Github Repo: http://github.com/cfregly/ai-performance-engineering/ O'Reilly Book: https://www.amazon.com/Systems-Performance-Engineering-Optimizing-Algorithms/dp/B0F47689K8/ YouTube: https://www.youtube.com/@AIPerformanceEngineering Generative AI Free Course on DeepLearning.ai: https://bit.ly/gllm O'Reilly Book: https://www.amazon.com/Systems-Performance-Engineering-Optimizing-Algorithms/dp/B0F47689K8/ YouTube: https://www.youtube.com/@AIPerformanceEngineering Generative AI Free Course on DeepLearning.ai: https://bit.ly/gllm