This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/evilevidenz on 2024-09-06 06:52:34+00:00.


Usually people respond with “Because NVIDIA had more time and more money”. However, why cant AMD catch up? What are the exact things that make optimizing ROCm so hard??

It would be helpful if you could point to some resources or if your answer would be as detailed as possible regarding the implementation of specific kernels and structures and how CUDA calls are exactly made and optimized from Triton or XLA. Thx :)