This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/Kingandpawnendgame on 2025-06-11 03:03:39+00:00.


We introduce FlashDMoE, the first system to completely fuse the Distributed MoE forward pass into a single kernel—delivering up to 9x higher GPU utilization, 6x lower latency, and 4x improved weak-scaling efficiency.

Code: https://github.com/osayamenja/Kleos/blob/main/csrc/include/kleos/moe/README.MD

Paper: https://arxiv.org/abs/2506.04667

If you are a CUDA enthusiast, you would enjoy reading the code :) We write the fused layer from scratch in pure CUDA.