[R] FlashDMoE: Fast Distributed MoE in a single Kernel

old.reddit.com

[R] FlashDMoE: Fast Distributed MoE in a single Kernel

old.reddit.com

Lemmit.Online botMAB to

Machine LearningEnglish · 11 days ago

We introduce *FlashDMoE*, the first system to *completely fuse* the Distributed MoE forward pass into a single kernel—delivering up to 9x higher...

This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/Kingandpawnendgame on 2025-06-11 03:03:39+00:00.

We introduce FlashDMoE, the first system to completely fuse the Distributed MoE forward pass into a single kernel—delivering up to 9x higher GPU utilization, 6x lower latency, and 4x improved weak-scaling efficiency.

Code: https://github.com/osayamenja/Kleos/blob/main/csrc/include/kleos/moe/README.MD

Paper: https://arxiv.org/abs/2506.04667

If you are a CUDA enthusiast, you would enjoy reading the code :) We write the fused layer from scratch in pure CUDA.

You must log in or register to comment.

Chat

Machine Learning

machinelearning

You are not logged in. However you can subscribe from another Fediverse account, for example Lemmy or Mastodon. To do this, paste the following into the search field of your instance: [email protected]

Community locked: only moderators can create posts. You can still comment on posts.

This subreddit is temporarily closed in protest of Reddit killing third party apps, see /r/ModCoord and /r/Save3rdPartyApps for more information.

Visibility: Public

This community can be federated to other instances and be posted/commented in by their users.

1 user / day
1 user / week
1 user / month
8 users / 6 months
1 local subscriber
19 subscribers
2.31K Posts
1 Comment
Modlog

mods:
Lemmit.Online bot