This is an automated archive made by the Lemmit Bot.
The original was posted on /r/machinelearning by /u/bo_peng on 2024-10-21 14:18:19+00:00.
Hi everyone. RWKV-7 (100% RNN and attention-free) can surpass the strong Modded-GPT baseline (the one with Muon optimizer, currently trending on twitter).
Training code & log: And it can reach loss 3.26xx if you use a larger headsz.
My current implementation is very inefficient though. Might can reach 85% Modded-GPT speed @ ctx1k (or faster than Modded-GPT @ ctx4k) after optimization. Any helps are welcome :)
The strong GPT baseline:
RWKV-7 moves away from the “linear attention” design to achieve greater performance :)
You must log in or register to comment.