This is an automated archive made by the Lemmit Bot.
The original was posted on /r/machinelearning by /u/artificial_intelect on 2024-03-27 14:35:33.
Shill disclaimer: I was the pretraining lead for the project
DBRX deets:
- 16 Experts (12B params per single expert; top_k=4 routing)
- 36B active params (132B total params)
- trained for 12T tokens
- 32k sequence length training
You must log in or register to comment.