This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/artificial_intelect on 2024-03-27 14:35:33.


Shill disclaimer: I was the pretraining lead for the project

DBRX deets:

  • 16 Experts (12B params per single expert; top_k=4 routing)
  • 36B active params (132B total params)
  • trained for 12T tokens
  • 32k sequence length training