This is an automated archive made by the Lemmit Bot.

The original was posted on /r/stablediffusion by /u/sanobawitch on 2024-10-18 06:45:59+00:00.


Janus is based on the DeepSeek-LLM-1.3b-base which is trained on an approximate corpus of 500B text tokens. For multimodal understanding, it uses the SigLIP-L as the vision encoder, which supports 384 x 384 image input. For image generation, Janus uses the Llama tokenizer.

They have released the model weights under DeepSeek Model License, the code is under MIT license. Commercial usage is permitted.

Three-stage training procedure

Link to the generated images. More examples with longer prompt.

The image resolutions are 1024×1024, 512×512, and 384×384.

Inference code here, they transform the decoded image patches to the final RGB image in… numpy. Yay!

So far, we have Omni, Mei, Sana and Ɉanus.