This is an automated archive made by the Lemmit Bot.
The original was posted on /r/stablediffusion by /u/hexinx on 2024-03-27 19:35:28.
u/aplewe mentioned the following in this post
This seems to be kinda-sorta similar to what the next iteration of Stable Diffusion is supposed to be, where an LLM is used to tokenize the prompt, essentially, but much cheaper because you only have to train LoRAs and not go into training an entire Diffusion model using all-new captions.
So, if it were laid out in steps:
1.) Grab an image gen model, like Stable Diffusion.
2.) Grab an LLM, like Llama.
3.) Train LoRAs for each model using their code. This code also trains an “adapter” model that’s meant to be used with the LoRAs, but this model is not large.
Then
4.) Use the language LoRA + the language model to tokenize the prompt, and
5.) Feed this into the adapter model, which spits out modified tokens that then go into
6.) The Stable Diffusion LoRA + Stable Diffusion, which turn the token output from the adapter into an image.
Pretty nifty, IMHO, because you only have to train LoRAs, not go whole-hog and train both models from scratch. Plus, the LoRAs don’t change the original model weights at all, so all the nice-ness of those models is preserved (such as their ability to be generally expressive).
I get this “feeling” that SD3 won’t come out for atleast a couple of months. Even in its current state, some models show remarkably prompt adherence. Getting the equivalent of ELLA to SDXL via Lavi-Bridge might be epic, since ELLA’s developers seem to be failing on their promise of code and weights…
I was wondering- Is the above something we can with tools as-is, or is new code/comfyui-nodes required?..