This is an automated archive made by the Lemmit Bot.

The original was posted on /r/stablediffusion by /u/trakusmk on 2024-04-10 05:35:55.


Hey everyone, I’ve been following the development of text-to-image models and noticed something interesting. A lot of the new models and papers like Stable Diffusion 3, PixArt Sigma, and Ella-diffusion are using the FLAN-T5 model for text encoding. Considering there are bigger models out there like Mistral 7B with 7 billion parameters or even llama 70b that have much greater language understanding, I’m curious why researchers are sticking with a smaller outdated model like FLAN-T5. Any thoughts on why this might be the case?