This is an automated archive made by the Lemmit Bot.
The original was posted on /r/stablediffusion by /u/Anzhc on 2025-10-15 21:22:40+00:00.
A little side addendum on CLIPs after this post: https://www.reddit.com/r/StableDiffusion/comments/1o1u2zm/text_encoders_in_noobai_are_dramatically_flawed_a/
I’ll keep it short this time.
While CLIPs are limited to 77 tokens, nothing *really* stopping you from feeding them longer context. By default this doesn’t really work:
I tuned base CLIP L on ~10000 text-image pairs filtered out by token length. Every image in dataset has 225+ tokens tagging. Training was performed with up to 770 tokens.
Validation dataset is 5%, so ~500 images.
In length benchmark, each landmark point is the maximum allowed length at which i tested. Up to 77 tokens both CLIPs show fairly normal performance, where the more tokens you give - the better it would perform. Then past 77 performance of base CLIP L drops drastically(as new chunk has entered the picture, and at 80 tokens it’s mostly filled with nothing), but tuned variation does not. Then CLIP L regains to the baseline, but it can’t make use of additional information, and as more and more tokens are being added into the mix, it practically dies, as signal is too overwhelming.
Tuned performance peaks at ~300 tokens(~75 tags). Why, shouldn’t it be able to utilize even more tokens?
Yeah. And it’s able to, what you see here is saturation of data, beyond 300 tokens there are very few images that actually can continue extending information, majority of dataset is exhausted, so there is no new data to discern, therefore performance flatlines.
There is, however, another chart i can show, which shows performance decoupled from saturated data:
This chart removes images that are not able to saturate tested landmark.
Important note, that as images get removed, benchmark becomes easier, as there are less samples to compare against, so if you want to consider performance, utilize results of first set of graphs.
But with that aside, let’s address this set.
It is basically same image, but as number decreases, proportionally Base CLIP L has it’s performance “improved” due to sheer chance, as beyond 100 tags data is too small, and it allows model to guess by pure chance, so 1/4 correct gives 25% :D
In reality, i wouldn’t consider data in this set very reliable beyond 300 tokens, as further sets are done on less than 100 images, and are likely much easier to solve.
But conclusion that can be made, is that CLIP tuned with long captions i able to utilize information in those captions to reliably(80% on full data is quite decent) discern anime images, while default CLIP L likely treats it as more or less noise.
And no, it is not usable out of the box
But patterns are nice.
I will upload it to HF if you want to experiment or something.
And node graphs for those who interested of course, but without explanations this time. There is nothing concerning us regarding longer context here really.
Red - Tuned, Blue - Base
PCA:
t-sne
pacmap
HF link: https://huggingface.co/Anzhc/SDXL-Text-Encoder-Longer-CLIP-L/tree/main
Probably don’t bother downloading if you’re not going to tune your model in some way to adjust to it.