Lemmit.Online bot

Lemmit.Online bot

This is an automated archive made by the Lemmit Bot.

The original was posted on /r/stablediffusion by /u/Anzhc on 2025-10-15 21:22:40+00:00.

A little side addendum on CLIPs after this post: https://www.reddit.com/r/StableDiffusion/comments/1o1u2zm/text_encoders_in_noobai_are_dramatically_flawed_a/

I’ll keep it short this time.

While CLIPs are limited to 77 tokens, nothing *really* stopping you from feeding them longer context. By default this doesn’t really work:

https://preview.redd.it/47d58svb5cvf1.png?width=1980&format=png&auto=webp&s=6133df9238318b35630b8a9c484988ccc94bd83c

https://preview.redd.it/ya61tmuc5cvf1.png?width=1979&format=png&auto=webp&s=23e7dbf162d378ff7610a0832b1ea1c2d349a9d3

I tuned base CLIP L on ~10000 text-image pairs filtered out by token length. Every image in dataset has 225+ tokens tagging. Training was performed with up to 770 tokens.

Validation dataset is 5%, so ~500 images.

In length benchmark, each landmark point is the maximum allowed length at which i tested. Up to 77 tokens both CLIPs show fairly normal performance, where the more tokens you give - the better it would perform. Then past 77 performance of base CLIP L drops drastically(as new chunk has entered the picture, and at 80 tokens it’s mostly filled with nothing), but tuned variation does not. Then CLIP L regains to the baseline, but it can’t make use of additional information, and as more and more tokens are being added into the mix, it practically dies, as signal is too overwhelming.

Tuned performance peaks at ~300 tokens(~75 tags). Why, shouldn’t it be able to utilize even more tokens?

Yeah. And it’s able to, what you see here is saturation of data, beyond 300 tokens there are very few images that actually can continue extending information, majority of dataset is exhausted, so there is no new data to discern, therefore performance flatlines.

There is, however, another chart i can show, which shows performance decoupled from saturated data:

https://preview.redd.it/zbss7ob77cvf1.png?width=1980&format=png&auto=webp&s=6066dd804d92fe5822eb96e39a3a3268d1353892

https://preview.redd.it/hwtxboa87cvf1.png?width=1980&format=png&auto=webp&s=6c130ad94d24f5b573a399dc26082a2161deeee3

This chart removes images that are not able to saturate tested landmark.

Important note, that as images get removed, benchmark becomes easier, as there are less samples to compare against, so if you want to consider performance, utilize results of first set of graphs.

But with that aside, let’s address this set.

It is basically same image, but as number decreases, proportionally Base CLIP L has it’s performance “improved” due to sheer chance, as beyond 100 tags data is too small, and it allows model to guess by pure chance, so 1/4 correct gives 25% :D

In reality, i wouldn’t consider data in this set very reliable beyond 300 tokens, as further sets are done on less than 100 images, and are likely much easier to solve.

But conclusion that can be made, is that CLIP tuned with long captions i able to utilize information in those captions to reliably(80% on full data is quite decent) discern anime images, while default CLIP L likely treats it as more or less noise.

And no, it is not usable out of the box

https://preview.redd.it/mrjwmtg6acvf1.png?width=899&format=png&auto=webp&s=adebc48a279fbcf54f54068df7180a95a52a5d90

But patterns are nice.

I will upload it to HF if you want to experiment or something.

And node graphs for those who interested of course, but without explanations this time. There is nothing concerning us regarding longer context here really.

Red - Tuned, Blue - Base

PCA:

https://preview.redd.it/9qcixd88ccvf1.png?width=2435&format=png&auto=webp&s=9114b46e0d27b74ef6c351ec94df41b3cda7fe5b

t-sne

https://preview.redd.it/iu9sy2eiccvf1.png?width=2128&format=png&auto=webp&s=16a0706bcdd99f35c639af4f2a48fe964569520f

pacmap

https://preview.redd.it/dobrckyrccvf1.png?width=2036&format=png&auto=webp&s=001d237868df7accad8776779edcc3e3c552e943

HF link: https://huggingface.co/Anzhc/SDXL-Text-Encoder-Longer-CLIP-L/tree/main

Probably don’t bother downloading if you’re not going to tune your model in some way to adjust to it.

CLIPs can understand well beyond 77 tokens

CLIPs can understand well beyond 77 tokens

This is an automated archive made by the Lemmit Bot.