This is an automated archive made by the Lemmit Bot.

The original was posted on /r/stablediffusion by /u/MarcS- on 2024-04-21 00:24:21.


Hi,

Since I had compiled a list of prompts for beta-testers to prompt and noone took my suggestions (boooo), I decided to give it a try myself now that it is available for test. I used the API service, which is rumored not to be the latest version with all the bells and whistles of the latest occurrence, according to a twitter post (yes, I know it’s a pretty low source of information). But anyway, I must test what we’ve got until the release of the weights.

The list tried to test several key points of advertised strong performance of the SD3 model. I generated several attempts for each prompt, up to 10. If the model doesn’t give something acceptable more than 1 in 10, I think the prompt should be marked as failed. I also compared to generations from Dall-E v3 (through chatgpt) as this is currently the generally accepted best image creation service in terms of prompt understanding (despite being closed source, barely editable and so on). I did compare with Juggernaut but for the sake of brevity, I won’t comment on this.

First: A queue of people waiting in line to buy bread in soviet-era bakery, with больше хлеба written in green neon sign on the door. The goal of this prompt is to see if the advertized text capabilities expanded to other alphabets. I am not sure it’s proper russian (I used google translated and asked for no more bread), but at least it has some cyrillic characters. From other displayed images during the beta test, it showed that kanji wasn’t part of the training data, but I tried something closer to latin alphabet.

It kind of failed. While it captured with the snow the idea that russian climate is less than ideal, and the neon green signe is there and the crowd of people queueing is not bad, it failed on text adherence. Also, even there is a lot of bread in these penury-stricken backeries. Fortunately, the Dall-E (best of 4) attempts where all worse. It is expected as the prompts in my list are hard, I suppose.

This is the best version Dall-E could do. I disliked the queue passing in front of the store, the added inexistant characters and general additions to the prompt the most. I won’t run a tally of the image but in my opinion (including aesthetics which is highly subjective), SD3 did a slightly better job despite failing on key points.

Second prompt was a test of recognizing right and left. I asked for a samurai aiming his bow from horseback, galloping fromt the left to the right of the image. It also tested a bow, which SDXL generally (ie, nearly all the time) failed at depicting correctly, seeming to think that a bow is a close combat clubbing weapon.

The reference from Dall-E was quite good, but the direction of the image was failed 50% of the time. It’s strange as it’s a basic positionning question an advanced model should nail nearly all the time.

The results from SD3 were nice image, with close adherence to the direction, but still failing totally on bow. Here is the best result out of six:

I like the image, the galloping on a three-legged horse? Also, the string of the bow is not attached and it’s not aimed this way. That’s sad. The best bow I got was this one:

It’s close, Japanese people apparently breed three-legged stallions, the direction is bad but the bow is… less bad. I guess running 100 image with a turbo model could produce something good… But with the current API price it’s not something I am willing to do. It will have to wait for a free self-hosted solution.

The second part of the challenge was doing an even more dynamic scene. A dynamic image of a samurai jumping and doing a backflip from horseback, while aiming his bow at a Komodo dragon.

OK, I intended the samurai to jump from horseback with his bow… Here it’s the HORSE that is trying to do a backflip, further complicated by the lack of a fourth leg… I like the aesthetic of the image, and curiously the bow are less awful than before.

Dall-E did worse. I got nonsenical image with three dragons and a levitating samurai (available on demand if it interests someone) and a nice image of a samurai on horseback fighting a very nice flying dragon, an ability the komodo version isn’t now to use often. So, the prompt is failed for every system but I still think SD3 is slightly superior. But I may be skewed, even if I like D3 output generally.

The third test was a pair of prompts about Rio de Janeiro. I asked for a wide view of Rio de Janeiro bay featuring Copa Cabana and the Christ statue on the Corcovado mountain, and a painting of the Rio de Janeiro bay painted in 1408. The latter was to test how it would take into account the date, both for the composition (there were no building at this date) and the painting style, as I expected an early-15th century painting style by specifying the date.

It was too smart for SD3.

There is the Christ the Redeemer statue but the bay is nonsensical.

Enjoy 1408 skyscrappers and ships in early Renaissance style.

At least it tried, and I guess I should have emphasized the style more. For information, here is Dall-E attempt:

One can’t blame an image generation AI to know when Rio was founded.

Our next post in the series (unfortunately in another thread, as in replies I can’t post several image in the same post) we’ll see those prompts: a trio of SS soldiers on the East Front, looking sad. The procession of Easter in Sevilla. A detailed picture of a sexy catgirl doing a handstand over a table and finally an attempt at yoga pose. Expect body horror!