We compared 126 keyword modifiers with the same prompt and initial image. These are the results.
We started by running the prompt "Scary skeleton astronaut in space" for 400 iterations at thumb resolution (400x400px). That gave us this base image.
Then, we evolved that creation 126 times, each time adding a different keyword modifier, and running for an additional 400 iterations. "Evolving" a creation uses the previous creation's output as the start image for the next, so every experiment started from the base image (i.e. NOT from scratch).
Want to try your own modifiers on this base image? Click "Evolve It" below then add your modifier to the prompt.
Taken from our VQGAN+CLIP tutorial on Medium.
Modifiers are just keywords that have been found to have a strong influence on how the AI interprets your prompt. In most cases, using one or more modifiers in your prompt will dramatically improve the resulting image. Here’s an example using the text prompt “A dog on the beach”. It’s obvious that the top left image (without any modifiers) is noticeably worse than the others.
So why do modifiers have such a dramatic effect? It’s to do with the data that the CLIP network was trained on — millions of image and caption pairs from the internet. CLIP has seen a huge number of images on the internet, and the ones that include the words “Thomas Kinkade” in the caption tend to be nicely textured paintings like those shown in the centre-left image. Likewise the images that were paired with a caption containing the words “Unreal Engine” tend to look like scenes from a video game (because Unreal Engine is a video game rendering engine).
Thus, when you include modifiers like “Thomas Kinkade” or “Unreal Engine”, CLIP knows that the image should look a certain way. Note that in the examples above, it’s not so much the shapes that are better with modifiers, it’s the finer textures that make it look better.
The modifiers that are 3D rendering engines (pictured: Unreal Engine, CryEngine, VRay, SketchUp) really shine here. Interestingly, the rendering engines targeted at games (Unreal Engine, CryEngine) both ended up with a spaceship interior in the the background.
Some modifiers like "futuristic", "mystical", "dream" and a few others didn't end up deviating far from the base image. Perhaps this means that CLIP doesn't have a strong concept of what these keywords should look like? Or maybe these modifiers are a bit too broad and therefore hard for CLIP to steer towards any particular look? It would be interesting to do more experiments with these modifiers to get a better idea of what's going on.
Scroll through the results and vote for your favourites by liking them.