The purpose of the experiment was to try a (slightly complicated) prompt on each model to see how it handled it.
This is not a rating of the model overall as some specialise, but as to how well it did on a specific prompt.
I have rated using the below 9 point system:
Delete, tweak, publish (if it was not a test) - .5 work on the faces
Immediately noticeable glitch - 1
Understood the assignment (hair) - 1
Understood the assignment (clothing) - .7 mostly
Understood the assignment (gender) - 1
Interesting background without prompt - 0
Do I think changes to the prompt may work better - 1 yes
Cost - 1
Would I use the model again - 1