Well IIRC DALL-E specifically in their public release tries to censor humans. Bu...

Well IIRC DALL-E specifically in their public release tries to censor humans. But other works like Stable Diffusion and Midjourney do not have these restrictions. Still, both are quite terrible at human faces. This is always interesting because generative tasks like FFHQ or CelebA have always been considered easy (despite humans being quite adept at recognizing when something is wrong with a face). But being able to create whole scenes requires a substantial amount of diversity. A good network can both memorize and create. There are some technical details that I am certain will be resolved (like long range features) simply due to increased hardware power/memory. But alignment will still be quite difficult as the entire problem is ill-defined to begin with.

But to give you some hints of how to use DALL-E better, there are magic keywords and this is why people suggest pretending like you're writing a prompt as if the thing exists. Some of these magic words are: screenshot, unreal engine, photorealistic, studio Ghibli. It does better with anime styles, probably due to training and human interpretations. Try to write longer prompts too. For example "flying otters" will give you otters in the water but "a photorealistic unreal engine render of otters flying through a beautiful sunset" will give you something much closer to what you actually want.