DALL-E 2 is out. I ran 200 prompts and measured the results.
I generated 200 images across 10 categories and rated coherence, prompt adherence, and artifact frequency. DALL-E 2 is good, but 'good' means different things for different prompt types.
I've been on the DALL-E 2 waitlist since the announcement, and when access finally landed in my inbox, I did what any reasonable person would do.
I generated 200 images, built a scoring rubric, and spent three days rating every single one.
My friends are concerned about me. I think that's fair.
The methodology
I designed 200 prompts across 10 categories, 20 prompts each. Every prompt was scored on three dimensions:
- Coherence (1-5): Does the image make visual sense? No melted faces, no impossible geometry?
- Prompt adherence (1-5): Did it actually generate what I asked for?
- Artifact frequency (1-5): How clean is the output? 5 means no visible artifacts.
I scored everything myself, which introduces bias. I know that. But consistency of scorer matters more than objectivity here, and I'm very consistent (my partner would say "obsessive" but I prefer "methodical").
All images were generated at 1024x1024 using DALL-E 2's default settings through the API.
The results, by category
| Category | Coherence (avg) | Prompt adherence (avg) | Artifact score (avg) | Overall (avg) | |----------|-----------------|----------------------|---------------------|---------------| | Simple objects | 4.7 | 4.8 | 4.5 | 4.67 | | Animals | 4.5 | 4.6 | 4.3 | 4.47 | | Landscapes | 4.6 | 4.4 | 4.4 | 4.47 | | People (portraits) | 3.2 | 3.8 | 3.1 | 3.37 | | People (full body) | 3.0 | 3.5 | 2.9 | 3.13 | | Abstract concepts | 3.8 | 3.1 | 4.0 | 3.63 | | Multi-object scenes | 3.4 | 3.0 | 3.5 | 3.30 | | Text in images | 1.9 | 1.6 | 2.8 | 2.10 | | Specific art styles | 4.3 | 4.1 | 4.2 | 4.20 | | Spatial relationships | 2.8 | 2.4 | 3.6 | 2.93 |
Let me walk through the highlights because the averages hide the interesting stuff.
Where DALL-E 2 genuinely impresses
Simple objects and animals are stunning. "A red ceramic mug on a wooden table" comes back looking like a photograph. "A golden retriever running through autumn leaves" is beautiful enough to frame.
The art style category surprised me the most. "A mountain village in the style of Hokusai" produced something I'd honestly hang on my wall. "A cityscape as a 1950s travel poster" nailed the color palette and composition. OpenAI's research page showcases similar examples, but my own testing confirmed it. The model has internalized visual style categories remarkably well.
Landscapes are also strong. I got maybe 2 out of 20 that looked off. The rest were genuinely good.
Where it falls apart
People. DALL-E 2 still struggles with people, especially hands (the classic problem) and faces at any distance beyond portrait-close. I counted fingers in every image that included human hands. Of the 40 people prompts, 27 had incorrect finger counts. That's a 67.5% error rate on hands.
But the real weakness is spatial relationships. "A cat sitting ON TOP OF a dog" frequently produced a cat next to a dog, or a strange merged animal. "A book to the LEFT of a lamp" got the spatial arrangement wrong about half the time.
| Spatial prompt type | Correct placement rate | |--------------------|----------------------| | "X on top of Y" | 45% | | "X next to Y" | 70% | | "X behind Y" | 35% | | "X inside Y" | 55% | | "X to the left/right of Y" | 40% |
And text generation is basically broken. "A sign that reads OPEN" produced gibberish letters 85% of the time. The model clearly doesn't understand text as semantic content. It treats letters as visual patterns and gets them wrong almost always.
The training data question
DALL-E 2 was trained on a combination of image-text pairs, and OpenAI hasn't fully disclosed the dataset. But the model's strengths and weaknesses are readable as a training data fingerprint.
It's excellent at photographic styles, suggesting heavy representation of stock photography and natural images. It handles art styles well, meaning art datasets like LAION were likely part of the training mix. It struggles with precise spatial relationships, which makes sense because image captions in training data rarely describe spatial layouts explicitly.
Compared to DALL-E 1
I didn't have systematic data from DALL-E 1 (it was much harder to access), but from the published samples and my limited prior testing, the jump is enormous.
| Dimension | DALL-E 1 (rough estimate) | DALL-E 2 (my data) | Improvement | |-----------|--------------------------|--------------------| ------------| | Overall coherence | ~2.5/5 | ~3.6/5 | +44% | | Resolution | 256x256 | 1024x1024 | 4x | | Artifact frequency | ~2.0/5 | ~3.7/5 | +85% | | Realistic style capability | Low | High | Qualitative leap |
The 256x256 to 1024x1024 resolution jump alone changes what you can do with the outputs. DALL-E 1 images were thumbnails. DALL-E 2 images are usable.
My ikigai for the month
What I keep coming back to is the gap between the best and worst categories. A 4.67 average for simple objects and a 2.10 for text in images. Same model, wildly different capability depending on what you ask it to do.
I think people are going to talk about DALL-E 2 in very absolute terms ("it's amazing" or "it's not ready"). The data says something more specific: it's amazing at certain tasks and mediocre to bad at others. Knowing which is which matters if you're building anything on top of it.
I'll be running the same 200 prompts on every new image model that comes out this year. The spreadsheet has begun.
If you found this interesting, you might also like:
- Every AI benchmark from 2020, ranked by how much they actually tell you
- The GPT-3 API waitlist is 6 months long. Here's what the early data looks like.
- DALL-E's first images vs what people expected: a data comparison
- I counted every AI startup that raised money in Q1 2021. The numbers are strange.
- GPT-3 vs GPT-J: the first real open source challenger, in data
-- dataku