DALL-E 2 is out. I ran 200 prompts and measured the results.

I've been on the DALL-E 2 waitlist since the announcement, and when access finally landed in my inbox, I did what any reasonable person would do.

I generated 200 images, built a scoring rubric, and spent three days rating every single one.

My friends are concerned about me. I think that's fair.

The methodology

I designed 200 prompts across 10 categories, 20 prompts each. Every prompt was scored on three dimensions:

Coherence (1-5): Does the image make visual sense? No melted faces, no impossible geometry?
Prompt adherence (1-5): Did it actually generate what I asked for?
Artifact frequency (1-5): How clean is the output? 5 means no visible artifacts.

I scored everything myself, which introduces bias. I know that. But consistency of scorer matters more than objectivity here, and I'm very consistent (my partner would say "obsessive" but I prefer "methodical").

All images were generated at 1024x1024 using DALL-E 2's default settings through the API.

The results, by category

| Category | Coherence (avg) | Prompt adherence (avg) | Artifact score (avg) | Overall (avg) | |----------|-----------------|----------------------|---------------------|---------------| | Simple objects | 4.7 | 4.8 | 4.5 | 4.67 | | Animals | 4.5 | 4.6 | 4.3 | 4.47 | | Landscapes | 4.6 | 4.4 | 4.4 | 4.47 | | People (portraits) | 3.2 | 3.8 | 3.1 | 3.37 | | People (full body) | 3.0 | 3.5 | 2.9 | 3.13 | | Abstract concepts | 3.8 | 3.1 | 4.0 | 3.63 | | Multi-object scenes | 3.4 | 3.0 | 3.5 | 3.30 | | Text in images | 1.9 | 1.6 | 2.8 | 2.10 | | Specific art styles | 4.3 | 4.1 | 4.2 | 4.20 | | Spatial relationships | 2.8 | 2.4 | 3.6 | 2.93 |

Let me walk through the highlights because the averages hide the interesting stuff.

Where DALL-E 2 genuinely impresses

Simple objects and animals are stunning. "A red ceramic mug on a wooden table" comes back looking like a photograph. "A golden retriever running through autumn leaves" is beautiful enough to frame.

The art style category surprised me the most. "A mountain village in the style of Hokusai" produced something I'd honestly hang on my wall. "A cityscape as a 1950s travel poster" nailed the color palette and composition. OpenAI's research page showcases similar examples, but my own testing confirmed it. The model has internalized visual style categories remarkably well.

Landscapes are also strong. I got maybe 2 out of 20 that looked off. The rest were genuinely good.

Where it falls apart

People. DALL-E 2 still struggles with people, especially hands (the classic problem) and faces at any distance beyond portrait-close. I counted fingers in every image that included human hands. Of the 40 people prompts, 27 had incorrect finger counts. That's a 67.5% error rate on hands.

But the real weakness is spatial relationships. "A cat sitting ON TOP OF a dog" frequently produced a cat next to a dog, or a strange merged animal. "A book to the LEFT of a lamp" got the spatial arrangement wrong about half the time.

| Spatial prompt type | Correct placement rate | |--------------------|----------------------| | "X on top of Y" | 45% | | "X next to Y" | 70% | | "X behind Y" | 35% | | "X inside Y" | 55% | | "X to the left/right of Y" | 40% |

And text generation is basically broken. "A sign that reads OPEN" produced gibberish letters 85% of the time. The model clearly doesn't understand text as semantic content. It treats letters as visual patterns and gets them wrong almost always.

The training data question

DALL-E 2 was trained on a combination of image-text pairs, and OpenAI hasn't fully disclosed the dataset. But the model's strengths and weaknesses are readable as a training data fingerprint.

It's excellent at photographic styles, suggesting heavy representation of stock photography and natural images. It handles art styles well, meaning art datasets like LAION were likely part of the training mix. It struggles with precise spatial relationships, which makes sense because image captions in training data rarely describe spatial layouts explicitly.

Compared to DALL-E 1

I didn't have systematic data from DALL-E 1 (it was much harder to access), but from the published samples and my limited prior testing, the jump is enormous.

| Dimension | DALL-E 1 (rough estimate) | DALL-E 2 (my data) | Improvement | |-----------|--------------------------|--------------------| ------------| | Overall coherence | ~2.5/5 | ~3.6/5 | +44% | | Resolution | 256x256 | 1024x1024 | 4x | | Artifact frequency | ~2.0/5 | ~3.7/5 | +85% | | Realistic style capability | Low | High | Qualitative leap |

The 256x256 to 1024x1024 resolution jump alone changes what you can do with the outputs. DALL-E 1 images were thumbnails. DALL-E 2 images are usable.

My ikigai for the month

What I keep coming back to is the gap between the best and worst categories. A 4.67 average for simple objects and a 2.10 for text in images. Same model, wildly different capability depending on what you ask it to do.

I think people are going to talk about DALL-E 2 in very absolute terms ("it's amazing" or "it's not ready"). The data says something more specific: it's amazing at certain tasks and mediocre to bad at others. Knowing which is which matters if you're building anything on top of it.

I'll be running the same 200 prompts on every new image model that comes out this year. The spreadsheet has begun.

If you found this interesting, you might also like:

-- dataku

DALL-E 2 is out. I ran 200 prompts and measured the results.

The methodology

The results, by category

Where DALL-E 2 genuinely impresses

Where it falls apart

The training data question

Compared to DALL-E 1

My ikigai for the month

More from dataku

My monthly benchmark dashboard: March 2026 update

Claude Opus 4.5: Anthropic's latest flagship, benchmarked

The state of AI benchmarks in early 2026: what still works?