Midjourney v3 vs DALL-E 2: 100 prompts, head to head
Same 100 prompts, two models, blind rating by 5 people. Midjourney wins on 'aesthetic feel' 64% of the time. DALL-E 2 wins on 'prompt accuracy' 71% of the time. The data is fascinating.
This one took me three weeks. I regret nothing.
I generated the same 100 prompts on both Midjourney v3 and DALL-E 2, then had 5 people rate the outputs in a blind comparison. Neither rater knew which model produced which image. Each pair was shown side by side, randomly positioned left or right.
The results tell a story that's more interesting than "Model A is better than Model B."
The methodology
100 prompts across 10 categories (10 each):
- Simple objects, animals, landscapes, people (portraits), people (full body), abstract concepts, multi-object scenes, text in images, specific art styles, spatial relationships
5 raters: 2 professional designers, 1 photographer, 2 non-specialists. Mix of backgrounds was intentional. I wanted both expert and general-audience perspectives.
Rating dimensions (each rated 1-5):
- Aesthetic quality: Which image looks better, period?
- Prompt accuracy: Which image more faithfully represents the prompt?
- Technical quality: Which has fewer artifacts, better resolution, cleaner details?
Blind comparison: Raters saw paired images labeled "Image A" and "Image B" with no model attribution. Position (left/right) was randomized.
The headline numbers
| Dimension | Midjourney v3 wins | DALL-E 2 wins | Tie (within 0.5) | |-----------|-------------------|--------------|-------------------| | Aesthetic quality | 64% | 28% | 8% | | Prompt accuracy | 22% | 71% | 7% | | Technical quality | 38% | 51% | 11% |
Two completely different models with two completely different strengths.
Midjourney produces images that people find more beautiful. DALL-E 2 produces images that more accurately match what you asked for. And technical quality (artifacts, resolution) leans DALL-E 2, but it's closer than I expected.
Category breakdown: aesthetic quality
Here's where the raters preferred Midjourney (aesthetic only):
| Category | Midjourney v3 win rate | DALL-E 2 win rate | |----------|----------------------|-------------------| | Landscapes | 82% | 14% | | Art styles | 78% | 18% | | Abstract concepts | 74% | 20% | | Animals | 68% | 26% | | Simple objects | 52% | 40% | | People (portraits) | 58% | 32% | | People (full body) | 56% | 34% | | Multi-object scenes | 60% | 32% | | Spatial relationships | 48% | 42% | | Text in images | 36% | 48% |
Midjourney dominates landscapes (82%) and art styles (78%). It has a distinctive visual "look" that raters consistently described as "painterly," "atmospheric," and "moody." Even the non-specialist raters picked up on this.
DALL-E 2 only wins aesthetics in one category: text in images. And that's not because DALL-E 2 is good at text (it's not). It's because Midjourney is even worse.
Category breakdown: prompt accuracy
| Category | Midjourney v3 win rate | DALL-E 2 win rate | |----------|----------------------|-------------------| | Simple objects | 16% | 80% | | Animals | 18% | 76% | | Landscapes | 30% | 62% | | People (portraits) | 22% | 68% | | People (full body) | 20% | 72% | | Abstract concepts | 36% | 52% | | Multi-object scenes | 14% | 80% | | Art styles | 28% | 62% | | Spatial relationships | 18% | 74% | | Text in images | 16% | 78% |
DALL-E 2 crushes prompt accuracy across the board. For simple objects and multi-object scenes, it wins 80% of the time. When you ask for "a red ball on top of a blue cube next to a green cylinder," DALL-E 2 is much more likely to give you exactly that.
Midjourney tends to take the prompt as a starting point and produce something adjacent. A rater wrote in their notes: "Midjourney's images are better pictures, but they're not always pictures of what I asked for."
I think that captures it perfectly.
The designer vs. non-specialist split
I found something interesting when I split the data by rater expertise.
| Rater type | Midjourney aesthetic win rate | DALL-E 2 prompt accuracy win rate | |-----------|----------------------------|--------------------------------| | Designers (n=2) | 72% | 74% | | Photographer (n=1) | 68% | 68% | | Non-specialists (n=2) | 56% | 70% |
The designers and photographer had a stronger preference for Midjourney's aesthetics than the non-specialists did. Trained visual professionals responded more strongly to Midjourney's composition and color choices. Non-specialists still preferred Midjourney aesthetically, but by a smaller margin.
Prompt accuracy ratings were more consistent across all raters. Everyone agrees on whether the image matches the prompt, regardless of background.
The "style" question
Midjourney has a default aesthetic. If you don't specify a style, you get something that looks like a concept art piece or digital painting. It's beautiful, but it's a house style. Some of my prompts deliberately asked for "photorealistic" outputs. Here's how that played out:
| Style requested | Midjourney aesthetic win rate | DALL-E 2 aesthetic win rate | |----------------|----------------------------|---------------------------| | No style specified | 72% | 20% | | "Photorealistic" | 34% | 58% | | "Oil painting style" | 84% | 12% | | "Minimalist" | 44% | 46% |
When you ask for photorealism, DALL-E 2 wins. Its outputs look more like photographs. Midjourney's "photorealistic" outputs still have that painterly quality.
When you ask for artistic styles, Midjourney destroys DALL-E 2. The 84% win rate on oil painting prompts was the most lopsided category in the entire test.
What this means for choosing a model
The data points to a clear recommendation that depends entirely on what you're doing.
If you're creating concept art, illustrations, mood boards, or anything where visual beauty matters more than literal accuracy: Midjourney.
If you're doing product mockups, specific image generation for content, or anything where "match the prompt exactly" matters: DALL-E 2.
If you're doing both, you need both. That's not a satisfying answer, but it's what the data says.
One more thing
Both models are bad at text in images. Both models struggle with spatial relationships. Both models produce better images at higher resolutions than anything available six months ago.
The competition between them is making both better. Midjourney v3 is noticeably improved over v2 in prompt accuracy (still behind DALL-E 2, but the gap narrowed). And I've heard DALL-E 2 is getting updates that improve its aesthetic quality.
I'll rerun this comparison when Midjourney v4 drops. The spreadsheet is ready.
If you found this interesting, you might also like:
- DALL-E's first images vs what people expected: a data comparison
- GPT-3 vs GPT-J: the first real open source challenger, in data
- Google's PaLM has 540 billion parameters. Let me put that number in context.
- I counted every AI startup that raised money in Q1 2021. The numbers are strange.
- Codex and the cost of code generation: my first pricing analysis
-- dataku