DALL-E's first images vs what people expected: a data comparison
OpenAI's DALL-E paper dropped in January and I've been collecting reaction data. The gap between what researchers expected and what it actually produces is measurable.
When OpenAI published the DALL-E paper in January 2021, the AI research community collectively lost its mind. A model that generates images from text descriptions? The examples in the blog post looked incredible. An avocado armchair. A baby daikon radish in a tutu walking a dog.
But I'm a data person. Excitement is fine, but I wanted to measure the gap between the curated examples OpenAI showed and what DALL-E actually produces on arbitrary prompts. So I spent three months collecting data.
What I tracked
I couldn't access DALL-E directly (nobody outside OpenAI can, as of April 2021). So I did the next best thing: I gathered two datasets.
Dataset 1: Expectations. I surveyed 42 people in ML-adjacent communities (Discord, Reddit, Twitter DMs) with the question: "Based on the DALL-E blog post examples, how well do you think it would handle these 20 prompt categories?" They rated each category 1-5 for expected quality.
Dataset 2: Actual outputs. I collected every DALL-E output that OpenAI researchers shared publicly (blog post, Twitter, talks), plus outputs from the DALL-E paper on arXiv, totaling 312 images across the same 20 categories.
Then I rated the actual outputs on the same 1-5 scale (with two other raters, averaged). The gap between expected and actual tells you where people's intuitions about DALL-E are wrong.
The expectation gap
| Prompt category | Expected quality (1-5) | Actual quality (1-5) | Gap | |----------------|----------------------|---------------------|-----| | Simple objects ("a red cube") | 4.8 | 4.6 | -0.2 | | Animals in settings | 4.5 | 4.2 | -0.3 | | Fantastical creatures | 4.6 | 3.8 | -0.8 | | Food items and plating | 4.2 | 4.5 | +0.3 | | Architecture and buildings | 4.3 | 3.4 | -0.9 | | Human faces (realistic) | 4.1 | 2.8 | -1.3 | | Human hands and fingers | 3.2 | 1.9 | -1.3 | | Text/letters in images | 3.8 | 2.1 | -1.7 | | Multi-object compositions | 4.0 | 3.3 | -0.7 | | Specific art styles ("in the style of Monet") | 4.4 | 4.1 | -0.3 | | Product design mockups | 3.9 | 3.6 | -0.3 | | Scientific diagrams | 3.1 | 2.2 | -0.9 | | Maps and spatial layouts | 2.8 | 1.7 | -1.1 | | Emotional scenes | 3.7 | 3.0 | -0.7 | | Counting (specific number of objects) | 3.5 | 2.3 | -1.2 |
The two biggest surprises for me:
People overestimate DALL-E's ability with text and hands. The gap on "text/letters in images" is -1.7 points. People expected a 3.8 and the reality is around 2.1. DALL-E produces plausible-looking but garbled text. Letters exist, but they rarely spell actual words. Hands have the same problem: they exist, but counting fingers is not DALL-E's strong suit.
People underestimate food. This one I did NOT expect. DALL-E is genuinely better at generating food images than people anticipated. A +0.3 gap, with actual quality at 4.5. Something about the training data (probably lots of food photography on the internet) gave DALL-E an edge here.
The CLIP paper connection
DALL-E doesn't work alone. It uses CLIP (Contrastive Language-Image Pre-training) to rank its generated images by how well they match the text prompt. CLIP essentially acts as a filter: DALL-E generates many candidates, CLIP picks the best ones.
This matters because the examples OpenAI shows are CLIP-filtered. They're the best of multiple generations. When you see "an armchair in the shape of an avocado," you're seeing the best result out of potentially 512 candidates.
The paper mentions generating 512 samples and using CLIP to rerank. That's a huge filter. The median quality of DALL-E's raw output is significantly lower than the cherry-picked examples suggest. I estimate (based on the paper's own qualitative comparisons) that CLIP filtering improves perceived quality by roughly 0.8-1.2 points on my 5-point scale.
So when people see the blog post and expect "this is what DALL-E produces," they're really seeing "this is what DALL-E produces after generating 512 options and picking the best one." Different thing entirely.
What the category data reveals
The categories where DALL-E performs closest to expectations (gap near zero or positive) share something in common: they're well-represented in internet image datasets. Simple objects, animals, food, common art styles. The internet has millions of these images.
The categories where DALL-E falls short all involve precise spatial reasoning or symbolic understanding. Text, counting, maps, hands, architecture. These require understanding abstract rules that a statistical model trained on image-text pairs hasn't internalized.
This pattern, strong on pattern matching and weak on symbolic reasoning, is the same story as GPT-3 for text. Large models excel at tasks that look like their training data and struggle with tasks that require genuine abstraction.
My prediction (for the record)
I'm writing this down so I can check myself later: I think DALL-E's successor (DALL-E 2? whatever OpenAI calls it) will fix the food and animal categories to near-perfect, improve art styles significantly, but STILL struggle with text in images and counting specific numbers of objects. Those problems are architectural, not data-related.
I'll revisit this prediction when the next version ships. Kanpeki na yosoku (perfect prediction) or not, at least the data will be there to check.
If you found this interesting, you might also like:
- The GPT-3 API waitlist is 6 months long. Here's what the early data looks like.
- Wait, GPT-3 costs HOW much per token?
- Every AI benchmark from 2020, ranked by how much they actually tell you
-- dataku