Data StoriesApril 18, 20226 min read

I tracked AI image generation quality over 6 months. The improvement rate is scary.

I've been generating the same 50 prompts on each new model as it releases. The quality jump from January to April 2022 is the steepest improvement curve I've ever plotted.

Six months ago I started something I thought would be a slow-burning side project.

I picked 50 image generation prompts spanning 10 categories (objects, animals, people, landscapes, abstract, text, spatial, style transfer, multi-object, photorealistic) and decided to run them on every new image generation model the moment it became available.

The idea was simple: same prompts, different models, track quality over time. I expected gradual improvement. Maybe a few percentage points per quarter.

What actually happened made me rethink my assumptions about the speed of progress in this field.

The scoring method

Each image gets rated on the same 1-5 scale I use for all my image generation testing:

  • Coherence: Does it look like a real image? No artifacts, no melted geometry.
  • Prompt fidelity: Did the model generate what was asked?
  • Aesthetic quality: Is it visually appealing, independent of accuracy?

The overall score is the average of all three. I score everything myself for consistency.

The timeline

| Model | Date tested | Overall avg (1-5) | Best category | Worst category | |-------|-----------|-------------------|---------------|----------------| | DALL-E (original) | Nov 2021* | 2.3 | Simple objects (3.1) | Text in images (1.2) | | DALL-E 2 (early access) | Jan 2022 | 3.6 | Simple objects (4.7) | Text in images (1.9) | | Midjourney v2 | Feb 2022 | 3.2 | Art styles (4.3) | Spatial relations (2.0) | | Midjourney v3 | Mar 2022 | 3.7 | Art styles (4.6) | Text in images (1.8) | | Google Imagen (paper) | Apr 2022 | N/A** | N/A | N/A |

*DALL-E 1 scores are from limited access, fewer than 50 prompts completed.

**Imagen isn't publicly available. Google published samples in their research paper, and the FID scores look very strong, but I can't run my own test suite on it.

The improvement rate

From DALL-E 1 to DALL-E 2 in roughly two months: a 56.5% improvement in overall quality score. From DALL-E 1 to Midjourney v3 in about four months: a 60.9% improvement.

| Period | Improvement in overall score | |--------|----------------------------| | DALL-E 1 to DALL-E 2 (2 months) | +1.3 points (+56.5%) | | DALL-E 2 to Midjourney v3 (2 months) | +0.1 points (+2.8%) | | DALL-E 1 to Midjourney v3 (4 months) | +1.4 points (+60.9%) |

The first jump is the story. A 56.5% quality improvement in two months is unlike anything I've tracked in any other AI domain. For comparison, language model benchmark improvements tend to be 5-15% per model generation, and those generations are spaced 6-12 months apart.

Category-level improvement

The gains aren't uniform. Some categories improved dramatically. Others barely moved.

| Category | DALL-E 1 (Nov 2021) | Best current (Apr 2022) | Improvement | |----------|---------------------|------------------------|-------------| | Simple objects | 3.1 | 4.7 (DALL-E 2) | +51.6% | | Animals | 2.8 | 4.5 (DALL-E 2) | +60.7% | | Landscapes | 2.6 | 4.6 (DALL-E 2) | +76.9% | | Art styles | 2.5 | 4.6 (MJ v3) | +84.0% | | Photorealistic | 1.8 | 4.2 (DALL-E 2) | +133.3% | | People | 1.9 | 3.2 (DALL-E 2) | +68.4% | | Multi-object | 2.1 | 3.4 (DALL-E 2) | +61.9% | | Abstract | 2.4 | 3.8 (DALL-E 2) | +58.3% | | Spatial relations | 1.7 | 2.8 (DALL-E 2) | +64.7% | | Text in images | 1.2 | 1.9 (DALL-E 2) | +58.3% |

Photorealistic images improved the most: +133.3%. Six months ago, no image generation model could produce something that looked like a photograph. Now DALL-E 2 produces photorealistic outputs that require a second look.

Art style transfer had the next biggest jump at +84.0%. Midjourney v3 is particularly strong here. Its outputs in "oil painting" and "watercolor" styles are genuinely beautiful.

Text in images improved the least in absolute terms. It went from terrible (1.2) to slightly less terrible (1.9). The models still can't spell.

Why this matters beyond the numbers

I've been tracking AI benchmark improvements across language, vision, and code generation for about two years. I have spreadsheets (yes, plural) for each domain.

Image generation is improving faster than any other AI capability I measure. And it's not close.

| AI domain | Typical improvement per 6 months (2020-2022) | |-----------|---------------------------------------------| | Language (text generation quality) | 8-15% | | Code generation (HumanEval pass rate) | 10-20% | | Image classification (top-1 accuracy) | 2-5% | | Image generation (my quality metric) | 50-60% |

The gap is enormous. Language models are improving steadily. Image generation is on a completely different curve.

Part of this is that image generation started from a lower baseline. DALL-E 1 was impressive for its time but objectively produced rough outputs. There was more room to improve. But even accounting for that, the pace of improvement is striking.

What's driving it

Three things, based on what I'm reading in the papers.

First, diffusion models. The shift from GANs and autoregressive approaches to diffusion-based architectures (DALL-E 2 uses a diffusion model, and so does Imagen) unlocked a big quality jump. Diffusion models produce cleaner, more coherent images at higher resolutions.

Second, CLIP. OpenAI's CLIP model (connecting text and images in a shared embedding space) is the backbone of DALL-E 2's prompt understanding. Better text-image alignment means better prompt fidelity.

Third, training data scale. LAION-5B, an open dataset of 5.85 billion image-text pairs, was released in late 2021. The sheer volume of training data available for image models has increased dramatically.

The scary part

I called the improvement rate "scary" in the title and I mean it. Not in a fearmongering way. In a "the trajectory implies things about what these models will look like in 12 months" way.

If the current rate holds (and it won't, it'll slow down eventually, curves always do), by the end of 2022 my overall quality score should be above 4.5/5 for the best models. That's the range where generated images become hard to distinguish from real ones across most categories.

I don't think we'll get there by December. But I didn't think we'd be where we are now back in November.

I'll update this tracker after every new model release. The data tells its own story.


If you found this interesting, you might also like:

-- dataku

More from dataku