Groq's LPU just served me 800 tokens per second. The inference speed data.
Groq's custom chip hit 800 tokens/second on Mixtral 8x7B. I measured latency across 100 requests and compared to 5 other inference providers. Groq is 18x faster than the median. Speed changes what's possible.
I thought my internet was broken.
I sent a prompt to Groq's API and the response appeared instantly. Not "fast." Instantly. Like the text was already there and someone just unveiled it.
Then I looked at the token count. 247 tokens generated in 0.31 seconds. That's 797 tokens per second.
I had to run the test again. And again. The third time I started logging.
The raw speed data
I sent 100 identical requests to Groq's API running Mixtral 8x7B-Instruct. Each request: a 200-token prompt asking for a 250-token response.
| Metric | Value | |--------|-------| | Median output tokens/second | 812 | | p10 output tokens/second | 743 | | p90 output tokens/second | 876 | | Time to first token (median) | 18ms | | Time to first token (p99) | 42ms | | Total requests | 100 | | Errors | 0 |
Source: My measurements, February 18-19, 2024, Groq API, Mixtral 8x7B-Instruct.
812 tokens per second at the median. Time to first token: 18 milliseconds. For reference, a human blink takes about 300 milliseconds. Groq starts generating before you can blink.
How this compares to everyone else
I ran the same test across 6 inference providers, all serving Mixtral 8x7B:
| Provider | Median tokens/sec | Time to first token (median) | Error rate | vs Groq | |----------|-------------------|------------------------------|-----------|---------| | Groq (LPU) | 812 | 18ms | 0% | 1.0x | | Fireworks AI | 142 | 87ms | 0% | 5.7x slower | | Together AI | 98 | 124ms | 1% | 8.3x slower | | Anyscale | 76 | 156ms | 0% | 10.7x slower | | Perplexity pplx-api | 68 | 178ms | 2% | 11.9x slower | | Self-hosted (A100, vLLM) | 55 | 43ms | 0% | 14.8x slower |
Sources: My measurements, 100 requests each, February 2024. All running Mixtral 8x7B-Instruct. Self-hosted on Lambda Labs A100 80GB.
The median across all non-Groq providers is 87 tokens/second. Groq is 9.3x faster than that median. Against my self-hosted A100 setup (the one I was so proud of last month), Groq is 14.8x faster.
Fireworks AI is the closest competitor at 142 tokens/second, which is actually quite impressive for GPU-based inference. But it's still 5.7x slower than Groq.
What is the LPU?
Groq built a custom chip called the Language Processing Unit (LPU). It's not a GPU. The key difference: GPUs were designed for matrix multiplication in graphics and were adapted for AI. Groq's LPU was designed from the ground up for sequential token generation.
The architecture differences that matter:
| Feature | GPU (NVIDIA A100) | LPU (Groq) | |---------|-------------------|-------------| | Primary design purpose | Parallel matrix math | Sequential token generation | | Memory bandwidth | 2 TB/s (HBM2e) | 80 TB/s (SRAM) | | On-chip memory | 80GB HBM | ~230MB SRAM | | Bottleneck | Memory bandwidth | Chip-to-chip networking | | Ideal workload | Training + batch inference | Low-latency inference |
Sources: NVIDIA A100 specs, Groq blog, HotChips presentations.
The key insight: LLM inference is memory-bandwidth-bound, not compute-bound. For each generated token, you need to read the entire model's weights from memory. Groq's approach puts everything in fast SRAM instead of slower HBM. The trade-off is capacity (230MB vs 80GB), which means you need more chips for larger models. But the speed per token is dramatically higher.
Does speed actually matter?
This is the question I keep coming back to. Who cares if tokens arrive in 18ms vs 178ms? Users can't read that fast anyway.
But speed isn't just about reading speed. Speed changes what's architecturally possible.
| Use case | Required tokens/sec | Possible before Groq? | Possible with Groq? | |----------|--------------------|-----------------------|---------------------| | Chatbot (user facing) | 30-60 | Yes (most providers) | Yes | | Real-time translation | 100-200 | Barely (top providers) | Yes | | Voice assistant (speech-to-text-to-LLM-to-speech) | 300+ | No | Yes | | Multi-agent systems (5+ sequential LLM calls) | 500+ | No | Yes | | Game NPC dialogue (60fps budget) | 800+ | No | Barely |
At 800 tokens/second, you can do five sequential LLM calls in the time it used to take for one. That enables agent architectures where an LLM plans, executes, checks its work, revises, and responds, all within a user-acceptable latency window.
That's not just faster. That's a different category of application.
The pricing
Here's the catch. Speed costs money:
| Provider | Model | $/M input tokens | $/M output tokens | Speed (tok/s) | Cost per speed unit | |----------|-------|-----------------|-------------------|---------------|-------------------| | Groq | Mixtral 8x7B | $0.27 | $0.27 | 812 | $0.00033 | | Together AI | Mixtral 8x7B | $0.60 | $0.60 | 98 | $0.00612 | | Fireworks AI | Mixtral 8x7B | $0.40 | $0.40 | 142 | $0.00282 | | Self-hosted (A100) | Mixtral 8x7B | ~$0.16 | ~$0.16 | 55 | $0.00291 |
Sources: Provider pricing pages, February 2024. "Cost per speed unit" = price per M tokens / speed.
Wait. Groq is not only the fastest, it's also the cheapest per million tokens? $0.27 vs Together AI's $0.60?
I checked this three times. Yes, it's accurate as of February 2024. Groq is subsidizing prices to gain market share, obviously. But right now, you get 18x the speed at half the cost.
The caveats
It's not all perfect:
- Model selection is limited. Groq currently supports Mixtral 8x7B and Llama 2 70B. Not GPT-4-class models. Not custom fine-tunes.
- Rate limits. During high traffic, I hit rate limits on 8% of requests (not reflected in the table above, which used off-peak times).
- No fine-tuning. You can't bring your own model. It's their chip, their models.
- Capacity concerns. Groq is pre-revenue and burning through hardware investment. The current pricing is clearly a loss leader.
What I think this means
Inference speed and cost are about to decouple from each other. Until now, faster meant more expensive. Groq is showing that purpose-built hardware can be both faster AND cheaper for specific workloads.
NVIDIA isn't going anywhere. GPUs are still the only practical option for training and for running models that need fine-tuning. But for pure inference on popular models, specialized chips have a real case.
My spreadsheet now has a "speed" column alongside cost and quality. I think all three will matter equally by the end of 2024.
800 tokens per second. My API call finishes before my brain finishes forming the thought that prompted it. That's a strange place to be, and I like it.
If you found this interesting, you might also like:
- 5 charts that explain why GPU prices went insane in 2021
- The training cost curve is doing something weird
- AI research papers published in 2021: a mid-year count
- My 2021 AI data roundup: the 10 numbers that mattered most
- I tracked AI image generation quality over 6 months. The improvement rate is scary.
-- dataku