I tested 10 local LLM runtimes. Ollama vs LM Studio vs llama.cpp vs...
Local inference has gotten shockingly good. I tested 10 runtimes on the same hardware (M3 Max, 64GB). Ollama wins on ease of use. llama.cpp wins on raw speed. The performance gap between local and cloud is narrowing.
Running LLMs on your own hardware went from "technically possible" to "genuinely useful" in about 12 months. I tested every major local runtime I could find.
Ten runtimes. Same hardware. Same model. Let me show you the numbers.
The test setup
| Spec | Value | |------|-------| | Hardware | MacBook Pro M3 Max, 48 GPU cores | | RAM | 64GB unified memory | | Model | Llama 3.1 8B (Q4_K_M quantized for all) | | Test | Generate 500 tokens from a standard prompt, 10 runs each |
Results
| Runtime | Avg tokens/sec | Time to first token | Memory usage | Setup difficulty | |---------|---------------|-------------------|-------------|-----------------| | llama.cpp (CLI) | 62.4 t/s | 180ms | 5.2 GB | Hard (compile from source) | | MLX (Apple) | 58.7 t/s | 95ms | 4.8 GB | Medium (Python package) | | Ollama | 51.3 t/s | 240ms | 5.6 GB | Easy (one command) | | LM Studio | 48.9 t/s | 310ms | 5.8 GB | Very easy (GUI app) | | LocalAI | 44.2 t/s | 350ms | 6.1 GB | Medium (Docker) | | vLLM | 56.1 t/s | 125ms | 5.4 GB | Hard (server setup) | | Jan | 46.8 t/s | 290ms | 5.9 GB | Easy (GUI app) | | GPT4All | 38.4 t/s | 420ms | 6.3 GB | Very easy (installer) | | Text Generation WebUI (Oobabooga) | 42.1 t/s | 380ms | 6.0 GB | Medium (Python + deps) | | Kobold.cpp | 49.2 t/s | 260ms | 5.5 GB | Medium (compile or binary) |
Sources: My benchmarks, 10 runs per runtime, M3 Max 64GB, June 2025.
Speed ranking
llama.cpp wins raw speed at 62.4 tokens/sec. MLX is close behind at 58.7 t/s, which makes sense since Apple designed MLX specifically for Apple Silicon.
Ollama comes in third at 51.3 t/s. The ~18% speed gap from llama.cpp is the overhead of Ollama's server architecture and REST API layer.
GPT4All is slowest at 38.4 t/s. It prioritizes accessibility over performance.
Time to first token (latency)
| Tier | Runtimes | TTFT range | |------|----------|-----------| | Fast (<150ms) | MLX, vLLM | 95-125ms | | Medium (150-300ms) | llama.cpp, Ollama, Kobold.cpp, Jan | 180-290ms | | Slow (>300ms) | LM Studio, LocalAI, Oobabooga, GPT4All | 310-420ms |
MLX has the best time-to-first-token at 95ms. Apple's runtime is optimized for their own chips, and it shows.
For interactive use (chatting), TTFT matters more than throughput. A 300ms+ wait before the first word feels sluggish. Under 150ms feels instant.
Ease of use ranking
| Tier | Runtime | Installation |
|------|---------|-------------|
| One command | Ollama | brew install ollama && ollama pull llama3.1 |
| Download and click | LM Studio, Jan, GPT4All | GUI installer, model browser |
| Some setup | MLX, LocalAI, Oobabooga, Kobold.cpp | Python/Docker, some config |
| Developer only | llama.cpp, vLLM | Compile from source, server config |
Ollama is the sweet spot. One command to install, one command to run a model. No Python environment, no Docker, no compilation.
LM Studio is the best for people who don't use terminals. Download the app, browse models, click "download," click "chat." Done.
Larger models (Llama 3.1 70B Q4)
I also tested runtimes that can handle 70B models on 64GB:
| Runtime | Tokens/sec (70B Q4) | Usable? | |---------|---------------------|---------| | Ollama | 8.2 t/s | Yes, but slow | | llama.cpp | 10.1 t/s | Yes, slow but functional | | MLX | 9.4 t/s | Yes | | LM Studio | 7.8 t/s | Barely |
70B models at Q4 quantization need about 40GB of RAM. On a 64GB machine, that leaves little headroom. Throughput drops to 8-10 t/s, which is about 10x slower than cloud APIs.
At 8 t/s, you're waiting ~60 seconds for a 500-token response. Usable for non-interactive work (batch processing, code generation). Not great for chat.
Local vs cloud comparison
| Metric | Local (Ollama, 8B) | Cloud (Anthropic API) | Ratio | |--------|--------------------|--------------------|-------| | Throughput | 51 t/s | 90 t/s | 0.57x | | Latency (TTFT) | 240ms | 245ms | ~1x | | Cost per 1M tokens | ~$0.00 (electricity only) | $3.00 (Sonnet) | ~0x | | Quality (8B vs frontier) | Mid | Frontier | Lower | | Privacy | Complete | Trust provider | Better |
The speed gap between local and cloud has closed dramatically. At 51 t/s, local inference on Apple Silicon is about 57% of cloud API speed. A year ago, it was closer to 20%.
The cost is effectively zero. My electricity costs for running local models: roughly $0.01 per hour.
My recommendation
| Use case | Best runtime | |----------|-------------| | General use, easy setup | Ollama | | Maximum speed | llama.cpp (compiled for your hardware) | | Apple Silicon optimization | MLX | | Non-technical users | LM Studio | | Server deployment | vLLM | | Experimentation, many models | LM Studio or Ollama |
Ollama is my default recommendation. It's fast enough, dead simple, and has the best model library. I use it daily.
For speed-critical work, llama.cpp compiled with Metal support on Apple Silicon is 22% faster than Ollama. That gap matters if you're batch-processing thousands of prompts.
My local inference setup has gone from "fun experiment" to "daily tool" in six months. The fact that I can run a 70B model on a laptop, offline, with zero API costs, still feels slightly unreal.
If you found this interesting, you might also like:
- Llama 2 is here and it's actually good. My benchmark data.
- Mistral Large vs GPT-4 vs Claude 3 Opus: the three-way benchmark
- GPT-3 vs GPT-J: the first real open source challenger, in data
- LLaMA leaked. Here's what Meta's model weights actually look like.
- Mistral 7B just beat Llama 2 13B. Small models are getting weird.
-- dataku