Pricing WatchFebruary 10, 20255 min read

The real cost of AI agents: I tracked token usage for 50 agentic tasks

AI agents sound cheap per token. But they loop. A lot. I measured the total token consumption for 50 real agent tasks across Claude, GPT-4o, and Gemini. The average task used 47K tokens. Some hit 200K+.

AI agents are the hot topic of early 2025. Everyone's building them. Very few people are talking about what they actually cost to run.

I tracked token consumption for 50 real agentic tasks across three models. The numbers surprised me.

The experiment

I built identical agent setups using LangChain for three models: Claude 3.5 Sonnet (Anthropic), GPT-4o (OpenAI), and Gemini 2.0 Flash (Google). Each agent had access to the same 5 tools: web search, code execution, file read/write, calculator, and a database query tool.

50 tasks. Each run three times (once per model). 150 total runs. I logged every token.

Token consumption by task type

| Task type (10 tasks each) | Avg tokens (Claude) | Avg tokens (GPT-4o) | Avg tokens (Gemini Flash) | |---------------------------|--------------------|--------------------|--------------------------| | Research and summarize | 38,200 | 42,600 | 31,400 | | Code generation + testing | 67,300 | 58,900 | 44,200 | | Data analysis pipeline | 52,100 | 61,400 | 39,800 | | Multi-step reasoning | 41,800 | 39,200 | 35,600 | | Content creation with research | 29,400 | 33,100 | 25,300 | | Overall average | 45,760 | 47,040 | 35,260 |

The average agent task consumed about 47K tokens on Claude and GPT-4o, and 35K on Gemini Flash.

For context, a single chatbot response is typically 200-500 tokens. An agent task uses 70-230x more tokens than a simple Q&A interaction.

The variance is the real story

Averages hide the distribution. Here's what it actually looks like:

| Token range | Percentage of tasks | |-------------|-------------------| | Under 10K | 8% | | 10K to 25K | 22% | | 25K to 50K | 34% | | 50K to 100K | 24% | | 100K to 200K | 10% | | Over 200K | 2% |

One code generation task on GPT-4o consumed 218,000 tokens. The agent got stuck in a debugging loop, trying the same fix 6 times before switching approaches. That single task cost $3.27 at GPT-4o rates.

The tasks under 10K tokens? Those were cases where the agent completed on the first try with no tool-use loops. Basically got lucky.

What this actually costs

| Model | Price per M input | Price per M output | Avg agent task cost | Monthly cost (100 tasks/day) | |-------|-------------------|--------------------|--------------------|---------------------------| | Claude 3.5 Sonnet | $3.00 | $15.00 | $0.52 | $1,560 | | GPT-4o | $2.50 | $10.00 | $0.39 | $1,170 | | Gemini 2.0 Flash | $0.075 | $0.30 | $0.013 | $39 | | GPT-4o mini | $0.15 | $0.60 | $0.025 | $75 |

Sources: Anthropic, OpenAI, Google pricing pages.

At 100 agent tasks per day on Claude 3.5 Sonnet, you're looking at $1,560 per month. On Gemini 2.0 Flash, it's $39. A 40x cost difference for the same workflow.

The question is whether the cheaper model completes the tasks as reliably. In my tests, Claude had a 78% completion rate, GPT-4o had 74%, and Gemini Flash had 61%. Cheaper models loop more and fail more often, which sometimes makes them more expensive per successful task.

| Model | Completion rate | Avg cost per successful task | |-------|----------------|------------------------------| | Claude 3.5 Sonnet | 78% | $0.67 | | GPT-4o | 74% | $0.53 | | Gemini 2.0 Flash | 61% | $0.021 |

Even adjusted for failures, Gemini Flash is dramatically cheaper. But if your agent is customer-facing and failures are expensive, the calculus changes.

The loop problem

The biggest cost driver isn't the initial prompt or the final answer. It's the loops.

I categorized every token by phase:

| Phase | % of total tokens | |-------|------------------| | Initial prompt + context | 8% | | Tool calls (first pass) | 22% | | Tool results processing | 18% | | Retry loops | 31% | | Final synthesis | 12% | | Error handling | 9% |

Retry loops account for 31% of all tokens. Nearly a third of the cost is the agent trying things that don't work.

The implication: improving agent reliability by just 20% could cut costs by more than 20%, because you're reducing the most expensive phase (loops) and also reducing failures that waste everything.

My takeaway

Agents are not chatbots with extra steps. The token economics are fundamentally different. A chatbot scales linearly: more users, proportionally more cost. An agent scales unpredictably: some tasks are cheap, some are 20x more expensive than the median.

If you're building agent-based products, budget for 50K tokens per task (not 500), plan for 30% of your spend going to retry loops, and consider using cheaper models for the loop phases while routing to premium models for the final synthesis.

My spreadsheet for agent costs has 150 rows now and I keep staring at the variance column. The standard deviation is larger than the mean. In any other domain, that would make me deeply uncomfortable. In AI agents, it's apparently normal.

Today's ikigai: helping people understand that "per token" pricing and "per task" pricing are very different conversations.


If you found this interesting, you might also like:

-- dataku

More from dataku