← Back to blog

Everything I Tried to Make My AI Coding Agent Cheaper, and What Actually Worked

June 22, 2026

by Brian Case

A measured look at Claude Code token costs, failed optimizations, and simple routing and retrieval changes that cut agent spend without hurting tests.

Like a lot of you, I've been shipping a lot of tickets with AI, but I was getting tired of getting hassled about session limits and weekly limits. I wanted to stretch my token budget farther. Apparently, I'm not the only one. Lately there have been wild stories about orgs blowing six figures on tokens in a single month.

So I decided it was time to get efficient. I knew I had to measure, not guess. As it turns out, almost every obvious idea I had turned out to be wrong. Clever isn't always effective. Over a long weekend I learned the difference.

The problem: how can I implement plans more efficiently?

I don't vibe cobe. I create plans (aka Specs), every single time. Plans are crucial, because otherwise you get what you don't want fast. Not really helpful.

I regularly generate plans with Bridge, but it doesn't handle what comes next. Implementation happens in Claude Code (or Cursor, Github Copilot, etc). How do I make implementating the plans more token efficient? That's what I needed to find out.

How I measured without lying to myself

Single absolute numbers lie.

The identical input, meaning the same prompt, same plan, and same model, burned almost 30% more tokens on one run than on the very next one. A huge swing on nothing useful, because agent runs are nondeterministic. Same ticket, same intent, different path through the maze.

So I stopped trusting one-off totals. I only trusted paired, within-run A/B deltas: A versus B back to back, repeated, then compare the difference instead of worshiping the level.

Tokens came from OTLP telemetry, which I treated as the source of truth, not the CLI summary line. Correctness was judged by whether the tests passed, not by a brittle output-equivalence gate. If the agent solved the ticket and the tests passed, it counted. If it produced a very similar-looking diff and broke the suite, it did not get a participation trophy.

The experiments, from worst to best

1. Plan slicing

The idea was simple: do not dump the whole plan on the agent. Feed it in slices so each turn carries less context. Easy win right?

Instead, the instructions to go fetch the next slice taught the agent to retrieve constantly. It over-fetched, re-fetched, grabbed material it already had, and generally behaved like a raccoon in a filing cabinet.

Per-turn context grew by about 40% instead of shrinking. My first clever idea was toast.

Verdict: backfired, no benefit.

2. Phasing

Next I tried chopping the plan into phases. Each phase would start with a fresh, short context. Surely a clean transcript beats one giant tail, right?

On raw tokens, it was basically a wash. On cost, it was about 25% worse.

The trap is that cost is class-weighted. Cache-reads are the cheapest tokens you can buy. Output and cache-creation are the expensive ones. Every fresh phase has to re-prime: re-read the codebase, rebuild cache, and reconstruct enough local understanding to act. I had traded cheap reads for expensive re-priming.

Worse, quality dropped. The scoped phase-2 agent could not see the whole picture and became a regression blind spot. It left 28 pre-existing tests failing that a single continuous session had noticed and fixed.

That was a nice moment. I paid more to learn less and break more. Very enterprise.

Verdict: more expensive and lower quality.

3. The fancy model routing

This was my favorite idea, which made it hurt when it failed.

The plan was to orchestrate a symphony of models inside a single plan: route mechanical steps to cheap models like Haiku or Sonnet, send hard steps to Opus, and use a coordinator to stitch it all together. Pay premium rates only for the hard thinking. Elegant. Architectural. Probably the sort of thing I would have put on a diagram with too many arrows.

Two problems killed it.

First, an interactive agent session's model is fixed at launch. You cannot swap models mid-session, so the whole different-model-per-step premise was dead on arrival in the CLI.

Second, even where you can delegate, you run into the review-of-delegated-diff trap. To delegate safely, you have to review what came back. Opus reading a delegated diff carefully enough to trust it costs about as much as Opus writing it itself, so you pay for the work twice. The review is the expensive half.

Measured on a realistic ticket, this was 46% more expensive. Even in a constructed best case, it was still 3.7% over.

The dumb version of this idea, two doors down, turned out to be the biggest win. Humbling, and frankly rude.

Verdict: more expensive.

4. Headroom

I heard about 'Headroom', a third-party compression proxy, with some huge claims behind it. The pitch was basically 90% context savings. I love a big number, especially when it tells me I do not have to think very hard, so I plugged it in and measured.

The context cost increased.

The advertised compression never materialized for iterative coding. The proxy could not safely drop the parts of the transcript the agent actually needed next turn. That is the annoying thing about coding agents: the old stuff is often still relevant, and the agent is not always great at predicting which old thing it will need three turns from now.

The lesson is older than AI: when a tool promises 90%, test it before you believe it.

Verdict: net-negative; the claim did not survive contact with a measurement.

5. Serena

Now we're getting into the nicer stuff. The experiments that worked.

Serena is an MCP server that does LSP and AST symbol retrieval. Instead of slurping an entire file, the agent can ask for the specific symbol it needs.

I dropped it in, and it worked out of the box. Across 4 of 4 paired reps, it cut tokens by 18% and cost by 16.7%. 18% may not sound amazing, but 18%, easily achievable for free, across an entire organization, forever? Yeah, I'll take that.

The consistency is what really sold me. A single lucky run is noise. Four paired wins in a row starts to look like a lever.

There was one nuance I learned the hard way: do not prompt-steer it. When I added instructions telling the agent how to use Serena, it over-retrieved and cost went up 13.8%. Same failure mode as plan slicing. I tried to be helpful and accidentally handed the agent a shovel.

Install it, say nothing, let it do its job.

Verdict: a real, banked win.

6. The dumb model routing

After the fancy per-step routing flopped, I tried the obvious thing I had skipped: route the whole plan to one cheaper model when the ticket is not hard, and only spend Opus on the genuinely difficult ones.

That is it. No coordinator. No mid-session juggling. No symphony. The elaborate idea could not even run because the model is locked at launch. The five-minute idea routed cleanly and saved the most money of anything I tried.

The implementation is not complicated, mostly because the hard part was already done: Bridge GPT already scores every ticket's difficulty from 1 to 10 as part of planning, so I just reused that signal. The CLI injects the model choice at spawn time, which is the only place it can, since the session's model is fixed at launch. Difficulty 1 to 3 goes to Haiku, 4 to 6 goes to Sonnet, and 7 to 10 goes to Opus.

Two gotchas mattered.

First, it fails safe, not cheap. On any uncertainty, like a failed difficulty eval, a credential problem, or a config hiccup, it defaults to premium and assumes the ticket is hard. Silently downgrading a hard ticket is how you ship broken code while feeling financially responsible.

Second, my difficulty scorer initially over-estimated everything. It rated every ticket a 6 to 9, which quietly made routing a near no-op. After recalibration, the mean error dropped from about 2.3 to about 0.8, and it finally started sending easy work to the cheap model.

This cut my costs hard. Now I only bring the power I need and pay what I have to. It's cut down my token burn by 34% overall.

Verdict: the big win; the simple idea beat the clever one.

7. The free quality win

Phasing had one genuine benefit: deeper validation. Fresh eyes catch more.

I captured that for zero tokens with one line in the implement prompt:

Validate against the acceptance-criteria intent; don't assume the diff is correct.

That nudges the agent to test against what the ticket actually wanted instead of rubber-stamping its own output. It gave me the useful part of phasing without the 25% cost penalty or the regression blind spot.

Verdict: easy win.

8. What I'll test next

The next thing on my list is an AST/PageRank repo-map MCP tool, Aider-style. The idea is to hand the agent a global structural map upfront so it stops spelunking through exploratory reads.

This should compose cleanly with Serena: use the map for orientation, then use Serena for precise symbol fetches. I have not shipped it yet, so this stays in the promising-but-unproven bucket. I am trying to develop the rare and powerful skill of not declaring victory before measuring things. Growth mindset, but with invoices.

What I actually learned about agent token costs

The real payoff from all this was not the individual tricks. It was learning how these platforms count tokens and why so many intuitive optimizations fail.

First, you are not mostly paying for what the agent writes. You are paying for it to re-read everything it has already done. Cache-read of the growing transcript is the bill, roughly 85% in my representative runs.

Second, not all tokens cost the same. Cache-reads are cheap. Output and cache-creation are dear. Any optimization that trades cheap cache-reads for expensive re-priming loses even when the raw token count looks flat. Phasing was the cleanest example: tokens looked fine, cost did not. Always look at cost by class, not just total tokens.

Third, big context windows are a trap as much as a feature. At 1M tokens, nothing auto-compacts, so the tail grows unbounded on exactly the long, hard runs where it hurts most.

Fourth, runs are nondeterministic. A single absolute number is noise wearing a lab coat. Measure paired deltas or do not bother.

Fifth, test the big claims yourself. Headroom promised 90% and made my context cost worse.

And the meta-lesson: the simple, slightly dumb idea beat the elaborate, clever one. Route the whole easy ticket to a cheaper model. Do not orchestrate a tiny distributed systems conference inside every coding session. I say this with love, and also because I tried.

The strategic fork

The biggest theoretical wins left go straight at the 85%: observation masking and tool-output sandboxing. The claims range from 60% to 98% tail reduction, because they prune what the transcript carries forward instead of fussing with the plan.

But you cannot really do that through a managed CLI. You need control of the transcript and tool layer, which means owning the agent loop with the Agent SDK or a custom harness.

That matters most precisely where I started: Opus-tier hard tickets at a 1M window, where nothing auto-compacts. So the fork is real. Keep the convenience of the managed CLI and accept the ceiling, or own the loop and go after the part of the bill that dominates.

The takeaway

The plan is not the cost; the work is. The wins were unglamorous: route whole plans to cheaper models by difficulty, drop in Serena, and add a one-line test-intent directive. Every clever, complicated idea I loved lost to a simpler one. Everything else was a good story and a worse bill.

About the author

Brian Case

Principal Salesforce Architect & AI Strategist

Brian Case is a Salesforce CTA and AI architect helping Salesforce orgs adopt LLMs, Data Cloud, and Agentforce.