Academy

Controlling AI Coding Agent Costs: Budget Management for Long-Running Jobs

How to prevent runaway costs from scheduled Claude Code jobs. Budget planning, scope optimisation, and the pricing levers that actually matter.

O
OpenHelm Team· Product
··10 min read
Controlling AI Coding Agent Costs: Budget Management for Long-Running Jobs

There's a category of developer who's tried Claude Code automation exactly once: they kick off an overnight job, wake up, and find a £100+ bill in their Anthropic console for a task they expected to cost £5.

The job hung silently. Or the goal was ambiguous and Claude Code looped optimising the same code section over and over. Or the codebase was larger than expected and the input token cost inflated.

The result is the same: an expensive lesson about how quickly Claude Code token costs can spiral when you're not watching.

The good news is that this is entirely preventable. Cost control isn't about restricting Claude Code — it's about writing prompts that finish predictably and building safeguards that catch runaway sessions before they become expensive.

Where AI Coding Costs Actually Come From

Claude Code billing follows the Anthropic API pricing model: you pay per token, with output tokens costing roughly 3x more than input tokens.

For a scheduled job, the cost breakdown looks like this:

Input tokens (code and context Claude Code reads): One-time cost at the start of each run.

Output tokens (Claude Code's responses): Accumulates with each iteration. A task that requires 10 iterations costs 10x more in output tokens than one that completes in 2 iterations.

Context accumulation: Each iteration adds to Claude Code's working context. A 10-iteration job has more context in iteration 10 than iteration 1, so later iterations cost more.

Practically: iteration cost is where expenses live. A well-scoped job that completes in 3 focused iterations costs a fraction of a vague goal that loops 20 times before giving up.

The Cost Scale

Here's what actual scheduled Claude Code jobs look like in practice:

Task typeTypical durationEstimated cost
Simple lint pass5–15 mins£1–2
Dependency upgrade + test20–45 mins£3–8
Scoped refactor45–90 mins£8–15
Test writing (single module)30–60 mins£4–10
Documentation update20–40 mins£2–5
Large refactor (vague goal)2–6 hours£30–80
Runaway job (unscoped)4–8 hours£80–150+

The last two rows are where cost control matters. The difference between a well-scoped refactor and a vague one can be a factor of 5 in expense.

Lever 1: Scope the Work Explicitly

The single biggest cost lever is codebase scope. A vague goal on a large codebase means Claude Code spends tokens reading and analysing files that don't matter.

Expensive:

Goal: Improve the codebase. Refactor things that could be cleaner.

Claude Code now has to decide what "improve" means. It reads broadly. It optimises things that weren't asked for. It explores possibilities. All of that generates tokens.

Cheap:

Goal: Refactor the auth module in src/auth/. Simplify the session handling logic without changing the public interface. Run npm test to verify nothing broke. If tests pass, commit. If tests fail, revert and report which test failed and why.

Claude Code now knows exactly which files matter. It reads less. It completes faster.

The cost difference between these two: £40–80 vs £3–5 for the same underlying work. The only change is scope precision.

How to apply this in practice:

  • Name the exact directory or file: "Refactor src/api/handlers.ts" not "improve the API"
  • Name constraints explicitly: "Don't change the external interface" or "Leave the database layer untouched"
  • Specify test boundaries: "Run npm test -- --testPathPattern=auth" not "make sure nothing broke"
  • Build in a completion check: "Done when jest --coverage shows ≥90% coverage"

Scope precision converts a search problem into a focused task. The cost difference is enormous.

Lever 2: Build Acceptance Criteria Into the Goal

Open-ended tasks generate expensive loops. Claude Code keeps iterating because "done" is ambiguous.

Adding a specific, verifiable completion condition stops the iteration:

Vague:

Improve test coverage.

How much improvement? When is it enough? Claude Code can loop indefinitely because there's no finish line.

Specific:

Add tests to src/api/users.ts until jest --coverage --testPathPattern=users reports ≥85% coverage. Stop once that threshold is met.

Claude Code can check the coverage number itself. It knows when it's done.

The cost impact: vague goals often run the full job timeout (2–8 hours). Specific goals usually complete in 30–60 minutes.

Lever 3: Silence Detection and Timeouts

Even well-scoped goals can hit unexpected issues — an API call that stalls, a build process that never completes, a prompt Claude Code doesn't know how to handle.

Without a safety net, the job runs indefinitely. With silence detection or a hard timeout, it stops.

With OpenHelm: Built-in silence detection. If Claude Code produces no output for 10 minutes, the run is flagged and stopped. A typical overnight job might cost £5–10; an undetected hang would cost £50–100.

With cron + script: Add a timeout wrapper:

timeout 3600 claude -p "your goal" --project /path >> ~/logs/job.log 2>&1

This kills the process after 3600 seconds (1 hour), preventing a stalled run from consuming resources overnight. The trade-off is that legitimately long-running jobs get cut short; choose your timeout based on what you expect.

Without this, a hung job is a silent expense.

Lever 4: Structured Logging and Monitoring

You can't control costs you don't see. Structured logging means you know what each run cost and can spot anomalies.

The Anthropic Console shows per-session costs: click a run, see tokens used, see the API breakdown. Build a weekly habit of checking this.

Watch for:

  • A job that always costs £3–5 suddenly costing £20. Something changed.
  • A job running longer than expected. Maybe the codebase grew and you need to scope more strictly now.
  • Consistently expensive runs. Maybe the goal needs rewriting.

The information is there. The cost control comes from noticing patterns and adjusting before they become problems.

A Real Cost Optimisation Example

A development team was running a nightly automation job to check code quality and fix linting issues. The job cost £20–30 most nights, but occasionally spiked to £50–80.

Original goal:

Run linting across the entire codebase. Fix violations. Run the test suite.

Problems:

  • Entire codebase is ~50k lines
  • "Fix violations" is vague (sometimes Claude Code rewrites unnecessarily)
  • No clear finish line (when is linting "done"?)
  • No scope boundary (when is test running "done"?)

Cost: £20–30 normal, £50–80 on spikes.

Revised goal:

Run npm run lint -- src/ --fix. Count remaining violations. Run npm test. If tests pass, commit the changes. If any test fails, revert linting changes and report which test failed and why.

Changes:

  • Scope to src/ only (not entire codebase)
  • Use linter's --fix (not Claude Code's rewrites)
  • Clear completion: tests pass or failure is reported and reverting stops further work
  • Build in a revert mechanism if something goes wrong

Cost: Now consistently £3–5.

Savings: £15–25/night. Over a month, that's £300–500. Over a year, it's £3,600–6,000 — from the same job, just rewritten to be specific.

Per-Month Budget Planning

For a solo developer running nightly Claude Code jobs:

  • 1–2 simple jobs (lint, format, small updates): £5–15/month
  • 3–5 jobs with test suites included: £50–100/month
  • Regular refactoring + optimization: £100–200/month
  • Multiple teams, multiple projects: £500–1,000+/month

These numbers assume well-scoped jobs. If your goals are vague, expect 2–3x higher costs.

Set a budget and monitor against it weekly. If you're trending over, the issue is usually goal clarity, not the tool itself.

The Self-Correction Fallacy

OpenHelm includes self-correction: if a job fails, it can automatically queue a retry with the failure output as context. This sounds like it costs more (you're running twice), and sometimes it does.

But in practice, a job that fails and gets retried often succeeds the second time because it has information about what went wrong. A vague goal that loops 10 times is more expensive than a failed attempt that retries once with context.

Self-correction is worth using because it trades a second attempt (usually £1–3 cost) for avoiding a long loop (usually £20–50 cost).

FAQ

Does using GPT-4 cost less than Claude?

Claude Code uses Claude, not GPT-4. If you're comparing to a different workflow (Cursor Pro with GPT-4, for instance), compare per-token prices: Claude and GPT-4 are similar order of magnitude, but it depends on your specific model versions.

Can I set a hard limit on Claude Code API spend?

Not at the Claude Code level. The Anthropic API has account-level rate limits you can configure, but not per-job caps. Cost control comes from goal design, not API boundaries.

What's the cost of silence detection itself?

Silence detection doesn't add cost; it saves cost by stopping runaway jobs. The monitoring happens at the output stream level with no additional API calls.

How often should I check my Anthropic Console?

Weekly, especially when you're first automating jobs. Once you have a month of history and understand what your jobs cost, checking monthly is reasonable. Always check immediately after anomalies (jobs taking longer than expected, unexpected failures).

Is there a way to predict a job's cost before running it?

Not precisely, because iteration cost depends on how many attempts Claude Code needs. But you can estimate based on prior runs. A job that usually costs £3–5 shouldn't cost £30; if it does, something went wrong. Set a hard timeout and review the failure.

More from the blog