Skip to content
Claude Programmatic Tool Calling: Cut Agent Token Costs — ContentBuffer guide

Claude Programmatic Tool Calling: Cut Agent Token Costs

K
Kodetra Technologies··10 min read Intermediate

Summary

Let Claude write code that calls your tools in a loop — 20–40% fewer tokens, same accuracy.

If you run a tool-using agent on the Claude API, your biggest bill is usually not the answer. It is everything the model reads to get there: every raw tool result, every intermediate JSON blob, every round-trip where the model wakes up, looks at one result, and decides to call the next tool. Programmatic tool calling attacks exactly that waste. Instead of the model calling your tools one at a time and reading each result, Claude writes a short Python program that calls your tools in a loop, filters and aggregates the results in code, and only hands back the conclusion.

The feature jumped back into focus with the June 9, 2026 launch of Claude Fable 5, Anthropic's most capable widely released model and one built specifically for long-horizon agentic work. Fable 5 supports programmatic tool calling out of the box, but so do Opus 4.8, Opus 4.7/4.6/4.5, and Sonnet 4.6/4.5, so you can ship this today on a model that is generally available. (Fable 5 itself had access suspended on June 12 pending a government review; the code in this guide runs unchanged on claude-opus-4-8.)

By the end you will be able to mark a tool as code-callable with one field, run an agent loop that handles the programmatic protocol correctly, and reason about when this saves you money and when it quietly costs more. Anthropic's own benchmarks put the savings at roughly 20–40% of billed input tokens on agents with 10+ tools, with no measurable drop in task accuracy. We will also build a runnable local harness so you can watch five tool calls collapse into a single model round-trip without spending a cent on API credits.

Prerequisites

  • Python 3.9+ and the official SDK: pip install anthropic (1.x).
  • An Anthropic API key in ANTHROPIC_API_KEY, on the Claude API, Claude Platform on AWS, or Microsoft Foundry (programmatic calling is not yet on Bedrock or Vertex).
  • A compatible model: claude-fable-5, claude-opus-4-8, claude-sonnet-4-6, or any Opus/Sonnet 4.5+.
  • Comfort with the basics of Claude tool use: tool definitions, tool_use blocks, and returning tool_result.

One hard requirement worth stating up front: programmatic tool calling requires the code execution tool to be enabled. Claude runs the Python it writes inside an Anthropic-managed sandbox, and your custom tools are exposed to that sandbox as async functions.

How programmatic tool calling works

In classic tool use, the loop is: model calls tool A, you return A's result, model reads it and calls tool B, you return B's result, and so on. Each step pays for an extra model sample and pushes the full raw result into the context window. With programmatic tool calling the shape changes:

  1. Claude writes Python that invokes your tool as a function, with whatever loops, conditionals, and post-processing the task needs.
  2. That code runs in the sandboxed code-execution container.
  3. When the code calls one of your tools, execution pauses and the API returns a tool_use block to you.
  4. You return the tool_result; the container resumes. Intermediate results are not loaded into Claude's context.
  5. When the code finishes, Claude sees only the final stdout and continues the task.

The payoff is in step 4: tool results consumed inside the script never enter the model's context, so they never get billed as input tokens. Calling ten tools directly costs roughly 10x the tokens of calling them in code and returning a one-line summary.

Step 1: Mark a tool as code-callable

A single field controls who can call a tool: allowed_callers. Add it to any normal tool definition.

query_database = {
    "name": "query_database",
    "description": (
        "Execute a SQL query against the sales database. "
        "Returns a JSON array of rows, each an object with keys "
        "'region' (str) and 'revenue' (int)."
    ),
    "input_schema": {
        "type": "object",
        "properties": {
            "sql": {"type": "string", "description": "SQL query to execute"}
        },
        "required": ["sql"],
    },
    # The whole feature is this one line:
    "allowed_callers": ["code_execution_20260120"],
}

The values are: ["direct"] (the default — Claude calls it the classic way), ["code_execution_20260120"] (Claude is guided to call it only from code), or both. Anthropic recommends picking one per tool rather than enabling both, so Claude gets a clear signal about how to use it.

Notice the description spells out the exact return shape. Because Claude deserializes the result in code, the more precisely you document the JSON structure and field types, the more reliably Claude can sort, filter, and aggregate it. Treat the description as an API contract, not a hint.

Security note: allowed_callers shapes how the tool is presented and is checked against tool_choice, but it is not a hard API-level block. Claude is strongly guided to respect it, yet your client should still be ready to handle a direct call for any tool it defines. Do not treat it as a security boundary.

Step 2: Make the request

Include the code execution tool alongside your custom tool, then send the user message. Nothing else about the request changes.

import anthropic

client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-opus-4-8",
    max_tokens=4096,
    messages=[{
        "role": "user",
        "content": "Query sales for West, East, Central, North and South, "
                   "then tell me which region had the highest revenue.",
    }],
    tools=[
        {"type": "code_execution_20260120", "name": "code_execution"},
        query_database,  # the dict from Step 1
    ],
)
print(response.stop_reason)  # -> "tool_use"

Claude responds with a server_tool_use block (the code it wrote) plus one or more tool_use blocks for your tool. The tell-tale sign of a programmatic call is the caller field:

{
  "type": "tool_use",
  "id": "toolu_def456",
  "name": "query_database",
  "input": {"sql": "<sql>"},
  "caller": {
    "type": "code_execution_20260120",
    "tool_id": "srvtoolu_abc123"
  }
}

A caller.type of "direct" means classic tool use; "code_execution_20260120" means the call came from inside Claude's script, and tool_id points back to the code-execution block that made it.

Step 3: Run the agent loop

Your job is the same as ordinary tool use with one extra rule. Execute every pending tool_use block, then reply — and when you are answering programmatic calls, the reply message must contain only tool_result blocks. No text, not even a trailing question. Here is a complete loop:

import json
import anthropic

client = anthropic.Anthropic()

def execute(name, tool_input):
    if name == "query_database":
        region = tool_input["sql"].split(":", 1)[1].strip()
        return DB.get(region, [])
    raise ValueError(f"unknown tool: {name}")

def run_agent(user_msg, tools):
    messages = [{"role": "user", "content": user_msg}]
    while True:
        resp = client.messages.create(
            model="claude-opus-4-8", max_tokens=4096,
            messages=messages, tools=tools,
        )
        messages.append({"role": "assistant", "content": resp.content})

        if resp.stop_reason != "tool_use":
            return "".join(b.text for b in resp.content if b.type == "text")

        # Resolve EVERY pending tool_use, direct or programmatic.
        results = []
        for block in resp.content:
            if block.type != "tool_use":
                continue
            out = execute(block.name, block.input)
            results.append({
                "type": "tool_result",
                "tool_use_id": block.id,
                "content": json.dumps(out),
            })

        # tool_result-only message when answering programmatic calls.
        messages.append({"role": "user", "content": results})

That is the whole integration. The container resumes after each batch of results, finishes the script, and returns when stop_reason is end_turn. If you make several related requests, pass the container id back to reuse the same sandbox and keep its state.

A runnable example you can verify without an API key

To prove the loop logic — including the tool_result-only rule — you do not need credits. The harness below swaps the real API for a tiny fake responder that emits the exact message shapes Claude returns for programmatic calling: turn one fans out into five query_database calls from a single code block; turn two finishes and summarizes. The client loop is byte-for-byte the one you would run against the real API.

import json

DB = {"West":[{"rev":45000},{"rev":12000}], "East":[{"rev":38000},{"rev":9000}],
      "Central":[{"rev":24000}], "North":[{"rev":51000},{"rev":4000}],
      "South":[{"rev":17000}]}

def query_database(sql):
    return DB.get(sql.split(":",1)[1].strip(), [])

TOOLS = {"query_database": query_database}

class FakeAPI:               # mimics the Messages API for prog. tool calling
    def __init__(self): self.turn = 0
    def create(self, messages):
        self.turn += 1
        if self.turn == 1:   # model writes a loop -> 5 programmatic calls
            content = [{"type":"text","text":"I'll fan out across regions."},
                       {"type":"server_tool_use","id":"srvtoolu_1",
                        "name":"code_execution","input":{"code":"<loop>"}}]
            for i,r in enumerate(["West","East","Central","North","South"]):
                content.append({"type":"tool_use","id":f"toolu_{i}",
                    "name":"query_database","input":{"sql":f"region:{r}"},
                    "caller":{"type":"code_execution_20260120","tool_id":"srvtoolu_1"}})
            return {"role":"assistant","content":content,"stop_reason":"tool_use"}
        totals = {r:sum(x["rev"] for x in query_database(f"region:{r}"))
                  for r in ["West","East","Central","North","South"]}
        top = max(totals, key=totals.get)
        return {"role":"assistant","stop_reason":"end_turn","content":[
            {"type":"text","text":f"{top} led with ${totals[top]:,} in revenue."}]}

def run_agent(api, messages):
    rounds = 0
    while True:
        rounds += 1
        resp = api.create(messages)
        messages.append({"role":"assistant","content":resp["content"]})
        if resp["stop_reason"] != "tool_use":
            return "".join(b["text"] for b in resp["content"] if b["type"]=="text"), rounds
        results = []
        for b in resp["content"]:
            if b.get("type") != "tool_use": continue
            out = TOOLS[b["name"]](**b["input"])
            results.append({"type":"tool_result","tool_use_id":b["id"],
                            "content":json.dumps(out)})
        assert all(r["type"]=="tool_result" for r in results)  # the rule
        messages.append({"role":"user","content":results})

if __name__ == "__main__":
    answer, rounds = run_agent(FakeAPI(),
        [{"role":"user","content":"Which region had the highest revenue?"}])
    print("MODEL ROUND-TRIPS:", rounds)
    print("ANSWER:", answer)

Run it and you get:

MODEL ROUND-TRIPS: 2
ANSWER: West led with $57,000 in revenue.

Two round-trips, not six. With classic tool use those five region lookups would have been five separate model samples plus a sixth to summarize, and all five raw result arrays would have sat in the context window. Programmatically, the model is sampled twice and never reads a single raw row — it reads only the script's final line. West wins at $57,000 (45,000 + 12,000), edging out North's $55,000.

Patterns that pay off

Programmatic calling shines whenever the work is shaped like code rather than conversation. A few patterns Claude reaches for:

  • Fan-out / batch: loop over many items (check 50 endpoints, look up 20 records) and return one aggregate, turning N round-trips into 1.
  • Early termination: iterate until a success condition, then break — stop hitting an API the moment you find a healthy host.
  • Conditional selection: inspect one cheap result, then decide whether to call the expensive tool (read full file vs. read summary based on size).
  • Data filtering: fetch a 10,000-line log, keep only the lines containing ERROR, return the last 10. The other 9,990 lines never touch the context.

That last pattern is the headline: programmatic calling makes it practical to work over tool results larger than the context window itself, because the filtering happens in the sandbox before anything reaches the model.

Common pitfalls and gotchas

This feature has sharp edges. Most production bugs come from the list below.

  • The tool_result-only rule. When a message answers pending programmatic calls, it must contain only tool_result blocks. Adding a text block ("what next?") returns a 400. Text after results is allowed for ordinary client-side tool calls, just not programmatic ones.
  • It is not free for every workload. On strictly sequential single-call turns (τ²-bench style, one tool call per turn that the model must reason over), Anthropic measured scores unchanged and cost ~8% higher — container startup and script generation are fixed overhead. Use it for fan-out and big results, not one-shot lookups.
  • MCP-connector tools cannot be called programmatically. Neither can you force a programmatic call: tool_choice may not name a tool whose allowed_callers omits "direct", or you get an invalid_request_error.
  • Structured outputs are incompatible. Tools with strict: true are not supported with programmatic calling, and neither is disable_parallel_tool_use: true.
  • Containers expire. Sandboxes idle out after 4.5 minutes (30-day hard max). If your tool is slow, the code sees a TimeoutError in stderr and Claude may retry. Watch the expires_at field and keep tools fast.
  • Refusals are HTTP 200, and monitoring is blind to them. A safety classifier decline returns stop_reason: "refusal", not an error, so error-rate dashboards never see it. Emit an explicit event per refusal and per fallback.
  • Validate tool results before they are trusted. Results return as strings that Claude may deserialize and act on in code. If your tool relays external or user-supplied data, treat it as a code-injection surface.

Quick reference

ItemValue / rule
Enable on a tool"allowed_callers": ["code_execution_20260120"]
Required toolcode execution (code_execution_20260120)
ModelsFable 5, Opus 4.8/4.7/4.6/4.5, Sonnet 4.6/4.5
PlatformsClaude API, Claude Platform on AWS, Microsoft Foundry
Spot a programmatic callcaller.type == "code_execution_20260120"
Reply formattool_result blocks ONLY (no text)
Container idle / max life4.5 min idle / 30-day hard max
Typical token savings20–40% with 10–49 tools
Incompatible withstrict outputs, MCP tools, forced tool_choice

On Anthropic's 75-tool project-management benchmark, enabling programmatic tool calling cut billed input tokens by about 38% with no change in task accuracy. Across production traffic, requests with 10–49 tool definitions saw 20–40% savings. The honest caveat: on workloads with a handful of small, sequential calls, the overhead can exceed the savings, so measure on a representative sample before turning it on everywhere.

Next steps

  • Audit your agent: count tool definitions and result sizes. 10+ tools or large results are strong candidates.
  • Add allowed_callers to your fan-out and large-result tools, leave conversational ones direct.
  • Measure billed input tokens with and without the field on real traffic before a broad rollout.
  • Pair with container reuse for multi-request workflows, and with refusal fallback if you target Fable 5.
  • Read Anthropic's engineering write-up on advanced tool use for the cost model behind the feature.

Programmatic tool calling is the rare optimization that costs one line of config and pays back in tokens, latency, and the ability to handle results that would never have fit in context. Add the field where it fits, measure, and let Claude do the bookkeeping in code.

Comments

Subscribe to join the conversation...

Be the first to comment

Found this useful?

Get new AI guides for builders by email. Free.