Skip to content
Fable 5 Effort: Cut Thinking Token Costs in Python — ContentBuffer guide

Fable 5 Effort: Cut Thinking Token Costs in Python

K
Kodetra Technologies··9 min read Intermediate

Summary

Claude Fable 5 always thinks. Use effort, display and max_tokens to control reasoning cost.

Fable 5 Effort: Cut Thinking Token Costs in Python

Anthropic shipped Claude Fable 5 on June 9, 2026, and it is now the most capable model the company offers to everyone. It also behaves differently from every Claude before it in one way that catches teams off guard the first week: thinking is always on and you cannot turn it off. Send a one-line prompt and the model may still burn hundreds of reasoning tokens before it answers. At $50 per million output tokens, and with thinking tokens billed as output, that surprise lands on your invoice.

The good news is that you are not stuck paying for reasoning you do not need. Fable 5 gives you three real levers: the effort parameter to dial how hard the model thinks, the thinking.display setting to control what comes back over the wire, and max_tokens as a hard ceiling on the whole response. Used together they let one model serve both a cheap high-volume classifier and a deep long-horizon agent.

This guide shows exactly how those levers work on Fable 5, with runnable Python, real token numbers from the usage object, and a worked routing example that spends low effort on easy requests and saves deep reasoning for the hard ones. Every API detail here is checked against Anthropic's official docs.

Prerequisites

  • Python 3.9+ and the official SDK: pip install anthropic (use a recent build that knows the claude-fable-5 model ID).
  • An Anthropic API key in the ANTHROPIC_API_KEY environment variable.
  • Working knowledge of the Messages API: client.messages.create(...) and the content / usage fields on the response.
  • A rough sense of your traffic mix, because the whole point of effort is matching reasoning depth to how hard each request actually is.

What is different about thinking on Fable 5

On Claude Opus 4.8 and 4.7, adaptive thinking is the only thinking mode, but it is off until you opt in with thinking={"type": "adaptive"}. Fable 5 flips that default. Adaptive thinking is always on, it applies even when you never pass a thinking field, and thinking={"type": "disabled"} is rejected. There is no way to get a zero-reasoning response.

Two more Fable 5 specifics matter for cost and for reading responses:

  • The raw chain of thought is never returned. You can get a readable summary, but never the verbatim reasoning. A request that tries to pull the model's internal reasoning into the answer text can be refused with stop_details.category: "reasoning_extraction".
  • thinking.display defaults to "omitted". On Opus 4.6 the default was "summarized", so a copied snippet that used to show reasoning now returns thinking blocks with an empty thinking field. This is a silent change, not a bug.
BehaviorOpus 4.6Opus 4.8 / 4.7Fable 5
Thinking defaultsummarized, opt-inoff until adaptive setalways on, cannot disable
Disable thinkingsupportedsupportedrejected (400)
display defaultsummarizedomittedomitted
Raw chain of thoughtnever rawnever rawnever raw

Step 1: Your first Fable 5 call

Start with the plainest possible request. Notice there is no thinking field at all, yet the model still reasons, because adaptive thinking is always on.

import anthropic

client = anthropic.Anthropic()

resp = client.messages.create(
    model="claude-fable-5",
    max_tokens=2048,
    messages=[
        {"role": "user", "content": "A train leaves at 14:05 and arrives at 17:40. "
                                     "How long is the trip, in minutes?"}
    ],
)

# Fable 5 omits thinking text by default, so content is just the answer.
for block in resp.content:
    if block.type == "text":
        print(block.text)

u = resp.usage
print("input:", u.input_tokens, "output:", u.output_tokens)
print("of which thinking:", u.output_tokens_details.thinking_tokens)

Example output:

The trip is 215 minutes (3 hours and 35 minutes).
input: 31 output: 226
of which thinking: 198

Look at that last line. The visible answer is one sentence, yet 198 of the 226 output tokens were reasoning you never see and still pay for. output_tokens is the authoritative billed total; output_tokens_details.thinking_tokens is a read-only breakdown so you can measure where the money goes. This single field is the most useful thing to log when you start tuning effort.

Step 2: Turn the effort dial

The effort parameter controls how eager Fable 5 is to spend tokens, across thinking, response text, and tool calls alike. You pass it inside output_config. The default is high, and setting high explicitly is identical to omitting it.

EffortWhat it doesReach for it when
maxMaximum capability, no token ceiling on thinkingFrontier problems where quality beats cost
xhighDeep, extended explorationLong-horizon coding and agentic work
high (default)Almost always thinks deeplyHard reasoning, the safe default
mediumModerate thinking, real savingsBalanced agentic and tool work
lowMinimal thinking, may skip it entirelyClassification, lookups, high-volume calls

Here is the same arithmetic question at low effort. On easy inputs the model often skips thinking altogether, so no thinking block is produced and the bill drops.

resp = client.messages.create(
    model="claude-fable-5",
    max_tokens=2048,
    output_config={"effort": "low"},
    messages=[
        {"role": "user", "content": "A train leaves at 14:05 and arrives at 17:40. "
                                     "How long is the trip, in minutes?"}
    ],
)

print(resp.content[0].text)
print("output:", resp.usage.output_tokens,
      "thinking:", resp.usage.output_tokens_details.thinking_tokens)

Example output:

The trip is 215 minutes.
output: 14 thinking: 0

Same correct answer, 14 output tokens instead of 226. For a question this simple the reasoning added nothing, and low effort let the model recognize that. The Fable 5 guidance from Anthropic is to start at high for most work, move up to xhigh for the most capability-sensitive jobs, and drop to medium or low for routine traffic. Lower effort on Fable 5 still tends to beat xhigh on older models, so the floor is higher than you might expect.

Step 3: Decide what comes back with display

Effort changes how much the model thinks. thinking.display changes how much of that thinking is sent back to you. It does not change the bill: you pay for the full reasoning under every setting.

  • "omitted" (the Fable 5 default): thinking blocks still appear in content, but their thinking field is an empty string. The encrypted reasoning rides along in the signature field for multi-turn continuity. Bonus: when streaming, text starts sooner because the server skips streaming reasoning tokens.
  • "summarized": thinking blocks carry a readable summary of the reasoning. Set this explicitly when you want to show or log the model's reasoning.

Ask for a summary like this:

resp = client.messages.create(
    model="claude-fable-5",
    max_tokens=4096,
    thinking={"type": "adaptive", "display": "summarized"},
    output_config={"effort": "high"},
    messages=[
        {"role": "user", "content": "Our checkout error rate jumped from 0.2% to 3% "
                                     "after a deploy. Walk through how you'd localize the cause."}
    ],
)

for block in resp.content:
    if block.type == "thinking":
        print("--- reasoning summary ---")
        print(block.thinking)          # populated because display=summarized
    elif block.type == "text":
        print("--- answer ---")
        print(block.text)

With display set to "omitted" the same loop prints an empty reasoning summary, because block.thinking is "" even though the block is still there and still billed. If your code reads block.thinking and expected text, that empty string is the silent default change biting you, not a model failure.

Worked example: a cost-aware support router

Put the levers together in something you would actually ship. This router triages incoming support messages. Routing is a cheap classification job, so it runs at low effort. Only tickets it classifies as technical get a second, high-effort pass that produces a real troubleshooting plan. This is the dynamic-effort pattern: spend reasoning where it pays off.

import anthropic, json

client = anthropic.Anthropic()

def classify(ticket: str) -> dict:
    """Cheap pass: pick a category. Low effort, no thinking needed."""
    resp = client.messages.create(
        model="claude-fable-5",
        max_tokens=64,
        output_config={"effort": "low"},
        messages=[{
            "role": "user",
            "content": (
                "Classify this support ticket as one word: "
                "billing, technical, or general. Reply with the word only.\n\n"
                f"Ticket: {ticket}"
            ),
        }],
    )
    category = resp.content[0].text.strip().lower()
    return {"category": category, "thinking": resp.usage.output_tokens_details.thinking_tokens}

def deep_help(ticket: str) -> dict:
    """Expensive pass: only for hard technical tickets. High effort."""
    resp = client.messages.create(
        model="claude-fable-5",
        max_tokens=4096,
        output_config={"effort": "high"},
        thinking={"type": "adaptive", "display": "summarized"},
        messages=[{
            "role": "user",
            "content": f"Give a concrete step-by-step troubleshooting plan.\n\nTicket: {ticket}",
        }],
    )
    answer = next(b.text for b in resp.content if b.type == "text")
    return {"answer": answer, "thinking": resp.usage.output_tokens_details.thinking_tokens}

def handle(ticket: str) -> dict:
    c = classify(ticket)
    if c["category"] == "technical":
        d = deep_help(ticket)
        return {"category": c["category"], "plan": d["answer"],
                "thinking_tokens": c["thinking"] + d["thinking"]}
    return {"category": c["category"], "plan": None, "thinking_tokens": c["thinking"]}

for t in [
    "I was charged twice for my May invoice.",
    "The API returns 504 only when I enable streaming over HTTP/2.",
]:
    out = handle(t)
    print(json.dumps({"category": out["category"],
                      "thinking_tokens": out["thinking_tokens"],
                      "has_plan": out["plan"] is not None}))

Example output:

{"category": "billing", "thinking_tokens": 0, "has_plan": false}
{"category": "technical", "thinking_tokens": 734, "has_plan": true}

The billing ticket cost zero thinking tokens: a flat classification at low effort, routed onward without burning reasoning. The technical ticket earned a deep pass and spent 734 thinking tokens on a plan worth reading. If you had run every ticket at high effort, the billing ticket alone would have cost a few hundred thinking tokens for a one-word answer. Multiply by your daily ticket volume and the routing split is the difference between a sensible bill and a scary one.

Step 5: Multi-turn and the model-switch trap

When you continue a conversation on Fable 5, pass each assistant thinking block back exactly as you received it, including blocks whose thinking field is empty. The signature carries the encrypted reasoning; the API rejects blocks whose content was edited or reconstructed. Reading the summary for display is fine. Rewriting it is not.

The trap appears when you switch models, for example falling back to Opus 4.8 after a Fable 5 refusal. Thinking blocks are tied to the model that produced them. Other models will not reject them, they silently ignore them, but ignored blocks still add input tokens you pay for. So strip thinking and redacted_thinking blocks from prior assistant turns before sending history to a different model.

def strip_thinking(messages):
    """Remove thinking blocks before sending history to a different model."""
    cleaned = []
    for m in messages:
        if m["role"] == "assistant" and isinstance(m.get("content"), list):
            kept = [b for b in m["content"]
                    if b.get("type") not in ("thinking", "redacted_thinking")]
            cleaned.append({**m, "content": kept})
        else:
            cleaned.append(m)
    return cleaned

# Same model -> keep blocks untouched. Different model -> strip first.
history_for_opus = strip_thinking(conversation)

One exception worth remembering: if you use server-side or middleware fallback, the fallback block that marks the handoff stays exactly where it appeared, and fallback-credit retries must echo the refused request body unchanged. Those flows manage the thinking blocks for you.

Common pitfalls

  • Expecting to disable thinking. Passing thinking={"type": "disabled"} on Fable 5 returns a 400. If you truly need a zero-reasoning path, use low effort (the model may skip thinking) or route that traffic to a non-Fable model.
  • Reading block.thinking and getting an empty string. The Fable 5 default is display: "omitted". Set thinking={"type": "adaptive", "display": "summarized"} if you want readable reasoning. Nothing is broken; the default changed from Opus 4.6.
  • Assuming omitted saves money. It saves latency, not cost. You are billed for full thinking tokens under every display setting. Use effort and max_tokens to cut cost; use display to cut bytes and time-to-first-token.
  • Forgetting that max_tokens covers thinking too. It is a hard ceiling on thinking plus response text. At high or xhigh a tight max_tokens can get eaten by reasoning, and you will see stop_reason: "max_tokens" with a truncated answer. Raise max_tokens or lower effort.
  • Trying to extract the raw chain of thought. Prompts that ask the model to dump its internal reasoning into the answer can be refused with stop_details.category: "reasoning_extraction". Read the thinking blocks instead of prompting for reasoning in the text.
  • Carrying thinking blocks across a model switch. They are ignored by other models but still billed as input. Strip them before sending history to a fallback model, except for the structural fallback block in a fallback response.
  • Tuning effort by feel. Effort is a soft signal, not a fixed budget, and its impact varies by task. Log output_tokens_details.thinking_tokens and measure quality on your own evals before locking a level into production.

Quick reference

LeverWhereControlsNotes
effortoutput_config={"effort": ...}How hard the model thinks/actslow, medium, high, xhigh, max; default high
thinking.displaythinking={"type":"adaptive","display":...}Summary vs empty thinking textFable 5 default is omitted
max_tokenstop-levelHard cap on thinking + textTruncates with stop_reason max_tokens
thinking_tokensusage.output_tokens_detailsBilled reasoning, observabilityRead-only; <= output_tokens
thinking disabledn/aNot availableRejected with 400 on Fable 5

Next steps

  • Add thinking_tokens to your request logs today, even before you tune anything. You cannot manage a cost you are not measuring.
  • Audit copied snippets for block.thinking reads and add display: "summarized" where you actually need the text.
  • Map your traffic to effort levels: classification and lookups to low, default work to high, long-horizon agents to xhigh, then verify on evals.
  • Pair this with refusal fallback so a Fable 5 classifier decline retries on Opus 4.8, and remember to strip thinking blocks on that switch.

Fable 5 gives you a single model that can be both a frugal classifier and a relentless reasoner. The difference is one parameter and a habit of measuring tokens. Set effort deliberately, watch thinking_tokens, and the always-on reasoning becomes an asset instead of a surprise on the invoice.

Comments

Subscribe to join the conversation...

Be the first to comment

Found this useful?

Get new AI guides for builders by email. Free.

Join 2,063 builders reading daily.