
Fable 5 Prompt Caching: Slash 1M-Token Codebase Costs
Summary
Reuse a huge codebase prefix across every Fable 5 call and pay ~90% less.
Anthropic shipped Claude Fable 5 on June 9, 2026 with a 1,000,000-token context window and up to 128k output tokens. Stripe reported it ran a codebase-wide migration across a 50-million-line Ruby repo in a day, work they estimated at two months by hand. The catch nobody mentions in the hype threads: feeding a model that much context is expensive. At $10 per million input tokens, dumping a 400k-token codebase into every request burns $4 of input on each call before the model writes a single line.
Prompt caching is the fix. You cache the big, unchanging part of your prompt once, then every follow-up request reads it back at roughly one tenth of the input price. This guide shows you exactly how to wire that up against Fable 5: load a real codebase as context, set a cache breakpoint, reuse it across a loop of questions, read the usage counters to prove the cache hit, and handle Fable 5's Opus 4.8 safety fallback. You will finish with a working codebase Q&A agent and a clear picture of what it costs.
Access note (read this first): on June 12, 2026 Anthropic posted that access to Fable 5 and Mythos 5 was suspended following a US government export-control directive. Check status.anthropic.com and the model's API page before you budget a project around it. The caching technique below is identical for any current Claude model (Opus 4.8, Sonnet 4.6, Haiku 4.5), so swap the model id and the code still runs today.
Prerequisites
- Python 3.10+ and an Anthropic API key in the
ANTHROPIC_API_KEYenvironment variable. - The official SDK:
pip install anthropic(1.x). - A folder of source files to use as context. Anything from a few thousand to a few hundred thousand tokens works.
- Basic familiarity with the Messages API (
client.messages.create).
Step 1 - Understand what caching actually saves
Three prices matter with Fable 5 caching. Base input is $10 per million tokens. Writing tokens into the cache costs 25% more than base input ($12.50/M) because the model still has to process them the first time. Reading from the cache costs ~10% of base input ($1.00/M). Output is unchanged at $50/M. So caching only pays off when you reuse the same prefix more than once within the cache lifetime (5 minutes by default, refreshed on every hit; opt into 1 hour with a TTL).
Worked math: say your codebase prefix is 300,000 tokens and you ask 20 questions about it in a session.
- No cache: 20 x 300k x $10/M = $60.00 in prefix input alone.
- With cache: first call writes 300k at $12.50/M = $3.75, then 19 reads x 300k x $1/M = $5.70. Total $9.45.
- That is an 84% cut on the prefix, before counting the small per-question tokens and output.
Step 2 - Load a codebase into a single context block
First, concatenate your source files into one string with clear file markers so the model can reference paths. Keep it deterministic (sorted paths) so the cached prefix stays byte-identical between requests, which is what lets the cache match.
import os, glob
def load_codebase(root, exts=(".py", ".js", ".ts", ".go")):
parts = []
paths = sorted(
p for p in glob.glob(os.path.join(root, "**", "*"), recursive=True)
if p.endswith(exts) and os.path.isfile(p)
)
for p in paths:
rel = os.path.relpath(p, root)
with open(p, "r", errors="ignore") as f:
parts.append(f"### FILE: {rel}\n{f.read()}")
return "\n\n".join(parts)
codebase = load_codebase("./my_project")
print(f"{len(codebase):,} chars (~{len(codebase)//4:,} tokens)")
Rough rule: ~4 characters per token for English/code. A 1.2M-character dump is around 300k tokens, well inside Fable 5's 1M window but far above the 1,024-token minimum a block needs to be cacheable.
Step 3 - Set a cache breakpoint with cache_control
You mark the end of the cacheable prefix by attaching cache_control to a content block. Put the stable codebase in the system array and tag its last block. Everything up to and including that block becomes the cached prefix; the per-question message stays outside it.
import anthropic
client = anthropic.Anthropic() # reads ANTHROPIC_API_KEY
MODEL = "claude-fable-5" # swap to claude-opus-4-8 to run today
SYSTEM = [
{"type": "text",
"text": "You are a senior engineer answering questions about a codebase. "
"Cite file paths. If unsure, say so."},
{"type": "text",
"text": codebase,
# This breakpoint caches the system prompt + the whole codebase above it.
"cache_control": {"type": "ephemeral"}},
]
def ask(question):
return client.messages.create(
model=MODEL,
max_tokens=1024,
system=SYSTEM,
messages=[{"role": "user", "content": question}],
)
The default cache lifetime is 5 minutes and refreshes on each hit, so an active session keeps it warm. For a longer-lived agent, request the extended TTL: "cache_control": {"type": "ephemeral", "ttl": "1h"}.
Step 4 - Reuse the cache and read the usage counters
The proof that caching works is in response.usage. The first call reports cache_creation_input_tokens (you paid the write premium). Every later call with the identical prefix reports cache_read_input_tokens instead (the cheap read). Here is a loop that asks several questions and prints a live cost breakdown.
PRICES = { # USD per token, Fable 5
"in": 10/1e6, "cache_write": 12.5/1e6,
"cache_read": 1/1e6, "out": 50/1e6,
}
def cost(u):
return (getattr(u, "input_tokens", 0) * PRICES["in"]
+ getattr(u, "cache_creation_input_tokens", 0) * PRICES["cache_write"]
+ getattr(u, "cache_read_input_tokens", 0) * PRICES["cache_read"]
+ getattr(u, "output_tokens", 0) * PRICES["out"])
questions = [
"Where is request authentication handled?",
"List every place we touch the database directly.",
"Which functions lack error handling?",
]
total = 0.0
for q in questions:
r = ask(q)
u = r.usage
total += cost(u)
print(f"Q: {q}")
print(f" write={u.cache_creation_input_tokens} "
f"read={u.cache_read_input_tokens} out={u.output_tokens} "
f"| ${cost(u):.4f}")
print(f"TOTAL: ${total:.4f}")
Example output (300k-token prefix, abbreviated):
Q: Where is request authentication handled?
write=300184 read=0 out=212 | $3.7659
Q: List every place we touch the database directly.
write=0 read=300184 out=389 | $0.3197
Q: Which functions lack error handling?
write=0 read=300184 out=433 | $0.3218
TOTAL: $4.4074
First call writes the cache; the next two read it for about a tenth of the price. Without caching those same three calls would cost roughly $9.06 on prefix input alone. The win grows with every additional question in the session.
Step 5 - Detect the Opus 4.8 safety fallback
Fable 5 ships with classifiers for cybersecurity, biology/chemistry, and model distillation. When one triggers, your request is answered by Claude Opus 4.8 instead, and Anthropic says fewer than 5% of sessions hit a fallback. For a codebase agent this matters: a question about, say, a security exploit path can silently route to a different model. The reliable signal is the model field on the response, which reports the model that actually answered.
def ask_checked(question):
r = ask(question)
answered_by = r.model
if not answered_by.startswith("claude-fable-5"):
print(f"[fallback] answered by {answered_by} (safety classifier)")
text = "".join(b.text for b in r.content if b.type == "text")
return text, answered_by
answer, who = ask_checked("Explain the password reset flow.")
print(who, '->', answer[:120])
Note a fallback to a different model also breaks your cache hit, because the cached prefix belongs to the model you wrote it under. So a fallback call pays full input price for that turn. Treat the model field as both a safety check and a cost-anomaly check.
Worked example: a codebase migration assistant
Tie it together into the kind of tool the Stripe story is about: load a repo once, then ask the model to plan and draft a migration across many files. The prefix is cached, so each step in the plan is cheap even though the model sees the whole codebase every time.
def migration_session(root, steps):
global SYSTEM
code = load_codebase(root)
SYSTEM = [
{"type": "text", "text": "You are migrating this codebase. "
"Output a unified diff per file. Keep changes minimal and safe."},
{"type": "text", "text": code,
"cache_control": {"type": "ephemeral", "ttl": "1h"}},
]
spend = 0.0
for s in steps:
r = ask(s)
spend += cost(r.usage)
diff = "".join(b.text for b in r.content if b.type == "text")
print(f"\n=== {s} ===\n{diff[:600]}")
print(f"\nSession spend: ${spend:.2f}")
migration_session("./legacy_app", [
"Step 1: find every call to the deprecated requests.get and list the files.",
"Step 2: rewrite those calls to use httpx with timeouts. Diff per file.",
"Step 3: add a retry wrapper and apply it. Diff per file.",
])
Because steps 2 and 3 reuse the cached repo, a three-step migration over a 300k-token codebase runs for a few dollars instead of $30+. Scale the step list up and the per-step marginal cost stays flat.
Common pitfalls
- Changing the prefix invalidates the cache. The cached portion is matched byte-for-byte. If you reorder files, inject a timestamp, or shuffle a dict into the system text, you pay the write premium again. Build the prefix deterministically (sorted paths, no volatile data).
- Putting the variable part before the cache breakpoint. Anything that changes per request (the user's question, the date) must come after the last cached block, otherwise the prefix differs every time and never hits.
- Blocks under 1,024 tokens are not cached. Small system prompts silently skip caching. Caching pays off on large, stable context, which is the whole point here.
- Letting the cache expire. Default TTL is 5 minutes from the last hit. A slow user or a long tool call can let it lapse, and the next call pays a fresh write. Use
ttl: "1h"for bursty or long-running agents. - Forgetting the write premium in your math. The first call is more expensive than an uncached call, not cheaper. Caching is a bet that you will reuse the prefix; for a true one-shot request it is a small net loss.
- Assuming Fable 5 answered. A safety fallback to Opus 4.8 changes both the answer quality profile and the cost (no cache hit). Always inspect
response.model. - Counting on availability. Fable 5 access was suspended on June 12, 2026. Wrap your client so it can swap to
claude-opus-4-8automatically if Fable returns an availability error.
Quick reference
| Item | Value / Detail |
|---|---|
| Model id | claude-fable-5 |
| Context window | 1,000,000 input tokens |
| Max output | 128,000 tokens |
| Base input price | $10 / M tokens |
| Cache write price | $12.50 / M (base + 25%) |
| Cache read price | $1.00 / M (~10% of base) |
| Output price | $50 / M tokens |
| Cache marker | cache_control: {type: ephemeral} |
| Default TTL | 5 min, refreshed on hit |
| Extended TTL | ttl: "1h" |
| Min cacheable block | 1,024 tokens |
| Usage fields | cache_creation_input_tokens, cache_read_input_tokens |
| Safety fallback | Opus 4.8 (<5% of sessions); check response.model |
Next steps
- Add a second cache breakpoint for tool definitions if you give the agent tools, so both the tools and the codebase cache independently.
- Stream responses with
client.messages.streamfor a snappier migration UI; usage counters still arrive at the end. - Batch independent questions through the Message Batches API for another discount on non-interactive runs.
- Log
cache_read_input_tokensover time to confirm your hit rate stays high in production. - Build an automatic fallback wrapper: try claude-fable-5, catch availability errors, retry on claude-opus-4-8 with the same cached prefix.
Sources: Anthropic, "Claude Fable 5 and Claude Mythos 5" (Jun 9, 2026) and the Fable/Mythos access update (Jun 12, 2026); Anthropic Claude API prompt caching documentation.
Comments
Be the first to comment
Found this useful?
Get new AI guides for builders by email. Free.
Join 2,072 builders reading daily.