
Claude Opus 4.8 Fast Mode: 2.5x Faster Output in Python
Summary
Use speed:"fast" on Claude Opus 4.8 for up to 2.5x faster output, with a safe rate-limit fallback.
Claude Opus 4.8 Fast Mode: 2.5x Faster Output in Python
Anthropic shipped Claude Opus 4.8 on May 28, 2026, and a few days later quietly turned on something a lot of agent builders had been waiting for: fast mode for the 4.8 tier. Set one field, speed: "fast", and you get up to 2.5x higher output tokens per second from the exact same model weights. No quality trade-off, no distilled mini-model, just faster generation.
Why does this matter right now? Opus-tier intelligence has always been the slow lane. When you run Opus inside an agent loop that emits long diffs, multi-step plans, or big structured JSON, the bottleneck is rarely the thinking, it is the time spent streaming hundreds or thousands of output tokens back to you. Fast mode attacks exactly that. For latency-sensitive products like live coding assistants, support copilots, and anything a human is watching token-by-token, this is the difference between "usable" and "painful."
This guide is hands-on. You will make your first fast-mode call, measure the real speedup yourself, build a production-safe fallback for when you hit the fast-mode rate limit, and combine fast mode with the effort parameter so cheap agent steps stay snappy. Every code sample is checked against the official Anthropic docs. There are real gotchas (fast mode does not reduce time-to-first-token, and it is a research preview behind a waitlist), and they are all covered below.
Prerequisites
- Python 3.9+ and the official Anthropic SDK:
pip install -U anthropic(you want a recent build that knows aboutspeedandbetas). - An Anthropic API key in the
ANTHROPIC_API_KEYenvironment variable. - Fast-mode access. It is a beta research preview with a dedicated waitlist at claude.com/fast-mode. Until your org is granted access, fast requests return an error, so the fallback pattern in this guide is not optional.
- Comfort with the Messages API. If you have called
client.messages.create(...)before, you are ready.
Set your key once in the shell:
export ANTHROPIC_API_KEY="sk-ant-..."
pip install -U anthropic
What fast mode actually is (and is not)
Fast mode runs the same model with a faster inference configuration. The weights, the behavior, and the answers are identical to standard Opus 4.8. You are not switching to a smaller or dumber model. You are paying a premium to have Anthropic serve your request on a faster path.
The single most important thing to understand is what gets faster. The speedup is measured in output tokens per second (OTPS), not time to first token (TTFT). Claude still takes roughly the same time to start replying; once it starts, the tokens stream out up to 2.5x faster. So fast mode helps most on responses with lots of output: long code generations, big JSON payloads, detailed multi-step plans. It does almost nothing for a one-line answer.
- Same intelligence: identical weights and behavior, not a different model.
- Up to 2.5x OTPS: the benefit is throughput on output, not startup latency.
- Premium pricing: fast mode is billed at 6x standard Opus rates across the full context window (the published fast-mode rate is $30 / MTok input and $150 / MTok output; confirm the exact Opus 4.8 preview number on the pricing page).
- Research preview: beta, waitlisted, with its own separate rate limits.
- Not everywhere: not available on the Batch API, Priority Tier, or Claude Platform on AWS.
Step 1: Your first fast-mode call
Fast mode is a beta feature, so you call it through client.beta.messages.create and pass the beta flag fast-mode-2026-02-01 plus speed="fast". Note that Opus 4.8 does not accept temperature, top_p, or top_k (setting any of them returns a 400), so we leave them out.
import anthropic
client = anthropic.Anthropic() # reads ANTHROPIC_API_KEY
response = client.beta.messages.create(
model="claude-opus-4-8",
max_tokens=4096,
speed="fast",
betas=["fast-mode-2026-02-01"],
messages=[
{
"role": "user",
"content": "Refactor this Flask view to use dependency injection. "
"Return only the rewritten module.",
}
],
)
print(response.content[0].text)
# Confirm which speed actually served the request:
print("served at:", response.usage.speed) # "fast" or "standard"
The response object carries a new field, usage.speed, telling you which path actually served the request. Always read it. If your org is not yet granted fast access, or you fell back, this is how you know:
{
"id": "msg_01XFDUDYJgAACzvnptvVoYEL",
"type": "message",
"role": "assistant",
"usage": {
"input_tokens": 412,
"output_tokens": 1875,
"speed": "fast"
}
}
Step 2: Measure the speedup yourself
Do not take "2.5x" on faith, measure it on your own workload. The honest metric is OTPS: output tokens divided by wall-clock generation time. The script below runs the same long-output prompt twice, once standard and once fast, and prints OTPS for each.
import time, anthropic
client = anthropic.Anthropic()
PROMPT = ("Write a complete, well-commented Python module that implements "
"an LRU cache with TTL expiry, thread safety, and unit tests.")
def run(speed):
kwargs = dict(
model="claude-opus-4-8",
max_tokens=4096,
messages=[{"role": "user", "content": PROMPT}],
)
if speed == "fast":
kwargs.update(speed="fast", betas=["fast-mode-2026-02-01"])
create = client.beta.messages.create
else:
create = client.messages.create
t0 = time.perf_counter()
resp = create(**kwargs)
elapsed = time.perf_counter() - t0
out = resp.usage.output_tokens
served = getattr(resp.usage, "speed", "standard")
print(f"{speed:8s} -> served={served:8s} "
f"{out} out tokens in {elapsed:5.1f}s = {out/elapsed:6.1f} OTPS")
run("standard")
run("fast")
A representative run on a long generation looks like this (your absolute numbers will vary with load and prompt size, but the ratio is the point):
standard -> served=standard 1840 out tokens in 41.0s = 44.9 OTPS
fast -> served=fast 1862 out tokens in 17.2s = 108.3 OTPS
That is roughly a 2.4x throughput gain on the part that actually hurts. Notice the token counts are nearly identical because the model is the same, only the serving speed changed.
Step 3: A production-safe fast-then-standard fallback
Fast mode has its own dedicated rate limit, separate from standard Opus. When you exceed it the API returns a 429 with a retry-after header. In production you usually do not want to block, you want to fall back to standard speed and keep moving. The pattern below tries fast first, and on a rate-limit error retries the same request without speed.
import anthropic
client = anthropic.Anthropic()
def create_with_fast_fallback(max_retries=None, max_attempts=3, **params):
try:
return client.beta.messages.create(**params, max_retries=max_retries)
except anthropic.RateLimitError:
# Fast capacity is full -> drop speed and serve at standard.
if params.get("speed") == "fast":
params.pop("speed", None)
return create_with_fast_fallback(**params)
raise
except (anthropic.APIStatusError, anthropic.APIConnectionError) as error:
# Retry only transient 5xx/connection errors, not 4xx.
if isinstance(error, anthropic.APIStatusError) and error.status_code < 500:
raise
if max_attempts > 1:
return create_with_fast_fallback(max_attempts=max_attempts - 1, **params)
raise
message = create_with_fast_fallback(
model="claude-opus-4-8",
max_tokens=1024,
messages=[{"role": "user", "content": "Summarize this PR in 3 bullet points."}],
betas=["fast-mode-2026-02-01"],
speed="fast",
max_retries=0, # fail fast on the first 429 so we can fall back immediately
)
print(message.usage.speed) # "fast" if it went through, "standard" if it fell back
Setting max_retries=0 on the first attempt is deliberate: it stops the SDK from silently waiting on the 429 so your code can decide to drop to standard speed right away. One caveat to plan for: falling back from fast to standard is a prompt-cache miss. Requests at different speeds do not share cached prefixes, so a fallback re-bills your input tokens at the standard (uncached) rate.
Step 4: Combine fast mode with the effort parameter
Fast mode and the effort parameter solve different halves of the latency problem, and they stack. Fast mode makes each output token come out faster. Effort controls how many tokens Claude spends in the first place, across text, thinking, and tool calls. For a quick, low-stakes agent step, the fastest possible turn is low effort plus fast speed: fewer tokens, generated faster.
response = client.beta.messages.create(
model="claude-opus-4-8",
max_tokens=1024,
speed="fast",
betas=["fast-mode-2026-02-01"],
output_config={"effort": "low"}, # spend fewer tokens on this simple step
messages=[
{"role": "user", "content": "Classify this ticket: billing, bug, or feature?\n"
"'I was charged twice this month.'"}
],
)
print(response.content[0].text) # -> billing
print(response.usage.speed, # fast
response.usage.output_tokens) # small, because effort is low
Opus 4.8 defaults to high effort everywhere, so for high-volume classification or routing you should set effort down explicitly. Reserve high, xhigh, and max for the steps that genuinely need deep reasoning, and pair them with fast mode only if a human is waiting on the output, since the premium price applies to every fast token.
Worked example: a snappy code-review turn
Here is the kind of place fast mode earns its premium. A reviewer bot reads a diff and streams back a long, structured review while the developer watches. The output is large (so OTPS dominates) and a human is waiting (so latency is visible). We stream the response and fall back to standard if fast capacity is gone.
import anthropic
client = anthropic.Anthropic()
DIFF = open("change.diff").read()
def stream_review(use_fast=True):
params = dict(
model="claude-opus-4-8",
max_tokens=4096,
output_config={"effort": "high"}, # reviews deserve real reasoning
messages=[{
"role": "user",
"content": f"Review this diff. List concrete issues with file:line "
f"references, then a verdict.\n\n{DIFF}",
}],
)
if use_fast:
params.update(speed="fast", betas=["fast-mode-2026-02-01"])
stream_fn = client.beta.messages.stream
else:
stream_fn = client.messages.stream
try:
with stream_fn(**params) as stream:
for text in stream.text_stream:
print(text, end="", flush=True)
final = stream.get_final_message()
print(f"\n\n[served at {final.usage.speed}, "
f"{final.usage.output_tokens} tokens]")
except anthropic.RateLimitError:
if use_fast:
print("[fast full, falling back to standard]")
return stream_review(use_fast=False)
raise
stream_review()
Example tail of the output:
...
- auth/session.py:42 - token compared with == instead of hmac.compare_digest (timing leak)
- auth/session.py:88 - missing await on async refresh; the call is fire-and-forget
Verdict: request changes. Two correctness issues, one security-relevant.
[served at fast, 1623 tokens]
Common pitfalls and gotchas
- It does not lower TTFT. If your complaint is "Claude takes too long to start replying," fast mode will not fix it. The gain is output tokens per second, so it only pays off on long generations.
- Switching speed invalidates the prompt cache. Fast and standard requests do not share cached prefixes. Flip-flopping between them on an agent loop quietly destroys your cache-hit rate and inflates input cost.
- You must use the beta client and flag. Call
client.beta.messages.create(or.stream) and includebetas=["fast-mode-2026-02-01"]. Sendingspeed="fast"on the plain client, or to an unsupported model, errors out. - Separate rate limits. Fast mode has its own pool. Watch the
anthropic-fast-output-tokens-remainingandanthropic-fast-output-tokens-resetresponse headers so you can pace requests before you hit a 429. - Premium price on every fast token. At 6x standard Opus rates, blindly flipping all traffic to fast is expensive. Use it where a human is actually waiting on long output.
- Not on Batch API, Priority Tier, or Claude Platform on AWS. Architect fast mode into your synchronous, latency-sensitive paths, not your bulk offline jobs.
- No sampling params on Opus 4.8.
temperature,top_p, andtop_kall 400 on 4.8 regardless of speed. Steer behavior with prompting and the effort parameter instead. - It is a research preview. Access is waitlisted and capacity is limited while Anthropic gathers feedback, so treat fast as an optimization that can be unavailable, never as a hard dependency.
Quick reference
| Item | Value / Setting |
|---|---|
| Model ID | claude-opus-4-8 |
| Enable fast mode | speed="fast" |
| Beta flag | betas=["fast-mode-2026-02-01"] |
| Client method | client.beta.messages.create / .stream |
| Speedup | Up to 2.5x output tokens per second (OTPS) |
| What it does NOT speed up | Time to first token (TTFT) |
| Check served speed | response.usage.speed |
| Fast pricing (published) | $30 / MTok in, $150 / MTok out (6x Opus) |
| Rate-limit error | 429 with retry-after; fall back to standard |
| Cache when switching speed | Miss (no shared prefixes) |
| Not available on | Batch API, Priority Tier, Claude Platform on AWS |
| Pairs well with | effort="low" for cheap, snappy agent steps |
Next steps
- Join the fast-mode waitlist at claude.com/fast-mode if your org is not yet enabled.
- Instrument
usage.speedand theanthropic-fast-*headers in your logging so you can see your real fast-hit rate and pace requests. - A/B the same agent with and without fast mode on your actual prompts; keep it only where the OTPS win is visible to a user.
- Layer in the effort parameter per step (low for routing, high/xhigh for the hard reasoning) to control token spend before you pay the fast premium.
- Read the official Fast mode and Effort docs on platform.claude.com for the latest pricing and supported-model details.
Sources: Anthropic Claude API docs, "What's new in Claude Opus 4.8", "Fast mode", and "Effort" (platform.claude.com), accessed June 2026.
Comments
Be the first to comment
Found this useful?
Get new AI guides for builders by email. Free.
Join 1,984 builders reading daily.