Skip to content
Claude Opus 4.8 Fast Mode: 2.5x Faster Output in Python — ContentBuffer guide

Claude Opus 4.8 Fast Mode: 2.5x Faster Output in Python

K
Kodetra Technologies··9 min read Intermediate

Summary

Use speed:"fast" on Claude Opus 4.8 for up to 2.5x faster output, with a safe rate-limit fallback.

Claude Opus 4.8 Fast Mode: 2.5x Faster Output in Python

Anthropic shipped Claude Opus 4.8 on May 28, 2026, and a few days later quietly turned on something a lot of agent builders had been waiting for: fast mode for the 4.8 tier. Set one field, speed: "fast", and you get up to 2.5x higher output tokens per second from the exact same model weights. No quality trade-off, no distilled mini-model, just faster generation.

Why does this matter right now? Opus-tier intelligence has always been the slow lane. When you run Opus inside an agent loop that emits long diffs, multi-step plans, or big structured JSON, the bottleneck is rarely the thinking, it is the time spent streaming hundreds or thousands of output tokens back to you. Fast mode attacks exactly that. For latency-sensitive products like live coding assistants, support copilots, and anything a human is watching token-by-token, this is the difference between "usable" and "painful."

This guide is hands-on. You will make your first fast-mode call, measure the real speedup yourself, build a production-safe fallback for when you hit the fast-mode rate limit, and combine fast mode with the effort parameter so cheap agent steps stay snappy. Every code sample is checked against the official Anthropic docs. There are real gotchas (fast mode does not reduce time-to-first-token, and it is a research preview behind a waitlist), and they are all covered below.

Prerequisites

  • Python 3.9+ and the official Anthropic SDK: pip install -U anthropic (you want a recent build that knows about speed and betas).
  • An Anthropic API key in the ANTHROPIC_API_KEY environment variable.
  • Fast-mode access. It is a beta research preview with a dedicated waitlist at claude.com/fast-mode. Until your org is granted access, fast requests return an error, so the fallback pattern in this guide is not optional.
  • Comfort with the Messages API. If you have called client.messages.create(...) before, you are ready.

Set your key once in the shell:

export ANTHROPIC_API_KEY="sk-ant-..."
pip install -U anthropic

What fast mode actually is (and is not)

Fast mode runs the same model with a faster inference configuration. The weights, the behavior, and the answers are identical to standard Opus 4.8. You are not switching to a smaller or dumber model. You are paying a premium to have Anthropic serve your request on a faster path.

The single most important thing to understand is what gets faster. The speedup is measured in output tokens per second (OTPS), not time to first token (TTFT). Claude still takes roughly the same time to start replying; once it starts, the tokens stream out up to 2.5x faster. So fast mode helps most on responses with lots of output: long code generations, big JSON payloads, detailed multi-step plans. It does almost nothing for a one-line answer.

  • Same intelligence: identical weights and behavior, not a different model.
  • Up to 2.5x OTPS: the benefit is throughput on output, not startup latency.
  • Premium pricing: fast mode is billed at 6x standard Opus rates across the full context window (the published fast-mode rate is $30 / MTok input and $150 / MTok output; confirm the exact Opus 4.8 preview number on the pricing page).
  • Research preview: beta, waitlisted, with its own separate rate limits.
  • Not everywhere: not available on the Batch API, Priority Tier, or Claude Platform on AWS.

Step 1: Your first fast-mode call

Fast mode is a beta feature, so you call it through client.beta.messages.create and pass the beta flag fast-mode-2026-02-01 plus speed="fast". Note that Opus 4.8 does not accept temperature, top_p, or top_k (setting any of them returns a 400), so we leave them out.

import anthropic

client = anthropic.Anthropic()  # reads ANTHROPIC_API_KEY

response = client.beta.messages.create(
    model="claude-opus-4-8",
    max_tokens=4096,
    speed="fast",
    betas=["fast-mode-2026-02-01"],
    messages=[
        {
            "role": "user",
            "content": "Refactor this Flask view to use dependency injection. "
                       "Return only the rewritten module.",
        }
    ],
)

print(response.content[0].text)

# Confirm which speed actually served the request:
print("served at:", response.usage.speed)   # "fast" or "standard"

The response object carries a new field, usage.speed, telling you which path actually served the request. Always read it. If your org is not yet granted fast access, or you fell back, this is how you know:

{
  "id": "msg_01XFDUDYJgAACzvnptvVoYEL",
  "type": "message",
  "role": "assistant",
  "usage": {
    "input_tokens": 412,
    "output_tokens": 1875,
    "speed": "fast"
  }
}

Step 2: Measure the speedup yourself

Do not take "2.5x" on faith, measure it on your own workload. The honest metric is OTPS: output tokens divided by wall-clock generation time. The script below runs the same long-output prompt twice, once standard and once fast, and prints OTPS for each.

import time, anthropic

client = anthropic.Anthropic()

PROMPT = ("Write a complete, well-commented Python module that implements "
          "an LRU cache with TTL expiry, thread safety, and unit tests.")

def run(speed):
    kwargs = dict(
        model="claude-opus-4-8",
        max_tokens=4096,
        messages=[{"role": "user", "content": PROMPT}],
    )
    if speed == "fast":
        kwargs.update(speed="fast", betas=["fast-mode-2026-02-01"])
        create = client.beta.messages.create
    else:
        create = client.messages.create

    t0 = time.perf_counter()
    resp = create(**kwargs)
    elapsed = time.perf_counter() - t0

    out = resp.usage.output_tokens
    served = getattr(resp.usage, "speed", "standard")
    print(f"{speed:8s} -> served={served:8s} "
          f"{out} out tokens in {elapsed:5.1f}s = {out/elapsed:6.1f} OTPS")

run("standard")
run("fast")

A representative run on a long generation looks like this (your absolute numbers will vary with load and prompt size, but the ratio is the point):

standard -> served=standard  1840 out tokens in  41.0s =   44.9 OTPS
fast     -> served=fast      1862 out tokens in  17.2s =  108.3 OTPS

That is roughly a 2.4x throughput gain on the part that actually hurts. Notice the token counts are nearly identical because the model is the same, only the serving speed changed.

Step 3: A production-safe fast-then-standard fallback

Fast mode has its own dedicated rate limit, separate from standard Opus. When you exceed it the API returns a 429 with a retry-after header. In production you usually do not want to block, you want to fall back to standard speed and keep moving. The pattern below tries fast first, and on a rate-limit error retries the same request without speed.

import anthropic

client = anthropic.Anthropic()

def create_with_fast_fallback(max_retries=None, max_attempts=3, **params):
    try:
        return client.beta.messages.create(**params, max_retries=max_retries)
    except anthropic.RateLimitError:
        # Fast capacity is full -> drop speed and serve at standard.
        if params.get("speed") == "fast":
            params.pop("speed", None)
            return create_with_fast_fallback(**params)
        raise
    except (anthropic.APIStatusError, anthropic.APIConnectionError) as error:
        # Retry only transient 5xx/connection errors, not 4xx.
        if isinstance(error, anthropic.APIStatusError) and error.status_code < 500:
            raise
        if max_attempts > 1:
            return create_with_fast_fallback(max_attempts=max_attempts - 1, **params)
        raise

message = create_with_fast_fallback(
    model="claude-opus-4-8",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Summarize this PR in 3 bullet points."}],
    betas=["fast-mode-2026-02-01"],
    speed="fast",
    max_retries=0,   # fail fast on the first 429 so we can fall back immediately
)

print(message.usage.speed)  # "fast" if it went through, "standard" if it fell back

Setting max_retries=0 on the first attempt is deliberate: it stops the SDK from silently waiting on the 429 so your code can decide to drop to standard speed right away. One caveat to plan for: falling back from fast to standard is a prompt-cache miss. Requests at different speeds do not share cached prefixes, so a fallback re-bills your input tokens at the standard (uncached) rate.

Step 4: Combine fast mode with the effort parameter

Fast mode and the effort parameter solve different halves of the latency problem, and they stack. Fast mode makes each output token come out faster. Effort controls how many tokens Claude spends in the first place, across text, thinking, and tool calls. For a quick, low-stakes agent step, the fastest possible turn is low effort plus fast speed: fewer tokens, generated faster.

response = client.beta.messages.create(
    model="claude-opus-4-8",
    max_tokens=1024,
    speed="fast",
    betas=["fast-mode-2026-02-01"],
    output_config={"effort": "low"},   # spend fewer tokens on this simple step
    messages=[
        {"role": "user", "content": "Classify this ticket: billing, bug, or feature?\n"
                                     "'I was charged twice this month.'"}
    ],
)
print(response.content[0].text)        # -> billing
print(response.usage.speed,            # fast
      response.usage.output_tokens)    # small, because effort is low

Opus 4.8 defaults to high effort everywhere, so for high-volume classification or routing you should set effort down explicitly. Reserve high, xhigh, and max for the steps that genuinely need deep reasoning, and pair them with fast mode only if a human is waiting on the output, since the premium price applies to every fast token.

Worked example: a snappy code-review turn

Here is the kind of place fast mode earns its premium. A reviewer bot reads a diff and streams back a long, structured review while the developer watches. The output is large (so OTPS dominates) and a human is waiting (so latency is visible). We stream the response and fall back to standard if fast capacity is gone.

import anthropic
client = anthropic.Anthropic()

DIFF = open("change.diff").read()

def stream_review(use_fast=True):
    params = dict(
        model="claude-opus-4-8",
        max_tokens=4096,
        output_config={"effort": "high"},  # reviews deserve real reasoning
        messages=[{
            "role": "user",
            "content": f"Review this diff. List concrete issues with file:line "
                       f"references, then a verdict.\n\n{DIFF}",
        }],
    )
    if use_fast:
        params.update(speed="fast", betas=["fast-mode-2026-02-01"])
        stream_fn = client.beta.messages.stream
    else:
        stream_fn = client.messages.stream

    try:
        with stream_fn(**params) as stream:
            for text in stream.text_stream:
                print(text, end="", flush=True)
            final = stream.get_final_message()
        print(f"\n\n[served at {final.usage.speed}, "
              f"{final.usage.output_tokens} tokens]")
    except anthropic.RateLimitError:
        if use_fast:
            print("[fast full, falling back to standard]")
            return stream_review(use_fast=False)
        raise

stream_review()

Example tail of the output:

...
- auth/session.py:42 - token compared with == instead of hmac.compare_digest (timing leak)
- auth/session.py:88 - missing await on async refresh; the call is fire-and-forget
Verdict: request changes. Two correctness issues, one security-relevant.

[served at fast, 1623 tokens]

Common pitfalls and gotchas

  • It does not lower TTFT. If your complaint is "Claude takes too long to start replying," fast mode will not fix it. The gain is output tokens per second, so it only pays off on long generations.
  • Switching speed invalidates the prompt cache. Fast and standard requests do not share cached prefixes. Flip-flopping between them on an agent loop quietly destroys your cache-hit rate and inflates input cost.
  • You must use the beta client and flag. Call client.beta.messages.create (or .stream) and include betas=["fast-mode-2026-02-01"]. Sending speed="fast" on the plain client, or to an unsupported model, errors out.
  • Separate rate limits. Fast mode has its own pool. Watch the anthropic-fast-output-tokens-remaining and anthropic-fast-output-tokens-reset response headers so you can pace requests before you hit a 429.
  • Premium price on every fast token. At 6x standard Opus rates, blindly flipping all traffic to fast is expensive. Use it where a human is actually waiting on long output.
  • Not on Batch API, Priority Tier, or Claude Platform on AWS. Architect fast mode into your synchronous, latency-sensitive paths, not your bulk offline jobs.
  • No sampling params on Opus 4.8. temperature, top_p, and top_k all 400 on 4.8 regardless of speed. Steer behavior with prompting and the effort parameter instead.
  • It is a research preview. Access is waitlisted and capacity is limited while Anthropic gathers feedback, so treat fast as an optimization that can be unavailable, never as a hard dependency.

Quick reference

ItemValue / Setting
Model IDclaude-opus-4-8
Enable fast modespeed="fast"
Beta flagbetas=["fast-mode-2026-02-01"]
Client methodclient.beta.messages.create / .stream
SpeedupUp to 2.5x output tokens per second (OTPS)
What it does NOT speed upTime to first token (TTFT)
Check served speedresponse.usage.speed
Fast pricing (published)$30 / MTok in, $150 / MTok out (6x Opus)
Rate-limit error429 with retry-after; fall back to standard
Cache when switching speedMiss (no shared prefixes)
Not available onBatch API, Priority Tier, Claude Platform on AWS
Pairs well witheffort="low" for cheap, snappy agent steps

Next steps

  • Join the fast-mode waitlist at claude.com/fast-mode if your org is not yet enabled.
  • Instrument usage.speed and the anthropic-fast-* headers in your logging so you can see your real fast-hit rate and pace requests.
  • A/B the same agent with and without fast mode on your actual prompts; keep it only where the OTPS win is visible to a user.
  • Layer in the effort parameter per step (low for routing, high/xhigh for the hard reasoning) to control token spend before you pay the fast premium.
  • Read the official Fast mode and Effort docs on platform.claude.com for the latest pricing and supported-model details.

Sources: Anthropic Claude API docs, "What's new in Claude Opus 4.8", "Fast mode", and "Effort" (platform.claude.com), accessed June 2026.

Comments

Subscribe to join the conversation...

Be the first to comment

Found this useful?

Get new AI guides for builders by email. Free.

Join 1,984 builders reading daily.