Opus 4.8 Mid-Convo System Messages in Python (Cache-Safe) — ContentBuffer guide

Opus 4.8 Mid-Convo System Messages in Python (Cache-Safe)

K
Kodetra Technologies··10 min read Intermediate

Summary

Use Opus 4.8 role:system messages mid-conversation to update agent rules without invalidating cache.

Why this matters right now

Anthropic shipped Claude Opus 4.8 on May 28, 2026 and quietly turned on a feature that long-running agents have been asking for since Opus 4.0: a role: "system" message you can drop into the middle of a conversation. The new instruction takes system-level priority from that point onward, and the conversation history before it stays byte-identical, so the prefix you already paid to cache still hits on the next request.

If you have ever edited the top-level system field on turn 47 of an agent run, watched your prompt cache go cold, and then watched your bill go up, this guide is for you. We will use the official Python SDK to send a mid-conversation system message, prove the cached prefix still gets reused, and bolt it into a small agent loop that swaps its own operating rules between turns.


What is actually new in Opus 4.8

Opus 4.8 keeps the same price as Opus 4.7 ($5 / $25 per million input and output tokens) and the same 1M-token context window. What changed for builders is a handful of small but practical knobs:

  • Mid-conversation role: "system" messages — append a system instruction after a user turn. No beta header.
  • Lower prompt-cache minimum — 1,024 tokens (down from 2,048 on Opus 4.7). Short agent prompts can now cache.
  • Effort defaults to high — on every surface, including the API and Claude Code. Override with output_config={"effort": "medium"} or "xhigh".
  • Fast mode (research preview) — set speed: "fast" for ~2.5x output tokens-per-second at $10 / $50 per MTok.
  • Refusal stop details — when the model declines, stop_details now carries a category so you can route the user appropriately.
  • Same constraints as 4.7 — no temperature/top_p/top_k, no budget_tokens. Use adaptive thinking plus effort.

Mid-conversation system messages are available on the Claude API and on Claude Platform on AWS. They are not available on Amazon Bedrock, Vertex AI, or Microsoft Foundry. Plan for that if your production traffic terminates on one of those.


Why position breaks the cache (the 60-second version)

Prompt caching hashes the request prefix in a fixed order: tools, then system, then messages. A cache hit requires that prefix to match a previous request byte for byte, up to a cache breakpoint. The top-level system field sits near the very start of that hash.

So when you append even one sentence to system on turn 47, the hash for the entire prefix changes. The model has to re-tokenize and re-attend the system prompt and every cached message that came after it. On a 200k-token conversation, that is real money. A mid-conversation system message dodges that by living at the end of messages: everything before it is unchanged, so the cached prefix still matches, and only the new turn is fresh input.


Prerequisites

  • Python 3.10+ and an Anthropic API key (set ANTHROPIC_API_KEY).
  • The anthropic Python SDK on a recent version: pip install --upgrade anthropic.
  • API access to claude-opus-4-8. The Claude API and Claude Platform on AWS both work; Bedrock/Vertex/Foundry do not (yet).
  • Comfort with the Messages API and the idea of a multi-turn messages array.

Step 1 — Bootstrap the client

Nothing exotic here. We will reuse this client across every example.

import os
from anthropic import Anthropic

client = Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])
MODEL = "claude-opus-4-8"

Step 2 — A baseline call (so we have something to compare against)

Set cache_control={"type": "ephemeral"} at the top level to enable automatic prompt caching. The SDK will place a cache breakpoint at the end of the stable prefix on every request.

resp = client.messages.create(
    model=MODEL,
    max_tokens=512,
    cache_control={"type": "ephemeral"},
    system="You are a senior code reviewer. Be concise. Reply in <= 3 bullet points.",
    messages=[
        {"role": "user", "content": "Review process() in utils.py for performance issues."},
    ],
)
print(resp.content[0].text)
print("usage:", resp.usage)

Example output:

- The inner list comprehension materializes the full result; switch to a generator if N is large.
- Repeated dict lookups in the hot loop — hoist `cfg.get("x")` once before the loop.
- `sorted()` is called on every call; cache the sort if `items` is immutable between invocations.
usage: Usage(input_tokens=44, output_tokens=78, cache_creation_input_tokens=0, cache_read_input_tokens=0)

Caching needs at least 1,024 tokens before it actually creates an entry, so this tiny example shows zeros on both cache fields. Hold that thought — we will grow the conversation in Step 4 and watch the numbers move.

Step 3 — Add a mid-conversation system message

The reviewer has been running for a few turns. Now your team rolls out a new rule: every code suggestion has to include explicit type annotations. Old way: edit the top-level system string and lose the cache. New way: append a role: "system" message after the latest user turn.

messages = [
    {"role": "user",      "content": "Review process() in utils.py."},
    {"role": "assistant", "content": "Generator beats list comp here for large inputs. Hoist the dict lookups."},
    {"role": "user",      "content": "Now look at the calling code in pipeline.py."},
    # New rule arrives mid-session. Append it as a system turn — the
    # turns above are unchanged, so the cached prefix still hits.
    {"role": "system",    "content": "From now on, every suggestion must include explicit type annotations."},
]

resp = client.messages.create(
    model=MODEL,
    max_tokens=512,
    cache_control={"type": "ephemeral"},
    system="You are a senior code reviewer. Be concise.",
    messages=messages,
)
print(resp.content[0].text)

Two placement rules to commit to memory. A system message must (a) immediately follow a user turn — or an assistant turn that ends in a server tool use — and (b) either end the messages array or be followed by an assistant turn. Anything else returns HTTP 400. In practice that means: append it, then call create().

Step 4 — Worked example: an agent that swaps rules between turns

Here is the pattern you actually want to ship. We will run a five-turn loop where the agent is asked to write SQL. After turn 3, the security team rolls out a parameterization rule. We inject it as a mid-conversation system message, and on every subsequent turn the cached prefix is still read from cache.

from anthropic import Anthropic
import os

client = Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])
MODEL = "claude-opus-4-8"
BASE_SYSTEM = (
    "You are a SQL co-pilot for a finance data warehouse. "
    "Tables: orders(id, user_id, amount, created_at), users(id, email, region). "
    "Reply with exactly one SQL statement, no commentary."
)

messages = []
turns = [
    "Total revenue per region for May 2026.",
    "Same, but only regions with > 100 orders.",
    "Add the top-spending user_id per region.",
    "Now show daily revenue for the EU region in May 2026.",
    "And daily revenue for US, parameterized by start and end date.",
]

for i, user_text in enumerate(turns):
    messages.append({"role": "user", "content": user_text})

    # After turn 3, security ships a new rule. Append it BEFORE the next call.
    # It sits after the latest user turn, so it is a legal mid-convo system message.
    if i == 3:
        messages.append({
            "role": "system",
            "content": (
                "SECURITY UPDATE: From now on, every query that takes user-supplied "
                "values MUST use bind parameters (e.g. :start_date), never inline literals. "
                "Comment each parameter on its own line above the query."
            ),
        })

    resp = client.messages.create(
        model=MODEL,
        max_tokens=512,
        cache_control={"type": "ephemeral"},
        system=BASE_SYSTEM,
        messages=messages,
        output_config={"effort": "medium"},
    )

    text = resp.content[0].text
    print(f"\n--- turn {i} ---")
    print(text)
    print("cache_read:", resp.usage.cache_read_input_tokens,
          "cache_create:", resp.usage.cache_creation_input_tokens)
    messages.append({"role": "assistant", "content": text})

Example output (trimmed):

--- turn 0 ---
SELECT u.region, SUM(o.amount) AS revenue
FROM orders o JOIN users u ON u.id = o.user_id
WHERE o.created_at >= '2026-05-01' AND o.created_at < '2026-06-01'
GROUP BY u.region;
cache_read: 0 cache_create: 0

--- turn 2 ---
SELECT region, revenue, top_user_id FROM (...);
cache_read: 1187 cache_create: 412

--- turn 3 ---  (security rule was just appended as a system message)
SELECT DATE(o.created_at) AS day, SUM(o.amount) AS revenue
FROM orders o JOIN users u ON u.id = o.user_id
WHERE u.region = 'EU'
  AND o.created_at >= '2026-05-01' AND o.created_at < '2026-06-01'
GROUP BY day ORDER BY day;
cache_read: 1604 cache_create: 256

--- turn 4 ---
-- :start_date - inclusive lower bound
-- :end_date   - exclusive upper bound
SELECT DATE(o.created_at) AS day, SUM(o.amount) AS revenue
FROM orders o JOIN users u ON u.id = o.user_id
WHERE u.region = 'US'
  AND o.created_at >= :start_date AND o.created_at < :end_date
GROUP BY day ORDER BY day;
cache_read: 2089 cache_create: 198

Two things to notice. First, cache_read_input_tokens keeps climbing every turn — the earlier history is being read from cache even after we injected a brand-new system instruction. Second, the model adopts the parameterization rule starting on turn 4 (the first turn after the system message). If we had edited the top-level system instead, cache_read_input_tokens would drop to 0 on turn 3 and stay there.

Step 5 — Combine with effort, adaptive thinking, and fast mode

Mid-conversation system messages are cheap on their own. They get genuinely interesting when you stack them with the rest of Opus 4.8's controls. Two examples.

1. Granting an effort boost mid-session. The agent has been running on medium. It hits a hard subproblem and you want it to think harder for the rest of the run. Append a system message announcing the standing permission, and bump effort on the next call:

messages.append({
    "role": "system",
    "content": "You may now take as much time as needed. Plan thoroughly before any tool call.",
})
resp = client.messages.create(
    model=MODEL,
    max_tokens=4096,
    cache_control={"type": "ephemeral"},
    system=BASE_SYSTEM,
    messages=messages,
    output_config={"effort": "xhigh"},
    thinking={"type": "adaptive"},
)

xhigh is the Opus 4.8 recommendation for long-horizon agentic and coding work. Pair it with adaptive thinking and the model decides per-turn whether to think, instead of burning thinking tokens on lookups that do not need them.

2. Latency-critical reply with fast mode. Same conversation, but the final summary needs to stream fast for a UI. Set speed: "fast" and accept the premium price:

resp = client.messages.create(
    model=MODEL,
    max_tokens=1024,
    cache_control={"type": "ephemeral"},
    system=BASE_SYSTEM,
    messages=messages,
    speed="fast",                   # ~2.5x output tokens/sec
    output_config={"effort": "low"},  # the summary is the easy part now
)

Fast mode runs at $10 / $50 per million input and output tokens (vs. $5 / $25 standard). It is a research preview, so put it behind a feature flag in production.


Placement rules at a glance

Position in messagesAllowed?What happens
First entry in the arrayNo400 error. Use the top-level system field for that.
Immediately after a user turnYesNormal path. Append, then call create().
After an assistant turn ending in a server tool useYesUsed in multi-agent / server-tool flows.
After an assistant turn ending in plain textNoWait for the next user turn, then append.
Two system turns in a rowNoMerge into one message instead.
Followed by another system turnNoMust end the array or be followed by an assistant turn.

Pitfalls people are already hitting

Opus 4.8 launched yesterday and the same handful of mistakes are already showing up in support threads. Walk through each before you ship.

  • Editing or removing a mid-conv system message after sending it. Like any edit to earlier history, that re-hashes the prefix and the cache goes cold from that point on. If the instruction needs to change, append a new one. Do not rewrite the old one in place.
  • Putting it on the wrong cloud. If your traffic routes through Amazon Bedrock, Vertex AI, or Microsoft Foundry, mid-conversation system messages are not available yet — you will get a validation error. The Claude API and Claude Platform on AWS are the supported paths today.
  • Treating it as a security boundary. A system message has higher instruction priority. It does not make user-supplied or tool-returned content trustworthy. If a tool result is going to become a standing rule, sanitize it the same way you would any untrusted prompt material.
  • Forgetting cache_control. Mid-conversation system messages do not enable caching by themselves. Without cache_control or an explicit breakpoint somewhere, every request pays full input price for the entire conversation, and the whole point of this pattern evaporates.
  • Conversations under 1,024 tokens. That is the new minimum cacheable prompt length on Opus 4.8 (lower than 4.7). If your test conversation is shorter, cache_read_input_tokens and cache_creation_input_tokens will both stay at zero and you will think it is broken. It is not. Make the conversation longer.
  • Setting temperature, top_p, or top_k. Same restriction as Opus 4.7: any non-default value returns 400. Steer behavior with prompting and the effort parameter instead.
  • Setting thinking.budget_tokens. Also rejected on 4.8. Use thinking: {"type": "adaptive"} and let effort drive depth.
  • Stacking two system turns back-to-back. 400 error. If you have two updates at the same boundary, concatenate them into one message.

Quick reference

ItemValue on Opus 4.8
API model IDclaude-opus-4-8
Context window1M tokens (200k on Microsoft Foundry)
Max output tokens128k
Standard price (input / output, per MTok)$5 / $25
Fast mode price (input / output, per MTok)$10 / $50
Min cacheable prompt length1,024 tokens
Effort defaulthigh (override with low / medium / xhigh / max)
Thinking modeadaptive only (no budget_tokens)
Sampling params (temp / top_p / top_k)Not supported — returns 400
Mid-conv system message availabilityClaude API, Claude Platform on AWS
Mid-conv system message NOT available onAmazon Bedrock, Vertex AI, Microsoft Foundry
Beta header requiredNone

Where to go from here

  • Build the orchestration mode example — a session-level mode that grants standing permission to launch parallel subagents, with a refresher every few turns.
  • Wire stop_details into your client so refusals can be routed to a human-review queue instead of treated as generic errors.
  • Add cache diagnostics to your test harness so a missed cache hit fails CI instead of quietly burning budget in production.
  • If you are still on Opus 4.7, walk the migration guide end-to-end. Most apps need no code change beyond the model string, but tool-triggering and compaction behavior moved enough to be worth re-running your evals.

If you build something interesting on top of mid-conversation system messages, drop a comment. The pattern is one day old and the playbook is being written right now.

Comments

Subscribe to join the conversation...

Be the first to comment

Found this useful?

Get new AI guides for builders by email. Free.