Skip to content
GLM-5.2 Open Weights: Route Reasoning Effort by Task — ContentBuffer guide

GLM-5.2 Open Weights: Route Reasoning Effort by Task

K
Kodetra Technologies··9 min read Intermediate

Summary

Build a cost-aware GLM-5.2 agent that routes thinking effort per task and calls tools.

GLM-5.2 Open Weights: Route Reasoning Effort by Task

On June 17, 2026, Z.ai dropped the full MIT-licensed weights for GLM-5.2, and the dev timelines lit up. The reason is simple: it is a ~753B-parameter mixture-of-experts model that scores at the top of the open-source pack on long-horizon coding, lands second on Code Arena, and trails Claude Opus 4.8 by roughly a single point on independent multi-step coding evals. The kicker is price. VentureBeat reports it beats GPT-5.5 on long-horizon coding at about one-sixth of the cost, and the weights ship under MIT, so you can also self-host.

But raw capability is only half the story. GLM-5.2 is a reasoning model with a control most teams ignore: you can turn thinking off for cheap, instant replies, or push it to max effort when a task actually needs deep reasoning. Spend that knob blindly and you either burn money on simple tasks or get shallow answers on hard ones.

This guide builds a cost-aware coding agent on top of GLM-5.2. It routes each task to the right reasoning effort, calls tools in a loop to verify its own work, and tracks exactly what every run costs. Everything here uses the OpenAI-compatible API, so the code drops into any stack that already speaks Chat Completions.

Prerequisites

  • Python 3.9+ and pip install openai (the standard OpenAI SDK; no Z.ai-specific client needed).
  • A Z.ai API key from the dashboard at z.ai. Store it in an environment variable, never in source.
  • Basic familiarity with the OpenAI Chat Completions request/response shape (messages, tool_calls, usage).
  • Optional: if you would rather self-host, the MIT weights live on Hugging Face as zai-org/GLM-5.2 and run on vLLM, SGLang, or Ollama (glm-5.2). The code below works unchanged against a local OpenAI-compatible server.

Why GLM-5.2 is the open-weights story of the week

GLM-5.2 matters because it closes the practical gap between open and closed models on the workload teams actually pay for: long-horizon, tool-using coding. The headline jump is Terminal-Bench 2.1, where Z.ai reports 81.0, up from 62.0 on GLM-5.1. A few verified numbers from Z.ai's published results put it in context:

BenchmarkGLM-5.2Comparison
Terminal-Bench 2.181.0GLM-5.1 was 62.0
SWE-bench Pro62.1GPT-5.5: 58.6
MCP-Atlas (tool use)77.0near Claude Opus 4.8
AIME 2026 (math)99.2frontier-tier
Code Arena rank#2trails Opus 4.8 by ~1 pt

Pair that with MIT weights, a 1M-token context, and output pricing of $4.40 per 1M tokens, and the economic pitch is hard to ignore: frontier-class coding at a fraction of closed-model cost, with a self-host escape hatch. The rest of this guide turns that into working code.

Step 1: Point the OpenAI SDK at GLM-5.2

GLM-5.2 speaks the OpenAI wire format. You only change the base URL and the model id. Set your key first:

export ZAI_API_KEY="your-glm-5.2-api-key"

Then create a client. The base URL is the pay-as-you-go endpoint, and the model id is simply glm-5.2 (use glm-5.2[1m] only when you actually need the full 1M-token context):

from openai import OpenAI
import os

client = OpenAI(
    api_key=os.environ["ZAI_API_KEY"],
    base_url="https://api.z.ai/api/paas/v4/",
)

resp = client.chat.completions.create(
    model="glm-5.2",
    messages=[
        {"role": "system", "content": "You are a concise backend engineer."},
        {"role": "user", "content": "Explain idempotency keys in 3 sentences."},
    ],
)
print(resp.choices[0].message.content)

That is the entire integration. Retries, logging, and helper code you already wrote for OpenAI carry over because the response carries the usual id, choices, and usage fields.

Step 2: The three reasoning modes (and what they cost)

GLM-5.2 exposes two knobs through the SDK's extra_body passthrough: thinking (enabled or disabled) and, when enabled, reasoning_effort (high or max). Z.ai recommends max for coding. That gives you three practical tiers:

# Tier 1 - thinking OFF: fast, cheapest. Good for classification, routing, short rewrites.
extra_body={"thinking": {"type": "disabled"}}

# Tier 2 - thinking ON, HIGH effort: balanced reasoning for everyday tasks.
extra_body={"thinking": {"type": "enabled"}, "reasoning_effort": "high"}

# Tier 3 - thinking ON, MAX effort: deepest reasoning for hard refactors and multi-step work.
extra_body={"thinking": {"type": "enabled"}, "reasoning_effort": "max"}

The cost trap: reasoning tokens are billed as output tokens. GLM-5.2 charges $1.40 per 1M input and $4.40 per 1M output, so a max-effort call can cost several times a thinking-disabled one for the same prompt. The whole point of routing is to pay for deep reasoning only when the task earns it.

Step 3: Build the effort router

Rather than hard-coding a tier, let the model triage the task for you with a single cheap, thinking-disabled call, then map its verdict to an effort tier. This is the cost-aware core:

EFFORT = {
    "trivial": {"thinking": {"type": "disabled"}},
    "normal":  {"thinking": {"type": "enabled"}, "reasoning_effort": "high"},
    "hard":    {"thinking": {"type": "enabled"}, "reasoning_effort": "max"},
}

def classify(task: str) -> str:
    """Cheap triage: one thinking-disabled call returns a single word."""
    r = client.chat.completions.create(
        model="glm-5.2",
        messages=[
            {"role": "system", "content":
                "Classify the coding task difficulty. Reply with exactly one word: "
                "trivial, normal, or hard. Multi-file refactors, debugging, and "
                "algorithm design are hard. Renames and one-liners are trivial."},
            {"role": "user", "content": task},
        ],
        extra_body={"thinking": {"type": "disabled"}},
        max_tokens=4,
    )
    label = r.choices[0].message.content.strip().lower()
    return label if label in EFFORT else "normal"  # safe fallback

The triage call itself is nearly free because thinking is off and you cap it at a handful of tokens. It pays for itself the first time it stops a trivial rename from triggering a max-effort reasoning run.

Step 4: Add a tool-calling loop

A coding agent that cannot run code is just a chatbot. GLM-5.2 scores 77.0 on MCP-Atlas for tool use, close to Claude Opus 4.8, and it follows the standard OpenAI two-step: the model emits tool_calls, you execute them, you feed results back, and it continues until it answers. Here is a sandboxed Python tool plus the loop:

import io, contextlib, json

def run_python(code: str) -> str:
    """Execute a short snippet and capture stdout. Sandbox properly in production."""
    buf = io.StringIO()
    try:
        with contextlib.redirect_stdout(buf):
            exec(code, {})
        return buf.getvalue().strip() or "(no output)"
    except Exception as e:
        return f"ERROR: {type(e).__name__}: {e}"

TOOLS = [{
    "type": "function",
    "function": {
        "name": "run_python",
        "description": "Execute a short Python snippet and return its stdout.",
        "parameters": {
            "type": "object",
            "properties": {"code": {"type": "string", "description": "Python source"}},
            "required": ["code"],
        },
    },
}]
DISPATCH = {"run_python": run_python}

Now the loop. Note the two details that trip people up with reasoning models: append the assistant message before the tool result, and keep that message intact so the model's reasoning stays in context across turns.

PRICE_IN, PRICE_OUT = 1.40 / 1e6, 4.40 / 1e6  # USD per token

def agent(task: str, max_turns: int = 6):
    tier = classify(task)
    messages = [
        {"role": "system", "content":
            "You are a coding agent. When useful, call run_python to verify "
            "your work before answering. Keep answers tight."},
        {"role": "user", "content": task},
    ]
    spend = 0.0
    for _ in range(max_turns):
        resp = client.chat.completions.create(
            model="glm-5.2",
            messages=messages,
            tools=TOOLS,
            extra_body=EFFORT[tier],
        )
        u = resp.usage
        spend += u.prompt_tokens * PRICE_IN + u.completion_tokens * PRICE_OUT
        msg = resp.choices[0].message
        messages.append(msg.model_dump())          # keep reasoning + tool_calls in context
        if not msg.tool_calls:
            return {"answer": msg.content, "tier": tier, "cost_usd": round(spend, 6)}
        for call in msg.tool_calls:
            args = json.loads(call.function.arguments)
            result = DISPATCH[call.function.name](**args)
            messages.append({
                "role": "tool",
                "tool_call_id": call.id,
                "content": result,
            })
    return {"answer": "max turns reached", "tier": tier, "cost_usd": round(spend, 6)}

Step 5: A worked example

Hand the agent a real debugging task. It triages to hard, writes a fix, runs it through run_python to confirm, then reports back:

task = (
    "This function should return the running maximum of a list, but it returns "
    "the wrong values. Fix it and prove the fix with a test case.\n\n"
    "def running_max(xs):\n"
    "    out = []\n"
    "    m = 0\n"            # bug: assumes non-negative inputs
    "    for x in xs:\n"
    "        m = max(m, x)\n"
    "        out.append(m)\n"
    "    return out\n"
)

result = agent(task)
print(result["tier"], "->", f'${result["cost_usd"]}')
print(result["answer"])

Example output (yours will vary slightly run to run):

hard -> $0.014213
The bug is initializing m = 0, which breaks on all-negative input.
Seed with the first element instead:

def running_max(xs):
    if not xs:
        return []
    out, m = [], xs[0]
    for x in xs:
        m = max(m, x)
        out.append(m)
    return out

Verified: running_max([-5, -2, -9, -1]) -> [-5, -2, -2, -1]

The agent caught the classic m = 0 seeding bug, fixed it, and the tool call confirmed the corrected output on a negative-only input. The whole thing cost under two cents because max effort only kicked in for a task that needed it.

Step 6: Track real cost from the usage object

Every non-streamed response carries a usage object. That is your billing source of truth, not an estimate. To sanity-check a single call:

r = client.chat.completions.create(
    model="glm-5.2",
    messages=[{"role": "user", "content": "Summarize REST vs gRPC in 4 bullets."}],
    extra_body={"thinking": {"type": "enabled"}, "reasoning_effort": "high"},
)
u = r.usage
cost = u.prompt_tokens * 1.40/1e6 + u.completion_tokens * 4.40/1e6
print(u.prompt_tokens, u.completion_tokens, f"${cost:.6f}")
# e.g. 24 612 $0.002727  (reasoning tokens are inside completion_tokens)

Because reasoning tokens land in completion_tokens, a max-effort call reads more expensive than a thinking-disabled one even for the same prompt. That single fact is why the router exists.

Common pitfalls and gotchas

  • Forgetting extra_body. The OpenAI SDK does not know about thinking or reasoning_effort, so they must go inside extra_body in Python. In raw curl, put them at the top level of the JSON body next to model.
  • Reasoning tokens are output tokens. They are billed at $4.40 per 1M, not the input rate. A naive agent that runs every task at max effort can cost 3-5x what a routed one does for identical results.
  • Dropping the assistant message before the tool result. You must append the assistant message (with its tool_calls) to messages before appending the role: tool result. Reverse the order and the API rejects the turn.
  • Stripping reasoning between turns. GLM-5.2 is thinking-first. If your framework strips the model's reasoning content between tool calls, multi-step quality drops. Keep the full assistant message (model_dump() preserves it).
  • Assuming a huge output budget. The context window is 1M tokens, but max output is up to 128K per Z.ai docs (verify live). Long generations can truncate; check finish_reason.
  • Reaching for the [1m] model by default. glm-5.2[1m] unlocks the 1M window but you pay for what you send. Use plain glm-5.2 unless you genuinely feed a giant context.
  • Expecting vision. As of June 2026 there is no confirmed vision variant. The API is text in, text out. Do not send image inputs.
  • Leaking the key. A leaked key bills against your account at output prices. Keep it in an env var, out of git, and rotate if exposed.
  • Unsandboxed exec. The run_python tool here uses bare exec for clarity. In production, run tool code in a container or a restricted subprocess with timeouts.

Quick reference

SettingValue
Base URL (SDK)https://api.z.ai/api/paas/v4/
Model idglm-5.2 (1M variant: glm-5.2[1m])
Auth headerAuthorization: Bearer $ZAI_API_KEY
Thinking offextra_body={"thinking": {"type": "disabled"}}
Thinking on{"thinking": {"type": "enabled"}, "reasoning_effort": "high"|"max"}
Context window1M tokens (1,048,576)
Max outputup to 128K (verify live)
Pricing$1.40 /1M input, $4.40 /1M output, ~$0.26 /1M cached
Weights / licensezai-org/GLM-5.2 on Hugging Face, MIT
OpenRouter / Ollamaz-ai/glm-5.2 / glm-5.2

Next steps

  • Swap the in-process run_python for a real sandbox (Docker, gVisor, or a subprocess with a timeout) and add a read_file / write_file pair to turn this into a repo agent.
  • Add a fourth tier that escalates from high to max automatically when a task fails its own test, so effort tracks difficulty observed at runtime, not just predicted up front.
  • Wire the same client into Claude Code via the Anthropic-compatible endpoint at https://api.z.ai/api/coding/paas/v4 with ANTHROPIC_BASE_URL and the glm-5.2[1m] model.
  • Log tier, cost_usd, and acceptance per task, then tune the classifier prompt against your real workload to push more work into cheaper tiers without losing quality.

GLM-5.2 facts (model id, base URL, thinking parameters, pricing, benchmarks) verified against Z.ai's documentation and launch coverage as of June 18, 2026. Prices and limits change; confirm live before you ship.

Comments

Subscribe to join the conversation...

Be the first to comment

Found this useful?

Get new AI guides for builders by email. Free.

Join 2,110 builders reading daily.