GLM-5.2 Open Weights: Route Reasoning Effort by Task

On June 17, 2026, Z.ai dropped the full MIT-licensed weights for GLM-5.2, and the dev timelines lit up. The reason is simple: it is a ~753B-parameter mixture-of-experts model that scores at the top of the open-source pack on long-horizon coding, lands second on Code Arena, and trails Claude Opus 4.8 by roughly a single point on independent multi-step coding evals. The kicker is price. VentureBeat reports it beats GPT-5.5 on long-horizon coding at about one-sixth of the cost, and the weights ship under MIT, so you can also self-host.

But raw capability is only half the story. GLM-5.2 is a reasoning model with a control most teams ignore: you can turn thinking off for cheap, instant replies, or push it to max effort when a task actually needs deep reasoning. Spend that knob blindly and you either burn money on simple tasks or get shallow answers on hard ones.

This guide builds a cost-aware coding agent on top of GLM-5.2. It routes each task to the right reasoning effort, calls tools in a loop to verify its own work, and tracks exactly what every run costs. Everything here uses the OpenAI-compatible API, so the code drops into any stack that already speaks Chat Completions.

Prerequisites

Python 3.9+ and pip install openai (the standard OpenAI SDK; no Z.ai-specific client needed).
A Z.ai API key from the dashboard at z.ai. Store it in an environment variable, never in source.
Basic familiarity with the OpenAI Chat Completions request/response shape (messages, tool_calls, usage).
Optional: if you would rather self-host, the MIT weights live on Hugging Face as zai-org/GLM-5.2 and run on vLLM, SGLang, or Ollama (glm-5.2). The code below works unchanged against a local OpenAI-compatible server.

Why GLM-5.2 is the open-weights story of the week

GLM-5.2 matters because it closes the practical gap between open and closed models on the workload teams actually pay for: long-horizon, tool-using coding. The headline jump is Terminal-Bench 2.1, where Z.ai reports 81.0, up from 62.0 on GLM-5.1. A few verified numbers from Z.ai's published results put it in context:

Benchmark	GLM-5.2	Comparison
Terminal-Bench 2.1	81.0	GLM-5.1 was 62.0
SWE-bench Pro	62.1	GPT-5.5: 58.6
MCP-Atlas (tool use)	77.0	near Claude Opus 4.8
AIME 2026 (math)	99.2	frontier-tier
Code Arena rank	#2	trails Opus 4.8 by ~1 pt

Pair that with MIT weights, a 1M-token context, and output pricing of $4.40 per 1M tokens, and the economic pitch is hard to ignore: frontier-class coding at a fraction of closed-model cost, with a self-host escape hatch. The rest of this guide turns that into working code.

Step 1: Point the OpenAI SDK at GLM-5.2

GLM-5.2 speaks the OpenAI wire format. You only change the base URL and the model id. Set your key first:

export ZAI_API_KEY="your-glm-5.2-api-key"

Then create a client. The base URL is the pay-as-you-go endpoint, and the model id is simply glm-5.2 (use glm-5.2[1m] only when you actually need the full 1M-token context):

from openai import OpenAI
import os

client = OpenAI(
    api_key=os.environ["ZAI_API_KEY"],
    base_url="https://api.z.ai/api/paas/v4/",
)

resp = client.chat.completions.create(
    model="glm-5.2",
    messages=[
        {"role": "system", "content": "You are a concise backend engineer."},
        {"role": "user", "content": "Explain idempotency keys in 3 sentences."},
    ],
)
print(resp.choices[0].message.content)

That is the entire integration. Retries, logging, and helper code you already wrote for OpenAI carry over because the response carries the usual id, choices, and usage fields.

Step 2: The three reasoning modes (and what they cost)

GLM-5.2 exposes two knobs through the SDK's extra_body passthrough: thinking (enabled or disabled) and, when enabled, reasoning_effort (high or max). Z.ai recommends max for coding. That gives you three practical tiers:

# Tier 1 - thinking OFF: fast, cheapest. Good for classification, routing, short rewrites.
extra_body={"thinking": {"type": "disabled"}}

# Tier 2 - thinking ON, HIGH effort: balanced reasoning for everyday tasks.
extra_body={"thinking": {"type": "enabled"}, "reasoning_effort": "high"}

# Tier 3 - thinking ON, MAX effort: deepest reasoning for hard refactors and multi-step work.
extra_body={"thinking": {"type": "enabled"}, "reasoning_effort": "max"}

The cost trap: reasoning tokens are billed as output tokens. GLM-5.2 charges $1.40 per 1M input and $4.40 per 1M output, so a max-effort call can cost several times a thinking-disabled one for the same prompt. The whole point of routing is to pay for deep reasoning only when the task earns it.

Step 3: Build the effort router

Rather than hard-coding a tier, let the model triage the task for you with a single cheap, thinking-disabled call, then map its verdict to an effort tier. This is the cost-aware core:

EFFORT = {
    "trivial": {"thinking": {"type": "disabled"}},
    "normal":  {"thinking": {"type": "enabled"}, "reasoning_effort": "high"},
    "hard":    {"thinking": {"type": "enabled"}, "reasoning_effort": "max"},
}

def classify(task: str) -> str:
    """Cheap triage: one thinking-disabled call returns a single word."""
    r = client.chat.completions.create(
        model="glm-5.2",
        messages=[
            {"role": "system", "content":
                "Classify the coding task difficulty. Reply with exactly one word: "
                "trivial, normal, or hard. Multi-file refactors, debugging, and "
                "algorithm design are hard. Renames and one-liners are trivial."},
            {"role": "user", "content": task},
        ],
        extra_body={"thinking": {"type": "disabled"}},
        max_tokens=4,
    )
    label = r.choices[0].message.content.strip().lower()
    return label if label in EFFORT else "normal"  # safe fallback

The triage call itself is nearly free because thinking is off and you cap it at a handful of tokens. It pays for itself the first time it stops a trivial rename from triggering a max-effort reasoning run.

Step 4: Add a tool-calling loop

A coding agent that cannot run code is just a chatbot. GLM-5.2 scores 77.0 on MCP-Atlas for tool use, close to Claude Opus 4.8, and it follows the standard OpenAI two-step: the model emits tool_calls, you execute them, you feed results back, and it continues until it answers. Here is a sandboxed Python tool plus the loop:

import io, contextlib, json

def run_python(code: str) -> str:
    """Execute a short snippet and capture stdout. Sandbox properly in production."""
    buf = io.StringIO()
    try:
        with contextlib.redirect_stdout(buf):
            exec(code, {})
        return buf.getvalue().strip() or "(no output)"
    except Exception as e:
        return f"ERROR: {type(e).__name__}: {e}"

TOOLS = [{
    "type": "function",
    "function": {
        "name": "run_python",
        "description": "Execute a short Python snippet and return its stdout.",
        "parameters": {
            "type": "object",
            "properties": {"code": {"type": "string", "description": "Python source"}},
            "required": ["code"],
        },
    },
}]
DISPATCH = {"run_python": run_python}

Now the loop. Note the two details that trip people up with reasoning models: append the assistant message before the tool result, and keep that message intact so the model's reasoning stays in context across turns.

PRICE_IN, PRICE_OUT = 1.40 / 1e6, 4.40 / 1e6  # USD per token

def agent(task: str, max_turns: int = 6):
    tier = classify(task)
    messages = [
        {"role": "system", "content":
            "You are a coding agent. When useful, call run_python to verify "
            "your work before answering. Keep answers tight."},
        {"role": "user", "content": task},
    ]
    spend = 0.0
    for _ in range(max_turns):
        resp = client.chat.completions.create(
            model="glm-5.2",
            messages=messages,
            tools=TOOLS,
            extra_body=EFFORT[tier],
        )
        u = resp.usage
        spend += u.prompt_tokens * PRICE_IN + u.completion_tokens * PRICE_OUT
        msg = resp.choices[0].message
        messages.append(msg.model_dump())          # keep reasoning + tool_calls in context
        if not msg.tool_calls:
            return {"answer": msg.content, "tier": tier, "cost_usd": round(spend, 6)}
        for call in msg.tool_calls:
            args = json.loads(call.function.arguments)
            result = DISPATCH[call.function.name](**args)
            messages.append({
                "role": "tool",
                "tool_call_id": call.id,
                "content": result,
            })
    return {"answer": "max turns reached", "tier": tier, "cost_usd": round(spend, 6)}

Step 5: A worked example

Hand the agent a real debugging task. It triages to hard, writes a fix, runs it through run_python to confirm, then reports back:

task = (
    "This function should return the running maximum of a list, but it returns "
    "the wrong values. Fix it and prove the fix with a test case.\n\n"
    "def running_max(xs):\n"
    "    out = []\n"
    "    m = 0\n"            # bug: assumes non-negative inputs
    "    for x in xs:\n"
    "        m = max(m, x)\n"
    "        out.append(m)\n"
    "    return out\n"
)

result = agent(task)
print(result["tier"], "->", f'${result["cost_usd"]}')
print(result["answer"])

Example output (yours will vary slightly run to run):

hard -> $0.014213
The bug is initializing m = 0, which breaks on all-negative input.
Seed with the first element instead:

def running_max(xs):
    if not xs:
        return []
    out, m = [], xs[0]
    for x in xs:
        m = max(m, x)
        out.append(m)
    return out

Verified: running_max([-5, -2, -9, -1]) -> [-5, -2, -2, -1]

The agent caught the classic m = 0 seeding bug, fixed it, and the tool call confirmed the corrected output on a negative-only input. The whole thing cost under two cents because max effort only kicked in for a task that needed it.

Step 6: Track real cost from the usage object

Every non-streamed response carries a usage object. That is your billing source of truth, not an estimate. To sanity-check a single call:

r = client.chat.completions.create(
    model="glm-5.2",
    messages=[{"role": "user", "content": "Summarize REST vs gRPC in 4 bullets."}],
    extra_body={"thinking": {"type": "enabled"}, "reasoning_effort": "high"},
)
u = r.usage
cost = u.prompt_tokens * 1.40/1e6 + u.completion_tokens * 4.40/1e6
print(u.prompt_tokens, u.completion_tokens, f"${cost:.6f}")
# e.g. 24 612 $0.002727  (reasoning tokens are inside completion_tokens)

Because reasoning tokens land in completion_tokens, a max-effort call reads more expensive than a thinking-disabled one even for the same prompt. That single fact is why the router exists.

Common pitfalls and gotchas

Forgetting extra_body. The OpenAI SDK does not know about thinking or reasoning_effort, so they must go inside extra_body in Python. In raw curl, put them at the top level of the JSON body next to model.
Reasoning tokens are output tokens. They are billed at $4.40 per 1M, not the input rate. A naive agent that runs every task at max effort can cost 3-5x what a routed one does for identical results.
Dropping the assistant message before the tool result. You must append the assistant message (with its tool_calls) to messages before appending the role: tool result. Reverse the order and the API rejects the turn.
Stripping reasoning between turns. GLM-5.2 is thinking-first. If your framework strips the model's reasoning content between tool calls, multi-step quality drops. Keep the full assistant message (model_dump() preserves it).
Assuming a huge output budget. The context window is 1M tokens, but max output is up to 128K per Z.ai docs (verify live). Long generations can truncate; check finish_reason.
Reaching for the [1m] model by default. glm-5.2[1m] unlocks the 1M window but you pay for what you send. Use plain glm-5.2 unless you genuinely feed a giant context.
Expecting vision. As of June 2026 there is no confirmed vision variant. The API is text in, text out. Do not send image inputs.
Leaking the key. A leaked key bills against your account at output prices. Keep it in an env var, out of git, and rotate if exposed.
Unsandboxed exec. The run_python tool here uses bare exec for clarity. In production, run tool code in a container or a restricted subprocess with timeouts.

Quick reference

Setting	Value
Base URL (SDK)	https://api.z.ai/api/paas/v4/
Model id	glm-5.2 (1M variant: glm-5.2[1m])
Auth header	Authorization: Bearer $ZAI_API_KEY
Thinking off	extra_body={"thinking": {"type": "disabled"}}
Thinking on	{"thinking": {"type": "enabled"}, "reasoning_effort": "high"\|"max"}
Context window	1M tokens (1,048,576)
Max output	up to 128K (verify live)
Pricing	$1.40 /1M input, $4.40 /1M output, ~$0.26 /1M cached
Weights / license	zai-org/GLM-5.2 on Hugging Face, MIT
OpenRouter / Ollama	z-ai/glm-5.2 / glm-5.2

Next steps

Swap the in-process run_python for a real sandbox (Docker, gVisor, or a subprocess with a timeout) and add a read_file / write_file pair to turn this into a repo agent.
Add a fourth tier that escalates from high to max automatically when a task fails its own test, so effort tracks difficulty observed at runtime, not just predicted up front.
Wire the same client into Claude Code via the Anthropic-compatible endpoint at https://api.z.ai/api/coding/paas/v4 with ANTHROPIC_BASE_URL and the glm-5.2[1m] model.
Log tier, cost_usd, and acceptance per task, then tune the classifier prompt against your real workload to push more work into cheaper tiers without losing quality.

GLM-5.2 facts (model id, base URL, thinking parameters, pricing, benchmarks) verified against Z.ai's documentation and launch coverage as of June 18, 2026. Prices and limits change; confirm live before you ship.

GLM-5.2 Open Weights: Route Reasoning Effort by Task

GLM-5.2 Open Weights: Route Reasoning Effort by Task

Prerequisites

Why GLM-5.2 is the open-weights story of the week

Step 1: Point the OpenAI SDK at GLM-5.2

Step 2: The three reasoning modes (and what they cost)

Step 3: Build the effort router

Step 4: Add a tool-calling loop

Step 5: A worked example

Step 6: Track real cost from the usage object

Common pitfalls and gotchas

Quick reference

Next steps

Comments

Found this useful?