
GLM-5.2 Open Weights: Route Reasoning Effort by Task
Summary
Build a cost-aware GLM-5.2 agent that routes thinking effort per task and calls tools.
GLM-5.2 Open Weights: Route Reasoning Effort by Task
On June 17, 2026, Z.ai dropped the full MIT-licensed weights for GLM-5.2, and the dev timelines lit up. The reason is simple: it is a ~753B-parameter mixture-of-experts model that scores at the top of the open-source pack on long-horizon coding, lands second on Code Arena, and trails Claude Opus 4.8 by roughly a single point on independent multi-step coding evals. The kicker is price. VentureBeat reports it beats GPT-5.5 on long-horizon coding at about one-sixth of the cost, and the weights ship under MIT, so you can also self-host.
But raw capability is only half the story. GLM-5.2 is a reasoning model with a control most teams ignore: you can turn thinking off for cheap, instant replies, or push it to max effort when a task actually needs deep reasoning. Spend that knob blindly and you either burn money on simple tasks or get shallow answers on hard ones.
This guide builds a cost-aware coding agent on top of GLM-5.2. It routes each task to the right reasoning effort, calls tools in a loop to verify its own work, and tracks exactly what every run costs. Everything here uses the OpenAI-compatible API, so the code drops into any stack that already speaks Chat Completions.
Prerequisites
- Python 3.9+ and
pip install openai(the standard OpenAI SDK; no Z.ai-specific client needed). - A Z.ai API key from the dashboard at z.ai. Store it in an environment variable, never in source.
- Basic familiarity with the OpenAI Chat Completions request/response shape (messages, tool_calls, usage).
- Optional: if you would rather self-host, the MIT weights live on Hugging Face as
zai-org/GLM-5.2and run on vLLM, SGLang, or Ollama (glm-5.2). The code below works unchanged against a local OpenAI-compatible server.
Why GLM-5.2 is the open-weights story of the week
GLM-5.2 matters because it closes the practical gap between open and closed models on the workload teams actually pay for: long-horizon, tool-using coding. The headline jump is Terminal-Bench 2.1, where Z.ai reports 81.0, up from 62.0 on GLM-5.1. A few verified numbers from Z.ai's published results put it in context:
| Benchmark | GLM-5.2 | Comparison |
|---|---|---|
| Terminal-Bench 2.1 | 81.0 | GLM-5.1 was 62.0 |
| SWE-bench Pro | 62.1 | GPT-5.5: 58.6 |
| MCP-Atlas (tool use) | 77.0 | near Claude Opus 4.8 |
| AIME 2026 (math) | 99.2 | frontier-tier |
| Code Arena rank | #2 | trails Opus 4.8 by ~1 pt |
Pair that with MIT weights, a 1M-token context, and output pricing of $4.40 per 1M tokens, and the economic pitch is hard to ignore: frontier-class coding at a fraction of closed-model cost, with a self-host escape hatch. The rest of this guide turns that into working code.
Step 1: Point the OpenAI SDK at GLM-5.2
GLM-5.2 speaks the OpenAI wire format. You only change the base URL and the model id. Set your key first:
export ZAI_API_KEY="your-glm-5.2-api-key"
Then create a client. The base URL is the pay-as-you-go endpoint, and the model id is simply glm-5.2 (use glm-5.2[1m] only when you actually need the full 1M-token context):
from openai import OpenAI
import os
client = OpenAI(
api_key=os.environ["ZAI_API_KEY"],
base_url="https://api.z.ai/api/paas/v4/",
)
resp = client.chat.completions.create(
model="glm-5.2",
messages=[
{"role": "system", "content": "You are a concise backend engineer."},
{"role": "user", "content": "Explain idempotency keys in 3 sentences."},
],
)
print(resp.choices[0].message.content)
That is the entire integration. Retries, logging, and helper code you already wrote for OpenAI carry over because the response carries the usual id, choices, and usage fields.
Step 2: The three reasoning modes (and what they cost)
GLM-5.2 exposes two knobs through the SDK's extra_body passthrough: thinking (enabled or disabled) and, when enabled, reasoning_effort (high or max). Z.ai recommends max for coding. That gives you three practical tiers:
# Tier 1 - thinking OFF: fast, cheapest. Good for classification, routing, short rewrites.
extra_body={"thinking": {"type": "disabled"}}
# Tier 2 - thinking ON, HIGH effort: balanced reasoning for everyday tasks.
extra_body={"thinking": {"type": "enabled"}, "reasoning_effort": "high"}
# Tier 3 - thinking ON, MAX effort: deepest reasoning for hard refactors and multi-step work.
extra_body={"thinking": {"type": "enabled"}, "reasoning_effort": "max"}
The cost trap: reasoning tokens are billed as output tokens. GLM-5.2 charges $1.40 per 1M input and $4.40 per 1M output, so a max-effort call can cost several times a thinking-disabled one for the same prompt. The whole point of routing is to pay for deep reasoning only when the task earns it.
Step 3: Build the effort router
Rather than hard-coding a tier, let the model triage the task for you with a single cheap, thinking-disabled call, then map its verdict to an effort tier. This is the cost-aware core:
EFFORT = {
"trivial": {"thinking": {"type": "disabled"}},
"normal": {"thinking": {"type": "enabled"}, "reasoning_effort": "high"},
"hard": {"thinking": {"type": "enabled"}, "reasoning_effort": "max"},
}
def classify(task: str) -> str:
"""Cheap triage: one thinking-disabled call returns a single word."""
r = client.chat.completions.create(
model="glm-5.2",
messages=[
{"role": "system", "content":
"Classify the coding task difficulty. Reply with exactly one word: "
"trivial, normal, or hard. Multi-file refactors, debugging, and "
"algorithm design are hard. Renames and one-liners are trivial."},
{"role": "user", "content": task},
],
extra_body={"thinking": {"type": "disabled"}},
max_tokens=4,
)
label = r.choices[0].message.content.strip().lower()
return label if label in EFFORT else "normal" # safe fallback
The triage call itself is nearly free because thinking is off and you cap it at a handful of tokens. It pays for itself the first time it stops a trivial rename from triggering a max-effort reasoning run.
Step 4: Add a tool-calling loop
A coding agent that cannot run code is just a chatbot. GLM-5.2 scores 77.0 on MCP-Atlas for tool use, close to Claude Opus 4.8, and it follows the standard OpenAI two-step: the model emits tool_calls, you execute them, you feed results back, and it continues until it answers. Here is a sandboxed Python tool plus the loop:
import io, contextlib, json
def run_python(code: str) -> str:
"""Execute a short snippet and capture stdout. Sandbox properly in production."""
buf = io.StringIO()
try:
with contextlib.redirect_stdout(buf):
exec(code, {})
return buf.getvalue().strip() or "(no output)"
except Exception as e:
return f"ERROR: {type(e).__name__}: {e}"
TOOLS = [{
"type": "function",
"function": {
"name": "run_python",
"description": "Execute a short Python snippet and return its stdout.",
"parameters": {
"type": "object",
"properties": {"code": {"type": "string", "description": "Python source"}},
"required": ["code"],
},
},
}]
DISPATCH = {"run_python": run_python}
Now the loop. Note the two details that trip people up with reasoning models: append the assistant message before the tool result, and keep that message intact so the model's reasoning stays in context across turns.
PRICE_IN, PRICE_OUT = 1.40 / 1e6, 4.40 / 1e6 # USD per token
def agent(task: str, max_turns: int = 6):
tier = classify(task)
messages = [
{"role": "system", "content":
"You are a coding agent. When useful, call run_python to verify "
"your work before answering. Keep answers tight."},
{"role": "user", "content": task},
]
spend = 0.0
for _ in range(max_turns):
resp = client.chat.completions.create(
model="glm-5.2",
messages=messages,
tools=TOOLS,
extra_body=EFFORT[tier],
)
u = resp.usage
spend += u.prompt_tokens * PRICE_IN + u.completion_tokens * PRICE_OUT
msg = resp.choices[0].message
messages.append(msg.model_dump()) # keep reasoning + tool_calls in context
if not msg.tool_calls:
return {"answer": msg.content, "tier": tier, "cost_usd": round(spend, 6)}
for call in msg.tool_calls:
args = json.loads(call.function.arguments)
result = DISPATCH[call.function.name](**args)
messages.append({
"role": "tool",
"tool_call_id": call.id,
"content": result,
})
return {"answer": "max turns reached", "tier": tier, "cost_usd": round(spend, 6)}
Step 5: A worked example
Hand the agent a real debugging task. It triages to hard, writes a fix, runs it through run_python to confirm, then reports back:
task = (
"This function should return the running maximum of a list, but it returns "
"the wrong values. Fix it and prove the fix with a test case.\n\n"
"def running_max(xs):\n"
" out = []\n"
" m = 0\n" # bug: assumes non-negative inputs
" for x in xs:\n"
" m = max(m, x)\n"
" out.append(m)\n"
" return out\n"
)
result = agent(task)
print(result["tier"], "->", f'${result["cost_usd"]}')
print(result["answer"])
Example output (yours will vary slightly run to run):
hard -> $0.014213
The bug is initializing m = 0, which breaks on all-negative input.
Seed with the first element instead:
def running_max(xs):
if not xs:
return []
out, m = [], xs[0]
for x in xs:
m = max(m, x)
out.append(m)
return out
Verified: running_max([-5, -2, -9, -1]) -> [-5, -2, -2, -1]
The agent caught the classic m = 0 seeding bug, fixed it, and the tool call confirmed the corrected output on a negative-only input. The whole thing cost under two cents because max effort only kicked in for a task that needed it.
Step 6: Track real cost from the usage object
Every non-streamed response carries a usage object. That is your billing source of truth, not an estimate. To sanity-check a single call:
r = client.chat.completions.create(
model="glm-5.2",
messages=[{"role": "user", "content": "Summarize REST vs gRPC in 4 bullets."}],
extra_body={"thinking": {"type": "enabled"}, "reasoning_effort": "high"},
)
u = r.usage
cost = u.prompt_tokens * 1.40/1e6 + u.completion_tokens * 4.40/1e6
print(u.prompt_tokens, u.completion_tokens, f"${cost:.6f}")
# e.g. 24 612 $0.002727 (reasoning tokens are inside completion_tokens)
Because reasoning tokens land in completion_tokens, a max-effort call reads more expensive than a thinking-disabled one even for the same prompt. That single fact is why the router exists.
Common pitfalls and gotchas
- Forgetting
extra_body. The OpenAI SDK does not know aboutthinkingorreasoning_effort, so they must go insideextra_bodyin Python. In raw curl, put them at the top level of the JSON body next tomodel. - Reasoning tokens are output tokens. They are billed at $4.40 per 1M, not the input rate. A naive agent that runs every task at
maxeffort can cost 3-5x what a routed one does for identical results. - Dropping the assistant message before the tool result. You must append the assistant message (with its
tool_calls) tomessagesbefore appending therole: toolresult. Reverse the order and the API rejects the turn. - Stripping reasoning between turns. GLM-5.2 is thinking-first. If your framework strips the model's reasoning content between tool calls, multi-step quality drops. Keep the full assistant message (
model_dump()preserves it). - Assuming a huge output budget. The context window is 1M tokens, but max output is up to 128K per Z.ai docs (verify live). Long generations can truncate; check
finish_reason. - Reaching for the [1m] model by default.
glm-5.2[1m]unlocks the 1M window but you pay for what you send. Use plainglm-5.2unless you genuinely feed a giant context. - Expecting vision. As of June 2026 there is no confirmed vision variant. The API is text in, text out. Do not send image inputs.
- Leaking the key. A leaked key bills against your account at output prices. Keep it in an env var, out of git, and rotate if exposed.
- Unsandboxed
exec. Therun_pythontool here uses bareexecfor clarity. In production, run tool code in a container or a restricted subprocess with timeouts.
Quick reference
| Setting | Value |
|---|---|
| Base URL (SDK) | https://api.z.ai/api/paas/v4/ |
| Model id | glm-5.2 (1M variant: glm-5.2[1m]) |
| Auth header | Authorization: Bearer $ZAI_API_KEY |
| Thinking off | extra_body={"thinking": {"type": "disabled"}} |
| Thinking on | {"thinking": {"type": "enabled"}, "reasoning_effort": "high"|"max"} |
| Context window | 1M tokens (1,048,576) |
| Max output | up to 128K (verify live) |
| Pricing | $1.40 /1M input, $4.40 /1M output, ~$0.26 /1M cached |
| Weights / license | zai-org/GLM-5.2 on Hugging Face, MIT |
| OpenRouter / Ollama | z-ai/glm-5.2 / glm-5.2 |
Next steps
- Swap the in-process
run_pythonfor a real sandbox (Docker, gVisor, or a subprocess with a timeout) and add aread_file/write_filepair to turn this into a repo agent. - Add a fourth tier that escalates from
hightomaxautomatically when a task fails its own test, so effort tracks difficulty observed at runtime, not just predicted up front. - Wire the same client into Claude Code via the Anthropic-compatible endpoint at https://api.z.ai/api/coding/paas/v4 with
ANTHROPIC_BASE_URLand theglm-5.2[1m]model. - Log
tier,cost_usd, and acceptance per task, then tune the classifier prompt against your real workload to push more work into cheaper tiers without losing quality.
GLM-5.2 facts (model id, base URL, thinking parameters, pricing, benchmarks) verified against Z.ai's documentation and launch coverage as of June 18, 2026. Prices and limits change; confirm live before you ship.
Comments
Be the first to comment
Found this useful?
Get new AI guides for builders by email. Free.
Join 2,110 builders reading daily.