
GLM-4.7 in Python: Build a Cheap Coding Assistant
Summary
Use Zhipu's GLM-4.7 through the OpenAI SDK to build a tool-calling coding assistant for pennies.
Why everyone is suddenly switching to GLM-4.7
This week the AI dev feeds are full of one number: the price of a coding token. Z.ai (Zhipu) shipped GLM-4.7, an open-weight model that lands at 73.8% on SWE-bench Verified and 84.7 on the τ²-Bench tool-calling benchmark, putting its agentic coding roughly on par with Claude Sonnet 4.5. The twist is the cost. The weights are MIT-licensed and on Hugging Face, there is a completely free Flash tier on the API, and the paid coding plan starts at a few dollars a month. That combination is why GLM-4.7 is the model people are quietly swapping into Claude Code, Cline, and their own scripts right now.
The best part for builders: the API is OpenAI-compatible. If you have ever called chat.completions.create, you already know 90% of this. In this guide you'll go from a first request to a working command-line coding assistant that can read your files and call tools, while paying close to nothing per run.
Why does the OpenAI-compatible part matter so much? It means GLM-4.7 is not a new ecosystem you have to learn. The same client, the same message format, the same streaming and tool-calling shapes you use with GPT or any other provider work here by changing two strings: the base URL and the model name. That portability is exactly what is driving the swap. Teams that built on the OpenAI SDK a year ago can A/B a frontier model against GLM-4.7 in an afternoon, watch the bill drop, and keep their code.
Everything below is verified against the official Z.ai documentation. Code is written to run as-is once you drop in an API key.
What you'll build
A small Python coding assistant you can run in a terminal. Ask it a question about a project, and it decides on its own whether to list a directory or read a file before answering. Along the way you'll learn the three things that make GLM-4.7 different from a plain chat model: model tiers and pricing, thinking modes with streaming, and tool (function) calling.
Prerequisites
- Python 3.9 or newer.
- A free Z.ai account and an API key from the developer portal (z.ai/manage-apikey).
- Two packages:
pip install --upgrade 'openai>=1.0' zai-sdk. - Basic comfort with the terminal and reading JSON.
Set your key once so the examples can pick it up automatically:
export ZAI_API_KEY="your-key-here" # macOS / Linux
# setx ZAI_API_KEY "your-key-here" # Windows PowerShell
Step 1 - Your first call with the OpenAI SDK
GLM-4.7 speaks the OpenAI Chat Completions format. Point the OpenAI client at the Z.ai base URL and you are done. No new client library required.
import os
from openai import OpenAI
client = OpenAI(
api_key=os.environ["ZAI_API_KEY"],
base_url="https://api.z.ai/api/paas/v4/", # note the trailing slash
)
resp = client.chat.completions.create(
model="glm-4.7",
messages=[
{"role": "system", "content": "You are a concise senior Python engineer."},
{"role": "user", "content": "Write a one-liner that flattens a list of lists."},
],
temperature=0.7,
)
print(resp.choices[0].message.content)
Representative output:
flat = [x for sub in nested for x in sub]
That is the whole contract. The endpoint is https://api.z.ai/api/paas/v4/chat/completions, auth is a bearer token, and the response object matches what the OpenAI SDK already expects.
Step 2 - Pick the right tier and price it out
GLM-4.7 ships in three sizes. They all share a 200K context window and 128K max output, but they trade quality for speed and cost. For an assistant you run dozens of times a day, start on Flash (free), then graduate the hard requests to the full model.
| Model id | Positioning | Cost | Use it for |
|---|---|---|---|
| glm-4.7 | Flagship coding + reasoning | Paid (coding plan from ~$3/mo) | Hard refactors, agentic tasks |
| glm-4.7-flashx | Lightweight, high-speed | Low-cost | Latency-sensitive tools |
| glm-4.7-flash | Lightweight | Completely free | Drafts, prototyping, high volume |
If you prefer per-token billing or want provider fallback, the same model is on OpenRouter as z-ai/glm-4.7 (around $0.44 in / $1.74 out per million tokens through the cheapest provider). The economics are the headline: a coding session that costs dollars on a frontier API often costs cents here, and the free Flash tier costs nothing while you experiment.
Step 3 - Stream tokens and watch it think
GLM-4.7 has a thinking mode. When enabled, the model reasons before it answers, and that reasoning streams back on a separate reasoning_content field, distinct from the user-facing content. The official zai-sdk exposes this cleanly. Enable thinking for complex tasks; disable it for simple ones to cut latency.
from zai import ZaiClient
import os
client = ZaiClient(api_key=os.environ["ZAI_API_KEY"])
stream = client.chat.completions.create(
model="glm-4.7",
messages=[{"role": "user",
"content": "A list has duplicates. Dedupe it but keep order. Explain briefly."}],
thinking={"type": "enabled"}, # "enabled" | "disabled" (default enabled)
stream=True,
max_tokens=2048,
temperature=0.6,
)
for chunk in stream:
delta = chunk.choices[0].delta
if delta.reasoning_content: # the model's private reasoning
print(delta.reasoning_content, end="", flush=True)
if delta.content: # the final answer
print(delta.content, end="", flush=True)
Representative output (reasoning shown first, then the answer):
[thinking] dict.fromkeys preserves insertion order in 3.7+, so list(dict.fromkeys(x))
keeps first occurrences and drops repeats in one pass...
deduped = list(dict.fromkeys(items))
dict.fromkeys keeps the first time each value appears and ignores later repeats,
and since Python 3.7 dicts preserve insertion order, so order is maintained.
If you stick with the portable OpenAI SDK instead of zai-sdk, pass GLM-specific options through extra_body={"thinking": {"type": "enabled"}} and read the reasoning from the same delta field.
Step 4 - Give it tools with function calling
This is where a chat model becomes an assistant. You describe functions in JSON Schema, the model decides when to call one, and you run it and feed the result back. GLM-4.7 is tuned hard for this (its τ²-Bench score is open-source state of the art), and because the API is OpenAI-compatible, the tools and tool_calls shapes are exactly what you already know.
import json, os
from openai import OpenAI
client = OpenAI(api_key=os.environ["ZAI_API_KEY"],
base_url="https://api.z.ai/api/paas/v4/")
def get_python_version():
import sys
return sys.version.split()[0]
tools = [{
"type": "function",
"function": {
"name": "get_python_version",
"description": "Return the running Python interpreter version.",
"parameters": {"type": "object", "properties": {}},
},
}]
messages = [{"role": "user", "content": "What Python version is this machine running?"}]
resp = client.chat.completions.create(model="glm-4.7", messages=messages, tools=tools)
msg = resp.choices[0].message
if msg.tool_calls:
messages.append(msg) # keep the assistant's tool request
for call in msg.tool_calls:
result = get_python_version() # you dispatch the real function
messages.append({
"role": "tool",
"tool_call_id": call.id,
"content": result,
})
final = client.chat.completions.create(model="glm-4.7", messages=messages, tools=tools)
print(final.choices[0].message.content)
Representative output:
This machine is running Python 3.11.9.
The pattern is always the same loop: send messages, check for tool_calls, execute each, append a role: "tool" message with the matching tool_call_id, then call the model again so it can use the result. The model can also ask for several tool calls in a single turn; when that happens, run them all and append one tool message per call before you loop. Keep your tool functions small and deterministic - the model is choosing when to call them, but you remain in full control of what they actually do.
Step 5 - A repo-aware coding assistant
Now put it together into something useful: an assistant that can explore a project before it answers. It gets two safe, read-only tools - list a directory and read a file - and loops until it has enough context to respond. This is the same mechanism agents like Claude Code use, just stripped to its essentials.
import json, os
from openai import OpenAI
client = OpenAI(api_key=os.environ["ZAI_API_KEY"],
base_url="https://api.z.ai/api/paas/v4/")
# ---- the actual tools (read-only, sandboxed to the current folder) ----
def list_dir(path="."):
safe = os.path.relpath(os.path.join(".", path))
return "\n".join(sorted(os.listdir(safe))) or "(empty)"
def read_file(path, max_chars=4000):
safe = os.path.relpath(os.path.join(".", path))
with open(safe, "r", encoding="utf-8", errors="replace") as f:
return f.read()[:max_chars]
DISPATCH = {"list_dir": list_dir, "read_file": read_file}
TOOLS = [
{"type": "function", "function": {
"name": "list_dir",
"description": "List files and folders at a path relative to the project root.",
"parameters": {"type": "object",
"properties": {"path": {"type": "string"}}}}},
{"type": "function", "function": {
"name": "read_file",
"description": "Read up to max_chars of a UTF-8 text file.",
"parameters": {"type": "object",
"properties": {"path": {"type": "string"},
"max_chars": {"type": "integer"}},
"required": ["path"]}}},
]
def ask(question, max_steps=6):
messages = [
{"role": "system", "content": "You are a coding assistant. Use the tools to "
"inspect the project before answering. Cite the files you read."},
{"role": "user", "content": question},
]
for _ in range(max_steps):
resp = client.chat.completions.create(
model="glm-4.7", messages=messages, tools=TOOLS, temperature=0.3)
msg = resp.choices[0].message
if not msg.tool_calls:
return msg.content
messages.append(msg)
for call in msg.tool_calls:
args = json.loads(call.function.arguments or "{}")
try:
out = DISPATCH[call.function.name](**args)
except Exception as e:
out = f"ERROR: {e}"
messages.append({"role": "tool", "tool_call_id": call.id, "content": str(out)})
return "Stopped: hit the step limit before finishing."
if __name__ == "__main__":
print(ask("What does this project do and what is its entry point?"))
Run it inside a small project and you get a multi-step trace like this:
-> list_dir(".") => app.py, requirements.txt, README.md, utils/
-> read_file("README.md") => "FastAPI service that resizes images..."
-> read_file("app.py") => "from fastapi import FastAPI ... @app.post('/resize')"
This project is a FastAPI image-resizing service. The entry point is app.py,
which defines a POST /resize endpoint; helpers live in utils/. Start it with
`uvicorn app:app --reload`.
Notice the model chose the order itself: it listed the directory, read the README for intent, then opened app.py to confirm the entry point. You never hard-coded that sequence.
Common pitfalls and how to avoid them
- Dropping the trailing slash on the base URL. It must be
https://api.z.ai/api/paas/v4/. Without the slash some HTTP clients build a wrong path and you get 404s. - Forgetting to append the assistant's tool-call message. Before you add
role: "tool"results, you must append the assistant message that requested them. Skip it and the model loses track of which call each result answers. - Mismatched tool_call_id. Each tool result must carry the exact
tool_call_idfrom the call it answers. With parallel tool calls in one turn, loop over every call and reply to each. - Leaving thinking on for trivial calls. Thinking mode adds latency and tokens. Set
thinking={"type": "disabled"}for simple formatting or classification, enable it for multi-step reasoning. - Reading reasoning_content as the answer. The streamed
reasoning_contentis the model's scratchpad, not the reply. Show it separately or hide it; the user-facing text is incontent. - Putting GLM-only params in the wrong place with the OpenAI SDK.
thinkingis a Z.ai extension. With the OpenAI client, pass it viaextra_body, not as a top-level argument, or you'll get a TypeError. - Unbounded tool loops. Always cap the agent loop (the
max_stepsabove). A confused model can otherwise call tools forever and burn tokens. - Letting tools touch the whole filesystem. The example normalizes paths to the current folder. In production, validate and sandbox every path a model can pass to a tool.
Quick reference
| Thing | Value |
|---|---|
| Base URL | https://api.z.ai/api/paas/v4/ |
| Endpoint | POST /chat/completions |
| Auth header | Authorization: Bearer $ZAI_API_KEY |
| Models | glm-4.7, glm-4.7-flashx, glm-4.7-flash (free) |
| Context / output | 200K context, 128K max output |
| Thinking mode | thinking={"type": "enabled" | "disabled"} |
| Streaming fields | delta.reasoning_content, delta.content |
| Tool calling | OpenAI-style tools + tool_calls |
| OpenRouter id | z-ai/glm-4.7 |
| Weights / license | Hugging Face zai-org/GLM-4.7, MIT |
Next steps
You now have the three primitives - chat, streaming with thinking, and tool calling - that every agent is built on. From here: add a write_file tool (with a confirmation step) to let the assistant make edits, swap glm-4.7 for the free glm-4.7-flash on cheap requests and reserve the flagship for hard ones, or point Claude Code or Cline at the Z.ai endpoint to use GLM-4.7 inside an editor you already know. Because everything is OpenAI-compatible, you can also drop GLM-4.7 into LangChain, LlamaIndex, or the Vercel AI SDK by changing only the base URL and model name.
The bigger lesson of this week's price war: capable coding models are becoming a commodity. Building on an OpenAI-compatible interface means you can chase the best price-to-quality ratio by changing two strings, not rewriting your app.
Primary sources: Z.ai GLM-4.7 docs (docs.z.ai/guides/llm/glm-4.7), Z.ai thinking-mode guide (docs.z.ai/guides/capabilities/thinking-mode), and the GLM-4.7 model card on Hugging Face (huggingface.co/zai-org/GLM-4.7).
Comments
Be the first to comment
Found this useful?
Get new AI guides for builders by email. Free.