
Qwen 3.7 Max: Drop-In Anthropic SDK Swap in Python
Summary
Point the Anthropic SDK at Qwen 3.7 Max with one base-URL change: 1M context, thinking, caching.
On May 19, 2026 Alibaba quietly flipped on the Qwen 3.7 Max API, then announced it the next day at the Alibaba Cloud Summit. Within hours it was the top thread across r/LocalLLaMA and the AI corners of Hacker News, for one reason: it is the first non-Anthropic model that speaks the Anthropic Messages protocol natively, at roughly half the input price of Claude Opus 4.7.
That means you can take an app already built on Claude, change a base URL and a model id, and route it to Qwen. No translation shim, no rewriting your tool-use schemas. People are pointing Claude Code at it and cutting coding-agent bills by about 3x on the same harness.
This guide shows you both ways to call it from Python: the OpenAI-compatible route for normal apps, and the Anthropic Messages route for anything that already targets Claude. You will turn on extended thinking, wire up prompt caching to slash repeated-context cost, and build a small tool-using agent. Every snippet is API-accurate against the official Model Studio docs.
What Qwen 3.7 Max actually is
Qwen 3.7 Max-Preview is Alibaba's agent-first flagship. The headline numbers that matter for a build decision:
- 1,000,000-token context with up to 65,536 output tokens. It posts 90.4 on MRCR-v2 128k retrieval, so the long window holds up instead of degrading past 200K like most '1M' models.
- Native Anthropic Messages support plus OpenAI-compatible chat/completions and responses endpoints.
- Extended thinking on by default, which lifts reasoning quality and benchmark scores (GPQA Diamond 92.4, HMMT Feb 2026 97.1, SWE-Pro 60.6).
- Pricing: $2.50 / 1M input, $7.50 / 1M output, and $0.25 / 1M cached input (a 90% discount on repeated context).
- Model ids:
qwen3.7-max(stable alias) orqwen3.7-max-2026-05-20(dated snapshot for reproducibility).
Prerequisites
- Python 3.9+ and pip.
- An Alibaba Cloud Model Studio (DashScope) account and API key.
pip install openai anthropic(the official Anthropic SDK works against the compatible endpoint).- Basic familiarity with chat-completion or Messages-style API calls.
Step 1 - Set your key and pick a region
Qwen 3.7 Max is hosted across three DashScope regions, each with its own base URL. Pick the one closest to you. The Anthropic-compatible endpoint lives under a separate /apps/anthropic path.
# Singapore (intl) endpoint used throughout this guide.
# Swap the region prefix if you are in Beijing or US-Virginia.
export DASHSCOPE_API_KEY="sk-your-model-studio-key"
# OpenAI-compatible base URL (chat/completions, responses):
# Beijing https://dashscope.aliyuncs.com/compatible-mode/v1
# Singapore https://dashscope-intl.aliyuncs.com/compatible-mode/v1
# US-Virginia https://dashscope-us.aliyuncs.com/compatible-mode/v1
#
# Anthropic-compatible base URL (Messages API, Claude Code, Anthropic SDK):
# https://dashscope-intl.aliyuncs.com/apps/anthropic
Step 2 - The OpenAI-compatible route (easiest migration)
If your code already uses the OpenAI SDK, switching is a one-line change: point base_url at DashScope and set the model to qwen3.7-max. Everything else stays the same.
from openai import OpenAI
import os
client = OpenAI(
api_key=os.environ["DASHSCOPE_API_KEY"],
base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)
resp = client.chat.completions.create(
model="qwen3.7-max", # or qwen3.7-max-2026-05-20 to pin
messages=[
{"role": "user", "content": "Explain MoE routing in 3 sentences."}
],
max_tokens=512, # always cap output explicitly
)
print(resp.choices[0].message.content)
print("input/output tokens:",
resp.usage.prompt_tokens, resp.usage.completion_tokens)
Running that prints the answer plus a token count. Note how few output tokens a capped, simple request uses:
Mixture-of-Experts (MoE) routing sends each token to a small subset of
specialized feed-forward networks ("experts") chosen by a learned gating
network, instead of running every token through one dense layer. The gate
scores all experts per token and keeps only the top-k (often 2), so compute
stays roughly constant even as total parameters grow. This lets the model
hold far more parameters than it activates on any single forward pass.
input/output tokens: 18 96
Step 3 - The Anthropic Messages route (drop-in for Claude Code)
This is the part everyone is talking about. Because Qwen accepts the Anthropic Messages format at the protocol level, anything that targets Claude works after changing four environment variables. For Claude Code:
# Make Claude Code (or any Anthropic-SDK harness) route to Qwen 3.7 Max.
export ANTHROPIC_BASE_URL="https://dashscope-intl.aliyuncs.com/apps/anthropic"
export ANTHROPIC_AUTH_TOKEN="$DASHSCOPE_API_KEY"
export ANTHROPIC_MODEL="qwen3.7-max"
export ANTHROPIC_SMALL_FAST_MODEL="qwen3.7-max"
claude # the Claude Code TUI now talks to Qwen, no harness changes
For your own Python, use the official Anthropic SDK and override base_url and auth_token. The key is sent as a Bearer token, which is what the ANTHROPIC_AUTH_TOKEN env var maps to:
from anthropic import Anthropic
import os
# Same SDK you already use for Claude. Only base_url + model change.
client = Anthropic(
base_url="https://dashscope-intl.aliyuncs.com/apps/anthropic",
auth_token=os.environ["DASHSCOPE_API_KEY"], # sent as Bearer token
)
msg = client.messages.create(
model="qwen3.7-max",
max_tokens=512,
system="You are a terse senior Python reviewer.",
messages=[
{"role": "user", "content": "Is `except:` ever acceptable? One line."}
],
)
print(msg.content[0].text)
print("usage:", msg.usage.input_tokens, msg.usage.output_tokens)
Output:
Only as `except Exception:` at a top-level boundary that logs and re-raises;
a bare `except:` also swallows KeyboardInterrupt and SystemExit, so avoid it.
usage: 29 41
Step 4 - Control extended thinking
Thinking is on by default and the deliberation tokens are billed as output. That is the single biggest cost surprise. On the OpenAI-compatible route you toggle it with extra_body; the reasoning trace comes back in a separate reasoning_content field so you can log or discard it.
# Extended thinking is ON by default and is billed as OUTPUT tokens.
# Through the OpenAI-compatible route you toggle it with extra_body.
resp = client.chat.completions.create(
model="qwen3.7-max",
messages=[{"role": "user", "content": "A train leaves... (hard puzzle)"}],
max_tokens=2048,
extra_body={
"enable_thinking": True, # default True; set False for chat UIs
"preserve_thinking": False, # keep prior-turn reasoning in context?
},
)
# The deliberation shows up separately from the final answer.
choice = resp.choices[0].message
reasoning = getattr(choice, "reasoning_content", None)
print("THOUGHT:", (reasoning or "")[:200])
print("ANSWER :", choice.content)
For a chat UI where latency matters and you do not need deep reasoning, set enable_thinking to False. For multi-step agents, leave it on but cap max_tokens hard.
Step 5 - Prompt caching to cut the bill
If you reuse a long system prompt or a fixed codebase context across many calls, prompt caching takes that input from $2.50 to $0.25 per 1M tokens. On the Anthropic route, mark the stable prefix with cache_control. The first call writes the cache; later calls read it cheaply.
# Prompt caching drops repeated input from $2.50 to $0.25 per 1M tokens.
# On the Anthropic route, mark the stable block with cache_control.
LONG_SYSTEM = open("codebase_context.md").read() # e.g. 40k tokens, reused
msg = client.messages.create(
model="qwen3.7-max",
max_tokens=1024,
system=[
{
"type": "text",
"text": LONG_SYSTEM,
"cache_control": {"type": "ephemeral"}, # cache this prefix
}
],
messages=[{"role": "user", "content": "Where is auth handled?"}],
)
u = msg.usage
print("cache_creation:", getattr(u, "cache_creation_input_tokens", 0))
print("cache_read :", getattr(u, "cache_read_input_tokens", 0))
Worked example: a tool-using coding agent
Here is the pattern that makes Qwen 3.7 Max worth using as an agent backend: the standard Anthropic tool-use loop. The model decides when to call a tool, you execute it, feed the result back, and repeat until it stops. This is identical to how you would write it against Claude, which is exactly the point.
import json, os
from anthropic import Anthropic
client = Anthropic(
base_url="https://dashscope-intl.aliyuncs.com/apps/anthropic",
auth_token=os.environ["DASHSCOPE_API_KEY"],
)
# 1. Define a tool the model can call.
TOOLS = [{
"name": "run_sql",
"description": "Run a read-only SQL query against the analytics DB.",
"input_schema": {
"type": "object",
"properties": {"query": {"type": "string"}},
"required": ["query"],
},
}]
def run_sql(query: str) -> str:
# Stand-in for a real DB call.
if "count" in query.lower():
return json.dumps({"rows": [{"signups": 1423}]})
return json.dumps({"rows": []})
messages = [{"role": "user",
"content": "How many signups did we get? Use the tool."}]
# 2. Agent loop: call -> run tools -> feed results back -> repeat.
while True:
resp = client.messages.create(
model="qwen3.7-max",
max_tokens=1024,
tools=TOOLS,
messages=messages,
)
messages.append({"role": "assistant", "content": resp.content})
if resp.stop_reason != "tool_use":
print(resp.content[-1].text) # final natural-language answer
break
# 3. Execute every tool_use block and return tool_result blocks.
results = []
for block in resp.content:
if block.type == "tool_use":
out = run_sql(**block.input)
results.append({
"type": "tool_result",
"tool_use_id": block.id,
"content": out,
})
messages.append({"role": "user", "content": results})
On the first turn the model returns a tool_use block instead of text; you run the tool, return a tool_result, and the second turn produces the final answer:
# stop_reason == "tool_use" -> model emitted: run_sql(query="SELECT COUNT(*) ...")
# tool returns: {"rows": [{"signups": 1423}]}
# second turn, stop_reason == "end_turn":
You had 1,423 signups. Want me to break that down by day or by channel?
Common pitfalls
- The verbosity tax. Independent testing measured roughly 4x the output tokens of comparable flagships on the same task, because thinking runs by default. The $7.50 output rate then lands you in Opus cost territory. Always set
max_tokens(2048-4096 per agent turn is plenty); never rely on the 65K ceiling. - You pay for thinking you throw away. The
reasoning_contenttokens are billed even if you drop them from the final message. If you do not need the trace, setenable_thinking=Falserather than discarding it after the fact. - Region vs endpoint mismatch. The OpenAI-compatible path ends in
/compatible-mode/v1; the Anthropic path ends in/apps/anthropic. Mixing them up returns 404s that look like auth errors. - auth_token, not api_key. The Anthropic-compatible endpoint expects a Bearer token. In the Anthropic Python SDK pass
auth_token=...(or setANTHROPIC_AUTH_TOKEN), notapi_key=..., or you will get 401s. - Long context is not free context. The window is a real 1M tokens, but stuffing it still costs input tokens every call unless you cache. Cache the stable part; only the user turn should be uncached.
- Pin the model for reproducible evals. Use
qwen3.7-max-2026-05-20when you are benchmarking, so a silent alias update does not move your numbers.
Quick reference
| What | Value |
|---|---|
| Stable model id | qwen3.7-max |
| Pinned snapshot | qwen3.7-max-2026-05-20 |
| OpenAI base URL (intl) | https://dashscope-intl.aliyuncs.com/compatible-mode/v1 |
| Anthropic base URL (intl) | https://dashscope-intl.aliyuncs.com/apps/anthropic |
| Context / max output | 1,000,000 / 65,536 tokens |
| Price in / out / cached | $2.50 / $7.50 / $0.25 per 1M |
| Turn thinking off | extra_body={'enable_thinking': False} |
| Cache a prefix | cache_control: {'type': 'ephemeral'} |
| Auth header | Bearer token (auth_token / ANTHROPIC_AUTH_TOKEN) |
Next steps
- Point your existing Claude Code setup at Qwen for one real task and compare cost and quality side by side.
- Add hybrid routing: send long-context or hard turns to Qwen 3.7 Max, short turns to a cheaper Flash-tier model.
- Wrap the tool-use loop above into your own agent, logging tool calls and final answers separately.
- Pin the dated snapshot and run a small eval set before moving production traffic over.
Sources: Qwen Team 'Qwen3.7: The Agent Frontier'; Alibaba Cloud Model Studio docs; MarkTechPost launch coverage (May 21, 2026); OpenRouter pricing. Verify pricing and region availability in your own DashScope console before committing production traffic.
Comments
Be the first to comment