
Gemini 3.1 Pro: Tune Reasoning, Vision and Cost in Python
Summary
Control thinking_level, media_resolution and thought signatures in the Gemini 3.1 Pro API.
Gemini 3.1 Pro: Tune Reasoning, Vision and Cost in Python
Gemini 3.1 Pro Preview is the model everyone is wiring into their stacks this week. The Apple keynote on June 8 put a Gemini model inside Siri, Microsoft shipped a wave of its own MAI models the same day, and across developer communities the question that keeps coming up is not how smart is it but how do I stop it from being slow and expensive. The good news: Gemini 3 added three concrete knobs that let you trade reasoning depth, vision fidelity, and latency against cost on a per-request basis.
This guide teaches those three controls end to end: thinking_level for how hard the model reasons, media_resolution for how many tokens each image or video frame burns, and thought signatures, the encrypted reasoning tokens you must echo back during tool calls or your agent breaks with a 400. By the end you will have a small, cost-aware multimodal agent that picks the right reasoning depth per task.
Every code block below uses the official google-genai SDK and matches the current Gemini 3 developer docs. The model ID is gemini-3.1-pro-preview: a 1M-token input window, 64k output, January 2025 knowledge cutoff, priced at $2 / $12 per million tokens under 200k context and $4 / $18 above it.
Prerequisites
- Python 3.9+ and a Gemini API key from Google AI Studio (set it as the
GEMINI_API_KEYenvironment variable). - The official SDK:
pip install -U google-genai(version 1.0 or newer). - Basic familiarity with chat-style LLM calls and Python functions.
- Optional: a sample image on disk for the vision section.
pip install -U google-genai
export GEMINI_API_KEY="your-key-here"
Step 1: Your first Gemini 3.1 Pro call
The client reads GEMINI_API_KEY from the environment automatically. Start with a plain request so you have a baseline before you start tuning anything.
from google import genai
client = genai.Client()
response = client.models.generate_content(
model="gemini-3.1-pro-preview",
contents="Find the race condition in this snippet and explain the fix:\n"
" counter += 1 # called from 8 threads, no lock",
)
print(response.text)
print("thinking tokens:", response.usage_metadata.thoughts_token_count)
print("output tokens: ", response.usage_metadata.candidates_token_count)
Example output (trimmed):
The increment `counter += 1` is not atomic: it compiles to load, add, store.
Two threads can read the same value and both write back n+1, losing an update.
Fix: guard it with a lock (threading.Lock) or use itertools.count / an atomic.
thinking tokens: 412
output tokens: 96
Notice the thoughts_token_count. By default Gemini 3.1 Pro reasons at high, so it spends hidden thinking tokens before answering. Those tokens are billed as output. That single line is why your bill can be larger than the visible answer suggests, and it is exactly what the next step controls.
Step 2: thinking_level - dial reasoning up or down
thinking_level sets the maximum depth of the model's internal reasoning before it answers. Gemini treats the levels as relative allowances, not strict token budgets. If you do not set it, Gemini 3.1 Pro defaults to high.
| Level | Supported on 3.1 Pro? | Use it for |
|---|---|---|
| minimal | No (Flash / Flash-Lite only) | Near-zero latency chat on lighter models |
| low | Yes | Classification, extraction, simple instruction following |
| medium | Yes | Balanced everyday tasks |
| high | Yes (default, dynamic) | Hard math, multi-step reasoning, deep code review |
You set it through thinking_config on the generation config:
from google import genai
from google.genai import types
client = genai.Client()
response = client.models.generate_content(
model="gemini-3.1-pro-preview",
contents="Classify this review as positive, negative, or neutral: "
"'Shipping was slow but the product is fantastic.'",
config=types.GenerateContentConfig(
thinking_config=types.ThinkingConfig(thinking_level="low"),
),
)
print(response.text)
print("thinking tokens:", response.usage_metadata.thoughts_token_count)
Example output:
Positive (the core sentiment is about the product, which is praised).
thinking tokens: 38
Same model, same prompt class, but the hidden thinking dropped from hundreds of tokens to dozens. For a classifier running millions of times a day, that is the difference between a sustainable feature and a runaway bill. Reach for low on anything mechanical; keep high for genuinely hard reasoning where a wrong answer costs more than the tokens.
Step 3: media_resolution - control what vision costs
Images and video frames are tokenized too, and Gemini 3 lets you cap how many tokens each media part may use with media_resolution. Higher resolution reads fine text and small details better but costs more per image. One important gotcha up front: this parameter currently lives in the v1alpha API version, so you must construct the client accordingly.
from google import genai
from google.genai import types
import base64
# media_resolution is only exposed on the v1alpha API surface today.
client = genai.Client(http_options={"api_version": "v1alpha"})
with open("invoice.jpg", "rb") as f:
image_bytes = f.read()
response = client.models.generate_content(
model="gemini-3.1-pro-preview",
contents=[
types.Content(parts=[
types.Part(text="Extract the invoice total and due date."),
types.Part(
inline_data=types.Blob(
mime_type="image/jpeg",
data=base64.b64encode(image_bytes).decode(),
),
media_resolution={"level": "media_resolution_high"},
),
])
],
)
print(response.text)
The token cost per part is fixed by the level you pick. The table below comes straight from the docs - note that video is compressed far more aggressively than still images, and that low and medium collapse to the same 70 tokens per frame for video.
| Media type | Recommended level | Max tokens | Why |
|---|---|---|---|
| Image | media_resolution_high | 1120 | Best quality for most image analysis |
| media_resolution_medium | 560 | OCR quality saturates at medium | |
| Video (general) | media_resolution_low | 70 / frame | Enough for action and scene description |
| Video (dense text) | media_resolution_high | 280 / frame | Only when reading small on-screen text |
The practical rule: default images to high, default PDFs to medium (pushing PDFs to high rarely improves OCR but doubles the cost), and keep video at low unless you are reading text off the frames. You can also set a level globally in generation_config, though global is not available for ultra_high.
Step 4: Thought signatures - the tool-calling trap
This is the one that breaks people. Gemini 3 keeps its reasoning coherent across turns using thought signatures: encrypted blobs attached to model parts that represent its internal chain of thought. During function calling the API enforces strict validation: if you replay the conversation history without returning the signatures exactly as received, you get a 400.
The rules worth memorizing:
- Single tool call: the
functionCallpart carries a signature; send it back unchanged. - Parallel tool calls: only the first call in the batch has a signature, and parts must be returned in the exact order received.
- Sequential calls in one turn: every step gets its own signature and you must accumulate all of them in the history.
- Plain text / chat: validation is not strict, but echoing the signature still preserves answer quality.
The good news: if you use the official SDK and let it manage chat history, signatures are handled for you. The trap is hand-rolling history (a common pattern when you persist conversations to a database). Here is the safe SDK pattern using an automatic function-calling loop:
from google import genai
from google.genai import types
client = genai.Client()
def get_weather(city: str) -> str:
"""Return the current weather for a city."""
fake = {"Paris": "15C, cloudy", "London": "12C, rain"}
return fake.get(city, "unknown")
# Passing a Python function lets the SDK run the tool loop AND carry
# the thought signatures across turns automatically.
response = client.models.generate_content(
model="gemini-3.1-pro-preview",
contents="Compare the weather in Paris and London and tell me which is nicer.",
config=types.GenerateContentConfig(tools=[get_weather]),
)
print(response.text)
Example output:
Paris is 15C and cloudy; London is 12C with rain. Paris is the nicer of the
two right now - warmer and dry rather than wet.
If you must store and rebuild history yourself, persist the full model parts including the thought_signature field on each functionCall, and replay them verbatim. Stripping them to 'clean up' the payload is the single most common cause of mysterious 400s with Gemini 3 agents.
Step 5: Leave temperature alone
One quick but counterintuitive rule. With earlier models, lowering temperature was the standard move for more deterministic output. For Gemini 3 the docs strongly recommend keeping temperature at its default 1.0. The reasoning machinery is tuned for that value, and dropping below 1.0 can trigger looping or degraded performance on hard math and reasoning tasks. If you want less variance, constrain the output with structured schemas or a lower thinking_level, not the temperature.
Worked example: a cost-aware triage agent
Let us put all three controls together. The agent below triages incoming support messages. Cheap, mechanical steps (language detection, intent tagging) run at low; only genuinely tricky tickets escalate to high. Vision attachments are read at the cheapest level that fits the job. This is the pattern teams are actually deploying to keep Gemini 3.1 Pro affordable at scale.
from google import genai
from google.genai import types
client = genai.Client()
MODEL = "gemini-3.1-pro-preview"
def quick(prompt: str) -> str:
"""A fast, low-cost call for mechanical sub-tasks."""
r = client.models.generate_content(
model=MODEL,
contents=prompt,
config=types.GenerateContentConfig(
thinking_config=types.ThinkingConfig(thinking_level="low"),
),
)
return r.text.strip()
def deep(prompt: str) -> str:
"""A careful, high-reasoning call for hard tickets."""
r = client.models.generate_content(
model=MODEL,
contents=prompt,
config=types.GenerateContentConfig(
thinking_config=types.ThinkingConfig(thinking_level="high"),
),
)
return r.text.strip()
def triage(ticket: str) -> dict:
intent = quick(f"One word intent (billing/bug/howto/other): {ticket}")
severity = quick(f"Rate severity 1-5, number only: {ticket}")
if int(severity[0]) >= 4:
plan = deep(
"You are a senior support engineer. Draft a precise resolution "
f"plan for this high-severity ticket:\n{ticket}"
)
else:
plan = quick(f"Draft a one-line reply to: {ticket}")
return {"intent": intent, "severity": severity, "plan": plan}
print(triage("I was charged twice this month and the app now crashes on launch."))
Example output (trimmed):
{
"intent": "billing",
"severity": "5",
"plan": "1) Confirm the duplicate charge in the billing system and issue a
refund for the second transaction. 2) For the crash, collect the
device/OS and app version, check crash logs for the launch path,
and roll the user back to the last stable build while we patch..."
}
Two of the three calls ran at low and finished fast and cheap; only the escalation paid for deep reasoning. Swap in a screenshot attachment and you would add a media_resolution_high part to the deep call so the model can read the error dialog text.
Common pitfalls and gotchas
- Mixing thinking_level with thinking_budget. They are mutually exclusive. Sending both in one request returns a 400.
thinking_budgetis the legacy 1.5/2.x parameter; on Gemini 3 usethinking_levelonly. - Expecting minimal on Pro.
minimalis not supported on Gemini 3.1 Pro - it exists only on Flash and Flash-Lite. The lowest Pro setting islow. - Calling media_resolution on the stable endpoint. It is only exposed on
v1alphatoday. Build the client withhttp_options={'api_version': 'v1alpha'}or the parameter is silently ignored. - Dropping thought signatures in hand-rolled history. Strict validation during function calling means a missing signature is a hard 400, not a soft degradation. Persist and replay them verbatim.
- Reordering parallel function calls. Only the first parallel call carries a signature; if you sort or dedupe the parts before replay, validation fails.
- Lowering temperature for determinism. On Gemini 3 this can cause looping. Keep it at 1.0 and control variance through schemas or thinking_level instead.
- Forgetting thinking tokens are billed.
thoughts_token_countcounts toward output billing. Always measure it when you benchmark cost, not just the visible answer length. - Pushing PDFs to high resolution. OCR quality saturates at
medium; high roughly doubles the token cost for no real gain on standard documents.
Quick reference
| Knob | Where it goes | Default | When to change it |
|---|---|---|---|
| thinking_level | ThinkingConfig | high (Pro) | Drop to low for mechanical tasks |
| media_resolution | Part or generation_config (v1alpha) | auto | Set per media type to cap vision cost |
| thought signatures | Returned on model parts | SDK-managed | Replay verbatim in hand-rolled history |
| temperature | GenerateContentConfig | 1.0 | Leave it at 1.0 |
| model | model= argument | n/a | gemini-3.1-pro-preview |
Next steps
- Add structured output with a response schema so the triage agent returns typed JSON instead of free text.
- Wrap the deep/quick split into a router that estimates difficulty from the prompt before choosing a level.
- Benchmark
lowvshighon your own eval set and logthoughts_token_countto quantify the savings. - Move multi-turn tool agents onto the SDK's chat session so thought signatures are handled for you end to end.
Sources: Gemini 3 Developer Guide (ai.google.dev/gemini-api/docs/gemini-3), the Gemini 3.1 Pro announcement on the Google blog, and the Media resolution and Thought signatures pages in the official Gemini API docs.
Comments
Be the first to comment
Found this useful?
Get new AI guides for builders by email. Free.
Join 1,984 builders reading daily.