Gemini 3.1 Pro: Tune Reasoning, Vision and Cost in Python

Gemini 3.1 Pro Preview is the model everyone is wiring into their stacks this week. The Apple keynote on June 8 put a Gemini model inside Siri, Microsoft shipped a wave of its own MAI models the same day, and across developer communities the question that keeps coming up is not how smart is it but how do I stop it from being slow and expensive. The good news: Gemini 3 added three concrete knobs that let you trade reasoning depth, vision fidelity, and latency against cost on a per-request basis.

This guide teaches those three controls end to end: thinking_level for how hard the model reasons, media_resolution for how many tokens each image or video frame burns, and thought signatures, the encrypted reasoning tokens you must echo back during tool calls or your agent breaks with a 400. By the end you will have a small, cost-aware multimodal agent that picks the right reasoning depth per task.

Every code block below uses the official google-genai SDK and matches the current Gemini 3 developer docs. The model ID is gemini-3.1-pro-preview: a 1M-token input window, 64k output, January 2025 knowledge cutoff, priced at $2 / $12 per million tokens under 200k context and $4 / $18 above it.

Prerequisites

Python 3.9+ and a Gemini API key from Google AI Studio (set it as the GEMINI_API_KEY environment variable).
The official SDK: pip install -U google-genai (version 1.0 or newer).
Basic familiarity with chat-style LLM calls and Python functions.
Optional: a sample image on disk for the vision section.

pip install -U google-genai
export GEMINI_API_KEY="your-key-here"

Step 1: Your first Gemini 3.1 Pro call

The client reads GEMINI_API_KEY from the environment automatically. Start with a plain request so you have a baseline before you start tuning anything.

from google import genai

client = genai.Client()

response = client.models.generate_content(
    model="gemini-3.1-pro-preview",
    contents="Find the race condition in this snippet and explain the fix:\n"
             "  counter += 1  # called from 8 threads, no lock",
)

print(response.text)
print("thinking tokens:", response.usage_metadata.thoughts_token_count)
print("output tokens:  ", response.usage_metadata.candidates_token_count)

Example output (trimmed):

The increment `counter += 1` is not atomic: it compiles to load, add, store.
Two threads can read the same value and both write back n+1, losing an update.
Fix: guard it with a lock (threading.Lock) or use itertools.count / an atomic.

thinking tokens: 412
output tokens:   96

Notice the thoughts_token_count. By default Gemini 3.1 Pro reasons at high, so it spends hidden thinking tokens before answering. Those tokens are billed as output. That single line is why your bill can be larger than the visible answer suggests, and it is exactly what the next step controls.

Step 2: thinking_level - dial reasoning up or down

thinking_level sets the maximum depth of the model's internal reasoning before it answers. Gemini treats the levels as relative allowances, not strict token budgets. If you do not set it, Gemini 3.1 Pro defaults to high.

Level	Supported on 3.1 Pro?	Use it for
minimal	No (Flash / Flash-Lite only)	Near-zero latency chat on lighter models
low	Yes	Classification, extraction, simple instruction following
medium	Yes	Balanced everyday tasks
high	Yes (default, dynamic)	Hard math, multi-step reasoning, deep code review

You set it through thinking_config on the generation config:

from google import genai
from google.genai import types

client = genai.Client()

response = client.models.generate_content(
    model="gemini-3.1-pro-preview",
    contents="Classify this review as positive, negative, or neutral: "
             "'Shipping was slow but the product is fantastic.'",
    config=types.GenerateContentConfig(
        thinking_config=types.ThinkingConfig(thinking_level="low"),
    ),
)

print(response.text)
print("thinking tokens:", response.usage_metadata.thoughts_token_count)

Example output:

Positive (the core sentiment is about the product, which is praised).

thinking tokens: 38

Same model, same prompt class, but the hidden thinking dropped from hundreds of tokens to dozens. For a classifier running millions of times a day, that is the difference between a sustainable feature and a runaway bill. Reach for low on anything mechanical; keep high for genuinely hard reasoning where a wrong answer costs more than the tokens.

Step 3: media_resolution - control what vision costs

Images and video frames are tokenized too, and Gemini 3 lets you cap how many tokens each media part may use with media_resolution. Higher resolution reads fine text and small details better but costs more per image. One important gotcha up front: this parameter currently lives in the v1alpha API version, so you must construct the client accordingly.

from google import genai
from google.genai import types
import base64

# media_resolution is only exposed on the v1alpha API surface today.
client = genai.Client(http_options={"api_version": "v1alpha"})

with open("invoice.jpg", "rb") as f:
    image_bytes = f.read()

response = client.models.generate_content(
    model="gemini-3.1-pro-preview",
    contents=[
        types.Content(parts=[
            types.Part(text="Extract the invoice total and due date."),
            types.Part(
                inline_data=types.Blob(
                    mime_type="image/jpeg",
                    data=base64.b64encode(image_bytes).decode(),
                ),
                media_resolution={"level": "media_resolution_high"},
            ),
        ])
    ],
)

print(response.text)

The token cost per part is fixed by the level you pick. The table below comes straight from the docs - note that video is compressed far more aggressively than still images, and that low and medium collapse to the same 70 tokens per frame for video.

Media type	Recommended level	Max tokens	Why
Image	media_resolution_high	1120	Best quality for most image analysis
PDF	media_resolution_medium	560	OCR quality saturates at medium
Video (general)	media_resolution_low	70 / frame	Enough for action and scene description
Video (dense text)	media_resolution_high	280 / frame	Only when reading small on-screen text

The practical rule: default images to high, default PDFs to medium (pushing PDFs to high rarely improves OCR but doubles the cost), and keep video at low unless you are reading text off the frames. You can also set a level globally in generation_config, though global is not available for ultra_high.

Step 4: Thought signatures - the tool-calling trap

This is the one that breaks people. Gemini 3 keeps its reasoning coherent across turns using thought signatures: encrypted blobs attached to model parts that represent its internal chain of thought. During function calling the API enforces strict validation: if you replay the conversation history without returning the signatures exactly as received, you get a 400.

The rules worth memorizing:

Single tool call: the functionCall part carries a signature; send it back unchanged.
Parallel tool calls: only the first call in the batch has a signature, and parts must be returned in the exact order received.
Sequential calls in one turn: every step gets its own signature and you must accumulate all of them in the history.
Plain text / chat: validation is not strict, but echoing the signature still preserves answer quality.

The good news: if you use the official SDK and let it manage chat history, signatures are handled for you. The trap is hand-rolling history (a common pattern when you persist conversations to a database). Here is the safe SDK pattern using an automatic function-calling loop:

from google import genai
from google.genai import types

client = genai.Client()

def get_weather(city: str) -> str:
    """Return the current weather for a city."""
    fake = {"Paris": "15C, cloudy", "London": "12C, rain"}
    return fake.get(city, "unknown")

# Passing a Python function lets the SDK run the tool loop AND carry
# the thought signatures across turns automatically.
response = client.models.generate_content(
    model="gemini-3.1-pro-preview",
    contents="Compare the weather in Paris and London and tell me which is nicer.",
    config=types.GenerateContentConfig(tools=[get_weather]),
)

print(response.text)

Example output:

Paris is 15C and cloudy; London is 12C with rain. Paris is the nicer of the
two right now - warmer and dry rather than wet.

If you must store and rebuild history yourself, persist the full model parts including the thought_signature field on each functionCall, and replay them verbatim. Stripping them to 'clean up' the payload is the single most common cause of mysterious 400s with Gemini 3 agents.

Step 5: Leave temperature alone

One quick but counterintuitive rule. With earlier models, lowering temperature was the standard move for more deterministic output. For Gemini 3 the docs strongly recommend keeping temperature at its default 1.0. The reasoning machinery is tuned for that value, and dropping below 1.0 can trigger looping or degraded performance on hard math and reasoning tasks. If you want less variance, constrain the output with structured schemas or a lower thinking_level, not the temperature.

Worked example: a cost-aware triage agent

Let us put all three controls together. The agent below triages incoming support messages. Cheap, mechanical steps (language detection, intent tagging) run at low; only genuinely tricky tickets escalate to high. Vision attachments are read at the cheapest level that fits the job. This is the pattern teams are actually deploying to keep Gemini 3.1 Pro affordable at scale.

from google import genai
from google.genai import types

client = genai.Client()
MODEL = "gemini-3.1-pro-preview"

def quick(prompt: str) -> str:
    """A fast, low-cost call for mechanical sub-tasks."""
    r = client.models.generate_content(
        model=MODEL,
        contents=prompt,
        config=types.GenerateContentConfig(
            thinking_config=types.ThinkingConfig(thinking_level="low"),
        ),
    )
    return r.text.strip()

def deep(prompt: str) -> str:
    """A careful, high-reasoning call for hard tickets."""
    r = client.models.generate_content(
        model=MODEL,
        contents=prompt,
        config=types.GenerateContentConfig(
            thinking_config=types.ThinkingConfig(thinking_level="high"),
        ),
    )
    return r.text.strip()

def triage(ticket: str) -> dict:
    intent = quick(f"One word intent (billing/bug/howto/other): {ticket}")
    severity = quick(f"Rate severity 1-5, number only: {ticket}")

    if int(severity[0]) >= 4:
        plan = deep(
            "You are a senior support engineer. Draft a precise resolution "
            f"plan for this high-severity ticket:\n{ticket}"
        )
    else:
        plan = quick(f"Draft a one-line reply to: {ticket}")

    return {"intent": intent, "severity": severity, "plan": plan}

print(triage("I was charged twice this month and the app now crashes on launch."))

Example output (trimmed):

{
  "intent": "billing",
  "severity": "5",
  "plan": "1) Confirm the duplicate charge in the billing system and issue a
           refund for the second transaction. 2) For the crash, collect the
           device/OS and app version, check crash logs for the launch path,
           and roll the user back to the last stable build while we patch..."
}

Two of the three calls ran at low and finished fast and cheap; only the escalation paid for deep reasoning. Swap in a screenshot attachment and you would add a media_resolution_high part to the deep call so the model can read the error dialog text.

Common pitfalls and gotchas

Mixing thinking_level with thinking_budget. They are mutually exclusive. Sending both in one request returns a 400. thinking_budget is the legacy 1.5/2.x parameter; on Gemini 3 use thinking_level only.
Expecting minimal on Pro. minimal is not supported on Gemini 3.1 Pro - it exists only on Flash and Flash-Lite. The lowest Pro setting is low.
Calling media_resolution on the stable endpoint. It is only exposed on v1alpha today. Build the client with http_options={'api_version': 'v1alpha'} or the parameter is silently ignored.
Dropping thought signatures in hand-rolled history. Strict validation during function calling means a missing signature is a hard 400, not a soft degradation. Persist and replay them verbatim.
Reordering parallel function calls. Only the first parallel call carries a signature; if you sort or dedupe the parts before replay, validation fails.
Lowering temperature for determinism. On Gemini 3 this can cause looping. Keep it at 1.0 and control variance through schemas or thinking_level instead.
Forgetting thinking tokens are billed. thoughts_token_count counts toward output billing. Always measure it when you benchmark cost, not just the visible answer length.
Pushing PDFs to high resolution. OCR quality saturates at medium; high roughly doubles the token cost for no real gain on standard documents.

Quick reference

Knob	Where it goes	Default	When to change it
thinking_level	ThinkingConfig	high (Pro)	Drop to low for mechanical tasks
media_resolution	Part or generation_config (v1alpha)	auto	Set per media type to cap vision cost
thought signatures	Returned on model parts	SDK-managed	Replay verbatim in hand-rolled history
temperature	GenerateContentConfig	1.0	Leave it at 1.0
model	model= argument	n/a	gemini-3.1-pro-preview

Next steps

Add structured output with a response schema so the triage agent returns typed JSON instead of free text.
Wrap the deep/quick split into a router that estimates difficulty from the prompt before choosing a level.
Benchmark low vs high on your own eval set and log thoughts_token_count to quantify the savings.
Move multi-turn tool agents onto the SDK's chat session so thought signatures are handled for you end to end.

Sources: Gemini 3 Developer Guide (ai.google.dev/gemini-api/docs/gemini-3), the Gemini 3.1 Pro announcement on the Google blog, and the Media resolution and Thought signatures pages in the official Gemini API docs.

Gemini 3.1 Pro: Tune Reasoning, Vision and Cost in Python

Gemini 3.1 Pro: Tune Reasoning, Vision and Cost in Python

Prerequisites

Step 1: Your first Gemini 3.1 Pro call

Step 2: thinking_level - dial reasoning up or down

Step 3: media_resolution - control what vision costs

Step 4: Thought signatures - the tool-calling trap

Step 5: Leave temperature alone

Worked example: a cost-aware triage agent

Common pitfalls and gotchas

Quick reference

Next steps

Comments

Found this useful?