
Gemini 3.5 Flash API: Control Thinking Levels in Python
Summary
Build with Gemini 3.5 Flash: thinking levels, streaming thoughts and function calling in Python.
Gemini 3.5 Flash API: Control Thinking Levels in Python
Google made Gemini 3.5 Flash generally available on May 19, 2026, and developers spent the week comparing it against the flagship models because the numbers looked strange in the best way: near-frontier reasoning at Flash speed. The interesting part for anyone writing code is not the leaderboard. It is that Google quietly changed how you control the model's reasoning. The old numeric thinking_budget you may have wired into a Gemini 2.5 project no longer applies. Gemini 3.5 Flash reasons through a four-step thinking_level enum instead, and the default effort dropped from high to medium.
That single change quietly breaks a lot of copy-pasted Gemini code, and it also gives you a much cleaner lever for trading latency against quality. This guide walks through the whole loop in Python with the google-genai SDK: a first call, setting and reasoning about thinking_level, streaming the model's thought summaries as they form, reading thinking token usage so you can predict cost, and wiring up function calling that actually returns answers instead of empty responses. Every snippet is built against the official Gemini API docs as of the GA release.
By the end you will have a small, runnable triage agent that picks a low thinking level for easy requests and escalates to high only when a task is genuinely hard, which is exactly how you keep a Flash-tier bill from quietly turning into a Pro-tier bill.
Prerequisites
- Python 3.9 or newer.
- A Gemini API key from Google AI Studio (the free tier is enough to run everything here).
- The
google-genaiSDK, version 2.0.0 or later. Gemini 3.5 Flash needs the modernfrom google import genaipackage, not the legacygoogle-generativeaione. - Basic familiarity with calling an HTTP API and reading JSON responses.
pip install -U google-genai
export GEMINI_API_KEY="your_key_here"
The client reads GEMINI_API_KEY from the environment automatically, so you never pass the key in code.
Step 1: Your first Gemini 3.5 Flash call
Start with the smallest possible program. The model ID is the stable string gemini-3.5-flash. There is no -preview suffix anymore because this is a GA model meant for production.
from google import genai
client = genai.Client() # reads GEMINI_API_KEY from the environment
response = client.models.generate_content(
model="gemini-3.5-flash",
contents="Explain how parallel agentic execution works in three sentences.",
)
print(response.text)
Example output (yours will vary slightly run to run):
Parallel agentic execution runs several sub-agents at the same time, each
handling an independent slice of the task, instead of marching through steps
one after another. A coordinator hands out the slices, waits for every branch
to return, and then merges the partial results into one answer. Because the
slow branches overlap instead of stacking, total wall-clock time drops to the
length of the longest single branch rather than the sum of all of them.
If that runs, your key and SDK are wired correctly and you are talking to the model. Notice you did not set any sampling parameters. For all Gemini 3.x models Google now recommends leaving temperature, top_p and top_k at their defaults; the reasoning stack is tuned for them, and overriding them tends to hurt quality.
Step 2: Set the thinking level (the big change)
Gemini 3.5 Flash decides how hard to think before it answers. You steer that with thinking_level, a string enum with four settings. This replaces the numeric thinking_budget from the Gemini 2.5 era. The budget parameter is still accepted for backward compatibility, but Google explicitly recommends against it for Gemini 3.x because it can produce unexpected behavior.
| thinking_level | When to reach for it |
|---|---|
| minimal | Chat replies, quick factual lookups, simple tool calls. Lowest latency. Note it does not fully disable thinking. |
| low | Code and agentic tasks with few steps, plus analysis and writing that needs a little reasoning. Strong quality, low cost. |
| medium (default) | Best all-round quality. Recommended for complex code and most agentic use cases. |
| high | Hard math, deep reasoning, the toughest coding and agent tasks. Allows extended thoughts and more tool calls. |
You pass the level inside a ThinkingConfig, which lives inside a GenerateContentConfig:
from google import genai
from google.genai import types
client = genai.Client()
response = client.models.generate_content(
model="gemini-3.5-flash",
contents="Prove that the square root of 2 is irrational.",
config=types.GenerateContentConfig(
thinking_config=types.ThinkingConfig(thinking_level="high"),
),
)
print(response.text)
The rule of thumb from Google's own migration notes: start at medium, drop to low when you want a faster and cheaper response with still-strong quality, switch to high for genuinely hard problems, and use minimal when speed matters more than depth. One subtlety worth burning into memory: minimal is not an off switch. Gemini 3.5 Flash cannot fully disable thinking, so even at minimal the model may think a little on complex inputs.
Step 3: Stream the model's thought summaries
Thinking is normally invisible, but you can ask for thought summaries, which are condensed views of the model's internal reasoning. Set include_thoughts=True, then stream the response so the summary arrives incrementally while the model works. Each streamed part carries a thought boolean: when it is True the text is reasoning, when it is False it is the actual answer.
from google import genai
from google.genai import types
client = genai.Client()
prompt = """
Alice, Bob, and Carol each live in a different house: red, green, or blue.
The person in the red house owns a cat.
Bob does not live in the green house.
Carol owns a dog.
The green house is to the left of the red house.
Alice does not own a cat.
Who lives where, and what pet does each own?
"""
thoughts, answer = "", ""
for chunk in client.models.generate_content_stream(
model="gemini-3.5-flash",
contents=prompt,
config=types.GenerateContentConfig(
thinking_config=types.ThinkingConfig(include_thoughts=True),
),
):
for part in chunk.candidates[0].content.parts:
if not part.text:
continue
if part.thought:
if not thoughts:
print("--- THOUGHT SUMMARY ---")
print(part.text, end="")
thoughts += part.text
else:
if not answer:
print("\n\n--- ANSWER ---")
print(part.text, end="")
answer += part.text
Trimmed example output:
--- THOUGHT SUMMARY ---
Green is left of red, so green is not the rightmost house. Carol owns a dog,
so Carol is not in the red house (cat). Alice does not own a cat, so Alice is
not in the red house either, which leaves Bob in the red house with the cat...
--- ANSWER ---
Bob lives in the red house and owns the cat. Alice lives in the green house and
owns the bird. Carol lives in the blue house and owns the dog.
Thought summaries are useful for two things: showing a live "thinking..." indicator in a UI, and debugging. When the model gives you a wrong answer, reading the summary usually shows you exactly which assumption it got wrong, which is far faster than blindly rewording your prompt.
Step 4: Measure thinking tokens so cost never surprises you
When thinking is on, you pay for the thinking tokens on top of the visible output tokens, even though only the summary is returned. The response carries a usage_metadata object that breaks this down, so you can log it and watch your spend per request.
response = client.models.generate_content(
model="gemini-3.5-flash",
contents="What is the sum of the first 50 prime numbers?",
config=types.GenerateContentConfig(
thinking_config=types.ThinkingConfig(thinking_level="medium"),
),
)
u = response.usage_metadata
print("Prompt tokens: ", u.prompt_token_count)
print("Thinking tokens:", u.thoughts_token_count)
print("Output tokens: ", u.candidates_token_count)
print("Total tokens: ", u.total_token_count)
Example output:
Prompt tokens: 18
Thinking tokens: 511
Output tokens: 42
Total tokens: 571
The lesson lands immediately: a 42-token answer cost 511 thinking tokens at medium. Drop the same call to low and that thinking count usually falls by more than half. This is why the thinking_level choice is the single biggest cost lever you have on Gemini 3.5 Flash, far more than prompt length for short queries.
Step 5: Function calling that returns real answers
Most agentic work needs the model to call your code. The easy path with google-genai is automatic function calling: pass a plain Python function as a tool, and the SDK runs it, feeds the result back, and handles the encrypted thought signatures behind the scenes. A clear docstring and type hints are what the model uses to decide when and how to call it.
from google import genai
from google.genai import types
def get_weather(city: str) -> dict:
"""Return the current weather for a city.
Args:
city: The city name, for example "London".
"""
fake_data = {
"London": {"temp_c": 12, "sky": "light rain"},
"Tokyo": {"temp_c": 24, "sky": "clear"},
}
return fake_data.get(city, {"temp_c": 20, "sky": "unknown"})
client = genai.Client()
response = client.models.generate_content(
model="gemini-3.5-flash",
contents="Should I take an umbrella in London today?",
config=types.GenerateContentConfig(
tools=[get_weather],
thinking_config=types.ThinkingConfig(thinking_level="low"),
),
)
print(response.text)
Example output:
Yes, take an umbrella. It is currently 12 C with light rain in London, so you
will want it for the walk.
If you instead manage conversation history by hand or call the REST API directly, Gemini 3.x adds a strict rule you must follow or you will get empty responses with finish_reason: STOP. Every function response you send back must include the id from the matching function call, the name must match, and you must return exactly one response per call.
# Manual path: echo the call id and name back exactly
final = client.models.generate_content(
model="gemini-3.5-flash",
config=config,
contents=[
*previous_contents,
response.candidates[0].content, # the model's function call turn
types.Content(role="user", parts=[
types.Part.from_function_response(
name=tool_call.name, # must match the call
response={"result": result},
id=tool_call.id, # must echo the call id
)
]),
],
)
Worked example: a cost-aware triage agent
Here is the payoff, a small agent that reads each incoming request, decides how hard it is, and picks a thinking level to match. Cheap requests stay cheap; only the genuinely hard ones pay for high reasoning. This is the pattern that keeps a high-volume Flash deployment affordable.
from google import genai
from google.genai import types
client = genai.Client()
def choose_level(prompt: str) -> str:
"""Cheap heuristic router: pick a thinking level before spending tokens."""
p = prompt.lower()
hard = ("prove", "derive", "optimize", "debug", "algorithm", "complexity")
easy = ("what is", "who is", "define", "list", "when did")
if any(k in p for k in hard):
return "high"
if any(p.startswith(k) for k in easy):
return "minimal"
return "low"
def answer(prompt: str) -> str:
level = choose_level(prompt)
resp = client.models.generate_content(
model="gemini-3.5-flash",
contents=prompt,
config=types.GenerateContentConfig(
thinking_config=types.ThinkingConfig(thinking_level=level),
),
)
u = resp.usage_metadata
print(f"[level={level} thinking_tokens={u.thoughts_token_count}]")
return resp.text
for q in [
"What is the capital of Australia?",
"Debug why this recursive function overflows the stack for n > 1000.",
]:
print("Q:", q)
print("A:", answer(q))
print()
Example output:
Q: What is the capital of Australia?
[level=minimal thinking_tokens=0]
A: The capital of Australia is Canberra.
Q: Debug why this recursive function overflows the stack for n > 1000.
[level=high thinking_tokens=843]
A: Python caps recursion depth near 1000 by default, so a function that recurses
once per n hits the limit and raises RecursionError. Rewrite it iteratively or
raise the limit with sys.setrecursionlimit, and add a base case guard...
The router spent zero thinking tokens on the trivia question and 843 on the debugging task. A naive build that left every call at the medium default would have burned hundreds of tokens reasoning about a question whose answer is a single word.
Common pitfalls and gotchas
1. Using thinking_budget out of habit. If you copy Gemini 2.5 code, the numeric budget still parses but is no longer recommended and can cause unexpected behavior on Gemini 3.x. Replace thinking_budget=7500 with thinking_level="medium". There is no numeric equivalent; the enum is the supported control.
2. Setting temperature to be "creative." On every Gemini 3.x model, temperature, top_p and top_k are no longer recommended and should be removed from requests. The model is tuned for the defaults. If you need deterministic, rule-bound output, write a system instruction with explicit rules instead of lowering temperature.
3. Expecting minimal to disable thinking. It does not. Gemini 3.5 Flash cannot fully turn thinking off, so even at minimal you may see a few thinking tokens on harder inputs. Budget for that if you are doing high-volume, latency-critical work.
4. Forgetting the default dropped to medium. Gemini 3 Flash Preview defaulted to high; Gemini 3.5 Flash defaults to medium. If your output quality dipped after migrating, you are now getting less reasoning by default. Set high explicitly where you relied on the old behavior, and re-test speed and cost.
5. Mismatched function responses returning empty answers. When you manage history yourself, a missing or wrong id or name on a FunctionResponse makes the model return an empty response with finish_reason: STOP rather than erroring loudly. Always echo the call's id and name, and return one response per call.
6. Stripping thought signatures from history. Thought preservation is on by default now, which means reasoning context carries across turns and improves multi-step tasks like iterative debugging. To benefit, pass the full, unmodified conversation history including thought signatures; the SDK does this for you, so do not hand-edit history to remove them. The trade-off is higher input token counts over long conversations, so clear them for simple one-off queries if cost matters.
7. Reaching for Computer Use. Gemini 3.5 Flash does not support Computer Use at GA. If you have a browser- or desktop-control workload, keep it on Gemini 3 Flash Preview for now.
Quick reference
| Task | Code |
|---|---|
| Create client | client = genai.Client() |
| Basic call | client.models.generate_content(model="gemini-3.5-flash", contents=prompt) |
| Set reasoning | thinking_config=types.ThinkingConfig(thinking_level="high") |
| Stream thoughts | generate_content_stream(...) with include_thoughts=True |
| Read thinking cost | response.usage_metadata.thoughts_token_count |
| Add a tool | config=types.GenerateContentConfig(tools=[my_function]) |
| Thinking levels | minimal | low | medium (default) | high |
| Removed params | thinking_budget, temperature, top_p, top_k |
Next steps
- Wrap the triage agent in a small FastAPI endpoint and log
thoughts_token_countper request so you can see your real cost distribution. - Combine tools: Gemini 3.5 Flash can use Google Search, URL context, code execution and your custom functions in a single request.
- Try the new Interactions API, which Google recommends for fresh agentic projects and which preserves thoughts automatically across turns.
- Add structured outputs (JSON mode) on top of function calling when you need machine-parseable results.
The core mental model to keep: on Gemini 3.5 Flash you are not buying a fixed model, you are buying a dial. thinking_level is that dial, and pointing it at the right setting per request is what turns frontier-grade reasoning into a Flash-grade bill.
Sources: Google AI for Developers, "What's new in Gemini 3.5 Flash" and "Gemini thinking" (ai.google.dev, updated May 19, 2026).
Comments
Be the first to comment