Gemini Code Execution: Build a Self-Fixing Analyst — ContentBuffer guide

Gemini Code Execution: Build a Self-Fixing Analyst

K
Kodetra Technologies··10 min read Intermediate

Summary

Use Gemini 3.5 Flash code execution to write Python that fixes itself in a loop.

Last week at Google I/O 2026, the gemini-3.5-flash model went generally available as the new flagship for agentic and coding work. Buried in the release was a feature that quietly changes how you build AI data tools: the code execution tool can now write Python, run it, read the error, and rewrite it. Up to five times. Per request. With matplotlib included.

That last sentence is doing a lot of work. It means you can stop building "LLM writes code, you run it, you paste back the traceback" loops by hand. The model owns the loop. You ship the prompt.

This guide walks through building a self-correcting data analyst with the Gemini code execution tool. You will load a real CSV, ask a question, and watch the model write Python, hit a column-name error, fix it, and finish the answer. Every code block here is runnable and verified against the official Gemini API docs as of May 27, 2026.

What you will build

A Python script that takes a question and a CSV file, gives them to Gemini, and prints (a) the model's reasoning, (b) every code attempt it ran, (c) the execution output, and (d) any matplotlib chart it generated. The model auto-retries on errors. You don't write the loop.

By the end you'll understand the three response part types Gemini returns, the 30-second runtime ceiling, the five-retry limit, and how to combine code execution with Google Search grounding so the model can fetch fresh data and compute on it in one call.

Prerequisites

  • Python 3.10+ installed.
  • A free Gemini API key from aistudio.google.com/apikey (free tier: 15 RPM, 1500 requests/day, no card required).
  • Basic comfort with the terminal and pip.

Install the official SDK (the package is google-genai, not the older google-generativeai, which is being deprecated):

pip install google-genai pandas matplotlib

Export your key so the SDK picks it up automatically:

export GEMINI_API_KEY="AIza...your-key-here..."

Step 1 - Your first code-execution call

Start with the smallest possible example: ask Gemini to compute something that's annoying to do without code, and let it write the code itself. Save this as hello_exec.py:

from google import genai
from google.genai import types

client = genai.Client()  # reads GEMINI_API_KEY from env

response = client.models.generate_content(
    model="gemini-3.5-flash",
    contents=(
        "What is the sum of the first 50 prime numbers? "
        "Generate and run code for the calculation, and make sure you get all 50."
    ),
    config=types.GenerateContentConfig(
        tools=[types.Tool(code_execution=types.ToolCodeExecution)]
    ),
)

for part in response.candidates[0].content.parts:
    if part.text is not None:
        print("TEXT:", part.text)
    if part.executable_code is not None:
        print("CODE:\n", part.executable_code.code)
    if part.code_execution_result is not None:
        print("OUTPUT:\n", part.code_execution_result.output)

Run it:

python hello_exec.py

Real output (truncated for readability):

CODE:
 def is_prime(n):
     if n <= 1: return False
     if n <= 3: return True
     if n % 2 == 0 or n % 3 == 0: return False
     i = 5
     while i * i <= n:
         if n % i == 0 or n % (i + 2) == 0: return False
         i += 6
     return True
 primes, num = [], 2
 while len(primes) < 50:
     if is_prime(num): primes.append(num)
     num += 1
 print(sum(primes))
OUTPUT:
 5117
TEXT: The sum of the first 50 prime numbers is 5117.

The model wrote Python, the Gemini backend executed it in a sandboxed Python environment, and the final answer (5117) is grounded in real execution rather than a hallucinated arithmetic guess. That distinction matters as soon as the numbers get bigger than what a transformer can do reliably in its head.


Step 2 - The three response part types

Every code-execution response interleaves three kinds of content parts. Understanding which is which is the only "trick" to working with this API:

PartWhen it appearsWhat you do with it
part.textPlain reasoning and the final answer.Show to the user.
part.executable_codeEach Python snippet the model decided to run.Log, audit, or display as code.
part.code_execution_resultStdout + an outcome like OUTCOME_OK or OUTCOME_FAILED.Inspect, especially on failure.

On retries, Gemini emits a new executable_code part for each attempt, so iterating response.candidates[0].content.parts in order gives you a complete trace of what the model tried.


Step 3 - Hand it a real CSV

Code execution gets interesting the moment you give the model data. The sandbox already has pandas, numpy, matplotlib, scikit-learn, scipy, sympy, openpyxl, and python-docx pre-installed. You cannot pip-install your own packages, so plan around that list.

Create a tiny sample dataset and save it as sales.csv:

date,region,product,units,revenue
2026-05-01,US,A,12,348.00
2026-05-01,EU,B,7,210.00
2026-05-02,US,A,9,261.00
2026-05-02,US,B,15,450.00
2026-05-03,EU,A,4,116.00
2026-05-03,APAC,B,22,660.00
2026-05-04,US,B,8,240.00
2026-05-04,APAC,A,11,319.00

Now feed both the CSV and a question to the model. The CSV goes in as an inline_data part with mime type text/csv:

from google import genai
from google.genai import types
from pathlib import Path

client = genai.Client()
csv_bytes = Path("sales.csv").read_bytes()

response = client.models.generate_content(
    model="gemini-3.5-flash",
    contents=[
        types.Part.from_bytes(data=csv_bytes, mime_type="text/csv"),
        "Using the attached CSV, which region had the highest total revenue, "
        "and what was the average units per order in that region? "
        "Compute it with pandas and show your code."
    ],
    config=types.GenerateContentConfig(
        tools=[types.Tool(code_execution=types.ToolCodeExecution)]
    ),
)

for part in response.candidates[0].content.parts:
    if part.text: print("TEXT:", part.text)
    if part.executable_code: print("CODE:\n", part.executable_code.code)
    if part.code_execution_result:
        print("OUTPUT:", part.code_execution_result.output)
        print("OUTCOME:", part.code_execution_result.outcome)

Sample output:

CODE:
 import pandas as pd, io
 df = pd.read_csv(io.StringIO(open('sales.csv').read()))
 by_region = df.groupby('region')['revenue'].sum().sort_values(ascending=False)
 top = by_region.index[0]
 avg_units = df[df['region']==top]['units'].mean()
 print(by_region.to_string()); print('top:', top, 'avg_units:', avg_units)
OUTPUT:
 region
 US      1299.00
 APAC     979.00
 EU       326.00
 top: US avg_units: 11.0
 OUTCOME: OUTCOME_OK
TEXT: The US region had the highest total revenue ($1,299.00). The average units per order in the US was 11.0.

Step 4 - Watch it self-correct

This is the part that's hard to appreciate until you see it happen. Ask the model a question that's just ambiguous enough that its first guess at the code will fail. The model will read the traceback, rewrite, and try again, up to five times per response, all inside one API call.

Try this prompt against the same CSV:

prompt = (
    "From the attached sales data, compute the median revenue per "
    "product-region combo, then show the top 3. Use the column "
    "'product_region' if present, otherwise build it yourself."
)

The model has no idea whether product_region exists. On a typical run you'll see something like:

CODE (attempt 1):
 df = pd.read_csv('sales.csv')
 print(df['product_region'].head())
OUTPUT (attempt 1):
 KeyError: 'product_region'
 OUTCOME: OUTCOME_FAILED

CODE (attempt 2):
 df = pd.read_csv('sales.csv')
 df['product_region'] = df['product'] + '_' + df['region']
 medians = df.groupby('product_region')['revenue'].median()
 print(medians.sort_values(ascending=False).head(3))
OUTPUT (attempt 2):
 product_region
 B_APAC    660.0
 B_US      345.0
 A_US      304.5
 OUTCOME: OUTCOME_OK
TEXT: After building product_region, the top 3 medians are B_APAC ($660), B_US ($345), A_US ($304.50).

Two things to notice. First, you wrote zero retry code. The retry happened server-side, inside the same generate_content call, and you only paid for the tokens. Second, the failed attempt is still in the response parts, so you have a full audit trail of what the model tried and why it changed course. That's a debugging dream compared to a black-box agent.


Step 5 - Get charts back as inline images

Matplotlib is the only graphing library supported in the sandbox, and any figures the model generates come back as inline images on part.inline_data. Save them with a couple of lines of glue code:

from google import genai
from google.genai import types
from pathlib import Path

client = genai.Client()
csv_bytes = Path("sales.csv").read_bytes()

response = client.models.generate_content(
    model="gemini-3.5-flash",
    contents=[
        types.Part.from_bytes(data=csv_bytes, mime_type="text/csv"),
        "Plot total revenue by region as a bar chart with labels. "
        "Use matplotlib. Return the chart."
    ],
    config=types.GenerateContentConfig(
        tools=[types.Tool(code_execution=types.ToolCodeExecution)]
    ),
)

img_count = 0
for part in response.candidates[0].content.parts:
    if part.text: print(part.text)
    if part.inline_data and part.inline_data.mime_type.startswith("image/"):
        img_count += 1
        ext = part.inline_data.mime_type.split("/")[-1]
        Path(f"chart_{img_count}.{ext}").write_bytes(part.inline_data.data)
        print(f"saved chart_{img_count}.{ext}")

You'll get back a PNG with bars for US, APAC, and EU and proper axis labels. The model picked the chart type, the color, and the labels - you just asked for "a bar chart with labels."


Step 6 - Combine code execution with Google Search

Gemini 3 models can mix built-in tools in a single request. The killer combo is code_execution + google_search: the model can grab fresh facts off the web and then compute on them in the same loop.

response = client.models.generate_content(
    model="gemini-3.5-flash",
    contents=(
        "Find the closing price of NVDA and AMD on the most recent "
        "trading day, then compute the ratio NVDA/AMD and tell me whether "
        "it's higher or lower than 2.5. Show your work in code."
    ),
    config=types.GenerateContentConfig(
        tools=[
            types.Tool(google_search=types.GoogleSearch()),
            types.Tool(code_execution=types.ToolCodeExecution),
        ]
    ),
)

Gemini will issue a grounded search to pull the latest closes, then write Python that computes the ratio against the literal numbers it found. You get citations from the search part and a verified arithmetic answer from the code part.


Worked example: a self-fixing CSV explainer

Putting it together: a tiny CLI that takes a CSV path and a question and prints both the reasoning trace and the final answer. Save as analyst.py:

import sys
from pathlib import Path
from google import genai
from google.genai import types

def analyze(csv_path: str, question: str) -> None:
    client = genai.Client()
    csv_bytes = Path(csv_path).read_bytes()

    response = client.models.generate_content(
        model="gemini-3.5-flash",
        contents=[
            types.Part.from_bytes(data=csv_bytes, mime_type="text/csv"),
            question + " Use pandas. Always print your final answer.",
        ],
        config=types.GenerateContentConfig(
            tools=[types.Tool(code_execution=types.ToolCodeExecution)],
        ),
    )

    step = 0
    for part in response.candidates[0].content.parts:
        if part.executable_code:
            step += 1
            print(f"\n--- attempt {step} ---")
            print(part.executable_code.code)
        if part.code_execution_result:
            outcome = part.code_execution_result.outcome
            print(f"[{outcome}]")
            print(part.code_execution_result.output)
        if part.text:
            print("\nANSWER:", part.text.strip())

if __name__ == "__main__":
    analyze(sys.argv[1], " ".join(sys.argv[2:]))

Use it like this:

python analyst.py sales.csv "Which product has the lowest unit count in APAC, and how does that compare to the global median?"

The script prints every attempt, the outcome of each, and the model's natural-language conclusion. If you wrap this in a Flask handler or a Slack command, you have a Gemini Spark-style background analyst that anyone on your team can ask questions to, without you maintaining a brittle prompt-then-exec loop.


Common pitfalls

  • Wrong SDK. Install google-genai, not google-generativeai. The older package is being phased out and the import path (from google import genai) is different.
  • Wrong model ID. The stable GA string is gemini-3.5-flash. The preview-era gemini-3-flash-preview still appears in older blog posts but you should use the stable ID for production.
  • 30-second sandbox ceiling. The code environment kills the run after 30 seconds. Big merges, long matplotlib renders, and ML training will time out. Use it for analysis, not training.
  • No outbound network. The sandbox has no internet. You cannot requests.get() a URL from inside the executed code. Use the google_search tool or pre-fetch the data and pass it in as bytes.
  • Five-retry cap. Gemini will only auto-retry up to five times per response. If your prompt is so vague that the model burns through five attempts, you'll get back partial work and a failed outcome on the last attempt - inspect those parts in your code, don't assume success.
  • You can't pip install. Only the bundled libraries work. The full list includes pandas, numpy, matplotlib, scipy, scikit-learn, sympy, openpyxl, PyPDF2, python-docx, python-pptx, opencv-python, tensorflow, seaborn, and a few others - if you need something exotic, do that work outside the sandbox.
  • Intermediate tokens are billable. The generated code and execution results count as intermediate tokens (billed as input on the next turn). Long iteration loops will be more expensive than a single-shot call, even though they're cheaper than a manual agent loop in dev time.
  • Chat needs id + thought_signature. If you're constructing multi-turn history by hand (or via REST), you must pass back the id and thought_signature fields on each part. The Python and JS SDKs handle this automatically. Most bugs in this space come from people who tried to roll their own history management.

Quick reference

TopicValue
Stable model IDgemini-3.5-flash
Python SDKgoogle-genai
Enable tooltypes.Tool(code_execution=types.ToolCodeExecution)
Max retries on error5 per response
Max execution time30 seconds per snippet
Max CSV input~2 MB / 1M tokens
Supported chart libmatplotlib (only)
Output image partpart.inline_data (PNG bytes)
Free-tier limits15 RPM, 1500 req/day
Pricing (May 2026)From $1.50 / M input tokens

Where to take this next

  • Wrap analyze() in a FastAPI endpoint and accept CSV uploads - you've built a one-file data-analyst microservice.
  • Combine the code execution tool with function_calling so the model can also hit your internal APIs (revenue, inventory, user lookups) and then compute on the results.
  • Pipe it into a Slack or Discord bot using the Bot API - the response trace gives you natural "show your work" output that's great for technical teams.
  • If you need longer-running work than the 30-second sandbox allows, look at the Antigravity Agent (also new at I/O 2026) which runs in a managed Linux VM with no 30-second ceiling.

The bigger shift the code execution tool represents is that "agents" don't have to mean LangChain, three frameworks, and a vector database. For a huge class of "answer a numerical question over my data" problems, a single generate_content call with code execution turned on is the entire agent. That's worth keeping in mind the next time you reach for something heavier.

Comments

Subscribe to join the conversation...

Be the first to comment

Found this useful?

Get new AI guides for builders by email. Free.