Skip to content
Gemma 4 Tool Calling: Build a Local AI Agent — ContentBuffer guide

Gemma 4 Tool Calling: Build a Local AI Agent

K
Kodetra Technologies··10 min read Intermediate

Summary

Run Google's open Gemma 4 locally with Ollama and wire up real function calling for an agent.

Google's Gemma 4 is the rare open model that is both small enough to run on your own laptop and capable enough to drive a real agent. The weights are Apache 2.0, the family ships in sizes from a 2B-effective edge model up to a 31B dense workstation model, and on Ollama alone it has already crossed 12.5 million downloads. What makes it interesting for builders is not the chat quality. It is the native function-calling support baked into the model and its chat template.

Function calling (also called tool calling) is the mechanism that turns a text generator into an agent. Instead of hallucinating today's weather or guessing the result of a calculation, the model emits a structured request: call get_weather with city="Tokyo". Your code runs that function, hands the result back, and the model writes a grounded answer. Everything stays on your machine: no API keys, no per-token billing, no data leaving your network.

This guide walks through the full loop end to end with Ollama and Python. You will define real tools, watch Gemma 4 decide when to call them, execute them, feed the results back, and let the model finish the answer. By the end you will have a small but genuine multi-tool agent running entirely offline, plus the gotchas that bite people the first time they wire this up.

Prerequisites

  • Ollama installed (v0.5+ recommended). Download from ollama.com/download.
  • Python 3.9+ and pip.
  • About 8 GB of free RAM for the 12B model, or 6 GB for the E4B edge model. No GPU required, though one helps.
  • Basic comfort with Python functions, type hints, and docstrings (Gemma 4 reads your docstrings to understand each tool).

We will use the gemma4 model tag. The default pull is the E4B edge model (~9.6 GB). For sharper tool decisions on a workstation, use gemma4:12b (256K context) or gemma4:31b.

Step 1: Pull Gemma 4 and confirm it runs

Start the Ollama service if it is not already running, then pull the model. The gemma4 family on Ollama is tagged tools, which means the runtime knows how to format and parse function calls for you.

# Pull a model. gemma4:12b is a strong default for tool use.
ollama pull gemma4:12b

# Smoke test
ollama run gemma4:12b "Reply with the single word: ready"

Now install the Python client. The official library will turn plain Python functions into the JSON schema Gemma 4 expects, and parse the model's tool calls back into objects.

pip install ollama

Step 2: Define tools as plain Python functions

A tool is just a function with a clear name, typed arguments, and a docstring. Gemma 4 never sees your function body. It only sees the name, the parameter types, and the docstring, so write the docstring as if it were API documentation for the model. Vague docstrings are the number-one cause of wrong or missing tool calls.

import json

def get_weather(city: str, unit: str = "celsius") -> str:
    """Get the current weather for a city.

    Args:
        city: The city name, e.g. "Tokyo" or "San Francisco".
        unit: Temperature unit, either "celsius" or "fahrenheit".
    """
    # A real implementation would call a weather API. We stub it
    # so the example runs offline and deterministically.
    fake = {"tokyo": 21, "london": 14, "san francisco": 18}
    c = fake.get(city.lower(), 20)
    temp = c if unit == "celsius" else round(c * 9 / 5 + 32)
    return json.dumps({"city": city, "temp": temp, "unit": unit, "sky": "clear"})


def calculator(expression: str) -> str:
    """Evaluate a basic arithmetic expression and return the result.

    Args:
        expression: A math expression using + - * / and parentheses,
            e.g. "(21 - 14) * 3".
    """
    allowed = set("0123456789+-*/(). ")
    if not set(expression) <= allowed:
        return json.dumps({"error": "unsupported characters"})
    try:
        return json.dumps({"result": eval(expression, {"__builtins__": {}})})
    except Exception as e:
        return json.dumps({"error": str(e)})


# A registry maps tool names to implementations. Never call a tool
# by name without checking it against an allow-list like this one.
TOOLS = {"get_weather": get_weather, "calculator": calculator}

Notice the TOOLS registry. Gemma 4 returns a tool name as a string, and you must map that string to a real function yourself. Looking the name up in an explicit dictionary, rather than calling globals()[name], is what keeps a model-chosen string from ever reaching arbitrary code.

Step 3: Make the first tool call

Pass the Python functions straight to ollama.chat via the tools argument. The client introspects each function's signature and docstring and builds the JSON schema for you. When Gemma 4 decides a tool is needed, it returns an empty content and a populated tool_calls list instead of a text answer.

import ollama

MODEL = "gemma4:12b"

messages = [
    {"role": "system", "content": "You are a helpful assistant. Use tools when they help."},
    {"role": "user", "content": "What's the weather in Tokyo right now?"},
]

response = ollama.chat(model=MODEL, messages=messages, tools=[get_weather, calculator])

print("content:", repr(response.message.content))
for call in response.message.tool_calls or []:
    print("tool:", call.function.name, "args:", call.function.arguments)

Example output:

content: ''
tool: get_weather args: {'city': 'Tokyo'}

Gemma 4 read the question, saw that get_weather matched, and produced a structured call with the argument filled in. It did not invent a temperature. That is the whole point: the model decides what to do, your code decides how.

Step 4: Execute the tool and feed the result back

A tool call is only half a turn. You now run the function, append the result to the conversation with the special tool role, and call the model again so it can turn raw data into a sentence. The tool_name field tells Gemma 4 which call this result answers.

# 1. Append the model's tool-call turn to history.
messages.append(response.message)

# 2. Execute each requested call against the allow-list.
for call in response.message.tool_calls or []:
    fn = TOOLS.get(call.function.name)
    if fn is None:
        result = json.dumps({"error": f"unknown tool {call.function.name}"})
    else:
        result = fn(**call.function.arguments)
    messages.append({
        "role": "tool",
        "tool_name": call.function.name,
        "content": result,
    })

# 3. Ask the model again, now that it has the data.
final = ollama.chat(model=MODEL, messages=messages, tools=[get_weather, calculator])
print(final.message.content)

Example output:

It's currently 21°C and clear in Tokyo — pleasant weather right now.

Step 5: Wrap it in a real agent loop

One tool call rarely finishes the job. A good question might need a lookup and a calculation, or one tool's output might decide the next. The fix is a loop: keep calling the model, executing whatever tools it asks for, and feeding results back until it returns a plain text answer with no more tool calls. Always cap the loop so a confused model can't spin forever.

import ollama, json

MODEL = "gemma4:12b"
TOOL_SPECS = [get_weather, calculator]

def run_agent(user_prompt: str, max_steps: int = 6) -> str:
    messages = [
        {"role": "system", "content": "You are a precise assistant. "
                                       "Use tools for facts and math instead of guessing."},
        {"role": "user", "content": user_prompt},
    ]
    for step in range(max_steps):
        resp = ollama.chat(model=MODEL, messages=messages, tools=TOOL_SPECS)
        msg = resp.message
        messages.append(msg)

        if not msg.tool_calls:
            return msg.content  # model is done

        for call in msg.tool_calls:
            fn = TOOLS.get(call.function.name)
            out = (fn(**call.function.arguments) if fn
                   else json.dumps({"error": "unknown tool"}))
            print(f"  [step {step}] {call.function.name}({call.function.arguments}) -> {out}")
            messages.append({"role": "tool",
                             "tool_name": call.function.name,
                             "content": out})
    return "Stopped: reached max steps without a final answer."


print(run_agent(
    "How much warmer is Tokyo than London right now, in Celsius?"
))

Example run (tool trace plus final answer):

  [step 0] get_weather({'city': 'Tokyo'}) -> {"city": "Tokyo", "temp": 21, "unit": "celsius", "sky": "clear"}
  [step 0] get_weather({'city': 'London'}) -> {"city": "London", "temp": 14, "unit": "celsius", "sky": "clear"}
  [step 1] calculator({'expression': '21 - 14'}) -> {"result": 7}
Tokyo is 7°C warmer than London right now (21°C versus 14°C).

This is a genuine multi-step agent. Gemma 4 fetched two cities (the model can request several tools in one turn), then on the next turn chained the numbers into the calculator, then wrote the answer. You never hard-coded that plan; the model assembled it from your tool descriptions.

Step 6: Turn on thinking for harder decisions

Gemma 4 has a configurable thinking mode that runs an internal reasoning pass before it acts. For ambiguous prompts ("is it good for running in Seoul?" really means "check the weather first"), thinking measurably improves which tool gets picked and how arguments are filled. With Ollama you enable it per request with the think flag.

resp = ollama.chat(
    model=MODEL,
    messages=messages,
    tools=TOOL_SPECS,
    think=True,          # run an internal reasoning pass first
    options={"temperature": 1.0, "top_p": 0.95, "top_k": 64},
)

# The reasoning is returned separately from the answer.
print("THINKING:", resp.message.thinking)
print("ANSWER  :", resp.message.content)

Those sampling values (temperature=1.0, top_p=0.95, top_k=64) are Google's recommended defaults for Gemma 4. One important rule from the model card: do not store thinking text in conversation history. When you append a turn for the next round, keep only the final content, never the thinking field, or accuracy degrades on later turns.

Worked example: a tiny offline research assistant

Here is everything tied together into one runnable file. It exposes three tools — weather, a calculator, and a stub "knowledge" lookup — and answers a compound question that needs more than one of them. Save it as agent.py and run python agent.py with Ollama running.

import json, ollama

MODEL = "gemma4:12b"

def get_weather(city: str, unit: str = "celsius") -> str:
    """Get current weather for a city. Args: city, unit (celsius|fahrenheit)."""
    data = {"tokyo": 21, "paris": 16, "cairo": 33}
    c = data.get(city.lower(), 20)
    t = c if unit == "celsius" else round(c * 9 / 5 + 32)
    return json.dumps({"city": city, "temp": t, "unit": unit})

def calculator(expression: str) -> str:
    """Evaluate basic arithmetic. Args: expression like '(33-16)/2'."""
    if not set(expression) <= set("0123456789+-*/(). "):
        return json.dumps({"error": "bad input"})
    try:
        return json.dumps({"result": eval(expression, {"__builtins__": {}})})
    except Exception as e:
        return json.dumps({"error": str(e)})

def city_facts(city: str) -> str:
    """Return a one-line fact about a city. Args: city name."""
    facts = {"cairo": "Cairo sits on the Nile and is Africa's largest city.",
             "paris": "Paris is divided into 20 arrondissements."}
    return json.dumps({"fact": facts.get(city.lower(), "No fact on file.")})

TOOLS = {"get_weather": get_weather, "calculator": calculator, "city_facts": city_facts}
SPECS = [get_weather, calculator, city_facts]

def run_agent(prompt, max_steps=6):
    messages = [
        {"role": "system", "content": "Use tools for facts and math. Never guess numbers."},
        {"role": "user", "content": prompt},
    ]
    for _ in range(max_steps):
        resp = ollama.chat(model=MODEL, messages=messages, tools=SPECS)
        messages.append(resp.message)
        if not resp.message.tool_calls:
            return resp.message.content
        for call in resp.message.tool_calls:
            fn = TOOLS.get(call.function.name)
            out = fn(**call.function.arguments) if fn else json.dumps({"error": "unknown"})
            messages.append({"role": "tool", "tool_name": call.function.name, "content": out})
    return "No final answer (hit max steps)."

if __name__ == "__main__":
    print(run_agent(
        "Tell me one fact about Cairo, and how much hotter it is than Paris in Fahrenheit."
    ))

Expected answer:

Cairo sits on the Nile and is Africa's largest city. It's currently 91°F there versus 61°F in Paris, so Cairo is about 30°F hotter right now.

Three tools, one compound question, fully offline. Swap the stub functions for real APIs — a database query, an internal search endpoint, a shell command — and you have the skeleton of a production agent.

Common pitfalls and how to avoid them

  • Weak docstrings. The model picks tools and fills arguments from your docstrings alone. "Gets data" is useless; spell out what each argument means and give an example value. This single change fixes most missing or wrong calls.
  • Calling tools by name unsafely. Never do globals()[name](**args) on a model-chosen string. Use an explicit {name: function} allow-list, and validate arguments before executing anything with side effects.
  • Forgetting to append the result. If you run the tool but don't add a {"role": "tool", ...} message before the next chat call, the model never sees the answer and will loop or hallucinate. Append both the assistant tool-call turn and the tool result.
  • Storing thinking text in history. Keep only the final content from each model turn. Carrying old thinking blocks forward degrades multi-turn tool accuracy.
  • No loop cap. A confused model can request tools indefinitely. Always set a max_steps ceiling and return a clear message when you hit it.
  • Picking a model that's too small. The E2B/E4B edge models are great for latency but weaker at tool selection (Tau2 agentic scores climb sharply at 12B/26B/31B). If tool calls look random, move up a size before blaming your prompt.
  • Non-JSON tool output. Return strings (ideally JSON) from tools, not raw Python objects. The conversation is text; serialize with json.dumps so the model reads clean, unambiguous data.

Quick reference

PieceWhat it isKey detail
tools=[fn, ...]List of Python functions or JSON schemasClient auto-builds schema from type hints + docstring
response.message.tool_callsList of requested callsEmpty/None means the model gave a final answer
call.function.name / .argumentsTool name and parsed argsLook the name up in an allow-list before executing
{'role': 'tool', ...}How you return a resultInclude tool_name and a string (JSON) content
think=TrueEnable internal reasoningBetter tool decisions; don't save thinking to history
SamplingRecommended defaultstemperature 1.0, top_p 0.95, top_k 64
Model sizese2b / e4b / 12b / 26b / 31bUse 12B+ for reliable agentic tool use

Next steps

You now have a working local tool-calling agent on a fully open model. From here you can: replace the stub tools with real database, HTTP, or shell functions; expose the same tools over the Model Context Protocol so any MCP client can use them; add streaming so token output appears live; or fine-tune Gemma 4 with LoRA on your own tool-use traces to sharpen domain-specific calls. Because the weights are Apache 2.0 and everything runs on your hardware, you can ship this in environments where sending data to a hosted API was never an option.

For the canonical reference, see Google's Function calling with Gemma 4 guide and the Ollama gemma4 model card, both linked in the sources below.

Sources: Google AI for Developers — Function calling with Gemma 4 (ai.google.dev); Ollama gemma4 model card (ollama.com/library/gemma4); Google Keyword blog — Gemma 4 announcement.

Comments

Subscribe to join the conversation...

Be the first to comment

Found this useful?

Get new AI guides for builders by email. Free.

Join 1,998 builders reading daily.