Skip to content
Kimi K2 Thinking: Agentic Tool Loops in Python — ContentBuffer guide

Kimi K2 Thinking: Agentic Tool Loops in Python

K
Kodetra Technologies··8 min read Intermediate

Summary

Build a multi-step tool-calling agent on Moonshot's open-weight Kimi K2.6 model.

Moonshot's Kimi K2.6 shipped as open weights in late April and did something open models had not done before: it tied or beat the closed frontier on real coding work. On SWE-Bench Pro it scored 58.6, ahead of GPT-5.4 (57.7) and Claude Opus 4.6 (53.4), at a list price of $0.60 / $2.50 per million input/output tokens. That price-to-capability ratio is why it spread across developer communities within hours.

But the screenshots people keep sharing are not the benchmark bars. They are the worklogs: a single K2.6 agent that ran for twelve hours straight, made over four thousand tool calls, and optimized a model's inference loop from 15 to 193 tokens per second. That kind of endurance comes from one design decision you can use directly: K2.6 was trained to interleave its private chain-of-thought with tool calls, and the API hands that reasoning back to you in a field called reasoning_content. Feed it back in, and the model keeps its train of thought across dozens of steps instead of starting cold each turn.

This guide builds a working tool-calling agent on Kimi K2.6 with nothing but the openai Python package. Moonshot exposes an OpenAI-compatible endpoint, so there is no new SDK to learn. You will wire two real local tools, run a loop where the model decides when to call them, read its reasoning, and learn the three configuration rules that separate a reliable thinking agent from one that silently truncates.

What you'll need

  • Python 3.9+ and pip install openai (v1.x).
  • A Moonshot API key from platform.moonshot.ai. New accounts get trial credit; this whole tutorial costs a few cents.
  • Comfort with JSON and basic Python functions. No agent framework, no vector store, no extra services.
  • Optional: the same code runs against any OpenAI-compatible host of K2.6 (OpenRouter, Fireworks, a local vLLM/SGLang server) by swapping base_url and model.

Step 1 — Point the OpenAI SDK at Moonshot

The only difference from calling OpenAI is the base_url and the model name. Set your key as an environment variable so it never lands in source control.

import os
import openai

client = openai.Client(
    base_url="https://api.moonshot.ai/v1",   # Moonshot's OpenAI-compatible endpoint
    api_key=os.environ["MOONSHOT_API_KEY"],
)

MODEL = "kimi-k2.6"   # thinking is ON by default for this model

Export the key first: export MOONSHOT_API_KEY=sk-.... Everything below reuses this single client.

Step 2 — Read the model's reasoning (the reasoning_content gotcha)

With thinking enabled, K2.6 returns two separate streams: its reasoning in reasoning_content and the user-facing answer in content. There is a catch the docs are explicit about — the OpenAI SDK's typed objects do not declare a reasoning_content attribute, so .reasoning_content can raise. You must guard it with hasattr / getattr.

stream = client.chat.completions.create(
    model=MODEL,
    messages=[
        {"role": "system", "content": "You are Kimi."},
        {"role": "user", "content": "Why is 1 + 1 = 2? Answer in one sentence."},
    ],
    max_tokens=1024 * 32,   # reasoning_content + content share this budget
    stream=True,            # recommended: thinking responses are large
)

for chunk in stream:
    if not chunk.choices:
        continue
    delta = chunk.choices[0].delta
    # reasoning_content always streams BEFORE content
    if hasattr(delta, "reasoning_content") and getattr(delta, "reasoning_content"):
        print(getattr(delta, "reasoning_content"), end="")   # the "thinking"
    elif delta.content:
        print(delta.content, end="")                          # the answer

Two facts to internalize from the official docs: in streaming mode reasoning_content always arrives before content, so the first content token is your signal that thinking is done; and tokens in reasoning_content count against max_tokens, which is exactly why thin budgets cause truncated answers.

Step 3 — Define real tools

Tools are plain Python functions plus a JSON schema that tells the model when and how to call them. We will give the agent a sandboxed calculator (no eval) and a directory lister. Keep the descriptions concrete — the model reads them to decide which tool fits.

import json, ast, operator, os

_OPS = {ast.Add: operator.add, ast.Sub: operator.sub, ast.Mult: operator.mul,
        ast.Div: operator.truediv, ast.Pow: operator.pow, ast.USub: operator.neg}

def calculate(expression: str) -> str:
    """Safe arithmetic: parse to an AST, evaluate only number ops."""
    def ev(n):
        if isinstance(n, ast.Constant):  return n.value
        if isinstance(n, ast.BinOp):     return _OPS[type(n.op)](ev(n.left), ev(n.right))
        if isinstance(n, ast.UnaryOp):   return _OPS[type(n.op)](ev(n.operand))
        raise ValueError("unsupported expression")
    return str(ev(ast.parse(expression, mode="eval").body))

def list_files(directory: str = ".") -> str:
    return "\n".join(sorted(os.listdir(directory))) or "(empty)"

TOOL_IMPLS = {"calculate": calculate, "list_files": list_files}

TOOLS = [
    {"type": "function", "function": {
        "name": "calculate",
        "description": "Evaluate an arithmetic expression and return the result.",
        "parameters": {"type": "object",
            "properties": {"expression": {"type": "string", "description": "e.g. '12 * (3 + 4)'"}},
            "required": ["expression"]}}},
    {"type": "function", "function": {
        "name": "list_files",
        "description": "List file names in a directory.",
        "parameters": {"type": "object",
            "properties": {"directory": {"type": "string", "default": "."}},
            "required": []}}},
]

Using a real AST-based calculator instead of eval() is the difference between a demo and something you would let an autonomous agent run. The model never executes code on your machine; it only proposes a tool name and arguments, and you decide what runs.

Step 4 — The agentic loop

Here is the whole pattern. Call the model; if it returns tool_calls, run each one, append the results, and call again; stop when it returns a plain answer. The critical detail for K2.6 is in the comments.

def run_agent(user_prompt: str, max_steps: int = 8) -> str:
    messages = [
        {"role": "system", "content": "You are a precise engineering assistant. "
         "Use tools when they make the answer more reliable."},
        {"role": "user", "content": user_prompt},
    ]

    for step in range(max_steps):
        completion = client.chat.completions.create(
            model=MODEL,
            messages=messages,
            tools=TOOLS,
            max_tokens=32000,   # >= 16000 is the documented minimum for thinking mode
            temperature=1.0,    # K2.6 uses a FIXED temperature of 1.0
        )
        msg = completion.choices[0].message

        if hasattr(msg, "reasoning_content") and getattr(msg, "reasoning_content"):
            print(f"[think {step+1}] {getattr(msg, 'reasoning_content')[:120]}...")

        # Append the assistant message AS-IS. This preserves reasoning_content and
        # the tool_calls so the model keeps its chain of thought across steps.
        messages.append(msg)

        if not msg.tool_calls:        # no tools requested -> this is the final answer
            return msg.content

        for tc in msg.tool_calls:
            args = json.loads(tc.function.arguments)
            try:
                result = TOOL_IMPLS[tc.function.name](**args)
            except Exception as e:
                result = f"ERROR: {e}"
            print(f"[tool] {tc.function.name}({args}) -> {str(result)[:60]}")
            # A tool message MUST carry the matching tool_call_id.
            messages.append({"role": "tool", "tool_call_id": tc.id,
                             "name": tc.function.name, "content": str(result)})

    return "Stopped: hit max_steps before a final answer."

Three rules from Moonshot's docs are baked into this loop, and skipping any of them is the usual cause of broken thinking agents: keep max_tokens at 16,000 or higher so reasoning plus the answer fit; leave temperature at 1.0 (the model is tuned for it); and append the assistant message object unchanged rather than rebuilding a clean {"role": "assistant", "content": ...} dict — that is what carries reasoning_content forward so each step builds on the last.

Worked example: a two-tool reasoning run

Run a prompt that forces the model to chain tools — first discover a fact about the environment, then compute on it:

print(run_agent("How many files are in this folder, and what is that count times 9?"))

A representative run (your file list will differ) looks like this:

[think 1] The user wants a count of files and then that count times 9. I should
list the directory first, then multiply...
[tool] list_files({'directory': '.'}) -> agent.py
README.md
data.csv
[think 2] Three files: agent.py, README.md, data.csv. Now 3 * 9 via the calculator...
[tool] calculate({'expression': '3 * 9'}) -> 27
[think 3] I have both pieces. Final answer.
There are 3 files in this folder, and 3 x 9 = 27.

Notice the model never guessed the multiplication — it routed arithmetic to the tool even though the math is trivial, because the system prompt told it tools make answers more reliable. That same instinct is what lets K2.6 stay correct across hundreds of steps: it offloads ground truth to tools and uses reasoning_content to plan the next one.

Step 5 — Carry reasoning across turns with Preserved Thinking

Inside a single run_agent call the loop already preserves reasoning because we append each assistant message. Across separate user turns in a longer conversation, you opt in explicitly with the thinking.keep parameter. The default (null) drops historical reasoning to save tokens; "all" keeps it so the model can resume a prior line of thought.

messages = [
    {"role": "system", "content": "You are Kimi."},
    {"role": "user", "content": "Start analyzing this dataset's schema."},
    prior_assistant_message,                 # contains reasoning_content + content
    {"role": "user", "content": "Now derive the next transformation step."},
]

response = client.chat.completions.create(
    model=MODEL,
    messages=messages,
    extra_body={"thinking": {"type": "enabled", "keep": "all"}},  # Preserved Thinking
)

Use keep: "all" when later turns genuinely depend on earlier reasoning, such as a multi-turn debugging or research session. It is not free: preserved reasoning_content stays in the context window and is billed on every subsequent call, so reach for it deliberately rather than by default.

Common pitfalls

  • Accessing .reasoning_content directly. The SDK's ChoiceDelta and ChatCompletionMessage types do not declare it. Always use hasattr then getattr, or read raw JSON where it sits beside content.
  • max_tokens too low. Reasoning and the final answer share the budget. Under ~16,000 you get answers that cut off mid-sentence or tool loops that never reach a conclusion. The docs set 16,000 as the floor for thinking mode.
  • Overriding temperature. K2.6 runs at a fixed temperature of 1.0. Passing 0.2 to 'make it deterministic' degrades quality instead of helping.
  • Rebuilding the assistant message. If you replace the returned object with a hand-made dict, you strip reasoning_content and the model loses continuity between tool steps. Append the object as-is.
  • Missing or mismatched tool_call_id. Every tool message must echo the exact id from the call it answers. Drop it and the API rejects the next request.
  • Returning non-strings from tools. The content of a tool message must be a string. Wrap dicts/lists in json.dumps (or str) before appending.
  • No step cap. An agent that can call tools can loop forever. Always bound it with a max_steps counter, as the loop above does.
  • Unbounded context on long runs. For very long agentic sessions, drop or summarize old tool results once the window fills — Moonshot's own benchmark setups keep only the most recent round of tool messages past a threshold.

Quick reference

SettingValue / ruleWhy
base_urlhttps://api.moonshot.ai/v1OpenAI-compatible endpoint
modelkimi-k2.6Thinking enabled by default
max_tokens≥ 16000 (use 32000)Reasoning + answer share the budget
temperature1.0 (fixed)Model is tuned for it; don't lower
streamTrue (recommended)Large responses; avoids timeouts
reasoning_contenthasattr / getattrUntyped in the SDK; streams before content
tool messagerole=tool + tool_call_id + str contentRequired to thread results back
thinking.keep"all" to persist across turnsPreserved Thinking (billed each call)
List price$0.60 / $2.50 per Mtok in/outOpen-weight, cheap for agent loops

Where to take it next

  • Add a web_search or HTTP tool so the agent can pull live facts, then watch how reasoning_content plans multi-source lookups.
  • Swap in a code-execution tool (sandboxed) to reproduce the long-horizon coding runs K2.6 is known for.
  • Persist messages to disk between sessions to give the agent durable memory.
  • Run the same code against a local K2.6 via vLLM or SGLang when you need full control or offline use — only base_url changes.
  • Compare costs: at $0.60/$2.50 a 50-step tool loop is typically cents, which is what makes always-on agents practical.

Benchmark figures and all API behavior (reasoning_content, the 16k token floor, fixed temperature, thinking.keep) are from Moonshot AI's official Kimi K2.6 tech blog and platform documentation, verified June 2026.

Comments

Subscribe to join the conversation...

Be the first to comment

Found this useful?

Get new AI guides for builders by email. Free.

Join 1,955 builders reading daily.