Skip to content
MAI-Thinking-1 Hands-On: Build a Reasoning Agent — ContentBuffer guide

MAI-Thinking-1 Hands-On: Build a Reasoning Agent

K
Kodetra Technologies··10 min read Intermediate

Summary

Deploy Microsoft's new reasoning model and build a tool-calling triage agent.

Microsoft just shipped its first homegrown reasoning model. MAI-Thinking-1 landed at Build 2026 on June 2, and the spec sheet explains why developer forums haven't stopped talking about it: a sparse Mixture-of-Experts design with roughly 1 trillion total parameters but only ~35B active per token, a 256K context window, and Microsoft's claim that it goes toe-to-toe with Claude Opus 4.6 on SWE-Bench Pro while being preferred over Sonnet 4.6 in blind human evaluations across 1,276 tasks.

Two things make this more than a model-card announcement. First, Microsoft trained it from scratch with zero distillation from third-party models, on commercially licensed data with AI-generated content excluded from pre-training. For enterprises worried about data provenance, that is a real differentiator. Second, it speaks the standard Chat Completions API with function calling and developer instructions, so the integration cost is close to zero if you already use the OpenAI SDK.

In this guide you'll deploy MAI-Thinking-1 from the Microsoft Foundry catalog, call it with plain Python, and then build something useful with it: a test-failure triage agent that uses function calling to read a stack trace, inspect the offending file, and propose a fix. By the end you'll know exactly where this model fits in your stack and where the preview's sharp edges are.

Prerequisites

  • Python 3.10+ with openai>=1.40 installed (pip install openai)
  • An Azure subscription with access to Microsoft Foundry (ai.azure.com). MAI-Thinking-1 entered preview at Build 2026 — if you don't see it in the model catalog yet, request access from the model page or use a partner host (Baseten and Fireworks are distributing MAI models)
  • A Foundry project with a deployed MAI-Thinking-1 endpoint (we create this in Step 1)
  • Basic familiarity with Chat Completions-style APIs and JSON tool schemas

Step 1 - Deploy MAI-Thinking-1 from the Foundry catalog

Open the Foundry model catalog at ai.azure.com/catalog and search for MAI-Thinking-1. It sits alongside the other new first-party MAI models Microsoft shipped at Build (MAI-Image-2.5, MAI-Transcribe-2, MAI-Voice-2). Click Deploy, pick your project, and accept the default deployment name mai-thinking-1 — or note whatever name you choose, because that name (not the model family name) is what goes in the model field of every API call.

Once deployment finishes, grab two values from the deployment page: the endpoint URL (it looks like https://<your-resource>.services.ai.azure.com) and an API key. Export them:

export FOUNDRY_ENDPOINT="https://<your-resource>.services.ai.azure.com"
export FOUNDRY_API_KEY="<your-key>"

Smoke-test with curl before writing any Python. MAI-Thinking-1 is Chat Completions-compatible, so the request shape is the one you already know:

curl -s "$FOUNDRY_ENDPOINT/openai/v1/chat/completions" \
  -H "Authorization: Bearer $FOUNDRY_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mai-thinking-1",
    "messages": [{"role": "user", "content": "In one sentence: why is sparse MoE cheaper to serve than a dense model of the same total size?"}],
    "max_tokens": 200
  }' | python3 -m json.tool

Example output (trimmed):

{
  "choices": [{
    "message": {
      "role": "assistant",
      "content": "A sparse MoE only activates a small subset of experts (~35B of ~1T parameters here) per token, so each forward pass costs a fraction of what running every parameter in a dense model would."
    },
    "finish_reason": "stop"
  }],
  "usage": {"prompt_tokens": 31, "completion_tokens": 47, "total_tokens": 78}
}

If you get a 404, your deployment name doesn't match the model field. If you get a 401, regenerate the key — Foundry keys are scoped per resource, not per project.

Step 2 - Call it from Python with the OpenAI SDK

Because the model is Chat Completions-compatible, the OpenAI SDK works unmodified — you just point base_url at your Foundry endpoint. This is the same drop-in pattern Microsoft used for MAI-Code-1-Flash, and it means every framework that speaks OpenAI (LangChain, LlamaIndex, Pydantic AI, the Vercel AI SDK) inherits MAI-Thinking-1 support for free.

import os
from openai import OpenAI

client = OpenAI(
    base_url=f"{os.environ['FOUNDRY_ENDPOINT']}/openai/v1",
    api_key=os.environ["FOUNDRY_API_KEY"],
)

resp = client.chat.completions.create(
    model="mai-thinking-1",  # your DEPLOYMENT name
    messages=[
        # MAI-Thinking-1 is trained for layered instructions:
        # system sets policy, developer sets task framing.
        {"role": "system", "content": "You are a precise senior engineer. Answer with reasoning first, conclusion last."},
        {"role": "user", "content": (
            "Two services share a Postgres table. Service A wraps updates in "
            "SERIALIZABLE transactions; service B uses READ COMMITTED with "
            "optimistic version checks. Under heavy contention, which one "
            "retries more, and why?"
        )},
    ],
    max_tokens=1024,
    temperature=0.2,
)
print(resp.choices[0].message.content)

Example output (abridged): the model walks through serialization failure semantics, notes that SERIALIZABLE aborts on dangerous dependency cycles while optimistic checks abort only on actual row-version conflicts, and concludes that service A retries more under contention because predicate-level conflicts trigger aborts even when no row was actually overwritten. That step-by-step structure is the reasoning training showing through — you get the chain of logic in the visible output, not just a verdict.

Keep temperature low (0-0.3) for reasoning workloads. Reasoning models buy their accuracy with deliberate token-by-token decomposition, and high temperature degrades exactly the multi-step chains you're paying for.

Step 3 - Use the 256K window for whole-codebase context

The 256K context window fits roughly a 600-page document — or, more usefully for engineers, an entire mid-sized service plus its test suite. The pattern below loads a repo snapshot into one request. No RAG pipeline, no chunking, one call:

from pathlib import Path

def repo_snapshot(root: str, exts={".py", ".toml", ".md"}, max_bytes=700_000):
    parts, total = [], 0
    for p in sorted(Path(root).rglob("*")):
        if p.suffix in exts and p.is_file() and ".venv" not in p.parts:
            body = p.read_text(errors="ignore")
            total += len(body)
            if total > max_bytes:
                break
            parts.append(f"===== {p} =====\n{body}")
    return "\n\n".join(parts)

snapshot = repo_snapshot("./my-service")  # ~700KB ≈ 180K tokens, fits in 256K

resp = client.chat.completions.create(
    model="mai-thinking-1",
    messages=[
        {"role": "system", "content": "You audit Python services for concurrency bugs."},
        {"role": "user", "content": snapshot + "\n\nList every place where shared state is mutated without a lock or transaction, ranked by blast radius."},
    ],
    max_tokens=2048,
)
print(resp.choices[0].message.content)

One practical note: 256K is generous but it is not the 1M windows Gemini advertises. Budget ~3.8 characters per token for code and leave headroom for the answer. If your repo exceeds ~220K tokens of input, fall back to retrieval for the long tail instead of truncating silently — silent truncation is the number one source of "the model ignored my file" bug reports.

Step 4 - The worked example: a test-failure triage agent

Now the real build. CI fails, someone has to read the traceback, find the file, and figure out whether it's the test or the code. That's a reasoning-plus-tools loop, which is exactly what MAI-Thinking-1's function calling is for. We give the model two tools — read_file and run_pytest — and let it drive.

import json, subprocess
from pathlib import Path

TOOLS = [
    {
        "type": "function",
        "function": {
            "name": "read_file",
            "description": "Read a source or test file from the repo.",
            "parameters": {
                "type": "object",
                "properties": {"path": {"type": "string", "description": "Repo-relative path"}},
                "required": ["path"],
            },
        },
    },
    {
        "type": "function",
        "function": {
            "name": "run_pytest",
            "description": "Run pytest on one test node and return the output.",
            "parameters": {
                "type": "object",
                "properties": {"node": {"type": "string", "description": "e.g. tests/test_cart.py::test_discount_stacking"}},
                "required": ["node"],
            },
        },
    },
]

def read_file(path: str) -> str:
    p = Path(path)
    if not p.is_file() or ".." in path:
        return f"ERROR: {path} not found"
    return p.read_text(errors="ignore")[:40_000]

def run_pytest(node: str) -> str:
    out = subprocess.run(["python", "-m", "pytest", node, "-x", "--tb=short", "-q"],
                         capture_output=True, text=True, timeout=120)
    return (out.stdout + out.stderr)[-8_000:]

IMPL = {"read_file": read_file, "run_pytest": run_pytest}

def triage(traceback_text: str, max_turns: int = 8) -> str:
    messages = [
        {"role": "system", "content": (
            "You are a CI triage agent. Investigate the failure using the tools. "
            "Re-run the failing test to confirm, read only the files you need, "
            "then output a verdict: BUG_IN_CODE or BUG_IN_TEST, the root cause, "
            "and a minimal patch as a unified diff."
        )},
        {"role": "user", "content": f"CI failure:\n{traceback_text}"},
    ]
    for _ in range(max_turns):
        resp = client.chat.completions.create(
            model="mai-thinking-1", messages=messages,
            tools=TOOLS, tool_choice="auto", max_tokens=2048,
        )
        msg = resp.choices[0].message
        if not msg.tool_calls:
            return msg.content  # final verdict
        messages.append(msg)
        for tc in msg.tool_calls:
            args = json.loads(tc.function.arguments)
            result = IMPL[tc.function.name](**args)
            messages.append({"role": "tool", "tool_call_id": tc.id, "content": result})
    return "Gave up after max_turns — escalate to a human."

print(triage("""
FAILED tests/test_cart.py::test_discount_stacking
E  assert cart.total() == 80.0
E  actual: 79.2
"""))

Example run against a small shopping-cart repo with a deliberately seeded bug (percentage discounts applied multiplicatively instead of being summed before application):

Turn 1 -> run_pytest("tests/test_cart.py::test_discount_stacking")  # confirms 79.2 != 80.0
Turn 2 -> read_file("src/cart.py")
Turn 3 -> read_file("tests/test_cart.py")
Turn 4 -> final answer:

VERDICT: BUG_IN_CODE
Root cause: Cart.total() applies each discount to the running total
(100 * 0.9 * 0.88 = 79.2). The documented contract and the test both
expect stacked discounts to sum first (10% + 12% = 22% -> 78.0)...
wait — the test expects 80.0, which matches summing only the two
10%-capped promo discounts per the PROMO_CAP rule in pricing.md.
The cap is never enforced in total().

--- a/src/cart.py
+++ b/src/cart.py
@@ -41,7 +41,8 @@ class Cart:
-        for d in self.discounts:
-            total *= (1 - d.pct)
+        applied = min(sum(d.pct for d in self.discounts), PROMO_CAP)
+        total *= (1 - applied)
     return round(total, 2)

The interesting part of that transcript is the mid-reasoning self-correction: the model initially computed the summed discount as 22%, noticed the expected value implied a cap, went back to the docs it had already read, and revised. That recover-from-intermediate-mistakes behavior is what Microsoft says its verified agentic training environments were built to teach — deterministic, executable environments graded by real test suites — and it's visible in practice.

Step 5 - Layered instructions for production behavior

MAI-Thinking-1 was explicitly post-trained to follow multiple layers of instructions, which Microsoft positions as an enterprise feature: platform policy in the system message, per-task framing from the developer, and the end-user request, each respected in priority order. Use it — it's the difference between an agent that obeys your output contract and one that freelances:

messages = [
    {"role": "system", "content": (
        "Platform policy: never output secrets or customer PII. "
        "All code suggestions must be valid unified diffs."
    )},
    {"role": "developer", "content": (
        "Task framing: you are reviewing payment-service PRs. "
        "Respond as JSON: {risk: low|medium|high, findings: [...], diff: string|null}"
    )},
    {"role": "user", "content": pr_contents},
]

In testing, the JSON contract held across multi-turn tool loops without a JSON mode flag, but don't rely on luck in production: validate with Pydantic and re-prompt on parse failure. One retry with the parse error appended fixes the overwhelming majority of contract breaks on any model, this one included.

Common pitfalls

  • Preview access is the gate. MAI-Thinking-1 launched in private preview on Microsoft Foundry (June 2), with public preview in the Foundry catalog rolling out and the MAI Playground coming after. If the catalog tile shows request-access, that's not a bug. Baseten and Fireworks distribution gives you a second path — Baseten even lets you tune the weights, a first for a Microsoft frontier model.
  • The model field is your deployment name, not the family name. If you deployed as mai-thinking-prod, then model="mai-thinking-1" returns 404. This bites everyone migrating from openai.com, where model IDs are global.
  • Benchmarks are vendor-reported. The SWE-Bench Pro parity with Opus 4.6, the 97.0% AIME 2025 / 94.5% AIME 2026 scores, and the Sonnet 4.6 preference win all come from Microsoft's own evals (competitor numbers from official model cards). Independent numbers hadn't landed at launch. Run your own task-level eval before switching production traffic.
  • Preference parity is not benchmark dominance. 'Preferred over Sonnet 4.6' means raters liked its answers at least as often across 1,276 mixed tasks — not that it wins every category. Expect benchmark-by-benchmark variation.
  • Reasoning costs output tokens. Step-by-step decomposition means longer completions than a non-reasoning model for the same question. Set max_tokens generously (we used 2048 in the agent loop) or you'll truncate mid-chain and get a verdict with no diff.
  • 256K is not infinite. At ~3.8 chars/token for code, 700KB of source is already ~180K tokens. Past ~220K of input, switch to retrieval rather than silently truncating.
  • Tool loops need a turn budget. Any function-calling agent can ping-pong. The max_turns=8 guard plus an explicit escalate-to-human fallback is the minimum viable safety net before you wire this into CI.

Quick reference

SpecValue
ArchitectureSparse MoE, ~35B active / ~1T total parameters
Context window256K tokens (~600-page document)
APIChat Completions-compatible; function calling; layered (system/developer) instructions
SWE-Bench ProMatches Claude Opus 4.6 (vendor-reported)
AIME 2025 / 202697.0% / 94.5% (vendor-reported)
Human preferencePreferred vs Sonnet 4.6, 1,276 blind tasks (Surge raters)
TrainingNo third-party distillation; licensed data; AI-generated content excluded from pre-training
AccessMicrosoft Foundry preview; MAI Playground soon; Baseten / Fireworks distribution
HardwareRuns on Microsoft Maia 200 accelerators in Azure

Next steps

  • Wire the triage agent into CI: trigger on pytest failure, post the verdict + diff as a PR comment, and keep a human on the merge button.
  • Benchmark against your incumbent: same prompts, same tools, MAI-Thinking-1 vs whatever you run today. Microsoft's pitch is mid-weight price for near-frontier reasoning, and pricing is the variable that decides this.
  • Read Microsoft's MAI-Thinking-1 paper (linked from the announcement at microsoft.ai/news/introducing-mai-thinking-1) for the Hill-Climbing Machine training pipeline details — the no-distillation claim is worth understanding before you cite it in a vendor review.
  • Watch for MAI Playground public preview if you want to evaluate without an Azure subscription.

Microsoft spent two years as the company that resells other labs' models. MAI-Thinking-1 — trained on its own data, on its own Maia silicon, with its own RL stack — is the clearest sign yet that era is ending. Whether it holds up under independent benchmarks is an open question, but the integration cost is low enough that finding out yourself takes an afternoon.

Comments

Subscribe to join the conversation...

Be the first to comment

Found this useful?

Get new AI guides for builders by email. Free.

Join 1,937 builders reading daily.