
MiniMax M3: Master 1M-Token Long Context With MSA
Summary
MiniMax M3 hands-on: MSA sparse attention plus real 1M-token long context, with runnable Python.
MiniMax M3: Master 1M-Token Long Context With MSA
On June 1, 2026 MiniMax shipped M3, and over the following ten days the part developers actually care about landed: the technical report and the open weights. M3 is the first open-weight model to combine frontier coding, native multimodal input, and a genuine 1-million-token context window in a single checkpoint. The thing making it spike across r/LocalLLaMA, Hacker News, and dev Twitter right now is not another benchmark bar chart. It is MSA, MiniMax Sparse Attention, the architecture that makes that 1M window cheap enough to use in production.
Most models that advertise a million-token window are bluffing. They are text-only at that length, or quality falls off a cliff past 256K, or the bill is so large you would never send a real million tokens. MSA attacks that head-on: at 1M context, M3's per-token compute is about 1/20th of the previous generation, with roughly 9.7x faster prefill and 15.6x faster decoding versus MiniMax M2. That is the difference between long context as a spec-sheet number and long context as something you ship.
This guide teaches you to actually use it. You will make your first M3 API call, understand what MSA is doing under the hood, toggle its thinking mode, pack an entire codebase or long document into one request without blowing your token budget, and run a real whole-repository question-answering example. Everything is concrete, the code is API-accurate against the official MiniMax docs, and you will see real example output.
Prerequisites
- Python 3.9+ and
pip. - The OpenAI SDK:
pip install openai(M3's API is OpenAI-compatible at the chat-completions level, so you reuse the same client). - A MiniMax API key from
platform.minimax.io, or an OpenRouter key if you prefer one bill across providers. - Basic comfort with chat-completions style messages (system / user / assistant roles).
- Optional: a folder of source files or a long document to test real 1M-token context.
Set your key as an environment variable so it never lands in source control:
export MINIMAX_API_KEY="your_key_here"
# or, if you go through OpenRouter:
export OPENROUTER_API_KEY="your_key_here"
Step 1 — Your first M3 call in 10 lines
The MiniMax endpoint speaks the OpenAI chat-completions dialect. Point the OpenAI client at MiniMax's base URL, set the model to MiniMax-M3, and you are done. No new SDK to learn.
import os
from openai import OpenAI
client = OpenAI(
api_key=os.environ["MINIMAX_API_KEY"],
base_url="https://api.minimax.io/v1", # OpenAI-compatible endpoint
)
resp = client.chat.completions.create(
model="MiniMax-M3",
messages=[
{"role": "system", "content": "You are a senior backend engineer."},
{"role": "user", "content": "In one sentence, what makes MSA different from full attention?"},
],
)
print(resp.choices[0].message.content)
Example output:
MSA only computes attention over a small set of selected key-value blocks instead of
every past token, so cost grows far slower than the quadratic blow-up of full attention
while keeping near-identical quality.
If you would rather route through OpenRouter (handy for failover and a single invoice across providers), the only thing that changes is the base URL and the model slug:
curl https://openrouter.ai/api/v1/chat/completions \
-H "Authorization: Bearer $OPENROUTER_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "minimax/minimax-m3",
"messages": [{"role": "user", "content": "Write a unit test for a fib(n) function."}]
}'
Step 2 — What MSA is actually doing
You do not need to re-implement attention to use M3, but understanding MSA tells you when to reach for it and what its limits are.
Full attention compares every token with every other token. For a sequence of length N that is N-squared work, which is why a 1M-token prompt is ruinous on a vanilla transformer. Sparse attention schemes add a pre-filtering stage so each query only attends to a subset of the past. The trick is choosing that subset well.
MSA keeps a Grouped-Query Attention (GQA) backbone and layers block-level selection on top. It splits the key-value cache into blocks and, for each query, selects the most relevant blocks to attend to, operating on real, uncompressed key-values. That last detail is the differentiator. DeepSeek's Multi-head Latent Attention (MLA) compresses the KV state to save memory and trades away some long-context precision. MSA does not compress; it selects, so it sidesteps the compression-versus-precision tradeoff.
MiniMax also tuned the GPU kernel with a 'KV outer, gather Q' loop: KV blocks are the outer loop, the queries that hit a block are gathered to it, each block is read once, and memory access stays contiguous. The reported result is more than 4x faster than open-source Flash-Sparse-Attention and flash-moba, and the headline practical numbers below.
| Metric at 1M context | MSA (M3) vs previous gen |
|---|---|
| Per-token compute | ~1/20th (about 20x less) |
| Prefill speed | ~9.7x faster |
| Decoding speed | ~15.6x faster |
| Quality vs full attention | Matched on the vast majority of ablations |
The takeaway: MSA is what lets you treat context length as a dimension you can scale, the way you already scale parameters or data, instead of a hard wall you bump into at 128K.
Step 3 — Toggle thinking mode for the job at hand
M3 ships with a switchable reasoning ('thinking') mode. With thinking on, it is suited to complex reasoning, agentic tasks, and long-horizon collaboration. With it off, it answers faster, which is what you want for chat and code completion. Crucially, both modes share the same price, so the choice is purely about latency versus depth.
On the OpenRouter route the reasoning control is the unified reasoning field, which is the cleanest documented way to flip it:
import os
from openai import OpenAI
client = OpenAI(
api_key=os.environ["OPENROUTER_API_KEY"],
base_url="https://openrouter.ai/api/v1",
)
resp = client.chat.completions.create(
model="minimax/minimax-m3",
messages=[{"role": "user", "content": "Plan the migration of a monolith to 3 services."}],
extra_body={"reasoning": {"enabled": True}}, # turn thinking ON for hard planning
)
print(resp.choices[0].message.content)
Rule of thumb: thinking on for multi-step planning, debugging, and agent loops where a wrong step compounds; thinking off for autocomplete, classification, extraction, and high-volume latency-sensitive endpoints. Always confirm the exact field names against the current MiniMax API reference, since reasoning controls are the parameter most likely to change between provider versions.
Step 4 — Pack a codebase into one request (without surprise bills)
The reason to use M3 is to put something genuinely large in front of the model: a whole repository, a 400-page contract, a quarter of server logs. But M3's pricing is tiered by input length. Calls with 512K input tokens or fewer are billed at the standard rate, which covers the vast majority of work; anything above 512K moves to a higher long-context rate. So before you send a million tokens, budget them.
Here is a small, dependency-free packer. It estimates tokens (roughly 4 characters per token for code and English), keeps you under whichever tier you target, and reserves room for the model's output. This code runs as-is.
import glob
def est_tokens(text: str) -> int:
# ~4 chars/token heuristic. Replace with a real tokenizer for billing-grade math.
return (len(text) + 3) // 4
STD_TIER_LIMIT = 512_000 # <=512K input billed at the standard rate
HARD_LIMIT = 1_000_000 # M3 absolute max context
def pack_repo(paths, budget=STD_TIER_LIMIT, reserve_for_output=8000):
"""Concatenate files into one long-context message, staying under budget."""
budget -= reserve_for_output
used, parts = 0, []
for p in paths:
try:
src = open(p, encoding="utf-8", errors="ignore").read()
except Exception:
continue
t = est_tokens(src) + 20 # +20 for the file header
if used + t > budget:
parts.append(f"\n[...truncated: {p} skipped, budget reached...]")
break
parts.append(f"### FILE: {p}\n```\n{src}\n```")
used += t
return "\n\n".join(parts), used
files = sorted(glob.glob("src/**/*.py", recursive=True))
blob, used = pack_repo(files)
print("estimated input tokens:", used)
print("fits standard tier (<=512K)?", used <= STD_TIER_LIMIT)
Real output from running it against a small project folder:
estimated input tokens: 388
fits standard tier (<=512K)? True
Two habits this enforces. First, you always know which billing tier you are in before you pay. Second, by reserving output tokens up front you avoid the classic failure where a maxed-out input leaves no room for the answer and the call errors or truncates.
Step 5 — Worked example: whole-repository Q&A
Now stitch it together into something useful: ask a question about an entire codebase in a single call, instead of building a retrieval pipeline. With a real 1M-token window you can often skip RAG for medium projects and just send everything.
import os, glob
from openai import OpenAI
client = OpenAI(
api_key=os.environ["MINIMAX_API_KEY"],
base_url="https://api.minimax.io/v1",
)
# 1) Pack the repo (reuse pack_repo from Step 4)
files = sorted(glob.glob("src/**/*.py", recursive=True))
blob, used = pack_repo(files)
print(f"Sending ~{used} tokens across {len(files)} files")
# 2) Ask one question grounded in the whole project
question = (
"Trace how a request flows from the HTTP handler to the database. "
"Name the exact functions and files, in order, and flag any layer "
"that talks to the DB without going through the repository abstraction."
)
resp = client.chat.completions.create(
model="MiniMax-M3",
messages=[
{"role": "system",
"content": "You are a staff engineer reviewing an unfamiliar codebase. "
"Cite file paths and function names. Say 'not found' if unsure."},
{"role": "user",
"content": f"{question}\n\nHere is the full source:\n\n{blob}"},
],
temperature=0.2,
)
print(resp.choices[0].message.content)
Representative output:
Sending ~41230 tokens across 17 files
Request flow:
1. src/api/routes.py -> handle_order() # parses + validates the HTTP body
2. src/services/order_service.py -> create() # business rules, idempotency check
3. src/repos/order_repo.py -> insert() # the repository abstraction
4. src/db/session.py -> execute() # actual SQL
Layering issue:
- src/api/routes.py -> export_csv() calls src/db/session.py:execute() DIRECTLY,
bypassing the repository layer. That is the one place a handler touches the DB
without going through src/repos/. Recommend routing it through order_repo.
Because M3 applies automatic prompt caching by default, re-sending that same large system prompt and source blob across follow-up questions is materially cheaper than the first call. That makes interactive 'chat with my repo' sessions practical rather than wallet-melting.
Step 6 — Add an image (native multimodal)
M3 was trained with mixed-modality data from step zero, not with vision bolted on afterward, so it accepts text, images, and video as input (output is text only). The message shape is the same structured-content list OpenAI's vision API uses, so it slots into the same client.
resp = client.chat.completions.create(
model="MiniMax-M3",
messages=[{
"role": "user",
"content": [
{"type": "text",
"text": "This is our architecture diagram. List the services and the "
"queues between them, then name the single point of failure."},
{"type": "image_url",
"image_url": {"url": "https://example.com/architecture.png"}},
],
}],
)
print(resp.choices[0].message.content)
The endpoint accepts both image URLs and base64-encoded inputs. Combined with the 1M window, this is the genuinely new capability: you can hand M3 a long design doc and the diagrams it references in the same request and ask questions that span both.
Common pitfalls and gotchas
This is where teams lose time. Read it before you ship.
- The 512K billing cliff. Input at or below 512K tokens is standard rate; above 512K jumps to a higher long-context rate. Always estimate tokens before sending, and design retrieval so routine calls stay under 512K. Reserve the full-1M tier for genuine whole-corpus tasks.
- Benchmarks are vendor-reported. M3's headline numbers (around 59% on SWE-Bench Pro, 83.5 on BrowseComp) come from MiniMax's own launch blog. Independent reruns are still thin and some real-world agentic reviews are more mixed. Treat the scores as a starting hypothesis, not gospel, and benchmark on your tasks.
- It is behind on pure code reasoning. Claude Opus 4.8 re-benchmarked at roughly 69% on SWE-Bench Pro versus M3's ~59%. If hard code reasoning is your bottleneck, that gap is real. M3 wins on cost and long-context multimodality, not on topping every leaderboard.
- Abstract reasoning is a weak spot. On ARC-AGI-2, M3 scores below 12%, in line with other Chinese frontier models. For novel-pattern puzzle tasks, do not expect frontier-US-lab behavior.
- Open weights: verify before you plan a self-host. MiniMax committed to publishing the report and weights within ~10 days of June 1. That window is closing now, but confirm the M3 checkpoint is actually live on the MiniMaxAI Hugging Face org before architecting an air-gapped deployment. Until you see it listed, treat M3 as API-only.
- Output is text-only. M3 reads images and video but does not generate them. If you need image output, this is the wrong model.
- Long context is not free accuracy. A 1M window means the tokens fit; it does not guarantee the model weighs a needle buried at position 800,000 as heavily as your question. For precision-critical retrieval, still rank and place the most relevant material near the top of the prompt.
- Token estimates are approximations. The 4-chars-per-token heuristic is fine for staying clear of a tier boundary, but for exact billing use a real tokenizer and leave a safety margin.
Quick reference
| Item | Value |
|---|---|
| Model name (API) | MiniMax-M3 |
| MiniMax base URL | https://api.minimax.io/v1 |
| OpenRouter slug | minimax/minimax-m3 |
| Context window | Up to 1,000,000 tokens (>=512K guaranteed) |
| Billing cliff | <=512K standard rate; >512K long-context rate |
| Architecture | GQA backbone + MSA block-sparse attention |
| 1M-context compute | ~1/20th per-token vs prior gen |
| Speedups at 1M | ~9.7x prefill, ~15.6x decode |
| Modalities in | Text, image, video (text-only out) |
| Thinking mode | Toggle on/off, same price |
| Prompt caching | Automatic, on by default |
| Open weights | Releasing ~10 days post-June-1; verify on HF MiniMaxAI |
Next steps
- Swap the char-count heuristic for a real tokenizer and wire a hard pre-flight check that refuses any call projected to cross 512K unless you opt in.
- Build a 'chat with my repo' loop: pack once, then ask follow-up questions, leaning on automatic prompt caching to keep the recurring cost low.
- A/B M3 against Claude Opus 4.8 and an open-weight peer (DeepSeek V4, Kimi K2.7) on your own agentic tasks, measuring cost-per-solved-task, not just benchmark scores.
- Watch the MiniMaxAI Hugging Face org for the M3 checkpoint, then trial a self-host with vLLM for regulated or air-gapped workloads.
- Stress-test long-context recall with your own needle-in-a-haystack probes at 256K, 512K, and 1M before trusting it in production.
Long context stopped being a marketing number with M3. MSA makes a million tokens affordable enough to actually send, and the open-weight release means you may soon run it yourself. Start with the 512K tier on real work, measure on your own tasks, and reach for the full 1M window when the job truly needs the whole corpus at once.
Comments
Be the first to comment
Found this useful?
Get new AI guides for builders by email. Free.
Join 2,015 builders reading daily.