Skip to content
Mercury 2 dLLM: Reasoning at 1000 Tokens Per Second — ContentBuffer guide

Mercury 2 dLLM: Reasoning at 1000 Tokens Per Second

K
Kodetra Technologies··8 min read Intermediate

Summary

Build real-time agents on the first reasoning diffusion LLM: OpenAI-compatible, 1000 tok/s.

Mercury 2: Reasoning at 1000 Tokens Per Second

Almost every production LLM you have ever called shares one habit: it writes left to right, one token at a time. That sequential decoding is the quiet tax on every agent loop, every retrieval pipeline, every voice turn. Inception's Mercury 2 breaks that habit. It is the first reasoning diffusion LLM (a dLLM), and instead of decoding one token after another it refines many tokens in parallel, converging on the answer in a handful of denoising steps.

The number that made it go viral: 1,009 tokens per second on NVIDIA Blackwell GPUs, roughly 5x faster than speed-optimized autoregressive models like Claude 4.5 Haiku or GPT-5 Mini, at $0.25 per million input tokens and $0.75 per million output tokens. It keeps reasoning quality while fitting inside real-time latency budgets, which is exactly the constraint that has been killing voice agents and multi-step tool loops.

The best part for builders: Mercury 2 speaks the OpenAI API. You change a base URL and a model string, and your existing agent code runs unchanged. This guide walks you from your first request to a real-time tool-using agent loop, with runnable code, real latency numbers, and the gotchas nobody warns you about.


Prerequisites

  • Python 3.9+ and pip (the examples are Python; TypeScript works the same way).
  • An Inception Platform account. Every new account is granted 10 million free tokens, so you can finish this entire guide without paying anything.
  • An API key from the Inception dashboard, exported as an environment variable.
  • Basic familiarity with chat-completion APIs (messages, roles, tool calls). If you have ever called OpenAI or Anthropic, you are ready.

Why diffusion decoding is fast (the 60-second version)

An autoregressive model produces token N, feeds it back in, then produces token N+1. The wall-clock cost of a response scales with its length because the steps are strictly serial. A diffusion LLM starts from a rough, masked draft of the whole answer and runs a small number of parallel refinement passes over it, sharpening many positions at once. Think less typewriter, more an editor revising a full draft in a few sweeps.

Two practical consequences follow. First, throughput is high and, just as important, more stable under concurrency, because you are not paying per-token serial latency. Second, the usual intelligence-versus-latency trade-off softens: you can dial up reasoning effort and still land inside a real-time budget. Mercury 2 exposes that dial directly through a reasoning_effort parameter, which you will use constantly.


Step 1: Install and set your key

Create an account, generate a key in the dashboard, then export it. Mercury 2 ships a first-party SDK, but it is also OpenAI-compatible, so you have options.

# Set your key (macOS / Linux)
export INCEPTION_API_KEY="your_api_key_here"

# Install the first-party SDK...
pip install inceptionai

# ...or just reuse the OpenAI client you already have
pip install openai

Step 2: Your first request

All requests go to https://api.inceptionlabs.ai/v1. The model string is mercury-2. Here it is with the native SDK:

from inceptionai import Inception

client = Inception()  # reads INCEPTION_API_KEY from the environment

completion = client.chat.completions.create(
    model="mercury-2",
    messages=[{"role": "user", "content": "Explain a diffusion LLM in two sentences."}],
    reasoning_effort="medium",
    temperature=0.75,
    max_tokens=8192,
)
print(completion.choices[0].message.content)

Example output:

A diffusion LLM generates text by starting with a noisy, masked draft of the
entire response and iteratively refining all positions in parallel over a few
steps, rather than predicting one token at a time. This lets it produce
coherent output far faster than autoregressive models while keeping quality high.

Prefer not to add a dependency? The exact same call through the OpenAI client, which is how you drop Mercury 2 into an existing codebase:

import os
from openai import OpenAI

client = OpenAI(
    api_key=os.environ["INCEPTION_API_KEY"],
    base_url="https://api.inceptionlabs.ai/v1",  # <- the only real change
)

resp = client.chat.completions.create(
    model="mercury-2",
    messages=[{"role": "user", "content": "Explain a diffusion LLM in two sentences."}],
    max_tokens=1000,
)
print(resp.choices[0].message.content)

Step 3: Tune reasoning_effort for latency

This is the parameter that makes Mercury 2 different from a plain speed model. reasoning_effort takes four values: instant, low, medium (the default), and high. Lower settings spend fewer refinement steps and return faster; high buys deeper thinking for hard problems. Because the floor is so fast, even high usually stays inside an interactive budget.

Measure it yourself instead of trusting a benchmark. This loop times each tier on the same prompt:

import os, time
from openai import OpenAI

client = OpenAI(
    api_key=os.environ["INCEPTION_API_KEY"],
    base_url="https://api.inceptionlabs.ai/v1",
)

prompt = "A train leaves at 3:15pm going 60mph. A second leaves 20 min later \
at 75mph on the same track. When does the second catch the first? Show steps."

for effort in ["instant", "low", "medium", "high"]:
    t0 = time.time()
    r = client.chat.completions.create(
        model="mercury-2",
        messages=[{"role": "user", "content": prompt}],
        reasoning_effort=effort,
        max_tokens=2048,
    )
    dt = time.time() - t0
    out_tok = r.usage.completion_tokens
    print(f"{effort:<8} {dt:5.2f}s  {out_tok:>4} out tok  "
          f"{out_tok/dt:6.0f} tok/s")

Representative result (your numbers vary with load and output length):

instant   0.31s    142 out tok    458 tok/s
low       0.44s    229 out tok    520 tok/s
medium    0.71s    498 out tok    701 tok/s
high      1.36s   1041 out tok    765 tok/s

Notice the shape: even the deepest setting answers a multi-step word problem in well under two seconds. The rule of thumb that works in practice is to default to medium, drop to instant or low for autocomplete and classification, and reserve high for genuinely hard reasoning or code.


Step 4: Stream tokens as they refine

Streaming works exactly like the OpenAI API: pass stream=True and iterate over chunks. With Mercury 2 the stream arrives in bursts because tokens are refined in parallel, which feels different from the steady left-to-right drip you are used to.

stream = client.chat.completions.create(
    model="mercury-2",
    messages=[{"role": "user", "content": "Write a haiku about parallel decoding."}],
    reasoning_effort="low",
    stream=True,
)

for chunk in stream:
    delta = chunk.choices[0].delta.content
    if delta:
        print(delta, end="", flush=True)
print()

Step 5 (worked example): a real-time tool-using agent loop

Mercury 2 supports native tool use, so you can build the kind of agent loop where latency normally compounds painfully. Each turn the model either calls a tool or answers; you run the tool, feed the result back, and continue. Because every hop is sub-second, you can afford more hops. Here is a complete, self-contained loop with two tools.

import os, json
from openai import OpenAI

client = OpenAI(
    api_key=os.environ["INCEPTION_API_KEY"],
    base_url="https://api.inceptionlabs.ai/v1",
)

# --- the actual functions the agent can run ---
def get_stock_price(ticker):
    fake = {"NVDA": 174.2, "AAPL": 228.5, "TSLA": 410.9}
    return {"ticker": ticker, "price": fake.get(ticker.upper(), 100.0)}

def convert_currency(amount, rate):
    return {"converted": round(amount * rate, 2)}

TOOLS = [
    {"type": "function", "function": {
        "name": "get_stock_price",
        "description": "Get the latest price for a stock ticker.",
        "parameters": {"type": "object",
            "properties": {"ticker": {"type": "string"}},
            "required": ["ticker"]}}},
    {"type": "function", "function": {
        "name": "convert_currency",
        "description": "Convert an amount by a given FX rate.",
        "parameters": {"type": "object",
            "properties": {"amount": {"type": "number"},
                           "rate": {"type": "number"}},
            "required": ["amount", "rate"]}}},
]

DISPATCH = {"get_stock_price": get_stock_price,
            "convert_currency": convert_currency}

messages = [{"role": "user", "content":
    "What is 10 shares of NVDA worth in euros if 1 USD = 0.92 EUR?"}]

for step in range(6):  # safety cap on the loop
    r = client.chat.completions.create(
        model="mercury-2",
        messages=messages,
        tools=TOOLS,
        reasoning_effort="medium",
    )
    msg = r.choices[0].message
    messages.append(msg)

    if not msg.tool_calls:
        print("FINAL:", msg.content)
        break

    for call in msg.tool_calls:
        args = json.loads(call.function.arguments)
        result = DISPATCH[call.function.name](**args)
        print(f"  -> {call.function.name}({args}) = {result}")
        messages.append({
            "role": "tool",
            "tool_call_id": call.id,
            "content": json.dumps(result),
        })

Example run:

  -> get_stock_price({'ticker': 'NVDA'}) = {'ticker': 'NVDA', 'price': 174.2}
  -> convert_currency({'amount': 1742.0, 'rate': 0.92}) = {'converted': 1602.64}
FINAL: 10 shares of NVDA at $174.20 is $1,742.00, which converts to about EUR 1,602.64 at a rate of 0.92 EUR per USD.

Two tool calls, a final synthesis, and the whole exchange returns in roughly the time a single autoregressive call would take. That is the unlock: the agent can take more steps without the user feeling the wait.


Step 6: Schema-aligned JSON output

When the output feeds another system, ask for JSON that matches a schema rather than parsing prose. Mercury 2 supports response_format with a JSON schema, the same shape as the OpenAI structured-outputs API.

schema = {
    "type": "object",
    "properties": {
        "sentiment": {"type": "string", "enum": ["positive", "neutral", "negative"]},
        "topics": {"type": "array", "items": {"type": "string"}},
        "urgency": {"type": "integer"},
    },
    "required": ["sentiment", "topics", "urgency"],
    "additionalProperties": False,
}

r = client.chat.completions.create(
    model="mercury-2",
    messages=[{"role": "user", "content":
        "Triage this ticket: 'Checkout has been throwing 500s for an hour, \
we are losing sales right now.'"}],
    reasoning_effort="low",
    response_format={
        "type": "json_schema",
        "json_schema": {"name": "triage", "schema": schema, "strict": True},
    },
)
import json
print(json.loads(r.choices[0].message.content))
# {'sentiment': 'negative', 'topics': ['checkout', 'outage', 'revenue'], 'urgency': 5}

Because the call runs at low effort and sub-second latency, you can put this on the hot path of a support queue and classify thousands of tickets without building a batch job.


Common pitfalls and gotchas

  • Treating reasoning_effort like a temperature knob. It controls the number of refinement steps, not randomness. Raise it for hard reasoning, not to make output 'more creative' — use temperature for that.
  • Forgetting the base_url with the OpenAI client. If you only change the model string and leave the default OpenAI endpoint, you will hit OpenAI's API, not Inception's. The base URL https://api.inceptionlabs.ai/v1 is the switch.
  • Assuming streaming is smooth. Diffusion output arrives in parallel bursts, not a steady drip. If your UI assumes one-token-at-a-time pacing, the cadence will look different. It is a rendering assumption, not a bug.
  • Over-budgeting max_tokens. The model is fast enough that a huge max_tokens rarely hurts latency, but it does raise your cost ceiling. The default of 8192 is fine for most chat; trim it for classification.
  • Skipping the tool-call loop cap. Any agent loop can in principle keep calling tools. Keep a hard step cap (the example uses 6) so a confused run cannot spin forever.
  • Expecting frontier-max benchmark scores. Mercury 2 targets the speed-optimized tier (competitive with fast models), not the absolute top of a reasoning leaderboard. Pick it when latency and cost per call matter most.

Quick reference

ItemValue
Base URLhttps://api.inceptionlabs.ai/v1
Model stringmercury-2
Auth headerAuthorization: Bearer $INCEPTION_API_KEY
reasoning_effortinstant | low | medium (default) | high
Defaultstemperature 0.75, max_tokens 8192
Context window128K tokens (max output 50K)
Pricing$0.25 / 1M input, $0.75 / 1M output
Speed~1,009 tok/s on Blackwell; 5x+ faster than fast AR models
CompatibilityOpenAI API; works with LangChain, LiteLLM, Vercel AI SDK
Free tier10M tokens on signup

Next steps

  • Swap Mercury 2 into an existing LangGraph or CrewAI agent by pointing the OpenAI-compatible model at the Inception base URL, and measure the end-to-end latency drop across a full task.
  • Build a voice loop: pair Mercury 2 at reasoning_effort=instant with a streaming TTS to stay inside natural speech cadence.
  • Push the tool loop further — add retrieval as a tool and let the agent take more hops now that each one is cheap.
  • A/B two effort levels behind the same endpoint and log latency plus a quality eval to find the right setting per route.
  • Read Inception's posts on real-time subagents to see how sub-second inference changes agent architecture, not just speed.

Comments

Subscribe to join the conversation...

Be the first to comment

Found this useful?

Get new AI guides for builders by email. Free.