DiffusionGemma: Google's 857 tok/s Diffusion LLM

ContentBuffer

daily-hour-news·Jun 13, 2026

🛠️DiffusionGemma: Google's 857 tok/s Diffusion LLM

TL;DR

Simon Willison dug into DiffusionGemma, an experimental diffusion-based Gemini model Google briefly released that generated 857 tokens per second. His June 10 writeup shows why diffusion text models could reset latency expectations for local inference.

Key Points

1

Diffusion-based text model, in contrast to autoregressive, clocked 857 tokens/second

2

Google released access briefly, then pulled it; Willison captured the benchmark

3

Diffusion decoding generates tokens in parallel rather than strictly left to right

4

Writeup published June 10, 2026 on simonwillison.net

Why It Matters

If diffusion LLMs hold quality at these speeds, the latency math for on-device assistants and tight agent loops changes overnight.

Quick Facts

DiffusionGemmaGooglediffusion modelsinferenceSimon Willisonlocal LLM

Frequently Asked Questions

Why does this matter?

If diffusion LLMs hold quality at these speeds, the latency math for on-device assistants and tight agent loops changes overnight.

What happened?

Simon Willison dug into DiffusionGemma, an experimental diffusion-based Gemini model Google briefly released that generated 857 tokens per second. His June 10 writeup shows why diffusion text models could reset latency expectations for local inference.

🛠️DiffusionGemma: Google's 857 tok/s Diffusion LLM

Key Points

Why It Matters

Quick Facts

Frequently Asked Questions

Comments

Enjoyed this article?