🛠️DiffusionGemma: Google's 857 tok/s Diffusion LLM
TL;DR
Simon Willison dug into DiffusionGemma, an experimental diffusion-based Gemini model Google briefly released that generated 857 tokens per second. His June 10 writeup shows why diffusion text models could reset latency expectations for local inference.
Simon Willison dug into DiffusionGemma, an experimental diffusion-based Gemini model Google briefly released that generated 857 tokens per second. His June 10 writeup shows why diffusion text models could reset latency expectations for local inference.
Key Points
Diffusion-based text model, in contrast to autoregressive, clocked 857 tokens/second
Google released access briefly, then pulled it; Willison captured the benchmark
Diffusion decoding generates tokens in parallel rather than strictly left to right
Writeup published June 10, 2026 on simonwillison.net
Why It Matters
If diffusion LLMs hold quality at these speeds, the latency math for on-device assistants and tight agent loops changes overnight.
Quick Facts
Frequently Asked Questions
Why does this matter?
If diffusion LLMs hold quality at these speeds, the latency math for on-device assistants and tight agent loops changes overnight.
What happened?
Simon Willison dug into DiffusionGemma, an experimental diffusion-based Gemini model Google briefly released that generated 857 tokens per second. His June 10 writeup shows why diffusion text models could reset latency expectations for local inference.
Comments
Be the first to comment
Enjoyed this article?
Get it daily. 7am. Free. Reads in 5 minutes.
Join 2,025 builders reading daily.