Skip to content
daily-hour-news·

🛠️DiffusionGemma: Google's 857 tok/s Diffusion LLM

TL;DR

Simon Willison dug into DiffusionGemma, an experimental diffusion-based Gemini model Google briefly released that generated 857 tokens per second. His June 10 writeup shows why diffusion text models could reset latency expectations for local inference.

Simon Willison dug into DiffusionGemma, an experimental diffusion-based Gemini model Google briefly released that generated 857 tokens per second. His June 10 writeup shows why diffusion text models could reset latency expectations for local inference.

Key Points

1

Diffusion-based text model, in contrast to autoregressive, clocked 857 tokens/second

2

Google released access briefly, then pulled it; Willison captured the benchmark

3

Diffusion decoding generates tokens in parallel rather than strictly left to right

4

Writeup published June 10, 2026 on simonwillison.net

Why It Matters

If diffusion LLMs hold quality at these speeds, the latency math for on-device assistants and tight agent loops changes overnight.

Quick Facts

DiffusionGemmaGooglediffusion modelsinferenceSimon Willisonlocal LLM

Frequently Asked Questions

Why does this matter?

If diffusion LLMs hold quality at these speeds, the latency math for on-device assistants and tight agent loops changes overnight.

What happened?

Simon Willison dug into DiffusionGemma, an experimental diffusion-based Gemini model Google briefly released that generated 857 tokens per second. His June 10 writeup shows why diffusion text models could reset latency expectations for local inference.

Comments

Subscribe to join the conversation...

Be the first to comment

Enjoyed this article?

Get it daily. 7am. Free. Reads in 5 minutes.

Join 2,025 builders reading daily.