Why The World’s AI Will Run on Diffusion Models

It's commonly accepted that there isn't enough data center compute to supply AI to the world in a way that feels equally distributed and not hoarded by those who can afford the high prices.

And with the increasing shenanigans AI Labs are pulling on customers, even flirting with sabotage and prohibiting most of us from using the best models, it's more important than ever that edge AI, the AI running on our personal devices, is great.

I now believe diffusion models are a great answer and will play a vital role in the future because they offer something our current mainstream AIs can't at those sizes.

The Promise of a Personal AI

Currently, the vast majority of the AIs you use are located hundreds of miles away from you, or, if you're not from the US, continents apart, in ugly-looking buildings on the East Coast of the US or in States like Arizona or Texas.

Alternatively, if you don't mind a little bit of sneak peeking by the Chinese Communist Party, you can run your AIs using Chinese servers.

For what it's worth, I don't necessarily trust US AI Labs with my data either, especially when they make decisions like prohibiting zero-day data retention for certain models in case they want to run a "safety investigation".

The point is, there's a clear incentive not to send your requests and data to external servers or the open Internet, but to keep them on your personal devices.

And this stems not only from healthy skepticism about these companies, but also from the fact that every single piece of data leaving your device and entering the open Internet could be stolen by third-party bad actors.

Put another way, for equivalent performance, 100% of us would choose the local model option simply for privacy reasons.

Besides obvious privacy advantages, another advantage of local models is that you don't have to "fight" to get your request given priority to be processed in a data center, which, at demand peaks, usually means that responses become way slower.

And worse, they can actually become worse in terms of quality, because these Labs are very compute-constrained and may decide to do things like:

Switch the model that actually handles your request to a weaker one without telling you
Reduce the thinking budget for your request to deliver it faster, but at a lower overall quality.

If you're accessing models programmatically, meaning you're not using the application (e.g., ChatGPT) but the API, you have SLAs (Service Level Agreements) that protect you from such performance variances to a certain degree, but when it comes to app subscriptions, it's literally the Wild West, and you never know what shenanigans they are pulling behind the scenes.

Nonetheless, we're just a few days from one of the biggest controversies in the history of this industry. AI Labs feel so "untouchable" that Anthropic even suggested purposely sabotaging requests if they didn't like what you asked without telling you, which generated a huge controversy to the point Anthropic had to backtrack their unprecedented decision.

With local models, nothing about this matters. You're the sole user and have full priority.

No waiting times and, as strong as it sounds, no sabotaging!

All things considered, local models are extremely appealing. However, there's an issue: they are mostly terrible currently.

The performance gap

The hard truth is that, so far, the considerable performance difference between Frontier and mid-sized models, which can only be run at scale in the cloud, and local models has been excessive.

Nonetheless, if you look at the Artificial Analysis Intelligence index below, only two models in the graph can be run locally, and you would still require quite beefy laptops with at the very least 48GB of RAM, something 99.99% of personal devices don't have.

Models need to get better and smaller. And the other big issue is that even those models aren't ideal because they are simply too slow.

For example, if I run Gemma 4 31B on my high-end laptop, I can only get responses at 17 tokens/second, about three times slower than the average latency you get from applications like ChatGPT, which is largely unacceptable to most people.

Therefore, despite their obvious advantages, local models are difficult to adopt because they are too weak and slow.

But things might finally be changing.

Enjoying this piece? Then you'll love my newsletter, which explains AI in first principles and in words you can understand, for those allergic to hype but hungry for knowledge.

Join today.

Subscribe | TheWhiteBox by Nacho de Gregorio The newsletter to stay ahead of the curve in AI

Why Small Diffusion Models are the Future

The open model release of DiffusionGemma by Google is particularly interesting for me because it shows us what I believe is the future of local models:

They are good
They are fast

Although the gap with the frontier is large and will persist, here's a thought: most tasks we need AIs for don't need frontier-level intelligence and prices, and the pace of progress suggests that, a year or two from now, we'll have Mythos-level models running on local computers.

Nonetheless, HRM Text, a model I recently talked about, a model so small that it can run in your smartphone with ease, has better performance than GPT-3.5, the model ChatGPT came out with in November 2022.

Despite having been trained on a budget of just $1,500 and being in the order of a million times shorter than the current model frontier, it's as good as the model that was state-of-the-art just a few years ago, 175 billion parameters in size (175 times larger), and a model that required an investment of millions of dollars to get trained.

That's an incredible rate of progress in terms of intelligence-per-size that suggests that small models a year from now might not be Mythos-level, maybe, but they will certainly be of the level of models like Minimax M3, which are more than enough for most tasks you use frontier models for today.

Nonetheless, in reference to the aforementioned Artificial Analysis benchmark, Opus 4.8 scores around 6 points higher than Minimax M3 but costs roughly $4,600, whereas the Chinese model requires only $400 to score 6 points lower.

The complexity of those extra six points is not linear, of course, but I find it hard to justify the 10x price difference.

Once we have models of such capacity running locally, Frontier Labs will struggle to give us reasons to buy their tokens.

However, to me, the biggest setback of using local models is speed. And this takes us to today's model, DiffusionGemma.

Small and fast

The key point about DiffusionGemma is in the name: it's a diffusion model.

In other words, unlike standard Large Language Models (LLMs), which generate one token (e.g., a word) every round, generating an 'autoregressive' sequence of tokens, Diffusion models take a more "immediate" approach.

They depart from a noisy overall picture, and iteratively "denoise" the canvas to "unearth" the response, churning several words simultaneously, not just one.

I've always thought of this intuitively as sculpting: you take a huge block of marble and "unearth" the "hidden" statue by chiseling away the excess. As the great artist Michelangelo once said:

"The sculpture is already complete within the marble block, before I start my work. It is already there, I just have to chisel away the superfluous material."

Diffusion is the same idea; you depart from a noisy canvas and "unearth" the result in one shot. This means diffusion models progressively transform what's essentially noise into an actual result by performing several 'denoising' updates.

Diffusion models introduce an unequivocal trade-off: they are much faster because they generate results globally, not sequentially, but in most cases, this comes at the expense of performance.

Google itself mentions this: "For applications demanding maximum quality, we recommend deploying standard Gemma4".

But DiffusionGemma and other diffusion models do introduce one vital aspect that makes me particularly bullish about them: they massively reduce the memory bottleneck, one of the biggest issues in AI today.

But what is it?

Computers work by moving data from memory into a computer processor, which performs a series of operations, and most of the results and the input data are sent back to memory, creating a back-and-forth between the memory and the computer processor.

This means that the performance of your hardware depends on both the processing speed and the speed (or, dare I say, bandwidth) at which data moves in and out of memory.

As both are needed, the slowest of the two is the bottleneck. And in AI, especially inference, memory bandwidth is always the bottleneck, which is why AI is famously defined as "memory-bound."

In practice, this means that, on average, processors are somewhat idle, or "waiting" for data to arrive. I always like to think of this as a factory process where one worker supplies the work and materials for the other to execute.

If the supplier is faster than the one executing the work, the latter's speed becomes the bottleneck
Instead, if the one executing the work is much faster than the supplier can provide, the work supplier is the bottleneck.

And in AI, especially on inference, the sheer amount of work to be provided makes the supplier worker, the memory, the bottleneck.

For companies that make money by producing tokens, that idle time literally translates into lost revenue. The time the executor is waiting for work to arrive, which is time the workers are still in the factory working and getting paid, but no work is coming out, is lost revenue.

In other words, what you really want is for the compute worker to be the bottleneck. In that case, our hardware is generating the maximum revenue it can per second.

However, that's never the case. But why?

The reason is the "autoregressive" nature of modern LLMs. The reason is very easy to see if you think about how ChatGPT responds when you ask a question: it's a sequential generation of tokens, word by word.

Behind the scenes, here's what is going on: You push the model into the processor; the processor decides the next word, which is appended to the sequence, and the process repeats.

However, there's an issue: the model doesn't fit as a whole and has to be pushed into the processor in chunks. Therefore, to load the next chunk, the model's first chunk must be egressed so the next chunk fits.

In practice, this means that to generate each new token, the entire model is moved in and out of the processor, which is time-consuming.

I know this isn't particularly intuitive unless you're living and breathing AI all day, as I do, so think of this as a factory analogy again.

In reality, the previous factory analogy was idealized: the worker doesn't complete the job in one go and needs six different machines to do so.

These machines are heavy and big, so the supplier worker, the one who gives the executor the right machine every time, can only give them one every turn (here the analogy is that each machine is a chunk of the model).

So the process is as follows: The supplier brings the first machine to the executor. Once the executor is finished, it takes this machine away and brings in the next, and so on.

And the problem with autoregressive LLMs like the ones you use all the time is that this expensive back-and-forth between machines only produces a single token each time; a single word (or screw, in the analogy).

All that machinery back and forth to produce a single screw.

But what if this process, instead of producing a single screw every time the executor went through the six machines, produced 256?

And that is the intuition that makes diffusion appealing.

Making the most out of every turn

AI chips are called 'accelerators' for a reason: they are designed to maximize the number of operations one can do per second.

To do so, they use parallelization, the ability to take workloads with parts that can be processed independently and do so in parallel.

The perfect example is matrix multiplications, where you can break them into tiles that can be computed separately.

Considering we've learned that moving data into the GPU compute chips has a "cost" in that it takes time, and the amount of data that can be sent into the chips, the memory bandwidth, is also limited, this means that to get the most of the accelerator's parallelization potential, you have to design your workload so that you get the most out of every byte of data you push into the processors.

For this reason, the most important metric for accelerators and, by extension, for AI hardware, is arithmetic intensity, which is precisely that: the ratio between how many operations the hardware is doing for every byte of data being moved.

That is, if a GPU can allegedly perform 100 operations per byte of data the processor sees and your workload has an arithmetic intensity of 10, you're leaving unforgivable amounts of compute on the table.

In other words, the goal is to ensure we stay as close to that value as possible. This way, we guarantee that processors are running at a good pace.

Sadly, however, AI inference makes it almost impossible to achieve good arithmetic intensities given the hardware's limits (unless you're using SRAM-only chips like Cerebras, but let's not get into that).

And the reason is none other than the autoregressive nature of models. While GPUs are designed to get the most out of every byte of data, autoregressive LLMs are telling the GPU, "nah, I just want to produce one new token each time."

On the other hand, diffusion, by denoising an entire canvas of tokens simultaneously, enables you to produce hundreds every turn (in the case of DiffusionGemma, it's 256).

Careful, I'm not saying that every prediction pass churns 256 tokens, but a step in the denoising process. However, that results in extremely fast generation because once the denoising updates finish, you automatically get 256 tokens.

The speed comparison is clear. For example, for a 30-step denoising process, meaning the diffusion model takes 30 steps to unearth the result from the noisy canvas, after those 30 steps, DiffusionGemma generates 256 tokens, while an identical autoregressive model would have generated just 30 tokens (one for each turn).

So, diffusion is a trade-off, but a valuable one; you lose some of the performance in exchange for absurdly fast inference.

The Future is Diffused

While it's true that diffusion models usually represent a performance downgrade, they are reaching levels that warrant use.

Combined with their speed, I believe they will account for a large portion of our daily AI use, running locally and quickly.

Because, honestly, this industry desperately needs access to more compute. For example, of the 16.2 GW expected to be deployed this year for AI data centers, as of Q1, only 5 were under active construction, and only a tiny portion was already operational.

And from the 5.8 GW we should have seen in 2025, several GW were delayed for various reasons, including not only supply chain constraints but also social blockers (people resisting construction).

In my humble opinion, the painting is on the wall, and this industry is going to need a lot of edge computing; our smartphones, laptops, and other personal devices, to make all this make sense.

AI Labs are already mistreating users to save on compute and sabotaging them for ideological reasons, knowing customers have little choice among the four or five companies playing this game.

Worse, I believe the problem will only deteriorate due to higher prices, due to the incredibly poor margins these AI companies "enjoy".

And once OpenAI and Anthropic are under the thumb of public investors, the pressure to squeeze margins will only make things more challenging.

Diffusion LLMs might find the sweet spot this industry urgently needs: models that are small, good, and run fast on compute that is already in the hands and laps of billions of people.

Who said sculpting wouldn't be back in vogue?

I share similar thoughts in a more comprehensive and simplified manner on my LinkedIn (don't worry, no AI-generated content there either). As a reminder, you can also subscribe to my newsletter.