Training AIs for a Millionth of the Cost

We just assumed AI had to be expensive. That's what everyone is telling us all the time.

But what you probably don't know is that nothing ever changes in this industry because everyone is doing the exact same thing; we're all building AIs by following the exact same playbook.

Now, a company has trained a model from scratch, using just $1,500. Despite the tiny upfront cost, it's extremely competitive; we've literally never seen anything so good for so little.

And it sends a message:

We might need trillions of dollars to create superintelligence, but nobody said we needed superintelligence to make AI useful.

When Training Became Cheap

HRM-Text is a new 1B-parameter language model from Sapient Intelligence that tries to make a very specific point: language models may not need internet-scale data and giant training runs to reach useful general performance.

A smart choice

Instead of training a conventional Transformer on trillions of raw tokens, HRM-Text (a Transformer nonetheless, but recurrent, as we'll see in a minute) is trained from scratch on only 40 billion structured instruction-response tokens, with Sapient claiming a training budget of roughly $1,500.

HRM-Text reportedly reaches competitive scores against open models in the 2B-to-7B parameter range while using far less data and compute.

The paper reports 60.7% on MMLU, 81.9% on ARC-C, 82.2% on DROP, 84.5% on GSM8K, and 56.2% on MATH, all of which are quite popular AI benchmarks.

And while those numbers do not put it in frontier-model territory (not even close), they are notable for a model of this size and training budget.

Sapient's claim is that HRM-Text uses roughly 100–900x fewer training tokens and 96–432x less estimated compute than standard baselines for small models (the comparison with frontier models is much, much bigger).

Those are very bold claims that, if true, completely reframe how we perceive the real costs of AI.

But is the difference really that dramatic? Well, actually, it's way worse.

Comparing to the frontier

While we don't have exact measurements of the size of current frontier pretraining runs, we can estimate them really well.

Current frontier models are thought to require about 100 times as much training compute as models from the GPT-4 era.

That era had models with pretraining budgets on the order of 2×10²⁵ FLOPs (i.e., mathematical operations), roughly the size of the training datasets for GPT-4 and Llama 3, so we're talking about roughly 2x10²⁷ FLOPs today.

Taking Claude Mythos, probably the largest model ever trained, with an estimated size of 10 trillion parameters, this gives a training dataset size of 667 trillion tokens using a formula to estimate training budget sizes from Kaplan et al (OpenAI).

Compared to HRM-Text? Well, as mentioned, it used only 40 billion training tokens, so we're talking about 16,650 times fewer training tokens here.

And get this, around 8.3 million times less overall FLOPs.

To get to that number, we use the following formula: Budget = 6 * N * D, where:

The budget is, as mentioned, is 2x10²⁷ FLOPs.

The 6 comes from the 2 operations per parameter needed for a forward pass and the 4 for the backward pass.

'N' is the number of activated parameters per model. Mythos, just like all models today, is a mixture-of-experts. Frontier models today are so large because they are extremely sparse, so we can assume a 5% sparsity, meaning "only" 500 billion of the ten trillion activate every prediction.

and 'D' is the training dataset size, the number we are looking for.

To be clear, these larger-than-life values can only be understood if you factor in multimodality and reinforcement learning; AI labs don't have 667 trillion tokens of text to train on, and include all other types of modalities (images, video, audio, etc.).

Moreover, many of those FLOPs are generated during Reinforcement Learning rollouts, which don't need data to imitate; they just try to solve a problem, and only successful tries are learned from, with the rest discarded. Yet, they still count as training FLOPs.

If you're impressed by the sheer size of these values, there's a reason frontier model training runs cost billions of dollars.

Therefore, even though this model isn't remotely close to the level of "intelligence" of these models, it shows that we are getting dramatically better at getting "good results" for a fraction of the cost.

Because while a frontier model requires a million times more compute than this model needed, it's not a million times better, which raises the question:

Can we find a sweet spot where costs aren't out of control, as we saw recently, while being useful for economically valuable tasks?

The answer is yes, but for that, we need to understand how these guys did it.

Enjoying this piece? Then you'll love my newsletter, which explains AI in first principles and in words you can understand, for those allergic to hype but hungry for knowledge.

Join today.

Subscribe | TheWhiteBox by Nacho de Gregorio The newsletter to stay ahead of the curve in AI

The Importance of Representations

Yann LeCun, one of the Godfathers of AI and a notorious Large Language Model (LLM) skeptic, always says that representations are the most important thing in AI.

And they are certainly at the center of this innovation. But what is a representation?

Why AIs are at a disadvantage with humans by default

Just like humans don't see things for what they are but what our experiences tell us they are, AIs also build representations of what they think something really is, even if they can't touch it, smell it, or see it.

However, as they lack the ability to perceive and experience the real world, they must develop this representation from a representation of it, such as text. In layman's terms, they learn from input like text, which is not the real world; it's a representation of it.

This already puts them at a learning disadvantage with humans. Researchers "solve" this by giving models every single data point we can find during training; if they can't experience what a tree is, they will surely know every single aspect and insight there is to know about them, feeding them millions of data points on trees, so that their "understanding" is "good enough".

To be clear, they don't build tacit knowledge, but "relative knowledge"; they understand 'what a tree' is measured by how similar or dissimilar it is to other concepts. 'Dog' and 'cat' are closer to each other than 'dog' and 'furnace', so they build an internal "world model" based solely on semantic similarity.

What I'm alluding to is one of the biggest issues in AI today: sample efficiency.

AIs need several orders of magnitude more data than humans to reach the same level of understanding; it's not that we strongly desire to "feed them the world's data" just because and spend a billion dollars on training a single model, it's just that we're forced. And the quality of AI's representations is the culprit.

Think about the broken telephone game, where people sit in a line, and each describes a message given by the previous person. By the end of the line, the message has been completely distorted.

We, humans, are the first in line; we get to experience the real thing and then write it down, take a photo, or whatever. This is fine, but it already simplifies the real thing; it's not the same to see and interact with a husky as it is to see one in a photo or to have it described by another human.

AIs instead receive the simplified thing, the text, or the photo, so obviously, they are going to need a lot more data to better understand what an actual husky is, while a human only needs a single physical interaction.

Therefore, for AI researchers, the best way to improve representations is just more data, brute-forcing learning. More and more data, as much data as you can find.

But what if there was another way to improve representations? Of course, there is; it's called algorithmic improvements, something the industry has ignored for almost 10 years since the Transformer was first discovered.

And an algorithmic improvement is precisely what Sapient Intelligence has done here.

From sequence to loop

A standard transformer, the type of AI model underneath ChatGPT or Claude, is a sequence of transformations (hence the name).

The model receives an input sequence and applies several transformations to a representation, trying to figure out "what comes next." You can think of this as like molding a clay figure; the model progressively transforms the representation into the next word.

Here, there's only a single sequence of transformations. This is why AIs need to verbalize their thinking; the amount of computation per prediction is fixed, so their way of "thinking for longer" is generating more tokens.

But what if they could do what humans do all the time, which is shut their mouths and continue thinking?

We, humans, might develop certain thoughts and ideas, but they might not be good enough to be said aloud. Thus, to avoid saying something wrong, we continue the thinking process, looping back to the idea again and again until we figure out what we want to say.

And this recurrence is precisely what we're proposing here today. The crucial idea is that an HRM model works at "two speeds"; one slower than the other.

This means the model has to work with two different state variables at different frequencies (i.e., the model is handling two distinct "thoughts" simultaneously):

One is faster, lower-level, which is the one that adapts to sudden changes
The other is slower, higher-level, which updates fewer times per prediction

And what's the point? Simple, because this allows the model to capture patterns of the world at different speeds. Let me explain.

If you have an AI model that moves at a single speed, like modern frontier models, it captures patterns that change at the speed at which it updates its internal state.

This makes most models you know by name excellent at identifying subtle changes in your wording and capturing local patterns you might have completely missed, such as a spelling mistake or a very specific fact you mentioned 50,000 words earlier.

This leads to exceptional performance, as you experience daily.

But this is also a trap for models, which, especially as context length rises, lose the plot completely. You may have realized that they sometimes struggle to see the big picture, getting lost in the nitty-gritty of the conversation.

And the point I'm trying to make is that this is expected because of how they are designed. Therefore, we can derive the following:

A model's ability to pick up patterns is determined by whether the patterns, which are nothing but a repetition, repeat at the same frequency that the model updates its representation of what is happening.

Imagine someone reads a 600-page novel while stopping after every sentence to update their opinion of what the story is about. This person becomes extremely sensitive to local details.

They notice a strange adjective, a contradiction in a character's wording, a tiny clue hidden in a sentence, or a callback to something said earlier.

At the sentence level, they are excellent, but now imagine that this is the only speed at which they think. Every sentence forces a new update.

Their attention keeps being pulled by the latest local detail, so their understanding of the whole story becomes unstable. A sarcastic line may make them overreact. A minor subplot may feel central. A temporary mood shift may appear to be a permanent change in the character.

Simply put, they are reading too "fast" for the pattern they are trying to identify. So what does the HRM architecture propose?

Simple, some patterns require slower updates to be picked up.

Just like you don't update your entire perception about a book's plot based on every new sentence, and instead carry both a general representation of 'what the plot is about', as well as a local representation that lets you be 'in the moment' regarding what is happening as you read, HRM introduces that same bias into AI models.

The result? A model that learns at different speeds and therefore learns more efficiently, requiring a lot less data and a much smaller size.

This is the big insight from this model, though not the only one; they also apply very interesting ideas, such as a new type of attention masking, which considerably reduces training time, which I won't go into today for the sake of brevity, but I highly recommend you take a look if interested.

And what's the takeaway? Simple, there's still too much to improve, and AI is "not solved."

There's a lot yet to be learned

You may be wondering: why has this not been adopted earlier? The answer is pretty simple, actually: US Labs are too rich to care.

I've been discussing for a while the fundamental misalignment between AI Lab incentives and what the world is actually asking for. "Show me the incentives, and I'll show you the outcome," as the late Charlie Munger would say.

For US Labs, with profitability being a pipe dream, no matter how hard Anthropic tries to convince you otherwise, their goal is to survive, and survival means raising more money from equity and fixed-income investors alike.

And to get that money, especially at the size of the funding rounds these companies need, they have to sell investors the moon. Curing cancer, eliminating the need for human jobs; you name it.

Getting models that are borderline intelligent but useful enough to be adopted, what enterprises are begging for, is simply not a well-thought-out fundraising strategy.

For them, it's just crossing fingers and expecting adoption to emerge.

Adoption is happening for coding, yes, but beyond that, it's modest, at best. Because outside iterative work, like coding, PowerPoint, and Excel sheets, enterprises need 'reliability at cost'.

Take a look at China. They are being starved of computing resources by export controls.

And thus, they are forced to innovate (necessity is the mother of invention, as they say), and they are doing so by creating models like DeepSeek v4 Pro, which requires an order of magnitude less cache memory than the average American model.

It's not a coincidence that Sapient Intelligence, today's protagonist, is from Singapore, under China's sphere of control.

This is just the perfect example of what the two great powers are optimizing for: one to lure investors, the other to put up a fight with orders-of-magnitude fewer resources.

But will it last? For now, excess liquidity has allowed the US Labs to ignore all these needs for optimization and cost-effectiveness; money feels infinite.

But hey, it's not!

AI Labs need to get their act together and start optimizing for cost. The positive thing is that frontier models are well beyond the "intelligence" levels enterprises need for their processes.

Instead, they just need them to be sufficiently smart, reliable, and cheap models; that's all enterprises ask for.

Current models can solve Erdös problems, but are as reliable as your chainsmoker friend telling you, "This is it; I'm going to stop smoking". I'm sorry, solving complicated maths problems is all great and all, but its economic impact is modest.

Don't get me wrong, I'm all for it, but the AI industry is still treating its goals as purely academic when it should be striving to make money.

Thus, reliability is an afterthought, so enterprises have to design reliability into the mix.

One of the best ways to do so is redundancy design, which I discuss in my latest free blog in The Imperative, which I recommend you read.

But in the meantime, I hope this article helps you understand three things:

The importance of representations and how there's still a lot of low-hanging fruit in terms of algorithmic improvements
How fundamentally misaligned the AI industry is with regard to what the world is asking them for

And the third, by far the most important one: thinking that AI has to be expensive by default, is wrong.

I share similar thoughts in a more comprehensive and simplified manner on my LinkedIn (don't worry, no AI-generated content there either). As a reminder, you can also subscribe to my newsletter.