On the platform where I write this (medium.com) I'm forever typing snippets of code only to have the system guess the wrong language (and use the wrong syntax highlighting). The same is true on other platforms, and I've always been puzzled by this, I mean, how hard can it be?
A frontier LLM can detect the language easily, but what about smaller models? CodeBERTa-language-id is built for this exact task, and is 99.9% accurate, but it's also 330 MB. So from the application-developer's perspective it's destined to live forever on a server.
At the other end of the scale, Highlight.js runs in the browser and can detect programming languages. But detection is not the main focus and accuracy is quite low.
Do we really have to choose between good-but-huge, and small-but-bad? Or can we capture some of the intelligence found in LLMs in a very small model that can be shipped over the network?
In this article I'll explore the general idea of shrinking intelligence down into a tiny model, using language detection as an example.
Let's take a look at some results first…
Results
Easy mode
In 'easy mode' I'm using the CodeSearchNet dataset which contains code in six programming languages. The CodeBERTa-language-id model gets an F1 of 99% on this dataset, and as a reminder, that model is 330 MB.
(If you don't know what F1 is, just think of it as accuracy, but smarter.)
My model gets 99% too. The size — including the model parameters and the inference engine written in JavaScript, with zero dependencies, as shipped to the browser — is 2 KB. Yes, two.
Now, 99% is great, but this dataset is too easy. Each item is a neat little function in one of only six programming languages, which makes for a classic academic benchmark (well defined but disconnected from reality).
So let's amble over into the real world…
Hard mode
For a more realistic dataset, I'm using a subset of The Stack V2. I've filtered for 31 languages based on those detected in the medium.com editor. This is a mix of regular programming languages and things like SQL, Markdown, YAML, CSS, and Shell scripts. I've split the original code files from the dataset into 10-line snippets. This is more representative of what a user might actually type in a blog post or GitHub comment, and a harder task.
On this dataset I get an F1 score of 86% with the model now at 14 KB.
Highlight.js scores 36%.
We can't really see how the CodeBERTa-language-id model does on this task because it only knows six languages. But if we filter the dataset for just those six languages, it gets 89% (and then mine gets 95% while shrinking to 2.5 KB).
Now, if you're thinking I must have done something particularly clever to achieve this, you are entirely incorrect, as you will soon see.
I should note that the scores I'm mentioning here likely undersell it (trust me bro). When you split code files into 10-line snippets, some of them will look like junk (lists of numbers, no way to know what language) and often you get one language nestled in a different language file (e.g. a shell script in a YAML file). I've filtered out some of these, but many of the mistakes the model makes are quite forgivable.
The process
Below I'll describe how I created this model, but I'll keep it high level. My goal here is partly to show off something I think it quite cool, but also to map out a general process for solving a given problem with a very small ML model.
If you want more details, you can find the code that generates the model in this very messy repo.
My original plan was to assess the latest and greatest model-shrinking techniques. I started with a dead-simple model just to get a baseline score, but as it turned out, the baseline model was pretty great.
The first step in any ML workflow is working out how to turn your inputs into a numerical representation that an ML model can work with. LLM-style tokenization is one approach, but it only works with certain model architectures, so wasn't an option. Instead, I asked Codex to list out all the reserved words in all of the 31 languages I wanted to detect, as well as symbols commonly found in those languages. It came up with a list of 746 tokens, like "import", "function", and "::".
So I could transform any code snippet into a list of 746 booleans by asking 746 questions, like "does the snippet contain import", "does the snippet contain function" …
Yes, this throws away a lot of signal, but it aligns nicely with my problem-solving mantra: "do it dumb first".
With my code snippet text turned into lists of values (the 'features' of the snippet), I then trained a LogisticRegression model on those features. If you're not up to date on the latest machine learning tech, you may not be aware that the 'logistic regression' model is just about the simplest possible ML model, and has been around for almost a century.
And this model has three properties that make it particularly well suited to this type of problem:
- Firstly, it doesn't suffer from the same context window limitations as transformer models. So while the
CodeBERTa-language-idmodel can only look at the first 512 tokens of a code snippet, my little model will 'see' the entire snippet. - Secondly, inference requires a tiny amount of code. It takes just ~25 lines of basic JavaScript to use the model to predict the language of a given snippet of code.
- Thirdly, it's blazing fast. It takes 3 seconds to train on 10,000 examples (on a CPU). So iteration and experimentation are fast, and once deployed, making a prediction takes well under one millisecond (much faster than Highlight.js).
It's also worth noting that when a model is small enough to send over the network, it removes a lot of the difficulties that come with handling large models. You don't need to worry about versioning, deploying, serving, or scaling, because the model is just a static asset in your codebase, and inference happens on the user's hardware.
If we take a step back, we can see 'intelligence' in two different parts of this system:
- In extracting features from a code snippet, we have the 'intelligence' from a frontier LLM with a deep understanding of all 31 languages, encoded as a list of features (you could achieve the same thing with a human + Google + half a day).
- Then there's the 'intelligence' of the logistic regression model that learns a relationship between those features and a programming language by looking at a few thousand examples.
An aside: peeking into the model internals
If you're new to machine learning, this particular model might be a nice way to build an intuition for how ML models work.
If you hop on over to langid.dgapps.io, you can see every one of the model's parameters and how they're used to go from a text input (code snippet) through the model weights to a prediction (programming language).

The columns of values (with headers import, ;\n, // etc.) represent 'features'. The Snippet features row indicates if the feature is in the provided code snippet. The numerical values below are the model's weights (faded out for features that aren't present). To make a prediction, the black numbers for each row are added up and that gives each language a score. In the screenshot, CSS gets the highest score with 13, so the prediction is CSS.
This is basically how ChatGPT works, except it has about a trillion more parameters, an entirely different tokenization strategy, and an architecture that looks nothing like this.
And in case you were wondering: yes, this model is definitely showing signs of consciousness.
Shrinking the model
OK back to the story…
At this point, the baseline model was looking pretty promising. It was smaller than any other model, and better performing than just about everything else.
When saved as JSON, the trained model was 630 KB on disk, but with a bit of effort that'll come down to under 10 KB (I kindly request that you be surprised by this).
The shrinking happens in 4.5 stages…
Stage 1: compression
This is the easy one. 630 KB gzipped is only 200 KB. All size figures from here on include compression.
Stage 2: model surgery
A nice perk of simple models is that it's trivial to perform minor surgery on them. In my case, I rearranged the model's weight matrix so that earlier columns had stronger predictive power (remember, each column represents a feature, which is some keyword or symbol). This change is mathematically inert, but it means you can opt for a smaller model with slightly worse performance simply by taking the first n columns. (A kind of pauper's Matryoshka Representation Learning.)
For the below chart I created 100 variants of the model each with a different number of features, resulting in models with various sizes and F1 scores.

A practical approach is to have a post-training step that removes features (columns from the right side of the model's weight matrix) until the F1 score drops below the best score by some small delta. So in effect, the model shrinks to only be as big as it needs to be (hence smaller models for easier tasks).
Using only 400 out of 746 features reduces the model size from 200 KB down to 113 KB, without meaningfully affecting F1.
Stage 3: rounding
Since the model is saved as JSON, a very easy way to trade performance for size is simply by rounding the model's weights to fewer decimal places, resulting in fewer actual digits in the JSON file (a kind of pauper's quantization). Rounding down to 1 decimal place barely moved the F1 score, but reduced the model size from 113 KB down to 14 KB.
Hey, hi there, do you like arcane discussions of marginal gains? You're going to love the next paragraph…
Rounding combines nicely with the gzip algorithm. When rounded to 1 decimal place, the model's 12,400 parameters contain just 45 unique values, so gzip can refer to the vast majority of parameters with just a few bits (referencing the place where it last saw that same value). Dropping from 2 decimal places to 1 only reduces the JSON length by 15%, but the gzipped size drops by 50%, because of all those repeated values. Neato!
You might be thinking: OK this is getting a bit extreme, this guy's lost sight of reality and is fretting over individual kilobytes. In which case, you're going to hate the next bit…
Stage 4: quantization, raw bytes, and beyond
We can go smaller. Sending the model parameters as text in a JSON file is easy but naive. Converting the parameters to 8-bit ints and sending raw bytes reduces the size from 14 KB down to 10 KB. That drop is not huge, because the aforementioned gzip magic makes the JSON form very efficient. And since it's fiddly to quantize, and to pack and unpack the model to raw bytes, I wouldn't bother with this for such a small model, but for bigger models it would be worth the effort.
Oh and one final marginal gain: using Brotli instead of Gzip to compress quantized, raw bytes drops the size down to 9 KB. So now I can say "under 10 KB" in the blog post title.
Guidance
This is all well and good, but how are you supposed to know when a tiny ML model might be the right tool for the job?
I think of programming problems as existing on a spectrum, from "definitely needs AI" to "easily solved with regular code". Think about the realm of visual understanding: at one end of the scale, you might want to detect if an image contains a person having a good time; for that you need serious AI. At the other end of the scale you might want to detect if an image is black and white; you can easily do that in regular code. But what about the space in between? What if you want to detect whether an image is a photo or a line drawing. Maybe (with the help of a frontier LLM) you could write a 'feature extractor' that turned the input image into a set of numerical features, and train a very small model to map those features to a prediction of 'photo' or 'line drawing'.
If you have a problem where you need to classify some input or make a prediction, and can imagine a finite set of features in the inputs that drive the outputs, then I would suggest exploring small ML models as a solution.
Hopefully I've managed to adjust your expectations for how small 'AI' can get while still doing useful work. If you build something cool using this approach I'd love to hear about it.
Thanks for reading!