June 15, 2026
Jetbrains’ Mellum2 is the best low resources local coding agent LLM.
Model speed for local AI: what does really matter? Testing different models and understanding how to pick the best one for your tasks.
Fabio Matricardi
12 min read
Not all models are born equal. And not all Generative AI LLM are fit for a task.
In the past months everyone is talking about AI agents and the harnesses (me too), and so I decided to give you a honest review of what models are best for running locally according to the tasks.
I Built an AI Second Brain to cure my information overload. And here is how. How an oil and gas engineer used the free Opencode coding agent to automate note-taking, kill the folder chaos, and…
You can use AI agents for coding or for Knowledge Base processing (like an LLM wiki style project). But the same LLM cannot be automatically good at both.
TL;DR:
Local AI agent workflows demand fast Prompt Ingest (Pre-fill) speeds over raw generation, as agents spend most of their time reading dense codebases. If you cannot afford to run Gemma-4–26B-A4B or Qwen3.6–35B-A3B these below are the things to consider:
- For Coding Agents: JetBrains Mellum2 Instruct (Q3_K_S) is the undisputed champion. Its 64-expert sparse MoE architecture packs 12B intelligence into a light 6GB RAM footprint, delivering an unparalleled 152.26 t/s ingest and 15.11 t/s generation on integrated graphics.
- For Knowledge Bases: Lightweight 4B effective systems like Gemma-4 E4B IT QAT or Qwen3.5–4B offer maximum semantic parsing efficiency for local Wikis and graphs without choking memory.
- The Math Tax: Standard quantization formats (like Q4_K_XL) outperform Importance Quantization (IQ) layouts by avoiding on-the-fly iGPU dequantization penalties.
Table Of Contents
1. Introduction: The Modern Agent Formula (Model + Harness)
2. Speed Metrics: Why Prompt Ingest Rules the Agent Era
3. Hardware Benchmarks: Evaluating Dense vs. Sparse Architectures
4. Deep Dive: JetBrains Mellum2 (64 Experts, 131K Context)
5. Deployment Guide: CPU and iGPU Vulkan llama.cpp Setups
6. Beyond Code: Optimized 4B Models for Wikis and Chat
7. Final Blueprint: Picking the Right Tool for Your Stack
8. Conclusion: The Sovereign AI Enthusiast’s Stack1. Introduction: The Modern Agent Formula (Model + Harness)
2. Speed Metrics: Why Prompt Ingest Rules the Agent Era
3. Hardware Benchmarks: Evaluating Dense vs. Sparse Architectures
4. Deep Dive: JetBrains Mellum2 (64 Experts, 131K Context)
5. Deployment Guide: CPU and iGPU Vulkan llama.cpp Setups
6. Beyond Code: Optimized 4B Models for Wikis and Chat
7. Final Blueprint: Picking the Right Tool for Your Stack
8. Conclusion: The Sovereign AI Enthusiast’s StackLet's start understanding what the hell is a harness.
Agent Harness
Before we go further, let's make sure we're on the same page. You keep hearing "harness" thrown around, but what does it actually mean?
Think of it this way: the model (Claude, GPT-4, Llama, whatever) is the brain. It can think, reason, write code. But it can't do anything on its own. It can't read your files. It can't run terminal commands. It can't remember what happened five minutes ago.
The harness is the body. It's the scaffolding that gives the model:
- File access ➡️ reading and writing where you need it
- Terminal execution ➡️ running commands, tests, builds
- Context management ➡️ what to include in the prompt, what to ignore
- Memory ➡️ retaining state between sessions
- Safety ➡️ permissions that stop it from doing something dumb (or malicious)
The formula everyone's using in 2026 is simple:
Agent = Model + Harness
This is a real shift. It means you're not locked into one provider's "complete solution." You can swap the brain while keeping the body. You can run a local model for privacy, then switch to an API for power, without learning a new tool.
OpenCode is one of those harnesses: you can read more here below:
OpenCode is the "Linux of Agents": and that's the entire point The Context sovereign: why AGENTS.md is the new LLM secret
Best LLM for coding with Opencode
This section is valid for every harness, not only for Opencode. I wrote about harness engineering in my previous articles.
The Agent's Operating System: why Harness Engineering is your next step to a reliable AI Think of the model as the CPU and the harness as the Operating System. And here I show you how to do it on your PC.
If you use AI agents for coding you need an AI that is good at following instructions, call sub-agents, call tools and understand the plan. And prompt processing speed is more relevant than the generation speed: in fact code tokens are expensive, and there are a lot of them.
Before proceeding, here a set of golden rules:
- plan the Agent repo (AGENTS.md) with the very same model you are going to use
- If you plan to use an open-source model of one family(Qwen, Gemma…) you can upgrade a higher tier model for the troubleshooting (for example you plan to use Gemma-4-E4B, you can troubleshoot with Gemma-4–12b or Gemma-4–26B-A4B)
- plan the entire project as simple as possible. If there are deterministic actions (like conversion from PDF to markdown, write a good HOWTO.md and do it manually)
I tried the Qwen family (from 2B to the MoE Qwen3.6–35B-A3B), Prism-ML Bonsai 8B, the new Gemma-4 model family, Liquid AI LFM2.5–8B-A1B and the latest JetBrains Mellum2–12B-A2.5B-Instruct.
Test results
If we look at raw throughput efficiency on consumer hardware, Mellum 2 punches way above its weight class.
- On pure CPU (my miniPC), Mellum 2 managed a comfortable 4.17 t/s generation speed, outperforming the smaller Gemma-4 configurations.
- Once offloaded to the AMD graphics (
-ngl 99), it absolutely flew: 152.26 t/s ingest and 15.11 t/s generation. For a local setup, breaking past that 15 t/s floor on a highly capable model makes it an incredibly smooth interactive experience.
The Qwen MoE model can do great with Multi Token Prediction enabled, but the Memory requirements are too big for an average Consumer Hardware.
The Gemma-4 family becomes useful for coding generation only starting from the latest Gemma-4–12b. The MoE is good but you can imagine how much RAM you need to start using a 26B model, even though with only 4B active the performance are not bad at all.
Qwen-3.5 for coding is acceptable with AI agents only starting from the 9B model. Its dense architecture slow down the processing speed considerably. Qwen3.6–35B-A3B requires too much staring memory, even if with 3B active the processing speed is good.
Prism-ML Bonsai 8B and Liquid AI LFM2.5–8B-A1B break during tool calls and are not fit, in my opinion, for AI coding tasks.
The King of Local Code: JetBrains Mellum2
If you want a local model designed from the ground up for software engineering, JetBrains Mellum2–12B-A2.5B-Instruct is the undisputed champion for consumer hardware.
JetBrains Mellum2 MoE Architecture
┌────────────────────────────────────────────────────────────────────────┐
│ Total Matrix: 12B Parameters (Fits comfortably within ~6GB VRAM/RAM) │
├────────────────────────────────────────────────────────────────────────┤
│ Active Path: 2.5B Parameters per token (64 Total Experts / 8 Active) │
└───────────────────────────────────┬────────────────────────────────────┘
▼
┌─────────────────────────────────────────────────────────┐
│ Ingest: 152.26 t/s ⚡ Generation: 15.11 t/s (Vulkan) │
└─────────────────────────────────────────────────────────┘ JetBrains Mellum2 MoE Architecture
┌────────────────────────────────────────────────────────────────────────┐
│ Total Matrix: 12B Parameters (Fits comfortably within ~6GB VRAM/RAM) │
├────────────────────────────────────────────────────────────────────────┤
│ Active Path: 2.5B Parameters per token (64 Total Experts / 8 Active) │
└───────────────────────────────────┬────────────────────────────────────┘
▼
┌─────────────────────────────────────────────────────────┐
│ Ingest: 152.26 t/s ⚡ Generation: 15.11 t/s (Vulkan) │
└─────────────────────────────────────────────────────────┘Mellum2 utilizes a highly specialized Sparse Mixture-of-Experts layout featuring 64 separate experts, activating exactly 8 experts per token. This architecture allows the entire 12B model to be packed into a light Q3_K_S GGUF that sits comfortably inside 6GB of RAM, leaving plenty of headroom for its massive 131,072 token context window.
Because it only computes 2.5B active parameters per execution loop, prompt pre-fill and generation speeds are blisteringly fast, even on GPU-limited hardware. Post-trained via reinforcement learning with verifiable rewards (RLVR) specifically on executable coding, tool usage, and instruction following, it communicates directly without requiring an explicit, slow, externalized chain-of-thought block.
On an integrated AMD Radeon 780M (-ngl 99), Mellum2 completely outclasses its weight class, delivering an unparalleled 152.26 t/s prompt ingest and breaking cleanly past the interactive comfort floor at 15.11 t/s generation.
The Quantization Cliff: Why Agents Pay a Heavier Q4 Tax
When optimizing a local stack, the general consensus for standard chat models is that Q4_K_M is the ultimate sweet spot, retaining roughly 97% to 99% of FP16 perplexity while drastically cutting the memory footprint.
But perplexity is a generic metric for LLM who chats with you; it measures how smoothly a model predicts the next word in a sentence. An AI agent is not a chat expert, but a high-precision data analyst.
An agent's core job is emitting flawlessly structured syntax (like valid JSON payloads) and strictly following instructions, turn after turn, under a rapidly expanding context window. When you look at local models through the lens of tool-calling and structured-output discipline, the math completely changes.
The Instruction-Following Breakdown
Recent per-task retention studies show that aggressive quantization does not degrade all capabilities equally. The sharpest performance drops are concentrated precisely in the areas coding agents rely on most:
- Instruction Adherence (IFEval): Low-bit quantization introduces up to a 20% degradation when crossing the cliff from Q5 down to Q4.
- Mathematical Reasoning & Planning: Coding accuracy drops by 5% to 15% at Q4, introducing frequent trailing commas, unquoted keys, or hallucinated arguments that instantly break strict parsers.
Structured Output Accuracy (JSON / Tool-Calls)
▲
100% | ┌─────────────── Lossless Zone (Q8 / FP16)
| │
80% | ┌────────┘ ◄─── Q5_K_M: The Agent Sweet Spot
| │
60% | ┌─────┘ ◄─── THE QUANTIZATION CLIFF (Massive IFEval drops)
| │
40% |────┘ ◄─── Q3 / Q4 Standard Quantization (Parser breaks frequently)
└─────────────────────────────────────────────────────────────► PrecisionStructured Output Accuracy (JSON / Tool-Calls)
▲
100% | ┌─────────────── Lossless Zone (Q8 / FP16)
| │
80% | ┌────────┘ ◄─── Q5_K_M: The Agent Sweet Spot
| │
60% | ┌─────┘ ◄─── THE QUANTIZATION CLIFF (Massive IFEval drops)
| │
40% |────┘ ◄─── Q3 / Q4 Standard Quantization (Parser breaks frequently)
└─────────────────────────────────────────────────────────────► PrecisionThe Sweet Spot on Consumer Hardware
This creates a fascinating architectural tension when running local workflows on an integrated GPU (iGPU) like the Radeon 780M:
- For the Ultra-Light Stack: my baseline choice of Mellum2 Instruct at Q3_K_S remains an incredible engineering marvel. Because it was trained from the ground up via Reinforcement Learning with Verifiable Rewards (RLVR) specifically on executable code and tool utilization, its native "form-filling" discipline is baked into the weights, defying the traditional quantization penalty.
- For Scaling Up Precision: but if we plan expanding our harness to handle complex, nested tools, or if we are scaling up to beefier models like Qwen3 32B, we cannot afford to default to Q4.
The New Golden Rule for Sovereign Agents: Do not compress your model weights to Q4 just to squeeze a massive model into RAM. Instead, look to your serving layer: quantize your KV-Cache to
q4_0using the-ctkand-ctvflags inllama.cpp.
We can save memory on the context cache instead of the model weights, and in this way we can free up the necessary VRAM headroom to deploy Q5_K_M or Q6_K weights. That 10% extra memory spend protects the 20% instruction-following gap, ensuring our agent loop doesn't choke on a missing bracket on turn three.
Step-by-Step Local Deployment Guide
To serve Mellum2 as a local drop-in replacement for your coding agent harness, you can use the latest llama.cpp binaries (llama-b9553 or newer with Vulkan support). We will expose the server endpoint on port 11434 to natively interface with agent harnesses expecting an Ollama-style API hook.
Llama.cpp binaries:
You can download the quantized model here (I suggest you the Q3_K_S):
To run the model ready to be recognized by Opencode, we will fake the Ollama endpoint port like this.
Configuration A: Pure CPU Execution (For low-power Mini-PCs)
.\llama-server.exe -m Mellum2-12B-A2.5B-Instruct.i1-Q3_K_S.gguf \
-ctk q4_0 -ctv q4_0 --jinja --mmap -ngl 0 -np 1 \
-t 4 -fa on --port 11434 -a Mellum2 \
--temp 0.6 --top-p 0.95 --top-k 20
--ctx-size 131072.\llama-server.exe -m Mellum2-12B-A2.5B-Instruct.i1-Q3_K_S.gguf \
-ctk q4_0 -ctv q4_0 --jinja --mmap -ngl 0 -np 1 \
-t 4 -fa on --port 11434 -a Mellum2 \
--temp 0.6 --top-p 0.95 --top-k 20
--ctx-size 131072Configuration B: Hardware Accelerated (Full iGPU Vulkan Offloading)
.\llama-server.exe -m Mellum2-12B-A2.5B-Instruct.i1-Q3_K_S.gguf \
-ctk q4_0 -ctv q4_0 --jinja --mmap -ngl 99 -np 1 \
-t 4 -fa on --port 11434 -a Mellum2 \
--temp 0.6 --top-p 0.95 --top-k 20
--ctx-size 131072.\llama-server.exe -m Mellum2-12B-A2.5B-Instruct.i1-Q3_K_S.gguf \
-ctk q4_0 -ctv q4_0 --jinja --mmap -ngl 99 -np 1 \
-t 4 -fa on --port 11434 -a Mellum2 \
--temp 0.6 --top-p 0.95 --top-k 20
--ctx-size 131072If you want to try the Q5_K_M and verify the increase precision in tool calls:
- download the model GGUF file Mellum2-12B-A2.5B-Instruct.Q5_K_M.gguf
In the terminal try:
.\llama-server.exe -m Mellum2-12B-A2.5B-Instruct.Q5_K_M.gguf \
-ctk q4_0 -ctv q4_0 --jinja --mmap -ngl 99 -np 1 \
-t 4 -fa on --port 11434 -a Mellum2 \
--temp 0.6 --top-p 0.95 --top-k 20
--ctx-size 131072.\llama-server.exe -m Mellum2-12B-A2.5B-Instruct.Q5_K_M.gguf \
-ctk q4_0 -ctv q4_0 --jinja --mmap -ngl 99 -np 1 \
-t 4 -fa on --port 11434 -a Mellum2 \
--temp 0.6 --top-p 0.95 --top-k 20
--ctx-size 131072Check if the number of retries changes: it can happen that you don't see any action taken by the agent after the turn is completed.
LLM with agents for non-coding tasks
For LLM-wiki style projects, 4B models are good to go.
The king is gemma-4-E4B-it-qat, a model with a total 8B but 4B effective used parameters during inference. The model is good at understanding instructions and process text. Is good to produce knowledge graphs.
In the same range Qwen3.5–4B is also a good model, and it has a smaller memory footprint than the Google model.
LLM for Chat
If you are not constrained by agents and tools, many of the mentioned above models are a good pick. I usually run the llama.cpp server and directly chat with them with accurate results. This apply also to:
- Prism-ML Bonsai 8B
- Liquid AI LFM2.5–8B-A1B
- IBM granite-4.1–3b
- Huihui-MoE-5B-A1.7B
You can run with a command like this
.\llama-server.exe -m .\Bonsai-8B-Q1_0.gguf --mmap -ngl 0 \
-t 4 -c 32000 --port 11434 -fa on \
-ctk q4_0 -ctv q4_0 -a Bonsai8b --reasoning on --jinja \
--temperature 0.5 --top-p 0.9 --top-k 20 --repeat-penalty 1.0.\llama-server.exe -m .\Bonsai-8B-Q1_0.gguf --mmap -ngl 0 \
-t 4 -c 32000 --port 11434 -fa on \
-ctk q4_0 -ctv q4_0 -a Bonsai8b --reasoning on --jinja \
--temperature 0.5 --top-p 0.9 --top-k 20 --repeat-penalty 1.0Then open the browser at http://localhost:11434
Final Blueprint: Picking the Right Tool
When managing your sovereign local hardware stack, match your underlying model architecture directly to the operational profile of your project:
- For Coding Agents & Tool Integration: Run JetBrains Mellum2 Instruct (Q3_K_S). It offers the absolute best balance of lightning-fast token pre-fill speed and rock-solid tool-calling accuracy available for local consumer-grade setups.
- For Local Knowledge Bases (LLM-Wiki / Graphs): Lean on the Gemma-4 E4B IT QAT or Qwen3.5–4B models. These 4B effective parameter systems require incredibly low memory footprints while providing exceptional semantic parsing for unstructured text documents.
- For Direct Interactive Chat: If you don't need complex agent tool-calling frameworks, look to sparse alternatives like LFM2.5–8B (clocking in at an effortless 24.7 t/s generation) or run dense options like Bonsai-8B with
--reasoning onto maximize direct conversational intelligence.
Conclusion
The paradigm shift we are witnessing in 2026 is involving both models getting bigger and architectures getting smarter. For too long, running a local AI coding agent felt like a compromise: a trade-off between the ironclad privacy of your own hardware and the fluent, rapid execution of cloud-hosted APIs.
If we split the brain from the body using modular frameworks like OpenCode, we have the freedom to choose specialized engines for distinct tasks. The data makes the choice clear: for everyday knowledge mapping and document parsing, lightweight 4B effective systems like Gemma-4 E4B QAT deliver incredible efficiency. But when it comes to the heavy lifting of active software engineering (where tool-calling precision, massive context handling, and instant prompt pre-fill speeds dictate success ) JetBrains Mellum2 Instruct changes the game.
Its unique 64-expert sparse layout proves that local consumer hardware, like a standard AMD or mini-PC, can deliver a premium, near-instantaneous development experience without breaking the bank or leaking proprietary code. The era of the "GPU-poor" developer stalling on local builds is officially over.
When you optimize your workspace we are not chasing the highest total parameter count anymore: we want to match the right active computational footprint to our specific task. Deploy Mellum2, hook it into your agent harness, and take full control of your digital environment.
What are your thoughts on sparse architectures like Mellum2 for local workflows? Have you experimented with Multi-Token Prediction or QAT on your own rigs? Let me know in the comments below, and don't forget to follow the Artificial Intelligence Playground for more hands-on local AI benchmarks and tutorials!
I hope you enjoyed the article. If this story provided value and you wish to show a little support, you could:
- Clap a lot of times for this story
- Highlight the parts more relevant to be remembered (it will be easier for you to find them later and for me to write better articles)
- Write with me on this Publication: there is no better way to learn than writing about it!
- Follow my publication https://medium.com/artificial-intel-ligence-playground
If you want to read more, here are some ideas:
The smallest 8B model ever created is not really an advancement Bonsai is the 1bit in the real world for the first time: but the claims are far from good. You still need a GPU or you…
The Agent's Operating System: why Harness Engineering is your next step to a reliable AI Think of the model as the CPU and the harness as the Operating System. And here I show you how to do it on your PC.
You don't need an AI agent for every single thing! Your AI your Rules: Open Web UI and llama.cpp are all you need to work with your documents in full privacy and…
The future of generative AI is vectorless And why this is not what you think: it is a tragedy!
AI Frankenstein is alive — part 5 The first modern RNN Language Model: RWKV
OpenCode Series
The OpenCode revolution: more than just another chatbot AI Agency under your control and data that never leaves your PC. You don't need anymore Codex or Claude Code. Part 1 of…
Beginner to expert: your first 60 seconds with OpenCode Installation and first setup in a blink: how to start using opencode for your own projects. Part 2 of the series.
Referenced sources
How to Use Graphify: Turn Any Folder Into a Knowledge Graph A step-by-step guide to using Graphify, the open-source tool that builds a queryable knowledge graph
How to Use lat.md: Turn Any Folder Into a Validated Knowledge Graph A deep dive into lat.md—the open-source lattice that gives Claude, Cursor, and Aider a perfect map of your codebase.
Your Agent's Best Model Isn't Your Best Chat Model. It's a Form-Filler. An agent's job is emitting valid JSON, turn after turn, under a growing context. Open-weight leaders now beat the…