Llama.cpp has now a website. And running a local AI will never be so easy!

Something fundamental shifted in the world of artificial intelligence last night. If you have been following the AI boom, you know that the true holy grail for enthusiasts, creators, and privacy advocates is running artificial intelligence locally.

Local AI means having a powerful brain living entirely on your own laptop or desktop computer. It means no subscriptions, no corporate data tracking, no internet connection required, and complete digital sovereignty.

And the absolute king of this revolution is still llama.cpp

If you have ever read my articles, you know my stand: llama.cpp has been the real innovation after Attention is all you need. The Transformers architecture makes it possible to get GPT models, llama.cpp makes generative AI models accessible to everyone.

For years, the undisputed king of this underground local AI movement has been an open-source project named llama.cpp.

Created by developer Georgi Gerganov, it is the engine that allows regular consumer hardware to run massive language models. Yet, despite its power, llama.cpp always had a reputation for being intimidating. To use it, you had to navigate text-heavy GitHub repositories, understand confusing terminal commands, manually compile code, and hunt through deep web repositories to download massive files with strange extensions like GGUF. It was a tool built by developers, for developers.

That era is officially over.

The team behind llama.cpp just launched an official, beautifully minimalist website: llama.app.

Along with this website comes a complete restructuring of the software. They have taken the raw, fragmented engine of local AI and packaged it into a single, unified application that anyone can install with one click.

Open the terminal and run

irm https://llama.app/install.ps1 | iex

irm https://llama.app/install.ps1 | iex

This command works on Windows OS: but there are also all other methods:

Winget (Windows)

winget install llama.cpp

winget install llama.cpp

The package is automatically updated with new llama.cpp releases. More info: #8188

Homebrew (Mac and Linux)

brew install llama.cpp

brew install llama.cpp

The formula is automatically updated with new llama.cpp releases. More info: #7668

MacPorts (Mac)

sudo port install llama.cpp

sudo port install llama.cpp

Nix (Mac and Linux)

nix profile install nixpkgs#llama-cpp

nix profile install nixpkgs#llama-cpp

I will explain later on what happens when you run the installer (and how to update llama.app )

Back to the additional news.

The llama.cpp team has deeply integrated llama.app with autonomous coding agents like Pi, meaning you can now have a private AI assistant build software directly inside your computer folders without typing a single API key or spending a single penny.

If you have ever wanted to break free from cloud subscriptions and experience the freedom of a personal AI companion, things will never be simpler than they are right now.

Let us explore what this change means, why it matters, and exactly how you can set it up on your machine today, whether you use macOS, Linux, or Windows.

The Problem with the Cloud

To understand why the launch of llama.app is such a milestone, we have to look at the current state of mainstream AI.

Right now, most people interact with artificial intelligence through web browsers or phone apps connected to massive corporate clouds. When you type a prompt into ChatGPT, Claude, or Gemini, your words travel across the internet to giant data centers filled with thousands of expensive graphics cards.

While these services are undeniably brilliant, they come with significant hidden costs.

The first cost is privacy. When you use a cloud service, you are sending your thoughts, your personal writing, your business plans, or your proprietary code to a third-party server.

Even if companies promise they will not use your data for training, many professionals and individuals feel deeply uncomfortable letting an external entity look over their digital shoulder.

The second cost is economic. Cloud platforms operate on subscription models or pay-as-you-go API credits. If you use them lightly, the costs are manageable. But if you begin integrating AI deeply into your daily work, or if you use autonomous AI agents that constantly talk to the model in loops, those API bills can skyrocket into hundreds of dollars a month.

The third cost is dependence. If the company changes its pricing, alters its terms of service, modifies the behavior of the model to make it less useful, or suffers a server outage, your workflow instantly grinds to a halt.

Ownership is also another issue: they are the owner of the system prompt, and this means that they are the owner of your values and bias.

Local AI solves every single one of these issues. When a model runs on your local machine, your data never leaves your hard drive. There is zero telemetry tracking you. The cost is exactly zero dollars, no matter how many millions of words you generate. And because it runs offline, it will keep working perfectly even if the entire world loses internet access.

You can set the system prompt, you can set your bouderies and your values:

your AI your rules

Until now, the barrier was usability. The launch of llama.app changes the playing field entirely.

What is Happening Under the Hood?

The new strategy behind llama.app can be summarized in two words: radical simplicity.

Historically, downloading llama.cpp left you with a folder full of scattered, highly technical tools. If you wanted to chat with a model via text, you had to run one command called llama-cli. If you wanted to spin up a local server to connect the AI to another app, you had to run a completely separate tool called llama-server. If you were a non-technical user, just keeping track of these different pieces was a headache.

The developers have solved this by taking a page from the playbook of modern professional software like Git. They have bundled everything into a single, unified program called simply llama.

Now, you have one central point of contact. If you want to start a local server to feed AI into other applications, you simply type llama serve. If you want to chat directly in your terminal, you type llama cli. The underlying engine remains just as powerful and highly optimized as before, but the human interface has been streamlined into something clean and logical.

The installation process

The launch of the unified llama.app architecture sweeps all of that friction away. When you paste the installation string into your Windows PowerShell terminal, you are launching an intelligent deployment script that acts as an automated digital concierge for your specific computer hardware.

Let us open the hood and look at exactly what this script is doing behind the scenes.

1️⃣ The Hardware Fingerprint

The moment the script initializes, its primary objective is discovery. It begins by querying your operating system to figure out your core system architecture. It handles traditional Intel and AMD systems (x86_64), but it also explicitly checks for modern, power-efficient Qualcomm Snapdragon chips (ARM64).

Once the base system architecture is identified, the script contacts its storage hub on Hugging Face to check a live file named latest. This ensures that even if you are using an old tutorial link, the script will always locate and deploy the absolute newest release of the software without forcing you to hunt for version numbers.

2️⃣ The Hardware Probes

Running local AI models is an incredibly intense mathematical exercise. To give you the fastest processing speeds possible, the script needs to understand the exact mathematical capabilities of your processor (CPU) and your graphics card (GPU).

First, it drops a tiny utility called vulkan-probe.exe into your temporary folder. Vulkan is a modern, universal graphics framework that allows software to communicate with graphics chips made by Nvidia, AMD, and Intel. If your computer has a compatible graphics card, this probe wakes it up.

The Vulkan advantage: modern Intel AI hardware changes the quantization game When your CPU has a sidekick (and IQ formats still misbehave) — Not All Quantizations Are Born Equal part 5

Next, it launches a clever feature-detection tool called featcode.exe. This utility scans your hardware and generates a highly specific "feature code"—essentially a unique hardware fingerprint. If a capable GPU is found, the script uses that fingerprint to pull down a version of llama.exe compiled specifically to unleash your graphics card.

If your machine doesn't have a dedicated graphics card, the script gracefully pivots to your CPU. It runs featcode.exe again to check for advanced, modern CPU mathematical instructions (like AVX2 or AVX512). It then pulls down a version of the application perfectly optimized for your specific processor chip. To save your internet bandwidth, all of these files are downloaded in a highly compressed format (.zst), which the script decompresses on the fly before wiping out its temporary workspace.

3️⃣ Seamless Workspace Integration

Once the script has safely acquired your custom-tailored llama.exe file, it has to put it somewhere your computer can find it. Traditionally, this required digging into advanced Windows settings to manually edit your system environment paths—a step that frequently trips up non-technical users.

The installer elegantly bypasses this entire headache by moving the file directly into a specific, hidden Windows directory:

%LOCALAPPDATA%\Microsoft\WindowsApps

%LOCALAPPDATA%\Microsoft\WindowsApps

Windows naturally monitors this folder. Because the script drops llama.exe directly into this directory, your operating system registers the command globally and instantly. The moment the installer finishes, you can open any command prompt or PowerShell window anywhere on your machine, type llama, and it will jump to life.⃣

4️⃣ Built For The Future: Seamless Upgrades

What happens when the development team releases a new version with better performance optimizations? This installer handles upgrades flawlessly.

Because Windows prevents you from deleting or overwriting any software binary that is actively running or being monitored, the installer uses a clever non-destructive swap routine. When you re-run the script to update your system, it looks into your WindowsApps folder. If it finds your older version, it doesn't try to delete it; instead, it safely renames it to llama.exe.old and pushes it aside. It then drops the brand-new version into place and cleanly deletes the temporary leftovers.

How to use it, practically

There is one single command now, called llama.

Either you want an interactive CLI chat environment or a compatible OpenAI API server with integrated out-of-the-box Web chat UI.

In the first scenario run the following command to see all the parameters required

llama cli --help

llama cli --help

In the second one run the following command to see all the parameters required

llama serve --help

llama serve --help

The thing is: the parameters are the same of the well known so far llama.cpp, but it is handled gracefully by an unified command.

For example, I run as a server (with integrated Web UI) using the latest Liquid AI MoE model LFM2.5–8B-A1B as follows:

llama.exe serve -m C:\Fabio-AI\Models_big\LFM2.5-8B-A1B-UD-Q4_K_XL.gguf -ngl 99 --mmap -t 4 -ctv q4_0 -ctk q4_0 --reasoning on -fa on --jinja -a lfm258b1a --port 11434 -c 98000

llama.exe serve -m C:\Fabio-AI\Models_big\LFM2.5-8B-A1B-UD-Q4_K_XL.gguf -ngl 99 --mmap -t 4 -ctv q4_0 -ctk q4_0 --reasoning on -fa on --jinja -a lfm258b1a --port 11434 -c 98000

where, to briefly explain the parameters, here a quick overview:

--mmap            enable memory map: load the entire model from the start
-ngl 99           all GPU layers, no CPU. use ngl 0 for CPU only
-t 4              4 threads of CPU
-c 98000          number of tokens for the context window = 92000
--port 11434      port for the llama server. 11434 emulate Ollama port
-fa on            Flash Attention active
-ctk q4_0         K-cache quantized in q4
-ctv q4_0         V-cache quantized in q4
-a lfm258b1a      alias for the model, to be called and explored by others
--reasoning on    Activate model reasoning (think-/think)
--jinja           expose full chat template with function tool calls

--mmap            enable memory map: load the entire model from the start
-ngl 99           all GPU layers, no CPU. use ngl 0 for CPU only
-t 4              4 threads of CPU
-c 98000          number of tokens for the context window = 92000
--port 11434      port for the llama server. 11434 emulate Ollama port
-fa on            Flash Attention active
-ctk q4_0         K-cache quantized in q4
-ctv q4_0         V-cache quantized in q4
-a lfm258b1a      alias for the model, to be called and explored by others
--reasoning on    Activate model reasoning (think-/think)
--jinja           expose full chat template with function tool calls

And that is all.

Your local AI is ready to go!

The Power of Local Agents

Making it easy to run a model is a massive achievement, but the team did not stop there. The new llama.app ecosystem is explicitly built to support the next frontier of artificial intelligence: autonomous agents.

A standard chatbot is reactive. You give it a prompt, it gives you an answer, and then it stops. An AI agent is proactive. When you give an agent a goal, it writes a plan, creates its own prompts, reviews its own output, catches its own mistakes, and loops repeatedly until the goal is achieved.

One of the most exciting examples of this technology is a terminal-based open-source coding agent named Pi, developed under the repository https://github.com/earendil-works/pi.

Pi is your coding assistant that lives inside your project folders. When you launch Pi inside a workspace, it reads your code files, creates new features, modifies existing logic, runs terminal tests to see if the changes worked, and debugs its own errors until the software compiles perfectly.

Running an agent like Pi on commercial cloud models is incredibly expensive. Because the agent must read your entire codebase and talk to the AI dozens of times to solve a single bug, a single automated coding session can easily drain a massive amount of cloud credits. Furthermore, letting an internet-connected cloud agent read your entire private codebase is an absolute dealbreaker for many security-conscious developers.

But if you combine the new llama serve stack with Pi, you get the ultimate developer setup. You get an autonomous coding assistant that works completely offline, costs absolutely nothing to run for thousands of hours, and keeps your proprietary source code safe inside your machine.

To make this even more magical, the Hugging Face team released a dedicated plugin called pi-llama. Thanks to this integration, you no longer have to spend hours editing configuration files or mapping complex network ports. You simply start your local AI server, open your agent, and Pi will automatically discover the model running on your computer. It is a zero-configuration, fully automated local laboratory.

More information here:

Why This Changes Everything

It is easy to get caught up in the technical novelty of running open-source code, but the real impact of llama.app is cultural and philosophical.

For the last several years, the narrative around artificial intelligence has been dominated by a sense of inevitable centralization.

We were told that AI models are too massive, too complex, and too resource-hungry to ever belong to the individual. The dominant belief was that citizens and creators would always have to rent intelligence from a handful of multi-billion-dollar cloud monopolies.

The launch of llama.app shatters that narrative completely. It proves that the open-source community will not rest until world-class technology is accessible, user-friendly, and completely decentralized.

Now, with a clean, unified command structure, the llama.cpp maintainers have transformed local AI from a complex engineering hobby into a mainstream utility. You just need a modern computer, a single installation command, and a curiosity to explore what is possible when your tools belong entirely to you.

The barrier to entry has vanished.

Local AI is officially ready for prime time, and it lives right on your computer.

I hope you enjoyed the article. If this story provided value and you wish to show a little support, you could:

Clap a lot of times for this story
Highlight the parts more relevant to be remembered (it will be easier for you to find them later and for me to write better articles)
Join my totally free weekly Substack newsletter here
Follow me on Medium
Follow my publication https://medium.com/artificial-intel-ligence-playground

If you want to read more, here are some ideas:

llama.CPP restyle is the workshop for your Local AI From LLM routing to full working Chat application: all in one ZIP file. And here is how.

You don't need an AI agent for every single thing! Your AI your Rules: Open Web UI and llama.cpp are all you need to work with your documents in full privacy and…

LLM-wiki local & locall LLM: part 2 How to implement LLM-Wiki with opencode and llama.cpp, all tricks included

Are you too a Poor-GPU-guy? Here's how to run 400B parameter Models for free A complete guide to NVIDIA NIM's free tier: get hundreds of API calls, access frontier models like Llama 3.3 and…

Contents