May 10, 2026
Run Qwen3.6-35B-A3B on 6GB VRAM Using Llama.cpp (~30 tps)
In 2026, the latest release of the Qwen3.6–35B-A3B AI model, combined with recent updates to Llama.cpp, marks a significant improvement…
Minyang Chen
7 min read
In 2026, the latest release of the Qwen3.6–35B-A3B AI model, combined with recent updates to Llama.cpp, marks a significant improvement over previous versions for both local and production inference.
It's hard to believe these advancements make it possible to run larger models efficiently on low-end consumer desktops and laptops equipped with older CPUs and limited VRAM, achieving speeds close to 30 tokens per second (TPS). so I thought was worth to share my findings.
Objective
Typically, low-end desktops and virtual machines can only handle smaller quantized models (e.g., 2B, 4B, or 7B parameters) with reasonable performance. However, these smaller models are limited to context windows of only 32K, 48K, or 128K tokens. Meanwhile, larger models usually require substantial VRAM, making them costly and impractical to run locally.
Therefore, the objective of this article is to demonstrate an efficient approach for running the 35B-parameter model with a 256K-token context window using Llama.cpp. We will show how to optimize it for low-end consumer desktops and laptops with just 6 GB of VRAM and 32 GB of RAM, while still achieving performance comparable to that of smaller models.
The Model
Released by Alibaba in April 2026, Qwen3.6 35B A3B is a sparse Mixture-of-Experts (MoE) model. It is designed to deliver advanced coding and reasoning performance while maintaining the efficiency typically found in much smaller models.
Why is Qwen3.6–35B-A3B worth running?
This model contains 35 billion total parameters but only activates 3 billion per token. This sparse architecture enables it to run at speeds comparable to a 3B model while delivering the reasoning capabilities of much larger dense models, such as Gemma 4–31B.
In my view, this release marks a significant leap forward compared to previous versions, delivering noticeably better responses. Here are four key advantages:
• Agentic Coding: The "6" in Qwen 3.6 highlights its evolution into a "coding agent." It features major upgrades in frontend workflows and repository-level reasoning, making it highly capable of handling complex, multi-step development tasks.
• Thinking Preservation: The new preserve_thinking feature allows the model to retain reasoning context across conversational turns. This reduces computational overhead during iterative development and helps the model maintain logical consistency throughout long sessions.
• Multimodal Reasoning: Unlike many coding-focused models, it is natively multimodal, supporting text, image, and video inputs. This enables it to "see" code screenshots or UI designs to assist with debugging and frontend generation.
• Massive Context Window: With a native context length of 262,144 tokens (expandable to 1 million tokens using YaRN), it can analyze entire codebases in a single prompt. This extensive window supports highly complex, multi-step prompt processing.
🔗 Model Link: https://huggingface.co/Qwen/Qwen3.6-35B-A3B
Impressive benchmark
The Inference Engine
llama.cpp is widely regarded as the "gold standard" for local Large Language Model (LLM) inference. It addresses the two most common challenges of running AI locally: complex dependency management and hardware limitations. Additionally, it benefits from continuous, community-driven improvements.
In December 2025, llama.cpp introduced a major automation feature that dynamically manages GPU layer offloading and memory allocation. In recent updates, the engine can automatically determine how to split models that exceed available GPU memory (VRAM). Instead of manually estimating how many layers will fit, you can now let the system handle the allocation:
- Automatic Memory Testing: The engine runs "virtual test allocations" to assess memory requirements. If the model doesn't fit, it iteratively optimizes memory usage — first by reducing context size, then by offloading tensors to system RAM — until the model fits across your available hardware.
- Partial Offloading: The system prioritizes keeping performance-critical components, such as dense tensors and attention weights, on the GPU. If the model exceeds GPU capacity, the remaining layers are automatically offloaded to system RAM.
- Command-Line Options: For Mixture-of-Experts (MoE) models, use flags like — n-cpu-moe or — cpu-moe to offload heavy expert weights to the CPU while keeping critical attention layers on the GPU. Additionally, the — fit flag allows the engine to automatically calculate the optimal layer and tensor distribution based on your hardware.
Github url: https://github.com/ggml-org/llama.cpp.git
Before diving into the lab setup details, let's review the typical hardware requirements for running large models locally.
Typical Hardware Requirements for Large Language Models
Qwen3.6–35B-A3B VRAM Requirements — MoE Hardware & GPU Guide (April 2026) Source: https://willitrunai.com/blog/qwen-3-6-vram-requirements
- Qwen3.6–35B-A3B VRAM (Q4_K_M): ~21 GB — fits comfortably on a 24 GB GPU.
- Qwen3.6–35B-A3B VRAM (Q8_0): ~37 GB — requires a 48 GB GPU or a Mac M4 Max with 64 GB+ unified memory.\
- Release Date: Open-source weights were released on April 16, 2026, via Hugging Face and ModelScope.
- Recommended Hardware: 24 GB VRAM or 32 GB unified memory for Q4_K_M quantization; 48 GB VRAM for Q8_0 with long-context support.
- Qwen3.6–27B (Dense Model): Requires ~16.8 GB VRAM at Q4_K_M quantization. See the dedicated guide for details.
For systems with limited VRAM, less than 24 GB GPU do offloading layers to the CPU.
Alternatively, you can run the model entirely on the CPU.
Experimental Demo Lab Setup
For this experimental demo, I used a few-year-old Acer Aspire desktop bought from Best buy running Ubuntu 22.04. The system is equipped with an Intel Core i5–11400 (6-core CPU), 32 GB of RAM, and a low-profile NVIDIA RTX A2000 GPU (6 GB VRAM, 70W TDP).
Setup Steps
- Install build dependency libraries and tools.
- Download and install the NVIDIA CUDA Toolkit 13.x.
- Download and compile the latest version of llama.cpp.
- Download a quantized version of the Qwen3.6–35B model.
- Run the llama.cpp inference server in API/router mode.
- Launch the llama.cpp web chat UI to perform testing.
STEP 1. Install build dependency libraries and tools
llama.cpp dependencies
$ sudo apt install build-essential gcc g++ make freeglut3-dev libx11-dev libxmu-dev libxi-dev libglu1-mesa libglu1-mesa-dev$ sudo apt install build-essential gcc g++ make freeglut3-dev libx11-dev libxmu-dev libxi-dev libglu1-mesa libglu1-mesa-devcuda toolkits dependencies
$ sudo apt update && sudo apt install build-essential zlib1g-dev libncurses5-dev libgdbm-dev libnss3-dev libssl-dev libreadline-dev libffi-dev libsqlite3-dev wget libbz2-dev -y$ sudo apt update && sudo apt install build-essential zlib1g-dev libncurses5-dev libgdbm-dev libnss3-dev libssl-dev libreadline-dev libffi-dev libsqlite3-dev wget libbz2-dev -ySTEP 2. Download and Install the Nvidia Cuda Toolkit 13.x
Run installation:
“wget https://developer.download.nvidia.com/compute/cuda/13.0.1/local_installers/cuda_13.0.1_580.82.07_linux.run
chmod +x cuda_13.0.1_580.82.07_linux.run
sudo ./cuda_13.0.1_580.82.07_linux.run“wget https://developer.download.nvidia.com/compute/cuda/13.0.1/local_installers/cuda_13.0.1_580.82.07_linux.run
chmod +x cuda_13.0.1_580.82.07_linux.run
sudo ./cuda_13.0.1_580.82.07_linux.runDuring the CUDA Toolkit installation, select only the CUDA Toolkit component. Skip the NVIDIA GPU driver installation if you already have one installed.
Once the installation is complete, configure the following environment variables:
export CUDA_HOME=/usr/local/cuda-13.0/
export PATH=/usr/local/cuda-13.0/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda-13.0/lib64:$LD_LIBRARY_PATHexport CUDA_HOME=/usr/local/cuda-13.0/
export PATH=/usr/local/cuda-13.0/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda-13.0/lib64:$LD_LIBRARY_PATHSTEP 3. Download and compile the latest version of llama.cpp
git clone https://github.com/ggml-org/llama.cpp.gitgit clone https://github.com/ggml-org/llama.cpp.gitCompile and build
## build for cuda
export CUDA_HOME=/usr/local/cuda-13.0/
export PATH=/usr/local/cuda-13.0/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda-13.0/lib64:$LD_LIBRARY_PATH
## build for cuda with flash on and quantization of cache
cd llama.cpp
cmake -B build \
-DGGML_CUDA=ON \
-DGGML_CUDA_FA_ALL_QUANTS=ON \
-DBUILD_SHARED_LIBS=OFF \
-DCMAKE_CUDA_COMPILER=/usr/local/cuda-13.0/bin/nvcc \
-DGGML_CCACHE=OFF \
-DLLAMA_CURL=ON \
-DCMAKE_BUILD_TYPE=Release
cmake - build build - config Release -j 6 --clean-first## build for cuda
export CUDA_HOME=/usr/local/cuda-13.0/
export PATH=/usr/local/cuda-13.0/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda-13.0/lib64:$LD_LIBRARY_PATH
## build for cuda with flash on and quantization of cache
cd llama.cpp
cmake -B build \
-DGGML_CUDA=ON \
-DGGML_CUDA_FA_ALL_QUANTS=ON \
-DBUILD_SHARED_LIBS=OFF \
-DCMAKE_CUDA_COMPILER=/usr/local/cuda-13.0/bin/nvcc \
-DGGML_CCACHE=OFF \
-DLLAMA_CURL=ON \
-DCMAKE_BUILD_TYPE=Release
cmake - build build - config Release -j 6 --clean-firstSTEP 3. Download a quantized version of the Qwen3.6–35B model
- Unsloth Q4 model (22.4 GB) GGUF: https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF/tree/main
$ wget https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF/resolve/main/Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf”$ wget https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF/resolve/main/Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf”STEP 5. Run the llama.cpp inference server in API/router mode
cd /datadisk/llama_cpp_server
MODEL_DIR="models/router"
LLAMA_SERVER="llama.cpp/build/bin/llama-server"
## Run
$LLAMA_SERVER \
--host 0.0.0.0 --port 8080 \
--models-dir $MODEL_DIR \
--models-max 2 \
--log-verbosity 3 \
--direct-io \
--ctx-size 131072 \
--parallel 2 \
--temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.00 --repeat-penalty 1.0 \
--batch-size 2048 --ubatch-size 512 \
--n-predict 66560 \
--fit on \
--threads 6 \
--mlock --no-mmap \
--cont-batching \
--jinja \
--flash-attn auto \
--embeddings --pooling mean \
--cache-type-k q4_0 --cache-type-v q4_0 \
--n-gpu-layers -1 \
--chat-template-kwargs "{\"preserve_thinking\":true}" --chat-template-kwargs "{\"enable_thinking\":true}" \
--reasoning-format auto --reasoning auto --chat-template-kwargs "{\"reasoning_effort\":\"high\"}" \
--image-min-tokens 512 --image-max-tokens 32786 \cd /datadisk/llama_cpp_server
MODEL_DIR="models/router"
LLAMA_SERVER="llama.cpp/build/bin/llama-server"
## Run
$LLAMA_SERVER \
--host 0.0.0.0 --port 8080 \
--models-dir $MODEL_DIR \
--models-max 2 \
--log-verbosity 3 \
--direct-io \
--ctx-size 131072 \
--parallel 2 \
--temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.00 --repeat-penalty 1.0 \
--batch-size 2048 --ubatch-size 512 \
--n-predict 66560 \
--fit on \
--threads 6 \
--mlock --no-mmap \
--cont-batching \
--jinja \
--flash-attn auto \
--embeddings --pooling mean \
--cache-type-k q4_0 --cache-type-v q4_0 \
--n-gpu-layers -1 \
--chat-template-kwargs "{\"preserve_thinking\":true}" --chat-template-kwargs "{\"enable_thinking\":true}" \
--reasoning-format auto --reasoning auto --chat-template-kwargs "{\"reasoning_effort\":\"high\"}" \
--image-min-tokens 512 --image-max-tokens 32786 \In this sample setup, two parallel instances each support a 65K context window. The latest llama.cpp server offers many customizable parameters, but for this experimental lab test, I only used the ones that proved most effective.
For example, reducing the context size improves speed but may limit long-context inference performance. For coding tasks, it's best to set ctx-size to 262144 and parallel to 1.
after server startup and loaded Qwen3.6–35B model,m check vram and CPU ram usage.
STEP 6. Launch the llama.cpp web chat UI to perform testing.
on browser open url http://localhost:8080 the load model qwen3.6–35b-a3b, give a minute on first time loading, on completion you see should a green pop-up message say model loaded.
It's worth noting that this model supports advanced reasoning and thinking capabilities while operating with a large context window. It can also break up complex questions into sub-question for the models to perform inference on each question and final self-reflected answer.
Additionally, upgrading to a GPU with at least 8GB of VRAM will significantly improve performance and processing speed.
Final Thoughts
Is this experiment worth it? Absolutely. The results clearly show that running the Qwen3.6 model in llama.cpp is efficient, powerful, and versatile. More importantly, it demonstrates a practical, cost-effective way to repurpose older hardware for running AI models locally or on virtual machines.
Kudos to the llama.cpp community for continuously releasing impressive tools, and to the Qwen team for delivering such advanced models.
About the Model The Qwen3.6 35B-A3B is a sparse Mixture-of-Experts (MoE) model with 35 billion total parameters and only 3 billion active parameters. It is released under the Apache 2.0 license. Key capabilities include:
- Agentic coding performance comparable to models with 10× its active parameter count
- Strong multimodal perception and reasoning abilities
- Dual operational modes: "thinking" mode for complex reasoning and "non-thinking" mode for faster, direct responses
One notable improvement in Qwen3.6 is its refined reasoning process, which effectively prevents the infinite thinking loops that plagued earlier Qwen3.5 versions (including the 2B, 4B, 9B, and 30B quantized models).
New llama.cpp Features That Made This Possible
- Automatic Layer Distribution: Instead of manual configuration, the system uses "virtual test allocations" to iteratively find the optimal way to distribute model layers across available GPUs.
- Intelligent Memory Scaling: When a model exceeds available VRAM, llama.cpp automatically:
- Reduces the context window to free up memory first.
- Offloads remaining tensors from VRAM to system RAM as a fallback.
- Priority-Based Offloading (MoE Optimization): For Mixture-of-Experts models, dense tensors are prioritized for VRAM to maximize performance.
- Layer "Overflow" Support: To avoid wasting VRAM on smaller GPUs, individual model layers can now be split ("overflow") between GPU memory and system RAM.
This combination of hardware efficiency, smart memory management, and advanced model architecture makes local AI deployment more accessible than ever.
Thanks for reading!
Have a good day!
/MC
REFERENCES:
Llama.cpp https://github.com/ggml-org/llama.cpp
Qwen Blog:https://qwen.ai/blog?id=qwen3.6-35b-a3b
Qwen HuggingFace:https://huggingface.co/Qwen/Qwen3.6-35B-A3B
Qwen ModelScope:https://modelscope.cn/models/Qwen/Qwen3.6-35B-A3B