Your own LLM on a CPU VPS: Ollama + llama.cpp with 7B-13B models

calendar_month May 08, 2026 schedule 8 min read visibility 11 views
person
Valebyte Team
Your own LLM on a CPU VPS: Ollama + llama.cpp with 7B-13B models
To run Ollama and 7B-13B models on a CPU VPS, it is optimal to use a server with 32 GB RAM and 8 vCPU cores, which provides a generation speed of 5-15 tokens per second at a rental cost starting from $30-40 per month. This approach allows you to deploy a full-fledged ChatGPT alternative for private use, API testing, or task automation without the need to rent expensive GPU instances.

Choosing the Hardware Configuration for Ollama VPS

The effective performance of an ollama vps on classic processors depends less on clock speed and more on the amount of RAM and the processor's support for modern instruction sets (AVX2, AVX-512). When choosing a server for local llm hosting, it is critical to understand how the model interacts with the hardware. Unlike video cards, where video memory bandwidth (VRAM) is the deciding factor, in the case of a CPU, the main load falls on the system memory bus and the number of threads. For comfortable operation of models like Mistral 7B or Llama 3 8B, a minimum of 16 GB of RAM is required, but 32 GB is the "gold standard" as it allows loading models with a lower quantization factor (e.g., Q8_0 instead of Q4_K_M), which directly affects the quality of responses. If your goal is your own GPT for working with large context windows (32k tokens and above), RAM capacity becomes the sole limiting factor.

Minimum and Recommended Server Specifications

Specification Minimum (7B models) Optimal (7B-13B models) High-end (30B+ models)
Processor (vCPU) 4 Cores (AVX2) 8 Cores (High Frequency) 16-32 Cores
RAM 16 GB 32 GB 64-128 GB
Disk Type NVMe (Required) NVMe Gen4 NVMe RAID
OS Ubuntu 22.04 LTS Ubuntu 24.04 LTS Debian 12
Expected Speed 3-5 tokens/sec 8-15 tokens/sec 1-3 tokens/sec
When planning your budget, keep in mind that Cloudways → Valebyte: a managed hosting alternative 3 times cheaper can help save on infrastructure, freeing up funds for a more powerful processor. Using NVMe drives is critical for the initial loading speed of model weights into memory. Standard SSDs might make you wait 2-3 minutes every time you restart the service or switch models.

llama.cpp Technology and the Magic of Quantization

The foundation of most modern solutions for running neural networks on processors is llama.cpp cpu optimization. This is a C++ project that implements efficient matrix multiplication algorithms adapted for x86 and ARM architectures. It is thanks to llama.cpp that it became possible to run heavy models on standard server hardware. The key concept here is quantization. Original models from Meta or Mistral AI are supplied in FP16 format (16 bits per weight). A 7B model in this form takes up about 14 GB. Quantization compresses the weights to 4 or 8 bits. The GGUF format used by Ollama allows storing the model in a single file where the weights are already optimized for the CPU.

Why the GGUF Format is Ideal for VPS

  • Memory Savings: A mistral 7b vps model in Q4_K_M quantization takes up only 4.1 GB of RAM instead of 14 GB.
  • Inference Speed: The fewer bits used per weight, the faster the processor can perform calculations, although model accuracy decreases slightly.
  • Versatility: The same file works on Linux, macOS, and Windows via the llama.cpp wrapper.
For those planning a migration from Hetzner to Valebyte, it is important to ensure that the new instances support the CPU flags necessary for accelerating mathematical operations. You can check this with the command lscpu | grep Flags.

Looking for a reliable server for your projects?

VPS from $10/mo and dedicated servers from $9/mo with NVMe, DDoS protection, and 24/7 support.

View offers →

Step-by-Step Ollama Installation on Linux VPS

The Ollama installation process is as simplified as possible. The developers provide a script that automatically detects the system architecture and installs the necessary dependencies. We recommend using a "clean" Ubuntu 22.04 or 24.04.
curl -fsSL https://ollama.com/install.sh | sh
Once the installation is complete, the service will automatically start in the background. You can check the status with the command systemctl status ollama. By default, Ollama listens on port 11434 on localhost. If you plan to access the API externally, you will need to configure environment variables.

Configuring Remote API Access

By default, Ollama blocks external connections for security reasons. To allow access, edit the service configuration:
sudo systemctl edit ollama.service
Add the following lines to the [Service] section:
[Service]
Environment="OLLAMA_HOST=0.0.0.0"
Environment="OLLAMA_ORIGINS=*"
Then restart the daemon and the service:
sudo systemctl daemon-reload
sudo systemctl restart ollama
Now your ollama vps is ready to accept requests. This is useful if you are building a VPS for a VPN business and want to integrate an AI chatbot for customer support directly into the control panel.

Running Mistral 7B and Llama 3 8B Models

After installation, you can proceed to download the models. For CPU servers with 32 GB of RAM, the best choices are the Llama 3 (8B) and Mistral (7B) families. They offer an excellent balance between reasoning quality and text generation speed.

Commands to Run Popular Models

  • Llama 3 8B: ollama run llama3 — the industry standard for general tasks.
  • Mistral 7B: ollama run mistral — better at summarization and coding.
  • Mistral NeMo 12B: ollama run mistral-nemo — a new model with an increased context window, requires about 12-14 GB of RAM.
  • Phi-3 Mini: ollama run phi3 — an ultra-fast model from Microsoft, delivering 20+ tokens/sec even on weak CPUs.
On the first run, Ollama will download the model weights (about 4-8 GB). Thanks to the use of NVMe on Valebyte servers, the verification and memory loading process will take just a few seconds. If you previously used foreign clouds and encountered payment issues, VLESS-Reality vs WireGuard solutions will help ensure stable access to your server from anywhere in the world.

Deploying OpenWebUI: A Graphical Interface for Your GPT

Working with AI through a terminal is not always convenient. To get an interface identical to ChatGPT, we will install OpenWebUI (formerly known as Ollama WebUI). This is a powerful web application that supports user authentication, chat history, document uploads (RAG), and custom prompt creation.

Installation via Docker Compose

The easiest way to deploy is using Docker. Create a docker-compose.yml file:
services:
  open-webui:
    image: ghcr.io/open-webui/open-webui:main
    container_name: open-webui
    ports:
      - "3000:8080"
    extra_hosts:
      - "host.docker.internal:host-gateway"
    volumes:
      - open-webui:/app/backend/data
    restart: always

volumes:
  open-webui:
Start the container with the command docker compose up -d. Now, your personal AI assistant is available at http://your-server-ip:3000. The first registration will be the administrative one. Inside, you can select the models installed in Ollama and start chatting.

Performance Optimization: Squeezing the Most Out of Your CPU

Running an LLM on a processor requires fine-tuning the operating system. By default, Linux may try to save energy or incorrectly distribute resources between cores, leading to "stuttering" in text output.

System Tuning Recommendations

  1. CPU Governor: Set the mode to maximum performance.
    echo "performance" | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
  2. Disable SWAP: If the model doesn't fit in RAM, using a swap file on the disk will slow down performance by 100x. It's better to use a smaller model than to allow swapping.
  3. Numa Nodes: If you have a powerful dedicated server with two processors, use numactl to bind the Ollama process to one group of cores and their local memory.
  4. Thread Count: Ollama automatically detects the number of cores, but sometimes manually setting OLLAMA_NUM_PARALLEL helps avoid overheating and throttling.
It is important to remember that local llm hosting consumes 100% of the selected vCPU core resources during generation. This is normal. However, if you are hosting other heavy services in parallel, such as a Rust server on a VPS, resource conflicts may occur, leading to in-game lag and slower AI responses.

Cost Comparison: CPU VPS vs. GPU Cloud

Many beginners believe that a video card like the NVIDIA A100 or H100 is mandatory for AI. This is true for training models, but for inference (using) 7B-13B models, a CPU VPS is much more cost-effective.
Hosting Type Approximate Monthly Price Pros Cons
GPU Cloud (A10) $150 - $300 Very high speed (50+ t/s) Expensive, pay for idle time
Valebyte CPU VPS (32GB) $35 - $50 Fixed price, plenty of RAM Moderate speed (10 t/s)
Serverless AI API $0.50 per 1M tokens No setup required Lack of privacy, censorship
Using your own server ensures complete data privacy. Your prompts and RAG documents do not go to OpenAI or Anthropic. This is critical for the corporate sector or developers working with confidential code.

Ollama Security and Monitoring

Deploying an ollama vps requires attention to security, especially if the API is accessible from the internet. We recommend closing port 11434 using ufw and allowing access only from your IP or through a VPN tunnel. To monitor the load, use the htop or btop utility. You will see how all vCPU cores hit 100% load during a request, while memory consumption remains stable — this is the nature of llama.cpp cpu operation. If you notice the Ollama process exiting with an "Out of Memory" error, it means the selected model is too large for your RAM capacity. In this case, try a version with stronger quantization (e.g., Q3_K_S).

Conclusions

To run Ollama with 7B-13B models, it is optimal to use a VPS with 32 GB RAM and 8 vCPU cores, which will provide a stable 10 tokens per second. This is sufficient for most tasks: from writing code to analyzing documents, while the cost of the solution is 5-6 times lower than renting a GPU server.

Ready to choose a server?

VPS and dedicated servers in 72+ countries with instant activation and full root access.

Start now →

Share this post:

support_agent
Valebyte Support
Usually replies within minutes
Hi there!
Send us a message and we'll reply as soon as possible.