Self-hosted ChatGPT alternative: OpenWebUI + Ollama + RAG in 30 minutes

calendar_month May 08, 2026 schedule 7 min read visibility 16 views
person
Valebyte Team
Self-hosted ChatGPT alternative: OpenWebUI + Ollama + RAG in 30 minutes
To launch your own ChatGPT VPS with RAG support and document uploading, you will need a server with at least 16-32 GB RAM and 8 vCPUs. Using a combination of Ollama and OpenWebUI, this allows you to process corporate data locally for around $90/mo without transferring information to third-party companies. This approach completely eliminates data leaks and dependency on OpenAI or Anthropic APIs, providing full control over confidential information.

Which server to choose for your ChatGPT VPS?

The effective performance of a local large language model (LLM) directly depends on the amount of RAM and processor speed if you are not using expensive GPUs. For comfortable work for 1-5 users with models like Llama 3.1 8B or Mistral 7B, it is optimal to choose VPS-L level plans or entry-level dedicated servers.

Hardware Technical Requirements

The main load during text generation falls on the CPU and RAM. Unlike training, model inference can be performed on the processor if you use quantized models (GGUF format). RAM is critical: an 8B model in 4-bit quantization takes up about 5 GB, but RAG (Retrieval-Augmented Generation) and context caching require extra overhead.
Parameter Minimum (Slow) Recommended (Fast) Enterprise Standard
vCPU Cores 4 Cores 8-12 Cores 16+ Cores
RAM 8 GB 16-32 GB 64 GB+
Disk (NVMe) 40 GB 100 GB 500 GB+
Estimated Price $20-30/mo $60-90/mo $150+/mo
If you are planning to migrate from complex cloud platforms, we recommend studying moving from AWS Lightsail/EC2 to dedicated, which can save up to $2000 per month when running heavy models.

CPU vs GPU on VPS

For most small business tasks, renting a server with a GPU (e.g., NVIDIA A100 or RTX 4090) is overkill in terms of price. Modern processor instructions (AVX2, AVX-512) allow Ollama to deliver speeds of 10-15 tokens per second on standard VPS. This is sufficient for real-time text reading and generation. The key factors become core frequency and L3 cache size.

Step-by-step OpenWebUI setup: from Docker to the first model

OpenWebUI is the most advanced interface for working with LLMs, visually mimicking ChatGPT but running entirely on your server. It supports multi-user mode, model management, and a built-in engine for RAG.

Installing Docker and the base environment

To start working on a clean Ubuntu 22.04/24.04, you need to install Docker Engine. We recommend using containerization to isolate system components.
sudo apt update && sudo apt upgrade -y
sudo apt install curl git -y
curl -fsSL https://get.docker.com -o get-docker.sh
sudo sh get-docker.sh
After installing Docker, you can proceed to deploy the Ollama + OpenWebUI stack. The easiest way is to use a ready-made Docker Compose file or a single launch command that combines the interface and the backend.

Running OpenWebUI with Ollama support

To implement a privategpt vps, we use a container that already contains all the necessary dependencies for working with vector databases.
docker run -d -p 3000:8080 \
  --add-host=host.docker.internal:host-gateway \
  -v open-webui:/app/backend/data \
  --name open-webui \
  ghcr.io/open-webui/open-webui:main
After executing this command, the interface will be available at http://your_server_IP:3000. On your first login, you will be prompted to create an administrator account. All user data and chat history will be stored locally in a Docker volume. Details on backend configuration can be found in the guide about your own LLM on CPU VPS: Ollama + llama.cpp.

Looking for a reliable server for your projects?

VPS from $10/mo and dedicated servers from $9/mo with NVMe, DDoS protection, and 24/7 support.

View offers →

Configuring RAG for local ChatGPT: working with PDFs and knowledge bases

The main advantage of a self-hosted GPT over public services is the ability to "feed" the neural network internal company documents (NDAs, technical specifications, regulations) without the risk of them entering the training sets of global models.

How RAG works in OpenWebUI

RAG (Retrieval-Augmented Generation) works according to the following algorithm:
  1. You upload a file (PDF, DOCX, TXT) to the interface.
  2. The system breaks the text into chunks.
  3. A special embedding model (e.g., nomic-embed-text) turns the text into vectors.
  4. When a user asks a question, the system searches for the most similar fragments in the local knowledge base.
  5. The found context is passed to the main model along with your question.
In OpenWebUI, RAG configuration happens in the "Documents" section. You can upload an entire folder of documentation or a project's codebase. For correct operation, ensure that an embedding model is selected in the settings. By default, the CPU version is used, which is perfect for our VPS.

Uploading codebase and PDFs

To make your local ChatGPT an expert in your project, use the collections feature. You can create a collection called "Project_Alpha" and upload all .py or .js files there. When chatting with the model, simply mention the collection using the # symbol, and the neural network will use your code as context for its answers. This turns a regular chat into a full-fledged tool on the level of GitHub Copilot, but with private data storage.

Self-hosted GPT security and corporate isolation

When deploying a corporate chat based on an OpenWebUI setup, attention must be paid to perimeter protection. An open port 3000 is a direct security threat.

Configuring HTTPS and Nginx Reverse Proxy

Never use HTTP to transmit corporate data. Install Nginx and obtain a free Let's Encrypt SSL certificate. This will encrypt traffic between your browser and the VPS.
sudo apt install nginx certbot python3-certbot-nginx -y
# Nginx configuration example
server {
    listen 80;
    server_name chat.yourcompany.com;
    location / {
        proxy_pass http://localhost:3000;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
    }
}
If you are moving from other hosts, for example, planning a migration from Hetzner to Valebyte, don't forget to update DNS records and reissue certificates.

Restricting access via VPN

For maximum security, it is recommended to close access to port 80/443 to the outside world and allow it only through an internal network. You can set up your own VPN on the same or an adjacent server. A great option would be using the 3x-ui panel for Reality configuration, which will provide hidden and fast access for employees to the corporate AI.

Model comparison for privategpt vps: Llama 3.1 vs Mistral

The choice of model determines the quality of answers and operating speed. On a VPS without a video card, we are limited to models with up to 14-20 billion parameters.
Model Size (4-bit) Specialization Speed on 8 vCPUs
Llama 3.1 8B 4.7 GB Universal, logic 12-15 tokens/sec
Mistral Nemo 12B 7.5 GB Long context (128k) 8-10 tokens/sec
Qwen 2.5 7B 4.4 GB Coding and math 14-16 tokens/sec
Phi-3 Mini 2.3 GB Fast simple tasks 25+ tokens/sec
For most office tasks (writing emails, summarizing meetings), Llama 3.1 8B is the gold standard. If you need to analyze huge logs or long legal contracts, Mistral Nemo with its expanded context window would be preferable.

Optimization and performance tuning on CPU

To ensure your own ChatGPT VPS doesn't "lag" when multiple employees are working simultaneously, you need to correctly configure Ollama parameters.

Thread Management (Threads)

By default, Ollama tries to use all available cores. However, this can lead to the entire system freezing. In the OpenWebUI settings or via Ollama environment variables, you can limit the number of threads for a single request. The optimal value is NUM_THREADS = (total_cores - 1).

Quantization and GGUF format

Using models in FP16 format on a CPU is impossible due to colossal memory requirements. Always choose Q4_K_M or Q5_K_M quantization. The loss of accuracy compared to the full model is less than 1-2%, but RAM requirements are reduced by 4 times. If you previously used DigitalOcean and encountered resource shortages, check the guide on how to migrate from DigitalOcean to more powerful Valebyte configurations.

Integration and API: how to use your ChatGPT in workflows

OpenWebUI provides an API that is fully compatible with the OpenAI API. This means you can connect your local server to any third-party applications (IDEs, CRMs, messengers) simply by replacing the base_url.
  • For developers: Connect VS Code via the Continue.dev extension to your VPS. You will get private code autocompletion.
  • For analysts: Use Python scripts for bulk processing of documents via your server's API.
  • For HR: Set up automatic initial screening of resumes by uploading them to the RAG folder.
The cost of ownership for such a system is fixed. Unlike OpenAI, where the bill grows proportionally to the number of tokens, for your own ChatGPT VPS you pay a fixed server rent, regardless of usage intensity.

Conclusions

To create a secure corporate analog of ChatGPT, it is sufficient to rent a VPS with 16-32 GB RAM and deploy the OpenWebUI + Ollama stack, which will ensure full data privacy for around $90/mo. It is recommended to use the Llama 3.1 8B model for everyday tasks and be sure to configure access via VPN or Reverse Proxy with SSL to protect corporate information.

Ready to choose a server?

VPS and dedicated servers in 72+ countries with instant activation and full root access.

Start now →

Share this post:

support_agent
Valebyte Support
Usually replies within minutes
Hi there!
Send us a message and we'll reply as soon as possible.