Dedicated server for AI inference: choosing hardware

For efficient AI inference without a GPU on a dedicated server, a powerful multi-core CPU, a minimum of 64GB RAM, and a fast NVMe drive are critically important, allowing complex ONNX and llama.cpp models to be processed with high performance and low latency.

Why is CPU Inference Relevant for AI Models?

In the world of artificial intelligence, Graphics Processing Units (GPUs) dominate, especially for training large models. However, for the inference phase—that is, applying an already trained model to obtain predictions—CPU inference often proves to be more than sufficient, and sometimes even the preferred solution. This is particularly relevant for models that do not require the massive parallelization inherent to GPUs, or when the budget for GPUs is limited.

The advantages of CPU inference include:

Cost-effectiveness: A dedicated server with a powerful CPU is typically significantly cheaper than a comparable server with high-performance GPUs.
Availability: GPU servers are often in short supply or have higher rental costs. CPU servers are much more common.
Flexibility: Many frameworks and libraries (such as ONNX Runtime, llama.cpp) are optimized for efficient operation on CPUs, allowing for the use of a wide range of hardware.
Energy Efficiency: In some cases, especially for "lighter" models or under low load, CPU servers consume less energy.

Projects like llama.cpp have demonstrated that even large language models (LLMs) can run efficiently on CPUs using optimized quantization and computation algorithms. Similarly, ONNX Runtime allows models from various frameworks (PyTorch, TensorFlow) to be deployed on CPUs with excellent performance.

Which Processor is Needed for an AI Inference Server?

Choosing a processor is a key aspect for an AI inference server without a GPU. Not only the number of cores but also their clock speed and cache size are important here.

Number of Cores: For simultaneous processing of multiple requests or performing multi-threaded inference operations, as many cores as possible are required. Modern frameworks can efficiently distribute the load. Look for processors with at least 8-12 physical cores, and preferably 16-32 or more.
Clock Speed: A high clock speed is important for single-thread performance, which can be critical for latency-sensitive applications where each request is processed sequentially.
Cache Memory (L3 Cache): A large cache significantly speeds up access to frequently used model data, reducing latency when accessing RAM.
Instruction Set Support: The presence of AVX-512 (for Intel) or FMA (for AMD) instructions significantly accelerates the mathematical computations required for neural networks.

Recommended Processor Series:

Intel Xeon E/W: A good balance of price and performance for small to medium tasks. For example, Xeon E-2388G (8 cores/16 threads, 5.10 GHz Turbo).
Intel Xeon Scalable (Silver, Gold, Platinum): An excellent choice for a high-performance dedicated server for AI. They offer a large number of cores (up to 56 per socket), high frequency, and a large cache.
AMD EPYC (7002, 7003, 7004 series): Leaders in core count (up to 128 per socket), cache size, and support for large amounts of RAM. Ideal for large-scale ML inference hosting.

Example of optimal CPU choice: AMD EPYC 7302P (16 cores/32 threads, 3.3 GHz) or Intel Xeon Gold 6248R (24 cores/48 threads, 4.0 GHz). These processors provide sufficient computational power for most CPU inference tasks.

Looking for a reliable server for your projects?

Valebyte offers VPS and dedicated servers with guaranteed resources and fast activation.

View offers →

Random Access Memory (RAM): A Critical Resource for Neural Network Servers

For a neural network server, especially with CPU inference, the amount and speed of RAM play no less important a role than the processor. Machine learning models, especially large language models (LLMs), can occupy tens or even hundreds of gigabytes in RAM.

RAM Capacity: This is the primary factor. For most inference tasks, a minimum of 64GB RAM is a starting point. For large LLMs (e.g., Llama 2 70B in quantized form), 128GB, 256GB, or even 512GB RAM may be required. Ensure that the chosen server can accommodate the necessary capacity.
RAM Speed: The faster the RAM (DDR4-3200, DDR5-4800, and higher), the quicker the processor can access model data and intermediate computation results. This directly impacts inference latency.
ECC RAM: For commercial and mission-critical systems, Error-Correcting Code (ECC) RAM is highly recommended. It detects and corrects data errors on the fly, increasing system stability and reliability, preventing crashes caused by random memory errors.

Insufficient RAM leads to constant data swapping to disk, which significantly slows down inference. Therefore, it's better to be safe and get RAM with a reserve than to encounter a performance bottleneck.

Data Storage: Why NVMe SSD is Indispensable for ML Inference Hosting?

The speed of the disk subsystem is critically important for ML inference hosting, especially when loading large models and datasets. Traditional HDDs or even SATA SSDs can become a serious bottleneck.

NVMe SSD: This is the de facto standard for high-performance servers. NVMe drives utilize the PCIe bus, providing significantly higher sequential read/write speeds (up to 7000 MB/s and higher) and, more importantly, a tremendous number of input/output operations per second (IOPS) compared to SATA SSDs.
Model Loading: Large AI models can weigh tens of gigabytes. Fast loading of a model from an NVMe drive into RAM reduces the inference service startup time and accelerates initialization.
Data Processing: If your inference involves pre-processing large volumes of data stored on disk or logging results, a high-speed NVMe will ensure minimal latency.
Capacity: For most inference tasks, a capacity of 500GB to 2TB NVMe SSD is sufficient. Larger models or logs may require more.

Using an NVMe SSD ensures that the disk subsystem will not be a bottleneck, allowing the processor and RAM to operate at full capacity.

Network Infrastructure and Bandwidth

While network bandwidth might seem less critical than CPU or RAM, it plays an important role for an AI inference server, especially in the following scenarios:

High-Load API: If your inference service handles a large number of requests from users or other systems, sufficient bandwidth is required for fast data exchange.
Stream Processing: For inference of video streams, large images, or real-time audio data, a 10 Gbps network interface becomes a necessity.
Distributed Inference: If you plan to scale your service horizontally using multiple servers, a fast network between them will ensure efficient interaction.
Model and Data Upload/Download: Initial loading of large models onto the server, as well as regular updates or result uploads, can significantly benefit from a high-speed connection.

For most inference tasks, a 1 Gbps port will be sufficient, but for high-load or latency-sensitive applications, consider options with a 10 Gbps connection.

Optimal Valebyte Configurations for a Dedicated Server for AI

Valebyte offers a wide selection of dedicated servers for AI that are ideally suited for CPU inference, providing a balance of power, flexibility, and cost. We focus on processors with a large number of cores, sufficient RAM, and fast NVMe drives.

Table: Recommended Valebyte Configurations for AI Inference (CPU-based)

Plan / Configuration	Processor	RAM	Disk (NVMe)	Network Port	Approximate Cost (from)
AI Inference Start	Intel Xeon E-2388G (8C/16T, up to 5.1 GHz)	64 GB DDR4 ECC	1 TB NVMe SSD	1 Gbps	$99/month
AI Inference Pro	AMD EPYC 7302P (16C/32T, up to 3.3 GHz)	128 GB DDR4 ECC	2 TB NVMe SSD	1 Gbps	$189/month
AI Inference Max	Intel Xeon Gold 6248R (24C/48T, up to 4.0 GHz)	256 GB DDR4 ECC	2 x 2 TB NVMe SSD (RAID1)	10 Gbps	$349/month
AI Inference EPYC Power	AMD EPYC 7502P (32C/64T, up to 3.35 GHz)	512 GB DDR4 ECC	2 x 3.84 TB NVMe SSD (RAID1)	10 Gbps	$599/month

Prices are approximate and may vary depending on region, availability, and special offers. Current prices and exact specifications are available on our website Valebyte.com.

Usage Examples and Software

On a dedicated Valebyte server, you can easily deploy environments for CPU inference. Here are a few examples:

1. Running Llama 2 7B on llama.cpp:

After installing `llama.cpp` and downloading a quantized model (e.g., `llama-2-7b-chat.Q4_K_M.gguf`), you can run inference:

./main -m models/llama-2-7b-chat.Q4_K_M.gguf -p "Tell me about Valebyte.com" -n 128 --temp 0.7 --top-k 40 --top-p 0.9 --threads 16

Here, `--threads 16` indicates the use of 16 CPU threads, which effectively utilizes the multi-core processor.

2. Using ONNX Runtime for Inference:

Installing ONNX Runtime in Python:

pip install onnxruntime

Example code for inference:

import onnxruntime as ort
import numpy as np

# Load ONNX model
session = ort.InferenceSession("path/to/your/model.onnx")

# Prepare input data
input_name = session.get_inputs()[0].name
input_shape = session.get_inputs()[0].shape
input_data = np.random.rand(*input_shape).astype(np.float32)

# Perform inference
output = session.run(None, {input_name: input_data})

print("Inference result:", output[0])

ONNX Runtime automatically optimizes execution on available CPU cores.

Recommendations for Selection and Scaling

Choosing the right server for neural networks is an investment. Consider the following recommendations:

Assess your model's requirements: Determine in advance the amount of RAM needed to load the model and the CPU computational power required for the desired inference latency.
Start with a buffer: Always get slightly more RAM and cores than seems necessary at first glance. This will give you room for scaling without immediate server replacement.
Test performance: After deployment, conduct load testing to ensure the server can handle the expected load and latencies.
Consider redundancy: For mission-critical inference services, consider setting up multiple servers to ensure high availability and load balancing.
Pay attention to support: Valebyte provides 24/7 technical support for all dedicated servers, which is critically important for the stable operation of your AI services.

Conclusion

Choosing a dedicated server for CPU inference of AI models requires a careful approach to hardware specifications, where a powerful multi-core processor, sufficient capacity (64GB+) and high-speed RAM, as well as a fast NVMe drive, are key. Valebyte.com offers optimal configurations capable of efficiently handling AI inference server tasks, ensuring reliability and performance for your projects.

Ready to choose a server?

VPS and dedicated servers in 72+ countries with instant setup and full root access.

Get started now →