Hosting and Optimizing Local LLMs on a 32GB RAM VPS
An engineering guide to hosting Llama and Gemma models locally using Ollama on a 32GB RAM VPS to eliminate API dependency and ensure absolute data privacy.
When building enterprise autonomous agent applications, developer platforms frequently run into substantial obstacles surrounding cloud API pricing and proprietary data security. While commercial third-party APIs offer convenient starting points, scaling agentic workflows—which perform loops of tool-calling, self-correction, and multiple reflection cycles—makes paying per token financially unsustainable. Furthermore, strict regulatory frameworks like GDPR or HIPAA strictly forbid sending sensitive enterprise or user data to external third-party endpoints.
Running open-weights Large Language Models (LLMs) locally on private cloud compute instances fully resolves these concerns. A 32GB RAM Virtual Private Server (VPS) represents the optimal cost-to-performance sweet spot for hosting local models. It is large enough to load robust 8-billion to 9-billion parameter models under high-quantization precision while retaining a large enough memory ceiling for dynamic context windows and parallel execution.
This systems guide demonstrates how to deploy, configure, and optimize local models using Ollama on a 32GB RAM VPS operating in a CPU-only or hybrid environment. We will cover bare-metal systemd configurations, resource-constrained containerization, low-level OS kernel optimization, and advanced real-world production failure modes.
1. Cloud API vs. Local VPS: A Financial and Architectural Breakdown
For autonomous agents running continuous query-eval-response loops (e.g., background document parsers, security scanners, or automated customer support swarms), token consumption scales exponentially rather than linearly.
The Cost Volatility of Commercial APIs
Consider an agent utilizing a standard ReAct (Reasoning and Acting) framework. A single agentic task requiring 10 iterative loops to gather facts, call APIs, and reason about the output might consume 4,000 input tokens (due to mounting context histories) and generate 500 output tokens on each iteration.
Let us calculate the cost of a single agent execution using commercial pricing scales:
- Average context per loop: 4,000 input tokens. Over 10 iterations, the cumulative input token cost is calculated as the sum of the historical prompts. If the history grows, it averages approximately
62,500 input tokenstotal. - Output generated: 500 tokens * 10 iterations =
5,000 output tokenstotal. - Commercial API Cost (e.g., standard high-performance model at 15.00/M output):
- Input Cost:
62,500 * ($5.00 / 1,000,000) = $0.3125 - Output Cost:
5,000 * ($15.00 / 1,000,000) = $0.075 - Total Cost Per Task Execution: $0.3875
- Input Cost:
If your enterprise swarm executes 25,000 tasks per day, your monthly operational expenses reach:
25,000 tasks/day * $0.3875/task * 30 days = $290,625 / Month
Even using a lower-tier “mini” model (e.g., 0.60/M output), the costs remain variable and grow directly with scale:
Input: 62,500 * ($0.15 / 1,000,000) = $0.009375
Output: 5,000 * ($0.60 / 1,000,000) = $0.003
Total: $0.012375 per task
25,000 tasks/day * $0.012375/task * 30 days = $9,281.25 / Month
The Predictable Flat-Rate VPS Solution
In contrast, hosting a high-performance quantized model like Llama-3.1-8B-Instruct (Q4_K_M) on a dedicated CPU-Optimized 32GB RAM VPS costs a flat 80 per month (depending on the provider). Your token allowance becomes functionally unlimited:
| Monthly Task Volume | API Cost (Premium) | API Cost (Budget/Mini) | Private 32GB VPS Cost | Net Savings (vs. Budget) |
|---|---|---|---|---|
| 10,000 | $3,875.00 | $123.75 | $60.00 | $63.75 (51.5%) |
| 50,000 | $19,375.00 | $618.75 | $60.00 | $558.75 (90.3%) |
| 250,000 | $96,875.00 | $3,093.75 | $60.00 | $3,033.75 (98.0%) |
| 1,000,000 | $387,500.00 | $12,375.00 | $60.00 | $12,315.00 (99.5%) |
Architectural Autonomy
Beyond raw cost savings, private VPS hosting eliminates external technical constraints:
- Zero Rate Limits: No
HTTP 429 Too Many Requestsresponse codes interrupting automated scripts. - Latency Optimization: Network traffic between your agent logic and the local model travels over loopback interfaces (
127.0.0.1), removing the 100ms–500ms network round-trip overhead of reaching public API gateways. - Data Residency: No third-party system stores your prompts, logs, or sensitive proprietary documents, making compliance audits effortless.
2. Deployment Architecture: Bare-Metal Systemd vs. Containerized Docker Compose
When architecting a VPS for LLM inference, you must choose between a bare-metal systemd deployment and a containerized Docker deployment. On a CPU-only system where every percentage of overhead affects execution speeds, this architectural choice carries significant weight.
Architecture diagram
Why Systemd is the Golden Standard for CPU Inference
While Docker is excellent for portability, running heavy matrix math on standard CPUs inside containers can introduce minor virtualization and network bridging penalties. Standard Docker configurations can also restrict direct access to hardware NUMA (Non-Uniform Memory Access) nodes, which slows down CPU-memory throughput.
Running Ollama as a native Systemd Service directly on the host operating system ensures:
- Direct Memory Mapping (
mmap): The kernel maps the model files from your SSD directly into the host virtual memory tables without container layer overhead. - Simplified NUMA Bindings: You can invoke utilities like
numactldirectly in the service start commands. - Guaranteed Host OOM Immunity: Systemd has direct integration with the Linux kernel’s process score tables, ensuring Ollama is protected from sudden termination under severe load.
Deep-Dive Systemd Configuration File
Create the dedicated user and configure the systemd unit file at /etc/systemd/system/ollama.service:
[Unit]
Description=Ollama Service (Optimized for 32GB VPS CPU Inference)
After=network-online.target
Wants=network-online.target
[Service]
Type=simple
User=ollama
Group=ollama
WorkingDirectory=/var/lib/ollama
ExecStart=/usr/local/bin/ollama serve
# --- Core Thread & Memory Performance Parameters ---
# Restrict Ollama to load only 1 model at a time to protect 32GB RAM limits
Environment="OLLAMA_MAX_LOADED_MODELS=1"
# Limit parallel requests to match system memory bandwidth bottlenecks
Environment="OLLAMA_NUM_PARALLEL=2"
# Force the allocation of standard KV Cache precision
Environment="OLLAMA_KV_CACHE_TYPE=q8_0"
# Enable experimental flash attention to lower memory consumption per token
Environment="OLLAMA_FLASH_ATTENTION=1"
# Bind the service to listen safely on all interfaces inside a private VPC network
Environment="OLLAMA_HOST=0.0.0.0"
# --- Host Kernel Resource Restrictions & OOM Tuning ---
# Safeguard Ollama from being selected by the kernel's Out-Of-Memory killer
OOMScoreAdjust=-1000
# Set strict limits to prevent memory runaway from crashing the entire VPS
MemoryHigh=28G
MemoryMax=30G
# Standard file descriptor limits for high concurrency web connections
LimitNOFILE=65535
# Process sandbox execution protections
PrivateTmp=true
ProtectSystem=full
ProtectHome=true
Restart=always
RestartSec=3
[Install]
WantedBy=default.target
Production-Grade Docker Compose Alternative
If you require container isolation or deployment through a larger GitOps pipeline (such as Portainer or Docker Swarm), use this highly optimized docker-compose.yml file. It enforces strict CPU limits, pins memory allocations, and protects the container process from OOM termination.
version: '3.8'
services:
ollama:
image: ollama/ollama:0.1.48 # Pin exact minor version to prevent breaking CLI changes
container_name: ollama-production
restart: always
ports:
- "127.0.0.1:11434:11434" # Exposed to local loopback; use Nginx for external routing
environment:
- OLLAMA_HOST=0.0.0.0
- OLLAMA_NUM_PARALLEL=2
- OLLAMA_MAX_LOADED_MODELS=1
- OLLAMA_FLASH_ATTENTION=1
- OLLAMA_KV_CACHE_TYPE=q8_0
volumes:
- /opt/ollama/data:/root/.ollama
# Protect container process from OOM killer using Docker daemon bindings
oom_score_adj: -1000
deploy:
resources:
limits:
# Leave 1 full physical core free for OS scheduling and networking interrupts
cpus: '7.0'
memory: 28G
reservations:
memory: 16G
logging:
driver: "json-file"
options:
max-size: "20m"
max-file: "5"
3. Deep Systems Optimization for CPU Inference
Unlike GPUs that feature ultra-wide, high-bandwidth VRAM buses (e.g., 512-bit to 2048-bit buses pushing up to 2 TB/s of bandwidth), CPUs access system RAM via narrow dual-channel or quad-channel memory controllers operating at standard DDR4/DDR5 speeds (typically 40 GB/s to 80 GB/s). Because of this memory bandwidth bottleneck, CPU inference performance is almost entirely determined by RAM access speeds and CPU cache efficiency, rather than raw CPU processing frequencies.
Thread Pinning and Core Affinity: SMT vs. Physical Cores
A common mistake in VPS configuration is setting the active inference thread count to match the number of virtual CPUs (vCPUs) reported by the operating system. If your 32GB VPS has 8 vCPUs, it is likely that the system is hyperthreaded, meaning there are 4 physical cores and 8 logical execution units (SMT - Simultaneous Multithreading).
In CPU-bound matrix math execution (such as ggml or llama.cpp runtimes under Ollama), hyperthreading actually degrades performance by 20% to 30%. SMT threads share the same physical execution units (vector engines like AVX-512) and the same L1/L2 data caches. When two threads try to execute matrix operations on the same physical core simultaneously, they compete for the same registers, causing constant cache line invalidations and CPU pipeline stalls.
To calculate the optimal number of inference threads (), use the physical core count ():
T = C_physical
If your VPS has hyperthreading active, determine your physical cores by executing:
lscpu | grep -E '^Core\(s\) per socket|^Socket\(s\)'
If your physical core count is 4, configure your execution environments to run strictly with 4 threads, which you can specify by building a customized model template in Ollama or launching Ollama with CPU affinity settings:
# Pin Ollama to physical cores 0, 2, 4, 6 to avoid SMT thread sharing
taskset -c 0,2,4,6 ollama serve
NUMA (Non-Uniform Memory Access) Node Tuning
High-core-count VPS nodes are often carved out of massive, multi-socket enterprise servers (e.g., dual AMD EPYC processors). On these hosts, the memory bus is segmented. A virtual machine accessing memory managed by a CPU socket different from the one it is currently running on experiences “remote memory access” latency, which drastically cuts inference speeds.
To prevent remote memory latency, configure memory interleaving across all NUMA nodes before launching Ollama. First, install the NUMA utility library:
sudo apt-get install -y numactl
Then, prepend your execution with numactl --interleave=all. This tells the Linux kernel to distribute memory allocations evenly across all physical memory channels, eliminating localized memory controller bottlenecks:
# Example execution inside custom scripts or systemd units
ExecStart=/usr/bin/numactl --interleave=all /usr/local/bin/ollama serve
Memory Locking (mlock) and Address Space Pinning
Ollama uses the mmap system call to load model weights on demand. While this is highly memory-efficient, the Linux virtual memory manager may dynamically page out unused or rarely accessed sections of the model weights to the swap space under memory pressure. When this paging occurs, a single token generation step must wait for the hard drive’s NVMe interface to read and page back those weights, resulting in a sudden drop from 15 tokens/sec to less than 1 token/sec.
To prevent the kernel from ever swapping out Ollama’s active memory space, you must configure mlock parameters.
Modify /etc/security/limits.conf to grant the ollama system user unlimited memory locking rights:
ollama soft memlock unlimited
ollama hard memlock unlimited
Custom Swap Space and Swappiness Configurations
While we want to protect Ollama from swapping, we do need swap space enabled on the VPS to serve as a pressure valve for non-critical system daemons (like systemd-journald, sshd, or nginx). This frees up all physical RAM for Ollama’s active model context and dynamic KV-cache allocations.
Use the following commands to configure a high-performance 8GB swap file on your fast NVMe storage:
# Allocate a zeroed file of 8GB
sudo dd if=/dev/zero of=/swapfile bs=1M count=8192
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile
# Persist the configuration in the filesystem tables
echo '/swapfile none swap sw 0 0' | sudo tee -a /etc/fstab
Now, adjust the kernel’s swappiness factor. By default, Ubuntu has a swappiness setting of 60, which aggressively pages out process memory. Lower this parameter to 10 to force the kernel to exhaust all physical RAM capacity before resorting to swap:
# Set runtime swappiness
sudo sysctl vm.swappiness=10
# Persist the configuration across system reboots
echo 'vm.swappiness=10' | sudo tee -a /etc/sysctl.conf
4. Advanced Production Failure Modes, Edge Cases, and Mitigation
Deploying LLMs for production environments exposes systems to edge cases that do not appear during standard development testing. On a CPU-constrained 32GB RAM VPS, understanding these failure modes is the difference between a highly resilient runtime and a constantly crashing server.
Failure Mode 1: CPU Thermal Locks and Dynamic Frequency Scaling (DVFS) Throttling
When running large matrix multiplication operations on CPU, every available execution thread runs at a 100% duty cycle. In shared VPS hosting environments or servers with limited cooling capacity, sustained high-load workloads generate significant heat.
To prevent physical damage, the host processor triggers Dynamic Voltage and Frequency Scaling (DVFS), dropping the physical CPU clock frequency from its peak boost (e.g., 3.6 GHz) down to a safe base frequency (e.g., 1.8 GHz) or lower.
CPU Frequency (GHz)
3.6 | [Inference Starts] ===\
| \ (Sustained 100% Core Load)
2.4 | \===================\ [Thermal Throttling Triggered]
| \
1.8 | \==================================
+---------------------------------------------------------------------------------- Time
The Production Symptoms
After 2 to 3 minutes of continuous text generation (e.g., bulk document indexing), the output generation speed drops by 50% or more, and system latency spikes. The system remains stable, but processing throughput degrades severely.
The Technical Mitigation
- Apply CPU Core Cap Offsets: Do not run your systemd service or Docker containers with access to 100% of the VPS cores. If the VPS has 8 vCPUs, cap Ollama’s core access at 6 cores using Docker
cpus: '6.0'or by pinning systemd execution viaCPUQuota=600%. Leaving 2 cores idle prevents the physical CPU socket from entering severe thermal states and keeps the host operating smoothly. - Install Monitoring Hooks: Set up a background cron job to check for active throttling states. If CPU scaling drops below nominal frequencies, inject brief, millisecond sleep delays into your client request queue to allow the host cores to cool down.
Failure Mode 2: Thread Cache Thrashing and Cache Line Bouncing
In multi-socket VPS architectures, multiple CPU cores share access to the same L3 cache. When the ggml inference library splits matrix math across multiple threads, those threads must frequently share vector weights and intermediate calculations.
If threads are continuously rescheduled across different physical cores by the Linux scheduler, a thread on Core 1 will attempt to read data currently cached in the L1/L2 caches of Core 2. This causes a cache miss, forcing a slow lookup to the shared L3 cache or the host system memory.
The Production Symptoms
Inference speeds drop sharply, even though CPU usage metrics show a constant 100% load across all threads. Running diagnostics with profiling tools reveals high CPU instruction stall rates (cycle_activity.stalls_mem_any).
The Technical Mitigation
Utilize strict CPU pinning with CPU affinity masks. This pins each execution thread to a specific core, preventing the kernel scheduler from dynamically shifting threads between cores:
# Run the Ollama server pinned explicitly to physical cores 0, 1, 2, and 3
taskset --cpu-list 0-3 /usr/local/bin/ollama serve
Failure Mode 3: Linux Out-Of-Memory (OOM) Killer Interventions
A quantized 8B model has a fixed static footprint (e.g., ~4.7 GB for Q4 precision). However, the overall memory footprint of an active Ollama instance is highly dynamic. The runtime must allocate memory for the Key-Value (KV) Cache, which holds the structural attention vectors for all active tokens in your context window.
The memory consumed by the KV Cache scales linearly with the context length, the batch size (number of parallel streams), and the layer configuration of the target model. We can calculate this allocation mathematically:
M_session = 2 * N_layers * N_heads * D_head * L_context * B_precision
And the total engine memory footprint is:
M_kv = M_session * N_parallel
Where:
N_layers: The number of attention layers in the model (e.g., 32 layers for Llama 3.1 8B).N_heads: The number of Key-Value attention heads (e.g., 8 heads under Grouped-Query Attention for Llama 3.1 8B).D_head: The attention head dimension size (e.g., 128).L_context: The target active context length (e.g., 16,384 tokens).B_precision: Bytes per cache parameter. Standard FP16 cache uses2bytes. Quantized 8-bit cache (OLLAMA_KV_CACHE_TYPE=q8_0) uses approximately1.06bytes.N_parallel: Number of active concurrent streams (e.g.,2).
Let’s calculate the memory required for Llama 3.1 8B Q4 with a context window of 16,384 tokens and FP16 precision across 2 parallel requests:
M_session = 2 * 32 * 8 * 128 * 16384 * 2 = 2,147,483,648 bytes (~2.0 GB per session)
M_kv = 2.0 GB * 2 parallel requests = 4.0 GB total KV Cache allocation
If your configuration increases the context length to 32,768 tokens, this KV Cache footprint doubles:
M_kv = 4.0 GB * 2 = 8.0 GB total KV Cache allocation
When you combine the model’s static weight footprint (4.7 GB), the active KV Cache (8.0 GB), system processes, and web server buffers, memory consumption can quickly spike. If memory usage reaches 100% of the physical RAM, the Linux kernel’s OOM killer will identify Ollama as the highest memory consumer and terminate the process instantly with a SIGKILL signal.
The Production Symptoms
Clients receive sudden 502 Bad Gateway or 500 Internal Server Error responses. Checking system kernel logs using dmesg -T | grep -i oom reveals that the ollama process was terminated:
[Tue May 26 19:10:45 2026] Out of memory: Killed process 10245 (ollama) total-vm:32504832kB, anon-rss:29394821kB, file-rss:0kB, shmem-rss:0kB
The Technical Mitigation
- Quantize the KV Cache: Set
OLLAMA_KV_CACHE_TYPE=q8_0orq4_0in your systemd environment variables. This reduces the cache footprint by 45% to 70% with negligible loss in reasoning accuracy. - Limit Overcommit Behavior: Set the Linux virtual memory subsystem to reject memory allocation requests that exceed your physical RAM and swap space boundaries. This forces Ollama to handle allocation limits internally, rather than letting the kernel crash the process:
sudo sysctl vm.overcommit_memory=2
sudo sysctl vm.overcommit_ratio=80
Failure Mode 4: Network Socket Exhaustion and Streaming Timeouts
When clients request streaming responses (Server-Sent Events), the HTTP connection remains open for a long duration. If a high volume of requests is routed to your VPS, the server can quickly exhaust its available networking socket resources.
Furthermore, if a slow or high-latency client connection stalls mid-stream, it keeps a worker thread occupied. This can lead to a backup of requests in the server’s input queue, ultimately resulting in timeout failures across the system.
The Production Symptoms
New connection attempts are rejected with connection refused or timeout errors. Profiling socket metrics via netstat or ss -s shows thousands of connections stuck in TIME_WAIT or ESTABLISHED states.
The Technical Mitigation
Deploy a highly optimized Nginx Reverse Proxy directly in front of the local Ollama daemon. Nginx is designed to handle connection queuing, buffer streams efficiently, and close inactive or slow client connections gracefully.
Below is a production-ready Nginx configuration file (/etc/nginx/sites-available/ollama) tailored specifically for streaming LLM outputs:
# Configure connection limit tracking zones
limit_conn_zone $binary_remote_addr zone=addr_limit:10m;
limit_req_zone $binary_remote_addr zone=req_limit:10m rate=10r/s;
upstream ollama_backend {
server 127.0.0.1:11434;
# Keep up to 32 idle connections open to Ollama to prevent socket regeneration churn
keepalive 32;
}
server {
listen 80;
server_name ollama.private-vps.internal;
return 301 https://$host$request_uri;
}
server {
listen 443 ssl http2;
server_name ollama.private-vps.internal;
# SSL hardening variables (Adjust paths to match your certificate provider)
ssl_certificate /etc/letsencrypt/live/ollama.private-vps.internal/fullchain.pem;
ssl_certificate_key /etc/letsencrypt/live/ollama.private-vps.internal/privkey.pem;
ssl_protocols TLSv1.2 TLSv1.3;
ssl_ciphers HIGH:!aNULL:!MD5;
# Apply strict connection and request rate limits
limit_conn addr_limit 5;
limit_req zone=req_limit burst=10 nodelay;
client_max_body_size 50M;
location / {
proxy_pass http://ollama_backend;
# Enforce HTTP/1.1 to enable persistent keepalive connections
proxy_http_version 1.1;
proxy_set_header Connection "";
# Standard proxy routing headers
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
# --- Crucial SSE Streaming Configurations ---
# Disable proxy buffering so tokens are streamed to the client in real-time
proxy_buffering off;
proxy_cache off;
proxy_read_timeout 300s;
proxy_send_timeout 300s;
# Disable Nginx response holding buffers
chunked_transfer_encoding on;
tcp_nopush off;
tcp_nodelay on;
}
}
DigitalOcean Cloud VPS
High-performance, CPU-optimized droplet configurations ideal for hosting Ollama instances, local databases, and enterprise API workloads.
5. Step-by-Step Enterprise Implementation Blueprint
Follow this step-by-step technical guide to configure your 32GB VPS, deploy Ollama under systemd, and implement performance benchmarks.
Step 1: Operating System & Kernel Optimization Script
Save the following code block as host-prep.sh and run it as root to configure your Ubuntu system parameters for optimal performance:
#!/usr/bin/env bash
set -euo pipefail
# Ensure script is run with root permissions
if [[ "${EUID}" -ne 0 ]]; then
echo "This script must be run as root. Exiting." >&2
exit 1
fi
echo "=== Beginning Enterprise VPS Kernel Preparation ==="
# 1. Update and install core dependencies
apt-get update && apt-get install -y \
curl \
numactl \
cpufrequtils \
procps
# 2. Configure high-performance NVMe swap
if [ ! -f /swapfile ]; then
echo "Creating 8GB high-performance NVMe swap allocation..."
fallocate -l 8G /swapfile
chmod 600 /swapfile
mkswap /swapfile
swapon /swapfile
echo '/swapfile none swap sw 0 0' >> /etc/fstab
fi
# 3. Configure virtual memory sub-system values
echo "Configuring kernel sysctl optimizations..."
cat << 'EOF' > /etc/sysctl.d/99-ollama-inference.conf
# Force kernel to hold physical pages in RAM, avoiding NVMe swapping
vm.swappiness=10
# Limit overcommit bounds to prevent runtime OOM system panics
vm.overcommit_memory=2
vm.overcommit_ratio=80
# Increase local port allocation ranges for high-concurrency API processing
net.ipv4.ip_local_port_range=10240 65535
# Increase TCP queue parameters to handle connection spikes
net.core.somaxconn=1024
EOF
sysctl --system
# 4. Configure maximum system memory locks
echo "Configuring process security limits..."
cat << 'EOF' >> /etc/security/limits.conf
ollama soft memlock unlimited
ollama hard memlock unlimited
EOF
echo "=== System kernel preparation complete. Please reboot for all settings to take effect ==="
Step 2: Systemd Daemon Setup & Model Installation
Execute these commands to download the official Ollama binary, configure your service files, and pull your target models:
# Install Ollama binaries
curl -fsSL https://ollama.com/install.sh | sh
# Add the dedicated system user (if not already created)
sudo id -u ollama &>/dev/null || sudo useradd -r -s /bin/false -U -m -d /var/lib/ollama ollama
# Download the optimized model configurations
sudo -u ollama ollama pull llama3.1:8b-instruct-q4_K_M
# Register and start the service daemon
sudo systemctl daemon-reload
sudo systemctl enable ollama
sudo systemctl start ollama
Step 3: Local Performance Benchmarking Script
To verify that your VPS optimizations are working correctly and measure your model’s actual token generation speed, use this comprehensive python test script (benchmark.py):
#!/usr/bin/env python3
import json
import time
import urllib.request
import urllib.error
API_URL = "http://127.0.0.1:11434/api/generate"
MODEL_NAME = "llama3.1:8b-instruct-q4_K_M"
TEST_PROMPT = "Explain the difference between a systemd service and a docker container, focusing on host kernel resource utilization."
payload = {
"model": MODEL_NAME,
"prompt": TEST_PROMPT,
"stream": False,
"options": {
"num_thread": 4, # Target optimal physical core count
"temperature": 0.2
}
}
headers = {"Content-Type": "application/json"}
req = urllib.request.Request(
API_URL,
data=json.dumps(payload).encode("utf-8"),
headers=headers,
method="POST"
)
print(f"=== Triggering Local Inference Performance Test ===")
print(f"Model: {MODEL_NAME}")
print(f"Prompt Length: {len(TEST_PROMPT)} characters")
print("Sending request... (please stand by)\n")
start_time = time.time()
try:
with urllib.request.urlopen(req) as response:
response_bytes = response.read()
duration = time.time() - start_time
result = json.loads(response_bytes.decode("utf-8"))
generated_text = result.get("response", "")
# Ollama performance stats (returned in nanoseconds)
eval_count = result.get("eval_count", 0)
eval_duration = result.get("eval_duration", 1) / 1e9 # Convert to seconds
prompt_eval_count = result.get("prompt_eval_count", 0)
prompt_duration = result.get("prompt_eval_duration", 1) / 1e9
tokens_per_second = eval_count / eval_duration
print("--- Execution Metrics ---")
print(f"Overall Wall-Clock Duration : {duration:.2f} seconds")
print(f"Prompt Tokens Evaluated : {prompt_eval_count} tokens")
print(f"Prompt Evaluation Duration : {prompt_duration:.2f} seconds")
print(f"Response Tokens Generated : {eval_count} tokens")
print(f"Response Duration : {eval_duration:.2f} seconds")
print(f"Token Generation Speed : {tokens_per_second:.2f} tokens/second")
print("\n--- Response Preview ---")
print(generated_text[:300] + "...\n")
print("========================================")
except urllib.error.URLError as e:
print(f"Inference request failed: {e.reason}")
print("Please verify the Ollama service is active and listening on port 11434.")
Run this benchmark utility using standard python tools:
python3 benchmark.py
Under optimized CPU execution parameters, your Llama 3.1 (8B) model should consistently deliver 12 to 16 tokens per second. This performance level is ideal for running complex background agents, task processing queues, and autonomous database workflows with minimal operational overhead.
Conclusion & System Deployment Checklist
By deploying local LLMs on a 32GB RAM VPS, you establish an independent, predictable, and highly secure runtime environment for your enterprise autonomous agents. Taking the time to properly configure your systems for CPU-based inference ensures your local deployment can run efficiently without suffering performance drops or crashes.
Before launching your optimized Ollama instance in a production environment, complete this final deployment checklist:
- Core Thread Configuration: Is your thread limit mapped to the physical core count (), bypassing hyperthreaded logical cores?
- Swap File Active: Is an 8GB NVMe swap file enabled with
/proc/sys/vm/swappinessset to10? - Memory Protection: Is
/etc/security/limits.confupdated with unlimitedmemlockbounds for theollamauser? - NUMA Interleaving: Is
numactl --interleave=allenabled on startup to ensure balanced memory channel allocation? - OOM Score Adjusted: Is
OOMScoreAdjustset to-1000inside your systemd unit or Docker deployment to prevent process crashes? - Reverse Proxy Configured: Is Nginx or Caddy deployed in front of Ollama with connection pooling and
proxy_buffering offactive for streaming? - Thermal Limits Offset: Have you set aside at least 1 or 2 CPU cores to handle networking and OS kernel interrupts, keeping CPU temperatures stable?