← Back to Blog

Serving LLMs with vLLM on RunPod: A Complete Guide

Feb 3rd, 2026 · 12 min read


What Are We Building?

When you want to run an LLM for inference, you have two options: use a cloud API (OpenAI, Anthropic) or host your own. Self-hosting gives you control over costs, latency, and model choice. In this post, I'll break down exactly what happens when you deploy a model on RunPod using vLLM.

The Stack: ┌─────────────────────────────────────────────────┐ │ Your Application (API calls) │ └─────────────────────┬───────────────────────────┘ │ HTTPS ▼ ┌─────────────────────────────────────────────────┐ │ RunPod Proxy (yixnlsxbw3md1q-8000.proxy...) │ └─────────────────────┬───────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────┐ │ Docker Container (vllm/vllm-openai:latest) │ │ ┌─────────────────────────────────────────┐ │ │ │ vLLM Server (OpenAI-compatible API) │ │ │ │ - /v1/chat/completions │ │ │ │ - /v1/completions │ │ │ │ - /v1/models │ │ │ └─────────────────────────────────────────┘ │ │ ┌─────────────────────────────────────────┐ │ │ │ Model Weights (Qwen2.5-7B-Instruct) │ │ │ │ Loaded in GPU VRAM (~14GB) │ │ │ └─────────────────────────────────────────┘ │ └─────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────┐ │ NVIDIA A6000 Ada GPU (48GB VRAM) │ └─────────────────────────────────────────────────┘

Understanding RunPod

RunPod is a cloud GPU provider. Unlike AWS or GCP where you rent full VMs, RunPod specializes in GPU containers. You pay only for GPU time, often at 50-70% lower cost than major cloud providers.

What RunPod Provides

RunPod Pod = GPU + Container Runtime + Networking Components: 1. GPU Hardware → NVIDIA A6000, A100, H100, etc. 2. Container → Docker image runs your application 3. Storage → Volume (persistent) + Container disk (ephemeral) 4. Networking → Proxy URL for HTTP/HTTPS access 5. SSH Access → Direct terminal access to the container

Templates: Pre-configured Recipes

A RunPod template is a pre-configured deployment recipe. It specifies everything needed to run a specific application:

Template Components: ───────────────────────────────────────────────── Component │ Example Value ───────────────────────────────────────────────── Docker Image │ vllm/vllm-openai:latest GPU Type │ NVIDIA A6000 Ada (48GB) Volume Disk │ 40GB (model weights stored here) Container Disk │ 10GB (temporary runtime files) Environment Vars │ HF_TOKEN, HF_HOME, etc. Start Command │ python -m vllm.entrypoints... Exposed Ports │ 8000 (HTTP), 22 (SSH) ─────────────────────────────────────────────────

Understanding vLLM

vLLM is a high-performance inference engine for LLMs. It's not a model—it's the software that loads models and serves them efficiently.

Why vLLM Over Plain PyTorch?

Plain PyTorch Inference: - Load model into GPU - Process one request at a time - Recompute KV cache for each token - Result: ~20 tokens/sec vLLM Inference: - PagedAttention: Efficient memory management - Continuous Batching: Process multiple requests simultaneously - KV Cache Optimization: Reuse computed attention states - Result: ~50-100+ tokens/sec

Key vLLM Optimizations

1. PagedAttention

Traditional attention stores KV cache in contiguous memory blocks. vLLM uses "paged" memory (like OS virtual memory), allowing dynamic allocation and preventing memory fragmentation.

2. Continuous Batching

Instead of waiting for a batch to complete, vLLM continuously adds new requests and removes completed ones. This maximizes GPU utilization.

Traditional Batching: Request 1: [████████████████████] Request 2: [████████████████████] Request 3: [████████████████████] ↑ Wait for all to finish before next batch Continuous Batching: Request 1: [████████] Request 2: [████████████████] Request 3: [████████████] Request 4: [████████████████] ↑ New requests added as slots free up

The Deployment Flow

Here's exactly what happens when you deploy:

Step 1: Pod Creation

RunPod allocates: - 1x NVIDIA A6000 Ada GPU (48GB VRAM) - 32GB System RAM - 40GB Volume Disk (mounted at /workspace) - 10GB Container Disk Time: ~30 seconds

Step 2: Container Startup

Docker pulls: vllm/vllm-openai:latest Container contains: - Python 3.10+ - PyTorch with CUDA support - vLLM library - Transformers library - FastAPI server Time: ~1-2 minutes (if image not cached)

Step 3: Model Download

vLLM downloads from HuggingFace: Model: Qwen/Qwen2.5-7B-Instruct Files: - model.safetensors.index.json - model-00001-of-00004.safetensors (4GB each) - tokenizer.json - config.json Total Size: ~14GB Saved to: /workspace/hf_home/hub/models--Qwen--Qwen2.5-7B-Instruct/ Time: 5-10 minutes (first time)

Step 4: Model Loading

vLLM loads model into GPU: 1. Read safetensors files from disk 2. Convert to appropriate dtype (bfloat16/float16) 3. Transfer weights to GPU VRAM 4. Initialize KV cache blocks 5. Compile CUDA graphs (optional) Memory Layout: ┌─────────────────────────────────────────┐ │ A6000 GPU (48GB VRAM) │ ├─────────────────────────────────────────┤ │ Model Weights │ ~14GB │ │ KV Cache │ ~20GB (dynamic) │ │ Activations │ ~4GB │ │ Free │ ~10GB │ └─────────────────────────────────────────┘ Time: 1-2 minutes

Step 5: API Server Ready

vLLM starts FastAPI server: INFO: Uvicorn running on http://0.0.0.0:8000 Available Endpoints: ───────────────────────────────────────────────── Endpoint │ Method │ Purpose ───────────────────────────────────────────────── /v1/models │ GET │ List loaded models /v1/chat/completions │ POST │ Chat API (OpenAI format) /v1/completions │ POST │ Text completion /health │ GET │ Health check ───────────────────────────────────────────────── RunPod creates proxy: https://yixnlsxbw3md1q-8000.proxy.runpod.net → localhost:8000

Making Requests

vLLM exposes an OpenAI-compatible API. This means you can use the same code you'd use for OpenAI, just change the base URL:

# OpenAI API call curl https://api.openai.com/v1/chat/completions \ -H "Authorization: Bearer sk-xxx" \ -d '{"model": "gpt-4", "messages": [...]}' # vLLM API call (same format!) curl https://your-pod-8000.proxy.runpod.net/v1/chat/completions \ -H "Authorization: Bearer sk-1234" \ -d '{"model": "Qwen/Qwen2.5-7B-Instruct", "messages": [...]}'

Request Flow

1. Request arrives at vLLM server 2. Tokenizer converts text → token IDs "What is ML?" → [1724, 374, 14946, 30] 3. Tokens added to scheduling queue 4. Scheduler batches with other requests 5. Forward pass through model: - Embedding lookup - 28 transformer layers - Final linear → logits 6. Sampling (temperature, top_p) 7. New token generated 8. Repeat until stop condition 9. Detokenize → text response 10. Return JSON with usage stats

Benchmark Results

I ran benchmarks at different concurrency levels to understand throughput scaling:

Configuration: - Model: Qwen/Qwen2.5-VL-7B-Instruct - GPU: NVIDIA A6000 Ada (48GB) - Cost: $0.44/hour - Max tokens per request: 100 Results: ───────────────────────────────────────────────────────────────── Concurrency │ Avg Latency │ P50 Latency │ P95 Latency │ Tokens/s ───────────────────────────────────────────────────────────────── 1 │ 2068ms │ 2018ms │ 2276ms │ 48.49 4 │ 2060ms │ 2070ms │ 2251ms │ 48.68 8 │ 2156ms │ 2145ms │ 2376ms │ 46.55 ───────────────────────────────────────────────────────────────── Throughput Scaling: - 1 concurrent: 0.48 req/s - 4 concurrent: 1.91 req/s (4x improvement) - 8 concurrent: 3.64 req/s (7.6x improvement)
Key Observation: Latency stays relatively flat even as concurrency increases. This is continuous batching in action—vLLM efficiently processes multiple requests without proportionally increasing per-request latency.

Cost Analysis

Self-hosting only makes sense if it's cheaper than API providers. Let's do the math:

Calculating Cost Per 1M Output Tokens: Given: - GPU cost: $0.44/hour - Average tokens/sec: 47.91 - Tokens per hour: 47.91 × 3600 = 172,476 Cost per token = $0.44 / 172,476 = $0.00000255 Cost per 1M tokens = $2.55 Comparison: ───────────────────────────────────────────────── Provider │ Cost/1M tokens │ vs Self-Host ───────────────────────────────────────────────── GPT-4o │ $15.00 │ 5.9x more expensive GPT-4o-mini │ $0.60 │ 4.2x cheaper Claude 3.5 Sonnet │ $15.00 │ 5.9x more expensive Self-hosted vLLM │ $2.55 │ baseline ─────────────────────────────────────────────────
Trade-offs:

When to Self-Host vs Use APIs

Self-Host When: ✓ High volume (>1M tokens/day) ✓ Need low latency (<500ms) ✓ Privacy requirements (data can't leave your infra) ✓ Need fine-tuned/custom models ✓ Predictable, steady traffic Use APIs When: ✓ Low/variable volume ✓ Need frontier model quality (GPT-4, Claude) ✓ Don't want to manage infrastructure ✓ Need automatic scaling ✓ Experimenting/prototyping

Key Takeaways

What We Learned:

What's Next?

In the next post, I'll test function calling capabilities with this setup. We'll compare zero-shot function calling accuracy across different open-source models and see how they stack up against GPT-4.