Serving LLMs with vLLM on RunPod: A Complete Guide

Feb 3rd, 2026 · 12 min read

What Are We Building?

When you want to run an LLM for inference, you have two options: use a cloud API (OpenAI, Anthropic) or host your own. Self-hosting gives you control over costs, latency, and model choice. In this post, I'll break down exactly what happens when you deploy a model on RunPod using vLLM.

The Stack:
┌─────────────────────────────────────────────────┐
│  Your Application (API calls)                   │
└─────────────────────┬───────────────────────────┘
                      │ HTTPS
                      ▼
┌─────────────────────────────────────────────────┐
│  RunPod Proxy (yixnlsxbw3md1q-8000.proxy...)   │
└─────────────────────┬───────────────────────────┘
                      │
                      ▼
┌─────────────────────────────────────────────────┐
│  Docker Container (vllm/vllm-openai:latest)    │
│  ┌─────────────────────────────────────────┐   │
│  │  vLLM Server (OpenAI-compatible API)    │   │
│  │  - /v1/chat/completions                 │   │
│  │  - /v1/completions                      │   │
│  │  - /v1/models                           │   │
│  └─────────────────────────────────────────┘   │
│  ┌─────────────────────────────────────────┐   │
│  │  Model Weights (Qwen2.5-7B-Instruct)    │   │
│  │  Loaded in GPU VRAM (~14GB)             │   │
│  └─────────────────────────────────────────┘   │
└─────────────────────────────────────────────────┘
                      │
                      ▼
┌─────────────────────────────────────────────────┐
│  NVIDIA A6000 Ada GPU (48GB VRAM)              │
└─────────────────────────────────────────────────┘

Understanding RunPod

RunPod is a cloud GPU provider. Unlike AWS or GCP where you rent full VMs, RunPod specializes in GPU containers. You pay only for GPU time, often at 50-70% lower cost than major cloud providers.

What RunPod Provides

RunPod Pod = GPU + Container Runtime + Networking

Components:
GPU Hardware    → NVIDIA A6000, A100, H100, etc.
Container       → Docker image runs your application
Storage         → Volume (persistent) + Container disk (ephemeral)
Networking      → Proxy URL for HTTP/HTTPS access
SSH Access      → Direct terminal access to the container

Templates: Pre-configured Recipes

A RunPod template is a pre-configured deployment recipe. It specifies everything needed to run a specific application:

Template Components:
─────────────────────────────────────────────────
Component           │ Example Value
─────────────────────────────────────────────────
Docker Image        │ vllm/vllm-openai:latest
GPU Type            │ NVIDIA A6000 Ada (48GB)
Volume Disk         │ 40GB (model weights stored here)
Container Disk      │ 10GB (temporary runtime files)
Environment Vars    │ HF_TOKEN, HF_HOME, etc.
Start Command       │ python -m vllm.entrypoints...
Exposed Ports       │ 8000 (HTTP), 22 (SSH)
─────────────────────────────────────────────────

Understanding vLLM

vLLM is a high-performance inference engine for LLMs. It's not a model—it's the software that loads models and serves them efficiently.

Why vLLM Over Plain PyTorch?

Plain PyTorch Inference:
- Load model into GPU
- Process one request at a time
- Recompute KV cache for each token
- Result: ~20 tokens/sec

vLLM Inference:
- PagedAttention: Efficient memory management
- Continuous Batching: Process multiple requests simultaneously
- KV Cache Optimization: Reuse computed attention states
- Result: ~50-100+ tokens/sec

Key vLLM Optimizations

1. PagedAttention

Traditional attention stores KV cache in contiguous memory blocks. vLLM uses "paged" memory (like OS virtual memory), allowing dynamic allocation and preventing memory fragmentation.

2. Continuous Batching

Instead of waiting for a batch to complete, vLLM continuously adds new requests and removes completed ones. This maximizes GPU utilization.

Traditional Batching:
Request 1: [████████████████████]
Request 2: [████████████████████]
Request 3: [████████████████████]
           ↑ Wait for all to finish before next batch

Continuous Batching:
Request 1: [████████]
Request 2: [████████████████]
Request 3:     [████████████]
Request 4:         [████████████████]
           ↑ New requests added as slots free up

The Deployment Flow

Here's exactly what happens when you deploy:

Step 1: Pod Creation

RunPod allocates:
- 1x NVIDIA A6000 Ada GPU (48GB VRAM)
- 32GB System RAM
- 40GB Volume Disk (mounted at /workspace)
- 10GB Container Disk

Time: ~30 seconds

Step 2: Container Startup

Docker pulls: vllm/vllm-openai:latest
Container contains:
- Python 3.10+
- PyTorch with CUDA support
- vLLM library
- Transformers library
- FastAPI server

Time: ~1-2 minutes (if image not cached)

Step 3: Model Download

vLLM downloads from HuggingFace:
Model: Qwen/Qwen2.5-7B-Instruct
Files:
- model.safetensors.index.json
- model-00001-of-00004.safetensors (4GB each)
- tokenizer.json
- config.json

Total Size: ~14GB
Saved to: /workspace/hf_home/hub/models--Qwen--Qwen2.5-7B-Instruct/

Time: 5-10 minutes (first time)

Step 4: Model Loading

vLLM loads model into GPU:
1. Read safetensors files from disk
2. Convert to appropriate dtype (bfloat16/float16)
3. Transfer weights to GPU VRAM
4. Initialize KV cache blocks
5. Compile CUDA graphs (optional)

Memory Layout:
┌─────────────────────────────────────────┐
│         A6000 GPU (48GB VRAM)           │
├─────────────────────────────────────────┤
│  Model Weights      │  ~14GB            │
│  KV Cache           │  ~20GB (dynamic)  │
│  Activations        │  ~4GB             │
│  Free               │  ~10GB            │
└─────────────────────────────────────────┘

Time: 1-2 minutes

Step 5: API Server Ready

vLLM starts FastAPI server:
INFO: Uvicorn running on http://0.0.0.0:8000

Available Endpoints:
─────────────────────────────────────────────────
Endpoint                    │ Method │ Purpose
─────────────────────────────────────────────────
/v1/models                  │ GET    │ List loaded models
/v1/chat/completions        │ POST   │ Chat API (OpenAI format)
/v1/completions             │ POST   │ Text completion
/health                     │ GET    │ Health check
─────────────────────────────────────────────────

RunPod creates proxy:
https://yixnlsxbw3md1q-8000.proxy.runpod.net → localhost:8000

Making Requests

vLLM exposes an OpenAI-compatible API. This means you can use the same code you'd use for OpenAI, just change the base URL:

# OpenAI API call
curl https://api.openai.com/v1/chat/completions \
  -H "Authorization: Bearer sk-xxx" \
  -d '{"model": "gpt-4", "messages": [...]}'

# vLLM API call (same format!)
curl https://your-pod-8000.proxy.runpod.net/v1/chat/completions \
  -H "Authorization: Bearer sk-1234" \
  -d '{"model": "Qwen/Qwen2.5-7B-Instruct", "messages": [...]}'

Request Flow

Request arrives at vLLM server
Tokenizer converts text → token IDs
   "What is ML?" → [1724, 374, 14946, 30]

Tokens added to scheduling queue
Scheduler batches with other requests
Forward pass through model:
   - Embedding lookup
   - 28 transformer layers
   - Final linear → logits

Sampling (temperature, top_p)
New token generated
Repeat until stop condition

Detokenize → text response
Return JSON with usage stats

Benchmark Results

I ran benchmarks at different concurrency levels to understand throughput scaling:

Configuration:
- Model: Qwen/Qwen2.5-VL-7B-Instruct
- GPU: NVIDIA A6000 Ada (48GB)
- Cost: $0.44/hour
- Max tokens per request: 100

Results:
─────────────────────────────────────────────────────────────────
Concurrency │ Avg Latency │ P50 Latency │ P95 Latency │ Tokens/s
─────────────────────────────────────────────────────────────────
1           │ 2068ms      │ 2018ms      │ 2276ms      │ 48.49
4           │ 2060ms      │ 2070ms      │ 2251ms      │ 48.68
8           │ 2156ms      │ 2145ms      │ 2376ms      │ 46.55
─────────────────────────────────────────────────────────────────

Throughput Scaling:
- 1 concurrent: 0.48 req/s
- 4 concurrent: 1.91 req/s (4x improvement)
- 8 concurrent: 3.64 req/s (7.6x improvement)

Key Observation: Latency stays relatively flat even as concurrency increases. This is continuous batching in action—vLLM efficiently processes multiple requests without proportionally increasing per-request latency.

Cost Analysis

Self-hosting only makes sense if it's cheaper than API providers. Let's do the math:

Calculating Cost Per 1M Output Tokens:

Given:
- GPU cost: $0.44/hour
- Average tokens/sec: 47.91
- Tokens per hour: 47.91 × 3600 = 172,476

Cost per token = $0.44 / 172,476 = $0.00000255
Cost per 1M tokens = $2.55

Comparison:
─────────────────────────────────────────────────
Provider          │ Cost/1M tokens │ vs Self-Host
─────────────────────────────────────────────────
GPT-4o            │ $15.00         │ 5.9x more expensive
GPT-4o-mini       │ $0.60          │ 4.2x cheaper
Claude 3.5 Sonnet │ $15.00         │ 5.9x more expensive
Self-hosted vLLM  │ $2.55          │ baseline
─────────────────────────────────────────────────

Trade-offs:

Self-hosting beats frontier models (GPT-4o, Claude) on cost
GPT-4o-mini is still cheaper for simple tasks
Self-hosting: no rate limits, full control, privacy
Self-hosting: you manage infrastructure, no automatic scaling

When to Self-Host vs Use APIs

Self-Host When:
✓ High volume (>1M tokens/day)
✓ Need low latency (<500ms)
✓ Privacy requirements (data can't leave your infra)
✓ Need fine-tuned/custom models
✓ Predictable, steady traffic

Use APIs When:
✓ Low/variable volume
✓ Need frontier model quality (GPT-4, Claude)
✓ Don't want to manage infrastructure
✓ Need automatic scaling
✓ Experimenting/prototyping

Key Takeaways

What We Learned:

RunPod provides GPU containers with simple deployment via templates
vLLM is an inference engine that makes LLM serving 2-5x faster than naive PyTorch
PagedAttention and continuous batching are the key optimizations
Self-hosting a 7B model costs ~$2.55/1M tokens on A6000
Throughput scales well with concurrency (7.6x at 8 concurrent)
OpenAI-compatible API means easy integration with existing code

What's Next?

In the next post, I'll test function calling capabilities with this setup. We'll compare zero-shot function calling accuracy across different open-source models and see how they stack up against GPT-4.