Serving LLMs with vLLM on RunPod: A Complete Guide
Feb 3rd, 2026 · 12 min read
What Are We Building?
When you want to run an LLM for inference, you have two options: use a cloud API (OpenAI, Anthropic)
or host your own. Self-hosting gives you control over costs, latency, and model choice. In this post,
I'll break down exactly what happens when you deploy a model on RunPod using vLLM.
The Stack:
┌─────────────────────────────────────────────────┐
│ Your Application (API calls) │
└─────────────────────┬───────────────────────────┘
│ HTTPS
▼
┌─────────────────────────────────────────────────┐
│ RunPod Proxy (yixnlsxbw3md1q-8000.proxy...) │
└─────────────────────┬───────────────────────────┘
│
▼
┌─────────────────────────────────────────────────┐
│ Docker Container (vllm/vllm-openai:latest) │
│ ┌─────────────────────────────────────────┐ │
│ │ vLLM Server (OpenAI-compatible API) │ │
│ │ - /v1/chat/completions │ │
│ │ - /v1/completions │ │
│ │ - /v1/models │ │
│ └─────────────────────────────────────────┘ │
│ ┌─────────────────────────────────────────┐ │
│ │ Model Weights (Qwen2.5-7B-Instruct) │ │
│ │ Loaded in GPU VRAM (~14GB) │ │
│ └─────────────────────────────────────────┘ │
└─────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────┐
│ NVIDIA A6000 Ada GPU (48GB VRAM) │
└─────────────────────────────────────────────────┘
Understanding RunPod
RunPod is a cloud GPU provider. Unlike AWS or GCP where you rent full VMs, RunPod specializes in
GPU containers. You pay only for GPU time, often at 50-70% lower cost than major cloud providers.
What RunPod Provides
RunPod Pod = GPU + Container Runtime + Networking
Components:
1. GPU Hardware → NVIDIA A6000, A100, H100, etc.
2. Container → Docker image runs your application
3. Storage → Volume (persistent) + Container disk (ephemeral)
4. Networking → Proxy URL for HTTP/HTTPS access
5. SSH Access → Direct terminal access to the container
Templates: Pre-configured Recipes
A RunPod template is a pre-configured deployment recipe. It specifies everything needed
to run a specific application:
Template Components:
─────────────────────────────────────────────────
Component │ Example Value
─────────────────────────────────────────────────
Docker Image │ vllm/vllm-openai:latest
GPU Type │ NVIDIA A6000 Ada (48GB)
Volume Disk │ 40GB (model weights stored here)
Container Disk │ 10GB (temporary runtime files)
Environment Vars │ HF_TOKEN, HF_HOME, etc.
Start Command │ python -m vllm.entrypoints...
Exposed Ports │ 8000 (HTTP), 22 (SSH)
─────────────────────────────────────────────────
Understanding vLLM
vLLM is a high-performance inference engine for LLMs. It's not a model—it's the
software that loads models and serves them efficiently.
Why vLLM Over Plain PyTorch?
Plain PyTorch Inference:
- Load model into GPU
- Process one request at a time
- Recompute KV cache for each token
- Result: ~20 tokens/sec
vLLM Inference:
- PagedAttention: Efficient memory management
- Continuous Batching: Process multiple requests simultaneously
- KV Cache Optimization: Reuse computed attention states
- Result: ~50-100+ tokens/sec
Key vLLM Optimizations
1. PagedAttention
Traditional attention stores KV cache in contiguous memory blocks. vLLM uses
"paged" memory (like OS virtual memory), allowing dynamic allocation and
preventing memory fragmentation.
2. Continuous Batching
Instead of waiting for a batch to complete, vLLM continuously adds new requests
and removes completed ones. This maximizes GPU utilization.
Traditional Batching:
Request 1: [████████████████████]
Request 2: [████████████████████]
Request 3: [████████████████████]
↑ Wait for all to finish before next batch
Continuous Batching:
Request 1: [████████]
Request 2: [████████████████]
Request 3: [████████████]
Request 4: [████████████████]
↑ New requests added as slots free up
The Deployment Flow
Here's exactly what happens when you deploy:
Step 1: Pod Creation
RunPod allocates:
- 1x NVIDIA A6000 Ada GPU (48GB VRAM)
- 32GB System RAM
- 40GB Volume Disk (mounted at /workspace)
- 10GB Container Disk
Time: ~30 seconds
Step 2: Container Startup
Docker pulls: vllm/vllm-openai:latest
Container contains:
- Python 3.10+
- PyTorch with CUDA support
- vLLM library
- Transformers library
- FastAPI server
Time: ~1-2 minutes (if image not cached)
Step 3: Model Download
vLLM downloads from HuggingFace:
Model: Qwen/Qwen2.5-7B-Instruct
Files:
- model.safetensors.index.json
- model-00001-of-00004.safetensors (4GB each)
- tokenizer.json
- config.json
Total Size: ~14GB
Saved to: /workspace/hf_home/hub/models--Qwen--Qwen2.5-7B-Instruct/
Time: 5-10 minutes (first time)
Step 4: Model Loading
vLLM loads model into GPU:
1. Read safetensors files from disk
2. Convert to appropriate dtype (bfloat16/float16)
3. Transfer weights to GPU VRAM
4. Initialize KV cache blocks
5. Compile CUDA graphs (optional)
Memory Layout:
┌─────────────────────────────────────────┐
│ A6000 GPU (48GB VRAM) │
├─────────────────────────────────────────┤
│ Model Weights │ ~14GB │
│ KV Cache │ ~20GB (dynamic) │
│ Activations │ ~4GB │
│ Free │ ~10GB │
└─────────────────────────────────────────┘
Time: 1-2 minutes
Step 5: API Server Ready
vLLM starts FastAPI server:
INFO: Uvicorn running on http://0.0.0.0:8000
Available Endpoints:
─────────────────────────────────────────────────
Endpoint │ Method │ Purpose
─────────────────────────────────────────────────
/v1/models │ GET │ List loaded models
/v1/chat/completions │ POST │ Chat API (OpenAI format)
/v1/completions │ POST │ Text completion
/health │ GET │ Health check
─────────────────────────────────────────────────
RunPod creates proxy:
https://yixnlsxbw3md1q-8000.proxy.runpod.net → localhost:8000
Making Requests
vLLM exposes an OpenAI-compatible API. This means you can use the same code
you'd use for OpenAI, just change the base URL:
# OpenAI API call
curl https://api.openai.com/v1/chat/completions \
-H "Authorization: Bearer sk-xxx" \
-d '{"model": "gpt-4", "messages": [...]}'
# vLLM API call (same format!)
curl https://your-pod-8000.proxy.runpod.net/v1/chat/completions \
-H "Authorization: Bearer sk-1234" \
-d '{"model": "Qwen/Qwen2.5-7B-Instruct", "messages": [...]}'
Request Flow
1. Request arrives at vLLM server
2. Tokenizer converts text → token IDs
"What is ML?" → [1724, 374, 14946, 30]
3. Tokens added to scheduling queue
4. Scheduler batches with other requests
5. Forward pass through model:
- Embedding lookup
- 28 transformer layers
- Final linear → logits
6. Sampling (temperature, top_p)
7. New token generated
8. Repeat until stop condition
9. Detokenize → text response
10. Return JSON with usage stats
Benchmark Results
I ran benchmarks at different concurrency levels to understand throughput scaling:
Configuration:
- Model: Qwen/Qwen2.5-VL-7B-Instruct
- GPU: NVIDIA A6000 Ada (48GB)
- Cost: $0.44/hour
- Max tokens per request: 100
Results:
─────────────────────────────────────────────────────────────────
Concurrency │ Avg Latency │ P50 Latency │ P95 Latency │ Tokens/s
─────────────────────────────────────────────────────────────────
1 │ 2068ms │ 2018ms │ 2276ms │ 48.49
4 │ 2060ms │ 2070ms │ 2251ms │ 48.68
8 │ 2156ms │ 2145ms │ 2376ms │ 46.55
─────────────────────────────────────────────────────────────────
Throughput Scaling:
- 1 concurrent: 0.48 req/s
- 4 concurrent: 1.91 req/s (4x improvement)
- 8 concurrent: 3.64 req/s (7.6x improvement)
Key Observation: Latency stays relatively flat even as concurrency increases.
This is continuous batching in action—vLLM efficiently processes multiple requests
without proportionally increasing per-request latency.
Cost Analysis
Self-hosting only makes sense if it's cheaper than API providers. Let's do the math:
Calculating Cost Per 1M Output Tokens:
Given:
- GPU cost: $0.44/hour
- Average tokens/sec: 47.91
- Tokens per hour: 47.91 × 3600 = 172,476
Cost per token = $0.44 / 172,476 = $0.00000255
Cost per 1M tokens = $2.55
Comparison:
─────────────────────────────────────────────────
Provider │ Cost/1M tokens │ vs Self-Host
─────────────────────────────────────────────────
GPT-4o │ $15.00 │ 5.9x more expensive
GPT-4o-mini │ $0.60 │ 4.2x cheaper
Claude 3.5 Sonnet │ $15.00 │ 5.9x more expensive
Self-hosted vLLM │ $2.55 │ baseline
─────────────────────────────────────────────────
Trade-offs:
- Self-hosting beats frontier models (GPT-4o, Claude) on cost
- GPT-4o-mini is still cheaper for simple tasks
- Self-hosting: no rate limits, full control, privacy
- Self-hosting: you manage infrastructure, no automatic scaling
When to Self-Host vs Use APIs
Self-Host When:
✓ High volume (>1M tokens/day)
✓ Need low latency (<500ms)
✓ Privacy requirements (data can't leave your infra)
✓ Need fine-tuned/custom models
✓ Predictable, steady traffic
Use APIs When:
✓ Low/variable volume
✓ Need frontier model quality (GPT-4, Claude)
✓ Don't want to manage infrastructure
✓ Need automatic scaling
✓ Experimenting/prototyping
Key Takeaways
What We Learned:
- RunPod provides GPU containers with simple deployment via templates
- vLLM is an inference engine that makes LLM serving 2-5x faster than naive PyTorch
- PagedAttention and continuous batching are the key optimizations
- Self-hosting a 7B model costs ~$2.55/1M tokens on A6000
- Throughput scales well with concurrency (7.6x at 8 concurrent)
- OpenAI-compatible API means easy integration with existing code
What's Next?
In the next post, I'll test function calling capabilities with this setup. We'll
compare zero-shot function calling accuracy across different open-source models
and see how they stack up against GPT-4.