I had written about KV cache, PagedAttention, and continuous batching before. I could explain them in a post. I still did not trust my own mental model until I tried to rebuild the pieces myself.

So I built mini-vllm: a small inference loop around Llama-2-7B on a single GPU. Scheduler, block allocator, paged KV, generation loop. Not a production server, just enough to see how the ideas connect. Code is on GitHub if you want the implementation.

What an inference engine actually is

From the outside it looks like one model call. Inside it is three boring layers plus weights:

  • Per-request state on the CPU: prompt, tokens generated so far, when to stop, which memory blocks you own.
  • Memory management: a fixed pool of KV blocks on the GPU, handed out like parking spots.
  • Scheduling: who runs this step, who waits, who just finished and must give blocks back.

The transformer math is the easy part to read about. Most of my confusion was in those three layers talking to each other.

PagedAttention, in one picture

Naive serving often reserves a long KV buffer per request. Short prompts waste most of it, so you fit fewer concurrent users.

PagedAttention is the opposite idea: allocate one big KV tensor up front, then give each request a short list of block ids. Token 47 does not mean "grow a tensor." It means "block 2, slot 15 in the shared buffer."

The allocator is just bookkeeping on those ids. The GPU tensor never changes shape during inference. That was the click for me: all the "paging" drama is CPU-side lists and integers, not new CUDA allocations every token.

Things I had backwards at first

I thought the block table stored vectors. It stores handles into the pool. The vectors live in one place; the table is a map.

I mixed up fleet metrics with request state. Before writing code I listed SLOs and counts. Step six of generation only needs identity, full output history, stop rules, and the prompt.

I reordered the scheduler wrong. If you grow block tables before you drop finished sequences, you can allocate for a request that already ended and free it in the same pass. Small bug, very vLLM-shaped lesson: lifecycle order matters as much as the algorithm on paper.

I treated prefill and decode as different models. Same weights. Different input shapes. Decode is one new token with a position index in the full sequence, not "position 0 in this one-row tensor." I lost time on RoPE before I believed that.

I underestimated memory layout vs math. Growing KV with concat-every-step means allocate, copy, free in the hot loop. Writing into a fixed slot is why serving engines feel so much faster in practice. The gap is often memory traffic, not a smarter softmax.

What surprised me

Papers teach the what. Building taught the why behind small choices.

  • Refcounts on blocks only make sense once you imagine two requests sharing a prefix later.
  • Most bugs were state machines and indexing, not attention.
  • File layout followed dependencies: engine owns everything, nothing imports the engine. I did not design that upfront; it fell out of "who creates whom at startup."

If you want to try this yourself

Start from the PagedAttention paper with a sketch of logical tokens vs physical blocks. Get one sequence generating before you turn on batching. Then read Inside vLLM or peek at nano-vllm once your own loop works. Comparisons land better when you have something to compare against.

mini-vllm is my version of that exercise. The repo has benchmarks and the full wiring; this post is only the flavour I wish I had before I opened the editor.