GPU Fundamentals & LLM Inference Mental Models

Summary · 20 min read

Research Engineer interviews at leading AI labs test your ability to reason about inference performance from first principles. You will be expected to estimate whether a model fits on a given GPU, explain why autoregressive decoding is slow, identify bottlenecks (memory bandwidth vs. compute) for a given workload, and propose optimizations that target the actual bottleneck.

This article builds that foundation. We cover GPU architecture, the roofline model, memory estimation, arithmetic intensity analysis, and latency estimation , everything you need to develop strong intuition about what makes LLM inference fast or slow.

GPU Architecture Fundamentals

NVIDIA A100 Architecture

The A100 is the workhorse GPU for LLM inference and training. Understanding its architecture is essential for reasoning about performance.

The A100-80GB packs 108 SMs, each with 64 CUDA cores and 4 Tensor cores, connected via 40 MB L2 to 80 GB HBM2e at 2 TB/s. Fast on-chip SRAM (192 KB per SM) sits next to each SM; the size and speed gap between SRAM and HBM is what makes memory bandwidth the bottleneck.

Key Numbers to Memorize

Streaming Multiprocessors

108

Each SM is an independent processor

CUDA Cores

6912 (64/SM)

Scalar FP ops

Tensor Cores

432 (4/SM)

Matrix multiply (FP16/BF16/INT8)

HBM2e Capacity

80 GB

Main VRAM

HBM Bandwidth

2.0 TB/s

Data rate from HBM

L2 Cache

40 MB

Shared across SMs

L1/Shared per SM

192 KB

Fast on-chip

Total SRAM

~20.25 MB

108 × 192 KB

FP16 Throughput

312 TFLOPS

Peak, Tensor Cores

FP32 Throughput

156 TFLOPS

Peak FP32

Execution Model: Warps and Thread Blocks

Kernel launch → a grid of thread blocks is scheduled onto SMs.
Each block has up to 1024 threads, split into warps.
Each warp = 32 threads executing in lockstep (SIMT); they all run the same instruction at once.

The warp is the fundamental unit of execution. All 32 threads in a warp execute the same instruction at the same time. If threads diverge (different if branches), both paths must be executed serially , this is called warp divergence and it wastes cycles.

SRAM vs. HBM: The Crucial Ratio

Total SRAM: ~20 MB (fast, on-chip, ~30 cycle latency)
Total HBM: 80 GB (slow, off-chip, ~400 cycle latency)
Ratio: HBM is ~4000x larger but ~13x slower

This mismatch is the fundamental reason why memory bandwidth is the bottleneck for most LLM inference workloads. The model weights live in HBM, and for each token generated during decode, we must read ALL weights from HBM through a limited bandwidth pipe.

Memory Hierarchy Latency

Registers~1 cycle64K 32-bit regs/SM

L1 / SRAM~30 cycles192 KB/SM = 20 MB total, ~19 TB/s aggregate

L2 Cache~200 cycles40 MB, ~5–6 TB/s

HBM~400 cycles80 GB, 2.0 TB/s

Why This Matters for LLM Inference

During autoregressive decode, each generated token requires reading the full model weights from HBM:

Llama-7B at FP16: 14 GB of weights
At 2 TB/s bandwidth: Takes 14 GB / 2 TB/s = 7 ms just to read the weights
Computation: ~14 GFLOP per token, which at 312 TFLOPS takes only 0.045 ms

The weight-loading time is 150x larger than the compute time. This is why decode is memory-bandwidth-bound, and why all the major inference optimizations (quantization, KV caching, speculative decoding, batching) ultimately aim to reduce the bytes-per-useful-FLOP ratio.

Rule of thumb: If you can do the arithmetic faster than you can load the data, you are memory-bound. For single-token decode, this is almost always the case.

The Roofline Model

The roofline model is the single most important mental model for understanding inference performance. It tells you whether a workload is limited by:

Memory bandwidth (loading data from HBM) , left side of the plot
Compute throughput (doing arithmetic) , right side / top of the plot

Arithmetic Intensity

AI = FLOPs / Bytes Accessed (FLOPs per byte). Low AI (< ridge point) → memory-bound; high AI (> ridge point) → compute-bound.

Ridge Point

The ridge point is the arithmetic intensity at which the compute and memory roofs intersect:

Ridge = Peak Compute / Peak Bandwidth. A100-80GB: 312 TFLOP/s ÷ 2.0 TB/s = 156 FLOPs/byte.

Attainable Performance

Attainable = min(Peak Compute, AI × Bandwidth). If AI < 156: memory-limited (AI × 2 TB/s). If AI ≥ 156: compute-limited (312 TFLOPS).

A100-80GB Roofline Model showing Prefill vs Decode positions

Figure 1: The A100-80GB Roofline Model. Decode (batch=1) sits at AI=1, achieving only 0.6% of peak compute. Prefill (seq=2048) sits near the ridge point, approaching full compute utilization. The red region is memory-bound; the green region is compute-bound.

The Key Insight: Prefill is Compute-Bound, Decode is Memory-Bound

This single insight explains nearly every optimization in modern LLM serving.

Prefill

Process entire prompt in one shot. Matrix-matrix multiply: (seq_len × d_model) @ (d_model × d_model).

AI scales with seq_len; for seq=2048, AI ~ 150.

Compute-bound: limited by TFLOPS

Decode

Generate one token at a time. Matrix-vector multiply: (1 × d_model) @ (d_model × d_model).

AI ~ 1 (each weight byte used once per batch item).

Memory-bound: limited by HBM bandwidth

Why is decode so memory-inefficient? During decode, we generate ONE token. This means we do a matrix-vector product: the weight matrix has d_model × d_model elements, but the input vector has only d_model elements. Each weight is loaded from HBM but used for only one multiply-add (2 FLOPs / 2 bytes at FP16 = AI of 1).

Why is prefill efficient? During prefill, the input is a matrix of shape (seq_len × d_model). The same weight matrix (loaded once from HBM) is multiplied against seq_len vectors. The weight bytes are amortized across seq_len tokens, giving AI ~ seq_len.

Optimization Implications

Quantization (INT4/INT8) Decode, fewer bytes from HBM per weight

Batching Decode, amortize weight load across B sequences (AI × B)

FlashAttention Prefill, tile attention into SRAM, fewer HBM reads

Speculative Decoding Decode, verify multiple tokens in one pass

KV Cache Both, avoid recomputing attention for past tokens

Tensor Parallelism Both, split weights across GPUs

Memory Estimation

Understanding where GPU memory goes during inference is critical for capacity planning. Let's break down the memory components for Llama-7B.

VRAM Breakdown: Llama-7B at FP16

Pie chart showing VRAM breakdown for Llama-7B at FP16

Figure 2: Llama-7B memory breakdown (FP16, batch=1, seq=2048). Weights 87.8% (13.04 GB), KV cache 6.7% (1.00 GB), activations 3.4% (0.51 GB), CUDA overhead 2.1% (0.31 GB). Total ~14.86 GB.

Key observations:

At batch=1, model weights dominate total memory (~90%+)
KV cache is small at batch=1 but grows linearly with batch size and sequence length
Activations during inference are negligible (only one layer active at a time)
CUDA overhead is a fixed ~0.5 GB cost

Does Llama-70B Fit on A100-80GB?

FP16 130 GB total → No (exceeds 80 GB; need 2 GPUs / tensor parallelism)

INT8 66.9 GB total → Yes (~13 GB headroom)

INT4 34.4 GB total → Yes (~46 GB headroom)

Interview insight: Llama-70B at FP16 needs ~130 GB = 2× A100-80GB. Quantizing to INT4 brings it down to ~35 GB, fitting on a single GPU. This is why quantization is so important for deployment.

When Does KV Cache Dominate?

At batch=1, weights dominate. But at production batch sizes, KV cache quickly overtakes weights. This plot shows the crossover point for Llama-7B.

KV Cache vs Weights memory as batch size increases

Figure 3: Llama-7B FP16 (seq_len=2048), KV cache vs weights vs batch size. Weights constant at ~13 GB; KV cache grows linearly. KV cache equals weights at batch=14. Max batch=50 on A100-80GB before OOM. KV cache management (PagedAttention, GQA) is critical at scale.

Memory Across Model Sizes and Precisions

Llama-7BFP16

Weights 13.0 GB · KV 0.50 GB · Total 14.0 GB

✓ Fits A100

Llama-7BINT4

Weights 3.3 GB · KV 0.50 GB · Total 4.3 GB

✓ Fits A100

Llama-13BFP16

Weights 24.2 GB · KV 0.78 GB · Total 25.5 GB

✓ Fits A100

Llama-13BINT4

Weights 6.1 GB · KV 0.78 GB · Total 7.4 GB

✓ Fits A100

Llama-70BFP16

Weights 130 GB · KV 1.25 GB · Total 131.9 GB

✗ 2 GPUs

Llama-70BINT4

Weights 32.5 GB · KV 1.25 GB · Total 34.4 GB

✓ Fits A100

Key observations:

Weights scale linearly with parameter count and inversely with quantization.
INT4 gives a 4x reduction in weight memory vs FP16, making 70B feasible on one GPU.
KV cache for 70B is smaller than expected thanks to GQA (8 KV heads vs 64 query heads).
At batch=1, weights dominate. At production batch sizes (32+), KV cache dominates.

Arithmetic Intensity Analysis

Let's see how arithmetic intensity changes with different operating conditions, and how that determines whether we're memory-bound or compute-bound.

Decode AI vs Batch Size

Decode at batch=1 has an arithmetic intensity of just 1 , only 0.6% of the ridge point. Even at batch=64, we're still deeply memory-bound. It takes a batch size of ~156 to fully saturate the A100's compute capability during decode.

Decode arithmetic intensity vs batch size

Figure 4: Decode arithmetic intensity grows linearly with batch size. The blue dashed line marks the A100 ridge point at 156 FLOPs/byte. Below this line, the GPU's compute cores are underutilized , adding more sequences to the batch is essentially "free" in terms of latency, because we're bottlenecked on memory bandwidth anyway.

Prefill AI vs Sequence Length

Prefill crosses the ridge point at relatively short sequence lengths (~300-400 tokens). For typical prompt lengths (1K+ tokens), prefill is solidly compute-bound , the GPU cores become the bottleneck, not memory bandwidth.

Figure 5: Prefill arithmetic intensity vs sequence length (Llama-7B, FP16). Crosses A100 ridge (156 FLOPs/byte) at seq_len ~192; below that, prefill is memory-bound; above, compute-bound. For short prompts, latency is dominated by weight loading; for long prompts, by computation.

Time Estimates

Using the roofline model, we can estimate latencies for prefill (TTFT = time to first token) and decode (per-token generation time). These back-of-the-envelope calculations are exactly the kind of reasoning expected in interviews.

Prefill Latency (TTFT)

Prefill latency scales roughly linearly with prompt length. For Llama-7B on A100, a 2048-token prompt takes about 9 ms , fast enough to be imperceptible.

Time to first token vs prompt length for different model sizes

Figure 6: Prefill latency (TTFT) vs prompt length for Llama-7B, 13B, and 70B on A100 (FP16, batch=1). Horizontal reference lines mark 100ms (imperceptible), 500ms (noticeable), and 1000ms (slow). Larger models have proportionally higher TTFT. For Llama-70B, prompts beyond 4K tokens push TTFT past 1 second on a single A100.

Decode Latency and Throughput

Decode time stays ~7 ms per step while memory-bound (batch < ~164); total tokens/sec scales with batch. Past the ridge, latency rises and throughput plateaus (~22K tokens/sec). See the diagram below.

The key insight: decode time per step stays nearly constant as batch size increases (while memory-bound). This means total throughput scales linearly with batch size , for free! This is the fundamental insight behind continuous batching.

Decode latency and throughput vs batch size

Figure 7: Left: Llama-7B decode latency per step stays flat (~7 ms) while memory-bound, then rises after the inflection at batch=164 (transition to compute-bound). Right: Throughput scales linearly with batch size in the memory-bound regime ("free" batching), then plateaus at ~22K tokens/sec. This is why production serving uses continuous batching.

Key observations:

Decode time per step stays nearly constant as batch size increases (while memory-bound)
Total throughput (tokens/sec) scales linearly with batch size , for free
This is the fundamental insight behind continuous batching (vLLM, TGI)
Per-request latency stays the same , everyone gets the same speed, but the server handles more requests

Key Takeaways

5 Things You Should Know

"Autoregressive decode is memory-bandwidth-bound, not compute-bound." Each decode step requires reading all model weights from HBM to generate a single token. The arithmetic intensity is ~1 FLOP/byte (at batch=1, FP16), far below the A100's ridge point of 156. This means the Tensor Cores sit idle ~99% of the time during decode.
"Prefill is compute-bound for reasonable sequence lengths." During prefill, the same weights are reused across all tokens in the prompt, giving arithmetic intensity that scales with seq_len. For seq_len > ~300 on A100, prefill becomes compute-bound.
"Batching is the primary lever for decode throughput." Since decode is memory-bound, adding more sequences to a batch reuses the same weight data already being streamed from HBM. The per-step latency barely changes, but throughput scales linearly. This is why continuous batching (vLLM, TGI) is transformative.
"KV cache memory grows as O(batch_size × seq_len × num_layers × d_head × num_kv_heads)." At production batch sizes (32+), KV cache can dominate total GPU memory. This is why PagedAttention, GQA (grouped-query attention), and KV cache quantization are critical.
"Quantization helps decode more than prefill." Since decode is memory-bound, reducing the number of bytes per weight (FP16 → INT4 = 4x fewer bytes) directly translates to ~4x faster decode. For prefill (compute-bound), quantization helps less unless it also reduces the compute cost.

Quick-Reference Formulas

Keep these formulas in your head for back-of-the-envelope calculations:

Weight Memory

params × bytes/element

Llama-7B FP16: 14 GB · 70B INT4: 35 GB

KV Cache

batch × seq × layers × (2 × kv_heads × head_dim × bytes)

Per token per layer, then sum over layers

Decode AI

2 × B / bytes_per_element

B=1 → 1; B=32 → 32. Prefill AI ~ seq_len.

Decode time

Weight Memory / HBM BW

Llama-7B FP16 on A100: 14 GB / 2 TB/s = 7 ms

Ridge

Peak FLOPS / Peak BW

A100: 156 · H100: 296 FLOPs/byte