Context in LLMs: What Determines It, What It Costs, and What Actually Works
1. What Determines Context Length
When people say a model "supports 128K context," that sounds like a single setting you can crank up. It's not. Context length is actually limited by three things at once: how the model tracks token position, how expensive attention gets as sequences grow, and how much memory you need to store past tokens during inference. Each one puts a ceiling on how far you can go.
Constraint 1: Positional Encoding Scheme
Transformers have no built-in sense of order. Self-attention is permutation-invariant, meaning it treats "the cat sat" the same as "sat cat the" unless you explicitly tell it which token came first. Positional encodings inject that order. The scheme you pick dictates the ceiling.
| Scheme | How It Works | Limit |
|---|---|---|
| Learned Absolute (original GPT) | Learns an embedding vector for positions 0..N-1 | Hard ceiling at N. Position 4097 literally doesn't exist. |
| Sinusoidal (original Transformer) | Fixed sin/cos at different frequencies | Theoretically infinite, but untested positions degrade |
| RoPE (LLaMA, Qwen, most modern LLMs) | Encodes position as rotation angle in embedding space | Theoretically extrapolable, but rotations at unseen angles are out-of-distribution |
| ALiBi (BLOOM) | Adds linear bias to attention scores based on distance | Better extrapolation, but penalizes long-range attention by design |
RoPE: The Dominant Approach (and Its Limitation)
Most modern models use Rotary Position Embedding (RoPE). For each dimension pair (i), the rotation angle at position m is:
The problem: If you train on positions 0 to 4096, the model has only ever seen rotation angles in that range. Position 8000 produces rotations the model has never encountered. It's like teaching someone a 12-hour clock then asking them to read a 24-hour clock. The mechanism is the same but the values are foreign. This is a distribution shift, and the model's output degrades unpredictably.
Constraint 2: Attention's Quadratic Cost
Self-attention computes pairwise scores between every token pair:
| From → To | Token Increase | Attention Cost Increase |
|---|---|---|
| 4K → 8K | 2x | 4x |
| 4K → 32K | 8x | 64x |
| 4K → 128K | 32x | 1,024x |
| 4K → 1M | 250x | 62,500x |
This is the fundamental reason you can't "just make context longer." The cost grows with the square of the length.
Constraint 3: KV Cache at Inference
At inference, you store Key and Value vectors for every past token, every layer. For a 70B model with GQA (8 KV heads):
Why You Can't "Just Extend" the Context
Problem 1: Training Distribution Mismatch (the deepest reason)
The model learns attention patterns from training data. If most training documents are 2-8K tokens:
- It learns "the answer is usually within a few thousand tokens of the question"
- It learns attention heads that specialize in local patterns
- It never practices retrieving a fact from 50K tokens away
Even if you architecturally support 128K, the model hasn't learned when and how to attend across that distance. This is a learned behavior problem, not an architecture problem.
Problem 2: Lost in the Middle (Liu et al., 2023)
Even models with "long context" show a U-shaped retrieval curve:
Softmax attention distributes probability mass. Over long sequences, the middle tokens get starved because they don't benefit from primacy or recency bias.
Problem 3: Effective vs. Claimed Context
A model "supporting" 128K and effectively using 128K are different things:
The spec sheet number is the architectural limit. The effective limit is much lower.
The Fundamental Trilemma
Pick at most two:
- Long + Quality = massive compute (full attention, long training)
- Long + Efficient = sparse/approximate attention (loses quality)
- Quality + Efficient = short context (what most models do best)
2. Strategies for Extending and Managing Context
There are two fundamentally different approaches: make the window bigger (architectural) or make the model smarter within the window it has (behavioral). Both have their place.
Architectural Strategies: Making the Window Bigger
| Technique | How It Works | Trade-off |
|---|---|---|
| Position Interpolation | Scale positions down: pos' = pos × (L_train / L_target). So position 8000 → 4000 (within training range). | Reduces positional resolution. Nearby tokens look more similar. Fine distinctions blur. |
| NTK-Aware Scaling | Scale the RoPE frequency base instead of positions: base' = base × scale_factor | Better than naive interpolation. Preserves local resolution. Still needs fine-tuning on longer data. |
| YaRN | Different scaling for different frequency dimensions. High-frequency (local patterns): minimal scaling. Low-frequency (global patterns): aggressive scaling. Plus temperature scaling on attention logits. | Best extrapolation quality. Most complex to implement. Still needs some continued training. |
| ALiBi | Linear distance bias on attention scores. No learned positions at all. | Good extrapolation, but linearly penalizes distance, limiting long-range attention by design. |
| Sliding Window (Mistral) | Each token only attends to last W tokens (e.g., W=4096). | O(n×W) instead of O(n²). But literally cannot retrieve beyond window. Stacking layers creates indirect receptive field, but it's lossy. |
| Ring Attention | Split sequence across GPUs, pass KV in a ring. | Solves memory. Does NOT solve the learning problem. Model still must learn long-range patterns. |
| Sparse Attention (Longformer, BigBird) | Local + global + random attention patterns. O(n) cost. | Some token pairs never directly attend to each other, so information is lost. Replaced in practice by Flash Attention. |
| Flash Attention | Exact same math as standard attention, but tiled computation fits in GPU SRAM. O(n) memory, O(n²) compute. | No quality loss. Solves the memory problem. Does NOT reduce compute. This is what everyone actually uses. |
| GQA (Grouped Query Attention) | Share KV heads across query heads (e.g., 64 query heads → 8 KV heads). | Reduces KV cache by 8x. Minimal quality loss. Standard in all modern models (Llama 3, Qwen, Mistral). |
Behavioral Strategies: Making the Model Smarter Within Its Window
Instead of making the window bigger, teach the model to use its existing window more intelligently. These are agent-level strategies.
Strategy 1: Surgical Retrieval. Read functions, not files
The agent already knows (from the stack trace or grep) which function matters. Reading the whole file dumps 1800 lines of noise into context that competes for attention with the 30 lines that matter.
Strategy 2: Context Eviction. Proactively discard stale information
After you've acted on information, you rarely need it verbatim. A summary preserves the decision-relevant facts while freeing context for new information.
Strategy 3: Structured Memory. Notes over verbatim copies
Strategy 4: Plan Before Reading. Avoid wrong leads entirely
200 tokens of planning saves 8K tokens of unnecessary reads.
Strategy 5: Compaction as a Tool
Compaction doesn't have to be an external system decision. You can give the agent a
compact() tool and let it learn when to use it through
reinforcement learning:
The model learns through trial and error: compact too early and you lose critical details. Compact too late and context fills up with noise. Extract key facts before compacting and you get the best of both worlds. This is trainable with RLVR.
The Core Principle
- Higher signal-to-noise ratio → attention works better on what's there
- Less "lost in the middle" because there's less middle
- Faster per-turn → more iterations possible → better refinement
- Cheaper per turn → can afford longer sessions
3. At Which Stage Can You Do This?
LLM development has four stages. Each one offers different knobs for improving context, at very different cost levels. Understanding which stage to operate at is critical.
Stage 1: Pretraining
What you can do:
- Choose the positional encoding scheme (RoPE, ALiBi)
- Choose the attention architecture (full, sparse, sliding window)
- Train on progressively longer sequences (4K → 16K → 64K → 128K)
- Curate long-context training data (full codebases, books, legal documents)
- Build in GQA for KV cache efficiency
Who does this: Only frontier labs (OpenAI, Anthropic, Google, Meta, DeepSeek). This is where the fundamental context capacity is set. Everything downstream is constrained by what happens here.
What it costs: Tens of millions to billions of dollars. Training on 128K sequences costs 1,024x more per sample than 4K (quadratic attention). You need thousands of GPUs for weeks.
What you get: A model that can physically attend to 128K-200K tokens and has actually practiced using long-range attention patterns during training. This is what makes Claude, GPT-4, and Gemini good at long context. They spent enormous compute on long-context pretraining.
Stage 2: Mid-Training (Context Extension)
What you can do:
- Take a model pretrained at 4K-8K and extend to 32K-128K
- Adjust RoPE parameters (YaRN, NTK scaling, change base frequency)
- Continue training on 10-100B tokens of long-context data
Real examples:
What it costs: Still expensive. You need long-context training data (books, repos, long docs) and significant GPU time. Context extension works well up to ~4x the original length. 32K → 128K is fine. 32K → 1M is sketchy.
| Extension | Quality | Estimated Cost |
|---|---|---|
| 4K → 16K (4x) | Good | ~$10K-50K (days on a small cluster) |
| 4K → 32K (8x) | Good | ~$50K-200K |
| 4K → 128K (32x) | Acceptable | ~$200K-1M |
| 4K → 1M (250x) | Degraded | ~$1M+ (and quality is questionable) |
Stage 3: Post-Training (SFT and RL)
What you can do:
- Fine-tune on long-context tasks (needle-in-a-haystack, multi-doc QA)
- SFT on full multi-turn agent trajectories with tool use
- RLVR to teach the model when to compact, extract, and manage context
- Train with compaction/memory tools as available actions
This is where you teach the model to be smart within its window. The context capacity is already set by pretraining. Post-training teaches the model behavioral strategies:
RLVR is particularly powerful here because optimal context management is task-dependent. Sometimes you should compact aggressively. Sometimes you need every detail. The model learns the strategy through trial and error, not imitation.
What it costs:
Stage 4: Agent Scaffolding
What you can do:
- Automatic context compaction (summarize old turns when approaching limits)
- Prompt caching (don't reprocess the same system prompt every turn)
- Selective file reading (tools load specific functions, not entire files)
- Conversation summarization (compress old turns into key decisions)
- KV cache management across turns
This is what Claude Code, Cursor, and Codex actually do. The model itself doesn't change. The system around the model manages context intelligently.
What it costs: Engineering time only. No GPU costs beyond normal inference.
Summary: The Four Stages
| Stage | What It Does | Cost | Who Does It |
|---|---|---|---|
| Pretraining | Sets context capacity, trains long-range attention | $10M-1B+ | Frontier labs only |
| Mid-Training | Extends context window with RoPE scaling | $10K-1M | Labs, well-funded startups |
| Post-Training (SFT/RL) | Teaches smart context management behaviors | $500-1200 | Anyone with a few GPUs |
| Agent Scaffolding | System-level context management | $0 (engineering time) | Anyone building agents |
4. Honest Cost Analysis
Everyone talks about long context. Nobody talks about what it actually costs. Here's the full picture.
Cost of Training with Long Context
Attention is O(n²). This means training costs scale quadratically with context length:
| Context Length | Cost per Training Sample (relative to 4K) | What You Need |
|---|---|---|
| 4K | 1x (baseline) | Standard GPU setup |
| 32K | 64x | Flash Attention, multi-GPU |
| 128K | 1,024x | Multi-node, Ring Attention |
| 1M | 62,500x | Frontier lab compute budget |
Cost of Inference with Long Context
Using Claude/GPT-4 API pricing as reference:
| Context Used | Cost per Turn (approximate) | 30-Turn Session Cost |
|---|---|---|
| 4K | ~$0.01 | ~$0.30 |
| 32K | ~$0.08 | ~$2.40 |
| 100K | ~$0.25 | ~$7.50 |
| 200K | ~$0.50 | ~$15.00 |
| 1M | ~$2.50 | ~$75.00 |
At 1M context, a 30-turn coding session costs $75. Most of those tokens are noise the model barely attends to. A smart agent at 32K would cost $2.40 and probably produce better results.
Cost of Self-Hosted Inference
The Cost-Performance Sweet Spot
5. What Reasoning Over Long Context Actually Means
With the mechanics and costs established, let's address what everyone is actually chasing: making models reason across long context, not just hold it.
Long-Horizon Tasks: The Motivation
Long-horizon tasks require many sequential steps, decisions, or intermediate sub-goals before reaching a final outcome. "Reasoning over" them means the model must plan, track state, and make coherent decisions across that entire chain.
| Domain | Short-Horizon | Long-Horizon |
|---|---|---|
| Coding | "Fix this typo" | "Build a REST API with auth, DB, tests, and deploy it" |
| Math | "What is 2+3?" | "Prove this theorem using 15 intermediate lemmas" |
| Agents | "Search the web" | "Research a topic, synthesize findings, write a report, iterate on feedback" |
Why It's Hard for LLMs
- Error compounding: A small mistake in step 3 of 20 derails everything downstream.
- State tracking: The model must remember what it has done, what remains, and what intermediate results it produced.
- Planning under uncertainty: Early decisions constrain later options.
- Credit assignment: When the final answer is wrong, it's hard to identify which step failed.
The Difference Between "Holds" and "Reasons Across"
This is the crux. Many models support 128K tokens. The question is what happens to information at various positions:
What "Reasoning Across" Looks Like in Code
Cross-File Causal Tracing
# File A (in context at position ~2K)
def get_user(id):
return db.query(User).filter(User.user_id == id).first() # returns None if not found
# File B (in context at position ~15K)
def login(request):
user = get_user(request.id)
token = generate_token(user.email) # ← crashes: NoneType has no attribute 'email'
The model must connect a None return in one file to an unguarded attribute access
in another file. These two code fragments might be 13,000 tokens apart in the
context. The model's attention mechanism must literally assign high attention weight from the
user.email token to the return ... .first() token across that gap.
This is what "lost in the middle" kills. A weaker model sees File B, generates
a fix like if user is None: return error, which is correct but shallow. A
model reasoning across full context also notices that get_user should probably raise
UserNotFoundError instead of returning None, because 6 other call sites
(also in context from earlier grep results) all assume it returns a valid user.
Pattern Recognition Across the Codebase
After reading 8-10 files, the model should recognize:
- "This codebase uses the repository pattern"
- "Errors are custom exceptions caught by middleware, not return codes"
- "Every endpoint has a corresponding Pydantic schema in
schemas/" - "Tests use factory_boy fixtures, not raw object creation"
This is not stored in any single file. It emerges from reasoning across the aggregate context. A model that can't do this generates code that is locally correct but stylistically alien to the codebase.
Multi-Step Plan Coherence
By step 5, the model is writing test code that must be simultaneously consistent with
decisions made in steps 1-4. If the model "forgets" that it used account_id
in step 1 and writes user_id in the step 5 test, the entire chain breaks.
Why This Is So Hard (The Non-Obvious Part)
The real difficulty isn't memory. It's relevance detection at scale.
At 100K tokens of context, there might be 500 function definitions, 200 class attributes, 50 config values, and 100 test assertions. When generating one line of code, maybe 3-4 of those are relevant. The model must:
- Not attend to the 796 irrelevant items (noise suppression)
- Strongly attend to the 4 relevant items (signal detection)
- Do this for every token it generates
How Models Address Long-Horizon Reasoning
- Chain-of-thought (CoT): Explicit step-by-step reasoning reduces the per-step difficulty.
- Tree/graph search (ToT, GoT): Explore multiple reasoning paths, backtrack when stuck.
- Reinforcement learning (GRPO, PPO): Train models to get reward for correct final answers, forcing them to learn robust multi-step strategies. This is what DeepSeek-R1 and OpenAI's o1/o3 do.
- Process reward models (PRMs): Give reward at each step, not just the final answer, so the model learns which intermediate steps are good.
- Decomposition: Break the long task into sub-tasks, solve each, then compose.
How Agentic Tools Put It All Together
No vector database. No knowledge graph. No external memory. Just a flat conversation where the model is simultaneously the retriever, the planner, and the reasoner.
At message 11, the model sees ALL of messages 1-10. "Long context reasoning" = at this point, attention reaches back across every previous message simultaneously to produce a coherent fix.
The Model IS the Retriever
No embedding approximation. No similarity threshold. The model reasons about code dependencies and retrieves based on understanding, not pattern matching. And the same model choosing what to retrieve will use what it retrieves.
Iterative Context Building
Each turn refines understanding. By turn 5, the model has a complete causal chain in context AND has been building its mental model incrementally. This is fundamentally better than dumping 5 files cold. The order of discovery helps the model organize information.
Automatic Context Compression
This mirrors what attention does naturally (recent = high attention, old = low attention), but makes it explicit and honest instead of pretending the model attends well to 180K tokens.
6. Where the Field Is Heading
The endgame is not "2M context." It's an agent that navigates a million-file codebase using a 50K context window, because it knows exactly where to look and what to remember. The next breakthrough won't be bigger windows. It'll be same quality at 64K with 10x less cost and 5x less latency.
Key Concepts Reference
| Concept | One-Line Summary |
|---|---|
| RoPE (Rotary Position Embedding) | Encodes position as rotation angle; works within training range, degrades outside it |
| YaRN / NTK scaling | Scale RoPE frequency base instead of positions; better local resolution, still needs fine-tuning |
| O(n²) attention cost | Why doubling context = 4x compute, making 1M context 25x costlier than 200K |
| KV cache | Stored Key/Value vectors per token per layer; 327 KB/token for 70B model, main memory bottleneck |
| GQA (Grouped Query Attention) | Share KV heads across query heads to reduce KV cache size (e.g., 64 query → 8 KV heads) |
| Flash Attention | Exact full attention with O(n) memory via tiled computation. No quality loss. Not sparse attention. |
| Sparse Attention | Local + global + random patterns. O(n) cost but drops some connections. Quality loss. |
| Lost in the middle | Models attend to start/end of context but lose information in the middle |
| Effective vs. claimed context | Architectural support ≠ actual retrieval/reasoning ability at that length |
| Training distribution mismatch | Model never practiced attending to position 50K if trained on 4K docs |
| Context trilemma | Long + Quality + Efficient: pick at most two |
| Long-horizon reasoning | Planning and executing across many dependent steps without error compounding |
| Cross-file causal tracing | Connecting cause in file A to effect in file B across large token distances |
| Holds vs. reasons across | Storing tokens ≠ actively using them during generation |
| Relevance detection at scale | Picking the 4 relevant items out of 800 at every generation step |
| Process reward models | Rewarding each reasoning step, not just the final answer |
| Surgical retrieval | Reading only the relevant function/lines instead of entire files |
| Context eviction | Proactively discarding stale information to keep context tight |
| Structured memory | Storing extracted facts + line pointers instead of verbatim file content |
| Plan-then-retrieve | Spending tokens thinking about what to read before reading anything |
| Model-as-retriever | The generating model also decides what to retrieve, replacing separate embedding-based search |
| Automatic context compression | Summarize old turns when approaching limits; honest version of attention decay |
| Prompt caching | Cache repeated prefixes server-side to avoid reprocessing the same system prompt every turn |