← Back to Blog

Context in LLMs: What Determines It, What It Costs, and What Actually Works

Feb 6th, 2026 · 30 min read


1. What Determines Context Length

When people say a model "supports 128K context," that sounds like a single setting you can crank up. It's not. Context length is actually limited by three things at once: how the model tracks token position, how expensive attention gets as sequences grow, and how much memory you need to store past tokens during inference. Each one puts a ceiling on how far you can go.

Constraint 1: Positional Encoding Scheme

Transformers have no built-in sense of order. Self-attention is permutation-invariant, meaning it treats "the cat sat" the same as "sat cat the" unless you explicitly tell it which token came first. Positional encodings inject that order. The scheme you pick dictates the ceiling.

Scheme How It Works Limit
Learned Absolute (original GPT) Learns an embedding vector for positions 0..N-1 Hard ceiling at N. Position 4097 literally doesn't exist.
Sinusoidal (original Transformer) Fixed sin/cos at different frequencies Theoretically infinite, but untested positions degrade
RoPE (LLaMA, Qwen, most modern LLMs) Encodes position as rotation angle in embedding space Theoretically extrapolable, but rotations at unseen angles are out-of-distribution
ALiBi (BLOOM) Adds linear bias to attention scores based on distance Better extrapolation, but penalizes long-range attention by design

RoPE: The Dominant Approach (and Its Limitation)

Most modern models use Rotary Position Embedding (RoPE). For each dimension pair (i), the rotation angle at position m is:

θ_i(m) = m × base^(-2i/d) where base = 10000 (typically), d = head dimension

The problem: If you train on positions 0 to 4096, the model has only ever seen rotation angles in that range. Position 8000 produces rotations the model has never encountered. It's like teaching someone a 12-hour clock then asking them to read a 24-hour clock. The mechanism is the same but the values are foreign. This is a distribution shift, and the model's output degrades unpredictably.

Constraint 2: Attention's Quadratic Cost

Self-attention computes pairwise scores between every token pair:

Attention(Q, K, V) = softmax(QK^T / √d_k) V QK^T matrix is n × n where n = sequence length Memory: O(n²) Compute: O(n² × d)
From → To Token Increase Attention Cost Increase
4K → 8K 2x 4x
4K → 32K 8x 64x
4K → 128K 32x 1,024x
4K → 1M 250x 62,500x

This is the fundamental reason you can't "just make context longer." The cost grows with the square of the length.

Constraint 3: KV Cache at Inference

At inference, you store Key and Value vectors for every past token, every layer. For a 70B model with GQA (8 KV heads):

Per token KV cost: = num_layers × 2(K,V) × num_kv_heads × head_dim × 2 bytes(fp16) = 80 × 2 × 8 × 128 × 2 ≈ 327 KB per token At different context lengths: 4K context: → 1.3 GB (manageable) 32K context: → 10.5 GB (one GPU) 128K context: → 42 GB (needs multiple GPUs JUST for KV cache) 1M context: → 327 GB (impossible on current hardware for one request)
Important: This is per request. A serving system handling 100 concurrent users at 128K context needs 4.2 TB just for KV cache, before any model weights.

Why You Can't "Just Extend" the Context

Problem 1: Training Distribution Mismatch (the deepest reason)

The model learns attention patterns from training data. If most training documents are 2-8K tokens:

Even if you architecturally support 128K, the model hasn't learned when and how to attend across that distance. This is a learned behavior problem, not an architecture problem.

Problem 2: Lost in the Middle (Liu et al., 2023)

Even models with "long context" show a U-shaped retrieval curve:

Retrieval accuracy by position of relevant info: Beginning: ████████████████████ 95% Middle: ████████ 40% ← catastrophic drop End: ███████████████████ 90%

Softmax attention distributes probability mass. Over long sequences, the middle tokens get starved because they don't benefit from primacy or recency bias.

Problem 3: Effective vs. Claimed Context

A model "supporting" 128K and effectively using 128K are different things:

Claimed: 128K context Needle retrieval at 128K: ~60% Needle retrieval at 4K: ~99%

The spec sheet number is the architectural limit. The effective limit is much lower.

The Fundamental Trilemma

Long Context /\ / \ / \ / \ Quality -------- Efficiency

Pick at most two:


2. Strategies for Extending and Managing Context

There are two fundamentally different approaches: make the window bigger (architectural) or make the model smarter within the window it has (behavioral). Both have their place.

Architectural Strategies: Making the Window Bigger

Technique How It Works Trade-off
Position Interpolation Scale positions down: pos' = pos × (L_train / L_target). So position 8000 → 4000 (within training range). Reduces positional resolution. Nearby tokens look more similar. Fine distinctions blur.
NTK-Aware Scaling Scale the RoPE frequency base instead of positions: base' = base × scale_factor Better than naive interpolation. Preserves local resolution. Still needs fine-tuning on longer data.
YaRN Different scaling for different frequency dimensions. High-frequency (local patterns): minimal scaling. Low-frequency (global patterns): aggressive scaling. Plus temperature scaling on attention logits. Best extrapolation quality. Most complex to implement. Still needs some continued training.
ALiBi Linear distance bias on attention scores. No learned positions at all. Good extrapolation, but linearly penalizes distance, limiting long-range attention by design.
Sliding Window (Mistral) Each token only attends to last W tokens (e.g., W=4096). O(n×W) instead of O(n²). But literally cannot retrieve beyond window. Stacking layers creates indirect receptive field, but it's lossy.
Ring Attention Split sequence across GPUs, pass KV in a ring. Solves memory. Does NOT solve the learning problem. Model still must learn long-range patterns.
Sparse Attention (Longformer, BigBird) Local + global + random attention patterns. O(n) cost. Some token pairs never directly attend to each other, so information is lost. Replaced in practice by Flash Attention.
Flash Attention Exact same math as standard attention, but tiled computation fits in GPU SRAM. O(n) memory, O(n²) compute. No quality loss. Solves the memory problem. Does NOT reduce compute. This is what everyone actually uses.
GQA (Grouped Query Attention) Share KV heads across query heads (e.g., 64 query heads → 8 KV heads). Reduces KV cache by 8x. Minimal quality loss. Standard in all modern models (Llama 3, Qwen, Mistral).
Key distinction: Flash Attention is NOT sparse attention. Sparse attention drops connections (quality loss). Flash Attention computes exact full attention but tiles the computation for memory efficiency (no quality loss). Don't confuse them.

Behavioral Strategies: Making the Model Smarter Within Its Window

Instead of making the window bigger, teach the model to use its existing window more intelligently. These are agent-level strategies.

Strategy 1: Surgical Retrieval. Read functions, not files

Dumb: Read all of user_service.py (2000 lines) → ~8K tokens Smart: Read lines 145-178 (the one relevant function) → ~200 tokens 40x reduction

The agent already knows (from the stack trace or grep) which function matters. Reading the whole file dumps 1800 lines of noise into context that competes for attention with the 30 lines that matter.

Strategy 2: Context Eviction. Proactively discard stale information

Turn 3: Read models/user.py to understand schema → 800 tokens in context Turn 5: Already used that info to write the fix Current: Keep the full file verbatim forever → 800 tokens wasted Smart: Compress to "User model: id, email_address, created_at" → 20 tokens 40x reduction

After you've acted on information, you rarely need it verbatim. A summary preserves the decision-relevant facts while freeing context for new information.

Strategy 3: Structured Memory. Notes over verbatim copies

Current (raw context): Full file stored in conversation history → 2000 tokens per file Degrades as it drifts further from attention window Optimized (structured scratchpad): { "file": "models/user.py", "facts": ["email_address: str", "uses SQLAlchemy", "has created_at"], "re_read_lines": "145-150" ← pointer to re-read if needed } → 30 tokens, and exact lines are one tool call away

Strategy 4: Plan Before Reading. Avoid wrong leads entirely

Dumb: "auth broken" → reads auth.py, user.py, session.py, middleware.py, config.py, database.py → discovers only auth.py + user.py mattered Cost: 6 file reads × ~2K = 12K tokens, 4 were wasted Smart: "auth broken" → reads stack trace (200 tokens) → thinks "crash is in login() calling get_user(), need those two files only" → reads 2 files Cost: 2 file reads × ~2K = 4K tokens, 0 wasted

200 tokens of planning saves 8K tokens of unnecessary reads.

Strategy 5: Compaction as a Tool

Compaction doesn't have to be an external system decision. You can give the agent a compact() tool and let it learn when to use it through reinforcement learning:

Available tools: search(query) → search the web calculator(expr) → compute math execute_code(code) → run Python compact(context) → summarize working memory, free up context space extract(context, key) → pull specific info from long text remember(key, value) → save a fact to persistent memory

The model learns through trial and error: compact too early and you lose critical details. Compact too late and context fills up with noise. Extract key facts before compacting and you get the best of both worlds. This is trainable with RLVR.

The Core Principle

Unlimited context → lazy agent → dump everything → hope attention finds signal (it often doesn't) Tight context → disciplined agent → precise retrieval → every token earns its place (attention concentrated on signal)
Key Insight: A 100K agent that's selective will outperform a 1M agent that's sloppy because:

3. At Which Stage Can You Do This?

LLM development has four stages. Each one offers different knobs for improving context, at very different cost levels. Understanding which stage to operate at is critical.

Stage 1: Pretraining → set context capacity, learn attention patterns Stage 2: Mid-Training → extend context with RoPE scaling Stage 3: Post-Training (SFT/RL) → teach smart context management behaviors Stage 4: Agent Scaffolding → system-level context management

Stage 1: Pretraining

What you can do:

Who does this: Only frontier labs (OpenAI, Anthropic, Google, Meta, DeepSeek). This is where the fundamental context capacity is set. Everything downstream is constrained by what happens here.

What it costs: Tens of millions to billions of dollars. Training on 128K sequences costs 1,024x more per sample than 4K (quadratic attention). You need thousands of GPUs for weeks.

What you get: A model that can physically attend to 128K-200K tokens and has actually practiced using long-range attention patterns during training. This is what makes Claude, GPT-4, and Gemini good at long context. They spent enormous compute on long-context pretraining.

Realistic for individuals? No. You use someone else's pretrained model.

Stage 2: Mid-Training (Context Extension)

What you can do:

Real examples:

Llama 2 (4K) → Code Llama (100K) via RoPE scaling + continued training Mistral (8K) → Mistral-128K via YaRN + fine-tuning Qwen (32K) → extended versions via NTK-aware scaling

What it costs: Still expensive. You need long-context training data (books, repos, long docs) and significant GPU time. Context extension works well up to ~4x the original length. 32K → 128K is fine. 32K → 1M is sketchy.

Extension Quality Estimated Cost
4K → 16K (4x) Good ~$10K-50K (days on a small cluster)
4K → 32K (8x) Good ~$50K-200K
4K → 128K (32x) Acceptable ~$200K-1M
4K → 1M (250x) Degraded ~$1M+ (and quality is questionable)
Realistic for individuals? Only at the small end (3B model, 4x extension). Most people use models that were already extended by the original lab.

Stage 3: Post-Training (SFT and RL)

What you can do:

This is where you teach the model to be smart within its window. The context capacity is already set by pretraining. Post-training teaches the model behavioral strategies:

What SFT teaches: "Here's what a good trajectory looks like when context gets long. See how the agent compacts at step 5? Copy that." What RLVR teaches: "Did you solve the task? No? You ran out of context because you didn't compact. Another rollout where you compacted early succeeded. Do more of that."

RLVR is particularly powerful here because optimal context management is task-dependent. Sometimes you should compact aggressively. Sometimes you need every detail. The model learns the strategy through trial and error, not imitation.

What it costs:

SFT on long-context trajectories: ~$100-500 (few hours on 4x A100) RLVR for context management: ~$300-700 (part of the full RLVR budget) Total: ~$500-1200
Realistic for individuals? Yes. This is the sweet spot. You take an existing model with a 32K window and teach it to be intelligent about using that window. No architecture changes, no massive compute, just better behavior.

Stage 4: Agent Scaffolding

What you can do:

This is what Claude Code, Cursor, and Codex actually do. The model itself doesn't change. The system around the model manages context intelligently.

What Claude Code does behind the scenes: 1. Prompt caching: System prompt (2K tokens) cached server-side. Not reprocessed every turn. 2. Selective reads: Agent uses Grep/Glob to find relevant files, then reads only those. Not the whole codebase. 3. Auto compaction: When context approaches 200K, older turns get summarized into ~3K token summaries. 4. KV caching: Cached prefixes avoid recomputing attention for the stable part of the conversation.

What it costs: Engineering time only. No GPU costs beyond normal inference.

Realistic for individuals? Absolutely. Anyone building an agent can implement these strategies. This is the cheapest and most accessible approach.

Summary: The Four Stages

Stage What It Does Cost Who Does It
Pretraining Sets context capacity, trains long-range attention $10M-1B+ Frontier labs only
Mid-Training Extends context window with RoPE scaling $10K-1M Labs, well-funded startups
Post-Training (SFT/RL) Teaches smart context management behaviors $500-1200 Anyone with a few GPUs
Agent Scaffolding System-level context management $0 (engineering time) Anyone building agents

4. Honest Cost Analysis

Everyone talks about long context. Nobody talks about what it actually costs. Here's the full picture.

Cost of Training with Long Context

Attention is O(n²). This means training costs scale quadratically with context length:

Context Length Cost per Training Sample (relative to 4K) What You Need
4K 1x (baseline) Standard GPU setup
32K 64x Flash Attention, multi-GPU
128K 1,024x Multi-node, Ring Attention
1M 62,500x Frontier lab compute budget

Cost of Inference with Long Context

Using Claude/GPT-4 API pricing as reference:

Context Used Cost per Turn (approximate) 30-Turn Session Cost
4K ~$0.01 ~$0.30
32K ~$0.08 ~$2.40
100K ~$0.25 ~$7.50
200K ~$0.50 ~$15.00
1M ~$2.50 ~$75.00

At 1M context, a 30-turn coding session costs $75. Most of those tokens are noise the model barely attends to. A smart agent at 32K would cost $2.40 and probably produce better results.

Cost of Self-Hosted Inference

Self-hosted Qwen 3B on A100 ($1.10/hr): 4K context: ~2700 tasks/hour → $0.0004/task 32K context: ~500 tasks/hour → $0.002/task Self-hosted 70B on 4x A100 ($4.40/hr): 4K context: ~200 tasks/hour → $0.022/task 32K context: ~50 tasks/hour → $0.088/task 128K context: ~10 tasks/hour → $0.440/task

The Cost-Performance Sweet Spot

Approach Cost Quality Bigger model + bigger context (brute) $$$$$ Diminishing returns past 100K Same model + smarter agent behavior $$ Often better than brute force Small model + RLVR + smart context mgmt $ Best value for production
The honest conclusion: Past 100K tokens, you're paying exponentially more for linearly diminishing returns. The economically rational approach is to invest in smarter context management (post-training + agent scaffolding) rather than bigger context windows. A 32K agent that compacts well beats a 200K agent that wastes context.

5. What Reasoning Over Long Context Actually Means

With the mechanics and costs established, let's address what everyone is actually chasing: making models reason across long context, not just hold it.

Long-Horizon Tasks: The Motivation

Long-horizon tasks require many sequential steps, decisions, or intermediate sub-goals before reaching a final outcome. "Reasoning over" them means the model must plan, track state, and make coherent decisions across that entire chain.

Domain Short-Horizon Long-Horizon
Coding "Fix this typo" "Build a REST API with auth, DB, tests, and deploy it"
Math "What is 2+3?" "Prove this theorem using 15 intermediate lemmas"
Agents "Search the web" "Research a topic, synthesize findings, write a report, iterate on feedback"

Why It's Hard for LLMs

  1. Error compounding: A small mistake in step 3 of 20 derails everything downstream.
  2. State tracking: The model must remember what it has done, what remains, and what intermediate results it produced.
  3. Planning under uncertainty: Early decisions constrain later options.
  4. Credit assignment: When the final answer is wrong, it's hard to identify which step failed.

The Difference Between "Holds" and "Reasons Across"

This is the crux. Many models support 128K tokens. The question is what happens to information at various positions:

Model that HOLDS long context: ✓ Can regurgitate text from position 60K if asked "what was in file X?" ✗ Does NOT spontaneously use info from position 60K when generating code at position 120K Model that REASONS ACROSS long context: ✓ While generating code at position 120K, attention heads actively pull information from position 60K because it's relevant ✓ Does this without being explicitly told "look at file X" ✓ Does this even when the connection is implicit (same variable name, compatible type signature, related business logic)

What "Reasoning Across" Looks Like in Code

Cross-File Causal Tracing

# File A (in context at position ~2K)
def get_user(id):
    return db.query(User).filter(User.user_id == id).first()  # returns None if not found

# File B (in context at position ~15K)
def login(request):
    user = get_user(request.id)
    token = generate_token(user.email)  # ← crashes: NoneType has no attribute 'email'

The model must connect a None return in one file to an unguarded attribute access in another file. These two code fragments might be 13,000 tokens apart in the context. The model's attention mechanism must literally assign high attention weight from the user.email token to the return ... .first() token across that gap.

This is what "lost in the middle" kills. A weaker model sees File B, generates a fix like if user is None: return error, which is correct but shallow. A model reasoning across full context also notices that get_user should probably raise UserNotFoundError instead of returning None, because 6 other call sites (also in context from earlier grep results) all assume it returns a valid user.

Pattern Recognition Across the Codebase

After reading 8-10 files, the model should recognize:

This is not stored in any single file. It emerges from reasoning across the aggregate context. A model that can't do this generates code that is locally correct but stylistically alien to the codebase.

Multi-Step Plan Coherence

Step 1: Add new column to User model → produces migration Step 2: Update UserSchema to include field → must match the column type from step 1 Step 3: Update create_user service → must use the schema from step 2 Step 4: Update API endpoint → must match the service signature from step 3 Step 5: Add test → must test the endpoint from step 4 with the schema from step 2 using the model from step 1

By step 5, the model is writing test code that must be simultaneously consistent with decisions made in steps 1-4. If the model "forgets" that it used account_id in step 1 and writes user_id in the step 5 test, the entire chain breaks.

Why This Is So Hard (The Non-Obvious Part)

The real difficulty isn't memory. It's relevance detection at scale.

At 100K tokens of context, there might be 500 function definitions, 200 class attributes, 50 config values, and 100 test assertions. When generating one line of code, maybe 3-4 of those are relevant. The model must:

  1. Not attend to the 796 irrelevant items (noise suppression)
  2. Strongly attend to the 4 relevant items (signal detection)
  3. Do this for every token it generates
Generating: token = generate_token(user.?????) To decide this next token, the model must: - Recall User model definition (position 8K): field is called "email_address" not "email" - Recall generate_token signature (position 22K): first param is type str - Recall project convention (positions 5K, 12K, 31K): always uses model attributes directly - Ignore 99.5% of everything else in context → Output: "email_address"

How Models Address Long-Horizon Reasoning

Key Insight: Supervised fine-tuning alone struggles with long-horizon reasoning because it only imitates. RL lets the model discover better strategies through exploration. The gap between models that can chain 5 steps vs. 50 steps reliably is exactly the frontier of "reasoning models" right now.

How Agentic Tools Put It All Together

No vector database. No knowledge graph. No external memory. Just a flat conversation where the model is simultaneously the retriever, the planner, and the reasoner.

Msg 1: [system] System prompt + project instructions (~3K tokens) Msg 2: [user] "Fix the auth bug" (~50 tokens) Msg 3: [assistant] Thinking + tool_call: Grep("auth error") (~200 tokens) Msg 4: [tool] Grep results: routes/auth.py:47 (~500 tokens) Msg 5: [assistant] Thinking + tool_call: Read("routes/auth.py") Msg 6: [tool] File contents (~2K tokens) Msg 7: [assistant] Thinking + tool_call: Read("services/user.py") Msg 8: [tool] File contents (~1.5K tokens) Msg 9: [assistant] Thinking + tool_call: Edit(...) Msg 10: [tool] Edit successful Msg 11: [assistant] "Fixed. Here's what I changed..."

At message 11, the model sees ALL of messages 1-10. "Long context reasoning" = at this point, attention reaches back across every previous message simultaneously to produce a coherent fix.

The Model IS the Retriever

RAG retriever: "This text is 0.87 similar to the query" (statistical) Agentic retriever: "The crash is in login() calling get_user(), I need to read where get_user is defined" (causal reasoning)

No embedding approximation. No similarity threshold. The model reasons about code dependencies and retrieves based on understanding, not pattern matching. And the same model choosing what to retrieve will use what it retrieves.

Iterative Context Building

Turn 1: Read error → "login endpoint crashes with NoneType" Turn 2: Read auth.py → "user.email fails because user is None" Turn 3: Grep get_user → "get_user is in services/user.py, returns .first()" Turn 4: Read user.py → ".first() returns None when no match, no error raised" Turn 5: Read tests → "tests expect UserNotFoundError, but it's never raised"

Each turn refines understanding. By turn 5, the model has a complete causal chain in context AND has been building its mental model incrementally. This is fundamentally better than dumping 5 files cold. The order of discovery helps the model organize information.

Automatic Context Compression

Before (approaching limit): Messages 1-30: Full verbatim content (~180K tokens) After compression: Messages 1-15: Summarized by a fast model (~8K tokens) "Investigated auth bug. get_user() returns None instead of raising. Fixed routes/auth.py and services/user.py. User asked to update tests." Messages 16-30: Full verbatim content (~90K tokens) Total: ~98K, back under budget

This mirrors what attention does naturally (recent = high attention, old = low attention), but makes it explicit and honest instead of pretending the model attends well to 180K tokens.


6. Where the Field Is Heading

Phase 1 (2023): "Make context bigger" → brute force, GPT-4 128K, Gemini 1M Phase 2 (2024): "Make context smarter" → better attention training, YaRN, RoPE scaling Phase 3 (2025+): "Make agents leaner" → use less context more effectively ├─ Surgical retrieval (read functions, not files) ├─ Active eviction (discard stale context proactively) ├─ Structured memory (notes > verbatim copies) ├─ Plan-then-retrieve (think before reading) └─ Compaction as a learnable skill (RLVR trains when to compress)

The endgame is not "2M context." It's an agent that navigates a million-file codebase using a 50K context window, because it knows exactly where to look and what to remember. The next breakthrough won't be bigger windows. It'll be same quality at 64K with 10x less cost and 5x less latency.

Bottom Line: Context length is determined by architecture (positional encodings, attention cost, KV cache). It can be extended at different training stages, each at vastly different costs. But the most impactful and accessible approach is to make the model smarter within its existing window, through post-training (SFT/RLVR) and agent-level strategies. A disciplined 32K agent that compacts, extracts, and plans will outperform a sloppy 200K agent on real tasks, at a fraction of the cost.

Key Concepts Reference

Concept One-Line Summary
RoPE (Rotary Position Embedding) Encodes position as rotation angle; works within training range, degrades outside it
YaRN / NTK scaling Scale RoPE frequency base instead of positions; better local resolution, still needs fine-tuning
O(n²) attention cost Why doubling context = 4x compute, making 1M context 25x costlier than 200K
KV cache Stored Key/Value vectors per token per layer; 327 KB/token for 70B model, main memory bottleneck
GQA (Grouped Query Attention) Share KV heads across query heads to reduce KV cache size (e.g., 64 query → 8 KV heads)
Flash Attention Exact full attention with O(n) memory via tiled computation. No quality loss. Not sparse attention.
Sparse Attention Local + global + random patterns. O(n) cost but drops some connections. Quality loss.
Lost in the middle Models attend to start/end of context but lose information in the middle
Effective vs. claimed context Architectural support ≠ actual retrieval/reasoning ability at that length
Training distribution mismatch Model never practiced attending to position 50K if trained on 4K docs
Context trilemma Long + Quality + Efficient: pick at most two
Long-horizon reasoning Planning and executing across many dependent steps without error compounding
Cross-file causal tracing Connecting cause in file A to effect in file B across large token distances
Holds vs. reasons across Storing tokens ≠ actively using them during generation
Relevance detection at scale Picking the 4 relevant items out of 800 at every generation step
Process reward models Rewarding each reasoning step, not just the final answer
Surgical retrieval Reading only the relevant function/lines instead of entire files
Context eviction Proactively discarding stale information to keep context tight
Structured memory Storing extracted facts + line pointers instead of verbatim file content
Plan-then-retrieve Spending tokens thinking about what to read before reading anything
Model-as-retriever The generating model also decides what to retrieve, replacing separate embedding-based search
Automatic context compression Summarize old turns when approaching limits; honest version of attention decay
Prompt caching Cache repeated prefixes server-side to avoid reprocessing the same system prompt every turn