Senior iOS Software Engineer

Cologne, Germany

Context Is All You Need

TL;DR

  • The context window is the total working memory of an LLM session: system prompt, conversation history, your message, and the response all count toward it.
  • Self-Attention scales quadratically with sequence length: double the context, quadruple the cost. This is the core constraint.
  • Positional Encoding (RoPE, ALiBi) tells the model where each token sits. Extending it beyond training length requires techniques like YaRN.
  • The KV-Cache stores intermediate results during generation and grows linearly with context. At 200K tokens, it can consume ~61 GiB of VRAM for a single request.
  • Longer context windows don’t automatically mean better reasoning. Models struggle with information in the middle of long contexts (Lost-in-the-Middle effect).

“When context gets too long, clear it and start fresh.” Anyone who has worked with Claude Code knows this. As context grows, quality drops. Instructions get ignored, the agent starts missing the point, or stops following them altogether. Depending on the situation: clear it, compact it, start over.

I’ve had a general understanding of why for a while. Transformer architecture, Self-Attention, KV-Cache. But I never sat down and thought it through properly. What exactly limits context length? Why is extending it so hard? And what does it actually mean when a model claims to support “1 million tokens”?

This article is my attempt to work through those questions.

What Is Context Length?

The context window is the maximum number of tokens a model can process in a single call.

A token is the smallest unit a language model works with, roughly ¾ of a word in English.1 Common words like “the” are a single token. Rarer or compound words take more. The passage you just read is about 40 tokens.

Context includes everything in the session, not just the document you pasted in. System instructions, the full conversation history, your current message, the response being generated, and if you’re using tool use, the tool definitions and tool results as well. It’s the total working memory of the session.

What counts toward your context window:

  System Prompt
+ Conversation History
+ Your Message
+ Tool Definitions & Results
+ Model's Response
= Total Tokens Used

When you hit the limit, the behavior depends on where you’re using the model. The API returns an error; you have to truncate the input yourself before sending. Chat interfaces handle it differently: some warn you, some silently drop older messages, and Claude.ai now automatically compacts earlier parts of the conversation into a summary rather than discarding them entirely.

Current context windows, for reference:

ModelContext Length
GPT-4o128K tokens
Claude Sonnet 4.6200K tokens (1M extended)
Claude Opus 4.6200K tokens (1M extended)
Gemini 1.5 Pro2M tokens
LLaMA 3.1128K tokens

200K tokens is roughly one to two novels, depending on length.

Why Is It Hard to Make It Longer?

Every modern LLM is built on the Transformer architecture, published in 2017 primarily by researchers at Google Brain and Google Research.2 The architecture came out of work on machine translation, specifically to overcome the limitations of the LSTM-based models that dominated at the time.3 The mechanism that makes Transformers work, Self-Attention, is also the reason scaling context is expensive.

In Self-Attention, every token computes a relevance score against all tokens that came before it (including itself). That’s how the model figures out which words relate to which, even when they’re far apart.

Encoder-Decoder vs. Decoder-Only

The original 2017 Transformer had two parts: an encoder (bidirectional, every token sees every other token) and a decoder (causal, each token only sees preceding tokens). This was designed for translation, where the full input sentence needs to be understood before generating the output.

Modern LLMs (GPT, Claude, LLaMA) are decoder-only: they always attend left-to-right and never look ahead. When this article discusses Self-Attention, it refers to this causal variant. For more on the architectural differences, see The Illustrated Transformer by Jay Alammar.

The quadratic problem

Here’s the core issue. In Self-Attention, every token looks at all preceding tokens. Double the sequence length, and the amount of work doesn’t double. It quadruples. This is quadratic scaling, and the numbers get ugly fast:4

Sequence lengthComparisons
4,000 tokens16,000,000
8,000 tokens64,000,000
32,000 tokens1,024,000,000
200,000 tokens~40 billion

Memory scales the same way. Context windows grew slowly for years: GPT-2 supported 1K tokens, GPT-3 had 2K, ChatGPT launched with 4K, and GPT-4 offered 8K (or 32K in its extended variant). Not because nobody wanted longer ones, but because the quadratic cost made it impractical with available hardware.

Technical deep dive: How Self-Attention actually works

Every token in the sequence asks three questions: “What am I looking for?” (Query), “What do I contain?” (Key), and “What do I offer if someone attends to me?” (Value).

For each token, the model computes a relevance score by comparing its Query against the Key of every preceding token. That score determines how much information flows between them. A token will score high against tokens that are semantically relevant to it, and low against everything else. Future tokens are masked out so the model can’t “look ahead.”

The total number of comparisons still grows with the square of the sequence length. For 200,000 tokens, that means up to 40 billion pairs. The intermediate results of all those comparisons have to be stored in GPU memory while the calculation runs.

The position problem

There’s a second issue that’s easy to miss: Transformers have no built-in sense of order.

Without extra information, “The dog bit the man” and “The man bit the dog” are identical to the model. Word order doesn’t exist unless you explicitly tell each token where it sits in the sequence. This is called Positional Encoding.

The original 2017 approach used fixed sinusoidal patterns baked in at training time.5 Fine for the sequences it saw during training. Beyond that maximum length, quality dropped off a cliff. The model had never seen positions past a certain point, so it had no idea what to do with them.

RoPE: how modern models handle it

Most models today use RoPE (Rotary Position Embedding).6 Rather than adding a fixed position signal to each token upfront, RoPE encodes position as a rotation that gets applied when two tokens compare themselves to each other during attention.

The practical difference: instead of encoding “I am at position 4,217,” a token effectively encodes “I am 83 positions away from you.” Relative distance. That turns out to generalize to longer sequences much better, because the model doesn’t need to have seen position 4,217 during training. It just needs to understand relative gaps, which it has seen.

LLaMA, Mistral, Qwen, Gemma: nearly every major open model uses RoPE.

Technical deep dive: The rotation math behind RoPE

Each token in a Transformer is represented as a vector with dd numbers (e.g. d=128d = 128). RoPE groups these into d/2d/2 pairs and treats each pair as a point on a 2D plane. Each pair gets rotated by an angle that depends on the token’s position:

θi=position×base2i/d\theta_i = \text{position} \times \text{base}^{-2i/d}

The index ii determines how fast each pair rotates. Small ii (high frequency) rotates fast and captures short-range relationships. Large ii (low frequency) rotates slowly and captures long-range ones.

The key trick: when two tokens interact during attention, their absolute rotations cancel out and only the difference (mn)(m - n) remains, which is exactly their relative distance.

For a detailed walkthrough, see RoPE: A Detailed Guide.

ALiBi: the alternative approach

ALiBi (Attention with Linear Biases) takes a different route entirely.7 Instead of adding position vectors, it directly penalizes the attention score based on distance:

Attention Score=QKTmposqposk\text{Attention Score} = Q \cdot K^T - m \cdot |pos_q - pos_k|

Each attention head gets its own slope m, so different heads penalize distance differently. The further apart two tokens are, the lower their attention score. This extrapolates well beyond training length because “further away = less relevant” is a useful default.

ALiBi is used by BLOOM and MPT, but has largely fallen out of favor for new models. RoPE and ALiBi are conceptually incompatible: one uses rotation, the other score penalties. Combining them doesn’t yield clear benefits.

Comparing positional encoding approaches

SinusoidalRoPEALiBi
EncodingAbsoluteRelative (rotation)Score penalty (no embedding)
ExtrapolationPoor, hard ceilingMedium (good with extensions)Good by design
Fine-tuning for longer ctxRetrainOften sufficientRarely needed
Used byOriginal Transformer (2017), GPT-2/3 (learned variant)LLaMA, Mistral, Gemma, QwenBLOOM, MPT

YaRN: extending context without starting over

Even with RoPE, you can’t just take a model trained on 32K tokens and tell it to handle 128K. It has learned attention patterns only for the distances it saw during training. Push past that and quality degrades.

YaRN (Yet another RoPE extensioN, 2023)8 is how you extend a model’s context without retraining from scratch. The insight is that RoPE’s different frequency dimensions aren’t equally broken when you try to extend context. High-frequency dimensions (the ones that handle short-range relationships) are already fine. Low-frequency dimensions, which handle long-range relationships, are the ones that need to be stretched.

So YaRN stretches only those:

FrequencyStrategyEffect
HighUnchangedShort-range stays sharp
MediumInterpolateSmooth transition
LowExtrapolateLong-range gets stretched

Think of a rubber band with markings: stretch it, and the markings spread apart but stay in the same order. Short distances stay readable, long distances just have more space between them.

The result: context extension at a fraction of the original training cost, with no quality loss on short contexts. The YaRN paper reports achieving this with roughly 10× fewer tokens and 2.5× fewer training steps compared to previous extension methods.

But here’s what’s important to understand: positional encoding only solves the “where am I” problem. The model also needs to have learned what to do with distant tokens. Real long-range dependencies only come through training on long sequences. A model that has only ever seen 8K token sequences doesn’t develop useful attention patterns for reasoning across 100K tokens, no matter how you adjust the position encoding.

This is why context extension isn’t just a positional encoding trick. It requires continued training on long sequences to teach the model meaningful long-range attention patterns. No new layers are added: the existing Query, Key, and Value matrices in every Transformer block get fine-tuned to handle distances they’ve never seen before.

Different models take different approaches. Most combine a positional encoding fix (adjusting RoPE frequencies) with progressive training: gradually increasing the sequence length during fine-tuning so the model learns long-range attention patterns step by step rather than all at once.

ModelBaseContext Extension
LLaMA 3.1 (128K)RoPEAdjusted Base Frequency (θ=500K\theta = 500\text{K}) + progressive training
Qwen2.5 (1M)RoPEDCA9 + YaRN scaling + progressive training
CodeLlama (100K)RoPEAdjusted Base Frequency (θ=1M\theta = 1\text{M})

The memory problem: KV-Cache

There’s a third issue, separate from attention and position: what happens during generation.

A model doesn’t produce a whole response at once. It generates one token at a time, each time running attention over the entire context so far. Without optimization, that means recomputing everything from scratch for every single new token.

Without KV-Cache:

StepComputationNote
Token 1Compute K,V for [1]
Token 2Compute K,V for [1, 2]Token 1 recomputed
Token 3Compute K,V for [1, 2, 3]Tokens 1+2 recomputed

The KV-Cache fixes this. The Key and Value vectors for each token (the things needed for attention) get computed once and stored. Every subsequent token just reads from the cache.

With KV-Cache:

StepComputation
Token 1Compute K,V for [1], store
Token 2Cache hit [1] + compute only K,V for [2]
Token 3Cache hit [1,2] + compute only K,V for [3]

The problem: this cache has to live in VRAM, the fast memory on a GPU. And it grows linearly with context length. For a large model handling 200K tokens, the cache alone can consume tens of gigabytes for a single request.10

This is the real reason long-context usage is expensive. It’s not just a training problem. It’s an infrastructure problem. Every active conversation at full context is sitting on a pile of GPU memory.

Technical deep dive: KV-Cache size & optimization

In practice, Transformers don’t run a single attention computation. They use Multi-Head Attention: multiple independent attention “heads” running in parallel, each learning to focus on different types of relationships. The formula nheadsn_{\text{heads}} below refers to how many of these heads store their own Key and Value vectors.

The cache size follows a straightforward formula:

KV-Cache=2×nlayers×nheads×dhead×seq_len×bytes\text{KV-Cache} = 2 \times n_{\text{layers}} \times n_{\text{heads}} \times d_{\text{head}} \times \text{seq\_len} \times \text{bytes}

For LLaMA 3 70B at float16 with 200K tokens and GQA (8 KV heads instead of 64): roughly 61 GiB for a single request. Without GQA, this would be ~488 GiB.

Three main techniques reduce KV-Cache size:

  • Grouped Query Attention (GQA), used in LLaMA 3, shares Key and Value vectors across multiple query heads, cutting cache size by 4-8x. LLaMA 3 70B uses 64 query heads but only 8 KV heads (the 61 GiB figure above already reflects this).
  • Quantization (int8 instead of float16) halves memory further.
  • Paged Attention (used in vLLM) manages cache memory in pages like an OS, avoiding fragmentation and enabling efficient multi-request batching.

The Quality Gap

So context windows are getting larger. But does longer actually mean better?

Not really, no. A model extended to 1M tokens via fine-tuning may handle everyday conversations perfectly well, but struggle to reason coherently across a truly massive document. The window is available. Training a model to actually use it well is a separate, harder problem.

Lost in the Middle

There’s a well-documented effect at play here. Models tend to attend better to recent tokens (recency bias) and to the very beginning of a context (primacy bias). Information placed in the middle gets the least attention, known as the Lost-in-the-Middle effect.11

I noticed this before I understood why: instructions set at the beginning of a long session would gradually stop being followed. The model seemed to forget them. In practice, even the beginning degrades as context grows. What starts as primacy fades when early tokens become distant enough.

Needle in a Haystack

The standard way to test whether a model actually uses its full context is the needle-in-a-haystack benchmark: hide a specific fact at different positions in a long document, then ask the model to retrieve it.12 Results typically show a clear gap between the advertised context length and the length at which recall actually stays reliable.

A model that supports 1M tokens doesn’t necessarily reason well across 1M tokens. The context window tells you what fits. It doesn’t tell you what the model can effectively use.

Where This Is Going

Context windows will keep growing. The engineering problems (quadratic attention cost, positional encoding limits, VRAM pressure) have each seen real solutions in the last few years: RoPE extensions like YaRN handle position, and GQA with Paged Attention tackle the cache.

But the harder problem remains: closing the gap between “context window is technically X” and “model actually reasons well across X.” That’s a training problem as much as an architecture problem. Today’s models can store a million tokens in context. Whether they can attend to all of them with equal quality, especially the ones buried in the middle, is a different question entirely.

What seems clear is that raw context length is becoming less of a differentiator. The real competition is shifting to effective context: how well a model retrieves, connects, and reasons across everything it’s been given. A 200K-token model that uses its full context reliably may be more useful than a 1M-token model that loses track of things past the halfway mark.

Practical Takeaways

I’ve actually disabled the 1M context window in my own Claude Code setup for exactly this reason: a shorter, more focused context produces better results in practice than a massive one where quality degrades. When set, 1M model variants are unavailable in the model picker.

// ~/.claude/settings.json
{
  "env": {
    "CLAUDE_CODE_DISABLE_1M_CONTEXT": "1"
  }
}

For anyone building on top of these models: treat context as a finite resource. Structure your prompts so critical information lands where the model attends best: at the beginning and end. Use retrieval-augmented approaches for anything that doesn’t need to live in the context window permanently. And test your specific use case at your actual context lengths, because the benchmark numbers won’t tell you the whole story.

The question is no longer how many tokens fit into a context window, but how many of them the model can actually use well. The engineering to make windows larger is progressing steadily. The harder challenge is making models reason reliably across all of that context, not just the parts at the edges.

Footnotes

  1. Tokenization varies by language and model. German and Finnish compound words often tokenize less efficiently than English. The ¾ figure is a rough average for English text with standard BPE tokenization.

  2. Vaswani, A. et al. (2017). Attention Is All You Need. NeurIPS. arxiv.org/abs/1706.03762. The team also included Aidan Gomez (University of Toronto) and Illia Polosukhin.

  3. LSTM (Long Short-Term Memory) is a type of Recurrent Neural Network that processes tokens sequentially, one at a time. This made parallelization difficult and caused the vanishing gradient problem: information from early tokens faded over long sequences. The Transformer replaced this with parallel Self-Attention, at the cost of quadratic memory scaling.

  4. Formally: attention computation is O(n2d)O(n^2 \cdot d) in time and O(n2)O(n^2) in memory, where nn is sequence length and dd is the model dimension. Vaswani et al. (2017), Section 3.5, Table 1.

  5. Sinusoidal position encoding represents each position as a vector of sine and cosine values at different frequencies. The intuition: each frequency oscillates at a different rate, so combinations of frequencies uniquely identify any position up to the maximum trained length, similar to how a clock uses multiple hands at different speeds to represent time. The original paper also experimented with learned positional embeddings and found nearly identical results, but chose the fixed sinusoidal version for its potential to extrapolate.

  6. Su, J. et al. (2021). RoFormer: Enhanced Transformer with Rotary Position Embedding. arxiv.org/abs/2104.09864

  7. Press, O. et al. (2022). Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation. ICLR. arxiv.org/abs/2108.12409

  8. Peng, B. et al. (2023). YaRN: Efficient Context Window Extension of Large Language Models. arxiv.org/abs/2309.00071

  9. DCA (Dual Chunk Attention) splits long sequences into fixed-size chunks and remaps positions within each chunk to smaller values. This lets the model handle sequences far beyond its training length without the positional encoding breaking down. An, C. et al. (2024). Training-Free Long-Context Scaling of Large Language Models. arxiv.org/abs/2402.17463

  10. Based on LLaMA 3 70B architecture (80 layers, 8 KV heads via GQA, head dimension 128) at float16: 2 × 80 × 8 × 128 × 200,000 × 2 bytes ≈ 61 GiB. Older models without Grouped Query Attention would be significantly higher.

  11. Liu, N.F. et al. (2023). Lost in the Middle: How Language Models Use Long Contexts. arxiv.org/abs/2307.03172

  12. The “needle in a haystack” evaluation was popularized by Greg Kamradt (2023) as a practical stress test for long-context models. See: github.com/gkamradt/LLMTest_NeedleInAHaystack

  • Categories:
  • AI
  • Software Engineering
  • Tags:
  • LLM
  • Transformer
  • Context Window
  • Attention
  • AI Architecture
  • Machine Learning
  • Deep Learning

If you'd like to leave a comment, write me an email at comment+context-is-all-you-need@arangino.app.