Context Window

Context Window

The maximum number of tokens a model can process in a single call, spanning all input (system prompt, conversation history, documents, tool definitions) plus the generated output. Everything outside the context window is invisible to the model.

Input vs. Output Limits

Input and output ceilings are separate. A model with a 1M token context and a 64K output limit can read ~936K tokens of input and write up to 64K tokens of response. Output limits are typically much lower than input limits.

As of mid-2026:

Model Context window Max output
Claude Sonnet 4.6 1M tokens 64K tokens
Claude Opus 4 / Sonnet 4 200K tokens 128K tokens
Gemini 3.1 Pro / Flash 1M tokens 65K tokens
Llama 4 Maverick 1M tokens
GPT-4o / GPT-4.1 128K tokens
GPT-5.4 272K (standard) / 1M (API)
DeepSeek R1 128K tokens

How Context Has Grown

  • GPT-3 (2020): ~2K tokens
  • GPT-4 (2023): 8K / 32K variants
  • Claude 2 (2023): 100K tokens — first mass-market long-context model
  • Gemini 1.5 Pro (2024): 1M tokens
  • Claude Sonnet 4.6 (March 2026): 1M tokens at standard pricing

The jump from 8K to 1M represents a roughly 125× increase in a three-year window, driven by improvements in attention efficiency (sparse attention, flash attention), positional encoding (RoPE, ALiBi), and infrastructure for KV cache management.

Practical Implications

  • Whole codebases: at 1M tokens (~750K words), a large mono-repo fits in a single context.
  • Long documents: a 50-page report (~25K words) fits comfortably in 200K; a book in 1M.
  • Multi-turn memory: conversation history can be retained across dozens of turns without summarization hacks.
  • rag vs. long context: for some retrieval tasks, stuffing the full document into context now competes with RAG pipelines — simpler but more expensive.

Lost in the Middle

Larger context does not mean uniform comprehension across it. Research (Liu et al., 2024) documents a U-shaped attention bias: models attend more reliably to content near the start and end of the context, with accuracy dropping for information buried in the middle. On multi-document QA tasks, accuracy fell 30%+ when the relevant document moved from position 1 to position 10 out of 20. This is caused by positional encoding decay in RoPE-based architectures.

Mitigation strategies: place critical information at the start or end of context; use rag to retrieve only the relevant chunks rather than loading everything; evaluate retrieval accuracy specifically with middle-position test cases.

Cost Implications

Token cost scales directly with context size. Sending a 500K-token document on every call is expensive. prompt-caching mitigates this for repeated calls — cache the document prefix and pay ~10% of normal input cost on cache reads. The break-even is roughly 2 requests within the TTL window.

Output tokens are billed at a higher rate than input tokens across all major providers, so a 64K output ceiling is also a cost floor to be aware of in generation-heavy workloads.

Related

prompt-caching · rag · extended-thinking · agentic-workflows

Sources