Context Window

3 min · concept

Context Window

The maximum number of tokens a model can process in a single call, spanning all input (system prompt, conversation history, documents, tool definitions) plus the generated output. Everything outside the context window is invisible to the model.

Input vs. Output Limits

Input and output ceilings are separate. A model with a 1M token context and a 64K output limit can read ~936K tokens of input and write up to 64K tokens of response. Output limits are typically much lower than input limits.

As of mid-2026:

Model	Context window	Max output
Claude Sonnet 4.6	1M tokens	64K tokens
Claude Opus 4 / Sonnet 4	200K tokens	128K tokens
Gemini 3.1 Pro / Flash	1M tokens	65K tokens
Llama 4 Maverick	1M tokens	—
GPT-4o / GPT-4.1	128K tokens	—
GPT-5.4	272K (standard) / 1M (API)	—
DeepSeek R1	128K tokens	—

How Context Has Grown

GPT-3 (2020): ~2K tokens
GPT-4 (2023): 8K / 32K variants
Claude 2 (2023): 100K tokens — first mass-market long-context model
Gemini 1.5 Pro (2024): 1M tokens
Claude Sonnet 4.6 (March 2026): 1M tokens at standard pricing

The jump from 8K to 1M represents a roughly 125× increase in a three-year window, driven by improvements in attention efficiency (sparse attention, flash attention), positional encoding (RoPE, ALiBi), and infrastructure for KV cache management.

Practical Implications

Whole codebases: at 1M tokens (~750K words), a large mono-repo fits in a single context.
Long documents: a 50-page report (~25K words) fits comfortably in 200K; a book in 1M.
Multi-turn memory: conversation history can be retained across dozens of turns without summarization hacks.
rag vs. long context: for some retrieval tasks, stuffing the full document into context now competes with RAG pipelines — simpler but more expensive.

Lost in the Middle

Larger context does not mean uniform comprehension across it. Research (Liu et al., 2024) documents a U-shaped attention bias: models attend more reliably to content near the start and end of the context, with accuracy dropping for information buried in the middle. On multi-document QA tasks, accuracy fell 30%+ when the relevant document moved from position 1 to position 10 out of 20. This is caused by positional encoding decay in RoPE-based architectures.

Mitigation strategies: place critical information at the start or end of context; use rag to retrieve only the relevant chunks rather than loading everything; evaluate retrieval accuracy specifically with middle-position test cases.

Cost Implications

Token cost scales directly with context size. Sending a 500K-token document on every call is expensive. prompt-caching mitigates this for repeated calls — cache the document prefix and pay ~10% of normal input cost on cache reads. The break-even is roughly 2 requests within the TTL window.

Output tokens are billed at a higher rate than input tokens across all major providers, so a 64K output ceiling is also a cost floor to be aware of in generation-heavy workloads.

prompt-caching · rag · extended-thinking · agentic-workflows

Sources

linked from

Prompt Caching

Context Window

Context Window

Input vs. Output Limits

How Context Has Grown

Practical Implications

Lost in the Middle

Cost Implications

Related

Sources