Prompt Caching
Prompt Caching
A mechanism for reusing static prompt prefixes across API calls, avoiding re-processing the same tokens on every request. Most valuable when system prompts, documents, or tool definitions are large and repeated across many calls.
How It Works
Mark a content block with cache_control: {"type": "ephemeral"} to designate it as a cache breakpoint. On the first request, those tokens are processed and stored. On subsequent requests within the TTL window, the model reads from cache instead of re-processing.
Two approaches:
- Automatic caching — add
cache_controlat the top level of the request; the system attaches the breakpoint to the last cacheable block and advances it as the conversation grows. - Explicit breakpoints — place
cache_controldirectly on individual content blocks for fine-grained control.
system = [
{
"type": "text",
"text": "[large document or system prompt]",
"cache_control": {"type": "ephemeral"}
}
]
Cache Lifetime
- Default TTL: 5 minutes. Resets on each cache hit at no cost.
- Extended TTL: 1 hour, available by passing
"ttl": "1h"in thecache_controlobject. Billed at a higher write rate.
Cost Model
Three pricing tiers (relative to base input token price):
| Token type | Cost multiplier |
|---|---|
| Normal input tokens | 1.0× |
| Cache write tokens (5 min) | 1.25× |
| Cache write tokens (1 hour) | 2.0× |
| Cache read tokens | 0.1× |
Cache reads cost 10% of normal input tokens — a 90% saving on the cached prefix. The break-even point is roughly 2 requests within the TTL window (the write premium is recovered on the first read).
Minimum Cacheable Length
Prompts shorter than the minimum threshold are not cached even if marked with cache_control. Check usage.cache_creation_input_tokens and usage.cache_read_input_tokens in the response to verify cache activity.
- Claude Opus 4.8, Sonnet 4.6, Sonnet 4.5, Haiku 4.5: 1,024 tokens
- Claude Opus 4.7, 4.6, 4.5: 4,096 tokens
Supported Models
All active Claude models support prompt caching: Claude Opus 4.8 / 4.7 / 4.6 / 4.5 / 4.1, Sonnet 4.6 / 4.5, Haiku 4.5. Available on the Anthropic API, AWS Bedrock, Vertex AI, and Microsoft Foundry.
When to Use
- Long system prompts or detailed instructions repeated across calls
- Large documents or knowledge bases queried multiple times
- Extensive tool or function definitions in agentic-workflows
- Few-shot example sets (20+ examples) that don't change per request
- Multi-turn conversations where growing history is re-sent each turn
What Breaks the Cache
Any change to the content before the cache breakpoint invalidates it. Also invalidated by changes to: tools, system, web_search, citations, speed, or image content. Exact token-level match is required.
Related
agentic-workflows · tool-use · extended-thinking · context-window