Prompt Caching

2 min · concept, anthropic

Prompt Caching

A mechanism for reusing static prompt prefixes across API calls, avoiding re-processing the same tokens on every request. Most valuable when system prompts, documents, or tool definitions are large and repeated across many calls.

How It Works

Mark a content block with cache_control: {"type": "ephemeral"} to designate it as a cache breakpoint. On the first request, those tokens are processed and stored. On subsequent requests within the TTL window, the model reads from cache instead of re-processing.

Two approaches:

Automatic caching — add cache_control at the top level of the request; the system attaches the breakpoint to the last cacheable block and advances it as the conversation grows.
Explicit breakpoints — place cache_control directly on individual content blocks for fine-grained control.

system = [
    {
        "type": "text",
        "text": "[large document or system prompt]",
        "cache_control": {"type": "ephemeral"}
    }
]

Cache Lifetime

Default TTL: 5 minutes. Resets on each cache hit at no cost.
Extended TTL: 1 hour, available by passing "ttl": "1h" in the cache_control object. Billed at a higher write rate.

Cost Model

Three pricing tiers (relative to base input token price):

Token type	Cost multiplier
Normal input tokens	1.0×
Cache write tokens (5 min)	1.25×
Cache write tokens (1 hour)	2.0×
Cache read tokens	0.1×

Cache reads cost 10% of normal input tokens — a 90% saving on the cached prefix. The break-even point is roughly 2 requests within the TTL window (the write premium is recovered on the first read).

Minimum Cacheable Length

Prompts shorter than the minimum threshold are not cached even if marked with cache_control. Check usage.cache_creation_input_tokens and usage.cache_read_input_tokens in the response to verify cache activity.

Claude Opus 4.8, Sonnet 4.6, Sonnet 4.5, Haiku 4.5: 1,024 tokens
Claude Opus 4.7, 4.6, 4.5: 4,096 tokens

Supported Models

All active Claude models support prompt caching: Claude Opus 4.8 / 4.7 / 4.6 / 4.5 / 4.1, Sonnet 4.6 / 4.5, Haiku 4.5. Available on the Anthropic API, AWS Bedrock, Vertex AI, and Microsoft Foundry.

When to Use

Long system prompts or detailed instructions repeated across calls
Large documents or knowledge bases queried multiple times
Extensive tool or function definitions in agentic-workflows
Few-shot example sets (20+ examples) that don't change per request
Multi-turn conversations where growing history is re-sent each turn

What Breaks the Cache

Any change to the content before the cache breakpoint invalidates it. Also invalidated by changes to: tools, system, web_search, citations, speed, or image content. Exact token-level match is required.

agentic-workflows · tool-use · extended-thinking · context-window

Sources

Prompt caching — Anthropic documentation

linked from

Context Window GPU Clouds Tool Use

Prompt Caching

Prompt Caching

How It Works

Cache Lifetime

Cost Model

Minimum Cacheable Length

Supported Models

When to Use

What Breaks the Cache

Related

Sources