Prompt Caching for Claude: Cut Your API Bill 60% in Production

Prompt caching is the most underused cost-optimization tool for Claude API workloads in 2026. Implemented well, it reduces input token costs by 60-90% on typical production workloads. Implemented poorly, it does nothing — or silently makes latency worse.

This guide walks through how prompt caching actually works, the five patterns that dominate in production, and concrete numbers from the workloads we have tuned.

How Prompt Caching Actually Works

Claude's prompt caching lets you mark portions of your prompt as cacheable. The first call with that prefix pays full price to write the cache. Subsequent calls within a TTL window pay a cache-read price — about 10% of the normal input cost.

Two TTL options:

TTL	Cache write cost	Cache read cost	Use case
5 minutes	1.25x base input price	0.1x base input price	Short conversations, rapid iteration
1 hour	2x base input price	0.1x base input price	Long sessions, system prompts, RAG contexts

The economics work if you have enough reads to amortize the write premium. Rule of thumb: 3+ reads within the TTL for 5-minute cache, 5+ reads for 1-hour cache.

The cache is keyed on the exact bytes of the cached portion plus the model version. Any change — a single whitespace edit, a timestamp, a user name — invalidates the cache and forces a re-write.

The Five Patterns That Work

Pattern 1: Large System Prompts

You have a 4,000-token system prompt with persona, instructions, and examples. Without caching, every request pays for those 4,000 tokens at $3/million (Sonnet) or $15/million (Opus) — $0.012 per request at Sonnet rates.

With caching, after the first request the system prompt costs 0.1x — $0.0012 per request. At 10,000 requests per day, that is $108 saved per day on Sonnet. On Opus the savings are 5x larger.

Implementation:

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": LARGE_SYSTEM_PROMPT,
            "cache_control": {"type": "ephemeral", "ttl": "1h"}
        }
    ],
    messages=[{"role": "user", "content": user_query}]
)

This is the easiest win. If your system prompt is over 1,000 tokens and you make more than a few requests per hour, you should be caching it.

Pattern 2: RAG Context Caching

In a RAG application, retrieved context is often 5,000-30,000 tokens per request. Within a conversation, the context often does not change between turns.

Cache the retrieved context:

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": f"<context>{retrieved_documents}</context>",
                "cache_control": {"type": "ephemeral", "ttl": "1h"}
            },
            {
                "type": "text",
                "text": user_query
            }
        ]
    }
]

Each follow-up question in the same conversation reads the cached context for 0.1x the cost. For a 10-turn conversation over a 30,000-token context:

Without caching: 300,000 input tokens × $3/M = $0.90
With caching (1 write + 9 reads): 30,000 + 9 × 3,000 = 57,000 effective = $0.175

A 5x cost reduction on a realistic workload.

Pattern 3: Conversation History Caching

In long conversations, the history grows. Without caching, you pay for the entire history on every turn. With caching, you cache the prefix of the conversation.

cached_messages = conversation_history[:-1]  # everything except the latest turn
cached_messages[-1]["content"][-1]["cache_control"] = {"type": "ephemeral", "ttl": "5m"}

messages = cached_messages + [current_user_turn]

For long conversational agents (coding assistants, customer support, writing collaborators) this halves the input cost compared to naïve implementations.

Pattern 4: Tool Definitions Caching

Agents with many tools spend significant tokens on tool definitions. A typical MCP-connected agent has 3,000-8,000 tokens of tool schemas on every request.

Cache them:

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    tools=tool_definitions,  # tools are cacheable
    system=[
        {"type": "text", "text": system_prompt, "cache_control": {"type": "ephemeral"}}
    ],
    messages=[...]
)

Claude caches tool definitions automatically when you include a cache_control block in the system. The cost savings for agent workloads are typically 40-70%.

Pattern 5: Few-Shot Examples Caching

If your prompt includes a large block of few-shot examples, cache it. Examples are the most cacheable content you have — they rarely change.

system=[
    {"type": "text", "text": short_instructions},
    {
        "type": "text",
        "text": large_few_shot_block,
        "cache_control": {"type": "ephemeral", "ttl": "1h"}
    }
]

The smart buy

Why pay $228/year when $69 works?

Lifetime Starter: one payment, no renewals. Covered by 30-day money-back guarantee.

See the math

The Production Numbers

Three real workloads we tuned, with before/after:

Workload A: Customer support chatbot (Sonnet 4.6)

50,000 requests/day
3,200-token system prompt + 15,000-token knowledge base context + 600-token user message
Before caching: $8,820/month
After caching: $3,105/month
Savings: 65%

Workload B: Code review agent (Opus 4.6)

800 PRs/month
1,800-token system prompt + 14,000-token PR diff + 5,000-token codebase context + 7,000 tokens of tool definitions
Before caching: $2,190/month
After caching: $642/month
Savings: 71%

Workload C: Research assistant (Sonnet 4.6)

10,000 sessions/month, 4.5 turns average
Accumulating conversation history reaching 40,000 tokens by end of session
Before caching: $4,140/month
After caching: $1,650/month
Savings: 60%

Across these three workloads the average savings was 65%. The effort to implement was roughly 2-4 hours per workload.

Anti-Patterns

Five things that kill cache hit rate without you noticing.

Anti-pattern 1: Timestamps in cached content.

"Current time: 2026-04-17T14:32:15Z" in your system prompt invalidates the cache on every request. Move timestamps out of the cached prefix or truncate to the day.

Anti-pattern 2: User-specific content in the prefix.

Putting "You are helping {user.name} who works at {user.company}" in the cached system prefix means every user gets a cache miss. Move user-specific content to the user message or split into a per-user cache with longer TTL.

Anti-pattern 3: Frequent whitespace changes.

Your prompt builder strips trailing whitespace inconsistently. Normalize aggressively or you will have 80% cache misses for no reason.

Anti-pattern 4: Model version migrations.

When Claude Opus 4.6 ships, your Opus 4.5 cache becomes useless. Cache writes cost 1.25-2x, so a bad migration plan creates a cost spike. Plan cache warmup alongside model upgrades.

Anti-pattern 5: Caching short prefixes.

Minimum cacheable prefix is 1,024 tokens (Sonnet) or 2,048 tokens (Opus). If your cache block is smaller, caching does nothing and you pay the overhead. Check the response for cache_creation_input_tokens — zero means you did not actually cache.

Measuring Cache Effectiveness

Every Claude API response includes usage metadata:

usage = response.usage
print(f"Input tokens: {usage.input_tokens}")
print(f"Cache read tokens: {usage.cache_read_input_tokens}")
print(f"Cache creation tokens: {usage.cache_creation_input_tokens}")
print(f"Output tokens: {usage.output_tokens}")

Your cache hit rate is cache_read_input_tokens / (cache_read_input_tokens + input_tokens). Below 60% on a production workload means there is room to optimize.

A useful rule of thumb: if you are not reviewing cache hit rate in a dashboard at least monthly, you are almost certainly leaving money on the table.

Five-Minute vs One-Hour TTL

The five-minute TTL is cheaper to write (1.25x vs 2x) but expires faster. The one-hour TTL has higher write cost but amortizes better over long-lived content.

Guidelines:

User-interactive chat: 5-minute TTL, refreshed on each message
Long-lived system prompts: 1-hour TTL
RAG contexts in long sessions: 1-hour TTL
High-frequency agent loops: 5-minute TTL
Tool definitions that change rarely: 1-hour TTL

For most production workloads, one-hour TTL on the prefix plus five-minute TTL on the conversation history is the pattern that works.

The Bigger Opportunity

Prompt caching is one cost-optimization lever. Two others that compound:

Model routing. Route simple tasks to Haiku 4.5 ($0.25/M input), complex tasks to Opus. Same workload can cost 30x less depending on routing.
Response streaming with short max_tokens. Do not budget 4,000 output tokens for a task that needs 400. You are not charged for unused tokens, but budget affects scheduling and concurrency.

Stack all three levers — aggressive prompt caching, intelligent model routing, tight output budgets — and typical production AI workloads come in at 20-30% of the unoptimized cost. For a team spending $30K/month on Claude, that is $60K-100K/year of savings with a few weeks of optimization work.

Where to Start This Week

Three actions that are worth doing this week even if you cannot do a full optimization pass:

Add prompt caching to your largest system prompt. This is the easiest single win.
Instrument cache hit rate in your observability dashboard.
Audit your prompts for the timestamps/user-context/whitespace anti-patterns.

These three moves take under a day and typically recover 30-50% of your API cost on their own.

AI Magicx applies aggressive prompt caching across all workloads so your AI content costs scale sub-linearly with usage. Start free.

Prompt Caching for Claude: Cut Your API Bill 60% in Production

Prompt Caching for Claude: Cut Your API Bill 60% in Production

How Prompt Caching Actually Works

The Five Patterns That Work

Pattern 1: Large System Prompts

Pattern 2: RAG Context Caching

Pattern 3: Conversation History Caching

Pattern 4: Tool Definitions Caching

Pattern 5: Few-Shot Examples Caching

The Production Numbers

Anti-Patterns

Measuring Cache Effectiveness

Five-Minute vs One-Hour TTL

The Bigger Opportunity

Where to Start This Week

Why pay $228/year when $69 works?

Related Articles

Why Smart Businesses Are Moving AI Off the Cloud in 2026: The Privacy, Cost, and Speed Case for On-Device AI

The AI Data Center Power Crunch: What the Electricity Bottleneck Means for Model Pricing in 2026

The LLM Pricing Collapse of 2026: How to Build When Models Cost Almost Nothing