How to Implement Prompt Caching on Amazon Bedrock to Reduce Costs
- Authors

- Name
- Nino
- Occupation
- Senior Tech Editor
In the world of Generative AI, the biggest hurdle for scaling production applications isn't just model performance—it's the compounding cost of inference. If you are running a multi-turn support agent or a complex RAG (Retrieval-Augmented Generation) system, you are likely paying to process the same static instructions over and over again. This is where Amazon Bedrock's Prompt Caching comes in, offering a way to cut input token costs by up to 90%.
When building high-performance AI applications, developers often look for the most efficient API routes. While n1n.ai provides a streamlined way to access top-tier models with low latency, understanding how to optimize specific provider features like Bedrock's caching is essential for enterprise-grade deployments.
The Problem: Redundant Token Processing
Imagine a customer support bot. Every time a user sends a message, the API request includes:
- System Prompt: Your agent's persona and safety rules (~200 tokens).
- Product Documentation: The context required to answer questions (~2,000+ tokens).
- Conversation History: Previous turns between the user and AI.
In a standard setup, the model re-processes the entire system prompt and documentation on every single turn. If your context is 2,000 tokens and the conversation lasts 5 turns, you've paid for 10,000 tokens of static content that never changed. This 'context tax' makes long-context models like Claude 3.5 Sonnet or DeepSeek-V3 expensive to run at scale.
How Prompt Caching Works on Bedrock
Prompt Caching allows you to define a cachePoint within your request. Amazon Bedrock then stores the prefix (everything before that point) in a high-speed cache. On subsequent calls, if the prefix matches byte-for-byte, Bedrock reads from the cache instead of recalculating the hidden states.
Key Economics:
- Cache Read: Billed at ~10% of the standard input price.
- Cache Write: Billed at a ~25% premium over standard input (one-time setup cost).
- Minimums: Requires at least 1,024 tokens for Nova models and 2,048 for Claude Haiku.
Implementation Guide: Step-by-Step
1. Prerequisites
Ensure your environment is ready. You will need the latest boto3 library (version >= 1.35.76) and access to the US East (N. Virginia) region where Nova models are widely available.
pip install boto3 --upgrade
2. The Baseline (No Caching)
Before optimizing, we establish a baseline using the converse API. This is the standard way to interact with models on Bedrock or through aggregators like n1n.ai.
import boto3
import time
bedrock = boto3.client("bedrock-runtime", region_name="us-east-1")
MODEL_ID = "amazon.nova-pro-v1:0"
# A large static system prompt
SYSTEM_CONTENT = "You are an expert assistant... [2,000 tokens of docs]"
def ask_baseline(question, history):
messages = history + [{"role": "user", "content": [{"text": question}]}]
response = bedrock.converse(
modelId=MODEL_ID,
system=[{"text": SYSTEM_CONTENT}],
messages=messages
)
return response
3. Enabling Prompt Caching
To enable caching, we simply modify the system parameter to include a cachePoint. This tells Bedrock exactly where the static content ends.
def ask_cached(question, history):
messages = history + [{"role": "user", "content": [{"text": question}]}]
response = bedrock.converse(
modelId=MODEL_ID,
system=[
{"text": SYSTEM_CONTENT},
{"cachePoint": {"type": "default"}} # The magic line
],
messages=messages
)
# Check usage metadata
usage = response["usage"]
print(f"Cache Read: {usage.get('cacheReadInputTokens', 0)}")
print(f"Cache Write: {usage.get('cacheWriteInputTokens', 0)}")
return response
Benchmarking the Savings
I ran a 5-turn simulation comparing three Amazon Nova models. The results show a dramatic shift in cost-per-conversation.
| Model | Baseline Cost (30 Days) | Cached Cost (30 Days) | Savings |
|---|---|---|---|
| Nova Pro | $334.61 | $169.99 | 49% |
| Nova Lite | $30.33 | $18.41 | 39% |
| Nova Micro | $16.99 | $9.47 | 44% |
Note: Estimates based on 1,000 conversations per day with a 2,069-token system prompt.
Advanced Strategy: Multi-Point Caching
You aren't limited to just one cache point. In complex agentic workflows using LangChain or tools, you can place up to 4 cache points. A common pattern is:
- Point 1: After the system identity.
- Point 2: After the tool/function definitions.
- Point 3: After the core RAG context.
This ensures that even if you change your tools, the system prompt remains cached, and vice versa.
Monitoring Cache Hits with CloudWatch
In production, you must monitor your 'Cache Hit Rate'. If your system prompt changes by even one character (e.g., adding a dynamic timestamp), the cache will miss, and you will pay the 25% write premium every time.
def publish_metrics(usage):
read_tokens = usage.get("cacheReadInputTokens", 0)
write_tokens = usage.get("cacheWriteInputTokens", 0)
# Logic to push to CloudWatch or a dashboard
Why Model Selection Matters
While caching is powerful, the biggest savings come from combining caching with the right model tier. Switching from Nova Pro (no cache) to Nova Micro (with cache) resulted in a 97% cost reduction in my tests. For high-volume tasks like classification or simple extraction, the 'Micro' tier plus caching is an unbeatable combination.
For developers who need high-speed access to these models without managing complex AWS infrastructure, n1n.ai offers a unified API that simplifies the deployment of these optimized workflows.
Conclusion
Prompt caching is no longer optional for production-grade LLM applications. By implementing a single line of code, you can effectively double your profit margins or pass the savings on to your users.
Key Takeaways:
- Use caching for any conversation exceeding 2 turns.
- Ensure your cached prefix is 100% static.
- Monitor hit rates to avoid 'Cache Miss' penalties.
Get a free API key at n1n.ai.