By Satyam Namdeo
Large Language Models (LLMs) have become a foundational component for modern AI applications, from developer copilots and documentation assistants to advanced troubleshooting tools. As these applications scale, one challenge quickly becomes apparent: token costs can grow rapidly when large prompts are repeatedly sent to the model.
A common architecture for production AI systems includes long system instructions, tool schemas, retrieved knowledge base documents, and conversation history. These components can easily add thousands of tokens per request. When applications handle thousands or millions of requests per day, repeatedly processing the same static prompt content becomes expensive.
To address this problem, prompt caching has emerged as an essential optimization technique supported by major model providers such as Anthropic and OpenAI.
Prompt caching allows repeated prompt segments to be reused across requests, significantly reducing both latency and cost. In this article, we will explore:
We will also show how prompt caching can be implemented when using models via DigitalOcean.
Prompt caching is a mechanism where large portions of a prompt that remain identical across requests are stored and reused, instead of being reprocessed every time.
Since information like System instructions, tool schemas, guardrails, documentations ,etc rarely changes, repeatedly sending it wastes computation and increases token usage costs. Prompt caching solves this by:
This optimization is especially powerful in production systems where large static prompts are combined with small dynamic queries.
At a high level, prompt caching works by identifying prefix tokens that remain identical across multiple requests.
If a request begins with the same sequence of tokens as a previous request, the model provider can reuse the previously processed representation rather than recomputing it.
The workflow looks like this:
This approach reduces compute work significantly because LLM inference is most expensive when processing large prompts.
Prompt caching provides several important benefits for production AI systems.
Prompt caching can significantly reduce the cost of running LLM applications because tokens reused from earlier requests are billed at a much lower rate than newly processed tokens. For example, in GPT-5, standard input tokens cost about $1.25 per million tokens, while cached input tokens cost only $0.125 per million tokens, making cached tokens around 10× cheaper.
Since cached prompt segments do not need to be recomputed, the model can process requests faster. This improves user experience in interactive applications such as Chat Assistants, Coding Copilots and Documentation Search tools
Applications handling large traffic volumes benefit significantly because caching prevents redundant computation across thousands of requests.
This makes AI systems more economically viable at scale.
Prompt caching is most effective when large prompt segments remain identical across requests. Most AI apps that commonly use this include ChatGPT, Cursor, Perplexity AI, Notion AI
RAG systems retrieve documents and inject them into prompts. If the retrieved documents are reused frequently, caching can significantly reduce token costs.
Typical examples include Knowledge Base Assistants, Documentation search, Internal support chatbots ,etc
Enterprise support assistants often include system instructions, operational playbooks, and technical documentation.
These prompts can exceed several thousand tokens and are ideal for caching.
A common architecture used in production AI systems organizes prompts into static and dynamic sections.
The key idea is simple: Place all large, static prompt components at the beginning of the prompt. This creates a large prefix that can be cached.
The following prompt components typically remain identical across requests:
The following components change per request:

Consider a Kubernetes troubleshooting assistant. Example request structure:
{
"model": "gpt-5",
"input": [
{
"role": "system",
"content": "You are a senior Kubernetes networking engineer..."
},
{
"role": "system",
"content": "TOOLS AVAILABLE:\n1. search_k8s_docs(query)..."
},
{
"role": "system",
"content": "DOCUMENT: CoreDNS runs as a deployment in Kubernetes..."
},
{
"role": "user",
"content": "How does CoreDNS know which pod IPs belong to a service?"
}
],
"max_output_tokens": 200
}
| Component | Tokens | Cacheble |
|---|---|---|
| System instructions | 1,500 | Yes |
| Tool schema definitions | 1,000 | Yes |
| RAG documentation | 3,500 | Yes |
| Conversation history | 300 | No |
| User question | 50 | No |
Total input tokens: 6,350 (6000 cacheble) Model output tokens: 200
Every request processes the full prompt.
Input cost: 6,350 × $1.25 / 1,000,000 = $0.00794
Output cost: 200 × $10 / 1,000,000 = $0.002
Total cost per request: $0.00994
Cached tokens: 6,000 Non-cached tokens: 350
With caching enabled, the first request runs at full price to populate the cache. After that, repeated tokens are billed at the cached rate.
Cached token cost: 6,000 × $0.125 / 1,000,000 = $0.00075
Non-cached input tokens: 350 × $1.25 / 1,000,000 = $0.00044
Output tokens: 200 × $10 / 1,000,000 = $0.002
Total cost per cached request: $0.00319
Savings per request
Savings per request: $0.00675 (~68% reduction)
| Requests per Day | Monthly Cost (No Cache) | Monthly Cost (With Cache) | Monthly Savings |
|---|---|---|---|
| 10,000 | $2,981 | $956 | $2,025 |
| 100,000 | $29,813 | $9,563 | $20,250 |
| 1,000,000 | $298,125 | $95,625 | $202,500 |
This is why prompt caching is critical for production LLM systems.
Prompt caching with Anthropic models is explicitly controlled by the developer. Developers can mark sections of the prompt as cacheable using the cache_control parameter.
Example:
{
"role": "user",
"content": {
"type": "text",
"text": "This is cached for 1h.",
"cache_control": {
"type": "ephemeral",
"ttl": "1h"
}
}
}
{
"role": "developer",
"content": [
{
"type": "text",
"text": "Cache this segment for 5m.",
"cache_control": {
"type": "ephemeral",
"ttl": "5m"
}
},
{
"type": "text",
"text": "Do not cache this segment"
}
]
}
{
"role": "tool",
"tool_call_id": "tool_call_id",
"content": [
{
"type": "text",
"text": "Tool output cached for 5m.",
"cache_control": {
"type": "ephemeral",
"ttl": "5m"
}
}
]
}
Prompt caching can also be used via Anthropic Relay:
{
"max_tokens": 200,
"system": [
{
"type": "text",
"text": "You are an expert Kubernetes networking assistant.",
"cache_control": { "type": "ephemeral" }
}
],
"messages": [
{
"role": "user",
"content": [
{ "type": "text", "text": "Explain how CoreDNS works in Kubernetes." }
]
}
]
}
Important details:
Initial request:
"usage": {
"cache_created_input_tokens": 1043,
"cache_read_input_tokens": 0
}
Subsequent request:
"usage": {
"cache_created_input_tokens": 0,
"cache_read_input_tokens": 1043
}
This indicates the tokens were reused from the cache.
Key pricing characteristics:
Official pricing details here.
Prompt caching in OpenAI models works differently. Caching is implicit and automatic rather than explicitly controlled.
Key characteristics:
Developers can optionally provide two fields: prompt_cache_key and prompt_cache_retention
{
"model": "gpt-4.1",
"prompt_cache_key": "k8s-networking-docs-v1",
"prompt_cache_retention": "extended",
"messages": [
{
"role": "system",
"content": "You are a Kubernetes networking expert."
},
{
"role": "system",
"content": "DOCUMENT: CoreDNS resolves service names..."
},
{
"role": "user",
"content": "How does CoreDNS know which pod IPs belong to a service?"
}
],
"max_tokens": 200
}
{
"model": "gpt-4.1",
"prompt_cache_key": "k8s-networking-docs-v1",
"prompt_cache_retention": "extended",
"input": [
{
"role": "system",
"content": "You are a Kubernetes networking expert."
},
{
"role": "system",
"content": "DOCUMENT: CoreDNS resolves service names."
},
{
"role": "user",
"content": "How does kube-proxy load balance traffic to pods?"
}
],
"max_output_tokens": 200
}
{
"usage": {
"input_tokens": 1500,
"output_tokens": 180,
"input_tokens_details": {
"cached_tokens": 1400
}
}
}
Meaning:
These optional parameters help improve cache reuse.
A developer-defined identifier used to group related prompts. Example uses:
Example: k8s-networking-docs-v1, support-assistant-v2, product-docs-release-2025
When identical prompts share the same key, cache hits become more likely.
Controls how aggressively the system tries to keep the cached prefix. Typical options include default short retention and extended retention for longer reuse windows.
This is particularly useful for high traffic applications, AI assistants with stable prompt prefixes.
Official pricing details here.
DigitalOcean supports prompt caching when using models from providers like Anthropic and OpenAI. Billing follows the per-token pricing defined by the model provider.
Developers using DigitalOcean can:
This allows teams to implement cost-efficient AI systems while using DigitalOcean’s developer platform. Prompt caching has become one of the most powerful techniques for reducing LLM infrastructure costs.
By organizing prompts into static cached prefixes and dynamic request components, developers can reduce token costs by roughly 70–90% in many real-world applications.
At scale, these savings can amount to tens or hundreds of thousands of dollars per month. As AI applications continue to grow, architectures that incorporate prompt caching will become a core part of building efficient, scalable AI systems.
For teams building production AI applications on platforms like DigitalOcean, prompt caching is not just an optimization—it is a foundational design principle for cost-efficient LLM deployment.


