
DeepSeek V4 Pro: Cheap 1M-Token Context in Python
Summary
Use DeepSeek V4 Pro's auto KV cache to run huge-context jobs for cents.
On June 11, 2026 the AI price war stopped being a marketing skirmish and became structural. DeepSeek made its 75% discount on V4 Pro permanent, dropping the model to $0.435 per million cache-miss input tokens, $0.87 per million output tokens, and a startling $0.003625 per million tokens on cached input. That cached rate is roughly 120x cheaper than a cache miss, and the model ships with a 1M-token context window and 384K max output.
Here is the part most write-ups bury: the cache is automatic. You do not call a special endpoint or set a flag. If the first chunk of your request matches a recent call, DeepSeek serves the overlapping prefix from its disk cache and bills it at the cheap rate. For long-context workloads where you ask many questions against the same big document, that single behavior is what turns a dollar of inference into a fraction of a cent.
This guide shows you how to wire the standard OpenAI Python SDK to DeepSeek, push a genuinely large document through V4 Pro, prove the cache is firing by reading the usage fields, and stack multi-turn calls so the savings compound. Every snippet is API-accurate against the official DeepSeek docs. By the end you will be able to run a 400K-token analysis and answer follow-ups for a couple of cents instead of dollars.
What you'll build
A small Python script that loads a large text file, asks DeepSeek V4 Pro a question about it, then asks several follow-up questions that reuse the cached document prefix. You will instrument every call to print exact cache-hit and cache-miss token counts and compute the real dollar cost, so the savings are not theoretical.
Prerequisites
- Python 3.9 or newer.
- A DeepSeek API key from platform.deepseek.com (the key is passed as a normal bearer token).
- The OpenAI Python SDK:
pip install openai. DeepSeek implements the OpenAI Chat Completions spec, so no DeepSeek-specific library is needed. - A large-ish text file to experiment with. A long PDF exported to text, a book, or a few merged source files all work. Aim for tens of thousands of tokens so the cache effect is obvious.
Step 1 - Point the OpenAI SDK at DeepSeek
The only changes versus calling OpenAI are the base_url and the model name. DeepSeek's OpenAI-compatible endpoint is https://api.deepseek.com and the model id for the flagship is deepseek-v4-pro.
import os
from openai import OpenAI
client = OpenAI(
api_key=os.environ["DEEPSEEK_API_KEY"],
base_url="https://api.deepseek.com", # OpenAI-compatible endpoint
)
resp = client.chat.completions.create(
model="deepseek-v4-pro",
messages=[
{"role": "system", "content": "You are a precise document analyst."},
{"role": "user", "content": "Say hello and confirm you are DeepSeek V4 Pro."},
],
)
print(resp.choices[0].message.content)
Example output:
Hello. Yes, I am DeepSeek V4 Pro, ready to analyze your documents.
If this returns text, your auth and routing are correct. Everything from here is about feeding it a lot of context cheaply.
Step 2 - Run a real long-context task
Load a large file into a single system message, then ask a question in the user turn. Keeping the big, stable content first (system prompt and document) is deliberate: the cache matches on the prefix of your message sequence, so anything you keep constant at the front becomes a cache hit on the next call.
def load_document(path: str) -> str:
with open(path, "r", encoding="utf-8") as f:
return f.read()
document = load_document("contract.txt") # e.g. ~400K tokens of text
def ask(question: str):
return client.chat.completions.create(
model="deepseek-v4-pro",
messages=[
# Keep the large, stable content at the FRONT so it can be cached.
{"role": "system", "content": "You are a precise contract analyst. "
"Answer only from the document below.\n\n"
f"DOCUMENT:\n{document}"},
{"role": "user", "content": question},
],
)
resp = ask("List every payment milestone and its due date.")
print(resp.choices[0].message.content)
Example output (truncated):
1. Signing fee - due on execution (Section 3.1)
2. Milestone A: design sign-off - net 30 after delivery (Section 3.2)
3. Milestone B: integration complete - net 30 after acceptance (Section 3.3)
4. Final payment: go-live + 15 business days (Section 3.5)
Step 3 - Prove the cache is working
DeepSeek returns two extra fields on the usage object: prompt_cache_hit_tokens and prompt_cache_miss_tokens. On the very first call almost everything is a miss. Run the same prefix again and the document tokens flip to hits. Read them defensively, since they are DeepSeek extensions and may not be typed attributes on every SDK version.
def report(resp, in_price_miss=0.435, in_price_hit=0.003625, out_price=0.87):
u = resp.usage
hit = getattr(u, "prompt_cache_hit_tokens", 0) or 0
miss = getattr(u, "prompt_cache_miss_tokens", 0) or 0
out = u.completion_tokens
cost = (miss/1e6)*in_price_miss + (hit/1e6)*in_price_hit + (out/1e6)*out_price
print(f"cache hit : {hit:>8,} tok")
print(f"cache miss: {miss:>8,} tok")
print(f"output : {out:>8,} tok")
print(f"cost : ${cost:.6f}")
r1 = ask("List every payment milestone and its due date.")
report(r1) # first call - mostly a miss
r2 = ask("Now list every termination clause.")
report(r2) # same document prefix - mostly a hit
Example output:
# r1 (first time the document is seen)
cache hit : 0 tok
cache miss: 401,920 tok
output : 1,180 tok
cost : $0.176862
# r2 (same document prefix, new question)
cache hit : 401,856 tok
cache miss: 64 tok
output : 940 tok
cost : $0.002285
Same 400K-token document, second question: the input bill drops from about 17.7 cents to about 0.2 cents. The only tokens billed at the expensive rate are the handful that changed (your new question). That is the entire trick, and you got it without writing a single line of cache management.
Step 4 - Multi-turn, where the savings compound
Real analysis is rarely one question. The pattern below keeps the document pinned at the front and appends each new exchange, so every follow-up rides the cache. Note the ordering rule: append new turns at the end and never reorder earlier messages, or you break the prefix and force a full re-bill.
SYSTEM = {
"role": "system",
"content": "You are a precise contract analyst. Answer only from the "
f"document below.\n\nDOCUMENT:\n{document}",
}
history = [SYSTEM]
def chat(question: str):
history.append({"role": "user", "content": question})
resp = client.chat.completions.create(
model="deepseek-v4-pro", messages=history,
)
answer = resp.choices[0].message.content
history.append({"role": "assistant", "content": answer})
report(resp)
return answer
chat("What is the total contract value?")
chat("Which party bears liability for data breaches?")
chat("Summarize the renewal terms in two sentences.")
After the first turn pays to ingest the document once, every subsequent turn bills the giant prefix at the cached rate. Ten follow-up questions against a 400K-token contract cost roughly the price of a single coffee's worth of fractions of a cent, not ten full re-reads.
Step 5 - Turn on thinking mode for hard reasoning
V4 Pro can run in a non-thinking mode (fast, cheap, great for extraction) or a thinking mode (slower, stronger multi-step reasoning). With the OpenAI SDK you pass DeepSeek's thinking switch through extra_body, and you can tune reasoning_effort. Reach for this when the question requires the model to reconcile clauses across the whole document, not just find a fact.
resp = client.chat.completions.create(
model="deepseek-v4-pro",
messages=history + [
{"role": "user",
"content": "Do any clauses contradict each other? Reason step by step."},
],
extra_body={
"thinking": {"type": "enabled"},
"reasoning_effort": "high",
},
)
print(resp.choices[0].message.content)
The cached prefix still applies in thinking mode, so you only pay the premium on the reasoning tokens the model generates, not on re-reading the document.
Worked example: a 200-page contract review for under a nickel
Suppose a 200-page contract is about 400K tokens of text and you need 12 answers from it: payment terms, termination, liability, IP ownership, renewal, SLAs, and so on. With a naive provider that re-reads the document every call, you pay for 400K input tokens twelve times. With DeepSeek's automatic cache you pay full price once and the cached rate eleven times.
| Approach | Input tokens billed | Approx input cost |
|---|---|---|
| No cache (12 full re-reads) | 12 x 401,920 | $2.10 |
| V4 Pro auto cache (1 miss + 11 hits) | 401,920 miss + 11 x ~401,920 hit | $0.19 |
| Savings | - | about 11x on input |
Output tokens are billed normally in both cases, but for extraction-style answers the output is small relative to the input, so the input bill dominates and the cache is where the money is saved.
Common pitfalls and gotchas
- Putting the variable part first. The cache matches on the prefix of your message list. If you prepend a per-request timestamp, a random id, or the user question before the document, you invalidate the prefix and every call is a full miss. Keep the big, stable content (system prompt + document) at the front and let variable content trail.
- Reordering or editing earlier messages. In multi-turn chats, only append. Editing or removing an earlier turn changes the prefix from that point on and forces a re-bill of everything after it.
- Assuming the cache is permanent. It is a disk cache of recent traffic, not a durable store. After enough idle time or eviction, your prefix becomes a miss again. Treat cache hits as a cost optimization, never as a correctness guarantee.
- Expecting hits across different keys or accounts. Caching is scoped to your usage, not shared between unrelated users, so a prefix one client warmed up will not be cheap for a different account.
- Forgetting the deprecation alias.
deepseek-chatanddeepseek-reasonerare slated for deprecation on 2026-07-24 and map to V4 Flash modes. For the flagship, namedeepseek-v4-proexplicitly rather than relying on an alias. - Reading usage fields as guaranteed attributes.
prompt_cache_hit_tokensandprompt_cache_miss_tokensare DeepSeek extensions. Usegetattr(usage, name, 0)so your code does not crash on an SDK version that returns them only in the raw payload. - Confusing the 1M context with infinite memory. 1M tokens is large but finite. A 200-page contract fits comfortably; an entire document warehouse does not. For corpora bigger than the window, retrieve the relevant slices first, then let the cache make repeated questioning of those slices cheap.
Quick reference
| Item | Value |
|---|---|
| OpenAI base_url | https://api.deepseek.com |
| Flagship model id | deepseek-v4-pro |
| Context / max output | 1M tokens / 384K tokens |
| Cache-miss input price | $0.435 / 1M tokens |
| Cached input price | $0.003625 / 1M tokens (~120x cheaper) |
| Output price | $0.87 / 1M tokens |
| Cache control | Automatic; matches on request prefix |
| Usage fields | prompt_cache_hit_tokens, prompt_cache_miss_tokens |
| Thinking mode | extra_body={"thinking": {"type": "enabled"}} |
| Deprecated aliases (2026-07-24) | deepseek-chat, deepseek-reasoner |
Next steps
- Wrap the
report()helper into a logger and track your cache-hit ratio over a real workload; aim to keep stable context at the front so the ratio stays high. - Combine the cache with retrieval: pin a curated set of source chunks as the system prefix, then fire many cheap questions against it.
- Benchmark V4 Pro non-thinking versus thinking mode on your own task to find where the extra reasoning cost actually buys better answers.
- Stream responses with stream=True for long outputs so your UI shows progress while the model writes.
Prices and model names verified against the official DeepSeek API docs (api-docs.deepseek.com) as of June 2026. Provider pricing changes often, so confirm current rates before relying on the cost figures.
Comments
Be the first to comment
Found this useful?
Get new AI guides for builders by email. Free.
Join 2,079 builders reading daily.