Context engineering: managing 1M-token windows without context rot

1M-token context windows exist, but quality degrades long before that limit. Context engineering is the discipline of using context windows effectively — what to include, what to summarize, what to retrieve fresh, and the patterns that keep quality high as context grows.

What you should be able to do

Evaluate the implementation pattern, failure modes, and guardrails before building.

May 15, 2026

In this article

In 2026, you have 1M-token context windows. Gemini, GPT-5, Claude (with extended thinking) all support them. Demos show models reading entire books in a single pass. The dream is finally real: just put everything in the context and let the model figure it out.

The reality, like always, is more nuanced. 1M tokens is a technical capacity, not a performance guarantee. Real-world models perform best at 5-50K tokens of context. At 100K+, subtle quality issues appear. At 500K+, important information is reliably missed. At 1M, the model is overwhelmed.

This phenomenon — "context rot" — is real, well-documented, and visible in evaluations. The implication: you can't just dump everything into context and call it done. You need context engineering: deliberate decisions about what to include, what to summarize, what to retrieve dynamically, and how to structure the resulting context.

This article covers the patterns and discipline of effective context engineering for serious production systems.

What context rot is

"Context rot" is the empirical observation that LLM performance degrades as the context grows, even within the technical limit.

Specific failure modes:

Lost-in-the-middle. Models attend more to content at the beginning and end of context. Information in the middle is less reliably used. A fact placed at position 50K out of 100K is more likely to be missed than the same fact at position 1K or 99K.

Recency bias. Models over-weight recent content. In conversation history, old context becomes effectively invisible.

Distractor sensitivity. Irrelevant content in context degrades performance even on tasks that don't need that content. The model has to filter; some filtering signal leaks.

Reasoning quality drops. Multi-step reasoning becomes less reliable as context grows. The model has more to track; tracking quality suffers.

Cost and latency. Independent of quality: large contexts are expensive (per-token pricing) and slow (linear in token count for many models).

These aren't theoretical concerns. Production deployments with naive large contexts consistently underperform deployments with curated contexts.

The principle: less is more

The core insight: context is a precious, degrading resource. Use it strategically.

A 30K-token context with carefully chosen content typically outperforms a 300K-token context with everything dumped in. Quality, cost, and latency all favor the smaller context.

This shifts the engineering work. Instead of "find a way to fit more in context," it's "decide what truly needs to be in context, and put it there well."

The context budget

Think of context as a budget you allocate.

A typical allocation for a customer support agent:

Total context budget: 30K tokens

- System prompt: 1500 tokens (5%)
- Tool descriptions: 1000 tokens (3%)
- User profile / context: 500 tokens (2%)
- Conversation history summary: 1000 tokens (3%)
- Recent conversation turns (full): 4000 tokens (13%)
- Retrieved relevant knowledge: 12000 tokens (40%)
- User's current message: 500 tokens (2%)
- Output token budget (response): 10K tokens (33%)

Each component competes for space. As context grows, you make trade-offs.

The discipline: be explicit about the allocation. Don't let any component grow unboundedly.

Pattern 1: Tiered conversation memory

For multi-turn conversations, full history grows unbounded. Most production systems use tiered memory:

Tier 1: Recent turns in full. The last 5-10 exchanges, verbatim.

Tier 2: Summarized older turns. Earlier in the conversation, compressed into a short summary.

Tier 3: Extracted facts. Key information from the conversation (user's name, preferences, decisions made) stored as structured facts.

Implementation:

On each turn:
1. Take the conversation history.
2. The last 10 turns are kept verbatim.
3. Turns 11-30 are summarized into 200 words ("Earlier, the user discussed X and we agreed Y").
4. Turns 31+ are reduced to extracted facts ("User prefers Python. User is on enterprise tier.").
5. Total memory: ~2K tokens regardless of conversation length.

This pattern is fundamental for any long-running conversation. Without it, conversations degrade as they grow.

A nuance: the summarization should preserve information the agent will need. If the user mentioned an account number at turn 5 and the agent doesn't pull that into long-term memory, by turn 50 the account number is lost.

Run summarization with explicit instructions on what to preserve:

Summarize the conversation so far. Preserve:
- All facts about the user (name, role, preferences, account info).
- All decisions made.
- All open commitments or follow-ups.
- The current goal of the conversation.

Discard:
- Pleasantries.
- Repeated information.
- Detailed reasoning that's been resolved.

Pattern 2: Just-in-time retrieval

Instead of pre-loading context, retrieve what's relevant when it's relevant.

A counterpattern: stuff all of a user's documents into context "in case the model needs them." Most queries need a tiny subset. Context is wasted; performance suffers.

Better: retrieve documents based on the current query. Different queries get different documents. Total context per call stays small; relevance stays high.

This is just RAG, applied disciplined. The key: don't be tempted to "just include everything because we can." That tempts even experienced teams when long-context models are available.

Pattern 3: Compressed representations

For information that does need to persist in context, use compressed representations.

Original (verbose):

The user has been working in software engineering for 8 years. They started at a small startup called Acme Corp where they worked on backend systems. After 3 years they moved to a larger company called Beta Inc where they did frontend work. They're currently at Gamma LLC working on machine learning systems.

Compressed:

User: SWE, 8 years, currently ML at Gamma LLC. Prior: backend@Acme (3yr), frontend@Beta.

The compressed version preserves the relevant facts in fewer tokens. The model can use either equally well for most purposes.

Apply this technique to:

User profiles.
Document summaries.
Past conversation context.
Knowledge base entries (when full content isn't needed).

The trade-off: compression loses nuance. Use full text when nuance matters; compressed when it doesn't.

Pattern 4: Hierarchical retrieval

For very large knowledge bases, retrieve hierarchically.

Step 1: Retrieve broad categories or document summaries based on the query.

Step 2: Within the most relevant categories, retrieve specific chunks.

Step 3: Include only the chunks selected at step 2 in the final context.

This avoids "I have 10K documents; let me embed them all into context." Instead, the funneling keeps context tight.

Variant: a small LLM call selects which chunks are most relevant, before they're included in the main call. This adds a small cost but significantly reduces context bloat.

Pattern 5: Reflection on context

For agents in long-running tasks, periodically reflect on what's in context and what should be there.

Every 10 steps, the agent does:

1. Reviews its current context.
2. Identifies what's relevant to ongoing work.
3. Summarizes or drops anything no longer needed.
4. Notes what additional context might help.
5. Replaces the old context with the curated version.

This is "garbage collection" for context. Without it, agents accumulate stale information that crowds out new, relevant information.

Implementation requires custom orchestration — most frameworks don't handle this natively. The pattern: between steps, the agent has a "memory consolidation" phase that adjusts what's in context.

Pattern 6: Dynamic context windows

Different parts of an agent's run might have different optimal context sizes.

Decision steps: Small context, focused on the immediate decision.
Synthesis steps: Larger context, including many sources.
Generation steps: Medium context, with style and format references.

A pattern: each step in the agent's workflow uses a different context shape. The orchestration manages which content goes into which step.

This requires breaking the agent's work into explicit steps rather than one big loop. The framework choice matters here (LangGraph handles this naturally; direct API requires manual work).

Pattern 7: Position-aware placement

Given that models attend more to context beginning and end, place important content there.

Less effective: Critical instruction buried in the middle of a long system prompt.

More effective: Critical instruction at the very start AND restated near the end.

For RAG with multiple retrieved documents: the most relevant document at the start, the second most at the end, less relevant in the middle.

This is a tactical optimization but has measurable effect on outputs.

Pattern 8: Selective summarization

Not all summarization is equal. Tailor summarization to what downstream tasks need.

Bad: generic summary that drops user preferences.

Better: summary explicitly preserving user preferences relevant to the downstream task.

Summarize this document focusing on:
- Technical decisions made.
- Stakeholders mentioned.
- Open questions or risks.

Drop:
- General context already known to the team.
- Repeated points.

The summarization prompt is engineered for the downstream use.

Pattern 9: Structured context

Plain text is one option. Structured context (JSON, XML, or specific markup) can be much denser.

Verbose prose:

The customer's name is John Smith. He's been a customer since March 2023. His current plan is Pro, billed monthly at $29. He has 3 active integrations: Slack, Notion, and Linear. His usage in the last 30 days has been moderate — 1,250 API calls.

Structured:

{
  "customer": {
    "name": "John Smith",
    "since": "2023-03",
    "plan": "Pro",
    "billing": "monthly $29",
    "integrations": ["Slack", "Notion", "Linear"],
    "usage_30d": {"api_calls": 1250, "tier": "moderate"}
  }
}

The structured version is shorter and (often) easier for the model to use. The model can quickly find specific facts.

Caveat: not all models are equally good with structured input. Test both formats for your use case.

Pattern 10: Context layering

Layer context by priority. High-priority always included; medium-priority included when relevant; low-priority retrieved on demand.

Always layer:

System prompt (identity, behavior).
Current user context (essential facts).
Recent conversation.

When relevant:

Retrieved documents matching the query.
Tool outputs from recent steps.

On demand:

Specific data the agent requests via tool calls.
Historical context beyond the recent window.

The "on demand" pattern is critical for scaling: the agent retrieves what it needs, when it needs it, rather than pre-loading everything.

Pattern 11: Eviction strategies

When context approaches limits, what gets evicted?

Recency-based: oldest content evicted first.
Relevance-based: content least related to current task evicted first.
Importance-based: content marked low-importance evicted first.

A practical pattern: tag context items with priority levels. When eviction is needed, drop in priority order.

context_items = [
    {"content": "...", "priority": "critical"},   # Never evict
    {"content": "...", "priority": "high"},        # Evict last
    {"content": "...", "priority": "medium"},     # Evict if needed
    {"content": "...", "priority": "low"},         # Evict first
]

def evict_to_fit(items, budget):
    items_by_priority = sorted(items, key=lambda x: priority_value(x.priority))
    while total_tokens(items) > budget:
        items.remove(items_by_priority.pop(0))  # Remove lowest priority
    return items

Pattern 12: Caching for repeated context

Many calls reuse the same context — same system prompt, same tool descriptions, same user profile.

Most providers now support prompt caching:

Anthropic: explicit cache_control markers in messages.
OpenAI: automatic for prefix-matching requests.
Google: explicit cached content via API.

When the same prefix is reused, the cached version is faster and cheaper (often 90% cheaper).

For agents that make many calls in a session, ensure the static parts (system prompt, tools, user context) are cacheable. Place dynamic content (current step, recent results) after the cacheable prefix.

This is one of the highest-ROI optimizations. A 10K-token static prefix used across 50 calls in a session: full price first call, 90% off subsequent. Significant savings.

Pattern 13: Context-aware prompts

Prompts can encourage the model to use context well:

Reference the documents below to answer the user's question. Always cite the specific document and quote relevant passages.

If you cannot find the answer in the provided documents, say so explicitly. Do not invent information.

When information from multiple documents is relevant, synthesize them and note any disagreements.

This kind of prompting reduces hallucination on context-grounded tasks and improves citation quality.

When to embrace longer context

Despite context rot, longer context is genuinely better for some tasks:

Single-document analysis. If the task is "analyze this contract," fitting the whole contract in context is often better than chunked retrieval.

Comparison across many items. Comparing 10 contracts side-by-side benefits from all 10 in context.

Code editing in context. Modifying a function in a 5K-LOC file is easier with the file in context than with retrieved snippets.

Whole-conversation summarization. Producing a summary of a long conversation works better with full context (up to a point).

The pattern: when the task fundamentally requires understanding relationships across content, longer context helps. When the task can be done with a small slice, smaller context is better.

A practical heuristic: contexts up to 50K tokens are generally fine. 50-200K tokens work but quality dips. 200K+ tokens often underperform shorter contexts. Test empirically.

Eval discipline

How do you know your context engineering is working? Eval.

Specific evals:

Recall tests. Embed key facts at various positions in long contexts. Test whether the model uses them. Measure recall vs position.

Distractor tests. Compare performance on the same query with and without irrelevant context. Measure degradation.

Long-context vs RAG comparison. Same queries answered with full context vs with retrieved chunks. Compare quality.

Token-efficiency. Quality per dollar. As context grows, you pay more — does quality grow commensurately?

These evals reveal whether your context choices are actually helping. Without them, you're guessing.

A worked example: research assistant

A real-world example: an AI research assistant for a small team.

Task: Answer questions about a 200-document corpus (papers, internal docs, meeting notes).

Naive approach: Embed all docs into context (300K tokens). Quality is okay; cost is high; latency is bad.

Engineered approach:

Context budget: 25K tokens

- System prompt: 1500 tokens (cached)
- Tool descriptions (search, fetch_doc, etc.): 800 tokens (cached)
- Conversation memory: 1500 tokens (last 10 turns)
- Retrieved chunks for current query: 18000 tokens (top 12 chunks via RAG)
- User's current question: 200 tokens
- Output budget: ~3000 tokens

The agent dynamically retrieves chunks based on the question. Conversation memory keeps recent context. Static elements are cached.

Outcome:

Latency: 2-3 seconds (vs 10-15 with 300K context).
Cost: ~€0.01/query (vs ~€0.10).
Quality: better, measured on eval set, because relevant content is properly attended to.

This is what disciplined context engineering looks like. Not a one-time decision but ongoing tuning.

Common mistakes

A few patterns:

Mistake 1: "More context = better." Reach for longer context as the solution to quality problems. Often the opposite is true.

Mistake 2: No context budget. Components grow unboundedly. The user profile section becomes 5K tokens; the retrieved chunks section becomes 50K. No discipline.

Mistake 3: Ignoring positions. Critical instructions buried in the middle, hoping the model finds them. Sometimes it does; often it doesn't.

Mistake 4: Verbose prose for facts. Long sentences where structured data would suffice. Wastes tokens.

Mistake 5: No conversation summarization. Conversations grow until they exceed limits. Then either trail-off or break.

Mistake 6: No caching. Repeated static prefixes paying full price every call. Easy money left on the table.

Mistake 7: No eval of context choices. Trust that "better context engineering" is working. Sometimes it isn't.

Mistake 8: Generic summarization. Summarize without thinking about what downstream tasks need. Lose information that matters.

The takeaway

Long context windows are real, but they're not a license to ignore context engineering. Quality degrades long before technical limits. Cost and latency are real.

The discipline of context engineering:

Treat context as a budget.
Tier memory (recent verbatim, older summarized, oldest as facts).
Retrieve just-in-time, not preemptively.
Compress where compression preserves information.
Place critical content at high-attention positions.
Layer context by priority.
Cache static prefixes.
Eval continuously.

These patterns produce systems that work better, faster, and cheaper than naive "throw it all in the context" approaches.

For mature production systems, context engineering is one of the highest-leverage areas. The technical primitives are simple; the discipline to apply them rigorously is what separates "demo works" from "production reliable."

Invest in the patterns. Build the discipline. The result is AI systems that scale gracefully as they encounter more data, more conversations, and more complexity.

Take it further

Hand-picked external courses that go deeper on this topic.

DeepLearning.AI

ChatGPT Prompt Engineering for Developers

Isa Fulford · Andrew Ng

Ninety minutes to a year's worth of intuition. If you've started writing code that calls an LLM — or you're about to — this is the most efficient course online for closing the gap between "playing with ChatGPT" and "shipping a feature that calls an LLM."

Intermediate~1.5 hoursVerified 25 days ago

Coursera · Vanderbilt University

Prompt Engineering for ChatGPT

Dr. Jules White

The academic complement to DeepLearning.AI's short course — same discipline, longer arc, written for people who don't code. Dr. White teaches prompting as a set of reusable patterns (Ask for Input, Outline Expansion, Fact Check List, Menu Actions) rather than tricks. After this you'll prompt LLMs like a designer, not a guesser.

Beginner~18 hoursVerified 25 days ago

See all courses for Prompt Engineering