Context Engineering
Strategies for curating the optimal set of tokens during LLM inference — a discipline that evolved from prompt engineering.
Context engineering is the set of strategies for curating and maintaining the optimal set of tokens during LLM inference. Where prompt engineering focuses on how you ask, context engineering focuses on everything that surrounds the ask — system prompts, tools, examples, retrieved data, conversation history, and structured memory.
The shift matters because, for many practical tasks, raw model capability is no longer the primary bottleneck. The constraint is increasingly context: what information is available when the model reasons, and how efficiently that information is encoded.
Think of the LLM as a CPU and the context window as RAM. You would not blame a CPU for poor performance if you loaded the wrong data into memory.
Context vs Prompt Engineering
Prompt engineering and context engineering are complementary, not competing:
| Aspect | Prompt Engineering | Context Engineering |
|---|---|---|
| Focus | How you phrase the instruction | What information surrounds the instruction |
| Scope | Single turn or message | Full session lifecycle |
| Optimises for | Instruction clarity | Token efficiency and relevance |
| Scales with | Task complexity | Session length and tool count |
A perfectly phrased prompt fails if the context is wrong. A mediocre prompt often succeeds if the right context is present.
The Anatomy of Effective Context
Three components make up the context a model sees at inference time. Each has distinct design constraints.
System Prompts
The system prompt sets the operating parameters. Two failure modes to avoid:
- Overly rigid — Hardcoded complex logic creates fragility and maintenance burden
- Overly vague — High-level guidance without concrete signals fails to direct behaviour
Best practices:
- Organise into distinct sections using XML tags or Markdown headers
- Aim for the minimal set of information that fully outlines expected behaviour
- Start with minimal prompts on capable models, then add instructions based on observed failure modes
- Write at the right altitude — general enough to handle variation, specific enough to prevent drift
Tools
Tools extend what the model can do. Effective tools are self-contained, robust to error, and extremely clear about their intended use.
Design principles:
- Minimal overlap — Avoid ambiguous decision points between similar tools
- Descriptive parameters — Unambiguous input names and descriptions
- Token-efficient returns — Return only what the model needs, not everything available
- No bloated tool sets — Each tool should do one thing well
Examples
Examples are worth a thousand words of instruction. Use diverse, canonical examples that portray expected behaviour — not exhaustive edge-case lists.
A few well-chosen examples teach patterns more effectively than pages of rules. Select examples that demonstrate the boundaries of acceptable behaviour, not just the happy path.
Runtime Context Strategies
How context enters the window during a session determines both cost and quality.
Just-in-Time Loading
Rather than preloading all potentially relevant data, maintain lightweight identifiers (file paths, URLs, queries) and retrieve information dynamically at runtime.
Benefits:
- Reduces immediate token burden
- Enables progressive discovery — agents incrementally find relevant context through exploration
- Metadata (folder hierarchies, naming conventions, timestamps) provides implicit guidance without consuming tokens
Hybrid Approach
Retrieve some data upfront for speed while allowing autonomous exploration for the rest. This is the pattern Claude Code uses: CLAUDE.md files load automatically at session start, while glob and grep enable selective file discovery — effectively bypassing stale indexing and complex syntax trees.
Trade-off: runtime exploration is slower than retrieving pre-computed data. Design tools and guidance to minimise unnecessary exploration.
Progressive Disclosure
Load metadata at startup (descriptions, triggers — roughly 100 tokens per item) and full content only on invocation. This is the pattern behind Claude's skill system: skill descriptions are always available, but full skill instructions load only when the skill activates.
This scales to hundreds of capabilities without overwhelming the context window.
Choosing a Strategy
| Task Type | Recommended Approach | Rationale |
|---|---|---|
| Short, focused task | Upfront loading | Fast; context window has capacity |
| Multi-file codebase exploration | Hybrid (essentials + search tools) | Unknown scope; discovery needed |
| Long-running session (hours) | Just-in-time + compaction | Context window will fill; manage actively |
| Multi-agent orchestration | Progressive disclosure + sub-agents | Isolate concerns; minimise cross-contamination |
| Repetitive workflow | Skills with metadata + on-demand load | Amortise cost across invocations |
Long-Horizon Patterns
Sessions that run for extended periods face a fundamental challenge: the context window fills up, and performance degrades. Three patterns address this.
Compaction
When context approaches capacity, summarise the session state and reinitiate with the compressed summary. The compaction process should:
- Preserve architectural decisions, unresolved issues, and implementation details
- Discard redundant tool outputs and superseded information
- Enable continuation with minimal performance degradation
Start by maximising recall — ensure the compaction prompt captures every relevant piece of information from the trace. Then iterate to improve precision by removing noise.
Structured Note-Taking
Agents maintain persistent notes outside the context window, reading them back when relevant. Applications:
- Progress tracking across complex tasks
- Strategic information preserved for later reference
- Multi-hour sessions where prior notes survive context resets
The key is writing notes at the right granularity — too detailed wastes tokens when read back, too sparse loses critical information.
Sub-Agent Architectures
Specialised agents handle focused tasks with clean context windows. The coordination pattern:
- Main agent plans and coordinates at a high level
- Sub-agents perform deep work, potentially consuming tens of thousands of tokens
- Each sub-agent returns a condensed summary (typically 1,000–2,000 tokens)
- Main agent synthesises results without inheriting the sub-agents' full context
This achieves clear separation of concerns — detailed search context remains isolated within sub-agents, while the lead agent focuses on synthesis.
Context Rot
As the number of tokens in the context window increases, the model's ability to accurately recall information from that context decreases. This stems from transformer architecture: with n tokens, there are n² pairwise relationships. Models trained on shorter sequences handle full context-width dependencies less reliably.
The degradation is a performance gradient, not a hard cliff. Practical implications:
- Treat context as a finite resource, not an unlimited buffer
- Front-load the most important information (system prompt, key constraints)
- Actively prune stale or redundant content during long sessions
- Prefer retrieval over retention — fetch what you need rather than keeping everything loaded
The Guiding Principle
Across all techniques, the goal is the same: find the smallest set of high-signal tokens that maximise the likelihood of the desired outcome. As models improve, they require less prescriptive engineering and can operate with more autonomy — but context remains a precious, finite resource that rewards careful curation.