Designing prompts for production: system, developer, and user layers

Production prompts are not 'tell the AI what you want'. They are a layered system — stable instructions, dynamic context, per-call variables — managed like code. The architecture, the patterns, and the discipline that distinguishes production from prototype.

Outcome: Separate system, developer, and user instructions and test production prompts as versioned system components.

Advanced12 min read

In a prototype, a prompt is a string you wrote one afternoon. In production, that approach falls apart in roughly the first month.

You'll want to change one part of the instructions but not others. You'll want different behavior for different customer tiers. You'll want to A/B test versions. You'll want to roll back when something breaks. You'll want to know when the prompt last changed, and why.

A production prompt system handles all of this. It's not "write a string." It's an architecture.

This article covers what that architecture looks like — the three layers, the templating discipline, the version control, the evaluation, and the operational practices that turn prompts from artifact into infrastructure.

Production prompt work has two separate artifacts: reusable templates and per-request data. Version templates like code. Treat rendered prompts and model responses as sensitive logs whenever they contain user, customer, or internal data.

The three layers

Production prompts have three distinct layers, each with different concerns:

System layer. Stable behavior, identity, constraints. Changes rarely. Owned by the team designing the AI behavior.

Developer layer. Per-feature instructions, tool descriptions, output format requirements. Changes when features change. Owned by the feature team.

User layer. The user's specific request, plus dynamic context (their data, conversation history, retrieved knowledge). Different every call.

Mixing these is the most common production prompt mistake. The system prompt grows to 5,000 words mixing identity, feature instructions, and dynamic context, and now changing one thing breaks others.

Separating them is foundational:

┌─────────────────────────────────────┐
│ System prompt (stable)              │  Identity, behavior, hard constraints
├─────────────────────────────────────┤
│ Developer prompt (per-feature)      │  Feature instructions, tools, format
├─────────────────────────────────────┤
│ User prompt (per-call)              │  User query, context, conversation
└─────────────────────────────────────┘

The model APIs explicitly support this:

OpenAI: system, developer, user roles.
Anthropic: system, then messages with user and assistant roles. Tool descriptions are a separate parameter.
Gemini: systemInstruction, then contents with roles.

Use these distinctions intentionally.

Layer 1: The system prompt

The system prompt defines who the AI is and how it behaves. It changes rarely.

A good system prompt covers:

Identity. Who the AI is. "You are an AI assistant for [Company], specializing in [domain]."

Voice and style. How it should sound. Specific traits, not vague descriptors.

Hard constraints. Things it must never do. Output certain content, make certain decisions, ignore certain instructions.

Behavioral patterns. How it handles common situations. Refusals, escalations, uncertainty.

Safety and compliance. Required disclosures, regulatory rules, content policies.

What it should NOT contain:

Feature-specific instructions ("for sales emails, do X").
Dynamic context ("the user's order history is...").
Tool descriptions (those go elsewhere).
Things that change frequently.

A good system prompt is 300-1000 words. Longer and it becomes hard to manage; shorter and you're under-specifying behavior.

A template that works:

You are [name], an AI assistant for [company / context].

## Your role
[2-3 sentences on what you do]

## Voice and style
- [Specific trait 1]
- [Specific trait 2]
- [Specific trait 3]
- Do not [anti-pattern 1]
- Do not [anti-pattern 2]

## Hard constraints
- Never [hard rule 1]
- Never [hard rule 2]
- Always [hard rule 3]

## How to handle uncertainty
- If you don't know something factual: say so explicitly.
- If a user asks for something outside scope: offer what you can help with.
- If a request might cause harm: refuse and explain why.

## Format expectations
- Plain text by default
- Use markdown when displaying code or structured data
- Be concise; do not pad responses with filler

This is the spine. Every interaction goes through it. Changes are deliberate and infrequent.

Layer 2: The developer prompt

The developer prompt is feature-specific. Different features have different developer prompts.

A summarization feature's developer prompt:

Task: produce a summary of the document below.

Requirements:
- 3-5 bullet points
- Each bullet is one complete sentence
- Focus on facts and concrete claims, not impressions
- If the document contains numbers, include the most important ones
- Do not include marketing language or speculation
- If the document is ambiguous about something important, note it

Format: plain markdown bullets, no preamble.

A code review feature's developer prompt:

Task: review the code diff below.

Output a JSON object with:
- summary: 1-2 sentence overview of the change
- concerns: array of specific issues (each: file, line, severity, description)
- suggestions: array of improvements (each: file, line, suggestion)
- approved: boolean (true if no blocking concerns)

Severity levels:
- "blocker": must be fixed before merge
- "warning": should be addressed but not blocking
- "nit": stylistic, optional

Focus on:
- Logic errors
- Security issues
- Performance issues
- Missing test coverage
- Unclear naming or structure

Skip:
- Formatting (handled by formatter)
- Subjective style preferences

Each feature has its own developer prompt. They're stored separately, versioned separately, evaluated separately.

Layer 3: The user prompt

The user layer is dynamic. It typically includes:

The user's actual request. "Summarise this document for me."

Context the system retrieved. Documents from RAG, customer history, conversation history.

Per-call variables. User name, timezone, language preference, account tier.

This layer is constructed programmatically at call time. The structure usually looks like:

{conversation_history_summary}

{retrieved_context}

User's request: {user_query}

Additional context:
- User name: {name}
- User timezone: {timezone}
- User tier: {tier}

The exact structure depends on the feature. The principle: data goes here, not in the system or developer prompts.

Templating discipline

Prompts in production are built from templates. Concatenating strings inline is the prototype approach; it doesn't scale.

A simple template system:

from string import Template

SUMMARIZE_TEMPLATE = Template("""
$conversation_summary

Document to summarize:
$document

User's specific instructions: $user_instructions
""")

prompt = SUMMARIZE_TEMPLATE.substitute(
    conversation_summary=summarize_conversation(history),
    document=document_text,
    user_instructions=user_query,
)

More sophisticated: a templating library (Jinja2, Handlebars) with conditionals and partials.

{% if user_tier == "enterprise" %}
You have access to advanced analysis features.
{% endif %}

{% if retrieved_context %}
Relevant context from your knowledge base:
{{ retrieved_context }}
{% endif %}

User's request: {{ user_query }}

Templating prevents prompt-injection through variables (escape user input where appropriate), enables conditional logic, and keeps prompt structure consistent.

Version control

Prompts are code. Store them in source control.

A pattern that works: a prompts/ directory in your repo, with one file per prompt:

prompts/
  system/
    main.txt
    customer-support.txt
    code-assistant.txt
  features/
    summarize.txt
    classify-ticket.txt
    generate-email.txt
  templates/
    base.j2

Each file is a separate prompt, with its own commit history. Changes are reviewed via PR. Production deploys reference specific versions.

Why this matters:

Diff visibility. When a prompt changes, the diff is in the PR. Reviewers can see exactly what changed.
Rollback. When a change breaks something, you can revert.
History. "When did we change the refund policy in the prompt?" "Why is this paragraph here?" — answerable via git blame.
Tooling. Linters, validators, eval suites all integrate with file-based prompts.

Avoid: prompts stored as strings in code (hard to find, hard to diff), prompts stored in a UI tool (versioning is the tool's, not yours), prompts pasted from people's chat windows (untracked, untestable).

Do not commit production conversations, customer records, support tickets, internal documents, or rendered prompts containing sensitive variables. Source control is for reusable templates, fixtures, and sanitized eval examples. Real traces belong in an observability store with retention, access control, and redaction.

Prompt as data: external storage

For prompts that change frequently — A/B tests, user-tier variations, locale-specific prompts — file-based source control is too slow.

Pattern: a database or service that stores prompt versions with metadata.

prompt = prompt_service.get(
    name="summarize",
    version="v3",
    locale="en",
    user_tier="enterprise",
)

The service maintains:

Current and historical versions of each prompt.
Metadata: when added, by whom, why.
Eval scores attached to each version.
Rollback capability.

Tools: PromptLayer, Helicone, in-house. For most teams, in-house with a simple DB schema works fine.

The interface is critical. Engineers and non-engineers (product, content) should be able to edit prompts. But changes go through review and pass evals before going live.

Eval-gated changes

Every prompt change goes through evals before deploying. This is non-negotiable for serious production systems.

The flow:

Engineer or non-engineer drafts a prompt change.
The change runs against the eval suite.
Eval results are reviewed alongside the change.
If evals pass (no regressions, ideally improvements), the change can be approved.
Approved changes deploy.
Post-deploy monitoring catches anything evals missed.

In practice, this means every prompt has an eval suite, and the suite runs in CI on prompt changes.

Without this gate, prompt changes break things in unpredictable ways. With it, you can move quickly and confidently.

A practical release checklist

Before a prompt version goes live, require a short checklist:

| Check | Requirement | | --- | --- | | Ownership | Prompt has a named owner and reviewer. | | Instruction layers | System, developer, and user/context data are separated. | | Schema | Structured outputs have a schema and failure path. | | Injection handling | User-provided content is clearly delimited and never treated as instruction. | | Evals | Candidate prompt passes the regression set and safety cases. | | Logs | Template version, model, latency, cost, and redacted inputs/outputs are observable. | | Rollback | Previous known-good version can be restored without code surgery. |

The companion checklist linked from this article turns these checks into a repeatable release review.

A/B testing in production

For new prompts, A/B testing them against the existing version on a small fraction of production traffic gives you real-world signal beyond evals.

Pattern:

95% of traffic uses production prompt v3.
5% gets new candidate v4.
Measure: user feedback, downstream metrics, eval scores on real traffic.
After sufficient data, decide: roll v4 to 100%, or keep v3.

Tools: feature flags (LaunchDarkly, in-house), prompt-versioning services (PromptLayer), custom routing.

Caveats:

A/B testing only catches signals you measure. If you don't have user feedback or downstream conversion metrics, A/B testing tells you little.
Statistical significance requires volume. For low-volume features, A/B is hard.
Don't run too many A/B tests at once; interactions get confusing.

Observability of prompts

Every production LLM call should log:

Which prompt template was used (name, version).
Which variables were substituted, using allowlisted names and redacted values where needed.
The final rendered prompt only when policy permits it; otherwise store a redacted, sampled, or hashed representation.
The model's response, redacted or sampled for sensitive workflows.
Latency, tokens, cost.
Downstream signals (user feedback, success metrics).

This is the data you need to debug "why did the model give a weird answer to this user?" Without it, you're guessing.

Storage: a database table or observability tool. Costs are real if you log everything (call volume × token count × storage). Privacy risk is also real if you log everything. Decide per workflow which fields are safe to store, redact secrets and personal data by default, and keep retention short unless there is a compliance reason to keep traces longer. Some teams sample.

Reviews: a regular practice (weekly) of reading a sample of real production prompts and responses. Catches issues evals don't.

Prompt anti-patterns

A few patterns to avoid:

Anti-pattern 1: The mega-prompt. A 10,000-word system prompt that tries to handle every situation. Hard to change, hard to debug, often gets ignored by the model on later instructions.

Fix: separate into layered, focused prompts. One per feature.

Anti-pattern 2: Inline string concatenation.

prompt = "You are helpful. " + (
    "The user is a paid customer. " if user.tier == "paid" else ""
) + f"Their name is {user.name}. " + ...

Fragile, hard to read, prone to injection.

Fix: templating system.

Anti-pattern 3: Same prompt for too many use cases.

A single "general assistant" prompt that's used for email drafting, code review, customer support, and research. Each is a different task; one prompt doesn't optimize for any of them.

Fix: feature-specific developer prompts on top of a shared system prompt.

Anti-pattern 4: Hard-coded prompts.

response = openai.chat.completions.create(
    messages=[
        {"role": "system", "content": "You are a helpful assistant..."},
        {"role": "user", "content": query}
    ]
)

The prompt is buried in the code. Can't be edited without a deploy. Can't be A/B tested. Can't be versioned independently.

Fix: extract to a prompt file or service.

Anti-pattern 5: No eval coverage.

A feature ships with a prompt that's never been tested systematically. Quality is "vibes." Drift is undetectable.

Fix: every prompt has an eval suite.

Anti-pattern 6: Mixing data into the system prompt.

You are an assistant for John, a premium customer who joined in 2023, lives in Tallinn, and has 47 open tickets.

Now the system prompt changes every call. Caching breaks. Confusion abounds.

Fix: dynamic data goes in the user/context layer, not the system prompt.

Anti-pattern 7: Instructions buried in the middle.

Help the user with their request. Be polite. Format output as JSON. Don't use markdown. The user is asking about pricing, so be careful about quoting numbers. Output should be 1-2 sentences. Now help them.

Important instructions get lost. The model might miss them.

Fix: structure with clear sections, place critical instructions at the start and end (recency effect helps).

Specific patterns for common features

A few feature-specific patterns:

Classification

Task: classify the following text into one of these categories:
- billing: payment, refund, subscription
- technical: bug, error, integration issue
- account: login, password, profile changes
- feature_request: new functionality requests
- complaint: general dissatisfaction without specific actionable issue

Output a JSON object: {"category": "<one of above>", "confidence": "<high|medium|low>", "reasoning": "<1 sentence>"}

Text to classify:
{text}

Patterns: enumerated categories with definitions, structured output, confidence and reasoning fields.

Extraction

Task: extract structured data from the document below.

Schema:
- vendor_name: company that issued the invoice
- invoice_number: as printed on the document
- date: ISO 8601 format
- line_items: array of {description, quantity, unit_price, total}
- subtotal, tax, total: numbers

Rules:
- If a field is not present, use null
- Numbers should be numeric, not strings
- For ambiguous cases, set "needs_review": true and explain

Document:
{document}

Patterns: explicit schema, type expectations, handling of missing data, escalation for ambiguity.

Generation with style

Task: write a {format} on the topic of {topic}, targeting {audience}.

Style:
- {Specific style trait 1}
- {Specific style trait 2}
- Avoid: {anti-pattern 1}, {anti-pattern 2}

Constraints:
- Length: {N} words
- Include: {required elements}
- Exclude: {forbidden elements}

Voice reference:
[Provide a sample of the desired voice]

Output: the {format} only, with no preamble or post-script.

Patterns: specific style traits (not generic), explicit constraints, voice anchored with reference sample.

Agent loop

You have access to the following tools:
{tool_descriptions}

For each turn:
1. Think about what you need to do.
2. Decide if you need a tool. If yes, call it.
3. After observing the result, decide if you need more tools or can answer.
4. When you have enough information, produce the final answer.

Constraints:
- Maximum 5 tool calls per request.
- If after 5 calls you can't complete, explain what's missing.
- Never invent tool names or arguments.
- Verify tool results before acting on them.

User request:
{user_query}

Patterns: explicit reasoning steps, tool budget, anti-hallucination, reflection.

The team aspect

Prompts in production usually involve multiple people:

Engineers wire the prompts into the system, maintain templates, manage deployments.
Product defines what the prompts should accomplish.
Content/marketing owns voice and style guidance.
Domain experts know what's right for specific use cases (legal language, medical terms, etc.).

A useful pattern: a "prompt review" process similar to code review, with the right reviewers for each domain. Voice changes get content's review. Logic changes get engineering's review. Domain-specific content gets the expert's review.

For sensitive use cases (legal, medical, financial), prompts may need formal review and sign-off. Build the process accordingly.

A 90-day prompt maturity plan

For teams moving from "prompts are strings in code" to "prompts are managed infrastructure":

Days 1-30: Foundation.

Extract all prompts to dedicated files in source control.
Establish the 3-layer pattern (system / developer / user).
Build a simple templating layer.
Set up basic logging of prompts and responses.

Days 31-60: Evaluation.

Build eval suites for the top 5 prompts.
Run evals on prompt changes (manually first).
Set up CI integration for evals (auto-run on PRs).

Days 61-90: Operations.

Implement prompt versioning (database or service).
Add A/B test capability for at least one critical prompt.
Build dashboards for production prompt quality.
Establish review process for prompt changes.

After 90 days, prompts are managed infrastructure. Changes are deliberate, testable, reviewable, rollback-able. Quality is measurable. Drift is detectable.

The takeaway

Production prompts aren't strings. They're a layered system with discipline around versioning, templating, evaluation, and observability.

The three-layer architecture (system / developer / user) separates concerns and keeps prompts maintainable. Templating prevents fragility. Source control or a prompt service provides version history. Evals gate changes. Observability catches what evals miss.

This isn't optional in serious production work. Teams that skip these steps end up with prompt chaos — strings scattered through code, no idea what version is in production, no measurement of quality, and constant unexplained behavioral changes.

Teams that invest in prompt infrastructure end up with AI behavior that's controllable, measurable, and improvable. That's the difference between a feature that ages well and one that becomes technical debt.

Start with the architecture. Everything else gets easier.