Structured outputs and function calling: the production patterns

Structured outputs and function calling are the bridge from 'LLM that generates text' to 'system that does work'. In production, the patterns that matter are about schemas, error handling, idempotency, and graceful degradation — not just JSON mode.

Advanced13 min read

The shift from "LLM chatbot" to "LLM-powered system that does work" happens at the structured output and function calling layer. This is where the LLM stops just producing prose and starts producing data, deciding actions, and integrating with the rest of your infrastructure.

In production, "structured output" doesn't mean "JSON mode worked once in my test." It means a robust pipeline that handles model variance, malformed outputs, partial failures, schema evolution, and the messy reality of LLMs that don't always follow instructions.

This article covers the patterns that actually hold up in production. We assume you know the basics (you've used OpenAI's tool_choice, Anthropic's tool use, JSON schema constraints). We're going deeper into what makes these systems reliable.

The two modes

Two related but distinct capabilities:

Structured output: the LLM produces output conforming to a schema (typically JSON). Used when you need the LLM's response in a programmatic format.

Function/tool calling: the LLM is given a set of functions it can call, decides which to call (if any), produces parameters for the call. The host system executes the function and returns results. The LLM can call further functions or produce a final response.

The model APIs typically expose these via:

A response_format (or equivalent) parameter taking a JSON schema, for plain structured outputs — OpenAI's "Structured Outputs", Google Gemini's responseSchema, etc.
A tools array describing available functions, plus a tool-call response when invoked. Anthropic's tool-use API doubles as the way you constrain output to a schema — you define a single tool with the schema you want and force the model to call it.

Both work; they're related. A "function call" is essentially a structured output where the schema is the function signature.

Pattern 1: Tight, explicit schemas

The single biggest reliability win is in your schemas.

A loose schema:

{
  "type": "object",
  "properties": {
    "category": { "type": "string" },
    "priority": { "type": "string" }
  }
}

A tight schema:

{
  "type": "object",
  "properties": {
    "category": {
      "type": "string",
      "enum": ["billing", "technical", "account", "feature_request", "complaint"],
      "description": "The ticket category. Use 'technical' for product bugs and 'account' for login/password issues."
    },
    "priority": {
      "type": "string",
      "enum": ["low", "medium", "high", "urgent"],
      "description": "Use 'urgent' only for outages or business-critical impact. 'High' for blocking issues on important customers. 'Medium' for standard impact. 'Low' for nice-to-haves."
    }
  },
  "required": ["category", "priority"],
  "additionalProperties": false
}

The tight version:

Restricts values to known enums (no free-form drift).
Includes descriptions that act as inline prompts (the model uses them).
Requires fields (so you don't get partial responses).
Forbids extra fields (no random hallucinated keys).

In production, every schema field should have a description. Every enum should be explicit. Every required field should be marked. This is "schema as prompt" — your schema is doing prompt engineering.

Pattern 2: Constrained generation

Major providers now support constrained generation — the model is constrained at the decoding level to produce only valid outputs.

OpenAI: response_format: { type: "json_schema", json_schema: { ..., strict: true } }
Anthropic: Tools with strict schemas.
Open-source: outlines, lm-format-enforcer, jsonformer, vLLM's grammar-based decoding.

Use these. Always. They eliminate an entire class of failures (malformed JSON, hallucinated fields, missing required fields). The performance overhead is negligible.

When constrained generation isn't available (some open models, some configurations), validation + retry is the fallback (see Pattern 4).

Pattern 3: Schema versioning

Schemas evolve. You add fields. You deprecate fields. You change enums.

A schema change is a code change. It should be:

Versioned. Tag each schema with a version number.
Tested. Eval suite runs against the new schema before deploy.
Communicated. Downstream consumers of the output know about the change.
Backward-compatible when possible. Add new optional fields; don't remove required ones.

A pattern we've seen work: store schemas as TypeScript types or Pydantic models, version them in source control, generate JSON Schema from them. The types serve both the model API and your application code.

class TicketClassificationV2(BaseModel):
    category: Literal["billing", "technical", "account", "feature_request", "complaint"]
    priority: Literal["low", "medium", "high", "urgent"]
    confidence: float = Field(ge=0, le=1, description="Confidence in this classification, 0-1")
    needs_human_review: bool = Field(description="True if any field has low confidence or unusual signal")
    reasoning: str = Field(description="Brief reasoning for the classification, especially for non-obvious cases")

A Pydantic model defines the schema, validates the output, and serves as the type in your Python code. One source of truth.

Pattern 4: Validation and retry

Even with constrained generation, validate the output before using it:

from pydantic import ValidationError

def call_with_validation(prompt, schema, max_retries=2):
    for attempt in range(max_retries + 1):
        response = llm_call(prompt, response_format=schema)
        try:
            parsed = schema.model_validate_json(response.content)
            return parsed
        except ValidationError as e:
            if attempt < max_retries:
                prompt = build_retry_prompt(prompt, response.content, e)
                continue
            raise

The retry prompt should include the original instructions, the model's previous output, and a specific description of what went wrong:

Your previous response had a validation error:
{error message}

Your previous output:
{previous output}

Please correct the issue and produce a valid response.

Retries work surprisingly well — usually one retry recovers from a model mistake.

Limits: don't retry forever (max 2-3). Don't retry on non-validation errors (rate limits, content filter, etc.). Log retries so you can monitor the rate (rising retry rate signals model drift or prompt issues).

Pattern 5: Result reflection

For high-stakes function calls, have the model reflect on tool results before using them.

A bare loop:

1. Call LLM with tools available.
2. Model decides to call tool X.
3. Execute X.
4. Pass result back to model.
5. Model produces final response.

A reflection loop:

1. Call LLM with tools available.
2. Model decides to call tool X.
3. Execute X.
4. Pass result back to model.
5. Model evaluates: does this result match what I expected? Should I act on it?
6. If yes, model produces final response. If no, model calls another tool or asks for clarification.

This catches cases like:

Tool returned 0 results when it should have returned data → model recognises the empty case.
Tool returned an error → model handles it explicitly rather than ignoring.
Tool returned unexpected data → model notices and adapts.

Implementation: prompt the model to evaluate tool results explicitly, possibly via a structured "evaluate then act" pattern.

This adds latency and tokens. For high-stakes actions (sending email, processing payment, modifying records) it's worth it. For low-stakes information retrieval, skip it.

Pattern 6: Idempotency

LLMs sometimes call the same tool twice, or retry calls that already succeeded. Without idempotency, you get duplicates: two refunds, two emails sent, two records created.

Patterns for idempotency:

Idempotency keys. Each tool call gets a unique key (generated client-side, included in the call). The downstream API or your tool wrapper uses the key to detect duplicates and return the existing result.

Get-or-create semantics. Tools that create records do a check-first lookup. "Create customer with email X" first checks if a customer with email X exists; if so, returns the existing one instead of creating a duplicate.

Operation logs. Tools log every operation. The wrapper checks the log before executing; if it's already done, return the cached result.

Conservative tool design. Tools that perform consequential actions are designed to require explicit confirmation or human approval. The LLM cannot trigger them in a tight loop accidentally.

For any tool that has side effects, design for idempotency. Skipping this is a major source of production bugs.

Pattern 7: Tool-call observability

You need to know what's happening with tool calls. For each call, log:

Timestamp.
Tool name and arguments.
Result (or error).
Duration.
The user/session it belongs to.
The chain of calls in this turn (was this tool call part of a longer chain?).

Build dashboards on this data. Common views:

Tool call volume by tool.
Error rate by tool.
Average duration by tool.
Patterns of tool sequences ("what tools tend to be called together?").
Hallucinated tool calls (LLM tried to call a tool that doesn't exist).

These reveal where your system is failing and where it's expensive.

Pattern 8: Hallucinated arguments

LLMs sometimes invent values for tool parameters. They'll call search_customers(email="...") with an email that doesn't match the user's actual question. Or call book_meeting(date="...") with a date that wasn't mentioned.

Mitigation:

Strict schema with descriptions. "The user_id must be one mentioned earlier in the conversation. Do not invent IDs."

Validation in the tool wrapper. If the value isn't plausible (e.g., user_id doesn't exist, date is in the past), the tool returns a structured error and the model reconsiders.

Reflection. "Before calling this tool, confirm the values you're using are grounded in the conversation."

Restricted tool descriptions. Tools that operate on specific entities only expose entity-IDs that were retrieved earlier in the conversation. Don't expose raw search.

Audit logs. Catch patterns of hallucinated arguments and adjust prompts/schemas.

Pattern 9: Graceful degradation

Tools fail. APIs go down. Rate limits hit. The right response is rarely "tell the user nothing works."

Patterns:

Cached or stale data. If the live data source is unavailable, return cached data with a note that it's stale.

Partial completion. If 3 of 5 sub-tasks succeed, report what was done and what wasn't.

Fallback paths. If the primary tool fails, the model knows about a fallback. E.g., if "search_documents" fails, fall back to "search_web" with appropriate caveats.

User-visible error states. If a tool truly cannot complete, the model produces a clear error message to the user — not a hallucinated success.

The model needs to know about these patterns. Document them in the system prompt:

If a tool returns an error:
- Try the alternate tool if one exists.
- Report partial results clearly if the user has already provided information.
- Never claim success when a tool returned an error.

Pattern 10: Streaming structured outputs

For UX, streaming partial structured output is great — the user sees results forming in real-time rather than waiting.

Implementation:

Most modern model APIs stream JSON output token by token.
Parse the partial JSON incrementally (libraries like partial-json-parser or write a small streaming parser).
Update the UI as fields arrive.

This works especially well for outputs with multiple sections. A long product description, an analysis with multiple insights, a code review with several findings — they feel much more responsive when streamed.

Caveat: don't make decisions on partial output. Stream for display; wait for completion before acting on the structured result.

Pattern 11: Function calling vs explicit "decide" calls

Native function calling is convenient — the model "decides" when to call a tool. But for some workflows, an explicit "decide" call is more reliable.

Example: a customer support workflow where the model needs to decide between several actions.

Native function calling approach: Give the model 5 tools (refund, send_article, escalate_to_human, ask_clarifying_question, close_ticket). Let it decide.

Explicit decide call approach: First call the model with a single tool decide_action that takes one parameter: which action. Then, based on the decision, call the model again with only the relevant tool.

The explicit approach is slower and more verbose, but more reliable. The model is more focused at each step. The host system has more control over the workflow.

For high-stakes workflows, the explicit approach often wins. For exploratory or simple workflows, native function calling is fine.

Pattern 12: Tool result formatting

How you return tool results matters. The model is reading the result; format matters.

Bad:

{"id": "cus_123", "n": "John", "p": "12345"}

Better:

{
  "customer_id": "cus_123",
  "name": "John Doe",
  "phone": "+1-555-0123",
  "tier": "premium",
  "open_tickets": 0
}

Best (for some cases):

Customer found:
- ID: cus_123
- Name: John Doe
- Tier: Premium
- Phone: +1-555-0123
- Open tickets: 0

This customer is in the premium tier and has no open tickets.

The "best" form is human-readable, includes context, and is easier for the model to use in subsequent generation. The "better" form is more structured and machine-readable. Use whichever format the model handles best for your downstream tasks (test it).

For some tools, returning both structured and narrative ("Here's the result: [narrative]. Raw data: [JSON]") works well.

Pattern 13: Schema-aware retries

Some validation errors are unrecoverable (the model fundamentally misunderstood the task). Others are easy fixes.

A useful pattern: classify the error and respond accordingly.

def handle_validation_error(error):
    if "missing required field" in str(error):
        return retry_with_message("You omitted required field X. Please include it.")
    elif "value not in enum" in str(error):
        return retry_with_message("Value X is not in the allowed set. Choose from: ...")
    elif "type mismatch" in str(error):
        return retry_with_message("Field X must be a number, not a string.")
    else:
        # Unknown error — single generic retry
        return retry_with_message("There was an error in your response. Please try again.")

Tailored retries succeed more often than generic retries.

Pattern 14: Compositionality

Tools should compose. Small, focused tools that do one thing can be combined by the model into complex workflows.

A monolithic process_customer_request(query) tool that does everything is a black box. The model can't observe or steer the internal logic.

A set of focused tools — search_customer(email), get_recent_orders(customer_id), check_subscription_status(customer_id), escalate_to_human(reason) — can be composed by the model into the right flow for each situation.

Design tools at the right granularity. Each tool does one thing. Tools compose into workflows.

Pattern 15: Schema for "I don't know"

A subtle pattern: explicitly model uncertainty in the schema.

class CustomerInfo(BaseModel):
    name: str
    name_confidence: Literal["high", "medium", "low"]
    needs_clarification: bool
    clarification_question: Optional[str] = None

The model can return "low confidence" with a clarification question rather than confabulating.

This is much better than a model that always confidently fills in fields, sometimes with hallucinated data.

A worked example: invoice processing

To pull this together, here's a production-quality invoice processing system.

Inputs: PDF invoice attached to email. Goal: Extract structured data, route to accounting system.

Schema:

class LineItem(BaseModel):
    description: str
    quantity: float
    unit_price: float
    total: float
    confidence: Literal["high", "medium", "low"]

class Invoice(BaseModel):
    vendor_name: str
    vendor_id: Optional[str] = None  # null if not found in our records
    invoice_number: str
    invoice_date: date
    due_date: Optional[date] = None
    line_items: List[LineItem]
    subtotal: float
    tax: float
    total: float
    currency: str  # ISO 4217
    confidence: Literal["high", "medium", "low"]
    needs_review: bool
    review_reasons: List[str]  # Specific reasons review is needed

Workflow:

OCR step: Vision model extracts text from PDF.
Extract step: LLM call with the schema above, constrained generation enabled.
Validation step: Pydantic validates. Validation errors trigger one retry with error feedback.
Cross-check step: Tool call to look up vendor in our records. If vendor_name matches a known vendor, attach vendor_id. If not, set needs_review=true.
Math check step: Verify sum(line_items.total) ≈ subtotal and subtotal + tax ≈ total. If not, set needs_review=true.
Confidence check step: If confidence is low, or any line item has low confidence, set needs_review=true.
Routing: If needs_review=true, send to human review queue. Otherwise, route to accounting system.
Logging: Every step logged with input, output, duration, errors.

Failure modes handled:

Malformed JSON: constrained generation prevents; retry handles edge cases.
Hallucinated fields: schema is strict.
Math errors: validated.
Unknown vendors: flagged.
Low confidence: flagged.
Tool errors: explicit handling.

Production performance: ~95% of invoices straight-through; 5% flagged for review. Of the auto-processed, error rate is <0.5% (well within acceptable). Of the review queue, ~80% confirmed correct, 20% need fixes.

This is what production-grade structured output looks like. Not just "JSON mode worked once." A pipeline that handles real-world failure modes.

Common mistakes

A few patterns we see repeatedly:

Mistake 1: No validation. Pydantic or zod or whatever — just validate. Don't trust the model.

Mistake 2: Vague descriptions. "category: string" doesn't help the model. "category: one of billing, technical, account_access, where billing covers..." does.

Mistake 3: Too many tools. 30 tools available, the model picks the wrong ones. Curate to <10 relevant tools per call.

Mistake 4: No retry on validation failure. A single malformed output kills the whole flow. Retry once with feedback.

Mistake 5: No observability. When tool calls fail in production, you can't diagnose without traces.

Mistake 6: No idempotency on side-effect tools. Duplicate refunds, duplicate emails. Predictable bug.

Mistake 7: Trusting LLM-decided arguments without validation. Hallucinated user IDs, hallucinated dates. Validate tool arguments before executing.

Mistake 8: Skipping schema versioning. Schema changes break downstream consumers. Version it.

The takeaway

Structured outputs and function calling are the bridge from "LLM that talks" to "LLM that does work." Done well, they unlock production AI. Done poorly, they break in interesting and expensive ways.

The patterns that matter: tight schemas, constrained generation, validation with retry, reflection on tool results, idempotency, graceful degradation, schema-aware error handling, and end-to-end observability.

These aren't optional polish. They're the difference between a demo and a production system. Build them in from the start.