MCP from scratch: build a production-ready server in TypeScript

Building a production Model Context Protocol server requires more than wiring up a few tools. The patterns for schema design, auth, error handling, streaming, observability, and the production realities that make MCP servers useful at scale.

Advanced14 min read

By mid-2026, MCP (Model Context Protocol) is the de-facto standard for connecting LLMs to tools. Anthropic introduced it; OpenAI, Google, and the broader ecosystem have adopted it. Cursor, Claude Desktop, ChatGPT, custom agents — all speak MCP.

If you want LLM agents to interact with your service, an MCP server is how. And once you've built one or two, you'll see that the protocol itself is small. The interesting engineering is in everything around it: schema design, error handling, auth, streaming, performance, observability.

This article is a deep dive into building production-grade MCP servers in TypeScript. We'll cover the patterns that hold up under real agent use, not just the protocol mechanics.

What MCP is, briefly

MCP is a client-server protocol where:

Servers expose tools, resources, and prompts.
Clients are typically LLM agents that consume them.

The protocol uses JSON-RPC 2.0. Transports include stdio (for local processes) and HTTP/SSE (for remote). Authentication and security are protocol concerns; major implementations support OAuth, API keys, and similar.

The server's job: expose useful capabilities to LLMs in a way they can discover and use.

The basic structure

Using the official @modelcontextprotocol/sdk package, a minimal server with the high-level McpServer API looks like:

import { McpServer } from "@modelcontextprotocol/sdk/server/mcp.js";
import { StdioServerTransport } from "@modelcontextprotocol/sdk/server/stdio.js";
import { z } from "zod";

const server = new McpServer({
  name: "my-server",
  version: "1.0.0",
});

server.registerTool(
  "echo",
  {
    description: "Echo back the provided text.",
    inputSchema: { text: z.string() },
  },
  async ({ text }) => ({
    content: [{ type: "text", text }],
  })
);

const transport = new StdioServerTransport();
await server.connect(transport);

(The same SDK also exposes the lower-level Server class plus setRequestHandler for ListToolsRequestSchema / CallToolRequestSchema if you want full control of the request handlers — but for most servers McpServer.registerTool is shorter and harder to get wrong.)

That's the skeleton. What you put into the tool handlers — and how — is where the work is.

Pattern 1: Tool design philosophy

The first decision: what tools do you expose, at what granularity?

A common failure: exposing your underlying API as tools. If you have 200 REST endpoints, exposing 200 tools is a disaster. Models with too many tools perform worse; tool descriptions become unwieldy; the protocol becomes a maze.

Better: design tools for the way agents want to use them. Each tool does one well-defined thing, takes well-defined inputs, returns well-defined outputs.

A few principles:

One concept per tool. Don't have a manage_customer tool that does 12 different things. Have search_customers, get_customer, update_customer_email, archive_customer — each focused.

Right granularity. Too granular and the agent needs many calls; too coarse and it can't precisely do what's needed. Aim for "operations a human would name."

Action verbs. search_documents, not documents. Tools should be named by what they do.

Read-vs-write distinction. Read tools are safer; write tools have side effects. Distinguish in naming (list_x vs create_x) and treat differently (require explicit confirmation, idempotency keys, etc.).

Aggregate when useful. A get_customer_profile that returns customer + recent orders + support tickets in one call is often better than three separate calls. The agent gets context in one shot.

For a server exposing, say, a customer support system, a reasonable tool set might be 8-15 tools. More is usually too many.

Pattern 2: Schema design

Every tool has an input schema (parameters the LLM must provide) and an output (what your tool returns). Schemas are not just for validation; they're prompt engineering.

Using Zod for input schemas:

const searchCustomersSchema = z.object({
  query: z.string().describe(
    "Search term: name, email, or company. Be specific to avoid too many matches."
  ),
  limit: z.number().int().min(1).max(50).default(10).describe(
    "Maximum results to return. Default 10, max 50."
  ),
  filters: z.object({
    tier: z.enum(["free", "pro", "enterprise"]).optional().describe(
      "Filter to specific customer tier"
    ),
    status: z.enum(["active", "trial", "churned"]).optional().describe(
      "Filter by customer status"
    ),
  }).optional(),
});

Notice:

Every field has a .describe(). The description is what the LLM reads.
Enums are explicit. Free-form strings are restricted where possible.
Defaults are sensible.
Constraints (min/max, length) are explicit.
Optional vs required is clear.

The descriptions matter enormously. "search term" is unhelpful; "Search term: name, email, or company. Be specific to avoid too many matches" is useful guidance to the LLM.

Pattern 3: Output shape

Output is what the LLM sees and acts on. Good output design dramatically improves LLM behavior.

Structured outputs.

type SearchResult = {
  customers: Customer[];
  total_matches: number;
  truncated: boolean;
  next_page_cursor?: string;
};

With context.

{
  customers: [...],
  total_matches: 47,
  truncated: true,
  next_page_cursor: "abc",
  message: "Found 47 matches; showing first 10. Use next_page_cursor to get more."
}

The message field is human-readable guidance. LLMs use it.

With errors handled gracefully.

{
  error: "ambiguous_query",
  message: "Search term 'john' matched 247 customers. Please be more specific.",
  suggestion: "Try including a company name or email domain.",
  partial_results: [...]  // top 3 by relevance, optional
}

The error is structured (machine-readable), but also includes a message and a suggestion (LLM-readable). The LLM can adapt — either ask the user for clarification or refine the query.

Sized appropriately.

A tool that returns 10,000 records is unusable. The LLM's context can't fit it; even if it could, the LLM won't use it well. Always paginate, truncate, or summarize. Return enough for the LLM to make a decision, not everything that exists.

Pattern 4: Error semantics

Tools fail. How they communicate failure to the LLM determines whether the LLM recovers gracefully or compounds the error.

Categories of error.

type ToolError =
  | { type: "validation"; message: string; field?: string }
  | { type: "auth"; message: string }
  | { type: "not_found"; message: string; suggestion?: string }
  | { type: "conflict"; message: string; resolution?: string }
  | { type: "rate_limit"; message: string; retry_after_seconds: number }
  | { type: "service_unavailable"; message: string; retryable: boolean }
  | { type: "internal"; message: string; trace_id: string };

Each category has different semantics. The LLM should respond differently:

validation: fix the input and retry.
not_found: tell the user, or try a different search.
conflict: ask for resolution.
rate_limit: wait and retry.
service_unavailable: try fallback or notify user.
internal: give up, surface to user.

Documenting these in your server makes the LLM more capable.

Error formatting.

Return errors as structured data, with clear, actionable messages:

{
  error: {
    type: "validation",
    message: "The email address is not valid format.",
    field: "email",
    suggestion: "Provide a valid email address like 'name@example.com'."
  }
}

Avoid:

{
  error: "Invalid input"
}

The first lets the LLM recover. The second leaves it guessing.

Pattern 5: Auth and authorization

Production MCP servers need authentication. Anyone who can reach the server can use the tools. That's almost always a problem.

Authentication: who is calling?

Common patterns:

API key. Simple, common, works for service-to-service. Issue per consumer; rotate periodically.
OAuth. For multi-user systems where end-users authorize agents. More complex but the right answer for many use cases.
mTLS. For high-security environments. Mutual TLS certificates for both sides.

Implementation depends on transport. Over HTTP, you authenticate the request before it ever reaches the MCP handler (in your Express/Hono/Fastify middleware) and stash the caller on the request:

// Express-style middleware in front of the MCP HTTP endpoint.
app.use("/mcp", async (req, res, next) => {
  const apiKey = req.header("x-api-key");
  const caller = await authenticate(apiKey);
  if (!caller) return res.status(401).send("Unauthorized");
  (req as any).caller = caller;
  next();
});

Then inside each tool handler, pull the caller from the per-call context (extra), not from raw headers. Over stdio there are no HTTP headers; auth typically comes from process env / config files instead.

Authorization: what can they do?

Once authenticated, what tools can the caller use, and on what data?

function authorize(caller: Caller, tool: string, params: any): boolean {
  // Caller-level: can this caller use this tool at all?
  if (!caller.tools.includes(tool)) return false;
  
  // Data-level: is this caller authorized for this specific data?
  if (params.tenant_id && params.tenant_id !== caller.tenant_id) return false;
  
  return true;
}

Don't let the LLM make authorization decisions. The LLM might be tricked. Authorization is the server's job; the LLM only sees data it's authorized to see.

For multi-tenant systems: every tool call is scoped to a tenant. The tenant is determined by the auth, not by parameters the LLM provides.

Pattern 6: Idempotency

For write operations, idempotency matters. The LLM might retry; it might call the same tool twice in different contexts. Without idempotency, you get duplicates.

Idempotency keys.

The tool accepts an idempotency_key parameter. The server checks: have we seen this key before? If yes, return the cached result. If no, execute and cache.

async function createInvoice(params: {
  amount: number;
  customer_id: string;
  idempotency_key: string;
}) {
  const cached = await idempotencyStore.get(params.idempotency_key);
  if (cached) return cached;
  
  const invoice = await actuallyCreateInvoice(params);
  await idempotencyStore.set(params.idempotency_key, invoice, { ttl: 86400 });
  return invoice;
}

For the LLM, hint at this in the tool description:

"For each unique invoice you create, generate a UUID and pass it as idempotency_key. If you need to retry the operation, use the same UUID to avoid duplicate creation."

Pattern 7: Streaming

For tools that produce large outputs or take time, streaming the output is better UX. MCP supports progress notifications from inside a tool handler via the per-call extra argument:

server.registerTool(
  "long_running_task",
  { description: "...", inputSchema: { ... } },
  async (input, extra) => {
    await extra.sendNotification({
      method: "notifications/progress",
      params: { progressToken: extra._meta?.progressToken, progress: 0, message: "Starting..." },
    });

    for (const step of steps) {
      await doStep(step);
      await extra.sendNotification({
        method: "notifications/progress",
        params: {
          progressToken: extra._meta?.progressToken,
          progress: step.index / steps.length,
          message: step.name,
        },
      });
    }

    return { content: [{ type: "text", text: JSON.stringify({ result: finalResult }) }] };
  }
);

Use streaming for:

Long-running operations (>5 seconds).
Large outputs (so the LLM can start processing while output is still coming).
Operations with intermediate results worth showing.

Don't stream for fast, simple operations — adds complexity without value.

Pattern 8: Caching

Many tool calls hit the same data repeatedly. Caching can dramatically improve performance and reduce backend load.

Local cache. In-process cache (e.g., LRU) for hot data.

Distributed cache. Redis or similar for shared cache across server instances.

Cache invalidation. When data changes, evict relevant entries. (This is the hard part.)

TTLs. Cached entries expire after a defined time. Tune per data type — customer profiles might cache for hours; pricing might cache for minutes.

For caching to help, the same calls must repeat. Many MCP servers see this — agents often refer to the same entities repeatedly within a session.

A pattern:

async function getCustomerCached(id: string) {
  const cached = await cache.get(`customer:${id}`);
  if (cached) {
    metrics.increment("cache.hit");
    return cached;
  }
  metrics.increment("cache.miss");
  const customer = await db.getCustomer(id);
  await cache.set(`customer:${id}`, customer, { ttl: 300 });
  return customer;
}

Pattern 9: Rate limiting

LLM agents can be surprisingly aggressive — looping, retrying, fanning out. A misbehaving agent can DoS your backend.

Rate limiting per caller is essential:

const limiter = new RateLimiter({ 
  windowMs: 60_000, 
  max: 100  // 100 calls/minute per caller
});

server.setRequestHandler(CallToolRequestSchema, async (request, context) => {
  if (await limiter.exceeded(context.caller.id)) {
    return errorResponse("rate_limit", "Too many requests");
  }
  // ...
});

Beyond global rate limits, per-tool limits matter — some tools are expensive and should be limited tightly.

For consequential operations (creating records, sending messages), use stricter limits or require explicit confirmation flows.

Pattern 10: Resources

MCP has "resources" — read-only data sources the LLM can browse and reference. Different from tools (which are called actively).

server.setRequestHandler(ListResourcesRequestSchema, async () => ({
  resources: [
    {
      uri: "doc://my-server/handbook",
      name: "Employee Handbook",
      mimeType: "text/markdown",
      description: "Company employee handbook"
    },
    // ...
  ]
}));

server.setRequestHandler(ReadResourceRequestSchema, async (request) => {
  const content = await loadResource(request.params.uri);
  return { contents: [{ uri: request.params.uri, mimeType: "text/markdown", text: content }] };
});

Resources are useful for:

Reference documents the LLM might want to browse.
Configuration or context data.
Lookup tables or schemas the LLM might need.

Resources are read; tools are actions. Use the right concept for each.

Pattern 11: Observability

Same patterns as elsewhere in production AI. For your MCP server, instrument:

Every tool call: timestamp, caller, tool, params, result, latency, status.
Per-tool metrics: call volume, p50/p95 latency, error rate.
Per-caller metrics: who's calling, how often.
Trace context: propagate trace IDs from the caller through to backend calls.

Structured logs:

logger.info("tool_call", {
  tool: request.params.name,
  caller_id: context.caller.id,
  trace_id: context.trace_id,
  params: redactPII(request.params.arguments),
  duration_ms: duration,
  status: "success"
});

Pipe to your observability platform.

Pattern 12: Versioning

Your MCP server will evolve. Tools will change. New tools added. Old tools deprecated.

Server versioning. The Server constructor takes a version. Bump it on changes. Clients can detect.

Tool versioning. When a tool's signature changes incompatibly, version it: search_customers_v2. Keep the old version available for a deprecation period.

Schema evolution. Add optional fields safely. Removing fields or changing types is breaking.

Deprecation. When deprecating a tool, mark it in its description: "DEPRECATED: use search_customers_v2 instead."

For production MCP servers used by multiple clients, versioning is essential. Internal-only servers can be more flexible.

Pattern 13: Testing

How do you test an MCP server?

Unit tests. Each tool's logic, with mocked dependencies. Standard TypeScript testing.

Schema tests. Schemas validate as expected. Edge cases (missing fields, wrong types) handled correctly.

Integration tests. Spin up the server, send actual MCP requests, verify responses. The @modelcontextprotocol/sdk includes test utilities.

End-to-end with a real LLM. The hardest but most valuable. Have an LLM use your MCP server to perform realistic tasks. Verify the LLM uses the tools correctly. Find tool description issues.

An end-to-end test setup (pseudocode; the exact client wiring depends on which LLM client you use — Anthropic's TypeScript SDK, OpenAI's, or a framework that supports MCP):

// Start your MCP server as a child process or in-memory transport.
const server = await startTestServer();

// Drive an LLM with the MCP tools attached. The exact API depends on the client.
const result = await runAgent({
  mcpServer: server,
  systemPrompt: "You are a customer service agent...",
  userMessage: "Find the customer Alice and check her open tickets",
});

// Inspect the tool calls captured by the server during the run.
expect(server.callLog.map((c) => c.name)).toEqual([
  "search_customers",
  "list_tickets",
]);

End-to-end tests catch tool description issues that unit tests can't.

Pattern 14: Deployment

Where does your MCP server live?

Stdio (local). The server runs as a process; the client invokes it. Best for desktop apps (Claude Desktop, Cursor) and local tools.

HTTP/SSE (remote). The server is a network service. Best for hosted services, shared infrastructure, multi-client access.

For production servers:

HTTP/SSE is usually the choice.
Deploy like any web service: containers, load balancing, auto-scaling.
TLS required.
Health checks for the deployment platform.
Graceful shutdown for in-flight requests.

Pattern 15: Security considerations

MCP servers expose capabilities to LLMs. LLMs can be manipulated. Security implications:

Prompt injection through tool inputs. A user's request might contain text that tries to trick the LLM into calling tools harmfully. Defenses:

Tool descriptions clear about expected use.
Authorization on the server side (independent of LLM-decided params).
Confirmations for consequential actions.

Data exfiltration. Tools that return data can be abused — the LLM might be tricked into returning sensitive data inappropriately. Defenses:

Authorization checks.
Logging of what data is accessed by whom.
Pattern detection for unusual access patterns.

Resource exhaustion. Tools that consume backend resources can be abused. Defenses:

Rate limiting.
Resource limits per tool call.
Circuit breakers when backend is degraded.

Injection into tool outputs. A tool's output might contain text that, when read by the LLM, manipulates it. Defenses:

Sanitize outputs where possible.
Be wary of tools that return user-generated content.

These are real attack surfaces. Treat MCP servers like any production API: defense in depth.

A complete example: a small-but-real MCP server

To put it together, a server exposing a small CRM:

import { McpServer } from "@modelcontextprotocol/sdk/server/mcp.js";
import { StreamableHTTPServerTransport } from "@modelcontextprotocol/sdk/server/streamableHttp.js";
import { z } from "zod";
import { db, cache, logger, authenticate } from "./infra.js";

const server = new McpServer({
  name: "crm-server",
  version: "1.0.0",
});

// === Tool: search_customers ===

server.registerTool(
  "search_customers",
  {
    description: "Search customers by name, email, or company.",
    inputSchema: {
      query: z.string().describe("Name, email, or company"),
      limit: z.number().int().min(1).max(50).default(10),
    },
  },
  async ({ query, limit }, extra) => {
    const auth = await authenticate(extra);
    const cacheKey = `search:${auth.tenant_id}:${query}:${limit}`;

    const cached = await cache.get(cacheKey);
    if (cached) return cached;

    const customers = await db.searchCustomers({
      tenant_id: auth.tenant_id,
      query,
      limit,
    });

    const result = {
      content: [{
        type: "text" as const,
        text: JSON.stringify({
          customers,
          total_matches: customers.length,
          truncated: customers.length === limit,
          message:
            customers.length === limit
              ? `Showing first ${limit}; there may be more matches.`
              : `Found ${customers.length} customer(s).`,
        }),
      }],
    };

    await cache.set(cacheKey, result, { ttl: 60 });
    logger.info("search_customers", { tenant: auth.tenant_id, query, results: customers.length });
    return result;
  }
);

// === Tool: get_customer ===

server.registerTool(
  "get_customer",
  {
    description: "Fetch a single customer by id.",
    inputSchema: { customer_id: z.string() },
  },
  async ({ customer_id }, extra) => {
    const auth = await authenticate(extra);
    const customer = await db.getCustomer(auth.tenant_id, customer_id);
    if (!customer) {
      return {
        isError: true,
        content: [{
          type: "text" as const,
          text: `Customer ${customer_id} not found. Use search_customers to find by name or email.`,
        }],
      };
    }
    return { content: [{ type: "text" as const, text: JSON.stringify({ customer }) }] };
  }
);

// === Tool: update_customer_email (with idempotency) ===

server.registerTool(
  "update_customer_email",
  {
    description: "Update a customer's email; pass the same idempotency_key on retry.",
    inputSchema: {
      customer_id: z.string(),
      new_email: z.string().email(),
      idempotency_key: z
        .string()
        .describe("UUID for this update; pass the same value on retry to prevent duplicates"),
    },
  },
  async (params, extra) => {
    const auth = await authenticate(extra);
    // ... idempotency check, validation, update
    return { content: [{ type: "text" as const, text: "ok" }] };
  }
);

// ... more tools ...

// Wire up a remote transport (Streamable HTTP) on a chosen port via your HTTP server of choice.
const transport = new StreamableHTTPServerTransport({ sessionIdGenerator: () => crypto.randomUUID() });
await server.connect(transport);

This is a starting structure. Add observability, rate limiting, more tools, more careful schemas — but the bones are here.

The takeaway

MCP is a small protocol; building a production-grade server is real engineering work. The payoff: your service becomes usable by any LLM agent, with a standardized integration that doesn't require per-model coupling.

The patterns that matter: focused tool design, prompt-aware schemas, structured error semantics, robust auth, idempotency, observability, security. Skipping any of these creates an MCP server that fails in production.

Build them in. Test against real LLMs. Iterate on tool descriptions. The result is a service that an LLM can use as fluently as a human can — and that scales with the rapidly-growing universe of AI agents.