Stop Dumping Tools Into Context. It Doesn't Scale
Everyone loves MCP in demos. Nobody talks about what happens when you connect five servers and the agent forgets how to think.
Everyone's excited about MCP. Model Context Protocol — the open standard that lets AI agents talk to external tools. GitHub, Slack, databases, file systems. Plug in an MCP server, hand the model a tool menu, and watch the magic happen.
Except here's what nobody shows in the demos: what happens when that tool menu has ninety items on it.
I've been building agentic AI workflows for finance teams — the kind of multi-step automation where a missed parameter means a payment gets applied to the wrong account. So when I say MCP breaks at scale, it's not a theoretical concern. It's a daily one.
The Context Tax Nobody Talks About
MCP's core idea is elegant. You run MCP servers. Each server exposes tools with JSON schemas. The client loads those definitions into the model's context window. The model picks the right tool. Clean, typed, standardized.
In practice, that clean menu becomes a wall of noise.
One developer recently audited their Claude Code setup and found their MCP tools were consuming over 66,000 tokens of context before they even started a conversation. That's a third of Claude's 200k context window — gone. Just on tool definitions nobody asked for.
And it's not just hobbyist setups. GitHub's official MCP server alone defines 93 tools and eats roughly 55,000 tokens. Vercel independently confirmed: ~50,000 tokens just describing what GitHub's server can do. Now stack three or four more servers on top.
The model is trying to solve your actual problem with whatever context budget is left. It's like handing someone a 200-page restaurant menu and wondering why they ordered wrong.
The result: poor tool selection, hallucinated parameters, drift in multi-step workflows, and complete breakdown in long conversations.
Why Accuracy Multiplies Down
Here's the part that makes this an engineering problem, not just a UX annoyance.
Every tool call is a probabilistic decision. The model reads the available tools, picks one, and fills in the parameters. Sometimes it gets it right. Sometimes it doesn't. The question is: what happens when you chain those decisions together?
Say each individual tool call has 90% accuracy. That sounds solid. But in a five-step workflow, you need all five to land correctly. The math is simple multiplication:
0.9 × 0.9 × 0.9 × 0.9 × 0.9 = 0.59
Five steps. 59% reliability. And that 90% per-call number is generous — it drops fast when the model is distracted by dozens of irrelevant tool definitions competing for its attention.
In a demo with two tools and one step? Flawless. In a production workflow where the agent fetches a record, validates it, cross-references another system, flags exceptions, and writes back the result? You're compounding uncertainty at every step.
Context bloat doesn't just waste tokens. It actively degrades the model's ability to reason about the task in front of it.
What Anthropic Actually Did About It
Anthropic didn't patch MCP. They didn't declare it dead. They didn't pretend context bloat was user error.
They quietly shipped something called Skills.
A Skill is deceptively simple: a folder containing a SKILL.md file with YAML frontmatter — name, description, metadata — followed by detailed instructions, optional reference docs, and optional executable scripts.
The key design decision isn't what's in the folder. It's when the model reads it.
At startup, the agent loads only minimal metadata for each skill. Name and a one-line description. Maybe a hundred tokens total. The full instructions, reference docs, and scripts stay on disk until the model decides it needs them.
When a user makes a request, the model scans skill names, identifies which one is relevant, opens just that SKILL.md, and loads additional files only on demand. Instead of dumping the entire tool universe into context upfront, the agent retrieves only what the current task requires.
Anthropic's own term for this: progressive disclosure. I'd call it something more specific: RAG for tools.
The Architecture That Actually Works
Here's the mental model:
MCP is the execution layer. It connects your agent to external systems. That part works fine and doesn't need fixing.
Skills are the retrieval and orchestration layer. They sit in front of MCP and decide what the agent needs to know, when it needs to know it, and how to act on it.
The flow becomes:
Retrieve the right skill based on the user's request.
Load only relevant instructions into context.
Run deterministic code when possible — why let the model generate what a script can execute?
Call MCP tools only when necessary, with precise parameters from the skill's instructions.
The model isn't choosing from a hundred tools anymore. It's choosing from a handful of workflow wrappers — each of which already knows which tools to call and how to call them. That's a fundamentally easier decision, and when it lands, the downstream workflow executes with deterministic reliability instead of compounding probability.
This matters most in the environments where MCP brute-force falls apart: multi-tenant systems with different tool sets per customer, long-lived sessions where schemas and history fight for the same budget, compliance workflows where hallucinated parameters have real consequences, and multi-agent chains where context bloat compounds across handoffs.
The Quiet Part Out Loud
Anthropic didn't kill MCP. They made it usable by acknowledging three things most agent builders are still ignoring: context is scarce, static tool exposure doesn't scale, and the future isn't about giving agents more tools.
It's about giving them better ways to relate to tools.
