Perplexity Drops MCP Internally, Citing Context Window Waste

agent-api ai-agents anthropic breaking-news cloudflare context-window framework-update mcp perplexity spoke

March 17, 2026|Nevo

Perplexity Drops MCP Internally, Citing Context Window Waste

Key Takeaways

Perplexity CTO announced they are moving away from MCP internally — 72% of an agent's context window can be consumed by tool schemas before processing a single query
The 72% figure came from Apideck (3 MCP servers, ~40 tools eating 143K of 200K tokens) — a worst-case scenario, not Perplexity's own measurement
Three competing solutions emerged: Cloudflare Code Mode (99.9% token reduction), Anthropic's own lazy-loading fix (98.7% reduction), and CLI progressive disclosure
MCP still has 97M+ monthly SDK downloads, 5,800+ verified servers, and Linux Foundation governance — the protocol is not dying, but its naive "load everything upfront" pattern is
Claude Code v2.1.7 already shipped Tool Search for lazy-loading — the fix is engineering discipline, not protocol abandonment

72% of Your Context Window, Gone Before Your Agent Writes a Single Word

On March 11, 2026, Perplexity CTO Denis Yarats took the stage at the Ask 2026 conference and said what a growing number of AI engineers have been thinking: the Model Context Protocol is too expensive for production agent systems. Perplexity is moving away from MCP internally, favoring their own Agent API — which reached general availability in February 2026 — and traditional APIs for the workloads that matter most.

The headline statistic driving the conversation is damning: 72% of an agent's context window consumed by tool schemas before the model processes a single user query. But here is the part most coverage gets wrong — that number did not come from Perplexity. It came from Apideck, an API integration company, in a deployment where three MCP servers (GitHub, Slack, and Sentry) loaded roughly 40 tools and consumed 143,000 of 200,000 available tokens. The figure is real. The attribution is not. And the difference matters, because the nuance between "Perplexity proved MCP wastes 72% of context" and "one company's worst-case deployment showed 72% waste" is the difference between a death sentence and a known engineering problem with known solutions.

What Is MCP and Why Does Context Window Efficiency Matter?

The Model Context Protocol (MCP) is an open standard created by Anthropic that defines how AI models connect to external tools and data sources. MCP is to AI agents what USB was to peripherals — a universal interface that lets any model talk to any tool through a standardized schema. Since its release, MCP has accumulated over 97 million monthly SDK downloads, more than 5,800 verified servers, and backing from every major AI lab including OpenAI, Google, and Microsoft. In November 2025, it moved under Linux Foundation governance, giving it the kind of institutional momentum that individual companies cannot easily replicate.

The context window is an AI model's working memory — the total amount of text it can process at once. When MCP loads tool schemas into the context window, those schemas compete directly with the user's actual task for space. A model with a 200,000-token context window that spends 143,000 tokens on tool definitions has only 57,000 tokens left for the conversation, documents, code, and reasoning that constitute the actual work. This is not a theoretical concern. It translates directly to degraded output quality, truncated inputs, and higher API costs.

The efficiency question matters even more in 2026 because agent workloads are growing more complex. Agents that coordinate across multiple services — code repositories, communication platforms, monitoring tools, databases — need access to dozens or hundreds of tools simultaneously. If each tool's schema costs thousands of tokens, the math breaks down fast.

The Data: How Bad Is It Really?

The 72% figure from Apideck represents a worst-case scenario, but it is not the only data point. Multiple independent benchmarks confirm that MCP's token overhead is a genuine structural problem, not an edge case.

Cloudflare's engineering team found that exposing their full Workers API through MCP would consume approximately 1.17 million tokens — far exceeding any model's context window. Their solution, Code Mode, reduces this to roughly 1,000 tokens by having the agent write code against a typed SDK instead of calling tools through MCP schemas. That is a 99.9% reduction. Scalekit ran 75 head-to-head comparisons between MCP and CLI approaches, measuring 4x to 32x token overhead for MCP across every test. And Anthropic — MCP's creator — published their own measurement showing 150,000 tokens reduced to 2,000 using a code execution approach, a 98.7% reduction.

When the protocol's own creator publishes data showing a 98.7% efficiency improvement by working around the protocol's core mechanism, the problem is real. The question is whether it is a fatal flaw or an engineering challenge with tractable solutions.

Garry Tan, CEO of Y Combinator, did not mince words. "MCP sucks honestly," he posted publicly, claiming he built a CLI alternative in 30 minutes that outperformed MCP for his use case. When the CEO of the world's most influential startup accelerator calls your protocol out by name, it registers — regardless of whether the technical critique holds up under scrutiny.

Three Competing Solutions to the Context Waste Problem

The industry is not waiting for MCP to fix itself. Three distinct approaches have emerged, each with different trade-offs.

Cloudflare Code Mode takes the most radical position: stop sending tool schemas entirely. Instead, give the agent a typed SDK and let it write code to accomplish tasks. The agent receives a compact API description (around 1,000 tokens regardless of API surface area), writes JavaScript or Python against the SDK, and executes it. This approach scales to APIs with thousands of endpoints without any increase in context consumption. The trade-off is complexity — the agent must be capable of writing correct code against an API it has never seen, and error handling becomes a code-level concern rather than a protocol-level one.

Anthropic's Code Execution with MCP is the most politically interesting solution because it comes from MCP's creator. Rather than abandoning the protocol, Anthropic extends it: the agent uses a filesystem-based tool discovery mechanism, loading only the schemas it needs for the current task. Their benchmark shows 150,000 tokens compressed to 2,000 — a 98.7% reduction that preserves MCP's interoperability guarantees while eliminating the bloat. This approach keeps MCP as the standard but fundamentally changes how tools are loaded. It is an acknowledgment that the original "load everything upfront" pattern was a mistake, paired with a concrete fix.

CLI progressive disclosure, championed by Apideck and gaining traction among developer-tool builders, replaces MCP schemas with command-line interfaces. The agent starts with a minimal system prompt (roughly 80 tokens) and discovers tool capabilities on demand by running --help commands. This mirrors how human developers work — you do not memorize an entire API reference before writing code; you look up what you need when you need it. The trade-off is security: CLI tools often have broader system access than sandboxed MCP servers, and the guardrails are less standardized.

Is MCP Actually in Trouble?

The "MCP is dead" narrative makes for compelling headlines, but the evidence does not support it. What the evidence does support is that MCP's naive implementation pattern — load all tool schemas into context upfront — is unsustainable for production workloads. That is a significant but solvable problem, and solutions are already shipping.

Claude Code version 2.1.7 introduced Tool Search, a lazy-loading mechanism that loads tool schemas on demand rather than all at once. Anthropic's code execution approach extends MCP rather than replacing it. The protocol's governance under the Linux Foundation means improvements flow to thousands of implementations simultaneously. And the raw adoption numbers — 97 million monthly SDK downloads, 5,800+ verified servers, institutional backing from every major AI lab — create a gravity that competing approaches will struggle to match.

There is also the matter of who is making the loudest criticism. Apideck sells API integration products that compete with MCP. Scalekit sells developer infrastructure. Perplexity is promoting its own Agent API. This does not invalidate their data — the token overhead numbers are independently verifiable — but it provides important context for why these specific companies are leading the charge against MCP while Anthropic, OpenAI, Google, and Microsoft continue investing in it.

Perplexity's own position is more nuanced than the headlines suggest. They are moving away from MCP internally for their production inference systems, where every token of overhead directly impacts latency and cost at scale. They still maintain an MCP Server for external developers — meaning they consider the protocol valuable enough to support for their ecosystem even as they optimize around it internally. This is not a rejection of MCP. It is an engineering decision about production performance at Perplexity's specific scale.

What This Means for Agent Builders

If you are building AI agent systems today, the practical takeaway is straightforward: MCP is not going anywhere, but how you use it needs to change.

The days of loading every tool schema into context on every request are over. Lazy loading, progressive disclosure, and code-based tool interaction are not optional optimizations — they are architectural requirements for any agent system that connects to more than a handful of tools. Anthropic's own engineering team has demonstrated this with their 98.7% token reduction. Cloudflare's Code Mode proves it at the extreme end. The pattern is clear: treat tool schemas like database queries, not like configuration files. Load what you need, when you need it.

The economics are also shifting in favor of larger context windows. Anthropic's decision to make the 1M context window generally available at standard pricing — eliminating the 2x input surcharge from the beta period — means that even if MCP schemas consume significant tokens, the raw cost per token is dropping fast enough to change the calculus for many production deployments.

For teams evaluating alternatives to MCP, consider what you are giving up. MCP provides standardized security boundaries, a universal tool interface, and an ecosystem with nearly 6,000 pre-built integrations. CLI alternatives offer efficiency gains but shift security responsibility entirely to the implementer. Code-based approaches like Cloudflare's Code Mode deliver the best token efficiency but require models capable of writing reliable code against unfamiliar APIs. The right answer depends on your scale, your security requirements, and how many external tools your agents need to coordinate.

The broader signal from Perplexity's announcement is that the AI agent ecosystem is maturing past the "one protocol rules everything" phase. Just as web development settled on HTTP as the transport layer while building REST, GraphQL, and gRPC on top for different use cases, agent tooling will likely standardize on MCP as the interoperability layer while developing specialized patterns — lazy loading, code execution, progressive disclosure — for performance-critical paths. The companies building agent orchestration platforms like NVIDIA's NemoClaw and agent-class models like OpenAI's GPT-5.4 are already designing for this multi-pattern future.

MCP's context window problem is real. The solutions are already here. The protocol is not dying — it is being forced to grow up.

Frequently Asked Questions

What does it mean that MCP wastes 72% of the context window?

The 72% figure comes from an Apideck deployment where three MCP servers (GitHub, Slack, and Sentry) loaded approximately 40 tool schemas upfront, consuming 143,000 of 200,000 available context tokens. This means the AI model had only 28% of its working memory left for the actual user task. The figure represents a worst-case scenario with no lazy loading or schema pruning, but it reflects a real structural inefficiency in MCP's default implementation pattern.

Did Perplexity abandon MCP completely?

No. Perplexity's CTO Denis Yarats announced at Ask 2026 that Perplexity is moving away from MCP for its internal production systems, favoring the Perplexity Agent API (GA since February 2026) and traditional APIs. However, Perplexity still maintains an MCP Server for external developers. This is an internal performance optimization, not a universal rejection of the protocol.

What is the Perplexity Agent API?

The Perplexity Agent API is Perplexity's production interface for building AI agent applications, which reached general availability in February 2026. It provides access to Perplexity's search-augmented generation capabilities through a traditional API design that avoids MCP's upfront schema loading overhead. Perplexity positions it as a more efficient alternative to MCP for developers building on their platform.

What are the alternatives to MCP for AI agent tool integration?

Three main alternatives have emerged as of March 2026. Cloudflare Code Mode has agents write code against typed SDKs instead of calling tool schemas, achieving a 99.9% token reduction. Anthropic's Code Execution with MCP uses filesystem-based tool discovery for a 98.7% reduction while preserving MCP compatibility. CLI progressive disclosure, championed by Apideck, replaces MCP schemas with on-demand command-line help calls, starting from roughly 80 tokens instead of thousands. Each approach trades different things: Code Mode requires code-writing capability, Anthropic's approach stays within MCP, and CLI shifts security responsibility to the implementer.

Is MCP still worth adopting for new AI agent projects?

Yes, with caveats. MCP remains the most widely adopted tool integration standard for AI agents, with 97 million monthly SDK downloads, 5,800+ verified servers, Linux Foundation governance, and backing from Anthropic, OpenAI, Google, and Microsoft. However, new implementations should use lazy loading (available in Claude Code 2.1.7+) or code-based tool discovery rather than loading all schemas upfront. The protocol's interoperability benefits are significant, but the naive "load everything" pattern is not viable for production systems connecting to more than a few tools.

Stay ahead of the AI curve. Bookmark nevo.systems for daily intelligence on the AI agent landscape — from protocol debates to Claude Code Review's multi-agent PR analysis pipeline, Gemini 3's struggle to close the tools gap, and the technical depth you will not find anywhere else.

Sources: Apideck, Cloudflare Engineering Blog, Anthropic Engineering, Awesome Agents, ByteIota, Dev.to