MCP Servers Fail Silently. Your Dashboard Won't Tell You.
The Silent Failure Problem
Your MCP server returns 200 OK on every request. Your dashboard is green. Your users are frustrated.
HTTP status codes do not measure what matters here. The tool returned data. Was it the right tool? Was the response useful? Did the model actually use it?
A public Anthropic evaluation of agentic tool use (MCP Bench, late 2025) clocked tool-selection error rates between 28% and 46% across popular MCP servers with more than six tools. That pattern shows up in production too: servers that log perfect uptime, sub-200ms latency, and zero 5xx errors — while the model is calling the wrong tool almost half the time.
The server does not know. The team does not know. The users just say the AI is bad.
That is the silent failure. An MCP server can pass every traditional health check and still answer the wrong question on every call.
What Working Actually Means
For a REST API, "working" means accepting valid requests and returning correct responses. Decades of tooling exist to verify that.
For an MCP server, working is a four-step chain. The model picks the right tool. It sends correct parameters. It receives a useful response. It incorporates the response into its answer. Each step fails independently. None of those failures show up in traditional monitoring.
Wrong tool? 200. Parameters that pass schema validation but miss the user's intent? 200. Response ignored? 200.
You can have a 100% success rate on the dashboard and a 40% success rate in production. That gap is where users say "the AI doesn't work" — and they are right, even though every SRE metric says otherwise.
The Metrics That Matter
Once you accept that traditional metrics are insufficient, the question becomes what to measure instead. Five things.
**Tool selection accuracy.** If you have `search_invoices` and `list_documents`, and a user asks for "last month's invoices," which one gets called? Track the gap between intended and actual selection. It is the single most revealing metric in MCP observability.
**Parameter validity.** Not whether parameters passed schema validation — that is table stakes. Did the date range match what the user asked for? Did the search query capture their intent?
**Response utilization.** Did the model actually use the response? If you return 50 rows and the model references none of them in its answer, either the tool description is wrong, the format is wrong, or the data is irrelevant.
**Error recovery rate.** When a call fails, does the model retry with corrected parameters, switch tools, or give up? The recovery pattern tells you whether your error messages are useful.
**Cost per successful invocation.** Not cost per call. Cost per call that produced a useful answer. Wasted retries and wrong-tool calls typically push real cost to 3-5x the raw API spend.
These five tell you more about MCP server health than every traditional metric combined.
Why Traditional APM Misses This
Datadog, New Relic, and Grafana are excellent at what they do. They track latency distributions, error rates, throughput, and resource utilization. That is the right shape for an API whose caller is a browser or a mobile app sending deterministic requests.
The caller of an MCP server is not deterministic. It is a model making a decision at inference time. The failure mode is not "request timed out." It is "the model decided this was the right tool and it wasn't." That is a decision-quality problem, and traditional APM is not built to measure decision quality.
Concretely: if your agent has both `get_customer` and `search_customers` and the model picks `get_customer` with a guessed ID every time, your p99 stays flat, your error rate stays flat, and every single user interaction ends in failure. The existing tooling has no signal to emit.
You need an observability layer between the model and the tools — one that can correlate user intent, tool selection, parameters, and response utilization. Traditional APM is the floor. For MCP servers, it is not even half the picture.
Building an Observability Stack for AI Tools
Three concrete things to start with. You do not need to boil the ocean.
**1. Log every invocation with full prompt context.** Not just tool name and parameters. The conversation that led to the call, the system prompt, the user's last message. Without context, you cannot evaluate whether selection was correct.
**2. Track intended tool versus selected tool.** You need ground truth — manual labels on a sample, user thumbs-up/down signals, or heuristic rules (a user question matching "invoice|billing|payment" that ended in a `list_contacts` call is a defect). Even rough heuristics beat nothing.
**3. Measure response utilization.** Diff what the tool returned against what appeared in the model's final answer. A large gap means bad response format, irrelevant data, or a tool description that does not help the model use the result.
These three signals — invocation context, selection accuracy, response utilization — tell you whether your MCP server is actually helping users. Five-nines uptime dashboards tell you whether the process is alive. Those are different questions.
Related Posts
Ready to try SmeltSec?
Generate secure MCP servers in 60 seconds. Free to start.