Quality

Your MCP Server Has a Secret Scoring Problem

SmeltSec Team|April 12, 2026|5 min read

English Español Français Deutsch 日本語中文 Português हिन्दी

The Demo That Always Works

Your MCP server works perfectly in demos. You know it does because you built the demo. You hand-picked the prompt, the tool fired, the result came back clean. Everyone nodded. Ship it.

Then production happens.

In production, the LLM picks the wrong tool 30% of the time. Users report weird results. The agent calls search when it should call fetch. It passes a date string where it needs an epoch timestamp. It ignores your most powerful tool entirely because it can't figure out when to use it.

What changed? Nothing in your code. The handlers are fine. The logic is fine. The problem was always there — you just never noticed because demos are scripted and production isn't. In a demo, you control the input. In production, the LLM has to figure out which of your fifteen tools matches a user's ambiguous request. And it's failing at that step far more often than you think.

This is the secret scoring problem. Your server doesn't have a bug. It has a quality gap that only shows up when a language model has to make real decisions about your tools.

Why LLMs Pick the Wrong Tool

Here's something most MCP developers don't internalize: the LLM never reads your code. It never sees your handler logic, your validation rules, your careful error handling. All it sees is a name, a description, and a schema.

That's it. That's the entire interface between your tool and the intelligence that decides whether to use it.

If two tools have similar descriptions, the LLM guesses. It doesn't ask for clarification. It doesn't flag ambiguity. It picks one and moves on. If your description says "search for documents" and another tool says "find documents," the LLM is essentially flipping a coin.

If a description is vague, the LLM hallucinates intent. It fills in the gaps with assumptions. "Manage user settings" could mean read settings, write settings, delete settings, or export settings. The LLM will assume one meaning and commit to it without telling anyone.

Tool selection is a language problem, not a code problem. The engineering instinct is to debug the handler. The actual fix is almost always in the text — the twenty words that describe what your tool does and when to use it. Most teams never look there because they're too busy looking at stack traces.

Anatomy of a Quality Score

A quality score isn't one number. If it were, it would be useless — like giving a restaurant a single rating without distinguishing food from service from ambiance.

A real quality score is six dimensions, each independently measurable, each capable of tanking your tool's reliability on its own.

Description clarity: Does the description explain what the tool does, when to use it, and when not to? Does it specify the response format? A score of 40 here means "the LLM will misuse this tool regularly." A score of 90 means "the LLM almost always picks this tool correctly."

Schema completeness: Are parameters typed precisely? Are required fields actually required? Are enums used where values are constrained? Missing constraints mean the LLM has to guess valid inputs, and it guesses wrong often.

Naming consistency: Do your tools follow a predictable pattern? If you have create_user and delete_user but then fetch_all_documents, the inconsistency creates cognitive overhead for the model. It sounds trivial. It isn't.

Overlap detection: How many of your tools could plausibly match the same user request? High overlap is the single biggest predictor of wrong tool selection.

Error documentation: Does the tool describe what happens when things go wrong? LLMs that don't know a tool can fail will retry indefinitely or switch to the wrong tool after an unexpected response.

Example coverage: Are there usage examples in the description? Examples are worth more than paragraphs of explanation because they anchor the LLM's understanding in concrete patterns.

The Six Dimensions That Matter

Let me make this concrete. Here's what a score of 40 looks like versus 90 across each dimension.

Description clarity at 40: "Searches the database." At 90: "Full-text search across all indexed documents. Returns top 10 results ranked by relevance. Use this when the user wants to find documents by content. Do not use for lookup by exact ID — use get_document instead. Returns title, snippet, and relevance score."

Schema completeness at 40: A single parameter called "query" typed as string, no constraints. At 90: "query" as a non-empty string with max length 500, plus optional "limit" as integer with min 1 max 100 default 10, plus optional "date_range" with start and end as ISO 8601 dates.

Naming at 40: search, getDoc, user_remove, ListAllItems. At 90: search_documents, get_document, remove_user, list_items. The pattern is verb_noun, consistently.

Overlap at 40: search_documents, find_documents, and query_documents all exist and do slightly different things. At 90: search_documents does full-text search, get_document does ID lookup, list_documents does filtered listing. Each has a clear, non-overlapping purpose.

Error documentation at 40: tool returns an error string on failure with no structure. At 90: tool documents specific error codes, distinguishes rate limits from auth failures from not-found, and includes retry-after hints for transient errors.

Example coverage at 40: no examples. At 90: two or three concrete examples showing typical inputs and expected outputs.

The difference is always specificity. Vague tools fail. Precise tools work. And the gap between 40 and 90 is usually not a rewrite — it's an afternoon of careful editing.

Fixing Descriptions Is 10x Cheaper Than Fixing Code

Most teams debug the wrong layer. They see tool selection failures and immediately go to code. They rewrite handlers to be more forgiving. They add retry logic. They switch to a more expensive model. They build elaborate routing layers on top of MCP to compensate for ambiguity underneath.

All of that is expensive, slow, and addresses the symptom instead of the cause.

The actual fix is usually a 20-word edit to a tool description. Change "manages documents" to "creates a new document from the provided title and body. Returns the document ID. Use this only for creating new documents — to update existing documents, use update_document instead."

That edit took thirty seconds. It eliminates an entire category of tool selection errors. No code changed. No model upgrade needed. No retry logic required.

Quality scoring tells you exactly which 20 words to change. It points at the specific dimension that's failing, on the specific tool that's causing problems, with a specific suggestion for what to write instead. It turns a vague "the AI keeps getting confused" into a precise "your search_documents description doesn't mention the response format, and 34% of failures trace back to the LLM misinterpreting results."

This is why measurement matters. Not because the score itself is valuable, but because it converts an invisible problem into a visible, fixable one. The teams that figure this out stop rewriting handlers and start rewriting descriptions. They ship fixes in minutes instead of days. And their tools just work — not because the code is better, but because the language is.

Quality

The Quality Gap Nobody Measures

5 min read

Technology

MCP Is Eating the API Economy

5 min read

Ready to try SmeltSec?

Generate secure MCP servers in 60 seconds. Free to start.

The Demo That Always Works

Your MCP server works perfectly in demos. You know it does because you built the demo. You hand-picked the prompt, the tool fired, the result came back clean. Everyone nodded. Ship it.

Then production happens.

This is the secret scoring problem. Your server doesn't have a bug. It has a quality gap that only shows up when a language model has to make real decisions about your tools.

Why LLMs Pick the Wrong Tool

That's it. That's the entire interface between your tool and the intelligence that decides whether to use it.

Anatomy of a Quality Score

A quality score isn't one number. If it were, it would be useless — like giving a restaurant a single rating without distinguishing food from service from ambiance.

A real quality score is six dimensions, each independently measurable, each capable of tanking your tool's reliability on its own.

Overlap detection: How many of your tools could plausibly match the same user request? High overlap is the single biggest predictor of wrong tool selection.

Error documentation: Does the tool describe what happens when things go wrong? LLMs that don't know a tool can fail will retry indefinitely or switch to the wrong tool after an unexpected response.

Example coverage: Are there usage examples in the description? Examples are worth more than paragraphs of explanation because they anchor the LLM's understanding in concrete patterns.

The Six Dimensions That Matter

Let me make this concrete. Here's what a score of 40 looks like versus 90 across each dimension.

Naming at 40: search, getDoc, user_remove, ListAllItems. At 90: search_documents, get_document, remove_user, list_items. The pattern is verb_noun, consistently.

Example coverage at 40: no examples. At 90: two or three concrete examples showing typical inputs and expected outputs.

The difference is always specificity. Vague tools fail. Precise tools work. And the gap between 40 and 90 is usually not a rewrite — it's an afternoon of careful editing.

Fixing Descriptions Is 10x Cheaper Than Fixing Code

All of that is expensive, slow, and addresses the symptom instead of the cause.

That edit took thirty seconds. It eliminates an entire category of tool selection errors. No code changed. No model upgrade needed. No retry logic required.

Your MCP Server Has a Secret Scoring Problem

The Demo That Always Works

Why LLMs Pick the Wrong Tool

Anatomy of a Quality Score

The Six Dimensions That Matter

Fixing Descriptions Is 10x Cheaper Than Fixing Code

Related Posts

The Quality Gap Nobody Measures

MCP Is Eating the API Economy

Ready to try SmeltSec?

Your MCP Server Has a Secret Scoring Problem

The Demo That Always Works

Why LLMs Pick the Wrong Tool

Anatomy of a Quality Score

The Six Dimensions That Matter

Fixing Descriptions Is 10x Cheaper Than Fixing Code

Related Posts

The Quality Gap Nobody Measures

MCP Is Eating the API Economy

Ready to try SmeltSec?