Scoring Methodology

How we test and score every MCP server — transparently and reproducibly. Version 2.1 · Last updated March 2026.

Philosophy

  • Every score is backed by observable data — test results, GitHub metrics, or server metadata. If we can't measure it, we don't score it.
  • If we can't measure it, we don't score it. When a dimension cannot be tested, its weight is redistributed proportionally across measured dimensions.
  • Scoring version tracked. All score outputs carry a version so historical comparisons account for methodology changes.
  • Transparency by default. Raw test observations are visible on every server detail page.

Score Taxonomy

Every server is scored across 6 dimensions. The overall score (0–100) is a weighted average.

25%

Reliability

Protocol conformance, connection stability, schema validity, error handling

25%

Security

Poisoning detection, dependency audit, secret scanning, authentication

15%

Setup

Ease of getting started: README, setup guides, transport DX

15%

Documentation

Quality & completeness of descriptions, schemas, categories

10%

Compatibility

Transport support, schema completeness, tool integration depth

10%

Maintenance

GitHub health signals, adjusted for project scale

Adaptive Weighting

When a dimension cannot be tested (e.g., protocol conformance for stdio servers), its weight is redistributed proportionally across the remaining measured dimensions. The overall score is always 0–100 regardless of how many dimensions are measured.

effective_weight[d] = base_weight[d] / sum(base_weights of measured dimensions)

Reliability Status Classification

Not every server can be fully tested. Rather than showing a misleading 0/10 for servers we couldn't reach, we classify reliability into three transparent statuses:

measured

Full protocol test completed — score reflects actual connection stability, schema validity, and error handling.

partial

Connected to the server but could not test individual tools. Score shown with caveat.

not_testable

Could not complete testing due to OAuth requirements, sandbox restrictions, transport limitations, or write-only tools. Shown as N/A — not a negative signal.

When reliability is not_testable, its weight is excluded from the overall score calculation — the server is not penalized for sandbox limitations.

Evidence-Based Testing

We classify each tool by risk level and only probe safe operations during testing. Write operations and unclassified tools are skipped — we never mutate state on tested servers without explicit sandbox setup.

  • Safe probes only. Read operations are tested with appropriate fixture data. Write operations are tested in isolated sandbox environments when available.
  • Every observation recorded. Tool name, status, latency, and errors are captured per test run and visible on detail pages.
  • Structured evidence. Raw observations are normalized into tool results and failure patterns for consistent presentation.

Security Scoring

Security scores are based on deterministic evidence from vulnerability databases, repository security signals, and secret-scanning systems. Supplemental triage signals may inform prioritization but never serve as the sole basis for a negative security score.

Poisoning

Tool description injection detection via pattern matching

Dependencies

Lockfile presence, known vulnerabilities via OSV and GitHub Advisory

Secrets

Hardcoded credentials and API keys in source code

Auth

Authentication method appropriateness for the transport type

Evidence Sources

OSV.dev

Known vulnerabilities (CVE/GHSA)

deterministicscoreable

GitHub Dependabot

Known vulnerabilities (GHSA)

deterministicscoreable

Regex scan (secrets)

Leaked secrets in source

deterministicscoreable

Regex scan (poisoning)

Prompt injection patterns

deterministicscoreable

Supplemental triage

Additional review signals

non-deterministicinformational

Quality Gate

Server pages are only indexed after passing automated quality checks on test coverage, data completeness, and scoring consistency. Pages that don't meet our quality bar remain hidden from search engines until sufficient evidence is available.

Badge Criteria

Badges are awarded based on score thresholds across reliability, security, and overall quality.

Lab Tested

Server has been tested with sufficient coverage and meets minimum quality standards.

Vendor Verified

Server demonstrates high reliability and security scores from structured testing.

Security Scanned

Server has passed security scanning with satisfactory results.

Vulnerability Disclosure Policy

We publicly list only already-disclosed vulnerabilities from authoritative databases (OSV, GitHub Advisory). Newly discovered findings from internal scans are kept private until responsibly disclosed to the vendor.

public_known

Known public CVE/GHSA/advisory — listed immediately

private_triage

New finding under investigation — not shown publicly

private_disclosed

Confirmed and reported to vendor, awaiting fix

public_disclosed

Disclosure process complete — listed publicly

Transparency & Re-test Policy

  • Raw test logs are visible on every server detail page.
  • Maintainers can request a re-test by contacting us.
  • Methodology changes are tracked and versioned.
  • Paid features and untestable tools are excluded from scoring — servers are not penalized for limitations outside their control.