LLM Evaluation Scorecard
A structured scorecard for evaluating LLM outputs against product quality criteria. Use it when comparing models, prompt variants, or assessing whether output quality meets the bar to ship. Free to copy, download, and use. No signup required.
# LLM Evaluation Scorecard
**Feature / use case:** [Name]
**Evaluator(s):** [Names]
**Date:** [Date]
**Models / prompts compared:** [List]
---
## 1. What we're evaluating
**Task description:**
[Plain description of what the LLM is being asked to do — e.g. "Extract structured pain points from a user interview transcript"]
**Input format:**
[Describe the input — transcript, document, structured data, etc.]
**Expected output format:**
[Describe the expected output — e.g. "JSON array of {type, content, quote} objects"]
**Sample inputs used in this evaluation:**
| ID | Input description | Source |
|---|---|---|
| S1 | [Brief description] | [e.g. Real customer transcript] |
| S2 | [Brief description] | |
| S3 | [Brief description] | |
---
## 2. Scoring rubric
Score each criterion 1–5 using the definitions below.
| Score | Meaning |
|---|---|
| 5 | Excellent — meets or exceeds expectations, no issues |
| 4 | Good — minor issues that don't affect usability |
| 3 | Acceptable — noticeable issues but output is still usable |
| 2 | Poor — significant issues, output needs substantial correction |
| 1 | Failing — output is incorrect, harmful, or unusable |
---
## 3. Evaluation criteria
### A. Accuracy & factual correctness
*Does the output contain only information that can be traced to the input? No hallucinated facts.*
| Sample | Model A | Model B | Notes |
|---|---|---|---|
| S1 | /5 | /5 | |
| S2 | /5 | /5 | |
| S3 | /5 | /5 | |
| **Average** | | | |
### B. Completeness
*Does the output capture all relevant information from the input? Nothing significant is missed.*
| Sample | Model A | Model B | Notes |
|---|---|---|---|
| S1 | /5 | /5 | |
| S2 | /5 | /5 | |
| S3 | /5 | /5 | |
| **Average** | | | |
### C. Format adherence
*Does the output match the required format exactly — correct JSON schema, required sections, length constraints?*
| Sample | Model A | Model B | Notes |
|---|---|---|---|
| S1 | /5 | /5 | |
| S2 | /5 | /5 | |
| S3 | /5 | /5 | |
| **Average** | | | |
### D. Relevance
*Is the output relevant to the task? No off-topic content, tangents, or unnecessary caveats.*
| Sample | Model A | Model B | Notes |
|---|---|---|---|
| S1 | /5 | /5 | |
| S2 | /5 | /5 | |
| S3 | /5 | /5 | |
| **Average** | | | |
### E. Tone & voice
*Does the output match the expected tone — professional, concise, consistent with product voice?*
| Sample | Model A | Model B | Notes |
|---|---|---|---|
| S1 | /5 | /5 | |
| S2 | /5 | /5 | |
| S3 | /5 | /5 | |
| **Average** | | | |
### F. Edge case handling
*How does the model behave on ambiguous, short, noisy, or out-of-distribution inputs?*
| Edge case | Model A | Model B | Notes |
|---|---|---|---|
| Empty / very short input | /5 | /5 | |
| Input in unexpected language | /5 | /5 | |
| Input with contradictory information | /5 | /5 | |
| Sensitive / potentially harmful input | /5 | /5 | |
| **Average** | | | |
---
## 4. Summary scorecard
| Criterion | Weight | Model A (raw avg) | Model A (weighted) | Model B (raw avg) | Model B (weighted) |
|---|---|---|---|---|---|
| Accuracy | 30% | | | | |
| Completeness | 25% | | | | |
| Format adherence | 20% | | | | |
| Relevance | 15% | | | | |
| Tone & voice | 5% | | | | |
| Edge case handling | 5% | | | | |
| **Total weighted score** | 100% | | **/5** | | **/5** |
---
## 5. Latency & cost
| Metric | Model A | Model B |
|---|---|---|
| Median latency (P50) | | |
| P95 latency | | |
| Avg tokens in | | |
| Avg tokens out | | |
| Cost per 1K requests (estimated) | | |
---
## 6. Failure modes observed
**Model A failures:**
- [Describe specific failures observed — type, frequency, severity]
**Model B failures:**
- [Describe specific failures observed]
---
## 7. Decision
**Selected model / prompt:** [Name]
**Reason:** [1–2 sentences explaining why this choice won on the criteria that matter most]
**Minimum bar met?**
- [ ] Accuracy ≥ [target]%
- [ ] P95 latency ≤ [target]ms
- [ ] Cost per 1K requests ≤ $[target]
- [ ] No critical failure modes observed
**Conditions / caveats for shipping:**
[Any known gaps or edge cases to monitor post-launch]How to use this LLM Eval template
Weight criteria before you see the scores
The weights in Section 4 should be agreed before running the evaluation. If you see the scores first, you'll unconsciously weight the criteria that favour the model you prefer. For a PRD generation task, accuracy might be 40%; for a copywriting task, tone might be 30%. Set weights based on what matters to users.
Use real customer data as evaluation samples
Synthetic inputs produce artificially clean results. Real customer transcripts, support tickets, or feedback have the noise, ambiguity, and off-topic content that models encounter in production. An LLM that scores 4.8/5 on clean samples and 2/5 on real inputs will fail in production.
Always include at least one adversarial / edge case input
Test the empty input, the 1-word input, the input in an unexpected language, and the input that could produce harmful output. Models that aren't tested on edge cases ship with silent failure modes — output that looks correct but isn't, or refusals that show a raw error to users.
Document failures, not just scores
An average score of 3.8 doesn't tell you whether the model fails consistently on one type of input or intermittently across many. Section 6 (failure modes) is often more valuable than the scorecard itself — it tells you what to monitor and what to include in the system prompt to mitigate.
Want a LLM Eval grounded in your actual customer data?
PMRead ingests your customer interviews, feedback, and Slack threads — and generates PRDs backed by real evidence, not guesses.
Frequently asked questions
How many samples do I need for a reliable evaluation?
For an initial model selection decision, 20–50 diverse samples is usually enough to identify clear quality differences. For a go/no-go shipping decision, 100+ samples with a mix of real and edge case inputs gives higher confidence. For high-stakes applications (medical, legal, financial), consider 500+ with domain expert labelling.
Should evaluations be done by PM, engineering, or a domain expert?
Ideally at least two independent evaluators to reduce bias. For domain-specific tasks (legal, medical, fintech), include a domain expert. PM evaluates against user needs; engineering evaluates format and reliability. Inter-rater agreement — how often your evaluators agree — is itself a useful signal.
How often should we re-run evaluations?
Re-evaluate when: (1) the model provider releases a new version, (2) you change the prompt significantly, (3) online monitoring signals a quality regression, or (4) the input distribution changes materially. Model quality is not static — a model that passed in January may behave differently after a provider update.
What if both models score similarly?
Break ties on cost and latency first. If those are similar too, default to the model with better edge case handling (Section 3F) — edge cases are where user trust erodes. If still tied, pick the model with the larger provider ecosystem for tooling, support, and longevity.
Other free templates