PM × AI

Prompt Design Template

A structured template for PMs and engineers designing LLM prompts for production features. Covers system prompt, user input handling, output format, edge cases, evaluation criteria, and version history. Free to copy, download, and use. No signup required.

Template
# Prompt Design Template
**Feature / use case:** [Name]
**Model:** [e.g. claude-sonnet-4-6, gpt-4o]
**PM:** [Name]
**Engineer:** [Name]
**Version:** v1.0
**Date:** [Date]

---

## 1. Purpose

**What is this prompt trying to accomplish?**
[1–2 sentences. Be specific about the output type — classification, generation, extraction, summarisation, etc.]

**Who is the end user?**
[Describe the user and their context — e.g. "A product manager reviewing customer feedback"]

**Where does this prompt run?**
[ ] User-triggered (on demand)  [ ] Background task (async)  [ ] Scheduled batch  [ ] Real-time (< 2s required)

---

## 2. Inputs

| Input | Type | Required | Source | Notes |
|---|---|---|---|---|
| [Input 1] | string / int / list | Yes / No | User / DB / API | [e.g. max 5,000 chars] |
| [Input 2] | | | | |
| [Input 3] | | | | |

**Input validation rules:**
- Minimum length: [e.g. 50 characters — reject if shorter]
- Maximum length: [e.g. 8,000 tokens — truncate or split if longer]
- Sanitisation: [e.g. strip HTML, remove PII before sending]
- Language handling: [English only / multilingual / detect and route]

---

## 3. System prompt

```
[Paste the full system prompt here]
```

**Prompt design notes:**
- Persona: [Who does the model play? e.g. "You are an expert product manager..."]
- Tone: [e.g. direct, concise, no filler phrases]
- Constraints: [e.g. "Never mention competitor products", "Always respond in the user's language"]
- Format instruction: [e.g. "Respond in JSON", "Use markdown headers", "Maximum 300 words"]

---

## 4. Expected output

**Output format:** [ ] Plain text  [ ] Markdown  [ ] JSON  [ ] Structured list  [ ] Other: ___

**Output schema (if JSON):**
```json
{
  "field_1": "string",
  "field_2": ["array", "of", "strings"],
  "field_3": {
    "nested_field": "string"
  }
}
```

**Output length target:** [e.g. 200–400 words, 5–10 bullet points, exactly 3 items]

**What a good output looks like:**
[Paste 1–2 example ideal outputs — this becomes your evaluation gold standard]

---

## 5. Edge cases & failure modes

| Scenario | Expected behaviour | Tested? |
|---|---|---|
| Input is too short to be meaningful | [e.g. Return error message, do not call model] | [ ] |
| Input is in an unsupported language | [e.g. Respond in English, flag language mismatch] | [ ] |
| Input contains PII (names, emails, phone numbers) | [e.g. Strip before sending, or reject with message] | [ ] |
| Input is adversarial / prompt injection attempt | [e.g. System prompt instructs model to ignore user overrides] | [ ] |
| Model returns malformed JSON | [e.g. Retry once, then return fallback response] | [ ] |
| Model is unavailable (timeout / rate limit) | [e.g. Queue and retry, or show user error] | [ ] |
| Output exceeds maximum length | [e.g. Truncate at sentence boundary, not mid-word] | [ ] |

---

## 6. Evaluation criteria

Define what "good" means before you start testing. Rate each dimension 1–5.

| Dimension | Definition of 5 (excellent) | Definition of 1 (failure) | Weight |
|---|---|---|---|
| Accuracy | Output is factually correct and grounded in input | Output contains hallucinations or invented facts | High |
| Completeness | All required elements are present | Key sections are missing | High |
| Format compliance | Output exactly matches the required format | Output ignores format instructions | Medium |
| Tone / voice | Output matches the defined persona and tone | Output is off-brand or inconsistent | Low |
| Conciseness | Output contains no unnecessary filler | Output is padded or repetitive | Medium |

**Minimum acceptable score to ship:** [e.g. Average ≥ 3.5 across all dimensions on 20 test cases]

**Evaluation method:**
[ ] Manual review by PM  [ ] Manual review by domain expert  [ ] Automated LLM-as-judge  [ ] A/B test with users

---

## 7. Cost & latency

| Metric | Estimate | Acceptable limit |
|---|---|---|
| Average input tokens | | |
| Average output tokens | | |
| Estimated cost per call | $0.00 | $0.00 |
| Estimated calls / day | | |
| Estimated monthly cost | $0.00 | $0.00 |
| P50 latency | | |
| P95 latency | | |

**Cost optimisation applied:**
- [ ] Prompt caching enabled (if provider supports it)
- [ ] Output length constrained in system prompt
- [ ] Batch processing used where real-time is not required
- [ ] Cheaper model tested and evaluated for this use case

---

## 8. Version history

| Version | Date | Author | Changes | Eval score |
|---|---|---|---|---|
| v1.0 | [Date] | [Name] | Initial version | |
| | | | | |

**Rollback plan:** [e.g. "v1.0 prompt is stored in DB and can be re-deployed without a code push"]

How to use this Prompt Design template

1

Write the expected output before writing the prompt

Most teams write a prompt, run it, then decide if the output is good. This leads to drifting standards. Instead: write 5–10 ideal outputs first (Section 4), then write the prompt to produce them. The ideal outputs become your evaluation gold standard and make it obvious when a prompt change has regressed quality.

2

Fill in the edge cases table before the first demo

Edge cases like prompt injection, PII in input, and malformed JSON always come up in demos or early users. Filling in Section 5 before you demo forces you to handle them in code — not scramble to fix them after someone's embarrassed by them publicly.

3

Calculate cost before you show the feature to leadership

Section 7 exists because teams regularly ship AI features without knowing their per-call cost. A feature that costs $0.05/call at 10 users/day is fine. The same feature at 10,000 users/day is $1,500/day — a budget crisis. Fill this in during development, not after launch.

4

Increment version number every time the system prompt changes

A prompt change is a code change. Version it like one. Teams that don't version prompts lose track of what changed when quality degrades. The version history table (Section 8) plus an eval score per version gives you a changelog that makes regression analysis tractable.

Want a Prompt Design grounded in your actual customer data?

PMRead ingests your customer interviews, feedback, and Slack threads — and generates PRDs backed by real evidence, not guesses.

Try PMRead free →

Frequently asked questions

Should this document live in code or in a product/PM tool?

The system prompt itself should live in code (or a config layer that is version-controlled). This design document should live wherever your PRDs and specs live — Notion, Confluence, or alongside the PRD in your PM tool. The document describes intent and evaluation criteria; the code contains the actual prompt. Both need to stay in sync.

How often should we re-evaluate prompts after launch?

Run a manual evaluation pass whenever: (a) the underlying model is updated by the provider, (b) the input distribution changes significantly (new user segment, new data source), (c) user complaints about AI quality spike, or (d) you change the system prompt. A lightweight eval (10–20 test cases) takes under an hour and catches most regressions before users report them.

What is 'LLM-as-judge' evaluation and when should we use it?

LLM-as-judge is when you use a second LLM call to evaluate the output of the first — e.g. asking Claude to rate whether a generated PRD section is accurate and complete on a 1–5 scale. It is useful for high-volume evaluation where manual review doesn't scale, but it introduces its own biases (models tend to prefer their own style). Use it alongside manual review for the first few evaluation rounds, not as a replacement.