Growth

A/B Test Plan Template

An A/B test plan template for product and growth PMs. Covers hypothesis, variants, sample size calculation, success metrics, guardrail metrics, and a results analysis section. Free to copy, download, and use. No signup required.

Template
# A/B Test Plan

**Test Name:** [Short descriptive name — e.g. "Onboarding CTA Copy Test"]
**Author:** [PM Name]
**Date Created:** [Date]
**Target Launch Date:** [Date]
**Status:** Planned / Running / Complete / Stopped

---

## Hypothesis

> If we [change], then [metric] will [increase / decrease] by [X%] because [reason based on evidence].

**Full hypothesis:**
> If we **[describe the change]**, then **[primary metric]** will **[direction + magnitude]** because **[insight from research, prior test, or theory that explains the mechanism]**.

**Evidence that informed this hypothesis:**
- [User interview insight, heatmap data, prior test result, or competitor observation]

---

## Background & Motivation

[2–3 sentences: What problem are you solving? Why is this worth testing now? What do you expect to learn?]

---

## Variants

| Variant | Description | Visual / Link |
|---|---|---|
| **Control (A)** | Current experience — no changes | [Screenshot / Figma link] |
| **Treatment (B)** | [What changes — be precise about every element that differs] | [Screenshot / Figma link] |
| **Treatment (C)** *(if applicable)* | [Third variant] | [Link] |

> ⚠️ Change only one thing between control and treatment unless this is a multivariate test. Multiple simultaneous changes make it impossible to know what caused the result.

---

## Target Audience

**Who sees this test:**
- [User segment — e.g. "New signups in their first 7 days"]
- [Inclusion criteria — e.g. "Desktop only, English locale"]
- [Exclusion criteria — e.g. "Existing customers, internal team"]

**Traffic allocation:**
- Control: [50%]
- Treatment: [50%]

---

## Metrics

### Primary Metric (the one that determines winner)
- **Metric:** [e.g. "Onboarding completion rate"]
- **Current baseline:** [X%]
- **Minimum Detectable Effect (MDE):** [+Y% relative — e.g. "+10% relative = from 40% to 44%"]
- **Direction:** [Increase / Decrease]

### Secondary Metrics (context, not decision-makers)
- [e.g. "Time to complete onboarding"]
- [e.g. "Step 2 drop-off rate"]

### Guardrail Metrics (must not worsen)
- [e.g. "Day 7 retention — must not drop below X%"]
- [e.g. "Support ticket volume — must not increase by more than Y%"]

---

## Sample Size & Duration

| Input | Value |
|---|---|
| Baseline conversion rate | [X%] |
| Minimum Detectable Effect | [+Y% relative] |
| Statistical significance | 95% |
| Statistical power | 80% |
| **Required sample size per variant** | **[N users]** |
| Daily eligible users | [N] |
| **Estimated test duration** | **[X days]** |

> Use a sample size calculator (e.g. Evan Miller's) with the above inputs. Never end a test early because results look good — wait for full duration.

---

## Implementation

| Task | Owner | Status |
|---|---|---|
| Variant built in code | [Engineer] | 🔲 |
| Feature flag configured | [Engineer] | 🔲 |
| Analytics events firing in all variants | [Engineer] | 🔲 |
| QA on staging | [Engineer + PM] | 🔲 |
| Pre-test data baseline captured | [Data] | 🔲 |

---

## Results (fill in after test)

**Test ran:** [Start date] → [End date]
**Total users in test:** [N control] vs [N treatment]
**Result:** [Winner / No significant difference / Inconclusive]

| Metric | Control | Treatment | Relative change | p-value | Significant? |
|---|---|---|---|---|---|
| [Primary metric] | [X%] | [Y%] | [+Z%] | [0.0X] | Yes / No |
| [Secondary metric] | | | | | |
| [Guardrail metric] | | | | | |

**Decision:** [Ship treatment / Keep control / Run follow-up test]

**Learnings:**
- [What this test confirmed or disproved]
- [What to test next]

How to use this A/B Test Plan template

1

Write the hypothesis before you write any code

A test without a hypothesis is just a coin flip. The hypothesis forces you to commit to a mechanism — why you believe the change will work. If you can't articulate the 'because' clause, you don't understand the problem well enough to run the test.

2

Calculate sample size before launching, not after

The biggest mistake in A/B testing is peeking at results early and calling a winner when they look good. Pre-calculating sample size tells you the minimum duration to run — commit to it before you see a single data point.

3

Define guardrail metrics in advance

A test that increases primary metric by 10% but increases support volume by 30% is not a win. Define your guardrails before the test runs so you can't rationalise ignoring them after seeing the primary result.

4

Document learnings even when the test loses

A failed test that disproves a hypothesis is as valuable as a winning one — it prevents you from running the same test again in 6 months. The Learnings section is mandatory, not optional.

Want a A/B Test Plan grounded in your actual customer data?

PMRead ingests your customer interviews, feedback, and Slack threads — and generates PRDs backed by real evidence, not guesses.

Try PMRead free →

Frequently asked questions

How long should an A/B test run?

Until you reach the required sample size. Minimum 1 full week to account for day-of-week effects (user behaviour differs Monday vs. Friday). Maximum 4 weeks — beyond that, novelty effects fade and seasonal factors introduce noise. Calculate the duration upfront and don't deviate.

What's statistical significance and why does 95% matter?

Statistical significance at 95% means there's a 5% chance the result occurred by random chance. At 90%, that's a 10% chance — meaning roughly 1 in 10 'winning' tests is actually a false positive. In a product with many tests running, false positives compound quickly. 95% is the industry standard.

Can I run multiple A/B tests at the same time?

Yes, on different user segments or different parts of the product. Running two tests on the same users at the same time risks interaction effects — a user who sees both Treatment B and Treatment C may respond differently than one who sees only one change. Use mutual exclusion if your experimentation platform supports it.

What do I do if the test shows no significant difference?

A null result is a real result. It means either the change doesn't matter to users, or your MDE was too small to detect a real effect. Document it as 'no significant difference', decide whether to run the test with a larger sample or a bolder variant, and move on. Don't ship the treatment just because it didn't lose.