Growth

A/B Test Plan Template

An A/B test plan template for product and growth PMs. Covers hypothesis, variants, sample size calculation, success metrics, guardrail metrics, and a results analysis section. Free to copy, download, and use. No signup required.

Template

# A/B Test Plan

**Test Name:** [Short descriptive name — e.g. "Onboarding CTA Copy Test"]
**Author:** [PM Name]
**Date Created:** [Date]
**Target Launch Date:** [Date]
**Status:** Planned / Running / Complete / Stopped

---

## Hypothesis

> If we [change], then [metric] will [increase / decrease] by [X%] because [reason based on evidence].

**Full hypothesis:**
> If we **[describe the change]**, then **[primary metric]** will **[direction + magnitude]** because **[insight from research, prior test, or theory that explains the mechanism]**.

**Evidence that informed this hypothesis:**
- [User interview insight, heatmap data, prior test result, or competitor observation]

---

## Background & Motivation

[2–3 sentences: What problem are you solving? Why is this worth testing now? What do you expect to learn?]

---

## Variants

| Variant | Description | Visual / Link |
|---|---|---|
| **Control (A)** | Current experience — no changes | [Screenshot / Figma link] |
| **Treatment (B)** | [What changes — be precise about every element that differs] | [Screenshot / Figma link] |
| **Treatment (C)** *(if applicable)* | [Third variant] | [Link] |

> ⚠️ Change only one thing between control and treatment unless this is a multivariate test. Multiple simultaneous changes make it impossible to know what caused the result.

---

## Target Audience

**Who sees this test:**
- [User segment — e.g. "New signups in their first 7 days"]
- [Inclusion criteria — e.g. "Desktop only, English locale"]
- [Exclusion criteria — e.g. "Existing customers, internal team"]

**Traffic allocation:**
- Control: [50%]
- Treatment: [50%]

---

## Metrics

### Primary Metric (the one that determines winner)
- **Metric:** [e.g. "Onboarding completion rate"]
- **Current baseline:** [X%]
- **Minimum Detectable Effect (MDE):** [+Y% relative — e.g. "+10% relative = from 40% to 44%"]
- **Direction:** [Increase / Decrease]

### Secondary Metrics (context, not decision-makers)
- [e.g. "Time to complete onboarding"]
- [e.g. "Step 2 drop-off rate"]

### Guardrail Metrics (must not worsen)
- [e.g. "Day 7 retention — must not drop below X%"]
- [e.g. "Support ticket volume — must not increase by more than Y%"]

---

## Sample Size & Duration

| Input | Value |
|---|---|
| Baseline conversion rate | [X%] |
| Minimum Detectable Effect | [+Y% relative] |
| Statistical significance | 95% |
| Statistical power | 80% |
| **Required sample size per variant** | **[N users]** |
| Daily eligible users | [N] |
| **Estimated test duration** | **[X days]** |

> Use a sample size calculator (e.g. Evan Miller's) with the above inputs. Never end a test early because results look good — wait for full duration.

---

## Implementation

| Task | Owner | Status |
|---|---|---|
| Variant built in code | [Engineer] | 🔲 |
| Feature flag configured | [Engineer] | 🔲 |
| Analytics events firing in all variants | [Engineer] | 🔲 |
| QA on staging | [Engineer + PM] | 🔲 |
| Pre-test data baseline captured | [Data] | 🔲 |

---

## Results (fill in after test)

**Test ran:** [Start date] → [End date]
**Total users in test:** [N control] vs [N treatment]
**Result:** [Winner / No significant difference / Inconclusive]

| Metric | Control | Treatment | Relative change | p-value | Significant? |
|---|---|---|---|---|---|
| [Primary metric] | [X%] | [Y%] | [+Z%] | [0.0X] | Yes / No |
| [Secondary metric] | | | | | |
| [Guardrail metric] | | | | | |

**Decision:** [Ship treatment / Keep control / Run follow-up test]

**Learnings:**
- [What this test confirmed or disproved]
- [What to test next]

How to use this A/B Test Plan template

Write the hypothesis before you write any code

A test without a hypothesis is just a coin flip. The hypothesis forces you to commit to a mechanism — why you believe the change will work. If you can't articulate the 'because' clause, you don't understand the problem well enough to run the test.

Calculate sample size before launching, not after

The biggest mistake in A/B testing is peeking at results early and calling a winner when they look good. Pre-calculating sample size tells you the minimum duration to run — commit to it before you see a single data point.

Define guardrail metrics in advance

A test that increases primary metric by 10% but increases support volume by 30% is not a win. Define your guardrails before the test runs so you can't rationalise ignoring them after seeing the primary result.

Document learnings even when the test loses

A failed test that disproves a hypothesis is as valuable as a winning one — it prevents you from running the same test again in 6 months. The Learnings section is mandatory, not optional.

Want a A/B Test Plan grounded in your actual customer data?

PMRead ingests your customer interviews, feedback, and Slack threads — and generates PRDs backed by real evidence, not guesses.

Try PMRead free →

Frequently asked questions

How long should an A/B test run?

Until you reach the required sample size. Minimum 1 full week to account for day-of-week effects (user behaviour differs Monday vs. Friday). Maximum 4 weeks — beyond that, novelty effects fade and seasonal factors introduce noise. Calculate the duration upfront and don't deviate.

What's statistical significance and why does 95% matter?

Statistical significance at 95% means there's a 5% chance the result occurred by random chance. At 90%, that's a 10% chance — meaning roughly 1 in 10 'winning' tests is actually a false positive. In a product with many tests running, false positives compound quickly. 95% is the industry standard.

Can I run multiple A/B tests at the same time?

Yes, on different user segments or different parts of the product. Running two tests on the same users at the same time risks interaction effects — a user who sees both Treatment B and Treatment C may respond differently than one who sees only one change. Use mutual exclusion if your experimentation platform supports it.

What do I do if the test shows no significant difference?

A null result is a real result. It means either the change doesn't matter to users, or your MDE was too small to detect a real effect. Document it as 'no significant difference', decide whether to run the test with a larger sample or a bolder variant, and move on. Don't ship the treatment just because it didn't lose.

Other free templates

PRD Template User Story Template Acceptance Criteria Template OKR Template Product Roadmap Template Kano Model Template Stakeholder Map Template RICE Scoring Template MoSCoW Prioritization Template Decision Log Template Risk Register for PMs Buyer Persona Template Customer Journey Map Template Empathy Map Template User Interview Script Template Product Launch Checklist Go-to-Market Template Release Notes Template Sprint Retrospective Template Customer Onboarding Checklist Feature Announcement Template Competitive Analysis Template Product Brief Template Lean Canvas Template North Star Metric Framework Jobs to Be Done Template Problem Statement Canvas Spec-to-Django Template Spec-to-React Template API Design Spec Template for PMs Technical Debt Scorecard Template Feature Flag Decision Template Architecture Decision Record (ADR) Template Engineering Kickoff Template Design Review Checklist Post-Mortem Template AI Feature Spec LLM Evaluation Scorecard AI Product Risk Assessment Responsible AI Checklist Prompt Design Template India GTM Template Unit Economics Template Fundraising PRD B2B SaaS Pricing Template AARRR Pirate Metrics Weekly PM Report Churn Analysis Template Experiment Design Template Product Health Dashboard