Metrics & Growth

Experiment Design Template

A rigorous experiment design template for PMs running A/B tests and product experiments. Covers hypothesis formulation, sample size calculation, metric selection, guardrail metrics, analysis plan, and a decision framework for shipping or reverting. Free to copy, download, and use. No signup required.

Template
# Experiment Design Template
**Experiment name:** [Descriptive name, e.g. "Onboarding checklist vs. free-form"]
**PM:** [Name]  **Engineer:** [Name]  **Analyst:** [Name]
**Date designed:** [Date]  **Planned launch:** [Date]  **Planned end:** [Date]

---

## 1. Hypothesis

**Problem statement:**
[What user behaviour or metric are you trying to improve, and why does it matter now?]

**Hypothesis (fill in the blanks):**
> We believe that **[changing X]** for **[user segment Y]** will cause **[metric Z]** to **[increase/decrease]** because **[mechanism/reasoning]**.

*Example: We believe that showing an onboarding checklist to new signups will cause Day-7 activation rate to increase because users who see explicit next steps are more likely to complete the core workflow.*

**Null hypothesis:**
> [X] will have no statistically significant effect on [Z].

---

## 2. Variants

| Variant | Description | % of traffic |
|---|---|---|
| Control (A) | [Current experience — describe exactly] | % |
| Treatment (B) | [New experience — describe exactly] | % |
| Treatment (C) | [Optional second treatment] | % |

**Randomisation unit:** [ ] User  [ ] Session  [ ] Account  [ ] Device
*Use user-level randomisation for most experiments — session-level creates inconsistent experiences within a session.*

**Targeting / eligibility:**
- Include: [e.g. New signups in the last 7 days]
- Exclude: [e.g. Existing paying users, internal accounts, users in other active experiments]

---

## 3. Metrics

**Primary metric (the one this experiment is designed to move):**
| Metric | Definition | Current baseline | Direction | Minimum detectable effect (MDE) |
|---|---|---|---|---|
| [e.g. Day-7 activation rate] | % of signups who complete [aha moment] within 7 days | % | ↑ | +[X]% absolute |

*MDE: the smallest improvement worth shipping. Too small an MDE requires impractically large samples. If you can't justify a 2% absolute improvement as worth shipping, your experiment isn't ready.*

**Secondary metrics (supporting signals):**
| Metric | Direction | Notes |
|---|---|---|
| [e.g. Day-30 retention] | ↑ | Confirm activation improvement leads to long-term retention |
| [e.g. Time to aha moment] | ↓ | Faster is better |
| [e.g. Support tickets in first 7 days] | ↓ | Ensure onboarding checklist reduces confusion |

**Guardrail metrics (must not degrade):**
| Metric | Maximum allowed degradation | Notes |
|---|---|---|
| [e.g. Paid conversion rate] | No degradation > 0.5% absolute | Activation improvement must not hurt revenue |
| [e.g. Signup completion rate] | No degradation > 1% absolute | New flow must not create drop-off |
| [e.g. Page load time] | No increase > 200ms P95 | Performance guardrail |

*If any guardrail metric breaches its threshold, the experiment should be stopped regardless of primary metric results.*

---

## 4. Sample size and duration

**Sample size calculation:**

| Input | Value |
|---|---|
| Baseline conversion rate | % |
| Minimum detectable effect (MDE) | % absolute |
| Statistical significance (α) | 0.05 (95% confidence) |
| Statistical power (1 − β) | 0.80 (80% power) |
| Number of variants | [2 / 3] |
| **Required sample per variant** | **[N] users** |
| **Total required sample** | **[N × variants] users** |

*Use a sample size calculator (e.g. Evan Miller's). Don't skip this — underpowered experiments produce inconclusive results that you'll run again.*

**Traffic estimate:**
| Eligible users per day | Days to reach required sample | Planned end date |
|---|---|---|
| | | [Date] |

**Minimum run time:** [N] days (never less than 1 full week — day-of-week effects are real)
**Maximum run time:** [N] days (after this, novelty effects and external events contaminate results)

---

## 5. Implementation checklist

- [ ] Experiment logged in experiment tracking system (Amplitude, Mixpanel, LaunchDarkly, etc.)
- [ ] Randomisation verified — no imbalanced split on pre-experiment metrics
- [ ] Tracking events instrumented for primary metric and all secondary/guardrail metrics
- [ ] QA passed for both control and treatment variants
- [ ] AA test passed (run control vs. control for 48h to verify randomisation)
- [ ] Monitoring dashboard set up
- [ ] Alert configured for guardrail metric breaches

---

## 6. Analysis plan

**When to analyse:**
[ ] Fixed horizon (analyse at planned end date only — avoids p-hacking)
[ ] Sequential testing (analyse continuously with corrected p-values)

*Recommendation: fixed horizon unless you have a strong reason to stop early.*

**Statistical test:**
[ ] Z-test / chi-squared (for proportions — e.g. conversion rate)
[ ] T-test (for means — e.g. revenue per user)
[ ] Mann-Whitney U (for non-normal distributions)

**Segmentation analysis (run after primary result):**
- [ ] By acquisition channel
- [ ] By device type (mobile vs desktop)
- [ ] By user tenure (new vs returning)
- [ ] By plan type (free vs paid)

*Segment analysis is exploratory — don't use it to rescue a failed experiment. Use it to generate hypotheses for future experiments.*

---

## 7. Decision framework

| Result | Condition | Decision |
|---|---|---|
| **Ship** | Primary metric improves ≥ MDE at p < 0.05 AND all guardrails hold | Roll out to 100% |
| **Iterate** | Primary metric improves but below MDE OR guardrail warning (not breach) | Refine treatment and re-run |
| **Revert** | Primary metric neutral + guardrail breach | Stop immediately, revert to control |
| **Inconclusive** | Primary metric not significant after full run time | Do not ship; investigate why |
| **Negative** | Primary metric significantly worse | Revert; document learnings |

**Decision owner:** [Name]
**Decision deadline:** [Date — typically 3–5 business days after experiment ends]

---

## 8. Results (fill in after experiment)

| Metric | Control | Treatment | Absolute diff | Relative diff | p-value | Significant? |
|---|---|---|---|---|---|---|
| [Primary] | % | % | % | % | | Y / N |
| [Secondary 1] | | | | | | |
| [Guardrail 1] | | | | | | |

**Decision:** [ ] Ship  [ ] Iterate  [ ] Revert  [ ] Inconclusive
**Rationale:** [One paragraph explaining the decision based on data]
**Learnings for future experiments:** [What does this tell you about user behaviour?]

How to use this Experiment Design template

1

Calculate sample size before launching — not after

The most common experiment mistake is launching without knowing how much traffic is needed to detect a meaningful effect. Without a sample size calculation, you either: (a) stop too early and ship a false positive, or (b) run indefinitely waiting for significance that may never come. Calculate required sample before launch. If the required sample takes more than 6 weeks at current traffic, either increase the MDE (accept a smaller improvement as significant) or don't run the experiment.

2

Define guardrail metrics before the experiment runs, not after

Guardrail metrics defined after the experiment are just rationalisations. Before launch, list every metric you'd be embarrassed to see degrade — paid conversion, page performance, support volume. Set explicit thresholds. If a guardrail breaches during the experiment, stop it regardless of primary metric results. A 5% lift in activation that causes a 3% drop in paid conversion is not a win.

3

Never peek at results and make decisions mid-experiment

Looking at p-values daily and stopping when p < 0.05 is p-hacking — it dramatically inflates false positive rates. If you're using a fixed-horizon design (recommended), set the analysis date before launch and don't touch it. If you have a business reason to monitor continuously, use a sequential testing framework (like Bayesian bandits or CUPED) with appropriate corrections. 'We stopped early because the result looked good' is not a valid analysis method.

4

Run an AA test for 48 hours before every experiment

An AA test runs your randomisation split on two identical versions of the control experience. If the AA test shows a statistically significant difference between 'control' and 'control', your randomisation is broken — users are not being assigned uniformly. Fix the randomisation before running the real experiment. Skipping the AA test means your results might be driven by selection bias, not treatment effect.

Want a Experiment Design grounded in your actual customer data?

PMRead ingests your customer interviews, feedback, and Slack threads — and generates PRDs backed by real evidence, not guesses.

Try PMRead free →

Frequently asked questions

What is the minimum detectable effect (MDE) and how do I set it?

The MDE is the smallest improvement that would be worth shipping — the threshold below which the change isn't commercially meaningful even if it's statistically significant. Set it by asking: 'If this experiment moves the metric by X%, would we ship it?' Work backwards from that answer. A typical MDE for activation experiments is 2–5% absolute. Setting the MDE too low (e.g. 0.5%) requires huge sample sizes and very long runtimes — most teams don't have that luxury.

What do we do when an experiment is inconclusive?

An inconclusive result (no significant difference in either direction) is information, not failure. It tells you the change had no measurable effect on this metric for this population. Document what you expected to happen and why it didn't, then decide: (a) was the hypothesis wrong?, (b) was the treatment too weak?, or (c) was the metric the wrong one to measure? Use inconclusives to sharpen future hypotheses rather than re-running the same test with the same design.

Can we run multiple experiments at the same time?

Yes, with safeguards. Experiments on non-overlapping user populations or non-interacting parts of the product can run concurrently without issue. The risk is interaction effects — if Experiment A changes onboarding and Experiment B changes the first feature users encounter, the combined effect of both is not separable. Maintain a running list of active experiments and check for overlapping eligibility criteria before launching a new one.