Featured image for: A/B Testing for Product Managers: What to Test, What to Skip, and How to Read Results
Planning

A/B Testing for Product Managers: What to Test, What to Skip, and How to Read Results

A/B testing has become the default answer to every product question it's not equipped to answer.

"Should we change the onboarding flow?" Run an experiment. "Should we add this feature?" Run an experiment. "Why is retention dropping?" Run an experiment.

None of those questions are wrong. The problem is that most experiments run to answer them are designed badly, measured too early, declared significant too soon, or used to answer the wrong question entirely.

Here's an honest guide to A/B testing in product management — including the situations where you should skip it entirely.

When A/B Testing Actually Works

Experiments are the right tool when all four of these conditions are true:

  1. 1You have a clear metric. Not "engagement" — a specific, measurable event. "Users who click the export button within 7 days of signup."
  2. 2You have enough traffic. A test that needs 10,000 users per variant to reach significance is not useful if your feature gets 500 visits a week.
  3. 3The change is isolated. If you're testing a new onboarding flow at the same time as a pricing change, you can't attribute the result to either.
  4. 4The change has a directional hypothesis. "Version B will increase 7-day activation because it reduces the steps to the first PRD from 5 to 2" is a hypothesis. "Let's see which one does better" is a guess with extra steps.

If you're missing any of these, you're not running an experiment — you're running a coin flip with more paperwork.

What to Test

Good candidates for A/B testing:

  • Copy changes — subject lines, button text, onboarding headlines. Low-risk, fast to implement, clear metric (click rate, activation).
  • UI layout changes — placement of a primary action, step order in a flow. Easy to isolate, meaningful impact on conversion.
  • Pricing page variations — different plan structures, CTA phrasing. Direct revenue signal.
  • Email timing and content — which day of the week, which subject line format.

Bad candidates:

  • Major feature decisions. "Should we build the Ask tab?" is not an A/B test question — it's a product strategy question. Run user interviews, not an experiment.
  • Small-traffic features. If 200 people use a feature per month, you'll run the experiment for 8 months to reach significance. Just ship the better version.
  • Changes that affect trust or brand perception. You won't see the damage in a 30-day experiment window.
  • Anything users remember. If a user sees variant A, converts, and then sees variant B two weeks later, your data is corrupted.

How to Design a Good Experiment Plan

An ab test plan template has six parts:

1. Hypothesis

State the change, the expected outcome, and the reasoning.

"Reducing the number of onboarding steps from 5 to 2 before the first file upload will increase 7-day activation by at least 10%, because current drop-off data shows 60% of users leave at step 3 (the project name step)."

2. Primary metric

One number. Not three.

"7-day activation rate (defined as: completed at least one file upload within 7 days of signup)."

3. Secondary metrics (guardrail only)

What would tell you the test is causing harm even if the primary metric improves?

"Day-30 retention. If activation goes up but 30-day retention drops, the change is pulling forward weak activations, not creating engaged users."

4. Sample size and duration

Calculate the minimum sample size before you start. Online calculators need: baseline conversion rate, minimum detectable effect, significance level (use 95%), power (use 80%).

Don't end the test early just because it looks like one variant is winning. Peeking inflates false positive rates dramatically.

5. Segmentation plan

Will you analyze results for subgroups (new vs. returning, mobile vs. desktop, plan type)? Decide this before you run, not after — post-hoc segmentation that chases significance is p-hacking.

6. Decision criteria

Write down in advance: if primary metric improves by X% with p < 0.05, we ship. If it's flat or negative, we don't. If guardrail metrics worsen by more than Y%, we stop regardless of primary metric.

Reading Results Without Fooling Yourself

Statistical significance is not the same as practical significance. A test that reaches p < 0.05 with a 0.3% improvement in conversion tells you the effect is real — not that it matters.

Confidence intervals matter more than p-values. "Variant B improved conversion by 4.2% (CI: 0.8%–7.6%)" tells you a lot more than "p = 0.04." The confidence interval tells you the range of plausible effects. If the lower bound is essentially zero, you don't have a compelling result.

Novelty effects are real. Users often engage more with anything new. Run experiments long enough (at least one full week, preferably two) to capture normal usage cycles, not just the novelty spike.

Segment results by new vs. existing users. A change that helps new users activate faster may confuse existing users who have already learned the old flow. Aggregate results hide this.

When to Skip the Experiment Entirely

Some decisions shouldn't be A/B tested:

When you already know the right answer from other evidence. If 7 of 10 recent user interviews mentioned that the 5-step onboarding is too long, and your funnel data confirms 60% drop-off at step 3, you have enough evidence to act. Running an experiment adds 8 weeks and tells you what you already know.

When the stakes are low. Not everything merits experiment rigor. Fixing a confusing label, correcting a flow that's clearly broken, removing a feature nobody uses — these are judgment calls, not experiments.

When the ethical implications are significant. Experimenting with pricing, with customer communications, or with data visibility has ethical dimensions that pure conversion metrics won't capture.

Feature Experimentation at Scale

If your team ships multiple experiments per week, you need infrastructure: a feature flag system, an experimentation platform, and a clear owner for experiment analysis. Without this, experiments collide, learnings aren't shared, and different teams ship contradictory variants to the same users.

The other thing you need at scale: a log of what you've learned from past experiments. An experiment that tells you "short copy outperforms long copy in onboarding CTAs" is organizational knowledge — not just a data point. Capture it somewhere the next PM can find it.

Use the AB Test Plan Template to structure your next experiment before it starts, not after.

Want PRDs grounded in your actual customer data?

PMRead ingests your interviews, Slack threads, and feedback files — and generates PRDs backed by real evidence, not guesses.

Try PMRead free →