Marketing / growth operating system

40 A/B Tests, 6-10 False Positives: The Discipline Most Skip

Run 40 A/B tests a year and 6-10 of them ship as false positives. Real growth wins stay invisible because the experiment design can't detect them. The hypothesis, sample-size, and 90-day post-ship audit discipline that separates 15-25% incremental growth from organizational noise.

Emily Ellis · 2024-07-19

Your growth team ran 40 A/B tests last year. Six to ten of those results were false positives. They shipped as real product changes, degraded the experience, and cost more than not testing at all. The real wins stayed invisible because the experiment design couldn't detect them.

The Margin Leak

The cost compounds in organizational trust. When the executive team stops believing the experimentation reports, the whole program gets defunded or relegated to cosmetic testing. Companies that run experiments rigorously unlock 15 to 25 percent incremental growth on tested surfaces annually. Companies that run them carelessly often regress.

The Path Forward

1. Start with a hypothesis, not a test

The question before any A/B test is "what do we believe and why?" If the hypothesis is "we think moving the CTA color from blue to green will lift conversions because green has higher contrast on our beige background," that's a testable, actionable hypothesis. If the hypothesis is "let's see if green works better than blue," that's a random experiment. Write the hypothesis in one sentence. State what you expect to happen and why. If you can't, don't run the test.

2. Run tests to statistical significance, not to convenience

The single most common failure mode in B2B experimentation is stopping tests early. Week two looks good, the team commits, and the result doesn't replicate. Use a sample size calculator at the start: baseline conversion rate, minimum detectable effect, statistical power. Run until the sample size is hit. Don't peek at interim results and act. Peeking creates early-stopping bias and produces false positives at rates far above the nominal 5 percent threshold.

3. Test one variable at a time until the signal is clean

Multivariate tests that change headline, CTA, and image simultaneously can't tell you which change drove the result. That's fine if you just want a better page. It's not fine if you want transferable insight. Start with clean A/B tests isolating one variable. Once you understand which variables matter, move to multivariate for optimization. Inverting that order produces experiments that feel productive and teach nothing.

4. Control for external factors

Seasonality, paid campaign spikes, and competitor launches all distort test results. A test run during Black Friday reflects Black Friday, not steady-state behavior. Either pause testing during known distortions or segment results to isolate the effect. Document the external context of every test so six months later you can tell whether the lift was real or environmental.

5. Distinguish statistical significance from practical significance

A result can hit 95 percent confidence and still be commercially trivial. A 0.3 percent lift on a funnel step that matters for one user segment rarely justifies the engineering cost to ship it. Define the minimum effect size that would actually change a business decision before the test starts. If the test hits significance below that threshold, the correct action is usually "no change, move on."

Common Failure Modes

The most common failure mode is running tests as an activity rather than a learning process. The team celebrates test velocity, reports wins weekly, and nobody checks whether shipped tests held up at 90 days. Many don't. The program generates motion without learning.

The fix is a post-ship audit. Every test that ships gets a 90-day check: did the expected lift hold up in production data? Most teams find that 30 to 40 percent of "winning" tests don't replicate at scale. That's a signal to tighten the methodology, not to abandon testing.

The second failure is treating AI-driven testing as a replacement rather than an addition. AI-driven multi-armed bandit algorithms are useful for continuous optimization of known variables, especially on high-traffic pages. They're not the right tool for strategic decisions like pricing or packaging, which need clean causal answers that classic A/B tests provide. Use both tools for what each does best.

Actions to Take Now

Write one-sentence hypotheses for every active test and retire the ones without clear hypotheses. Set a sample size requirement before each new test and commit to running to that size. Add a 90-day post-ship audit to your experimentation process. Pilot a continuous testing tool on one high-traffic optimization surface. Build a shared documentation template capturing hypothesis, sample size, external context, and outcome for every test.

Experimentation compounds when done well. It generates noise when done poorly. The difference is discipline at the hypothesis, sample size, and post-ship audit stages.

Benchmark your experimentation maturity with a free assessment.

Frequently Asked Questions

How long should a B2B A/B test run before you commit to the result?

Set the sample size requirement before the test starts, using a calculator that accounts for baseline conversion rate, minimum detectable effect, and desired statistical power. Don't peek at interim results and stop the test when they look good. That's called early stopping bias and it produces false positives. Run to the planned sample size or to the planned duration, whichever comes later. For most B2B tests, that's 8 to 14 weeks. Patience is the single most underpriced discipline in experimentation.

When should you use AI multi-armed bandits instead of classic A/B testing?

It's extending it, not replacing it. AI-driven multi-armed bandit algorithms are good for optimization problems where you want to maximize conversion on a high-volume page and the cost of running a clear A/B test is traffic. Traditional A/B testing is still better for strategic decisions where you need a clean causal answer: pricing, packaging, major UX changes. The rule: use bandits for continuous optimization of known variables. Use classic A/B tests for decisions that change strategy.

Find out where your commercial gaps are.

Take the Free Assessment →