Skip to main content

Your A/B Test is Significant, But Is It Real? Understanding False Positive Risk

Ab-Testing
Kuan-Hao (Wilson)
Author
Kuan-Hao (Wilson)
Working at Google. Passionate about causal inference and A/B testing.
Table of Contents

Picture this: Your A/B test just came back with a p-value of 0.02. You rush to your manager, excited to share the news: “Our test is 98% certain to be correct!”

Stop right there! This is one of the most dangerous misunderstandings I see in the industry, and it could be costing your company millions in wrong decisions.

Let’s talk about why this seemingly logical interpretation is completely wrong…

The Core Problem: P-value ≠ Trustworthiness
#

Source: JacqueLENS PhD

When I first started in data science, I made this exact mistake. I thought a p-value of 0.05 meant my test had a 95% chance of being right. It felt so intuitive!

And let’s be honest - we WANT to believe this interpretation, don’t we? It’s like seeing a 95% on a test score. Our brains instantly go “Woohoo! A+!” Nobody needs to trick us into this misunderstanding. We practically run towards it with open arms because it’s just so beautifully simple.

But here’s the thing - that’s NOT what p-values tell us at all.

Think about it like a medical test. If you test positive for a rare disease, does that mean you’re 95% likely to have it? Not necessarily! It depends on how common the disease is in the first place. The same logic applies to A/B tests.

What p-values actually tell us:

  • The probability of seeing your results (or more extreme) IF there’s actually no real effect
  • It’s about the data given the hypothesis, not the hypothesis given the data

What p-values DON’T tell us:

  • How likely your test result is to be true
  • The probability that your treatment actually works (i.e., the probability that your hypothesis is true)

What Business Decision Makers Really Want to Know
#

In my years of working with product teams, the question I hear most often isn’t “What’s the p-value?” It’s:

“Yeah, I know we have a significant result. So how confident can we be that this improvement is real?”

And honestly? The p-value does not answer that question. But here’s the funny thing: p-values are like the celebrity of statistics. They’re everywhere, everyone name-drops them in meetings, and knowing about them makes us feel like data wizards. So we desperately want to believe they answer everything, including this question.

They don’t. What they’re really asking for is the False Positive Risk (FPR): the probability that your “significant” result is actually just noise.

The Eye-Opening Reality: False Positive Risk
#

Here’s where things get interesting (and a bit scary).

False Positive Risk (FPR) is simply this:
when your A/B test shows a “winner,” what’s the chance you’re celebrating nothing but random noise?

It’s the probability that your exciting result is actually a false alarm. For example, your smoke detector went off because you burnt toast, not because there’s a fire.

And here’s the kicker. This risk depends on something most people never consider: your historical decision-making success rate.

\( \text{False Positive Risk} = \frac{\alpha \times \pi}{\alpha \times \pi + (1 - \beta) \times (1 - \pi)} \)

\( \text{where }\pi = \text{Probability the null hypothesis is true (1 - success rate)}\)

Let me blow your mind with some real numbers from top tech companies:

  • Airbnb: Only 8% of their A/B tests succeed
    • → When they see p < 0.05, there’s still a 26.4% chance it’s a false positive!
  • Microsoft: 33% success rate
    • → 5.9% FPR
  • Booking.com: 10% success rate
    • → 22% FPR

(Source of these data: Ron Kohavi)

See the pattern? The lower your success rate, the less you can trust your “significant” results.

A Real-World Case Study: Airbnb
#

Let’s make this concrete with Airbnb’s actual data. They’ve publicly shared that only 8% of their A/B tests succeed - meaning 92% of their experiments don’t move the needle. Pretty humbling, right?

Now, imagine you’re on Airbnb’s team and you just got p = 0.05 on your latest test.

Time to celebrate ( ˘•ω•˘ ) ?

Not so fast. Here’s what the math tells us:

  • Historical success rate: 8% (ouch!)
  • Your p-value: 0.05 (looks significant!)
  • Your actual False Positive Risk: 26.4%

\( \text{FPR} = \frac{0.05 \times 0.92}{0.05 \times 0.92 + 0.8 \times 0.08} = \frac{0.046}{0.046 + 0.064} = 0.264 \)

Let me translate this for decision makers: Even though your test shows “statistical significance,” there’s still a 26.4% chance you’re looking at pure noise. In other words, only about 3 out of 4 “significant” results at Airbnb are actually real improvements.

Imagine telling your PM: “Hey, our test is significant, but there’s a 1 in 4 chance we’re completely wrong.” Not quite the victory lap moment anymore, is it?

This is why companies like Airbnb have become incredibly disciplined about replication and validation. When your baseline success rate is low, even “significant” results need extra scrutiny.

Bayesian Changed My Perspective
#

The breakthrough moment for me was understanding this simple truth: The trustworthiness of your significant results depends on your track record.

It’s like being a weather forecaster. If you correctly predict rain 90% of the time, people trust your forecasts more than someone who’s only right 20% of the time. The same principle applies to A/B testing.

Teams that rigorously test good ideas and maintain high success rates can trust their p-values more. Teams that test everything and rarely see real improvements? They need to be much more skeptical.

(New to Bayesian thinking? This simple explanation on YouTube helped me grasp the concept)

Don’t Be Fooled by P-values
#

After learning about False Positive Risk, I changed how I approach A/B testing:

  • Track your team’s success rate: You can’t calculate FPR without it
  • Be extra skeptical of surprising results: If something seems too good to be true…
  • Consider replication for borderline results: When p is between 0.01 and 0.10, run it again
  • Use stricter significance thresholds: Maybe 0.01 instead of 0.05 for important decisions

I don’t want to overwhelm this blog post with details about how we can mitigate FPRs in A/B tests. Here’s the bottom line:

Next time you see p-value < 0.05, don’t celebrate just yet. Ask yourself: “Given our historical success rate, how likely is this result to be real?”

This simple mindset shift has saved my teams from countless wrong decisions. It’s not about being pessimistic - it’s about being realistic with the math.


If this article opened your eyes to the hidden complexities of A/B testing, you’re going to love diving deeper into the topic. I highly recommend checking out:

  • “Trustworthy Online Controlled Experiments” by Ron Kohavi - This is THE definitive guide to A/B testing, written by the former head of experimentation at Microsoft. Get the book here (Amazon affiliate link)
  • The research paper “A/B Testing Intuition Busters” by Kohavi, Deng, and Vermeer - This paper goes deep into False Positive Risk and other counterintuitive aspects of experimentation. It’s where I first learned about these concepts! Find the paper here