• Conversion Rate Optimisation

2nd Oct 2014

4 min

This is the second in a series of posts which aim to make clearer a few commonly used phrases in Conversion Optimisation statistics and debunk a few myths around what you can and can’t infer from your test statistics. To read the first post in the series read: Statistics In Testing Jargon Buster: Part One. This jargon buster series uses examples from A/B testing tool Optimizely but the explanation can apply to any statistics in testing.

PART TWO: STATISTICAL SIGNIFICANCE

Probably the most used and most contested CRO term of them all, statistical significance. How many times have you quoted a test running to ’95% statistical significance’. 95 times? 100 times? But what does it actually mean? Is it a good thing?

Firstly, before we delve into the glossary, a brief introduction. In a lot of CRO testing, we agree that a test is significant (i.e, the results are significant enough for us to assume the probability of it being a fluke is low enough to accept the test result) when it reaches 95% statistical significance and 80% statistical power. This will make more sense later on.

So, let’s dive straight in. Statistical significance measures how often, when the variation is no different to the original in terms of results, the test will say so. So when you test to 95%, you are basically saying if I ran this test 20 times, 19 times it would show no difference (assuming there is no difference).

Wait, what?!

Okay, let me put it another way. Say we have tested the original landing page of a company, for namesake we’ll call it Frank’s Merchandise. We know that Frank’s homepage converts 3% of traffic into sales. Say we then test a variation of the home page, where we know the true value of that conversion is also 3%. Despite them being the same, when we tested them against each other 5% of the time the test would show that one version was better, which is also called a false positive result.

Great!

Or is it? One in 20? Near twice as likely as rolling a double six when playing monopoly? Put another way, if we tested two versions of exactly the same webpage against each other, 1 out of every 20 tests would say that one variation outperformed another.  Is that good enough for us?

But what’s statistical power, and what’s the difference between the two?

Now, statistical power is kind of the opposite of statistical significance. Statistical power measures how often, when a variation is different to the original, the test will pick up on it. The industry standard is 80% statistical power, which again, basically means if I ran a test where I knew categorically there was a difference in conversion between the original and the variation, the test would pick up on that significant different 8/10 times. So 2/10 times the test would fail to pick up on the difference between the two versions, even though there is one. This is also called a false negative.

But surely statistical significance and statistical power are industry standard set at 95% and 80% respectively for a reason?

Erm, actually no. There is actually no basis behind 95% for statistical significance, nor 80% for statistical power. They are simply the most commonly used in statistics, particularly medical statistics, although a wider range of different values are often used. One suggestion as to why statistical significance is set higher than statistical power is that it is more risky. Certainly in medical terms, it is far more damaging to implement a new drug treatment that is actually less effective than the control drug than to not implement a more effective drug treatment. But in CRO, which would you say is more damaging to a business:

Not implementing a variation that increases conversion (a false negative)

or

Implementing a variation that has no effect on the conversion? (a false positive)

I hope that’s given you more of a grounding into what a lot of CRO hypothesis testing is based around and maybe sprung up a few more questions. I’d like to leave you with two questions to ask about your business:

  • What do you hold more value towards? Never implementing a variation that doesn’t hold any difference, or always implementing variations you know will provide results?
  • With this information, would you change the levels of statistical significance and statistical power you test to?

Any questions? We’d be happy to answer them.