Exploring your data and asking questions can unveil opportunities for moving forward. Blindly trusting data can be dangerous as aggregate data can sometimes hide or even show opposite results compared with groups where correlations in data are observed. By asking data questions you can figure out what went wrong, what went right, when to look at aggregate data and when to explore segments.
Simpson’s paradox is a phenomenon in statistics whereby trends are apparent in different groups of data but then disappear or reverse when data is amalgamated. This is most commonly observed when groups of data are given somewhat causal interpretations. You can read more about Simpson’s paradox here.
One of the most common examples is the UC Berkeley gender bias case. The aggregate admissions data implied that there was a significant bias towards accepting male applicants (44% of male applicants were accepted compared 35% of females). On further investigation, it was found that several departments were significantly biased against male applicants.
The research paper by Bickel et al (available here) concluded that a higher proportion of women applied to departments with lower admissions rates, while more men displayed a tendency to apply to the less competitive departments. The Simpsons paradox is an example of Omitted Variable Bias (OVB).
OVB occurs when an inaccurate model is produced by incorrectly leaving out one or more important considerations. The model is then likely to compensate for the missing variable by over or underestimating the effect of the factors.
When analysing test results we should train ourselves to be sceptical and challenge what we think we know. There is a somewhat famous quote rumoured to have been said by Sir Winston Churchill and I believe it comes as a helpful reminder of how data can be abused and why we should always question it– “I only believe in statistics that I doctored myself”.
There is a requirement to be suspicious of the data collected in A/B tests to gain a greater understanding of test results. In the previously mentioned example, UC Berkeley gender bias segmentation highlighted clear trends in the school in which the applicant applied to. The variable (which school the person applied to) was key to understanding what was truly occurring.
Hidden variables such as the school of application heavily influenced the above case study. It is also possible that they are lurking in the results of A/B tests. These effects commonly occur when sampling is poorly distributed. Unfortunately, the online environment is not the most ideal for experimentation. There are a huge number of variables such as user intent, screen size, device type, new vs returning users and many, many more that can influence tests. It is impossible to control and randomise users by every possible variable when assigning them to a variation.
One potential solution is stratified sampling. However, to my knowledge no testing tool currently offers this feature and instead opt for random sampling. Stratified sampling divides the population based on several mutually exclusive variables and then allocates users from these variables equally. I believe stratified sampling is part of the Netflix optimisation strategy however for many it can be difficult to implement as it requires creating a custom sampling tool.
To limit the chance of OVB during data analysis, data should be segmented and a model rationalised to form an explanation that best describes the data collected. It is likely that the segments created may not always be large enough to draw robust conclusions. When this occurs, tests should either be repeated on this segment or this inferred knowledge could be used as the basis for new follow-up tests.
It is this model that you create by gaining a better understanding that can help you figure out and digest what to do with the data. This rationalisation behind the numbers helps you decide what to do with the data. For example, a homepage test that included a brand benefits section above the fold may be found to have no significant impact on your KPI in aggregate data. Segmenting this data by new and returning users you find evidence to support a positive impact on new users being masked by returning user data.
From this model, you could hypothesise that brand benefits on the home page resonate with new users and should be implemented for these customers (or tested again on this segment if the original sample size was not great enough). Further to this, you could also come up with further test hypotheses based on the knowledge that returning users have different requirements from the home page. This is a perfect example of how being sceptical of results can result in improved optimisation output.
During the analysis and segmentation of data it is imperative that we start with a predefined goal or hypothesis, as if a large number of variables are collected and exhaustively mined for combinations of variables that might show a correlation, it is likely to generate misleading insight. This is known as data dredging. Caution should be given to the exploratory analysis of results as correlation does not equal causation in all cases, to explain this you only need to read this article that demonstrates the correlation between the number of films Nicholas Cage has been in and the amount people who drowned falling into swimming pools . Finding correlation can be helpful as this data could be the start of an interesting test hypothesis that is then validated through good A/B testing. You can read more about this here.
This is a cautionary tale about how to look at aggregated data, as it can sometimes hide lurking variables. Data analysis is just one area where testing can fall down. Here is an article about some of the other ways you may be getting it wrong.