Navigating the sea of A/B Test Significance Calculators
Let’s start at the basics, why use them?
Significance (also referred to as confidence) calculators help marketers know if the data set they are analyzing has demonstrated enough of a difference to reach statistical significance. In layman’s terms, significance helps marketers understand if the results of a test are worth noting and potentially making a change. Significance or Confidence is most often evaluated at 95% using simple t-test methods (one or two tail), but some calculators out there can use slightly different methods such as chi-squared, difference of observed proportions, and others. There is merit and considerations for different methods and how they fit into your company, particularly risk appetite vs. the need to make quick determinations
Be careful to understand what these significance calculators are telling you
What should be made clear when using calculators, which oftentimes gets lost is a result that has “reached significance”, does not mean we are confident in the exact result of the test, rather the treatment is performing in a way that is not a result of chance/error. Said differently, assume we have a test where a checkout payment page is performing 10% better than the control at 95% significance. That measure of significance is not saying that we expect this page to perform 10% better post-test. Instead, all that significance shows is the test has measured enough interactions to make a determination the treatment is performing better than the control at a rate that shows there is a less than 5% chance is it due to random chance. This is an important point that many conversion rate optimization folks fail to understand and paint some dishonest pictures of test results.
To accurately look at and forecast a/b test performance, this is where accurate sample size and confidence intervals come into play (more on this in an upcoming post).
While significance calculators are an important tool in the marketers toolbox, it should not be the only thing considered when analyzing a test.**Be careful to understand the statistics behind a/b testing and calculators to avoid poor testing execution and analysis. Evangelizing faulty test results can lead to poor internal beliefs, which can be hard to dispel. We can go into more detail in later posts about proper process and analysis considerations.
Background & Criteria Used on A/B Test Significance Calculators
Test significance calculators help interpret the results of A/B tests by determining how statistically significant the data is. Each calculator computes values in a slightly different way, and some contain additional fields that will look deeper into the data for greater accuracy and insights. We’ve reviewed the A/B test significance calculators below based on the following criteria:
- Ability to add more than 2 test versions
- Ability to change confidence level
- Error or confidence interval data
- Evaluation Method
VWO’s A/B Split Test Significance Calculator
Visual Website Optimizer (VWO) has put together what one could argue is the most attractive of the calculators featured here. We are always fans of straightforward design, but feature-wise it has been kept a little too simple for our taste. With this tool you will basically get a one-tail read on how significant the test results are thus far with simple reported p-value (the lower the p-value, the more confidence that the given group is actually different from the baseline). More of a quick directional checking tool for significance than something you would do deep analysis with. Perhaps nit-picking, but they could employ some better usability on the tool as well. For example, the p-value and significance fields appear as if they require the user to input a value, which could confuse some.
Pros: Attractive design, very simple to use, links to other tools such as testing duration estimator
Considerations: One-tail validation method, non-ability to add multiple test variations, no confidence interval information
Evan Miller’s Chi-Squared Test Calculator
Evan Miller’s excellent chi-square is a great way to dip your toes into deeper testing analysis. Going beyond simple one-tail calculations, and even offering confidence intervals. What we like about this tool is simple design, but the offering of many other statistical tools on the site to really run your testing results through its paces. Be careful of understanding how your analysis might change if your test encompasses more than one variation as this tool just allows two group comparison. ***Extra bonus, even though it can be a bit hidden (usability hint), the link element on the top right of calculator allows you to share calculation results with the whole team.
Pros: Simple design, many other test analysis tools available on Evan Miller’s site, shareable link which saves your test data, confidence interval data, adjustable p-value requirement for test success.
Considerations: Non-ability to add multiple test variations
Thumbtack’s Abba A/B Test Calculator
Abba computes a few useful results, including: the estimated success rate confidence interval (important); a p-value which gives of how confident it is that the two groups truly have different chances of success; and estimated improvement in success rate as a percentage relative to the baseline success rate. Abba also updates the URL fragment for each report you generate, so you can easily share reports by copying the URL and sending to friends or linking it to social media. A major bonus with this tool, not only do they allow additional variations to be added for calculation, but they can perform an additional layer of statistical calculation to account for multiple group correction (more on this in an upcoming blog post). Additionally, they are the only provider we have reviewed that give a detailed write-up on what statistical models they use and why; which can be a creative educational tool for some to get their toes wet in the world of test analysis.
Pros: Decent design, simple to use, confidence interval and improvement ranges, ability to add multiple variations, multiple group correction ability, shareable results, and educational section
Considerations: Could be a bit overwhelming for those just beginning with testing.