P-Value Devoutness

Great article on why one shouldn't think p-values are the gold-standard when it comes to evaluating experiments and hypothesis-tests: http://www.nature.com/news/scientific-method-statistical-errors-1.14700


It proposes to report effect-sizes and confidence-intervalls when reporting (publishing) results of experiments and tests. Additionally one should look to evaluate tests not only the "frequentist" way but also incoporate bayesian evaluations of tests. The following links are good tutorials on how to do this in R:






The nature-article also talks about p-hacking, but doesn't go into detail about the specifics on how to hack your p-value. A common mistake is to look into the test-results on a daily basis and to stop the experiment whenever a p-value smaller 0.05 has been reached. Multiple comparisons is a huge issue when looking at p-values of a test.


The best summary on all the fallacies on can encounter while doing ab-tests is this: http://www.evanmiller.org/how-not-to-run-an-ab-test.html


So in summary this is an aggregate Dos/Donts-List on how to properly evaluate and report split-testing results:


  1. Report the test's power, confidence-intervalls and effect-size, never only the p-value
  2. Use more than one perspective, evaluate the test on a p-value basis (frequentist-approach) but also with a bayesian approach
  3. Use intuition and talking to experts as a third perspective
  4. Don't test hypothesis that seem interesting but that seem to have odds of being true that are very low
  5. No p-hacking: Determine before starting the test when you'll end and evaluate the test
  6. No p-hacking: Dont' look into the test results before the pre-determined end of the test
  7. Avoid multiple-comparison problems by testing only one hypothesis at once (not more than one test-group)
  8. Last but definitely not the least important: AA-Test, check how valid your randomization-process, i.e. your test-/control-group assignment is



Write a comment

Comments: 0