P-values: Can we agree to disagree?

The p-value debate, started by the American Statistical Association (I wrote about it here), gained a lot of attention in the scientific community. Many people have commented on it. And the more I read, the more I got confused about what the correct way of inference from data should be. Continue reading P-values: Can we agree to disagree?

Seems like Andrew Gelman found some fellow campaigners against the inappropriate use of p-values and null hypothesis significance testing in science. The American Statistical Association now published a press release with a link to a paper forthcoming in The American Statistician. Here is the take-home message:

1. P-values can indicate how incompatible the data are with a specified statistical model.
2. P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone.
3. Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold.
4. Proper inference requires full reporting and transparency.
5. A p-value, or statistical significance, does not measure the size of an effect or the importance of a result.
6. By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis.

It’s about time that an association with a leverage like the ASA makes a clear statement and takes the topic out of the blogosphere. Especially the second point, an insignificant test does not mean that there is no effect, is a widespread misconception. My feeling is that it will still take a while to finally get rid of it.

I always find it shocking how difficult it is to explain the concept of statistical uncertainty to statistically untrained persons. (Obviously I’m the one to blame if I can’t get my point across.) Recently I was working on a policy report and we had to conduct an econometric analysis on the basis of a rather small sample. In the end we found a bunch of insignificant results, which is not surprising. In the results section I therefore tried to formulate things defensively. I mentioned that based on our analysis we’re unable to decide whether effects are just very small (and therefore statistically compatible with a null hypothesis of being zero) or whether the variation in the data, as a result of the low number of observations, is just too large. Our clients, however, we’re not happy about this interpretation because they expected a clear testimony from our side. Essentialy, they wanted us to make a yes-or-no decision which in some situations is very difficult based purely on statistical grounds.

There is another very interesting suggestion in the paper. I wonder how long it will take for the following to become popular:

One need not formally carry out multiple statistical tests for this problem to arise: Whenever a researcher chooses what to present based on statistical results, valid interpretation of those results is severely compromised if the reader is not informed of the choice and its basis. Researchers should disclose the number of hypotheses explored during the study…