On the misuses of significance tests

To ask whether a difference is statistically significant is one of the most common and recurring of the analysis in many fields, and in particular in sociology. Let us think in the discussions on whether the differences between the results of the surveys are significant.

Now, for several years have developed various criticisms of the use of the idea of significance, and that a good part of the analysts really do not have a clear that means a test of significance. In some sense, that we use wrong the blissful testing and, then, we generate a poor knowledge from that use.

In these last days, the American Statistical Association (ASA) decided to issue a formal statement (in this link) on the use of these tests. One of the phrases at the beginning of the statement shows a little of what is merely ritual of much of our use and knowledge about:

Q: Why do so many colleges and grad schools teach p = .05?

A: Because that’s still what the scientific community and journal editors use.

Q: Why do so many people still use p = 0.05?

A: Because that’s what they were taught in college or grad school.

The definition of informal is given of the test of significance is not particularly clear, but that is apparently a common feature of blissful parameter:

Informally, a p-value is the probability under a specified statistical model that a statistical summary of the data (for example, the sample mean difference between two compared groups) would be equal to or more extreme than stis observed value

Although the author is not a statistician, one of the best and most clear descriptions of what is involved in this informal statement is in a blog entry Crookedtimber (link here). It uses the example of throwing coins and you find that you have gone side on 5 occasions. Now, the probability would be 1/32 that would be lower than the standard limit of p < .05. The idea then is that if the coin is ‘fair’ we would have a chance lower than 1 in 20 that you could get an equivalent result (or better), five heads in a row. The author of the post then we make the following equivalence between the assertion in italics, and the schema of the HANDLE:

‘Under a statistical model-specific’ = if this coin is ‘fair’
‘The probability that a statistical summary of the data is equal to or more extreme than’ = the chance is lower than 1 in 20 that you could get that result
‘Observed value’ = the five faces that had followed

The example, if at all, by showing us more clearly what is the statistical significance, we also shows its limits: Because of the significance of the result, the sentence that is in italics, a few we would get as a conclusion that the coin is loaded. So to speak, following an idea ‘bayesian’ we would say that the probability of obtaining a significant result as it is higher than that of having a currency that is loaded. In other words, the rejection of the null hypothesis does not follow from anything in connection with the validity of the hypothesis as a substantive.

We quote below some commentators who have emphasized this last point:

One of the most important messages is that the p-value cannot tell you if your hypothesis is correct. Instead, it’s the probability of your data given your hypothesis. That sounds tantalizingly similar to “the probability of your hypothesis given your data,” but they’re not the same thing (Christie Aschwanden in FiveThirtyEight.com, link here)

And now to Andrew Gelman, who has been one of the most important critical around the use of statistical significance in the practice of research:

Ultimately the problem is not with p-values but with the null-hypothesis significance testing, that parody of falsificationism in which straw-man null hypothesis is rejected and this is taken as evidence in favor of preferred alternative B (Link here)

Gelman continues to emphasize the difference between the statistical hypothesis with respect to scientific hypotheses substantive. And that, as we have already seen, the test of significance does not indicate about the truth of your hypothesis. The principles set out in the declaration of the HANDLE on this is clear: The P-values indicate just how incompatible are your results given a statistical model. Pero does not measure the probability that your hypothesis substantive to be true, nor that the process that generated the outcome is random. (Principle 2)

All of the above has consequences for the research: The declaration of the ASA tells us that the scientific conclusions (or policy recommendations) cannot be based only on the fact that it has obtained a ‘significant result’. And the HANDLE, remember something that is known but often forget: The statistical significance is not substantive, do not indicate anything about the importance of the relationship declared significant.

In the discussion that led to the declaration of the HANDLE, one of the points was, what could replace then the values of p? However, I think that the conclusion of Gelman is right: we must leave the idea that there is a flag that sets for itself the validity of a result.

I want from this last point to make some more specific comments on the topic of the uses of significance tests in sociology. The discussion that we’ve mentioned up to now apply to multiple disciplines, the use of p-values and the limit of p < .05 appear in various contexts, and in all of them re-appears the topic of the misuse of these parameters. But I think that our disciplines are added other elements that make it even more critical to this misuse.

The case is that given the way sociologists are taught statistics, the temptation to reduce everything to the simple parameter to the value of significance is very high. As statistics teaches us, as a black box, and then in terms of procedures to follow that in reality we do not understand (how many sociologists would recognize the function that produces the normal distribution?), then it is more easy to reduce all the results of an analysis to the simple parameter of the test of significance. It comes with a standard ‘clear and definite’: Someone who does not understand a lot of what you’re doing anyway, you can check for any result from any procedure and see if p is less than 0.05, and then conclude that there is significant association between such-and-such.

Some of the problems of sociology in relation to the use of tests of significance are related to aspects that, in any case, it is even more basic than the previous ones; and they make the wrong use of this parameter is even more painful in our discipline.

The first of these is around the use of the threshold of the p <.05 as shown in result important. This is not unique to our disciplines: One of the topics that we discuss, and that has been important in generating this discussion is the use of non-reflection threshold of p < .05 as the threshold of publication and important results. Now, as in all research, there is an important degree of freedom in operations (in the form of analysis, in the concrete model, in the procedure, in the variables that are inserted etc) to ‘search’ a significant result is a temptation important. Furthermore, we can remember that a relationship with the same degree of force may or may not be significant depending on the size of the sample. The pressure to publish creates this temptation to find something meaningful.

Being a general matter, I believe that in sociology this is reinforced by the acceptance of very low standards of the level of effect. Think of the countless articles published with models that explain a low percentage of the variance: they Are published because they are effects statistically significant, but in reality I am still as ignorant of reality as before. The freedom of the models also has another effect: That I can finish with a high number of variables that influence, depending on how I model the number of possible variables that are significantly associated with can be very high. But in the interpretation and discussion models that explain between 10% or 14% are used as if they were discovering processes that are highly important.

Gold element which tends to affect the readings of the significance, and I have the impression that this is very common in our disciplines, is the confusion between X is associated with, And with that the X are Y. let us Consider that we find that such a group (men, people of stratum high, the bank workers etc) have a significant difference with respect to another group in a certain dimension (say, are greater readers or a more intolerant or etc.). And hence we conclude and act as if men were characterized by the lectoría or intolerance, or etc, and published reports showing as a paradigmatic example of the men who have that feature. But one reviews the data and finds that the significant difference is, for example, 3 or 4 points, and all groups were characterized (roughly) by the same distribution between the values of the variable. Only being able to observe differences, then we move on to think as if the differences were the value.

In general, the problems that we have mentioned -from the most subtle to the most gross, from the crossing of various disciplines to those that are common, to our shame, in our disciplines – from a willingness to move away from the complexities involved in analyzing the data. The reality is not left to grasp through a single instrument, and less through the use crass and simplistic to reduce everything to the question of the threshold of significance. To make of truth a statistical analysis, there is to know and learn about the tools used, describe the processes and decisions, and to be clear that there is a magical value that simplifies all the complexities of looking at reality.