17 Other nonparametric tests

17.1 Introduction

In this chapter, we discuss different nonparametric tests. These tests can be used when the data is not measured on an interval level of measurement (see Chapter 4), or if the probability distribution of the data deviates from the normal distribution (see §10.5). The nonparametric tests do not make assumptions about the parameters of the probability distribution of the data.

Earlier, we already saw that nonparametric correlation coefficients exist, namely the Spearman’s rank correlation coefficient (§11.5) and the (nominal) Phi correlation coefficient (§11.6). In the previous chapter, we discussed a much used nonparametric test, the \(\chi^2\)-test. Below, we will look at some other frequently used nonparametric tests. We discuss these in two groups: firstly for paired observations, and afterwards for unpaired observations from multiple samples. In each subsection, we will firstly discuss the tests which use information of the nominal level (sign tests and related) and then the tests which use information of the ordinal level, i.e. which are based on the rank order of the observed values.

17.2 Paired observations, single sample

17.2.1 Sign test

A handy test for paired observations is the so-called sign test. This test can be viewed as a nonparametric, nominal counterpart of the t-test for paired observations (§13.7).

In this test, we look only at the sign (positive or negative) of the difference \(D\) between the two paired observations. Let us again take the example of an imaginary study on webpages with U (Dutch formal ‘you’) and je (Dutch informal ‘you’) as forms of address, with \(N=10\) respondents. In Table 13.1, we saw that all 10 respondents preferred je: the difference variable \(D\) was \(10\times\) negative and \(0\times\) positive, or put differently, all the outcomes of \(D\) were negative.

With the sign test, we look at how probable this distribution of positive and negative values of \(D\) would be, if H0 were correct. According to H0, we expect \(N/2\) positive and \(N/2\) negative differences; according to H0, the probability of a positive sign of \(D\) (the probability of a hit) is thus \(p=1/2\). We now determine the probability of the observed outcome (0 hits) given H0, and we use the binomial probability distribution for this (§10.2): \[\begin{equation} \tag{17.1} P(0\,\mbox{hits}) = {10 \choose 0} (0.5)^0 (1-0.5)^{10-0} = (1) (1) (0.000976) < 0.001 \end{equation}\] The probability of this outcome according to H0 is so small that, in light of this observed (and presumably valid) outcome, we decide to reject H0, and we report this as follows:

The \(N=10\) respondents unanimously give a lower judgement to the webpage with U as the form of address than to the comparable page with je as the form of address; this is a significant difference (sign test, \(p<.001\)).

17.2.2 Wilcoxon signed-ranks test

The Wilcoxon signed-ranks test can be viewed as a nonparametric, ordinal counterpart of the t-test for paired observations (§13.7).

This test makes use of the rank order of the difference \(D\) between the two paired observations. We will again use the example of the imaginary study on webpages with U or je as forms of address (Table 13.1), but will now look at the rank order of the differences \(D\) (taking into account equal differences from several participants), and indicate the sign (positive or negative) of the difference \(D\): \[-2, -2, -7.5, -5, -7.5, -5, -10, -7.5, -5, -2\] The sum of the positive rankings is \(W_+=0\) (there are no positive rankings) and the sum of the negative rankings \(W_-= -53.5\), and with that \(|W_-|=53.5\). The smallest of these two sums (\(W_+\) or \(|W_-|\)) forms the test statistic; here, we use \(|W_-|\). We will not discuss the probability distribution of the test statistic but instead have the significance calculated by computer: \(P(|W_-|)=.006\). The probability of this outcome according to H0 is so small that, in light of this observed (and presumably valid) outcome, we again decide to reject H0.

The (ordinal) Wilcoxon signed-ranks test makes use of more information than the (nominal) sign test. If an effect is significant according to the sign test, as is the case in this example, then it is also always significant according to the Wilcoxon signed-ranks test. If an effect is significant in the Wilcoxon signed-ranks test, then it is also always significant according to the t-test. This has to do with the level of measurement: the sign test considers only the (nominal) sign of the differences, the Wilcoxon signed-rank is based on the (ordinal) ranking of the differences, and the t-test is based on the (interval) size of the differences.

17.2.2.1 formulas

We not only calculate \(W_+\) (or \(|W_-|\)) in the aforementioned manner, but also the corresponding value of \(z\) (Ferguson and Takane 1989): \[\begin{equation} \tag{17.2} z = \frac{ W_+ - \frac{N(N+1)}{4} } { \sqrt{ \frac{N(N+1)(2N+1)}{24} } } \end{equation}\] With this, we can calculate the effect size, in the form of a correlation (Rosenthal 1991, Eq.2.18): \[\begin{equation} \tag{17.3} r = \frac {z} {\sqrt{N}} \end{equation}\] For the example above, we find \(z=-2.80\), and \(r=-.89\), which indicates an extremely large effect.

17.3 Independent observations, multiple samples

17.3.1 Median test

The median test can be viewed as a nonparametric, nominal counterpart of the t-test for unpaired, independent observations. It is actually a sign test (see 17.2.1), in which we test whether the distribution of observations above/below their joint median (see §9.3.2 for explanation about the median) deviates from the expected distribution according to H0. The null hypothesis H0 is that the distributions of the two samples do not differ from each other, and that approximately half of the observations in both samples lie above the joint median, and the other half lies below it.

17.3.2 Wilcoxon rank sum test, or Mann-Whitney U test

The Wilcoxon rank sum test is equivalent to the Mann-Whitney U test. Both can be viewed as nonparametric, ordinal counterparts of the t-test for unpaired, independent observations (§13.6).

Let us say that we want to investigate whether certain text attributes have an influence on the subjective appreciation of the text. For this, a researcher selects a random sample of participants from the population (see §7.3), and assigns these participants in a random manner to two experimental conditions (randomisation, see §5.4, point 5).
In the first condition, the participant has to give a judgement about the original version of a text. In the second condition, the participants give a judgement about the rewritten version of the same text. The higher the given score, the higher the valuation for the text. One of the participants unfortunately had to leave the study prematurely. The judgements of the remaining 19 participants are in Table 17.1. On the basis of the random sample and the random assignment of participants to conditions, the judgements can be seen as coming from two different random samples. The null hypothesis is that there is no difference in valuation between the two conditions.

Table 17.1: Judgements of \(N=19\) participants on the original and rewritten versions of a text.
Condition
Original 10 17 35 2 19 4 18 28 24
Rewritten 15 22 8 48 29 25 27 39 31 36

The Wilcoxon rank sum test is based on the ranking of the observations. Each observation is replaced by the ranking of that observation, taken over the two conditions together. The lowest or smallest value gets ranking 1. We indicate the sum of the rankings of the smallest group (here: of the original condition) with \(W_1\). The probability distribution of \(W\) under H0 is known (exactly for small \(n_1\) and \(n_2\), and approximately for larger samples). With this, we can determine the probability of encountering the value found of \(W_1\), or a more extreme value, if H0 is true.

Earlier, we saw that the t-test for unpaired observations (§13.6) investigates whether the means are different for two samples. Analogously, the Wilcoxon rank sum test (and the Mann-Whitney \(U\) test) investigates whether the medians are different for the two samples. The test is thus more robust for outliers — if we were to replace the highest judgement (48) with a much higher judgement (say 480), then that would have no influence on the median of that group, nor on the test statistic or its significance.

For our example, we find that the lower rankings occur relatively frequently in the first condition (original version), i.e. that the text in this condition received lower judgements. The sum of the rankings for this smallest condition is the test statistic \(W_1=67\). In some versions of the test42, this raw sum is used to calculate the significance. In other test versions43, this raw sum is firstly corrected for the minimal value of \(W_1\) (see the formulas below): the test statistic is then \(U=W_1 - \textrm{min}(W_1) = 67-45=22\). Afterwards, the significance of \(W_1=67\) or of \(U=22\) is calculated. We find that \(p=.07\). If we do a two-sided test (H0: judgements in conditions 2 are no higher and no lower than those in condition 1) with \(\alpha=.05\), then there is no reason to reject H044.

17.3.2.1 formulas

For the sums of the rankings, it is the case that \(W_1 + W_2 = (n_1+n_2) (n_1+n_2+1) / 2\).

If all the lowest rankings (i.e. all lowest judgements) are in the smallest (first) condition, then \(W_1\) has the minimal value of \(n_1 (n_1+1) /2\). If all the highest rankings (i.e. all the highest judgements) are in this condition, then \(W_1\) has the maximum value of \(n_1 (n_1+n_2+1) / 2\). \(W_1\) (and the minimum and maximum of it) can only be integer numbers.

It is useful to not only calculate \(W_1\) or \(U\), but also the corresponding value of \(z\) (Ferguson and Takane 1989): \[\begin{equation} \tag{17.4} \bar{W_1} = \frac{ n_1 (n_1+n_2+1) }{ 2 } \end{equation}\] \[\begin{equation} \tag{17.5} z = \frac{ |W_1-\bar{W_1}|-\frac{1}{2} }{ \sqrt{ \frac{n_1 n_2 (n_1+n_2+1)}{12} } } \end{equation}\]

With this, we again determine the effect size, using equation (17.3). For the above example, we find \(\bar{W_1}=22.5\), \(z=1.84\), and \(r=.42\), which indicates a ‘medium’ effect. That this considerable effect still does not lead to a significant difference (with two-sided testing) is presumably a consequence of the (too) small size of the two groups.

17.3.3 Kruskall-Wallis H test

The Kruskall-Wallis H test can be viewed as an expansion of the Wilcoxon rank sum test (see §17.3.2 above), for \(k \ge 2\) independent samples or groups. The test can also be used to compare \(k=2\) groups; in this case, the test is completely equivalent to the Wilcoxon rank sum test above. The Kruskall-Wallis H test can be viewed as the nonparametric, ordinal counterpart of a one-way analysis of variance (see §15.3.1). Put loosely: we carry out a kind of variance analysis, not on the observations themselves but on the rankings of the observations. We calculate \(H\) as the test statistic based on the rankings of the observations in the \(k\) different groups.

17.3.3.1 formula

\[\begin{equation} \tag{17.6} H = \frac{12}{N(N+1)} \sum^{k} (\frac{R^2_j}{n_j}) - 3(N+1) \end{equation}\] where \(R_j\) refers to the sum of the rankings of the observations in group \(j\), and \(n_j\) refers to the size of the group \(j\). (For convenience, we disregard ‘ties’ which are instances in which the same value and ranking occurs in multiple groups.)

The test statistic \(H\) has a probability distribution which resembles that of \(\chi^2\), with \(k-1\) degrees of freedom. The significance of the test statistic \(H\) is thus determined via the probability distribution of \(\chi^2\) (see Appendix D). This approximation via \(\chi^2\) however only works if \(k\ge3\) and \(n_j\ge5\) for the smallest group (Ferguson and Takane 1989). If \(k=2\) or \(n_j<5\) then the probability \(P(H)\) is calculated exactly.

References

Ferguson, George A., and Yoshio Takane. 1989. Statistical Analysis in Psychology and Education. 6e ed. New York: McGraw-Hill.
Rosenthal, Robert. 1991. Meta-Analytic Procedures for Social Research. 2nd ed. Newbury Park, CA: Sage.

  1. Wilcoxon rank sum test in SPSS.↩︎

  2. Mann-Whitney test in SPSS and in R, and Wilcoxon rank sum test in R.↩︎

  3. If we do a two-sided test with \(\alpha=.10\), then we could indeed reject H0. If we do a one-sided test (H0: judgements in condition 2 are not higher than in condition 1), then we may halve the calculated \(p\), since the calculated \(p\) assumes two-sided testing. We would then find \(p=.0653/2=.0326\), and, as this probability is smaller than \(\alpha=.05\), we would then indeed be able to reject H0.↩︎