# 17 Other nonparametric tests

## 17.1 Introduction

In this chapter, we discuss different nonparametric tests. These tests can be used when the data is not measured on an interval level of measurement (see Chapter 4), or if the probability distribution of the data deviates from the normal distribution (see §10.5). The nonparametric tests do not make assumptions about the parameters of the probability distribution of the data.

Earlier, we already saw that nonparametric correlation coefficients exist, namely the Spearman’s rank correlation coefficient (§11.5) and the (nominal) Phi correlation coefficient (§11.6). In the previous chapter, we discussed a much used nonparametric test, the \(\chi^2\)-test. Below, we will look at some other frequently used nonparametric tests. We discuss these in two groups: firstly for paired observations, and afterwards for unpaired observations from multiple samples. In each subsection, we will firstly discuss the tests which use information of the nominal level (sign tests and related) and then the tests which use information of the ordinal level, i.e. which are based on the rank order of the observed values.

## 17.2 Paired observations, single sample

### 17.2.1 Sign test

A handy test for paired observations is the so-called sign test.
This test can be viewed as a nonparametric, nominal counterpart of
the *t*-test for paired observations
(§13.7).

In this test, we look only at the *sign* (positive or negative)
of the *difference* \(D\) between the two paired observations. Let us again
take the example of an imaginary study on webpages with
*U* (Dutch formal ‘you’) and *je* (Dutch informal ‘you’) as forms of address, with
\(N=10\) respondents. In Table 13.1, we saw
that all 10 respondents preferred *je*: the difference variable \(D\) was
\(10\times\) negative and \(0\times\) positive, or put differently, all the outcomes
of \(D\) were negative.

With the sign test, we look at how probable this distribution of positive and negative values of \(D\) would be, if H0 were correct. According to H0, we expect \(N/2\) positive and \(N/2\) negative differences; according to H0, the probability of a positive sign of \(D\) (the probability of a hit) is thus \(p=1/2\). We now determine the probability of the observed outcome (0 hits) given H0, and we use the binomial probability distribution for this (§10.2): \[\begin{equation} \tag{17.1} P(0\,\mbox{hits}) = {10 \choose 0} (0.5)^0 (1-0.5)^{10-0} = (1) (1) (0.000976) < 0.001 \end{equation}\] The probability of this outcome according to H0 is so small that, in light of this observed (and presumably valid) outcome, we decide to reject H0, and we report this as follows:

The \(N=10\) respondents unanimously give a lower judgement to the webpage with

Uas the form of address than to the comparable page withjeas the form of address; this is a significant difference (sign test, \(p<.001\)).

### 17.2.2 Wilcoxon signed-ranks test

The Wilcoxon signed-ranks test can be viewed as a
nonparametric, ordinal counterpart of the *t*-test for paired
observations (§13.7).

This test makes use of the *rank order* of the difference \(D\) between
the two paired observations. We will again use the example of the imaginary
study on webpages with *U* or *je* as forms of
address
(Table 13.1), but will now look at the *rank order*
of the differences \(D\) (taking into account equal differences from several
participants), and indicate the sign (positive or
negative) of the difference \(D\):
\[-2, -2, -7.5, -5, -7.5, -5, -10, -7.5, -5, -2\]
The sum of the positive rankings is \(W_+=0\) (there are no positive rankings)
and the sum of the negative rankings \(W_-= -53.5\), and with that \(|W_-|=53.5\).
The smallest of
these two sums (\(W_+\) or \(|W_-|\)) forms the test statistic; here, we use
\(|W_-|\).
We will not discuss the probability distribution of the test statistic
but instead have the significance calculated by computer: \(P(|W_-|)=.0055\). The
probability of this outcome according to H0 is so small that, in light of this
observed (and presumably valid)
outcome, we again decide to reject H0.

The (ordinal) Wilcoxon signed-ranks test makes use of more information
than the (nominal) sign test. If an effect is significant according to the sign
test, as is the case in this example, then it is also always significant
according to the Wilcoxon signed-ranks test. If an effect is significant
in the Wilcoxon signed-ranks test, then it is also always significant according
to the *t*-test. This has to do with the level of measurement: the sign test
considers only the (nominal) *sign* of the differences, the
Wilcoxon signed-rank is based on the (ordinal) *ranking* of the
differences, and the *t*-test is based on the (interval) *size* of the
differences.

#### 17.2.2.1 formulas

We not only calculate \(W_+\) (or \(|W_-|\)) in the aforementioned manner, but also the corresponding value of \(z\) (Ferguson and Takane 1989): \[\begin{equation} \tag{17.2} z = \frac{ W_+ - \frac{N(N+1)}{4} } { \sqrt{ \frac{N(N+1)(2N+1)}{24} } } \end{equation}\] With this, we can calculate the effect size, in the form of a correlation (Rosenthal 1991, Eq.2.18): \[\begin{equation} \tag{17.3} r = \frac {z} {\sqrt{N}} \end{equation}\] For the example above, we find \(z=-2.803\), and \(r=-.89\), which indicates an extremely large effect.

## 17.3 Independent observations, multiple samples

### 17.3.1 Median test

The median test can be viewed as a nonparametric, nominal
counterpart
of the *t*-test for unpaired, independent observations. It is actually
a sign test (see 17.2.1), in which we test whether the
distribution of observations above/below their *joint* median
(see §9.3.2 for explanation about the median) deviates from
the expected distribution according to H0.
The null hypothesis H0 is that the distributions of the two samples
do not differ from each other, and that approximately half of the observations
in both samples lie above the joint median, and the other half lies below
it.

### 17.3.2 Wilcoxon rank sum test, or Mann-Whitney U test

The Wilcoxon rank sum test is equivalent to the Mann-Whitney U test.
Both can be viewed as nonparametric, ordinal counterparts
of the *t*-test for unpaired, independent observations
(§13.6).

Let us say that we want to investigate whether certain text attributes
have an influence on the subjective appreciation of the text. For this,
a researcher selects a random sample of participants
from the population (see
§7.3), and assigns these participants in a random
manner to two experimental conditions (randomisation, see
§5.4, point 5).

In the first condition, the participant has to give a judgement about
the original version of a text. In the second condition, the participants
give a judgement about the rewritten version of the same text.
The higher the given score, the higher the valuation for the text.
One of the participants unfortunately had to leave the study
prematurely. The judgements of the remaining 19 participants
are in Table 17.1. On the basis of the random
sample and the random assignment of participants to conditions,
the judgements can be seen as coming from two different
random samples. The null hypothesis is that there is no difference
in valuation between the two conditions.

Condition | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|

Original | 10 | 17 | 35 | 2 | 19 | 4 | 18 | 28 | 24 | – |

Rewritten | 15 | 22 | 8 | 48 | 29 | 25 | 27 | 39 | 31 | 36 |

The Wilcoxon rank sum test is based on the *ranking* of the
observations. Each observation is replaced by the ranking of
that observation, taken over the two conditions together. The lowest or
smallest value gets ranking 1. We indicate the sum of the rankings of the
smallest group (here: of the original condition) with
\(W_1\). The probability distribution of \(W\) under H0 is known (exactly for small
\(n_1\) and \(n_2\), and approximately for larger samples). With this, we can
determine the probability of encountering the value found of \(W_1\),
or a more extreme value, if H0 is true.

Earlier, we saw that the *t*-test for unpaired observations
(§13.6) investigates whether the *means*
are different for two samples. Analogously, the Wilcoxon
rank sum test (and the Mann-Whitney \(U\) test) investigates whether the
*medians* are different for the two samples. The test is thus more
robust for outliers — if we were to replace the highest judgement (48)
with a much higher judgement (say 480), then that would have no influence
on the median of that group, nor on the test statistic or its
significance.

For our example, we find that the lower rankings occur relatively frequently
in the first condition (original version), i.e. that the text in this condition
received lower judgements. The sum of the rankings for this
smallest condition is the test statistic \(W_1=67\). In some versions of the
test^{42}, this raw sum is used to calculate the significance.
In other test versions^{43}, this raw sum is firstly corrected for the
minimal value of \(W_1\) (see the formulas below): the test statistic
is then \(U=W_1 - \textrm{min}(W_1) = 67-45=22\). Afterwards, the significance
of \(W_1=67\) or of \(U=22\) is calculated. We find that \(p=.0653\). If we do a
two-sided test (H0: judgements in conditions 2 are no higher and no lower
than those in condition 1) with \(\alpha=.05\), then there is no reason to
reject H0^{44}.

#### 17.3.2.1 formulas

For the sums of the rankings, it is the case that \(W_1 + W_2 = (n_1+n_2) (n_1+n_2+1) / 2\).

If all the lowest rankings (i.e. all lowest judgements) are in the smallest (first) condition, then \(W_1\) has the minimal value of \(n_1 (n_1+1) /2\). If all the highest rankings (i.e. all the highest judgements) are in this condition, then \(W_1\) has the maximum value of \(n_1 (n_1+n_2+1) / 2\). \(W_1\) (and the minimum and maximum of it) can only be integer numbers.

It is useful to not only calculate \(W_1\) or \(U\), but also the corresponding value of \(z\) (Ferguson and Takane 1989): \[\begin{equation} \tag{17.4} \bar{W_1} = \frac{ n_1 (n_1+n_2+1) }{ 2 } \end{equation}\] \[\begin{equation} \tag{17.5} z = \frac{ |W_1-\bar{W_1}|-\frac{1}{2} }{ \sqrt{ \frac{n_1 n_2 (n_1+n_2+1)}{12} } } \end{equation}\]

With this, we again determine the effect size, using equation (17.3). For the above example, we find \(\bar{W_1}=22.5\), \(z=1.837\), and \(r=.42\), which indicates a ‘medium’ effect. That this considerable effect still does not lead to a significant difference (with two-sided testing) is presumably a consequence of the (too) small size of the two groups.

### 17.3.3 Kruskall-Wallis H test

The Kruskall-Wallis H test can be viewed as an expansion of the Wilcoxon rank sum test (see §17.3.2 above), for \(k \ge 2\) independent samples or groups. The test can also be used to compare \(k=2\) groups; in this case, the test is completely equivalent to the Wilcoxon rank sum test above. The Kruskall-Wallis H test can be viewed as the nonparametric, ordinal counterpart of a one-way analysis of variance (see §15.3.1). Put loosely: we carry out a kind of variance analysis, not on the observations themselves but on the rankings of the observations. We calculate \(H\) as the test statistic based on the rankings of the observations in the \(k\) different groups.

#### 17.3.3.1 formula

\[\begin{equation}
\tag{17.6}
H = \frac{12}{N(N+1)} \sum^{k} (\frac{R^2_j}{n_j}) - 3(N+1)
\end{equation}\]
where \(R_j\) refers to the *sum* of the rankings of the observations
in group \(j\), and \(n_j\) refers to the size of the group \(j\).
(For convenience, we disregard ‘ties’
which are instances in which the same value and ranking occurs in
multiple groups.)

The test statistic \(H\) has a probability distribution which resembles that of \(\chi^2\), with \(k-1\) degrees of freedom. The significance of the test statistic \(H\) is thus determined via the probability distribution of \(\chi^2\) (see Appendix D). This approximation via \(\chi^2\) however only works if \(k\ge3\) and \(n_j\ge5\) for the smallest group (Ferguson and Takane 1989). If \(k=2\) or \(n_j<5\) then the probability \(P(H)\) is calculated exactly.

### References

*Statistical Analysis in Psychology and Education*. 6e ed. New York: McGraw-Hill.

*Meta-Analytic Procedures for Social Research*. 2nd ed. Newbury Park, CA: Sage.

Wilcoxon rank sum test in SPSS.↩︎

Mann-Whitney test in SPSS and in R, and Wilcoxon rank sum test in R.↩︎

If we do a two-sided test with \(\alpha=.10\), then we could indeed reject H0. If we do a one-sided test (H0: judgements in condition 2 are not higher than in condition 1), then we may halve the calculated \(p\), since the calculated \(p\) assumes two-sided testing. We would then find \(p=.0653/2=.0326\), and, as this probability is smaller than \(\alpha=.05\), we would then indeed be able to reject H0.↩︎