# 16 Chi-square-tests

## 16.1 Introduction

Earlier, we already saw that we cannot always make use of a
parametric test such as the *t* test or analysis of variance, because
the collected data do not satisfy the assumptions. If the collected data
have been measured on an interval level of measurement (see Chapter
4), or if the probability distribution
of the data is far from normal (see
§10.5), then a non-parametric test is to be
preferred over such a parametric test. If the collected data
do satisfy the assumptions for a parametric test, then a non-parametric
test is less sensitive (more conservative) than a parametric test, i.e. the
non-parametric test requires a larger effect and/or a larger sample, and generally
has less power than a parametric test when seeking out an effect
(see Chapter 14).

In this Chapter, we discuss the most used non-parametric test: the so-called \(\chi^2\) test, pronounced as “chi-square-test” (with the greek letter “chi”).

## 16.2 \(\chi^2\) test for “goodness of fit” in single sample

Data of nominal level of measurement are often analysed with the \(\chi^2\) test. The number of dots on a dice is an example of a dependent variable of nominal level of measurement: there is no physical ordering between the six sides, and each side of a die has an equally high probability of appearing on the top. Imagine we throw a die \(60\times\), and find the following frequencies of the six possible outcomes: \(14, 9, 11, 10, 15, 1\). This can be considered to be a sample of \(n=60\) throws from an infinite population of possible throws, and the outcome frequencies reported here should be seen as a contingency table of 1 row and 6 columns (i.e. 6 cells). How high is the probability of this distribution of outcomes? Is the die indeed honest?

The \(\chi^2\) test is based on the differences between the expected
and observed frequencies. According to the null hypothesis (H0: the die is honest),
we expect 10 outcomes in each cell (\(60/6=10\)), i.e. the
expected frequency is identical for each cell (this is called
a *uniform* distribution).
The observed outcomes deviate from the expected frequencies of outcomes,
in particular because the outcome “six” barely occurs in this sample. This
might of course also have happened by chance. The \(\chi^2\) test indicates how high the probability is of this uneven distribution of outcomes (or an even more uneven distribution),
if H0 is true.
The expected outcomes are thus deduced from a distribution of the outcomes
according to H0, and we investigate how well the observed outcomes
fit the expected outcomes. This form of the \(\chi^2\) test is thus also
referred to as a test of the ‘goodness of fit.’

For this example, we find the outcome of the testing \(\chi^2=12.44\)
with 5 degrees of freedom (see
§13.2.1 for explanation about
degrees of freedom), with \(p=.0297\). We usually use the computer to
calculate this probability value, but we can also estimate this probability
via a table with critical
\(\chi^2\)-values, see Appendix D, and footnote).^{39}
If H0 is true, then we have only 3% probability
of finding this outcome (or an even more uneven distribution of outcomes).
The significance \(p\) found is smaller than \(\alpha=.05\), and we thus
reject H0. We conclude that this die is not honest: the distribution
of outcomes found deviates significantly from the expected
distribution according to H0.

## 16.3 \(\chi^2\) test for homogeneity of a variable in multiple samples

The \(\chi^2\) test can also be used for a research design with *one* nominal
variable which we have observed in two or more samples. The
question is then whether the distribution of the observations over the
categories is equal for the different samples. This test is comparable with
*t* tests for two independent samples
(§13.6). We usually then summarise the numbers of observations
with a contingency table with multiple rows for the different samples,
and multiple columns for the categories of the nominal dependent variable (see also
Table 11.3).

The \(\chi^2\)-test is again based on the differences between the expected and observed frequencies. According to the null hypothesis (there is no difference in distribution between the two samples), the distribution of observations across the columns should be approximately equal for all rows (and vice versa).

## 16.4 \(\chi^2\) test for association between two variables in single sample

Finally, the \(\chi^2\) test can equally well be used for a research design
with *two* nominal variables, which we have observed in a single
sample. The question then is whether the distribution of observations
over the second variable’s categories is equal for the different
categories of the first variable (and vice versa). We again summarise
the numbers of observations in a contingency table with multiple rows for
the categories of the first nominal variable, and multiple columns
for the categories of the second nominal variable.

Here too, the \(\chi^2\)-test is based on the differences between the expected and
observed frequencies. According to the null hypothesis (that there is no association
between the two nominal variables), the distribution of observations across
rows should be approximately equal for all columns, and vice versa. However, this
does *not* mean that we expect the same frequency for all cells.
This is illustrated in the following example.

Example 16.1: In the early morning of 15th April 1912, theTitanicsunk in the Atlantic Ocean. Many of those on board lost their lives. Those on board could be divided into four classes (1st/2nd/3rd class passengers, and crew). Was the outcome of the disaster (whether the individual survived the disaster or not) approximately equal for persons of these four classes? The contingency table 16.1 provides the distributions of outcomes.

Class | Died | Survived | Total |
---|---|---|---|

1st | 122 | 203 | 325 |

2nd | 167 | 118 | 285 |

3rd | 528 | 178 | 706 |

Crew | 673 | 212 | 885 |

Total | 1490 | 711 | 2201 |

For the expected frequencies, we have to take into account the different numbers of those on board in the different classes, and the unequal distribution of outcomes (1490 non-survivors and 711 survivors). If there were no association between the class and the survival status, we would expect there to be 220 non-survivors amongst the first class passengers \([(1490/2201) \times 325 = (325 \times 1490) / 2201 = 220]\) and 105 non-survivors \([(711/2201) \times 325 = (325 \times 711) / 2201 = 105]\). In this way, we can determine the expected frequencies for each cell, taking into account the marginal totals. With the help of these expected frequencies, we then calculate \(\chi^2=190.4\), here with 3 d.f., \(p<.001\). The significance \(p\) found is smaller than \(\alpha=.001\), and we thus reject H0. We conclude that the outcome of the disaster (died or survived) was

unevenlydistributed for the four classes of those on board theTitanic.

For the analysis of contingency tables which consist of precisely \(2\times2\) cells, the Phi coefficient is an effective alternative (see §11.6).

Reread and remember the warnings about correlation and causality (§11.7) — these are also applicable here.

## 16.5 assumptions

The \(\chi^2\)-test requires three assumptions which must be satisfied in order to use the test.

The data have to be measured on a nominal level of measurement, or have to be simplified to nominal level (see Chapter 4).

All observations have to be independent of each other, and based on (a) random sample(s) of the population(s) (see §7.3), or on random assignment of the elements from the sample to experimental conditions (randomisation, see §5.4, point 5). Each element for the sample can thus only contribute one observation to one cell

^{40}.The sample has to be large enough so that the expected frequency (\(E\)) for each cell is at least 5. If the expected frequency or frequencies in one or more cells is/are less than 5, then reduce the number of cells by merging bordering cells, and determine the expected frequencies again.

## 16.6 formulas

The test statistic \(\chi^2\) is defined as \[\begin{equation} \tag{16.1} \chi^2 = \sum \frac{(O-E)^2}{E} \end{equation}\] in which \(O\) and \(E\) indicate the observed and expected numbers of observations for each cell of the frequency table (Ferguson and Takane 1989). The expected numbers might also be rational numbers (e.g. \(45/6\) for the 6 possible outcomes of an honest die, if we throw \(45\times\)). The larger the difference \((O-E)\) in one or several cells, the larger also \(\chi^2\) will be (see below). Due to squaring, the test statistic \(\chi^2\) is always null or positive, and never negative (Ferguson and Takane 1989).

The probability distribution of the test statistic \(\chi^2\) is determined by the number of degrees of freedom (see §13.2.1 for explanation of this concept). For a \(\chi^2\)-test with one nominal variable (“goodness of fit”), the number of degrees of freedom must be equal to the number of cells minus 1. For a \(\chi^2\)-test with multiple samples (homogeneity) and/or with two variables (correlations), with respectively \(k\) and \(m\) categories, the number of degrees of freedom is equal to \((k-1)\times(m-1)\).

For each cell of the frequency table, in row \(i\) and column \(j\), we can also compute the raw residual: \[\begin{equation} \tag{16.2} e_{ij} = \frac{(O_{ij}-E_{ij})}{\sqrt{E_{ij}}} \end{equation}\] If we square these raw residuals and then sum the squares, the result is the \(\chi^2\) test statistic given in Eq.(16.1) above.

It is more insightful to compute the *standardized* residual for each cell of the frequency table (Agresti 2007, 38). The standardization means that the standard error of the residuals is taken into account (by using row totals \(R_i\), column totals \(C_j\), and the grand total \(N\)):
\[\begin{equation}
\tag{16.3}
e_{ij} = \frac{(O_{ij}-E_{ij})}{\sqrt{E_{ij}\times(1-\frac{R_i}{N})\times(1-\frac{C_j}{N})}}
\end{equation}\]
These standardized residuals may be interpreted as standard normal \(Z\) scores, using the critical \(Z\) values given in Appendix B. Hence the adjusted standardized residuals provide insight in the source of a significant outcome of the \(\chi^2\) test, and they also allow us to assess the contribution of each cell to that outcome^{41}.

For the example given in §16.2 we find the following six standardized residuals for the six possible outcomes of the die: \((1.39, -0.35, +0.35, 0.00, 1.73, -3.12)\). The first five of these outcomes are observed approximately as frequently as expected, but the sixth of these outcomes is observed significantly less often than expected (\(p=.003\)).

## 16.7 SPSS

### 16.7.1 goodness of fit: preparation

If we want to investigate a nominal variable, then it must of course be marked as a column in the SPSS data file. Every observation forms a separate row in the data file, and the nominal independent variable is a column in the data file.

Sometimes, we do not have the separate observations (rows) but
do have the table of numbers of observations per category of the nominal
variable. We can work further with these. Let us say that we have two columns,
named `outcome`

and `number`

, as follows
(see §16.2):

```
Outcome Number
1 14
2 9
3 11
4 10
5 15
6 1
```

Next, each cell (row) has to get a weight that is as large as the
`number`

of observations, which is named here in the second column: the
first cell (row) weighs \(14\times\), the second cell (row) weighs
\(9\times\) etc. Thanks to this trick, we do not have to fill in \(N=60\) rows
(a row for each observation), but only 6 rows (a row for each cell).

`Data > Weigh Cases... `

Choose `Weigh cases by...`

and select the variable `number`

in
entry field. Confirm with `OK`

.

Choose and select the variable `number`

in
input field. Confirm with `OK`

.

### 16.7.2 goodness of fit: testing

`Analyze > Nonparametric tests > Legacy Dialogs > Chi-square...`

Select the variables `outcome`

(in “Test variable list” panel) and
indicate that we expect *equal* numbers of observations in each cell.
(It is also possible to enter other expected frequencies here,
if other, unequal frequencies are expected according to H0.)
Confirm with `OK`

.

### 16.7.3 contingency tables: preparation

If we want to investigate two nominal variables, then they must
both be marked as columns in the SPSS data file. Each observation
forms a separate row in the data file, and the nominal variables
are columns in the data file. For Example 16.1 above, we then use a “long”
data file, consisting of \(N=2201\) rows, with a separate row for each person
on board, with at least two columns, for `class`

and
`survivor`

.

Sometimes, we do not have the separate observations (rows) but
do have the contingency table of numbers of observations for each
combination of categories of the nominal variables. We can also
work further with these. Let us say that we have three columns, named
`class`

, `survivor`

and `number`

, as follows:

```
Class Survivor Number
1st no 122
1st yes 203
2nd no 167
2nd yes 118
3rd no 528
3rd yes 178
crew no 673
crew yes 212
```

Next, each cell (row) has to get a weight which is as large as
the `number`

of observations, which is named in the third column: the
first cell (row) weighs \(122\times\), the second cell (row) weighs
\(203\times\), etc. With this trick, we do not have to enter \(N=2201\) rows
(a row for each observation), but only 8 rows (a row
for each cell).

`Data > Weigh Cases... `

Choose `Weigh cases by...`

and select the variable `number`

in
entry field. Confirm with `OK`

.

### 16.7.4 contingency tables: testing

The testing proceeds in the same way as described in §11.6 for the association between two nominal variables.

`Analyze > Descriptives > Crosstabs...`

Select the variables `class`

(in “Rows” panel) and `survivor`

(in
“Columns” panel) for
contingency table 16.1.

Choose `Statistics…`

and tick the option `Chi-square`

. Confirm firstly with
`Continue`

and afterwards again with `OK`

.

## 16.8 JASP

### 16.8.1 goodness of fit: preparation

The nominal data to investigate are typically coded as a “long” column in the data file. Each observation typically forms a separate row in the data file, and the nominal independent variable is a column in the data file. However, for the “goodness of fit” \(\chi^2\) test in JASP, the data have to be entered not in this “long” fashion (with \(N\) rows), but in the form of a summary of numbers of observations (counts, frequency) per category of the nominal variable (with \(k\) rows, one row for each of \(k\) categories).

For the example in §16.2 these summary data would look like this:

```
outcome count
1 14
2 9
3 11
4 10
5 15
6 1
```

In order to enter these data in JASP, create a data file (using e.g. Excel or any text editor) with the contents as listed above, including the column headers. Save the file in CSV format (`.csv`

, not `.xlsx`

) and open it in JASP.

### 16.8.2 goodness of fit: testing

In the top menu bar, choose

`Frequencies > Classical: Multinomial Test`

Select the variable containing the categories of the nominal variable, here `outcome`

, and place it in the entry field “Factor.”
Select the variable containing the counts (frequencies) of each category, and place it in the entry field “Count.”

Under “Test Values” there are two options.

If you choose `Equal proportions (multinomial test)`

, a special version of the \(\chi^2\) test will be performed, testing for a uniform distribution (as explained above, this means that the expected frequency is equal for each outcome category). In this example, this H0 implies that the die is honest, which is exactly what we want to test here.

If you choose `Expected proportions (chi-square test)`

, you may adjust the expected frequencies in each cell. Use this option if your H0 postulates a non-uniform (e.g. gaussian) distribution. A table will appear, in which you must enter the expected frequencies according to *H0* for each category or cell. By default, the values in this table are all equal, so that the default is equivalent to the “equal proportions” or uniform H0 in the first option.

You may also check `Descriptives`

and `Confidence interval`

under the heading “Additional Statistics,” and check `Descriptives plot`

under “Plots,” so as to gain better insight in the patterns in your data.

In JASP it is not possible to obtain the (adjusted) standardized residuals; however you can compute these manually from the observed and expected counts.

### 16.8.3 contingency tables: preparation

The nominal data to investigate are typically coded as two or more “long” columns in the data file. Each observation (e.g. each person on the Titanic, in Example 16.1) corresponds with a separate row in the data file, and the nominal variables are in columns in the data file (e.g. `class`

and `outcome`

). We can use such a “long” data file for creating a contingency table in JASP, and for performing a \(\chi^2\) test on that contingency table — see the end of the next subsection for further instructions.

However, for performing a \(\chi^2\) test on a contingency table in JASP, the data do not necessarily have to be entered in this “long” fashion (with \(N\) rows); the data may also be in the form of a summary of numbers of observations (counts, frequency) per category of the nominal variable (with \(k\) rows, one row for each of \(k\) cells or combinations of categories).

For example 16.1, the data would then look as follows:

```
class outcome count
1st died 122
1st survived 203
2nd died 167
2nd survived 118
3rd died 528
3rd survived 178
crew died 673
crew survived 212
```

In order to enter these data in JASP, create a data file (using e.g. Excel or any text editor) with the contents as listed above, including the column headers. Save the file in CSV format (`.csv`

, not `.xlsx`

) and open it in JASP.

### 16.8.4 contingency tables: testing

The \(\chi^2\) test on a contingency table proceeds in the same way as described in §11.6 for association between two nominal variables.

In the top menu bar, choose:

`Frequencies > Classical: Contingency Tables`

Select one nominal variable (`class`

) in the “Rows” field, and the other nominal variable (`outcome`

) in the “Columns” field, to set up the contingency table (Table 16.1).
Select the variable `count`

into the “Counts” field; this specifies the numbers of observations for each cell.

Open the `Statistics`

section bar, and check the option `Chi-square`

(\(\chi^2\)).
Open the `Cells`

section bar, and check the option `Expected counts`

.

The resulting value of the \(\chi^2\) test statistic is reported in the output under **Chi-Squared Tests**.

If you have a “long” data sheet, with one observation per row, then you only need to select one nominal variable (`class`

) in the “Rows” field, and the other nominal variable (`outcome`

) in the “Columns” field, to set up the contingency table (Table 16.1).

Open the `Statistics`

section bar, and check the option `Chi-square`

(\(\chi^2\)).
Open the `Cells`

section bar, and check the option `Expected counts`

.

In JASP it is not possible to obtain the (adjusted) standardized residuals; however you can compute these manually from the observed and expected counts.

## 16.9 R

### 16.9.1 goodness of fit: testing

```
chisq.test( c( 14, 9, 11, 10, 15, 1 ) ) -> dobbel.chi2.htest # die §16.2
print(dobbel.chi2.htest)
```

```
##
## Chi-squared test for given probabilities
##
## data: c(14, 9, 11, 10, 15, 1)
## X-squared = 12.4, df = 5, p-value = 0.0297
```

`$residuals # raw residuals dobbel.chi2.htest`

`## [1] 1.2649111 -0.3162278 0.3162278 0.0000000 1.5811388 -2.8460499`

`sum( (dobbel.chi2.htest$residuals)^2 ) # chi2 = sum of sq of raw resid`

`## [1] 12.4`

`$stdres # standardized residuals dobbel.chi2.htest`

`## [1] 1.3856406 -0.3464102 0.3464102 0.0000000 1.7320508 -3.1176915`

### 16.9.2 contingency table: preparation and testing

In R, the dataset `Titanic`

is provided as a multidimensional matrix. We sum
the observations and make a contingency table of the first dimension (class) and
the fourth dimension (outcome).

`apply( Titanic, c(1,4), sum ) -> Titanic.classoutcome`

Next, we use the contingency (frequency) table as the input for a `chisq.test`

.
The resulting `chisq.htest`

object is saved within R in order to inspect its residuals.

```
chisq.test( Titanic.classoutcome ) -> Titanic.chisq.htest
print(Titanic.chisq.htest)
```

```
##
## Pearson's Chi-squared test
##
## data: Titanic.classoutcome
## X-squared = 190.4, df = 3, p-value < 2.2e-16
```

`$stdres # standardized residuals Titanic.chisq.htest`

```
## Survived
## Class No Yes
## 1st -12.593038 12.593038
## 2nd -3.521022 3.521022
## 3rd 4.888701 -4.888701
## Crew 6.868541 -6.868541
```

The adjusted standardized residuals show the remarkably high number of survivors among the first class passengers, and the remarkably low number of survivors among the ship’s crew.

## 16.10 Effect size: odds ratio

When using the \(\chi^2\)-test, the effect size can be reported in the form of the so-called “odds ratio.” The ‘odds ratio’ is derived from the contingency table with frequencies per cell; the odds ratio is most commonly used with \(2 \times 2\) contingency tables. We will explain all these matters using the following example of a \(2 \times 2\) contingency table.

Example 16.2: Doll and Hill (1956) investigated the relation between smoking and lung cancer. They first surveyed all British doctors about their age and smoking behaviour. Next, the researchers kept up over the years with the death notices and cause of death of all those surveyed. The first outcomes, after more than four years, are summarised in Table 16.2.

Smoking | No lung cancer | Lung cancer | Total | |||
---|---|---|---|---|---|---|

No (0) | 3092 | (A) | 1 | (B) | 3093 | (A+B) |

Yes (1) | 21178 | (C) | 83 | (D) | 21261 | (C+D) |

Total | 24270 | (A+C) | 84 | (B+D) | 24354 | (A+B+C+D) |

In the usual manner, we find \(\chi^2=10.35\), df=1, \(p<.01\). We conclude that there is an association between smoking behaviour and death from lung cancer.

For the effect size, we firstly calculate the ‘odds’ of death from lung cancer for the smokers: D/C= \(83/21178 =0.00392\). Amongst the smokers, there are 83 deaths from lung cancer, compared with 21178 deaths not from lung cancer (the ‘odds’ of dying from lung cancer are 1 in 0.00392). For the non-smokers: B/A=\(1/3092 =0.00032\) (the ‘odds’ are 1 in 0.00032).

We call the *ratio* of these two ‘odds’ for the two groups the
‘odds ratio’ (abbreviated OR). In this example, we find (D/C) / (B/A) =
AD/BC =
\((3092 \times 83) / (1 \times 2178) = (0.00392)/(0.00032) = 12.1\). The
‘odds’ of dying from lung cancer are thus more than \(12\times\)
as great for the smokers as for the non-smokers. We report this as follows:

Doll and Hill (1956) found a significant relation between smoking behaviour and death from lung cancer, \(\chi^2(1)=10.35, p<.01, \textrm{OR}=12.1\). The ‘odds’ of dying from lung cancer seemed to be more than \(12\times\) as great for smokers as for non-smokers.

### References

*Introduction to Categorical Data Analysis*. 2nd ed. Hoboken, NJ: Wiley.

*British Medical Journal*, 1071–81. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2035864/.

*Statistical Analysis in Psychology and Education*. 6e ed. New York: McGraw-Hill.

*Designing Experiments and Analyzing Data: A Model Comparison Perspective*. Book. 2nd ed. Mahwah, NJ: Lawrence Erlbaum Associates.

The value found \(\chi^2=12.44\) is slightly under the critical value for 5 d.f. and \(p=.025\), (there \((\chi^2)^*=12.83\)), thus the corresponding probability of this value or a larger value is slightly greater than \(0.025\).↩︎

If one variable’s observations are paired rather than independent (e.g. before/after treatment, passed/failed, etc.), then the McNemar test is a useful alternative.↩︎

If multiple comparisons are performed, then the critical value of \(\alpha\) should be adjusted accordingly, in order to prevent Type I errors somewhere among the comparisons. With \(k\) cells and \(k\) comparisons, a safe precaution is to use \(\alpha/k\) instead of \(\alpha\) for each comparison – this is called Bonferroni’s adjustement of the \(\alpha\) value, or Dunn’s procedure (Maxwell and Delaney 2004, 202). See also §15.3.5.↩︎