16 Chi-square-tests

16.1 Introduction

Earlier, we already saw that we cannot always make use of a parametric test such as the t test or analysis of variance, because the collected data do not satisfy the assumptions. If the collected data have not been measured on an interval level of measurement (see Chapter 4), or if the probability distribution of the data is far from normal (see §10.5), then a non-parametric test is to be preferred over such a parametric test. If the collected data do satisfy the assumptions for a parametric test, then a non-parametric test is less sensitive (more conservative) than a parametric test, i.e. the non-parametric test requires a larger effect and/or a larger sample, and generally has less power than a parametric test when seeking out an effect (see Chapter 14).

In this Chapter, we discuss the most used non-parametric test: the so-called \(\chi^2\) test, pronounced as “chi-square-test” (with the greek letter “chi”).

16.2 \(\chi^2\) test for “goodness of fit” in single sample

Data of nominal level of measurement are often analysed with the \(\chi^2\) test. The number of dots on a dice is an example of a dependent variable of nominal level of measurement: there is no physical ordering between the six sides, and each side of a die has an equally high probability of appearing on the top. Imagine we throw a die \(60\times\), and find the following frequencies of the six possible outcomes: \(14, 9, 11, 10, 15, 1\). This can be considered to be a sample of \(n=60\) throws from an infinite population of possible throws, and the outcome frequencies reported here should be seen as a contingency table of 1 row and 6 columns (i.e. 6 cells). How high is the probability of this distribution of outcomes? Is the die indeed honest?

The \(\chi^2\) test is based on the differences between the expected and observed frequencies. According to the null hypothesis (H0: the die is honest), we expect 10 outcomes in each cell (\(60/6=10\)), i.e. the expected frequency is identical for each cell (this is called a uniform distribution). The observed outcomes deviate from the expected frequencies of outcomes, in particular because the outcome “six” barely occurs in this sample. This might of course also have happened by chance. The \(\chi^2\) test indicates how high the probability is of this uneven distribution of outcomes (or an even more uneven distribution), if H0 is true. The expected outcomes are thus deduced from a distribution of the outcomes according to H0, and we investigate how well the observed outcomes fit the expected outcomes. This form of the \(\chi^2\) test is thus also referred to as a test of the ‘goodness of fit’.

For this example, we find the outcome of the testing \(\chi^2=12.44\) with 5 degrees of freedom (see §13.2.1 for explanation about degrees of freedom), with \(p=.03\). We usually use the computer to calculate this probability value, but we can also estimate this probability via a table with critical \(\chi^2\)-values, see Appendix D, and footnote ³⁹). If H0 is true, then we have only 3% probability of finding this outcome (or an even more uneven distribution of outcomes). The significance \(p\) found is smaller than \(\alpha=.05\), and we thus reject H0. We conclude that this die is not honest: the distribution of outcomes found deviates significantly from the expected distribution according to H0.

16.3 \(\chi^2\) test for homogeneity of a variable in multiple samples

The \(\chi^2\) test can also be used for a research design with one nominal variable which we have observed in two or more samples. The question is then whether the distribution of the observations over the categories is equal for the different samples. This test is comparable with t tests for two independent samples (§13.6). We usually then summarise the numbers of observations with a contingency table with multiple rows for the different samples, and multiple columns for the categories of the nominal dependent variable (see also Table 11.3).

The \(\chi^2\)-test is again based on the differences between the expected and observed frequencies. According to the null hypothesis (there is no difference in distribution between the two samples), the distribution of observations across the columns should be approximately equal for all rows (and vice versa).

16.4 \(\chi^2\) test for association between two variables in single sample

Finally, the \(\chi^2\) test can equally well be used for a research design with two nominal variables, which we have observed in a single sample. The question then is whether the distribution of observations over the second variable’s categories is equal for the different categories of the first variable (and vice versa). We again summarise the numbers of observations in a contingency table with multiple rows for the categories of the first nominal variable, and multiple columns for the categories of the second nominal variable.

Here too, the \(\chi^2\)-test is based on the differences between the expected and observed frequencies. According to the null hypothesis (that there is no association between the two nominal variables), the distribution of observations across rows should be approximately equal for all columns, and vice versa. However, this does not mean that we expect the same frequency for all cells. This is illustrated in the following example.

Example 16.1: In the early morning of 15th April 1912, the Titanic sunk in the Atlantic Ocean. Many of those on board lost their lives. Those on board could be divided into four classes (1st/2nd/3rd class passengers, and crew). Was the outcome of the disaster (whether the individual survived the disaster or not) approximately equal for persons of these four classes? The contingency table 16.1 provides the distributions of outcomes.

Table 16.1: Distribution of those on board the *Titanic* (\(N=2201\)), according to passage and status (survived or not). Data taken from the dataset `Titanic` in R.
Class	Died	Survived	Total
1st	122	203	325
2nd	167	118	285
3rd	528	178	706
Crew	673	212	885
Total	1490	711	2201

For the expected frequencies, we have to take into account the different numbers of those on board in the different classes, and the unequal distribution of outcomes (1490 non-survivors and 711 survivors). If there were no association between the class and the survival status, we would expect there to be 220 non-survivors amongst the first class passengers \([(1490/2201) \times 325 = (325 \times 1490) / 2201 = 220]\) and 105 non-survivors \([(711/2201) \times 325 = (325 \times 711) / 2201 = 105]\). In this way, we can determine the expected frequencies for each cell, taking into account the marginal totals. With the help of these expected frequencies, we then calculate \(\chi^2=190.4\), here with 3 d.f., \(p<.001\). The significance \(p\) found is smaller than \(\alpha=.001\), and we thus reject H0. We conclude that the outcome of the disaster (died or survived) was unevenly distributed for the four classes of those on board the Titanic.

For the analysis of contingency tables which consist of precisely \(2\times2\) cells, the Phi coefficient is an effective alternative (see §11.6).

Reread and remember the warnings about correlation and causality (§11.7) — these are also applicable here.

16.5 assumptions

The \(\chi^2\)-test requires three assumptions which must be satisfied in order to use the test.

The data have to be measured on a nominal level of measurement, or have to be simplified to nominal level (see Chapter 4).
All observations have to be independent of each other, and based on (a) random sample(s) of the population(s) (see §7.3), or on random assignment of the elements from the sample to experimental conditions (randomisation, see §5.4, point 5). Each element for the sample can thus only contribute one observation to one cell⁴⁰.
The sample has to be large enough so that the expected frequency (\(E\)) for each cell is at least 5. If the expected frequency or frequencies in one or more cells is/are less than 5, then reduce the number of cells by merging bordering cells, and determine the expected frequencies again.

16.6 formulas

The test statistic \(\chi^2\) is defined as \[\begin{equation} \tag{16.1} \chi^2 = \sum \frac{(O-E)^2}{E} \end{equation}\] in which \(O\) and \(E\) indicate the observed and expected numbers of observations for each cell of the frequency table (Ferguson and Takane 1989). The expected numbers might also be rational numbers (e.g. \(45/6\) for the 6 possible outcomes of an honest die, if we throw \(45\times\)). The larger the difference \((O-E)\) in one or several cells, the larger also \(\chi^2\) will be (see below). Due to squaring, the test statistic \(\chi^2\) is always null or positive, and never negative (Ferguson and Takane 1989).

The probability distribution of the test statistic \(\chi^2\) is determined by the number of degrees of freedom (see §13.2.1 for explanation of this concept). For a \(\chi^2\)-test with one nominal variable (“goodness of fit”), the number of degrees of freedom must be equal to the number of cells minus 1. For a \(\chi^2\)-test with multiple samples (homogeneity) and/or with two variables (correlations), with respectively \(k\) and \(m\) categories, the number of degrees of freedom is equal to \((k-1)\times(m-1)\).

For each cell of the frequency table, in row \(i\) and column \(j\), we can also compute the raw residual: \[\begin{equation} \tag{16.2} e_{ij} = \frac{(O_{ij}-E_{ij})}{\sqrt{E_{ij}}} \end{equation}\] If we square these raw residuals and then sum the squares, the result is the \(\chi^2\) test statistic given in Eq.(16.1) above.

It is more insightful to compute the standardized residual for each cell of the frequency table (Agresti 2007, 38). The standardization means that the standard error of the residuals is taken into account (by using row totals \(R_i\), column totals \(C_j\), and the grand total \(N\)): \[\begin{equation} \tag{16.3} e_{ij} = \frac{(O_{ij}-E_{ij})}{\sqrt{E_{ij}\times(1-\frac{R_i}{N})\times(1-\frac{C_j}{N})}} \end{equation}\] These standardized residuals may be interpreted as standard normal \(Z\) scores, using the critical \(Z\) values given in Appendix B. Hence the adjusted standardized residuals provide insight in the source of a significant outcome of the \(\chi^2\) test, and they also allow us to assess the contribution of each cell to that outcome⁴¹.

For the example given in §16.2 we find the following six standardized residuals for the six possible outcomes of the die: \((1.39, -0.35, +0.35, 0.00, 1.73, -3.12)\). The first five of these outcomes are observed approximately as frequently as expected, but the sixth of these outcomes is observed significantly less often than expected (\(p=.003\)).

16.7 SPSS

16.7.1 goodness of fit: preparation

If we want to investigate a nominal variable, then it must of course be marked as a column in the SPSS data file. Every observation forms a separate row in the data file, and the nominal independent variable is a column in the data file.

Sometimes, we do not have the separate observations (rows) but do have the table of numbers of observations per category of the nominal variable. We can work further with these. Let us say that we have two columns, named outcome and number, as follows (see §16.2):

Outcome Number
 1        14
 2         9
 3        11
 4        10
 5        15
 6         1

Next, each cell (row) has to get a weight that is as large as the number of observations, which is named here in the second column: the first cell (row) weighs \(14\times\), the second cell (row) weighs \(9\times\) etc. Thanks to this trick, we do not have to fill in \(N=60\) rows (a row for each observation), but only 6 rows (a row for each cell).

Data > Weigh Cases...

Choose Weigh cases by... and select the variable number in entry field. Confirm with OK.

Choose and select the variable number in input field. Confirm with OK.

16.7.2 goodness of fit: testing

Analyze > Nonparametric tests > Legacy Dialogs > Chi-square...

Select the variables outcome (in “Test variable list” panel) and indicate that we expect equal numbers of observations in each cell. (It is also possible to enter other expected frequencies here, if other, unequal frequencies are expected according to H0.) Confirm with OK.

16.7.3 contingency tables: preparation

If we want to investigate two nominal variables, then they must both be marked as columns in the SPSS data file. Each observation forms a separate row in the data file, and the nominal variables are columns in the data file. For Example 16.1 above, we then use a “long” data file, consisting of \(N=2201\) rows, with a separate row for each person on board, with at least two columns, for class and survivor.

Sometimes, we do not have the separate observations (rows) but do have the contingency table of numbers of observations for each combination of categories of the nominal variables. We can also work further with these. Let us say that we have three columns, named class, survivor and number, as follows:

Class  Survivor   Number
1st     no         122
1st     yes        203
2nd     no         167
2nd     yes        118
3rd     no         528
3rd     yes        178
crew    no         673
crew    yes        212

Next, each cell (row) has to get a weight which is as large as the number of observations, which is named in the third column: the first cell (row) weighs \(122\times\), the second cell (row) weighs \(203\times\), etc. With this trick, we do not have to enter \(N=2201\) rows (a row for each observation), but only 8 rows (a row for each cell).

Data > Weigh Cases...

Choose Weigh cases by... and select the variable number in entry field. Confirm with OK.

16.7.4 contingency tables: testing

The testing proceeds in the same way as described in §11.6 for the association between two nominal variables.

Analyze > Descriptives > Crosstabs...

Select the variables class (in “Rows” panel) and survivor (in “Columns” panel) for contingency table 16.1.
Choose Statistics… and tick the option Chi-square. Confirm firstly with Continue and afterwards again with OK.

16.8 JASP

16.8.1 goodness of fit: preparation

The nominal data to investigate are typically coded as a “long” column in the data file. Each observation typically forms a separate row in the data file, and the nominal independent variable is a column in the data file. However, for the “goodness of fit” \(\chi^2\) test in JASP, the data have to be entered not in this “long” fashion (with \(N\) rows), but in the form of a summary of numbers of observations (counts, frequency) per category of the nominal variable (with \(k\) rows, one row for each of \(k\) categories).

For the example in §16.2 these summary data would look like this:

outcome count
 1        14
 2         9
 3        11
 4        10
 5        15
 6         1

In order to enter these data in JASP, create a data file (using e.g. Excel or any text editor) with the contents as listed above, including the column headers. Save the file in CSV format (.csv, not .xlsx) and open it in JASP.

16.8.2 goodness of fit: testing

In the top menu bar, choose

Frequencies > Classical: Multinomial Test

Select the variable containing the categories of the nominal variable, here outcome, and place it in the entry field “Factor”. Select the variable containing the counts (frequencies) of each category, and place it in the entry field “Count”.

Under “Test Values” there are two options.
If you choose Equal proportions (multinomial test), a special version of the \(\chi^2\) test will be performed, testing for a uniform distribution (as explained above, this means that the expected frequency is equal for each outcome category). In this example, this H0 implies that the die is honest, which is exactly what we want to test here.
If you choose Expected proportions (chi-square test), you may adjust the expected frequencies in each cell. Use this option if your H0 postulates a non-uniform (e.g. gaussian) distribution. A table will appear, in which you must enter the expected frequencies according to H0 for each category or cell. By default, the values in this table are all equal, so that the default is equivalent to the “equal proportions” or uniform H0 in the first option.

You may also check Descriptives and Confidence interval under the heading “Additional Statistics”, and check Descriptives plot under “Plots”, so as to gain better insight in the patterns in your data.

In JASP it is not possible to obtain the (adjusted) standardized residuals for this test; however you can compute these manually from the observed and expected counts.

16.8.3 contingency tables: preparation

The nominal data to investigate are typically coded as two or more “long” columns in the data file. Each observation (e.g. each person on the Titanic, in Example 16.1) corresponds with a separate row in the data file, and the nominal variables are in columns in the data file (e.g. class and outcome). We can use such a “long” data file for creating a contingency table in JASP, and for performing a \(\chi^2\) test on that contingency table — see the end of the next subsection for further instructions.

However, for performing a \(\chi^2\) test on a contingency table in JASP, the data do not necessarily have to be entered in this “long” fashion (with \(N\) rows); the data may also be in the form of a summary of numbers of observations (counts, frequency) per category of the nominal variable (with \(k\) rows, one row for each of \(k\) cells or combinations of categories).

For example 16.1, the data would then look as follows:

class   outcome    count
1st     died       122
1st     survived   203
2nd     died       167
2nd     survived   118
3rd     died       528
3rd     survived   178
crew    died       673
crew    survived   212

16.8.4 contingency tables: testing

The \(\chi^2\) test on a contingency table proceeds in the same way as described in §11.6 for association between two nominal variables.

In the top menu bar, choose:

Frequencies > Classical: Contingency Tables

Select one nominal variable (class) in the “Rows” field, and the other nominal variable (outcome) in the “Columns” field, to set up the contingency table (Table 16.1). Select the variable count into the “Counts” field; this specifies the numbers of observations for each cell.
Open the Statistics section bar, and check the option Chi-square (\(\chi^2\)). Open the Cells section bar, and check the option Expected counts.

The resulting value of the \(\chi^2\) test statistic is reported in the output under Chi-Squared Tests.

If you have a “long” data sheet, with one observation per row, then you only need to select one nominal variable (class) in the “Rows” field, and the other nominal variable (outcome) in the “Columns” field, to set up the contingency table (Table 16.1).
Open the Statistics section bar, and check the option Chi-square (\(\chi^2\)). Open the Cells section bar, and check the option Expected counts. Also check the Pearson residuals and Standardized (adjusted Pearson).

The number of survivors is significantly larger than expected under H0 for passengers in first and second class (positive standardized residuals for these cells, both \(p<.001\)) and significantly lower than expected under H0 for passenger in third class and for crew (negative standardized residuals, both \(p<.001\)).

16.9 R

16.9.1 goodness of fit: testing

chisq.test( c( 14, 9, 11, 10, 15, 1 ) ) -> dobbel.chi2.htest # die §16.2
print(dobbel.chi2.htest)

## 
##  Chi-squared test for given probabilities
## 
## data:  c(14, 9, 11, 10, 15, 1)
## X-squared = 12.4, df = 5, p-value = 0.0297

dobbel.chi2.htest$residuals # raw residuals

## [1]  1.2649111 -0.3162278  0.3162278  0.0000000  1.5811388 -2.8460499

sum( (dobbel.chi2.htest$residuals)^2 ) # chi2 = sum of sq of raw resid

## [1] 12.4

dobbel.chi2.htest$stdres # standardized residuals

## [1]  1.3856406 -0.3464102  0.3464102  0.0000000  1.7320508 -3.1176915

16.9.2 contingency table: preparation and testing

In R, the dataset Titanic is provided as a multidimensional matrix. We sum the observations and make a contingency table of the first dimension (class) and the fourth dimension (outcome).

apply( Titanic, c(1,4), sum ) -> Titanic.classoutcome

Next, we use the contingency (frequency) table as the input for a chisq.test. The resulting chisq.htest object is saved within R in order to inspect its residuals.

chisq.test( Titanic.classoutcome ) -> Titanic.chisq.htest
print(Titanic.chisq.htest)

## 
##  Pearson's Chi-squared test
## 
## data:  Titanic.classoutcome
## X-squared = 190.4, df = 3, p-value < 2.2e-16

Titanic.chisq.htest$stdres # standardized residuals

##       Survived
## Class          No       Yes
##   1st  -12.593038 12.593038
##   2nd   -3.521022  3.521022
##   3rd    4.888701 -4.888701
##   Crew   6.868541 -6.868541

The adjusted standardized residuals show the remarkably high number of survivors among the first class passengers, and the remarkably low number of survivors among the ship’s crew. Note that R here reports the standardized (but not otherwise adjusted) Pearson residuals.

16.10 Effect size: odds ratio

When using the \(\chi^2\)-test, the effect size can be reported in the form of the so-called “odds ratio”. The ‘odds ratio’ is derived from the contingency table with frequencies per cell; the odds ratio is most commonly used with \(2 \times 2\) contingency tables. We will explain all these matters using the following example of a \(2 \times 2\) contingency table.

Example 16.2: Doll and Hill (1956) investigated the relation between smoking and lung cancer. They first surveyed all British doctors about their age and smoking behaviour. Next, the researchers kept up over the years with the death notices and cause of death of all those surveyed. The first outcomes, after more than four years, are summarised in Table 16.2.

Table 16.2: Contingency table of \(N=24354\) British doctors of 35 years and older for the first survey, divided according to smoking behaviour (rows: (non-) smoker currently or previously) and according to death by lung cancer in the last 4 years (columns), with letter indication for the numbers of observations.
Smoking	No lung cancer		Lung cancer		Total
No (0)	3092	(A)	1	(B)	3093	(A+B)
Yes (1)	21178	(C)	83	(D)	21261	(C+D)
Total	24270	(A+C)	84	(B+D)	24354	(A+B+C+D)

In the usual manner, we find \(\chi^2=10.35\), df=1, \(p<.01\). We conclude that there is an association between smoking behaviour and death from lung cancer.

For the effect size, we firstly calculate the ‘odds’ of death from lung cancer for the smokers: D/C= \(83/21178 =0.00392\). Amongst the smokers, there are 83 deaths from lung cancer, compared with 21178 persons not dying from lung cancer (the ‘odds’ of dying from lung cancer are 1 in 0.00392). For the non-smokers: B/A=\(1/3092 =0.00032\) (the ‘odds’ are 1 in 0.00032).

We call the ratio of these two ‘odds’ for the two groups the ‘odds ratio’ (abbreviated OR). In this example, we find (D/C) / (B/A) = AD/BC = \((3092 \times 83) / (1 \times 2178) = (0.00392)/(0.00032) = 12.1\). The ‘odds’ of dying from lung cancer are thus more than \(12\times\) as great for the smokers as for the non-smokers. We report this as follows:

Doll and Hill (1956) found a significant relation between smoking behaviour and death from lung cancer, \(\chi^2(1)=10.35, p<.01, \textrm{OR}=12.1\). The ‘odds’ of dying from lung cancer seemed to be more than \(12\times\) as great for smokers as for non-smokers.

References

Agresti, Alan. 2007. Introduction to Categorical Data Analysis. Hoboken, NJ: Wiley.

Doll, Richard, and A. Bradford Hill. 1956. “Lung Cancer and Other Causes of Death in Relation to Smoking: A Second Report on the Mortality of British Doctors.” British Medical Journal, 1071–81. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2035864/.

Ferguson, George A., and Yoshio Takane. 1989. Statistical Analysis in Psychology and Education. 6e ed. New York: McGraw-Hill.

Maxwell, Scott E., and Harold D. Delaney. 2004. Designing Experiments and Analyzing Data: A Model Comparison Perspective. Book. 2nd ed. Mahwah, NJ: Lawrence Erlbaum Associates.

The value found \(\chi^2=12.44\) is slightly under the critical value for 5 d.f. and \(p=.03\), (there \((\chi^2)^*=12.83\)), thus the corresponding probability of this value or a larger value is slightly greater than \(0.03\).↩︎
If one variable’s observations are paired rather than independent (e.g. before/after treatment, passed/failed, etc.), then the McNemar test is a useful alternative.↩︎
If multiple comparisons are performed, then the critical value of \(\alpha\) should be adjusted accordingly, in order to prevent Type I errors somewhere among the comparisons. With \(k\) cells and \(k\) comparisons, a safe precaution is to use \(\alpha/k\) instead of \(\alpha\) for each comparison – this is called Bonferroni’s adjustement of the \(\alpha\) value, or Dunn’s procedure (Maxwell and Delaney 2004, 202). See also §15.3.5.↩︎