# 12 Reliability

## 12.1 Introduction

In Chapter 5, we talked about, amongst other matters,
construct validity, the distance between the intended (theoretical)
concept or construct on the one hand, and the independent or dependent variable
on the other hand. In this Chapter, we will look at another very important
aspect of the dependent variable, namely its *reliability*.
This reliability can be estimated based on the
association between observations of the same construct.
We will also look at the relations between reliability and construct validity.

Often validity and reliability are mentioned in the same breath, and discussed in consecutive chapters. There is something to be said for this, since both concepts are about how you define and operationalise your variables. Nevertheless, we have chosen a different ordering here. Reliability will only be discussed following our discussion of correlation (Chapter 11), since reliability is based on the relation or correlation between observations.

## 12.2 What is reliability?

A reliable person is stable and predictable: what he or she does
today is consistent with what he or she did last week, you can trust
this person — in contrast to an unreliable person, who
is unstable and behaves unpredictably. According to Collins English Dictionary, someone or something is reliable when they/it “…can be trusted to work well or to behave in the way that you want them to.”^{28}
Reliable measurements can form the basis for a
“justified true belief” (see
§2.4); conversely, it is not worth
giving credence to unreliable measurements.

Measurements always show some degree of fluctuation or variation or inconsistency. This variability can partially be attributed to the variation in the behaviour which is being measured. After all, even if we measure the same construct for the same person, we still see variance as a result of the momentary mental or physical state of the participant, which simply fluctuates. Moreover, there is variation in the measuring device (thermometer, questionnaire, sensor), and there are probably inconsistencies in the manner of measurement or evaluation. With the quantification of such consistencies and inconsistencies, we enter the realms of reliability analysis.

The term ‘reliability’ has two meanings in academic research, which we
will treat separately. Firstly, reliability signifies the *precision*
or *accuracy* of a measurement. This aspect concerns the question of the extent
to which the measurement is influenced by chance factors (through which the
measurement does not exclusively render the construct investigated).
If we do *not* measure
accurately then we also know what the measurements actually show —
perhaps they show the construct investigated but perhaps they also do not. If we do
measure accurately then we would expect, if we were to conduct the same measurement again,
that we would then measure the same outcome. The less precise a measurement is, the more variation or inconsistency there is between the first measurement and the repeated measurement,
and the measurements are thus less reliable.

Example 12.1:If we want to measure the reading ability of pupils in their final examinations, then we present them with a reading comprehension test with a number of accompanying questions. The degree to which the different questions measurethe sameconstruct, here the construct ‘reading ability,’ is called reliability, precision or homogeneity.

In what follows, to avoid confusion, we will refer to this form
of reliability with the term *homogeneity* (vs.
heterogeneity). With a heterogenous (non-homogenous) test, the total score
is difficult to interpret. With a perfectly homogenous test, people
who have the same total score have also answered the same questions correctly.
However, when we measure human (language) behaviour, such perfectly homogenous tests
never occur: respondents who do achieve the same total score, have not
always answered the same questions correctly (e.g. in the final examination
reading ability test, Example 12.1). This implies that the questions have not
measured exactly the same capacity. This is also the case: one question was about
paraphrasing a paragraph, whilst another was related to a relationship between
a referential expression and its antecedent.

As such, the questions or items were not perfectly homogenous!

Secondly, reliability signifies a measurement’s *stability*.
To measure your weight, you stand on a weighing scale. This
measurement is stable: five minutes later, the same weighing scale with the same
person under the same circumstances will also yield (almost)
the same measurement. Stability is often expressed in a so-called
correlation coefficient (a measurement for association, see Chapter
11). This correlation coefficient can assume all
values between \(+1\) and \(-1\). The more similar the first and second measurement,
the higher the correlation is, and the higher the association between the first and second
measurement. Conversely, the lower the association between the first and second
measurement is, the lower too the correlation is.

Stable measurements nevertheless rarely occur in research on (language) behaviour. If is a test is taken twice, then there is often a considerable difference in scores on the first measurement point and scores on the second measurement point.

Example 12.2:In the final examination for Dutch secondary school, pupils typically have to write an essay, which is assessed by two raters. The raters are stable if, after some time, they give the same judgements to the same essays. Thus: if rater A at first gave a grade 8 to an essay, and for the second evaluation sometime later, he/she also gave the same essay an 8, then this rater is (very) stable. If, however, the same rater A gave this same essay a grade 4 on the second evaluation, then this rater is not stable in his/her judgements.

Now, grading essays is a tricky task: criteria are not precisely described and there is a relatively large amount of room for interpretation differences. Accordingly, the stability of judgements is also low; previously, a stability coefficient of even \(0.40\) has been reported.

To calculate a test’s stability, the same test has to be
taken twice; the degree of association between the first and
second measurement is called the *test-retest-reliability*.
In practice, repeatedly sitting a test like this rarely takes place due to
the relatively high costs and relatively low benefits.

Example 12.3:Lata-Caneda et al. (2009) developed a Spanish-language questionnaire consisting of 39 questions, intended for aphasia patients to determine their quality of life. The quality of life is described as “the patient perception about, either the effects of a given disease, or the application of a certain treatment on different aspects of life, especially regarding its consequences on the physical, emotional and social welfare” (Lata-Caneda et al. 2009, 379). The new questionnaire was taken twice with a sample of 23 Spanish-language patients with aphasia as a result of cerebral haemorrhage. The reported test-retest stability for this questionnaire was \(0.95\).

Both homogeneity and stability are expressed as a
coefficient with a value between \(0\) and \(1\) (in practice, negative coefficients
do not occur). How should we interpret the reported
coefficients? Generally, it is of course the case that the higher the coefficient is,
the higher (better) the reliability. But how large should the reliability minimally be
before we can call a test “reliable?” There are no clear rules for this.
However, when considerations have to be made about people, then the test has to
have a reliability of at least \(0.90\) according to the *Nederlands
Instituut van Psychologen* ‘Dutch Institute for Psychologists’ (NIP).
This is, for instance, the case for tests which are used to determine whether or not a child is eligible for a so-called dyslexia declaration.

For research purposes, such a strict requirement for the reliability of a
test is not required. Often, \(0.70\) is used as the lower limit of the
reliability coefficient.

## 12.3 Test theory

Classical test theory refers to the measurement of variable \(x\)
for the \(i\)-th element of a sample consisting of random members of the
population. Test theory posits that each measurement \(x_i\) is composed
of two components, namely a true score \(t_i\) and an error score:

\[\begin{equation}
x_i = t_i + e_i
\tag{12.1}
\end{equation}\]
Imagine that you “actually” weigh \(t=72.0\) kg, and imagine also
that your measured weight is \(x=71.6\) kg, then the error score
is \(e=-0.4\) kg.

A first important assumption in classical test theory is that the deviations \(e_i\) neutralise or cancel each other out (i.e. are zero when averaged out, and thus do not deviate systematically from the true score \(t\)), and that larger deviations above or below occur less often than smaller deviations. This means that the deviations are normally distributed (see §10.3), with \(\mu_e=0\) as mean: \[\begin{equation} \tag{12.2} e_i \sim \mathcal{N}(0,s^2_e) \end{equation}\]

A second important assumption in classical test theory is that there is no relation between the true scores \(t_i\) and the error scores \(e_i\). Since the component \(e_i\) is completely determined by chance, and thus does not have any relation with \(x_i\), the correlation between the true score and the error score is null: \[\begin{equation} \tag{12.3} r_{(t,e)} = 0 \end{equation}\]

The total variance of \(x\) is thus^{29} equal to the sum of the variance
of the true scores and the variance of the error scores:
\[\begin{equation}
\tag{12.4}
s^2_x = s^2_t + s^2_e
\end{equation}\]

When the observed variance \(s^2_x\) proportionately contains much
error variance (i.e. variance of deviations), then the observed scores
have been determined for the most part by chance deviations. That is of course
undesirable. In a such instance, we say that the observed scores
are not reliable; there is much “noise” in the observed scores.

When the error variance in contrast is relatively small, then
the observed scores provide a good reflection of what the true scores are,
and then the observed differences are indeed reliable, i.e. they
are not much determined by chance differences.

In that case, we can also define reliability (symbol \(\rho\)) as the
proportion between true score variance and total variance:

\[\begin{equation}
\tag{12.5}
\rho_{xx} = \frac{s^2_t}{s^2_x}
\end{equation}\]

However, in practice, we cannot use this formula (12.5) to
establish reliability, since we do not know \(s^2_t\). We must thus firstly
estimate what the true score variance is — or what the error variance is, which, after all,
is the complement of the true score variance (see
formula (12.4))^{30}.

The second assumption (in formula (12.3)), that there is no relation between
true score and error score, is, in practice, not always justified.
To illustrate, let us look at the results of a test on

a scale from 1.0 to 10.0. Students with scores of 9 or 10 have a high
true score too (they master the material very well) and thus usually have a low
error score. The students with scores of 1 or 2 also have a low true score
(they master the material very badly) and thus also usually have a low
error score. For the students with scores of 5 or 6,
the situation is different: perhaps they master the material fairly well
but have just given a wrong answer, or perhaps they master the material poorly but have by chance given a good answer. For these students with an observed
score in the middle of the scale, the error scores are relatively larger
than for the students with a score at the ends of the scale. In other domains,
e.g. for reaction times, we see other relations, e.g. that the error score increases
more or less equally with the score itself; there is then a positive relationship
between the true score and the error score (\(\rho_{(t,e)}>0\)). Nevertheless,
the advantages of the classical test theory are so large that we use this
theory as a starting point.

From the formulas (12.4) and (12.5)
above, it also follows that the standard error of measurement is related to the
standard deviation and to the reliability:
\[\begin{equation}
\tag{12.6}
s_e = s_x \sqrt{1-r_{xx}}
\end{equation}\]

This standard error measurement can be understood
as the standard deviation of the error scores \(e_i\), assuming still that
the error scores are normally distributed (formula (12.2)).

Example 12.4:External inspectors doubt whether teachers mark their students’ final papers well. If a student got a 6, should the final paper have perhaps actually been judged as a fail?

Let us assume that the given assessment shows a standard deviation of \(s_x=0.75\), and let us equally assume that an analysis of reliability had shown that \(r_{xx}=0.9\). The standard error measurement is then \(s_e = 0.24\) points (rounded up). The probability that the true score \(t_i\) is smaller than or equal to 5.4 (the minimum for a fail), with an observed score of \(x_i=6.0\) and \(s_e=0.24\), is only \(p=0.006\) (for explanation, see §13.5 below). The final paper’s assessment as a pass is with high probability correct.

## 12.4 Interpretations

Before we look at the different ways of calculating reliability, it is a good idea to pause on the different interpretations of reliability estimations.

First, reliability can be interpreted as the proportion of
true score variance (see formula (12.5)), or
as the proportion of variance which is “systematic.”
This is different from the proportion of variance resulting from the concept-as-defined,
the “valid” variance (see Chapter 5).
The variance resulting from the concept-as-defined is part of the proportion of
true score variance.
However, many other factors may systematically influence
respondents’ scores, such as differences in test experience. If two students \(i\) and \(j\)
possess a concept (let us say: language proficiency) to the same degree, then one of the
students can still score more highly because he or she has done (and practiced) language proficiency
tests more often than the other student. Then, there is no difference
in the concept-as-defined (language proficiency \(T_i = T_j\)),

but there is in another factor (experience), and thus a difference arises
between the students in their ‘true’ scores (\(t_i \neq t_j\)) which
we measure with a valid and reliable language proficiency measurement. When measuring,
deviations and measurement errors appear (\(e_i\) and \(e_j\)), through which
the observed differences between students (\(x_i-x_j\)) can be larger or smaller than their
differences in ‘true’ score (\(t_i-t_j\)).
This is the reason why a reliability estimate always forms the upper limit of
the validity.

A second interpretation of reliability (formula (12.5)) is that of the theoretically expected correlation (see §11.2) between measurements, when measurements are replicated many times. For convenience, we assume that memory and fatigue effects have no effect at all on the second and later measurements. If we were to measure the same people with the same measurement three times, without memory or fatigue effects, then the scores from the first and second measurement, and from the first and third measurement, and from the second and third measurement would always show the same correlation \(\rho\). This correlation thus indicates the extent to which the repeated measurements are consistent, i.e. represent the same unknown true score.

In this interpretation, reliability thus expresses the expected association between scores from the same test taken repeated times. We then interpret the reliability coefficient \(\rho\) as the correlation between two measurements with the same instrument.

Thirdly, reliability can be interpreted as the loss of efficiency in the estimation of the mean score \(\overline{X}\) (Ferguson and Takane 1989, 474). Imagine that we want to establish the mean score of a group of \(n=50\) participants, and for this we use a measurement instrument with reliability \(\rho_{xx}=0.8\). In this case, there is uncertainty in the estimation, which come from the chance deviations \(e_i\) in the measurements. If the measurement instrument were perfectly reliable (\(\rho=1\)), we would only need \(\rho_{xx}\times n = 0.8\times50=40\) participants for the same accuracy in the estimation of \(\overline{X}\) (Ferguson and Takane 1989, 474). As such, we have, as it were, played away 10 participants to compensate for the unreliability of the measurement instrument.

Above, we spoke about measurements with the help of measurement instruments, and below we will talk about ratings done by raters. In these situations, the approach to the notion of ‘reliability’ is always the same. Reliability plays a role in all situations where elements from a sample are measured or assessed by multiple assessors or instruments. Non-final exams and questionnaires can also be such measurement instruments: a non-final exam or questionnaire can be thought of as a composite instrument with which we try to measure an abstract property or condition of the participants. Each question can then be considered as a “measurement instrument” or “assessor” of the respondent’s property or condition. For this, all of the above mentioned insights and interpretations concerning test theory, measurement error and reliability are equally applicable.

## 12.5 Methods for estimating reliability

A measurement’s reliability can be determined in different ways. The most important are:

The

*test-retest method*

We conduct all measurements twice, and then calculate the correlation between the first and the second measurement. The fewer measurement errors and deviations the measurements contain, the higher the correlation and thus also the reliability is. This method is time consuming but can also be applied to a small portion of the measurements. In speech research, this method is indeed used to establish the reliability of phonetic transcriptions: part of the speech recordings are transcribed by a second assessor, and then both transcriptions are compared.The

*parallel forms method*

We have a large collection of measurements which are readily comparable and measure the same construct. We conduct all measurements repeatedly, the first time by combining the measurements of several measurement instruments chosen at random (let’s say A and B and C) and the second time by using other random instruments (let’s say D and E and F). Since the measurement instruments are ‘parallel’ and the same measurement is considered to be measured, the correlation between the first and the second measurement is an indication of the measurement’s reliability.The

*split-half method*

This method is similar to the parallel forms method. The \(k\) questions or instruments are divided into two halves, after which the score is determined within each half. From the correlation \(r_{hh}\) between the scores from the two half tests, the reliability of the whole test can be deduced, \(r_{xx} = \frac{2r_{hh}}{1+r_{hh}}\).

## 12.6 Reliability between assessors

As an example, let us look at language proficiency measurements from students in a foreign language. This construct ‘language proficiency’ is measured in this example by means of two assessors who each, independently of the other, award a grade between 1 and 100 to the student (higher is better). However, when assessing, measurement errors also arise, through which the judgements not only reflect the underlying true score but also a deviation if it, with all the aforementioned assumptions. Let us firstly look at the judgements by the first and second rater (see Table 12.1). For the time being, the final judgement of a student is the mean of the judgements from the first and second rater.

Student | B1 | B2 | B3 |
---|---|---|---|

1 | 67 | 64 | 70 |

2 | 72 | 75 | 74 |

3 | 83 | 89 | 73 |

4 | 68 | 72 | 61 |

5 | 75 | 72 | 77 |

6 | 73 | 73 | 78 |

7 | 73 | 76 | 72 |

8 | 65 | 73 | 72 |

9 | 64 | 68 | 71 |

10 | 70 | 82 | 69 |

\(\overline{x_i}\) | 71.0 | 74.4 | 71.7 |

\(s_i\) | 5.6 | 7.0 | 4.7 |

The judgements of only the first and the second assessor show a
correlation of \(r_{1,2}=.75\). This means (according to the formula
(12.5)) that 75% of the total variance in the
judgements of these two raters can be attributed to differences
between the students rated, and thus 25% of measurement errors (after all,
we have assumed that there are no systematic differences between
raters). The proportion of measurement errors appears to be quite high.
However, we can draw hope from one of our earlier observations, namely that
the raters’ measurement errors are not correlated. The
*combination* of these two raters — the mean score per student
over the two raters — thus provides more reliable measurements
than each of the two raters can do separately. After all, the
measurement errors of the two raters tend to cancel each other out
(see formula (12.2)).
Reread the last two sentences carefully.

Reliability is often expressed as *Cronbach’s Alpha*
(Cortina 1993). This number is a measure for the consistency or homogeneity
of the measurements, and thus also indicates the degree to which the two
raters have rated the same construct. The simplest definition is
based on \(\overline{r}\), the mean correlation between
measurements of \(k\) different raters^{31}.
\[\begin{equation}
\tag{12.7}
\alpha = \frac{k \overline{r}} {1+(k-1)\overline{r}}
\end{equation}\]

Filling in \(k=2\) raters and \(\overline{r}=0.75\) provides \(\alpha=0.86\) (SPSS and R
use a somewhat more complex formula for this, and report
\(\alpha=0.84\)). This measurement for reliability is not only referred to
as Cronbach’s Alpha but also as the Spearman-Brown formula
or the Kuder-Richardson formula 20 (KR=20)^{32}.

The value of Cronbach’s Alpha found is a bit tricky to evaluate since it is also dependent on the number of instruments or raters or questions in the test (Cortina 1993; Gliner, Morgan, and Harmon 2001). For academic research, a lower limit of 0.75 or 0.80 is often used. If the result of the test or measurement is of great importance to the person concerned, as in the case of medical or psychological patient diagnosis, or when recruiting and selecting personnel, then an even higher reliability of \(\alpha=.9\) is recommended (Gliner, Morgan, and Harmon 2001).

If we want to increase reliability to \(\alpha=0.9\) or higher, then
we can achieve that in two ways. The first way is to expand the
number of raters. If we combine more raters in the total score,
then the measurement errors of these raters also better cancel out each other,
and then the total score is thus more reliable. Using
the formula (12.7), we can investigate how many raters
are needed to improve the reliability to \(\alpha=0.90\) or better. We fill in

\(\alpha=0.90\) and again \(\overline{r}=0.75\), and then find an outcome of
minimally \(k=3\) raters. The *increase* in reliability levels off, the more
raters there are already participating: if \(k=2\) then \(\alpha=.84\), if \(k=3\) then
\(\alpha=.84+.06=.90\), if \(k=6\) then \(\alpha=.90+.05=.95\), if \(k=9\) then
\(\alpha=.95+.01=.96\), etc. After all, if there are already 6 raters who are
already readily cancelling out each other’s measurement errors, then 3 extra
assessors add little to the reliability.

The second way of increasing reliability is by reducing the measurement error. We can try to do this, for example, by instructing the raters as well as possible about how they should rate the students’ language proficiency. An assessment protocol and/or instruction can make the deviations between and within raters smaller. Smaller deviations mean smaller measurement errors, and that again means higher correlations between the raters. For an \(\overline{r}=0.8\), we already almost achieve the desired reliability, with only \(k=2\) raters.

A third way of increasing reliability requires a closer analysis of
the separate raters. To explain this, we also now involve the third
rater in our considerations (see Table
12.1). However, the judgements of the third rater show
low correlations with those of the first and second
assessor: \(r_{1,3}=0.41\) and \(r_{2,3}=0.09\). As a consequence, the mean
correlation between assessors is now lower,
\(\overline{r}=0.42\). As a result of taking this third rater, the reliability
has not risen but instead actually lowered to
\(\alpha = \frac{3\times0.42}{1+2\times0.42} = 0.68\). We can thus
perhaps better ignore the measurements of the third rater.
Also if we investigate the reliability of a non-final exam or test or questionnaire
it can seem that the reliability of a whole test *increases* if some
“bad” questions are removed. Apparently, these “bad” questions measured
a construct which differed from what the remaining questions measured.

## 12.7 Reliability and construct validity

When a measurement is reliable, then “something” has been measured reliably.
But this still does not show *what* has been measured! There is a
relation between reliability (how measurements are made) and construct
validity (what is measured, see
Chapter 5), but these two terms are not identical.
Sufficient reliability is a requirement for validity, but
is not a sufficient condition for it. Put otherwise: a test which
is not reliable can also not be valid (since this test also measures noise),

but a test which is reliable does not have to be valid.
Perhaps, the test used does measure another construct other than what
was intended very reliably.

An instrument is construct valid if the concept measured matches the intended concept or construct. In Example 12.3, the questionnaire is valid if the score from the questionnaire matches the quality of life (whatever that actually is) of the aphasia patients. Only once it has been shown that an instrument is reliable, is it meaningful to speak about a measurement’s construct validity. Reliability is a necessary but not a sufficient condition for construct validity. An unreliable instrument can thus not be valid but a reliable instrument does not necessarily have to be valid.

To measure reading proficiency, we get the pupils to write
an essay. We count the number of letters *e* in each essay. This is
a very reliable measurement: different raters arrive at the same
number of *e*’s (raters are homogenous) and the same rater always also
delivers the same outcome (raters are stable). The great objection here is that
the number of *e*’s in an essay does not or does not necessarily match the concept
of reading proficiency. A pupil who has incorporated more *e*’s into his/her essay
is not necessarily a better writer.

Whilst researchers know that reliability is a necessary but not sufficient condition for validity, they do not always use these terms carefully. In many studies, it is tacitly assumed that if the reliability is sufficient, the validity is also then guaranteed. In Example 12.3 too, the difference is not made clear and the researchers do not discuss the construct validity of their new questionnaire explicitly.

## 12.8 SPSS

For a reliability analysis of the \(k=3\) judgements over
language proficiency in
Table 12.1:

`Analyze > Scale > Reliability Analysis...`

Select the variables which are considered to measure the same construct;
here that is three raters. We look at these \(k=3\)
assessors as “items” who measure the property “language proficiency” of 10
students. Drag these variables to the Variable(s) panel.

As Scale label, fill in an indication of the construct, e.g.
`language proficiency`

.

As a method, choose `Alpha`

for Cronbach’s Alpha (see formula
(12.7))

Choose `Statistics…`

, tick: Descriptives for
`Item, Scale, Scale if item deleted`

, Inter-Item `Correlations`

,
Summaries `Means, Variances`

, and confirm with `Continue`

and again
with `OK`

.

The output includes Cronbach’s Alpha, the desired inter-item correlations (particularly high between raters 1 and 2), and (in Table Item-Total Statistics) the reliability if we remove a certain rater. This last output teaches us that raters 1 and 2 are more important than rater 3. If we were to replace raters 1 or 2, then the reliability would collapse but if we were to remove rater 3 then the reliability would even increase (from 0.68 to 0.84). Presumably, this rater has rated a different concept to the others.

## 12.9 JASP

For a reliability analysis of \(k=3\) language proficiency
judgements in Table 12.1: in the top menu bar, choose

`Reliability > Classical: Single-Test Reliability Analysis`

Perhaps the button for `Reliability`

is not yet visible in the top menu bar; if so, you can add it yourself by clicking the huge blue **+** button at the righthand side of the top menu bar, and then check the `Reliability`

module.

Select those variables that presumably measure the same construct, in this example the three raters’ judgements, and put those in the field “Variables.” We regard the \(k=3\) raters or judges as *items* (test questions) that measure the construct of “speaking skill” of 10 students.

Open the menu bar `Single-Test Reliability`

, and under “Scale Statistics” check `Cronbach's a`

(see equation (12.7)). Aso check the options for `Average interitem correlation`

, `Mean`

and `Standard deviation`

to inspect these properties of the joint scale of speaking skill. (This joint scale is constructed by averaging the values of the three raters).

Under “Individual Item Statistics” check the options for `Cronbach's a (if item dropped)`

and `Item-rest correlation`

, as well as `Mean`

and `Standard deviation`

in order to obtain these values for the separate items (raters).

The output table titled **Frequentist Scale Reliability Statistics** reports Cronbach’s Alpha.
The output table titled **Frequentist Individual Item Reliability Statistics** contains the column “If item dropped” which reports the Cronbach’s Alpha reliability coefficient if we would drop or ignore a particular item or rater. In this example, the output shows that raters 1 and 2 are more important and relevant than rater 3. If we would remove either rater 1 or 2 then the reliability would collapse, but if we would remove rater 3 then the reliability would even improve (from 0.68 to 0.84). Presumably this rater 3 has measured some other construct than the other two raters.

The output does contain the average inter-item correlation (in the table **Frequentist Scale Reliability Statistics**), but not the correlations among all the individual items or raters.
We need to obtain these correlations explicitly (see §11.2), by going to the top menu bar and then choosing:

`Regression > Classical: Correlation`

In the field “Variables,” select the variables of which you want to know the correlations (the three raters, in this example).
Make sure that under “Sample Correlation Coefficient,” option `Pearson's r`

is checked.
Under “Additional Options,” check `Report significance`

, `Flag significant correlations`

and `Sample size`

. You may also check the option `Display pairwise`

to obtain simple table output. The output confirms that the inter-rater correlation between raters 1 and 2 is rather high (and significant), and that the inter-rater correlations involving rater 3 are rather low (and not significant).

## 12.10 R

For a reliability analysis of \(k=3\) language proficiency
judgements in Table 12.1:

```
<- read.table(file="data/beoordelaars.txt", header=TRUE)
raters if (require(psych)) { # for psych::alpha
alpha( raters[,2:4] ) # columns 2 to 4
}
```

`## Number of categories should be increased in order to count frequencies.`

```
##
## Reliability analysis
## Call: alpha(x = raters[, 2:4])
##
## raw_alpha std.alpha G6(smc) average_r S/N ase mean sd median_r
## 0.68 0.68 0.74 0.41 2.1 0.17 72 4.6 0.41
##
## lower alpha upper 95% confidence boundaries
## 0.35 0.68 1.01
##
## Reliability if an item is dropped:
## raw_alpha std.alpha G6(smc) average_r S/N alpha se var.r med.r
## B1 0.15 0.16 0.088 0.088 0.19 0.497 NA 0.088
## B2 0.58 0.58 0.410 0.410 1.39 0.264 NA 0.410
## B3 0.84 0.85 0.745 0.745 5.84 0.095 NA 0.745
##
## Item statistics
## n raw.r std.r r.cor r.drop mean sd
## B1 10 0.93 0.92 0.91 0.81 71 5.6
## B2 10 0.84 0.78 0.72 0.53 74 7.0
## B3 10 0.56 0.64 0.38 0.25 72 4.7
```

This output includes Cronbach’s Alpha (`raw_alpha 0.68`

), and the
reliability if we would remove or drop a certain rater.
If we were to remove rater 3, then the reliability would even
improve (from 0.68 to 0.84). Over all three raters,
`average_r=0.41`

.

The output does contain the average inter-item correlation but not the correlations among all the individual items or raters. We need to obtain these correlations explicitly (see §11.2):

`cor( raters[ ,c("B1","B2","B3") ] ) # explicit column names`

```
## B1 B2 B3
## B1 1.0000000 0.74494845 0.40979738
## B2 0.7449484 1.00000000 0.08845909
## B3 0.4097974 0.08845909 1.00000000
```

The output confirms that the inter-rater correlation between raters 1 and 2 is rather high, and that the inter-rater correlations involving rater 3 are rather low.

### References

*Journal of Applied Psychology*78 (1): 98–104.

*Statistical Analysis in Psychology and Education*. 6e ed. New York: McGraw-Hill.

*Journal of the American Academy of Child and Adolescent Psychiatry*40 (4): 486–88.

*European Journal of Physical and Rehabilitation Medicine*45 (3): 379–84. https://www.ncbi.nlm.nih.gov/pubmed/19156021].

https://www.collinsdictionary.com/dictionary/english/reliable↩︎

\(s^2_{(t+e)} = s^2_t + s^2_e + 2 r_{(t,e)} s_t s_e\), with here \(r_{(t,e)}=0\) according to the formula(12.3).↩︎

An exception to this is a situation in which \(s^2_x=0\), and thus \(s^2_t=0\), thus reliability \(\rho=0\); the dependent variable \(x\) has then not been operationalised well.↩︎

In our example, there are only \(k=2\) raters, thus there is only one correlation, and \(\overline{r} = r_{1,2} = 0.75\).↩︎

The so-called ‘intra-class correlation coefficient’ (ICC) for \(k\) is likewise identical to the Cronbach’s Alpha.↩︎