2 Hypothesis testing research

2.1 Introduction

Many empirical studies pursue the goal of establishing connections between (supposed) causes and their (supposed) effects or consequences. The researcher would like to know whether one variable has an influence on another. Their research tests the hypothesis that there is a connection between the supposed cause and the supposed effect (see Table 2.1). The best way to establish such a connection, and, thus, to test this hypothesis, is an experiment. An experiment that has been set up properly and is well executed is the ‘gold standard’ in many academic disciplines, because it offers significant guarantees concerning the validity of the conclusions drawn from it (see Chapter 5). Put differently: the outcome of a good experiment forms the strongest possible evidence for a connection between the variables investigated. As we discussed in Chapter 1, there are also many other forms of research, and hypotheses can also be investigated in other ways and according to other paradigms, but we will limit ourselves here to experimental research.

Table 2.1: Possible causes and possible effects.
Domain Supposed cause Supposed effect
trade outside temperature units of ice cream sold
healthcare type of treatment degree of recovery
eduction method of instruction performance on test
language age at which L2 learning satrts degree of proficiency
education class size general performance in school
healthcare altitude rate of malaria infection
language age speaking rate (speech tempo)

In experimental research, the effect of a variable manipulated by the researcher on some other variable is investigated. The introduction already provided an example of an experimental study. A novel teaching method was tested by dividing students between two groups. One group was taught according to the novel method, while the other group was taught as usual. The researcher hoped and expected that her or his novel teaching method would have a beneficial effect, meaning that it would lead to better student performance.

In hypothesis testing research, it is examined whether the variables investigated are indeed connected to one another in the way expected by the researcher. Two terms play a central role in this definition: ‘variables’ and ‘in the way expected.’ Before we consider experimental research in more detail, we will first take a closer look at these terms.

2.2 Variables

What is a variable? Roughly speaking, a variable is a particular kind of property of objects or people: a property that may vary, i.e., take different values. Let us look at two properties of people: how many siblings they have, and whether their mother is a woman or a man. The first property may vary between individuals, and is thus a (between-subject or between-participants) variable. The second property may not vary: if there is a mother, she will always be a woman by definition [at least, traditionally]. Thus, the second property is not a variable, but a constant.

In our world, almost everything exists in varying quantities, in varying manners, or to various extents. Even a difficult to define property, like a person’s popularity within a certain group, may form a variable. This is because we can rank people in a group from most to least popular. There are ample examples of variables:

  • regarding individuals: their length, their weight, shoe size, speaking rate, number of siblings, number of children, political preference, income, sex, popularity within a group, etc.

  • regarding texts: the total number of words (‘tokens’), the number of unique words (‘types’), number of typos, number of sentences, number of signs of interpunction, etc.

  • regarding words: their frequency of use, number of syllables, number of sounds, grammatical category, etc.

  • regarding objects such as cars, phones, etc.: their weight, number of components, energy use, price, etc.

  • regarding organizations: the number of their employees, their postal code, financial turnover, numbers of customers or patients or students, number of surgeries or transactions performed or number of degrees awarded, type of organization (corporation, non-profit, …), etc.

2.3 Independent and dependent variables

In hypothesis testing research, we distinguish two types of variables: dependent and independent variables. The independent variable is whatever is presumed to bring about the supposed effect. The independent variable is the aspect that a research will manipulate in a study. In our example where an experiment is conducted to evaluate the effects of a new teaching method, the teaching method is the independent variable. When we compare performance between the students that were taught using the new method and those whose writing instruction only followed the traditional method, we can see that the independent variable takes on two values. In this case, we can give these two values (also called levels) that the independent variable can take the names of “experimental” and “control,” or “new” and “old.” We might also express the independent variable’s values as a number: 1 and 0, respectively. These numbers do not have a numerical interpretation (for instance, we might as well give these values the names 17 and 23, respectively), but are used here solely as arbitrary labels to distinguish between groups. The manipulated variable is called ‘independent’ because the chosen (manipulated) values of this variable are not dependent on anything else in the study: the researcher is independent in their choice of this variable’s values. An independent variable is also called a factor or a predictor.

The second type of variable is the dependent variable. The dependent variable is the variable for which we expect the supposed effect to take place. This means that the independent variable possibly cause an effect on the dependent variable, or: it is presumed that the dependent variable’s value depends on the independent variable’s value - hence their names. An observed value for the dependent variable is also called a response or score; oftentimes, the dependent variable itself may also be given these names. In our example where an experiment conducted to evaluate the effect a new teaching method has on students’ performance, the student’s performance is the dependent variable. Other examples of possible dependent variables include speaking rate, score on a questionnaire, or the rate at which a product is sold (see Table 2.1). In short, any variable could be used as the dependent variable, in principle. It is mainly the research question that determines which dependent variable is chosen, and how it is measured.

This being said, it must be stressed that independent and dependent variables themselves must not be interpreted as ‘cause’ and ‘effect,’ respectively. This is because the study has as its goal to convincingly demonstrate the existence of a (causal) connection between the independent and the dependent variable. However, Chapter 5 will show us how complex this can be.

The researcher varies the independent variable and observes whether this results in differences observed in the dependent variable. If the dependent variable’s values differ before and after manipulating the independent variable, we may assume that this is an effect that the manipulation has on the independent variable. We may speak of a relationship between both variables. If the dependent variable’s value does not differ under the influence of the independent variable’s values, then there is no connection between the two variables.

Example 2.1: Hugo Quené, Semin, and Foroni (2012) investigated whether a smile or frown influences how listeners process spoken words. The words were ‘pronounced’ (synthesized) by a computer in various phonetic variants - specifically, in such a way that these words sounded as if pronounced neutrally, with a smile, or with a frown. Listeners had to classify the words as ‘positive’ or ‘negative’ (in meaning) as quickly as possible. In this study, the phonetic variant (neutral, smile, frown) takes the place of the independent variable, and the speed with which listeners give their judgment is the dependent variable.

2.4 Falsification and null hypothesis

The goal of scientific research is to arrive at a coherent collection of “justified true beliefs” (Morton 2003). This means that a scientific belief must be properly motivated and justified (and must be coherent with other beliefs). How may we arrive at such a proper motivation and justification? For this, we will first refer back to the so-called induction problem discussed by Hume (1739). Hume found that it is logically impossible to generalize a statement from a number of specific cases (the observations in a study) to a general rule (all possible observations in the universe).

We will illustrate the problem inherent in this generalization or induction step with the belief that ‘all swans are white.’ If I had observed 10 swans that are all white, I might consider this as a motivation for this belief. However, this generalization might be unjustified: perhaps swans also exist in different colours, even if I might not have seen these. The same problem of induction remains even if I had seen 100 or 1000 white swans. However, what if I had seen a single black swan? In that case, I will know immediately and with certainty that the belief of all swans being white is false. This principle is also used in scientific research.

Let us return to our earlier example in which we presumed that a new teaching method will work better than an older teaching method; this belief is called H1. Let us now set this reasoning on its head, and base ourselves on the complementary belief that the new method is not better than the old one2; this belief is called the null hypothesis or H0. This belief that ‘all methods have an equal effect’ is analogous to the belief that ‘all swans are white’ from the example given in the previous paragraph. How can we then test whether the belief or hypothesis called H0 is true? For this, let us draw a representative sample of students (see Chapter 7) and randomly assign students to the new or old teaching method (values of the independent variable); we then observe all participating students’ performance (dependent variable), following the same protocol in all cases. For the time being, we presume that H0 is true. This means that we expect no difference between the student groups’ performance. If, despite this, the students taught by the new method turn out to perform much better than the students taught by the old method, then this observed difference forms the metaphorical black swan: the observed difference (which contradicts H0) makes it unlikely that H0 is true (provided that the study was valid; see Chapter 5 for more on this). Because H0 and H1 exclude each other, this means that it is very likely that H1 is indeed true. And because we based our motivation upon H0 and not H1, sceptics cannot accuse us of being biased: after all, we did try to show that there was indeed no difference between the performance exhibited by the students in each group.

The method just described is called falsification, because we gain knowledge by rejecting (falsifying) hypotheses, and not by accepting (verifying) hypotheses. This method was developed by philosopher of science Karl Popper (Popper 1935, 1959, 1963). The falsification method has interesting similarities to the theory of evolution. Through variation between individual organisms, some can successfully reproduce, while many others die prematurely and/or do not reproduce. Analogously, some tentative statements cannot be refuted, allowing them to ‘survive’ and ‘reproduce,’ while many other statements are indeed refuted, through which they ‘die.’ In the words by Popper (1963) (p.51, italics removed):

" … to explain (the world) … as far as possible, with the help of laws and explanatory theories …there is no more rational procedure than the method of trial and error — of conjecture and refutation: of boldly proposing theories; of trying our best to show that these are erroneous; and of accepting them tentatively if our critical efforts are unsuccessful."

Thus, a proper scientific statement or theory ought to be falsifiable or refutable or testable (Popper 1963). In other words, it must be possible to prove this statement or theory wrong. A testable statement’s scientific motivation, and, therefore, its plausibility increases with each time this statement proves to be immune to falsification, and with each new set of circumstances under which this happens. ‘Earth’s climate is warming up’ is a good example of a statement that is becoming increasingly immune to falsification, and, therefore, is becoming increasingly stronger.

Example 2.2: ‘All swans are white’ and ‘Earth’s climate is warming up’ are falsifiable, and therefore scientifically useful statements. What about the following statements?
a. Gold dissolves in water.
b. Salt dissolves in water.
c. Women talk more than men.
d. Coldplay’s music is better than U2’s.
e. Coldplay’s music sells better than U2’s.
f. If a patient rejects a psychoanalyst’s reading, then this is a consequence of their resistance to the fact that the psychoanalyst’s reading is correct.
g. Global warming is caused by human activity.

2.5 The empirical cycle

So far, we have provided a rather global introduction to experimental research. In this section, we will describe the course of an experimental study in a more systematic way. Throughout the years, various schemata have been devised that describe research in terms of phases. The best known of these schemata is probably the empirical cycle by De Groot (1961).

The empirical cycle distinguishes five phases of research: the observation phase, the induction phase, the deduction phase, the testing phase, and the evaluation phase. In this last phase, any shortcomings and alternative interpretations are formulated, which lead to potential new studies, each of which once again goes through the entire series of phases (hence the name, ‘cycle’). We will now look at each of these five phases of research one by one.

2.5.1 observation

In this phase, the researcher constructs a problem. This is to say, the researcher forms an idea of possible relationships between various (theoretical) concepts or constructs. These presumptions will later be worked out into more general hypotheses. Presumptions like these may come about in myriads of different ways – but all require for the researcher to have sufficient curiosity. The researcher may notice an unusual phenomenon that needs an explanation, e.g., the phenomenon that the ability to hear absolute pitch occurs much often in Chinese musicians than in American ones (Deutsch 2006). Systematic surveys of scientific publications may also lead to presumptions. Sometimes, it turns out that different studies’ results contradict each other, or that there is a clear gap in our knowledge.

Presumptions can also be based on case studies: these are studies in which one or several cases are studied in depth and extensively described. For instance, Piaget developed his theory of children’s mental development based on observing his own children during the time he was unemployed. These observations later (when Piaget already had his own laboratory) formed the impetus for many experiments that he used to sharpen and strengthen his theoretical insights.

It is important to realize that purely unbiased and objective observation is not possible. Any observation is influenced by theory or prior knowledge to a greater or smaller extent. If we do not know what to pay attention to, we also cannot observe properly. For instance, those that specialize in the formation of clouds can observe a far greater variety of cloud types than the uninitiated. This means that it is useful to first lay down an explicit theoretical framework, however rudimentary, before making any observations and analysing any facts.

A researcher is prompted by remarkable phenomena, case studies, studying the literature, etc. to arrive at certain presumptions. However, there are no methodological guidelines on how this process should come about: it is a creative process.

2.5.2 induction

During the induction phase, the presumption voiced in the observation phase is generalized. Having started from specific observations, the researcher now formulates a hypothesis that they suspect is valid in general. (Induction is the logical step in which a general claim or hypothesis is derived from specific cases: my children (have) learned to talk \(\rightarrow\) all children (can) learn to talk.)

For instance, from the observation made in their own social circle that women speak more than men do (more minutes per day, and more words per day), a researcher may induce a general hypothesis: H1: women talk more than men do (see Example 2.2; this hypothesis may be further restricted as to time and location).

In addition, the hypothesis’ empirical content must be clearly described, which is to say: the type or class of observations must be properly described. Are we talking about all women and men? Or just speakers of Dutch (or English)? And what about multilingual speakers? And children that are still acquiring their language? This clearly defined content is needed to test the hypothesis (see the subsection on testing below, and see Chapter 13).

Finally, a hypothesis also has to be logically coherent: the hypothesis has to be consistent with other theories or hypotheses. If a hypothesis is not logically coherent, it follows by definition that it cannot be unambiguously related to the empirical realm, which means that it is not properly testable. From this, we can conclude that a hypothesis may not have multiple interpretations: within an experiment, a hypothesis, by itself, must predict one single outcome, and no more than one. In general, three types of hypotheses are distinguished (De Groot 1961):

  • Universal-deterministic hypotheses.
    These take the general shape of all As are B. For example: all swans are white, all human beings can speak. If a researcher can show for one single A that it is not B, then the hypothesis has, in principle, been falsified. A universal deterministic hypothesis can never be verified: a researcher can only make statements about the cases they have observed or measured. If we are talking about an infinite set, such as: all birds, or all human beings, or all heaters, this may lead to problems. The researcher does not know whether such a set might include a single case for which ‘A is not B’; there is one bird that cannot fly, et cetera. Consequently, no statement can be made about these remaining cases, which means that the universal validity of the hypothesis can never be fully ‘proven.’

  • Deterministic existential hypotheses.
    These take the general shape of there is some (at least one) A that is B. For example: there is some swan that is white, there is some human being that can speak, there is some heater that provides warmth. If a researcher can demonstrate that there exists one A that is B, the hypothesis has been verified. However, deterministic existential hypotheses may never be falsified. If we wanted to do that, it would be necessary to investigate all units or individuals in an infinite set for whether they are B, which is exactly what is excluded by the infinite nature of the set. At the same time, this makes it apparent that this type of hypotheses does not lead to generally valid statements, and that their scientific import is not as clear. One could also put it this way: a hypothesis of this type makes no clear predictions for any individual case of A; a given A might be the specific one that is also B, but it might also not be. In this sense, deterministic existential hypotheses do not conform to our criterion of falsifiability.

  • Probabilistic hypotheses.
    These take the general shape of there are relatively more As that are B compared to non-As that are B. In the behavioural sciences, this is by far the most frequently occurring type of hypothesis.
    For example: there are relatively more women that are talkative compared to men that are talkative. Or: there are relatively more highly performing students for the new teaching method compared to the old teaching method. Or: speech errors occur relatively more often at the beginning rather than at the end of the word. This does not entail that all women speak more than all men, nor does this entail that all students taught by the new method perform better than all students taught by the old method.

2.5.3 deduction

During this phase, specific predictions are deduced from the generally formulated hypothesis set up in the induction phase. (Deduction is the logical step whereby a specific statement or prediction is derived from a more general statement: all children learn to talk \(\rightarrow\) my children (will) learn to talk.)

If we presume (H0) that “women talk more than men,” we can make specific predictions for specific samples. For example, if we interviewed 40 female and 40 male school teachers of Dutch, without giving them a time limit, then we predict that the female teachers in this sample will say more than the male teachers in the sample (including the prediction that they will speak a greater number of syllables in the interview).

As explained above (§2.4), most scientific research does not test H1 itself, but its logical counterpart: H0. Therefore, for testing a H1 (in the next phase of the empirical cycle), we use the predictions derived from H0 (!), for instance: “women and men produce equal numbers of syllables in a comparable interview.”

In practice, the terms “hypothesis” and “prediction” are often used interchangeably, and we often speak of testing hypotheses. However, according to the above terminology, we do not test the hypotheses, but we test predictions that are derived from those hypotheses.

2.5.4 testing

During this phase, we collect empirical observations and compare these to the worked-out predictions made “under H0,” i.e., the predictions made if H0 were to be true. In Chapter 13, we will talk more about this type of testing. Here, we will merely introduce the general principle. (In addition to the conventional “frequentist” approach described here, we may also test hypotheses and compare models using a newer “Bayesian” approach; however, this latter method of testing is outside the scope of this textbook).

If the observations made are extremely unlikely under H0, there are two possibilities.

    1. The observations are inadequate, we have observed incorrectly. But if the researcher has carried out rigorous checks on their work, and if they take themselves seriously, this is not likely to be true.
    1. The prediction was incorrect, meaning that H0 is possibly incorrect, and should be rejected in favour of H1.

In our example above, we derived from H0 (!) the prediction that, within a sample of 40 male and 40 female teachers, individuals will use the same amount of syllables in a standardized interview. However, we find that men use 4210 syllables on average, while women use 3926 on average (H. Quené 2008, 1112). How likely is this difference if H0 were true, assuming that the observations are correct? This probability is so small, that the researcher rejects H0 (see option (ii) above) and concludes that women and men do not speak equal amounts of syllables, at least, in this study.

In the example above, the testing phase involves comparing two groups, in this case, men and women. One of these two groups is often a neutral or control group, as we saw in the example given earlier of the new and old teaching methods. Why do researchers often make use of a control group of this kind? Imagine that we had only looked at the group taught by the new method. In the testing phase, we measure students’ performance, which is a solid B on average (7 in the Dutch system). Does this mean that the new method is successful? Perhaps it is not: if the students might have gotten an A or A- (8 in the Dutch system) under the old method, the new method would actually be worse, and it would be better not to add this new method to the curriculum. In order to be able to draw a sensible conclusion about this, it is essential to compare the new and old methods between one another. This is the reason why many studies involve components like a neutral condition, null condition, control group, or placebo treatment.

Now that we know this, how can we determine the probability of the observations we made if H0 were to be true? This is often a somewhat complex question, but, for present purposes, we will give a simple example as an illustration: tossing a coin and observing heads or tails. We presume (H0): we are dealing with a fair coin, the probability of heads is \(1/2\) at each toss. We toss the same coin 10 times, and, miraculously, we observe the outcome of heads all 10 times. The chance of this happening, given that H0 is true, is \(P = (1/2)^{10} = 1/1024\). Thus, if H0 were to be true, this outcome would be highly unlikely (even though the outcome is not impossible, since \(P > 0\)); hence, we reject H0. Therefore, we conclude that the coin most likely is not a fair coin.

This leads us to an important point: when is an outcome unlikely enough for us to reject H0? Which criterion do we use for the probability of the observations made if H0 were to be true? This is the question of the level of significance, i.e., the level of probability at which we decide to reject H0. This level is signified as \(\alpha\). If a study uses a level of significance of \(\alpha = 0.05\), then H0 is rejected if the probability of finding these results under H03 is smaller than 5%.
In this case, the outcome is so unlikely, that we choose to reject H0 (option (ii) above), i.e., we conclude that H0 is most probably not true.

If we thus reject H0, there is a small chance that we are actually dealing with option (I): H0 is actually true, but the observations happen by chance to strongly diverge from the prediction under H0, and H0 is falsely rejected. This is called a Type I error. This type of error can be compared to unjustly sentencing an innocent person, or undeservedly classifying an innocent email message as ‘spam.’ Most of the time, \(\alpha = 0.05\) is used, but other levels of significance are also possible, and sometimes more prudent.

Note that significance is the probability of finding the extreme data that were observed (or data even more extreme than that) given that H0 is true: \[\textrm{significance} = P(\textrm{data}|\textrm{H0})\] Most importantly, significance is not the probability of H0 being true given these data, \(P(\textrm{H0}|\textrm{data})\), even though we do encounter this mistake quite often.

Each form of testing also involves the risk of making the opposite mistake, i.e., not rejecting H0 even though it should be rejected. This is called a Type II error: H0 is, in fact, false (meaning that H1 is true), but, nevertheless, H0 is not rejected. This type of mistake can be compared to unjustly acquitting a guilty person, or undeservedly letting through a spam email message (see Table 2.2).

Table 2.2: Possible outcomes of the decision procedure.
Reality Decision
Reject H0 Maintain H0
H0 is true (H1 false) Type I error (\(\alpha\)) correct
H0 is false (H1 true) correct Type II error (\(\beta\))
Convict defendant Acquit defendant
defendant is innocent (H0) Type I error correct
defendant is guilty correct Type I error
Discard message Allow message
message is OK (H0) Type I error correct
message is spam correct Type II error

If we set the level of significance to a higher value, e.g., \(\alpha = .20\), this also means that the chance of rejecting H0 is much higher. In the testing phase, we would reject H0 if the probability of observing these data (or any more extreme data) were smaller than 20%. This would mean that 8 times heads within 10 coin tosses would be enough to reject H0 (i.e., judging the coin as unfair). Thus, more outcomes are possible that lead to rejecting H0. Consequently, this higher level of significance entails a greater risk of a Type 1 error, and, at the same time, a smaller risk of a Type II error. The balance between the two type of error depends on the exact circumstances under which the study is conducted, and on the consequences that each of the two types of error might have. Which type of error is worse: throwing away an innocent email, or letting a spam message through? The probability of making a Type I error (the level of significance) is controlled by the researcher themselves. The probability of a Type II error depends on three factors and is difficult to gauge. Chapter 14 will discuss this in more detail.

2.5.5 evaluation

At the end of their study, the researcher has to evaluate the results the study yielded: what do they amount to? The question posed here is not merely whether the results favour the theory that was tested. The goal is to provide a critical review of the way in which the data were collected, the steps of reasoning employed, questions of operationalization, any possible alternative explanations, as well as what the results themselves entail. The results must be put in a broader context and discussed. Perhaps the conclusions will also lead to recommendations, for example, recommendations for clinical applications or for educational practice. This is also the appropriate moment to suggest ideas for alternative or follow-up studies.

During this phase, the aim is primarily to interpret the results, a process in which the researcher plays an important and personal role as the one who provides the interpretation. Different researchers may interpret the same results in widely different ways. Finally, in some cases, results will contradict the outcome that was predicted or desired.

2.6 Making choices

Research consists of a sequence of choices: from the inspirational observations during the first phase, to the operational decisions involved in performing the actual study, to interpreting the results during the last stage. Rarely will a researcher be able to make the best decision for every choice point, but they must remain vigilant of the possibility of making a bad decision along the way. The entire study is as strong as the weakest link: the entire study is as good as the worst choice in its sequence of choices. As an illustration, we will provide an overview of the choices a researcher has to make throughout the empirical cycle.

The first choice that has to be made concerns the formulation of the problem. Some relevant questions that the researcher has to answer at that moment include: how do I recognize a certain research question, is research the right choice in this situation, is it possible to research this idea? The best answers to such questions depend on various factors, such as the researcher’s view of humankind and society, any wishes their superiors or sponsors might have, financial and practical (im)possibilities, etc.

The research question does have to be answerable given the methods and means available. However, within this restriction, the research question may relate to any aspect of reality, regardless of whether this aspect is seen as irrelevant or important. There are many examples of research that was initially dismissed as irrelevant, but, nevertheless, did turn out to have scientific value, for instance, a study on the question: “is ‘Huh?’ a universal word?” (Dingemanse, Torreira, and Enfield 2013) (Example 1.1). In addition, some ideas that were initially dismissed as false later did turn out to be in accordance with reality. For instance, Galilei’s statement that Earth revolved around the Sun once was called unjustified. In short, research questions should not be rejected too soon for being ‘useless,’ ‘platitudes,’ ‘irrelevant,’ or ‘trivial.’

If the researcher decides to continue their study, the next step is usually studying the literature. Most research handbooks recommend doing a sizeable amount of reading, but how is an appropriate collection of literature found? Of course, the relevant research literature on the area of knowledge in question must be looked at. Fortunately, these days, there are various resources for finding relevant academic publications. For this, we recommend exploring the pointers and so-called “libguides” offered by the Utrecht University Library (see http://www.uu.nl/library and http://libguides.library.uu.nl/home_en). We would also like to warmly recommend the guide by Sanders (2011), which contains many extremely helpful tips to use when searching for relevant research literature.

During the next phase, the first methodological problems start appearing: the researcher has to formulate the problem more precisely. One important decision that has to be made at that point is whether the problem posed here is actually suited for research (§2.4). For instance, a question like “what is the effect of the age of onset of learning on fluency in a foreign language?” cannot be researched in this form. The question must be specified further. Crucial concepts must be (re)defined: what is the age of onset of learning? What is language fluency? What is an effect? And how do we define a foreign language? How is the population defined? The researcher is confronted with various questions regarding definitions and operationalization: Is the way concepts are defined theoretical, or empirical, or pragmatic in nature? Which instruments are used to measure the various constructs? But also: what degree of complexity should this study have? Practically speaking, would this allow for the entire study be completed? In which way should data be collected? Would it be possible at all to collect the desired data, or might respondents never be able or willing to answer such questions? Is the proposed manipulation ethically sound? How great is the distance between the theoretical construct and the way in which it will be measured? If anything goes wrong during this phase, this will have a direct effect upon the rest of the study.

If a problem has been successfully formulated and operationalized, a further exploration of the literature follows. This second bout of literature study is much more focussed on the research question that has been worked out by this point, compared to the broad exploration of the literature mentioned earlier. On the grounds of earlier publications, the researcher might reconsider their original formulation of the problem. Not only does one have to look at the literature in terms of theoretical content, but one should also pay attention to examples of how core concepts are operationalized. Have these concepts been properly operationalized, and if there might be different ways of operationalizing them, what is the reason behind these differences? In addition, would it be possible to operationalize the core concepts in such a way that the distance between the concept-as-intended and the concept-as-defined become (even) smaller (§??)? The pointers given above with regard to searching for academic literature are useful here, as well. After this, the research is to (once again) reflect upon the purpose of the study. Depending on the problem under consideration, questions such as the following should be asked: does the study contribute to our knowledge within a certain domain, does the study create solutions for known stumbling blocks or problems, or does the study contribute to the potential development of such solutions? Does the research question still cover the original problem (or question) identified by superiors or sponsors? Are the available facilities, funds, and practical circumstances sufficient to conduct the study?

During the next step, the researcher must specify how data will be collected. This is an essential step, which influences the rest of the study; for this reason, we will devote an entire chapter to it (Chapter ??). What constitutes the population: language users? Students? Bilingual infants? Speech errors involving consonants? Sentences? And what is the best way to draw a representative sample (or samples) from this population (or populations)? What sample size is best? In addition, this phase involves choosing a method of analysis. Moreover, it is advisable to design a plan of analysis at this stage. Which analyses will be performed, what ways of exploring the data are envisioned?

All the choices mentioned so far are not yet sufficient for finishing one’s preparations. One must also choose one’s instruments: which devices, recording tools, questionnaires, etc., will be used to make observations? Do suitable instruments already exist? If so, are these easily accessible and does the researcher have permission to use them? If not, instruments must be developed first (§??). However, in this latter case, the researcher must also take the task upon themselves to first test these instruments: to check whether the data obtained with these instruments conform to the quality standards that are either set by the researcher or that may be generally expected of instruments used in scientific research (in terms of reliability and validity, see Chapters 5 and 12).

It is only when the instruments, too, have been prepared that the actual empirical study begins: the selected type of data is collected within the selected sample in the selected manner using the selected instruments. During this phase, also, there are various, often practical problems the researcher might encounter. An example from actual practice: three days after a researcher had sent out their questionnaire by mail, a nationwide mail workers’ strike was declared, which lasted for two weeks. Unfortunately, the researcher had also given the respondents two weeks’ notice to respond by mail. This means that, once the strike was over, the time frame the participants were given to respond had already passed. What was the researcher to do? Lacking any alternatives, our protagonist decided to approach each of the 1020 respondents by phone, asking them to fill out the questionnaire regardless and return it at their earliest convenience.

For the researcher who has invested in devising a plan of analysis in advance, now is the time of harvest. Finally, the analyses that were planned can be performed. Unfortunately, reality usually turns out to be much more stubborn than the researcher might have imagined beforehand. Participants might give unexpected responses or not follow instructions, presumed correlations turn out to be absent, and unexpected (and undesirable) correlations do turn out to be present to a high degree. Later chapters will be devoted to a deeper exploration of various methods of analysis and problems associated with them.

Finally, the researcher must also report on their study. Without an (adequate) research report, the data are not accessible, and the study might as well not have been performed. This is an essential step, which, among other things, involves the question of whether the study may be checked and replicated based on the way it is reported. Usually, research activity is reported in the form of a paper, a research report, or an article in an academic journal. Sometimes, a study is also reported on in a rather more popular journal or magazine, targeted towards an audience broader than just fellow researchers.

This concludes a brief overview of the choices researchers have to make when doing research. Each empirical study consists of a chain of problems, choices, and decisions. The most important choices have been made before the researcher starts collecting data.


De Groot, A. D. 1961. Methodologie: Grondslagen van Onderzoek En Denken in de Gedragswetenschappen. ’s-Gravenhage: Mouton. http://www.dbnl.org/tekst/groo004meth01_01/index.php.
Deutsch, Diana. 2006. “The Enigma of Absolute Pitch.” Acoustics Today 2: 11–19.
Dingemanse, Mark, Francisco Torreira, and N. J. Enfield. 2013. “Is ‘Huh?’ A Universal Word? Conversational Infrastructure and the Convergent Evolution of Linguistic Items.” PLOS One 8 (11): e78273. http://dx.doi.org/10.1371/journal.pone.0078273.
Hume, David. 1739. A Treatise on Human Nature.
Morton, Adam. 2003. A Guide Through the Theory of Knowledge. 3e ed. Malden, MA: Blackwell.
Popper, Karl. 1935. Logik Der Forschung. Zur Erkentnistheorie Der Modernen Naturwissenschaft. Wien: Julius Springer.
———. 1959. The Logic of Scientific Discovery. London: Routledge.
———. 1963. Conjectures and Refutations: The Growth of Scientific Knowledge. London: Routledge; Kegan Paul.
Quené, H. 2008. “Multilevel Modeling of Between-Speaker and Within-Speaker Variation in Spontaneous Speech Tempo.” Journal of the Acoustical Society of America 123 (2): 1104–13.
Quené, Hugo, Gün R. Semin, and Francesco Foroni. 2012. “Audible Smiles and Frowns Affect Speech Comprehension.” Speech Communication 54 (7): 917–22.
Sanders, Ewoud. 2011. Eerste Hulp Bij e-Onderzoek Voor Studenten in de Geesteswetenschappen: Slimmer Zoeken, Slimmer Documenteren. Early Dutch Books Online. http://hdl.handle.net/1887/17774.

  1. Two beliefs are complementary when they mutually exclude each other, like H1 and H0 in this example.↩︎

  2. More accurately: If the probability to find either these results or other results that would differ even more from those predicted by H0 is smaller than 5%, then H0 is rejected.↩︎