IntroductionThe goal of experimental research is to test a hypothesis. Hypotheses may also be tested in other, non-experimental research, but in the following, we will restrict ourselves to experimental research, i.e., research that uses the experiment as its methodology, for the sake of clarity. In experimental research, we attempt to argue for the plausibility of a causal relationship between certain factors. If an experiment study has results that confirm the research hypothesis (i.e., the null hypothesis is rejected), it is plausible that a change in the independent variable is the cause of a change, or effect, in the dependent variable. In this manner, experimental research allows us to conclude with some degree of certainty that, for instance, a difference in medical treatment after a stroke is the cause, or an important cause, of a difference in patients’ language ability as observed 6 months post-stroke. The experiment has made it plausible that there is a causal relationship between the method of treatment (independent variable) and the resulting language ability (dependent variable). Show CausalityA causal relationship between two variables is distinct from ‘just’ a relationship or correlation between two variables. If two phenomena correlate with one another, one does not have to be the cause of the other. One example can be seen in the correlation between individuals height and their weight: tall people are, in general, heavier than short people (and vice versa: short people are generally lighter than tall people). Does this mean that we can speak of a causal relationship between height and weight? Is one property (partially) cause by the other? No: in this example, there is, indeed, a correlation, but no causal relationship between the properties: both height and weight are “caused” by other variables, including genetic properties and diet. A second example is the correlation between motivation and language ability in second language learners: highly motivated students learn to speak a new second language better and more fluently than those with low motivation do, but here, also, it is unclear which is the cause and which is the effect. A causal relationship is a specific type of correlation. A causal relationship is a correlation between two phenomena or properties, for which there are also certain additional requirements (Shadish, Cook, and Campbell 2002). Firstly, the cause has to precede the effect (it is after treatment that improvement occurs). Secondly, the effect should not occur if the cause is not there (no improvement without treatment). Moreover, the effect – at least, in theory – should always occur whenever the cause is present (treatment always results in improvement). Thirdly, we cannot find any plausible explanation for the effect’s occurrence, other than the possible cause we are considering. When we know the causal mechanism (we understand why treatment causes improvement), we are better able to exclude other plausible explanations. Unfortunately, however, this is very rarely the case in the behavioural sciences, which include linguistics. We do determine that a treatment results in improvement, but the theory that ties cause (treatment) and effect (improvement) together is rarely complete and has crucial gaps. This means that we must take appropriate precautionary measures in setting up our research methodology in order to exclude any possible alternative explanations of the effects we find. ValidityA statement or conclusion is valid whenever it is true and justified. A true statement corresponds to reality: the statement that every child learns at least one language is true, because this statement appropriately represents reality. A justified statement lends its validity from the empirical evidence upon which is it based: every child observed by us or by others is learning or has learned a language (except for certain extraordinary cases, for which we need a separate explanation). A statement’s justification becomes stronger with an increasingly stronger and more reliable method of (direct or indirect) observation. This also means that a statement’s validity is not a categorical property (valid/not valid) but a gradual property: a statement can be relatively more or less valid. Three aspects of a statement’s validity may be distinguished:
These three forms of validity will be further illustrated in the following sections. Internal validityAs you already know, it is our aim in an experimental study to exclude as many alternative explanations of our results as possible. After all, we must demonstrate that there is a causal relationship between two variables, X and Y, and this means keeping any confounding factors under control as much as possible. Let us take a look at example 5.1.
This question of a conclusion’s justification is a question about the study’s internal validity. Internal validity pertains to the relationships between variables that are measured or manipulated, and is not dependent on the (theoretical) constructs represented by the various variables (hence the name ‘internal validity’). In other words: the question of internal validity is one of possible alternative explanations of the research results that were found. Many of the possible alternative explanations can be pre-empted by the way in which the data are collected. In the following, we will discuss the most prominent threats to internal validity (Shadish, Cook, and Campbell 2002).
In a laboratory, ‘history’ is kept under control by isolating participants from outside influences (such as a heat wave), or by choosing dependent variables that could barely be influenced by external factors. In research performed outside of a laboratory, including field research, it is much more difficult and often even impossible to keep outside influences under control. The following example makes this clear.
In most cases, maturation occurs because participants perform the same task or answer the same questions many times in a row. However, maturation can also happen when participants are asked to provide their answers in a way they are not used to, e.g., because of an unexpected way of asking the question, or through an unusual type of multiple choice question. During the first few times a participant answers a question in such a study, the method of answering may interfere with the answer itself. Afterwards, we could compare, e.g., the first and the last quarter of a participant’s answers to check whether there might have been an effect of experience, i.e., maturation.
In many studies, observations are made both before a treatment and after. Identical tests could be used for this, but that might lead to a learning effect (see above). For this reason, researchers often use different tests between pretest and posttest. However, this might lead to an instrumentation effect. The researcher has to consider the possible advantages and disadvantages of each option.
Table 5.1: Conditions in the study by Donald (1983).
So, what is wrong with this study? The answer to this question can be found in how students were selected. Readers were classified as ‘strong’ or ‘less strong’ based on a reading ability test, but their performance on this test are always influenced by random factors that have nothing to do with reading ability: Tom was not feeling well and did not perform well on the test, Sarah was not able to concentrate, Nick was having knee troubles, Julie was extraordinarily motivated and outdid herself. In other words: the assessment of reading ability was not entirely reliable. This means that (1) the less strong readers who happened to have performed above their level were unjustly classified as strong readers instead of as less strong readers; and, conversely, (2) strong readers who happened to have performed below their level were unjustly deemed to be less strong readers. Thus, the group of less strong readers will always contain some readers that are not that bad at all, and the group of strong readers will always contain a few that are not so strong, after all. When the actually-strong readers that were unjustly classified as non-strong readers are given a second reading test (after having studied a text with or without illustrations), they will typically go back to performance at their usual, higher level. This means that a higher score on the second test (the posttest) might be an artefact of the method of selection. The same is true, with the necessary changes, for the actually-less-strong readers that have unjustly been selected as strong readers. When these students are given a second reading test, they, too, will typically go back to performance at their usual (lower) level. Thus, their score on the posttest will be lower than their score on the pretest. For the study by Donald (1983) used as an example here, this means that the difference found between strong and less strong readers is partially due to randomness. Even if the independent variable has no effect, the group of ‘strong’ readers will, on average, perform worse, while the group of ‘less strong’ readers will, on average, perform better. In other words: the difference between the two groups is smaller during the posttest than during the pretest, as a consequence of random variation: regression to the mean. As you may expect, research results may be muddled by this phenomenon. As we saw above, an experimental effect may be weakened or disappear as a consequence of regression to the mean; conversely, regression to the mean can be misidentified as an experimental effect (for extensive discussion, see Retraction Watch (2018)). Generally speaking, regression to the mean may occur when a classification is made based on a pretest whose scores are correlated with the scores on the posttest (see Chapter 11). If there is no correlation at all between pretest and posttest, regression to the mean even plays the main role: in this case, any difference between pretest and posttest must be the consequence of regression to the mean. If there is a perfect correlation, regression to the mean does not play a role, but the pretest is also not informative, since (after the fact) it turned out to be completely predictable from the posttest. Regression to the mean may offer an alternative explanation for an alleged substantial score increase between pretest and posttest for a lower performing group (e.g., less strong readers) compared to a smaller increase for a higher performing group (e.g., strong readers). Conversely, it might also offer an alternative explanation for an alleged score decrease between pretest and posttest for a higher performing group (e.g., strong readers) compared to a lower performing group (e.g., less strong readers). It is better when groups are not composed according to one of the measurements (pretest or posttest), but, instead, on the basis of on some other, independent criterion. In the latter case, the participants in both groups will have a more or less average score on the pretest, which minimizes the effect of regression to the mean. Each group will have about equal numbers of participants whose scores fell out too high by accident and those whose scores fell out too low, both on the pretest and the posttest.
Research in education often does not provide the opportunity to randomly assign students in different classes to various conditions, because this may lead to insurmountable organizational problems. These organizational problems involve more than just (randomly) splitting classes, even though the latter may already be difficult to put into practice. The researcher also has to account for possible transfer effects between conditions: students will talk to one another, and might even teach each other the essential characteristics of the experimental condition(s). This is at least one way in which the absence of an effect could be explained. Because of the problems sketched out here, it often occurs that entire classes are assigned to one of the conditions. However, classes consist of students that go to the same school. When students (and their parents) choose a school, self-selection takes place (in the Dutch education system), which leads to differences between conditions (that is, between classes within conditions) in terms of students’ initial performance. This means that any differences we find between conditions could also have been caused by students’ self-selection of schools. In the above, we have already indicated the most straightforward way to give different conditions the same initial level: assign students to conditions by chance, or, at random. This method is known as randomization (Shadish, Cook, and Campbell 2002, 294 ff). For instance, we might randomize participants’ assignment to conditions by giving each student a random number (see Appendix A)), and then assigning ‘even students’ to one condition and ‘odd students’, to the other. If participants are randomly assigned to conditions, all differences between participants within the various conditions are based on chance, and are thus averaged out. In this way, it is most likely that there are no systematic differences between the groups or conditions distinguished. However, this is only true if the groups are sufficiently large. Randomization, or random assignment of participants to conditions, is to be distinguished from random sampling from a population (see §7.3) below). In the case of random sampling, we randomly select participants from the population of possible participants to be included in our sample; our goal in this case is that the sample(s) resemble the population from which they are drawn (it is drawn). In the case of randomization, we randomly assign the participants within the sample to the various conditions in the study; our goal in this case is that the samples resemble each other. One alternative method to create two equal groups is matching. In the case of matching, participants are first measured on a number of relevant variables. After this, pairs are formed that have an equal score on these variables. Within each pair, one participant is assigned to one condition, and the other, to the other condition. However, matching has various drawbacks. Firstly, regression to the mean might play a role. Secondly, matching is very labour-intensive when participants have to be matched on multiple variables, and it requires a sizeable group of potential participants. Thirdly matching only reckons with variables that the researcher deems relevant, but not with other, unknown variables. Randomization does not only randomize these relevant variables, but also other properties that might potentially play a role without the researcher’s realizing this. In short, randomization, even if it is relatively simple, is far preferable to matching.
In the preceding paragraphs, we discussed a number of frequently occurring problems that may threaten a study’s internal validity. However, this list is by no means an exhaustive one. Each type of research has problems of its own, and it is the researcher’s task to remain aware of possible threats to internal validity. To this end, always try to think of plausible explanations that might explain a possible effect to the same extent as, or maybe even better than, the cause you are planning to investigate. In this manner, the researcher must adopt the mindset of their own greatest sceptic, who is by no means convinced that the factor investigated is truly the cause of the effect found. Which possible alternative explanations might this sceptic come up with, and how would the researcher be able to eliminate these threats to validity through the way the study is set up? This requires an excellent insight into the logical relationships between the research questions, the variables that are investigated, the results, and the conclusion. Construct validityIn an experimental study, an independent variable is manipulated. Depending on the research question, this may be done in many different ways. In the same manner, the way in which the dependent variable(s) is/are measured may take different shapes. The way in which the independent and dependent variables are formulated is called these variables’ operationalization. For instance, students’ reading ability may be operationalized as (a) their score on a reading comprehension test with open-ended questions; (b) their score on a reading comprehension test with multiple choice questions; (c) their score on a so-called cloze test (fill in the missing word); or (d) as the degree to which students can execute written instructions. In most cases, there are quite a few ways to operationalize a variable, and it is rarely the case that a theory would entail just one possible description for the way the independent or dependent variables must be operationalized. Construct validity, or concept validity, refers to the degree to which the operationalization of both the dependent variable(s) and the independent variable(s) adequately mirrors the theoretical constructs that the study focuses on. In other words: are the independent and dependent variables properly related to the theoretical concepts the study focuses on?
Of course, the problems with construct validity mentioned above arise not only for written questions or multiple choice questions, but also for questions one might ask participants orally.
A construct that is notably difficult to operationalize is that of writing ability. What is a good or bad product of writing? And what exactly is writing ability? Can writing ability be measured by counting relevant content elements in a text, should we count sentences or words, or perhaps mainly connectives (therefore, because, since, although, etc.), should we collect multiple judgments that readers have about the written text (regarding goal-orientedness, audience-orientedness, and style), or should we collect a single judgment from readers regarding the text’s global quality, should we count spelling errors, etc? The operationalization problems arise from the lack of a theory of writing ability, from which we might derive a definition of the quality of writing products (Van den Bergh and Meuffels 1993). This makes it easy to criticize research into writing quality, but makes it difficult to formulate alternative operationalizations of the construct. Another difficult construct is the intelligibility of a spoken sentence. Intelligibility may be operationalized in various ways. The first option is that the researcher speak the words or sentences in question, and the participant repeat them out loud, with errors in reproduction being counted; one disadvantage of this is that there is hardly any control over the researcher’s model pronunciation. A second option is that the words or sentences be recorded in advance and the same procedure be followed for the rest; one disadvantage that remains is that responses are influenced by world knowledge, grammatical expectations, familiarity with the speaker or their use of language, etc. The most reliable method is that of the so-called ‘speech reception threshold’ (Plomp and Mimpen 1979), which is described in the next example. However, this method does have the disadvantages of being time-consuming, being difficult to administer automatically, and requiring a great amount of stimulus material (speech recordings) for a single measurement.
So far, we have only talked about problems around the construct validity of the dependent variables. However, the operationalization of the independent variable is also often questioned. After all, the researcher has had to make many choices while operationalizing their independent variable (see §2.6)), and the choices made can often be contested. A study is not construct valid, or concept valid, if the operationalizations of the independent variables cannot withstand the test of criticism. A study is also not construct valid if the independent variable is not a valid operationalization of the theoretical-concept-as-intended. If this operationalization is not valid, we are, in fact, manipulating something different from what we intended. In this case, the relationship between the dependent variable and the independent variable-as-intended that was manipulated is no longer unambiguous. Any observed differences in the dependent variable are not necessarily caused by the independent variable-as-intended, but could also be influenced by other factors. One well-known effect of this kind is the so-called Hawthorne effect.
Thus, we observe the Hawthorne effect when a change in behaviour does not correlate with the manipulation of any independent variable, but this change in behaviour is the consequence of a psychological phenomenon: participants who know they are being observed are more eager to show (what they think is) the desired behaviour.
Just like for internal validity, we can also mention a number of validity-threatening factors for construct or concept validity.
However, it is not sufficient to demonstrate that tests meant to measure the same concept or construct are, indeed, convergently valid. After all, this does not show what construct they refer to, or whether the construct measured is actually the intended construct. Did we actually measure a speaker’s ‘fluentness’ using multiple methods, or did we, in reality, measure the construct of ‘attention’ or ‘speaking rate’ each time? Did we actually measure ‘degree of reading comprehension’ using different converging methods, or did we, in reality, measure the construct of ‘performance anxiety’ each time? To ensure construct validity, we really have to demonstrate that the operationalizations are divergently valid compared to operationalizations that aim to measure some other aspect or some other (related) skill or ability. In short, the researcher must be able to show that performance on instruments (operationalizations) that represent a single skill or ability (construct) is highly correlated (is convergent), while performance on instruments that represent different skills or abilities is hardly correlated, if at all (is divergent).
This famous example illustrates how subtle a researcher’s or experimenter’s9 influence on the object of study can be. It goes without saying that this influence threatens construct validity. For this reason, it is better when the researcher does not also function as the experimenter. Studies in which the researcher also is the one who administers the treatment or teaches the students or judges performance may be criticized, because researcher (and their expectations) may influence the outcome, which threatens the independent variable’s construct validity. Researchers may, however, defend themselves against this ‘experimenter bias’. For instance, in the Head Turn Preference Paradigm (example 5.6), it is customary that the experimenter does not know which group a participant belongs to, and does not hear which sound file is being played (Johnson and Zamuner 2010, 74).
This effect was partially responsible for the overestimation of the number of Clinton votes and underestimation of the number of Trump votes in the opinion polls prior to the 2016 US presidential election.
Nevertheless, we still recommend facing such questions of generalizability. Are the conclusions reached also applicable to another population or language, and why (not)? Which other factors might influence this generalizability? Could it be that a favourable effect for one population or language turns to an unfavourable effect for some other population or language that was outside the scope of the study? External validityBased on the data collected, a researcher – barring any unexpected problems – may draw the conclusion: in this study, XYZ is true. However, it is rarely a researcher’s goal to draw conclusions that are true just for one study. A researcher would not just like to show that being bilingual has a positive influence on language development in the sample of children studied. A researcher would like to draw conclusions such as: being bilingual has a positive influence on language development in children. The researcher would like to generalize. The same holds for daily life: we might taste a single spoonful of soup from an entire pot, and then express a judgment on the entire pot of soup. We assume that our findings based on the one spoonful may be generalized to the entire pot, and that it is not necessary to eat the entire pot before we can form a judgment. The question of whether a researcher may generalize their results is the question of a study’s external validity (Shadish, Cook, and Campbell 2002). The aspects of a study generalization pertains to include:
For external validity, we distinguish between (1) generalization to a specific intended population, situation, and time, and (2) generalization over other populations, situations, and times. Generalizing to and over are two aspects of external validity that must be carefully separated. Generalizing to a population (of individuals, or often, of language material) has to do with how representative the sample used is: to which extent does the sample properly mirror the population (of individuals, words, or relevant possible sentences)? Thus, “generalizing to” is tied directly to the goals of the study; a study’s goals cannot be reached unless it is possible to generalize to the populations defined. Generalizing over populations has to do with the degree to which the conclusions we formulate are true for sub-populations we may recognize. Let us illustrate this with an example.
We may assume that this outcome can be generalized to the intended population, namely, all native listeners of American English. This generalization can be made despite the possibility that various listeners were perhaps influenced by the speaker’s foreign accent to different degrees. Perhaps a later analysis might show that there is a difference between female and male listeners. It is not impossible that women and men might differ in their sensitivity to the speaker’s accent. Such an (imagined) outcome would show that we may not generalize over sub-populations within our population, even though we may still generalize to the target population. In (applied) linguistic research, researchers often attempt to simultaneously generalize to two populations of units, namely, individuals (or schools, or families) and stimuli (words, sentences, texts, etc.). We want to show that the results are true not just for the language users we studied, but for other language users, as well. At the same time, we also want to show that the results are true not just for the stimuli we investigated, but also for other, similar language material in the population from which we drew our sample of stimuli. This simultaneous generalization requires studies to have a complex design, because we see repeated observations both within participants (multiple judgments from the same participant) and within stimuli (multiple judgments on the same stimulus). After the observations have been made, the stimuli, participants, and conditions are combined in a clever way to protect internal validity as best as possible. Naturally, generalization to other language material does require that the stimuli be randomly selected from the (sometimes infinitely large) population of all possible language material (see Chapter 7). References“Alex Foundation.” 2015. http://alexfoundation.org/. Donald, D. R. 1983. “THE USE AND VALUE OF ILLUSTRATIONS AS CONTEXTUAL INFORMATION FOR READERS AT DIFFERENT PROGRESS AND DEVELOPMENTAL LEVELS.” British Journal of Educational Psychology 53 (2): 175–85. Houtman, C. 1986. “Bevrijd Ons van Het Meerkeuze-Examen.” Levende Talen 412: 367–69. Johnson, Elizabeth K., and Tania Zamuner. 2010. “Using Infant and Toddler Testing Methods in Language Acquisition Research.” In Experimental Methods in Language Acquisition Research, edited by Elma Blom and Sharon Unsworth, 73–93. Amsterdam: John Benjamins. Karp, J. A., and D. Brockington. 2005. “Social Desirability and Response Validity: A Comparative Analysis of Overreporting Voter Turnout in Five Countries.” Journal of Politics 67 (3): 825–40. Lev-Ari, Shiri, and Boaz Keysar. 2010. “Why Don’t We Believe Non-Native Speakers? The Influence of Accent on Credibility.” Journal of Experimental Social Psychology 46 (6): 1093–96. Pfungst, Oskar. 1907. Das Pferd Des Herrn von Osten (Der Kluge Hans): Ein Beitrag Zur Experimentellen Tier- Und Menschen-Psychologie. Leipzig: J. A. Barth. https://archive.org/details/daspferddesherr00stumgoog. Plomp, R., and A. M. Mimpen. 1979. “Improving the Reliability of Testing the Speech Reception Threshold for Sentences.” International Journal of Audiology 18 (1): 43–52. Richardson, Ellis, Barbara DiBenedetto, Adolph Christ, Mark Press, and Bertrand G. Winsberg. 1978. “An Assessment of Two Methods for Remediating Reading Deficiencies.” Reading Improvement 15 (2): 82. Rijlaarsdam, G. 1986. “Effecten van Leerlingrespons Op Aspecten van Stelvaardigheid.” PhD thesis. Rosén, Erik, Helena Stigson, and Ulrich Sander. 2011. “Literature Review of Pedestrian Fatality Risk as a Function of Car Impact Speed.” Accident Analysis and Prevention 43 (1): 25–33. http://dx.doi.org/10.1016/j.aap.2010.04.003. Shadish, William R., Thomas D. Cook, and Donald T. Campbell. 2002. Experimental and Quasi-Experimental Designs for Generalized Causal Inference. Belmont, CA: Wadsworth. Shohamy, E. 1984. “Does the Testing Method Make a Difference? The Case of Reading Comprehension.” Language Testing 1 (2): 147–70. SWOV. 2012. “De Relatie Tussen Snelheid En Ongevallen.” SWOV. http://www.swov.nl/rapport/Factsheets/NL/Factsheet_Snelheid.pdf. Van den Bergh, Huub, and Bert Meuffels. 1993. “Schrijfvaardigheid.” In Taalbeheersing Als Tekstwetenschap: Terreinen En Trends, edited by A. Braet and J. Van de Gein. Dordrecht: ICG. Verhoeven, Jo, Guy De Pauw, and Hanne Kloots. 2004. “Speech Rate in a Pluricentric Language: A Comparison Between Dutch in Belgium and the Netherlands.” Language and Speech 47 (3): 297–308. Watzlawick, Paul. 1977. Is “Werkelijk” Waar? Spraakverwarring, Zinsbegoocheling En Onvoorstelbare Werkelijkheid. Deventer: Van Loghum Slaterus. Is experimenter bias a threat to internal validity?Experimenter Bias is a type of cognitive bias that occurs when experimenters allow their expectations to affect their interpretation of observations. People believe that bias is rare, but its presence can seriously threaten the validity of an experiment.
Is experimenter bias a threat to external validity?Experimenter bias becomes a threat to external validity because the results that are obtained in a given study may simply reflect the personal biases of the researcher.
What is a confound and how is it related to internal validity?The less chance there is for "confounding," the higher the internal validity and the more confident we can be. Confounding refers to uncontrollable variables that come into play and can confuse the outcome of a study, making us unsure of whether we can trust that we have identified the cause-and-effect relationship.
What affects validity in research?There are eight threats to internal validity: history, maturation, instrumentation, testing, selection bias, regression to the mean, social interaction, and attrition.
|