What is a threat to internal validity that occurs in the sampling process when random assignment to research groups is not properly conducted?

Internal validity concerned the validity of the experiment itself: did the experimental treatment actually cause the outcome of the experiment or was it something else?

From: Encyclopedia of Social Measurement, 2005

Internal Validity

M.H. Clark, S.C. Middleton, in International Encyclopedia of Education (Third Edition), 2010

Internal validity addresses whether or not it is reasonable to make a causal inference from the observed covariation between two variables, a presumed cause and its effect. Donald Campbell and his colleagues developed several threats to validity to better evaluate the strength of a study’s internal validity. These threats to internal validity include: ambiguous temporal precedence, selection, history, maturation, regression, attrition, testing, instrumentation, and additive and interactive threats to internal validity. Often, these threats can be identified or accounted for in a study by adding pretests and comparison groups to a research design, using specific designs, or making statistical adjustments.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B978008044894700292X

Longitudinal Studies, Panel

Scott Menard, in Encyclopedia of Social Measurement, 2005

Unreliability of Measurement and Measurement Change

In longitudinal panel research, the usual issues of internal and external validity and measurement reliability arise. Measurement reliability may be measured by either test–retest reliability, with a suitably short interval between test and retest, or by internal consistency measures, such as Cronbach's alpha or factor analytic techniques. The issue of longitudinal reliability arises from the possibility that even when identical measurement instruments are used in different waves, and even more when they are not, differences in administration or life course changes may lead to the result that what is measured at one wave is not really comparable to what is measured at another. Four more specific issues within the broader framework of longitudinal reliability illustrate the dilemmas that may arise: consistency of administration, consistency of observation, factorial invariance, and changes in measurement.

Even if the same survey or psychometric tests are administered in each wave of a panel study, how the questions are administered may vary in more obvious or more subtle ways. An example of this is the change from face-to-face to telephone interviews in the National Crime Victimization Survey and the Panel Study of Income Dynamics. More subtle changes may include changes in instructions to interviewers (or in the interviewers' actual behavior, regardless of instructions); one example is whether certain responses elicit probes from the interviewer. The setting of a psychometric test may change, for example, from the researcher's office to the respondent's home or school. These changes in administration all represent inconsistencies in administration, and raise the question of whether any apparent change (or absence of change) in responses from one wave to another represents real changes (or their absence) in characteristics of the research participants, or merely changes in how the participants respond to the administration of the survey or test. Observational inconsistency closely parallels inconsistency in administration. When the observer becomes the measurement instrument, as in ethnographic research, the question arises whether reported changes represent true change in the individuals being observed, or merely changes in the perceptions or perspective of the observer.

When the same sets of items on a scale have the same relationship to the scale (and thus to one another) in each wave of a panel study, the result is called factorial invariance. Strict factorial invariance insists that the numerical relationship be identical across waves; a weaker form of factorial invariance insists only that the same items “belong” to the same scale (as indicated by some numerical criterion, but the numerical relationship need not be exactly the same) at each wave. Factorial invariance is compromised when the relationship of one item to the scale changes. The first question that arises when factorial invariance is absent is whether we are measuring the same variable from one wave to the next, and hence whether it is possible to measure change in that variable, one of the core criteria of longitudinal panel research. If the variable being measured is not the same from one wave to the next, or even if it is the same variable, but measured differently, a second question that arises is whether any change in the relationship of this variable to other variables in the study (or, again, an apparent absence of change) can be attributed to real change (or its absence) or merely to a change in measurement.

In some instances, researchers may deliberately change the measurement instrument (including the observer, in ethnographic studies) from one wave of a panel to another. This may be done to incorporate a “new and improved” measure, because questions that were appropriate at an earlier age or stage of the life course (e.g., how well a student is doing in school at age 14) may no longer be appropriate, or because new questions that would have been inappropriate at an earlier age or stage (e.g., how well an individual is doing at work at age 34) are now appropriate. Of course, it is possible that, for some individuals, work is an important context at age 14 or school is an important context at age 34, but what about ages 9 and 59? In this example, the introduction of new questions without eliminating old ones, or the elimination of old questions without adding new ones, would constitute a change in survey administration, adding to or subtracting from the length of the survey, but the inclusion of the same set of questions at all waves, regardless of their appropriateness to the age or stage of the respondent, could risk unwillingness of the respondent to participate in a survey that seems irrelevant. Unless the same questions are asked in the same order, in the same way, at every wave, the question of comparability of administration arises. Yet at the same time, if the survey covers a substantial span of the life course, the same question at different ages or stages may not have the same meaning for the respondent, either subjectively or in its empirical relationship with other scale items or variables.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B0123693985000074

Experiments, Criminology

David Weisburd, Anthony Petrosino, in Encyclopedia of Social Measurement, 2005

Barriers to Experimentation in Crime and Justice

Despite the theoretical benefits of experimental study in terms of internal validity, some scholars (e.g., Clarke and Cornish, in 1972, and Pawson and Tilley, in 1997) have argued that practical and ethical barriers limit the use of randomized experiments in real crime and justice contexts. Ordinarily, such concerns relate to the question of whether the random allocation of sanctions, programs, or treatments in criminal justice settings can be justified on the basis of the benefits accrued to society. Or conversely, the concern is whether the potential costs of not providing treatment to some offenders (either in terms of harm to them or to the principles of equity and justice in the criminal justice system) are outweighed by those benefits. Over the past two decades, criminal justice researchers have illustrated that randomized experiments can be carried out across a wide variety of settings and across a number of different types of criminal justice institutions. As described by Weisburd in 2000, researchers have overcome barriers to randomization of criminal justice innovations in a number of different ways. For example, it is common that there are not enough resources to provide treatment to all eligible subjects. In such cases, researchers have argued successfully that random allocation provides a fair method for choosing those who will gain treatment and those who will not. One objection often raised to experimentation in crime and justice is that it is unethical and sometimes illegal to create control conditions in which individuals receive no treatments or sanctions. In practice, most experiments in crime and justice involve comparison groups that receive either conventional treatment or some alternative type of treatment. The most serious barriers to experimental research have been encountered in studies of criminal justice sanctions such as arrest or imprisonment. However, even here, a number of studies have been developed. In general, fewer ethical objections are raised in studies in which the innovation proposed involves a less punitive sanction than that conventionally applied.

Practical barriers to experimentation have also hindered the development of randomized studies in criminal justice. It is generally assumed that it is more difficult to gain cooperation for randomized experimental approaches than for nonrandomized methods. This has led some scholars to argue that the “external validity” of experimental research is often lower than that of nonrandomized studies. External validity refers to the degree to which the findings or results from a study sample represent the characteristics of the population of interest. If experimental studies are to be carried out only in select criminal justice jurisdictions or institutions, or only among specific subjects, then the external validity of experimental studies can be questioned. Though the external validity of randomized studies, like nonrandomized studies, will vary from study to study, a number of recent trends have contributed to the expansion of the experimental model in criminal justice evaluation. First, the wide use of randomized experiments in medicine, and public exposure to such studies, have led to wide acceptance among the public and policymakers of the value of experimental methods. Second, there is a growing recognition of the importance of evaluating criminal justice treatments and programs as part of a more general movement referred to as “evidence-based policy,” which seeks to track the effectiveness and efficiency of government. Finally, in the United States, and more recently in Europe, public support for partnerships between criminal justice researchers and criminal justice professionals has grown. Such collaboration has led to a greater understanding of experimental methods among practitioners.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B0123693985004424

Empirical Research Methods in the Economics of Education

P.J. McEwan, in International Encyclopedia of Education (Third Edition), 2010

Combining Methods to Improve Causal Inference

Researchers often apply multiple methods in the same study to improve internal validity. Almost every study employs statistical controls for family and student characteristics that affect outcomes. DD methods are frequently combined with experiments (Skoufias, 2005), the RDD (Chay et al., 2005), and IV methods (Kuziemko, 2006). Finally, researchers often combine IV methods with randomized experiments and the RDD, especially to address imperfect compliance of students with random or cutoff-based assignment to policy treatments. In New York City’s voucher experiment, for example, students were randomly assigned to receive a voucher offer, but not all students accepted the offer and actually attended a private school (Krueger and Zhu, 2004). To recover an estimate of the treatment-on-the-treated (i.e., the effect of actually attending a private school), researchers used the voucher offer as an instrument for private school attendance. The resulting IV estimate provides a credible estimate of the effect of private school attendance on those induced to accept it by the voucher offer.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780080448947012112

Experiments, Psychology

Peter Y. Chen, Autumn D. Krauss, in Encyclopedia of Social Measurement, 2005

Longitudinal Design

Compared to the prior designs, a longitudinal design provides stronger evidence for internal validity. Recall from the substance-abuse study that participants receiving the treatment reported using cocaine on significantly fewer days for the first 6 months, as compared to those who received the placebo treatment; however, the treatment effect diminished when the participants were retested 12 months after the interventions. Without the longitudinal design, the lack of a long-term benefit of the treatment would not have been known. The design consists of multiple pretests and posttests over a period of time. The numbers of pretests and posttests do not need to be the same, and generally there are more posttests than pretests. An example of the structure of the longitudinal design is as follows:

ROpre OpreXOpostOpostO postOpostROpreO preOpostOpostOpostOpost

Practical constraints are often encountered when attempting to implement a study using a longitudinal design. For instance, attrition is common in longitudinal designs, so data for some participants are often incomplete. Furthermore, it is not clear from a theoretical viewpoint how long a longitudinal design should be conducted; as such, requiring participant involvement for an extended period of time may pose ethical concerns.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B0123693985003273

Validity, Data Sources

Michael P. McDonald, in Encyclopedia of Social Measurement, 2005

Internal and External Validity

The causal relationship of one concept to another is sometimes also discussed in terms of validity. Internal validity refers to the robustness of the relationship of a concept to another internal to the research question under study. Much of the discussion in the section under threats to validity and the tests for validity is pertinent to the internal validity of a measure, vis-a-vis another concept with which it is theoretically correlated. External validity refers to the greater generalizability of the relationship between two concepts under study. Is the uncovered relationship applicable outside of the research study?

The relationship between one measure and another may be a true relationship, or it may be a spurious relationship that is caused by invalid measurement of one of the measures. That is, the two measures may be related because of improper measurement, and not because the two measures are truly correlated with one another. Similarly, two measures that are truly related may remain undetected because invalid measurement prevents the discovery of the correlation. By now, the reader should be aware that all measures are not perfectly valid, the hope is that the error induced in projecting theory onto the real world is small and unbiased so that relationships, be they findings that two measures are or are not correlated, are correctly determined.

All of the threats to validity apply to the strength of the internal validity of the relationship between two measures, as the two measures must be valid in order for the true relationship between the two, if any exists, to be determined. Much of the discussion of tests of content and convergent validity also applies to internal validity. In addition, researchers should consider the rules of inference in determining if a relationship is real or spurious. Are there confounding factors that are uncontrolled for driving the relationship? A classic example in time-series analysis is cointegration, the moving of two series together over time, such as the size of the population and the size of the economy, or any other measure that grows or shrinks over time. In the earlier example of voter turnout, the confounding influence of a growing ineligible population led researchers to incorrectly correlate a largely invalid measure of decreasing voter turnout to negative advertising, a decline of social capital, the rise in cable television, campaign financing, the death of the World War II generation, globalization, and decline in voter mobilization efforts by the political parties.

External validity refers to the generalizability of a relationship outside the setting of the study. Perhaps the most distinguishing characteristic of the social sciences from the hard sciences is that social scientists do not have the luxury of performing controlled experiments. One cannot go back in history and change events to determine hypothetical counterfactuals, while physicists may repeatedly bash particles together and observe how changing conditions alter outcomes. The closest the social sciences come to controlled experiments is in laboratory settings where human subjects are observed responding to stimuli in controlled situations. But are these laboratory experiments externally valid to real situations?

In a classic psychology experiment, a subject seated in a chair is told that the button in front of them is connected to an electric probe attached to a second subject. When the button is pushed an increasing amount of voltage is delivered. Unknown to the subject, the button is only hooked to a speaker, simulating screams of pain. Under the right circumstances, subjects are coerced into delivering what would be fatal doses of voltage.

Such laboratory experiments raise the question as to whether in real situations subjects would respond in the similar manner and deliver a fatal charge to another person, i.e., is the experiment externally valid? Psychologists, sociologists, political scientists, economists, cognitive theorists, and others who engage in social science laboratory experiments painstakingly make the laboratory as close to the real world as possible in order to control for the confounding influence that people may behave differently if they know they are being observed. For example, this may take the form of one-way windows to observe child behavior. Unfortunately, sometimes the laboratory atmosphere is impossible to remove, such as with subjects engaged in computer simulations, and subjects are usually aware prior to engaging in a laboratory experiment that they are being observed.

External validity is also an issue in forecasting, where relationships that are based on observed relationships may fail in predicting hypothetical or unobserved events. For example, economists often describe the stock market as a random walk. Despite analyst charts that graph levels of support and simple trend lines, no model exists to predict what will happen in the future. For this reason, mutual funds come with the disclaimer, “past performance is no guarantee of future returns.” A successful mutual fund manager is likely to be no more successful than another in the next business quarter.

The stock market is perhaps the best example of a system that is highly reactionary to external shocks. Unanticipated shocks are the bane of forecasting. As long as conditions remain constant, modeling will be at least somewhat accurate, but if the world fundamentally changes then the model may fail. Similarly, forecasts of extreme values outside the scope of the research design may also fail, or when the world acts within the margin of error of the forecast then predictions, such as the winner of the 2000 presidential election, may be indeterminate.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B0123693985000463

Design of Experiments

C.A. Albers, T.R. Kratochwill, in International Encyclopedia of Education (Third Edition), 2010

Causal Studies

Causal studies are designed to determine whether one or more variables actually cause changes in one or more outcome variables. Internal validity, which is the extent to which we can accurately state that the independent variable caused the changes in the outcome variable(s), is a critical component within causal studies. To establish internal validity (and thus a causal relationship), three criteria need to be adequately addressed. These criteria consist of:

1.

temporal precedence, which is establishing that the cause (i.e., independent variable) occurs before the effect (i.e., outcome);

2.

establishing that the cause and effect are related and/or covary; and

3.

establishing that there are no plausible alternative explanations.

The most appropriate way to address these criteria is through the use of various research designs; the stronger the research design, the more likely it is that the researcher is able to control sources of error in their methods and results. Specific procedures for controlling this variance include: (a) randomization, (b) accounting for conditions/factors into the design as independent variables, (c) holding conditions/factors constant, and (d) statistical adjustments. Whereas (a), (b), and (c) can be controlled when structuring the research design, (d) is conducted during the analysis stage.

A variety of experimental research designs can be used within causal studies; these designs can be classified as being (1) randomized experimental, (2) quasi-experimental, or (3) single-case designs. The primary and critical difference between randomized experimental and quasi-experimental designs is the presence of random assignment for randomized experimental designs, and the presence of a control group or the use of multiple measures in quasi-experimental designs. Although single-case designs are frequently classified as quasi-experimental designs, we include them as a separate classification within this article as they can serve as an alternative to using large, aggregate group designs. The variations within each type of design are discussed in more detail below.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780080448947013804

Experimentation in Economics

Francesco Guala, in Philosophy of Economics, 2012

4.1 External validity and representativeness

We have seen that the logic of the perfectly controlled experiment leads quite naturally to endorse a severity approach to inductive inference. The method of the perfectly controlled experiment however is maximally useful to solve internal validity problem — when the issue is to find out what is going on within a given experimental set-up or laboratory system. Since the method relies importantly on the control of background conditions (Ki) in order to obtain truly informative evidence, there is usually a trade-off between internal and external validity. A simple experiment that reproduces many of the idealisations of a theoretical model is usually easier to control in the laboratory; but it also constitutes a weaker starting point for extending the experimental knowledge thus obtained to other situations of interests (where such idealisations do not hold).

So how can we tackle the external validity problem constructively? And can we indicate a solution that is consistent with the logic of the severity approach? Ideally, we would like to have a unique inductive methodology that is able to capture both types of inferential moves.

Following an old tradition in experimental psychology, Robin Hogarth [2005] argues that the problem of external validity should be framed in terms of representativeness. There are, more precisely, at least two dimensions of representativeness in an economic experiment: subjects sample and design. Whereas statistical techniques and random sampling can be used to tackle subject representativeness, the choice of the design is rarely seen as a problem of the same kind. The designs of economic and psychology experiments are often highly idiosyncratic if compared to real-world situations, and are certainly not randomly picked from the target population (e.g. the set of real-life choice-situations or real market decisions). For this reason, the “representativeness” framework might be helpful to highlight the nature of the problem, but does not do much in terms of pointing to a solution, as far as problems of design are concerned.

Why is the method of random sampling from a set of real-life situations not followed by experimental economists? Random sampling makes sense only if you are trying to capture a central tendency in a population of individuals with varying traits. But there may be no such central tendency in a set of, say, market exchanges. Consider bargaining: economic theory suggests that different details of the bargaining situation can influence the outcome drastically. If this is true, then an average description of different bargaining outcomes is likely to be rather uninformative and to obscure all the interesting variations in the data. What we want is instead to be able to understand how different factors or causal mechanisms interact to generate different outcomes. This is why experimenters sometimes privilege simple designs or game-situations that capture the working of just one mechanism in isolation, where somewhat “extreme” results are vividly instantiated.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B978044451676350021X

Methods for Approximating Random Assignment: Regression Discontinuity and Propensity Scores

L.M. Scheier, in International Encyclopedia of Education (Third Edition), 2010

Summary

Randomization affords researchers a clear method to ensure that any differences between units or individuals assigned to experimental conditions are by chance alone. It is an optimal design strategy to fend off certain threats to internal validity and provides the foundation for true experiments seeking to make valid causal inferences. When subjects have an equal probability of being selected and assigned to an experimental condition the unique characteristics they bring to the laboratory or field experiment are equated across treatment conditions. Simply put, the goal of randomization is to seek a level of equality of subjects prior to their assignment to experimental condition. When this occurs, researchers can then make more confident assertions whether a specific manipulation results in the anticipated treatment effect by design as opposed to happening by chance alone.

In the realm of educational studies, it is not always possible to assign students (or teachers) to treatment conditions using random assignment methods. In many cases, well-known studies examining the role of school vouchers, private versus public school education, efficacy of school-based drug-prevention, grade retention, and evaluations of many remedial instructional modalities to improve learning and achievement take shape as observational, quasi-experimental studies lacking the precision afforded by random assignment. These studies tend to be more economical and less cumbersome than randomized trials; however, the absence of complete randomization hampers scientists’ abilities to make valid causal inferences. In recent years, several alternative approaches have been proposed to accommodate the necessity of controlling nuisance factors that may diminish the authority of causal attributions. With these tools in hand, a researcher is much closer to being able to state that a certain manipulation resulted in a specific effect and thus reinforce the cause–effect relationship that is the backbone of all scientific effort.

Although several viable alternatives to randomization exist, two in particular are covered in this article (instrumental variables and fixed-effect methods as possible remedies to randomization are discussed elsewhere in the encyclopedia). Propensity scores provide a parsimonious and efficient remedy to the problem of obtaining unbiased estimates of treatment effectiveness by adjusting or balancing treatment group differences based on a single composite characteristic. Any bias associated with treatment condition assignment is controlled statistically through the covariate adjustment, and subjects are balanced on their propensity for selection. Importantly, models using this approach are only as valid as the model selection process used to include covariates in the scalar function. Hidden or missing covariates can differentiate participants in ways not considered and alter the statistical outcomes or at the very least undermine confidence in any causal interpretation.

Another remedy discussed in this article involves dichotomization of samples in a way that permits investigators to make meaningful comparisons or treatment contrasts as if the subjects had been randomly assigned to discrete experimental groups. While these variants on randomization have less stringent requirements for balancing preexisting differences, they still afford a means of comparison not available with subclassification or other covariate-adjustment methods. The technique of RD works on the assumption that oftentimes natural boundary conditions mimic random assignment to treatment conditions. With RD methods, an investigator formulates the designation (assignment) of experimental versus control groups based on students achieving just above (or below) a certain threshold on some benchmark performance criteria. As a special feature, the selection mechanism using the assignment variable cutoff point is fully known (i.e., there is no measurement error), and there is no hidden bias that can influence the estimate of the treatment effect. The elegance of this approach is that it can be mixed with randomization techniques, utilize multiple cutoff points, and incorporate more than one treatment. When certain assumptions regarding internal validity are met, the resulting estimate of treatment effect is virtually unbiased.

All told, despite differences in these two popular methods to approximate randomization, both are similarly geared toward removing extraneous variance or controlling for preexisting treatment group differences that might bias the estimation of treatment effects. Random assignment is a major canon of experimental design but not the sine qua non that defines experimentation. Other important components include selection of the treatment, dosage, method of dispensing treatment, measuring and timing of effects, choosing participants, the nature of comparison, and assignment protocols. Even with these additional requirements, the goal of any experimental design or sampling strategy is still to ensure that causal assertions are not mistaken, implausible, or falsifiable.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780080448947017152

External Validity

G.E. Matt, ... M. Sklar, in International Encyclopedia of Education (Third Edition), 2010

Introduction

The term external validity was first introduced more than 50 years ago in a seminal paper by Campbell (1957) titled ‘Factors relevant to the validity of experiments in social settings.’ For Campbell, internal validity and external validity were the two major criteria for evaluating the validity of research designs examining causal propositions. In his definition, internal validity asked whether an experimental stimulus (e.g., treatment) made some significant difference in a specific instance. In contrast, external validity asked questions about representativeness or generalizability: To what populations, settings, and variables can an effect be generalized?

The distinction between internal and external validity evolved in subsequent writings by Campbell and Stanley (1963), Cook and Campbell (1979), Campbell (1986), and Shadish et al. (2002) into a four-partite model of statistical conclusion, internal, construct (treatment and effect), and external validity. The latter two are closely related, in that both involve inferences with regard to more abstract constructs or universes based on the manifest instances present in a particular study. The following sections briefly define key concepts, review traditional approaches for justifying generalized inferences, introduce Cook’s pragmatic principles of generalized causal inferences, and highlight the importance of meta-analysis in justifying the external validity of causal claims. The article concludes by outlining future directions for theory and practice of causal generalization.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780080448947017000

What is the main threat to internal validity?

- Main threats to internal validity are order effects (deal with this by counter-balancing the order of conditions across the subjects if possible), main trade-off is that within has better stats and worse construct due to demand characteristics/reactivity.

Is random assignment a threat to internal validity?

Random assignment is central to internal validity, which allows the researcher to make causal claims about the effect of the treatment.

What are the 5 threats to internal validity?

History, maturation, selection, mortality and interaction of selection and the experimental variable are all threats to the internal validity of this design.

Is sampling bias a threat to internal validity?

The bias that occurs during participant selection is generally identified as a threat to external validity, whereas bias that occurs during assignment is known as a threat to internal validity. During a study, if a significant number of participants withdraw without completing the study, selection bias can also occur.