Peter H. Schönemann
Professor Emeritus • Department of Psychological Sciences • Purdue University
Alttext

Note: numbers in brackets refer to Publications list.

IQ Controversy

(a) Problem of defining "intelligence":

In his controversial revival  of the eugenic  traditions of the 20s, Arthur Jensen (1969) appealed explicitly to Spearman's factor model  as a vehicle for defining "intelligence".  However, in view of the factor indetermincay problem (see above, factor analysis), these high hopes are doomed to failure [40, 47, 52, 57, 83] . Recourse to concrete IQ tests is equally unsatisfactory, because different tests are often quite poorly correlated. In fact , this was the reason why Spearman had postulated his factor model in the first place. From a purely pragmatic point of view one further finds that, contrary to what some authors who should know better have claimed, conventional IQ tests  are surprisingly poor predictors of  most criteria of practical interest, including scholastic achievement. For example, the SAT - a descendent of conventional "verbal" IQ tests such as the Army Alpha - consistently performs worse  than easily available  previous grades as a predictor of subsequent grades. This was known, though not advertised, since the 20s. For long range criteria (such as graduation or GPA at graduation), the SAT usually accounts for less than 5% of the criterion variance (Humphreys, 1967, Donlon, 1984). As one might expect, the picture dims further for the GRE: In two recent, large scale, validity studies, Horn and Hofer (undated) and Sternberg (1998 ) found that the validities of the GRE for predicting successful completion of graduate training were effectively zero.

 

This means that no-one knows what "intelligence" is after 100 years of feverish "research". This is especially disconcerting if viewed against the historical background of the mental test movement which Jensen and his followers have tried to revive by linking untenable validity claims for IQ to equally specious "heritability" claims (see Quantitative Behavior Genetics, below)...

(b) Spearman's Hypothesis

In the early 80s, Jensen (in Bias in Mental Testing, 1980) revived a casual observation Spearman had made in 1927:  He had reported that subtests most highly loaded on  his  general intelligence factor g showed the largest Black/White contrasts (Spearman Hypothesis). Jensen, after substituting the largest principal component (PC1) for g, interpreted this as new, compelling evidence for the existence of g which seemed to corroborate his central claim that Blacks, on average, are deficient in g compared to Whites,  and that these differences are primarily genetic rather than not cultural, in origin.

 

In [43]  I drew attention to the fact that this result can be explained as an artifact which has nothing to do with Blacks or g. Rather it arises with any data, including randomly generated data, if they exhibit a sufficiently large mean difference vector.  William Shockley subsequently challenged this interpretation. He correctly pointed out that it was limited to a positive relation between the mean differences and the weights of the PC1 of the pooled group, while  most of Jensen's data showed such  positive correlationswithineachgroup. I, therefore, extended  my results to this more general situation by invoking joint  multinormality as an additional condition. I then showed mathematically, geometrically, empirically, and by random simulation, the following result:

 

If one splits a multinormal distribution of positively intercorrelated variables into a high and a low group, one  finds

 

(a) that the mean differences between both groups are monotonically related to the loadings on the largest principal component

 

(b)  if both groups are of equal size, then the cosine between both vectors will not just be large but 1, while,

 

(c)  if the groups are of different size,  the effect will be more pronounced for the larger group [68, 82, 83].

 

Thus, Spearman's Hypothesis does not warrant any of the farreaching claims Jensen and some of his followers (e.g., Herrnstein and Murray) have attached to it. In particular, it does not validate the existence of a general ability g as Jensen has asserted. Nor does it have any bearing on the race question.

Publication [83] is a Target Article on this topic, followed by  numerous  commentaries. Most of them endorse the stringency of the above reasoning. For a chronicle of the incredibly protracted history of this paper, see [86].

(c) Hit-Rate Bias

 In  view of the severe implications  of a mistaken interpretation of discrepancies in IQ performance of various ethnic groups, much attention has been focused on the question whether these discrepancies might conceivably  be the result of  a bias  favoring some groups over others (perhaps also, males over females). A. Jensen (1980)  devoted a whole book to this issue, befittingly entitled Bias in Mental Testing. He concluded that such worries are unwarranted so far as the Black/White discrepancy is concerned because, if anything, conventional IQ tests overpredict Black criterion performance.

 

With this reasoning Jensen followed tradition in adopting an institutional point of view (e.g. that of universities) over that of the applicants, by focussing on  regression equations and  validity coefficients. From an institutional point of view, a test is useful if it improves the composition of the subgroup that is eventually hired or admitted on the basis of superior test performance.  From this point of view, one can  show that even a test with low validity has some merit, as long as the hiring institution employs a sufficiently stringent admission quota (by raising the test cut-off).

 

However, this narrow perspective ignores two important aspects of the bias problem:

 

(a)  the base-rate problem:

 

By solely focusing on the regression equation and correlations ( predictive validities), the traditional approach to the bias problem ignores the fact that a valid test can be worse than useless  if the base-rates (= proportion of qualified candidates) are sufficiently skewed. To illustrate this briefly, suppose the base-rate of  a clinical syndrom (e.g., schizophrenia) is  1%. In this case we could  achieve 99% correct predictions by simply predicting that everybody is "normal", regardless of test performance.  For a test to achieve such a high degree of correct prediction, it would have to have an unrealistically high predictive validity.

 

More generally, validity coefficients by themselves (in the absence of knowledge of base rate and quota), are meaningless as indicators of the pratical utility of a test.

 

Though this was already known to Meehl and Rosen (1955), it has been conveniently ignored in the meantime.

(b) the interests of the testee (as opposed to that of the hiring or admitting institution):

 

Once the bias problem is cast into the language of prediction error frequencies (rather than validities and regression equations,  disregarding base-rates),  it becomes  apparent that, to the same extent an institution benefits from use of a low validity test  by tightening the (admission) quota, qualified applicants will suffer because an increasingly larger proportion of them is wrongly rejected as a result of imperfect test validity. This follows directly from Bayes' well-known theorem that  relates two types of conditional probabilities. In the present context, they have the following concrete interpretation:

 

The conditional probability that a candidate will be successful (e.g., graduate), given that he passes the test, is called the success ratio (SR) of the test. Following standard terminology of signal detection theory, let HR denote the hit-rate, which is the reverse conditional probability that a candidate  passes the test if he is qualified. Finally, let Q denote the (admission) quota, and BR the base-rate (the proportion of qualified candidates in the unselected population.

 

Then Bayes' Theorem asserts:

                                                                 SR = HR x BR/Q,

 

which expresses the conventional instituional point of view:  The smaller we make the quota (by raising the test cut-off), the larger will be the success rate, because Q shows up in the denominator.

 

However, if we adopt the point of view of  qualified candidates, then we find (by  solving the above equation for the hit-rate):

                                                                HR = SR x Q /BR.

 

Now Q appears in the numerator. Hence, the  tighter the admission quota, the smaller will be the hit-rate, the chance of the qualified student to be admitted.

 

Although these simple relations have been known for a long time, they have been consistently ignored or downplayed in the mental test literature. In particular, so far as I know, few if any systematic investigations of  actual  hit-rates as a function of validity, base-rate, and quotas  seem to have been reported in the past. Nor have the test experts shown  much interest  in the problem whether the tests may be biased against certain groups, e.g., Blacks, in terms of hit-rates.

 

In [78]  we (with Thompsen) derive simple approximations for hit-rates and tabulate them as a function of validity, quota, and base-rate.  We also derive a  bound on hit-rates,

                                                                   HR < Q/BR,

 

which says that tightening the quota inevitably penalizes the qualified students by lowering the hit-rate.

Finally, we review a number of data sets to evaluate the  hit-rates of the SAT and ACT for different ethnic groups. We also assess the hit-rate bias, i.e., the extent to which conventional tests favor or discriminate against subgroups in terms of the chances that a qualified student  passes the test. We found that conventional admission tests discriminate against Blacks, and  further, that this bias increases as the admission quotas are tightened. In [82] these results are further refined and extendend to include formulae for estimating the minimum validity needed for given quota and base rate, so that use of the test improves the percentage of correct decisions over random admissions. The bottom line is that, in the realistic validity range (.3 - .4)  for longrange criteria of practical interest, no test improves over random admission in terms of overall percent of correct decisions if one of the two base rates exceeds .7.

 

Evidently, these results impact directly on the ongoing affirmative action debate. In the past, such discussions have often been misguided by the fallacious notion that all one has to do to "raise standards" is to raise test cut-offs. The above formula underlines the need to clearly distinguish between predictor standards and criterion standards, especially if the test validities  are low, as they unfortunately usually are. In this case, raising test standards, instead of raising criterion standards as hoped, may simply further aggravate discrimination against minorities.