Peter H. Schonemann
Note: numbers in brackets refer to Publications list.
Early work on matrix derivatives [1, 45]. Since differentiation is a linear map, partial differentiation interfaces with matrix notation. This simplifies work in least squares and maximum likelihood estimation [3, 6, 7, 8, 10]. Some minor papers on the non-central F-distribution [27, 66] and the non-null distribution of intraclass correlations (which is central F). Intraclass correlations provide basic data for twin research ( see Quantitative Behavior Genetics, below).
Early work on procrustes methods (least squares maps T for given A, B in B = AT+E, to minimize the sum of squared residuals in E, usually under some constraint on E, such as, in this case, orthogonality [1, 3, 6, 8, 17, 21]) and methods for machine rotation  to simple structure.
After finally having been made aware of it (by Heerman at a meeting of the Psychometric Society in 1964, after obtaining a Ph.D. in psychometrics at the UofI), intensive study of the implications and ramifications of factor indeterminacy: In the classical factor model, the number of factors always exceeds number of observed variables, so that no unique solution for "factor scores" exists [11, 12, 20, 24, 26 ]. This defect of the factor model vitiates any claims that factors provide an objective basis for defining " intelligence", which had been Spearman's declared objective [40, 47, 52, 57, 75, 76]. The same indeterminacy affects models of the LISREL type.
E.B. Wilson, an acknowledged scholar of first rank, first drew attention to this issue in 1928. It was subsequently discussed by numerous competent psychometricians and statisticians (see  for an unsanitized history of this problem). During the Thurstone era of classical psychometrics, this whole problem area faded into oblivion, until it was eventually revived again in the early 70s. It is still the subject of debate today [79, 80]. Most importantly, it bears directly on Jensen's specious claim that Spearman's g factor provides "an operational definition of intelligence".
Papers [26, 40, 52] highlight one of several peculiar consequences this indeterminacy implies for the classical the factor model :
For example, regardless of the observed variables from which the factors are derived, theyThe factors of the factor model can always be chosen in such a way that they predict any criterion whatever perfectly (in a multiple regression sense).
In paper  it is shown that the power of maximum likelihood factor analysis is poor , barely exceeding twice the alpha level for moderate sample sizes 100-200. Paper  presents a comprehensive discussion of an alternative to the factor model (Regression component analysis). This methodology is not afflicted by any indeterminacy problems, and in practice it gives very similar numerical results. However, in contrast to the factor model, in the absence of further constraints it does not pose as a falsifiable theory, but rather is a purely descriptive data reduction method of the data at hand.
Some early work on Thurstonian scaling [5, 10], Guttman's simplex theory , and Coombsian metric multidimensional unfolding . An algebraic solution for Horan's subjective metrics model (which underlies "INDSCAL" ) was subsequently extended into a computationally efficient and robust scaling algorithm (COSPA, [25, 29]). Later  was extended (with Wang) into a multidimensional scaling model for preference data that combines the Bradley-Terry-Luce model with Coombs' unfolding model [14, 19]. A common characteristic all these metric MDS models share is that they all have exact algebraic solutions and are, at least in principle, testable [35, 39] .
Later empirical work with similarity data on rectangles followed up on Krantz and Tversky's (1970) lead [36, 37]). On closer scrutiny we found that dissimilarity ratings often violate some basic assumptions required by the conventional metric models, notably the Archimedean axiom, which underlies all Minkowski metrics, in particular, the euclidean and the city-block metric [38, 41, 42, 46, 48, 59, 73].
- More generally, such findings cast doubt on the research promise of prepackaged scaling programs that ignore the actual judgement behavior of the subjects. In hindsight, the few nontrivial insights produced during the MDS craze of the 70s seem to have been mostly artifacts [36, 38, 59].
Naively, one might think that both scaling and test theory ought to relate to measurement theory in some way since all three profess to be concerned with the problem of assigning numbers to objects or subjects. Our earlier, still relatively upbeat thoughts on these issues are summarized in [33, with I. Borg]. However, as time went on, and the anticipated empirical support of axiomatic measurment theories never materialized, we found it increasingly harder to maintain our earlier optimism about the prospective utility of such abstract theories. Eventually, this scepticism extended to mathematics more generally as a tool for solving non-ficticious problems in psychology .
(a) Problem of defining "intelligence":
In his controversial revival of the eugenic traditions of the 20s, Arthur Jensen (1969) appealed explicitly to Spearman's factor model as a vehicle for defining "intelligence". However, in view of the factor indetermincay problem (see above, factor analysis), these high hopes are doomed to failure [40, 47, 52, 57, 83] . Recourse to concrete IQ tests is equally unsatisfactory, because different tests are often quite poorly correlated. In fact , this was the reason why Spearman had postulated his factor model in the first place. From a purely pragmatic point of view one further finds that, contrary to what some authors who should know better have claimed, conventional IQ tests are surprisingly poor predictors of most criteria of practical interest, including scholastic achievement. For example, the SAT - a descendent of conventional "verbal" IQ tests such as the Army Alpha - consistently performs worse than easily available previous grades as a predictor of subsequent grades. This was known, though not advertised, since the 20s. For long range criteria (such as graduation or GPA at graduation), the SAT usually accounts for less than 5% of the criterion variance (Humphreys, 1967, Donlon, 1984). As one might expect, the picture dims further for the GRE: In two recent, large scale, validity studies, Horn and Hofer (undated) and Sternberg (1998 ) found that the validities of the GRE for predicting successful completion of graduate training were effectively zero.
- This means that no-one knows what "intelligence" is after 100 years of feverish "research". This is especially disconcerting if viewed against the historical background of the mental test movement which Jensen and his followers have tried to revive by linking untenable validity claims for IQ to equally specious "heritability" claims (see Quantitative Behavior Genetics, below)..
(b) Spearman's Hypothesis
In the early 80s, Jensen (in Bias in Mental Testing, 1980) revived a casual observation Spearman had made in 1927: He had reported that subtests most highly loaded on his general intelligence factor g showed the largest Black/White contrasts (Spearman Hypothesis). Jensen, after substituting the largest principal component (PC1) for g, interpreted this as new, compelling evidence for the existence of g which seemed to corroborate his central claim that Blacks, on average, are deficient in g compared to Whites, and that these differences are primarily genetic rather than not cultural, in origin.
In  I drew attention to the fact that this result can be explained as an artifact which has nothing to do with Blacks or g. Rather it arises with any data, including randomly generated data, if they exhibit a sufficiently large mean difference vector. William Shockley subsequently challenged this interpretation. He correctly pointed out that it was limited to a positive relation between the mean differences and the weights of the PC1 of the pooled group, while most of Jensen's data showed such positive correlationswithineachgroup. I, therefore, extended my results to this more general situation by invoking joint multinormality as an additional condition. I then showed mathematically, geometrically, empirically, and by random simulation, the following result:
If one splits a multinormal distribution of positively intercorrelated variables into a high and a low group, one finds
(a) that the mean differences between both groups are monotonically related to the loadings on the largest principal component
(b) if both groups are of equal size, then the cosine between both vectors will not just be large but 1, while,
(c) if the groups are of different size, the effect will be more pronounced for the larger group [68, 82, 83].
Thus, Spearman's Hypothesis does not warrant any of the farreaching claims Jensen and some of his followers (e.g., Herrnstein and Murray) have attached to it. In particular, it does not validate the existence of a general ability g as Jensen has asserted. Nor does it have any bearing on the race question.
Publication  is a Target Article on this topic, followed by numerous commentaries. Most of them endorse the stringency of the above reasoning. For a chronicle of the incredibly protracted history of this paper, see .
(c) Hit-Rate BiasIn view of the severe implications of a mistaken interpretation of discrepancies in IQ performance of various ethnic groups, much attention has been focused on the question whether these discrepancies might conceivably be the result of a bias favoring some groups over others (perhaps also, males over females). A. Jensen (1980) devoted a whole book to this issue, befittingly entitled Bias in Mental Testing. He concluded that such worries are unwarranted so far as the Black/White discrepancy is concerned because, if anything, conventional IQ tests overpredict Black criterion performance.
With this reasoning Jensen followed tradition in adopting an institutional point of view (e.g. that of universities) over that of the applicants, by focussing on regression equations and validity coefficients. From an institutional point of view, a test is useful if it improves the composition of the subgroup that is eventually hired or admitted on the basis of superior test performance. From this point of view, one can show that even a test with low validity has some merit, as long as the hiring institution employs a sufficiently stringent admission quota (by raising the test cut-off).
However, this narrow perspective ignores two important aspects of the bias problem:
(a) the base-rate problem:
By solely focusing on the regression equation and correlations ( predictive validities), the traditional approach to the bias problem ignores the fact that a valid test can be worse than useless if the base-rates (= proportion of qualified candidates) are sufficiently skewed. To illustrate this briefly, suppose the base-rate of a clinical syndrom (e.g., schizophrenia) is 1%. In this case we could achieve 99% correct predictions by simply predicting that everybody is "normal", regardless of test performance. For a test to achieve such a high degree of correct prediction, it would have to have an unrealistically high predictive validity.More generally, validity coefficients by themselves (in the absence of knowledge of base rate and quota), are meaningless as indicators of the pratical utility of a test.
Though this was already known to Meehl and Rosen (1955), it has been conveniently ignored in the meantime.
(b) the interests of the testee (as opposed to that of the hiring or admitting institution):
Once the bias problem is cast into the language of prediction error frequencies (rather than validities and regression equations, disregarding base-rates), it becomes apparent that, to the same extent an institution benefits from use of a low validity test by tightening the (admission) quota, qualified applicants will suffer because an increasingly larger proportion of them is wrongly rejected as a result of imperfect test validity. This follows directly from Bayes' well-known theorem that relates two types of conditional probabilities. In the present context, they have the following concrete interpretation:
The conditional probability that a candidate will be successful (e.g., graduate), given that he passes the test, is called the success ratio (SR) of the test. Following standard terminology of signal detection theory, let HR denote the hit-rate, which is the reverse conditional probability that a candidate passes the test if he is qualified. Finally, let Q denote the (admission) quota, and BR the base-rate (the proportion of qualified candidates in the unselected population.
SR = HR x BR/Q,Then Bayes' Theorem asserts:
which expresses the conventional instituional point of view: The smaller we make the quota (by raising the test cut-off), the larger will be the success rate, because Q shows up in the denominator.
HR = SR x Q /BR.However, if we adopt the point of view of qualified candidates, then we find (by solving the above equation for the hit-rate):
Now Q appears in the numerator. Hence, the tighter the admission quota, the smaller will be the hit-rate, the chance of the qualified student to be admitted.
Although these simple relations have been known for a long time, they have been consistently ignored or downplayed in the mental test literature. In particular, so far as I know, few if any systematic investigations of actual hit-rates as a function of validity, base-rate, and quotas seem to have been reported in the past. Nor have the test experts shown much interest in the problem whether the tests may be biased against certain groups, e.g., Blacks, in terms of hit-rates.
HR < Q/BR,In  we (with Thompsen) derive simple approximations for hit-rates and tabulate them as a function of validity, quota, and base-rate. We also derive a bound on hit-rates,
which says that tightening the quota inevitably penalizes the qualified students by lowering the hit-rate.
Finally, we review a number of data sets to evaluate the hit-rates of the SAT and ACT for different ethnic groups. We also assess the hit-rate bias, i.e., the extent to which conventional tests favor or discriminate against subgroups in terms of the chances that a qualified student passes the test. We found that conventional admission tests discriminate against Blacks, and further, that this bias increases as the admission quotas are tightened. In  these results are further refined and extendend to include formulae for estimating the minimum validity needed for given quota and base rate, so that use of the test improves the percentage of correct decisions over random admissions. The bottom line is that, in the realistic validity range (.3 - .4) for longrange criteria of practical interest, no test improves over random admission in terms of overall percent of correct decisions if one of the two base rates exceeds .7.
One reason for the astonishing persistence of the IQ myth in the facce of overwhelming prior and posterior odds against it may be the unbroken chain of excessive "heritability" claims for "intelligence", which IQ tests are supposed to "measure". However, if "intelligence" is undefined, and Spearman's g is beset with numerous problems, not the least of which is universal (and by now tacitly though grudgingly acknowledged) rejection of Spearman's model by the data, then how can the heritability of "intelligence" exceed that of milk production of cows and egg production of hens?
These problems are addressed in a series of more recent publications, [54, 60, 61, 62, 63, 70, 71, 72, 75, 81]. In  it is shown that a once widely used "heritability estimate" is mathematically unsound, because Holzinger had made a mistake in his derivations which had been overlooked for decades. Another such estimate, though mathematically valid, never fits any real data. This should have been obvious from the start because it typically produces an inordinate number of inadmissible estimates (e.g., proportions larger than 1). These absurd results nevertheless found their way into print without comment or challenge. The same estimate also produces excessive "heritabilities" for variables which plainly have nothing to do with genes. For example, the "heritability" of answers to the question: "Did you have your back rubbed last year?" turns out to be 92% for males and 21% for females .
- The main problem is that all such estimates rely on simplistic mathematical models which necessarily make some unrealistically stringent assumptions. Unfortunately, they were rarely tested. Once they are tested, one finds that they are usually violated by the data. A comprehensive review of these issues is attempted in , where further references to specific subproblems can be found.
Click here to return to Home Page.