Prepublication draft of:
Chinese Journal of Psychology, 1997, 39, 173-192.
Some New Results on Hit Rates and Base Rates in Mental Testing
Peter H. Schonemann
National Taiwan University *
Recent work on hit rates and base rates (Schonemann and Thompson, 1996) is extended:
A flawed premise in the derivation of an earlier hit rate approximation, HR1, is corrected, leading to a slightly more complicated approximation, HR2. However, over the targeted parameter region, the differences between HR1 and HR2 are small. After deriving exact hit rates for 2x2 contingency tables with binary criteria, they are compared with HR1 and HR2, and also with hit rates for continuous criteria inferred, via Bayes' Theorem, from Taylor and Russell's (1939) tables. Overall, the simpler approximation HR1 outperforms HR2.
Finally, a new approximation is derived for the minimum validity needed that a test improves over random admissions in terms of total percent of correct classifications (%c). The results confirm concerns Meehl and Rosen voiced more than four decades ago: Validity coefficients are not sufficient for gauging the practical merit of a test, because,
"... when the base rates of the criterion classification deviate greatly from a 50 percent split, use of a test sign having slight or moderate validity will result in an increase of erroneous clinical decisions." (Meehl and Rosen, 1955, p. 215. Emphasis in the original).
This paper extends previous work on hit rates and base rates by
Schonemann and Thompson (1996). Our interest in these problems had been aroused when we reanalyzed data sets the NCAA (Note 1) had collected in support of a projected upward revision of standards to qualify for athletic scholarships.
Inspection of these data revealed a disproportionate number of false negatives for Blacks, compared to Whites. False negatives are false classifications of qualified candidates as "unqualified", because they do not pass the admission test. For fallible tests, some such classification errors are virtually unavoidable, as are the complementary misclassifications of unqualified candidates as "qualified", because they pass the admissions test (false positives). However, what drew our attention was that the error rates, more specifically the
Miss Rate:= Proportion of qualified students failing the test
and also the
False Alarm Rate:= Proportion of unqualified students passing the test
differed systematically between Whites and Blacks: Black miss rates consistently exceeded White miss rates, and White false alarm rates consistently exceeded Black false alarm rates. Thus, unqualified Whites benefited from the test errors at the expense of qualified Blacks. Subsequent analyses described in more detail in (Schonemann and Thompson, 1996) showed that similar asymmetries characterize rich/poor comparisons.
In an early paper, Cole (1973) has argued that this constitutes a form of bias against Blacks (in this case), who face steeper odds than Whites to acquire a decent education to begin with, and then face an additional hurdle at the college entrance stage because the admission tests systematically screen out higher proportions of qualified Blacks than Whites, and screen in a higher proportions of unqualified Whites than Blacks. As Hartigan and Wigdor (1988) put it:
"Fair test use would seem to require at the very least that the inadequacies of the technology should not fall more heavily on the social groups already burdened by the effects of past and present discrimination" (p. 260).
2. Terminology and Notation:
To render the discussion of these issues manageable, the notation laid out in Table 1 will be used throughout this paper.
Cf. Table 1 following text
On the left side of Table 1, a 2x2 joint probability table is laid out. The columns represent the actual qualifications of the applicants: A candidate is either qualified (Q) or unqualified (U). For entrance tests, qualified might mean graduation with a BA, unqualified failure to receive a BA. Criteria such as these with only two outcomes are called binary, in contrast to continuous criteria such as Freshman Gradepoint Average (FGPA).
To predict qualification at the admission stage, a test is given. In Table 1, the two test outcomes are represented by the rows of the 2x2 joint probability table: A candidate either passes (P) or fails (F) the admission test. As a result, the table contains four joint probabilities, two for correct decisions and two for false decisions. The correct decisions are called true positives (with joint probability tp) and true negatives (tn), the incorrect decisions, false positives (fp, unqualified candidates who pass the test) and false negatives (fn, qualified candidates who fail the test). All four cells sum to 1. The column sum tp+fn gives the proportion of qualified candidates in the unselected population. It is called a base rate (b). Similarly, the column sum fp+tn (= 1-b) is the complementary base rate, the proportion of unqualified subjects in the population. The row sum fp+tp is the proportion of candidates who pass the test, and thus, a function of the cut-off C defining "pass" on the test. It will be called the (admission-) quota (q). The quota is a characteristic of the test, and, thus, under the control of the tester, who can raise or lower q by adjusting the test cut-off C. The base rate is a property of the population and thus not under his control. One lesson to emerge - already stressed by Meehl and Rosen (1955) but conveniently ignored ever since - is that the practical merit of a test is not just a function of its predictive validity (test-criterion correlation), but also of the base rate and the quota.
On dividing out the base rates (i.e., on conditioning the joint table on its columns), one arrives at the table of conditional probabilities on the right in Table 1. Of particular interest here will be the
(1) hit rate: hr := tp/b.
As already noted, it represents the proportion of qualified candidates who pass the test. Ideally, hr should be close to 1, but for fallible tests it may fall far short of this ideal. On forming the ratio of hit rates for Whites versus hit rates for Blacks, HRB := hrW/hrB, one obtains a measure of hit rate bias against Blacks. On reanalyzing several data sets, we found that, for college entrance tests, this bias averages out to 1.7. This means, roughly, that qualified Whites are twice as likely to pass the test as qualified Blacks.
3. Earlier Theoretical Results
In (Schonemann and Thompson, 1996) we derived, as a simple consequence of the definitions in Table 1, a Hit Rate Bound: (2) hr £ q/b, which means that simply raising a test cut-off ("raising standards") does not ensure an improvement in correct decisions: Especially if the base rate is large, it may just raise the proportion of misses (since mr = 1 - hr). Thus, depending on one's objective, one may wish to exercise control over hit rates. This is made difficult by the fact that the relation between hr, b, q, and r cannot be solved explictly for hr as a function of the other three parameters. However, we were able to show that for binary criteria a simple and, over the relevant parameter region quite reasonable, hit rate approximation is given by
(3) HR1 := q + rpb Ö [(1-b)/b] / 3,
where rpb denotes the point biserial correlation measuring the validity of the test.
We derived this explicit hit rate estimate HR1 by approximating the standard normal ogive by a straight line with slope 1/3 over the base rate interval (.3 £ b £ .7). Our rationale for this restriction was that, beyond this range, the misclassification rates quickly reach unacceptable levels for all tests except those with purely academic validities (e.g., Estes, 1992. Note 2). More generally, all approximations considered here are intended only for the practically relevant parameter region
(4) (0 £ r £ .5), (.3 £ b £ .7), (.3 £ q £ .7).
4. Derivation of Exact Hit Rates
Our derivation of HR1 are briefly reviewed in Appendix 1. As noted there, in loc. cit. we used the within group standard deviation (sw) to norm the mean distance between the qualified and the unqualified group in strict analogy with Signal Detection Theory (SDT). However, in the present context, the total standard deviation (st) is more appropriate. As shown in Appendix 1, this revision yields an "improved" estimate
(5) HR2 := q + (r/Ö (1-r2) Ö [(1-b)/b] / 3,
which differs from HR1 in the factor k := r/Ö (1-r2 ). For small r, the effect of k is negligible, so that one expects that the "improved" hit rate approximation HR2 will be close to HR1. To obtain "exact" hit rates (Note 3), the above correction was applied to the point biserial formula,
(6) rpb = d" Ö [b(1-b)],
where d" := (mQ - mU)/st is now the correctly normed mean difference. As shown in Appendix 1, this leads to
(7) rpb = Ö [ rpb'2/(1 + rpb'2 )]
where rpb' := d' Ö [b(1-b)] is the point biserial based on norming with sw. Again it is clear that the revision takes effect only for rpb's near the upper boundary .5 (Note 4).
After programming (7), "exact" hr's were computed iteratively. For fixed b and q, hr was varied until a specified rpb was reproduced within a small tolerance (.002). The results are tabulated in Tables 2 and 3.
These "exact" hit rates were then compared with (a) the simpler HR1 (Table 2), (b) the "improved" approximation HR2 (Table 3),
Cf. Tables 2 and 3 following text
and (c), also, the "exact" hit rates inferred via Bayes' Theorem from Taylor and Russell's (1939) tables of the success rates (sr := tp/q. Note 5) for continuous criteria, viz:
pr(pass|qualified) = pr(qualified|pass) pr(pass)/pr(qualified)
(8) hr = sr x q / b.
The results of these comparisons are summarized in Tables 4 and 5.
Cf. Tables 4 and 5 following text
As the columns of differences in these tables show, the simpler approximation HR1 - though strictly speaking derived from a flawed premise - outperforms the "improved" approximation HR2 both for binary and for continuous criteria over the targeted parameter ranges (eq. 4). For binary criteria, the largest discrepancy is .07, with the modal discrepancy near .03. Since only 3 out of 125 discrepancies over the targeted parameter range are negative, the approximation could be further improved by raising the multiplier (1/3) slightly. This seems hardly worthwhile since the "improved" HR2 is only slightly better than HR1 for binary criteria.
Moreover, for continuous criteria, HR1 overestimates the "exact hr's inferred from the Taylor Russell tables, with a modal discrepancy near -.02, which increases to -.03 for HR2. Thus, the simple HR1 emerges as a superior compromise overall for approximating hr's for both binary and continuous criteria over the targeted parameter region (4). As the tables also show, beyond these ranges both approximations deteriorate quickly.
5. Base Rate Problems
In loc.cit. we also briefly addressed the base rate problem, which asks how much a valid test improves over random admissions from the unselected population of applicants in terms of the total percent of correct classifications,
(9) %c := tp + tn
(cf. Table 1). To be sure, this is by no means the only plausible optimality criterion for evaluating the merit of a test.
However, as Meehl and Rosen (1955) have stressed, it certainly warrants more attention than it has received in the past. Especially for populations with severely skewed base rates - as they arise, for example, in clinical psychology - use of a test may be worse than no test at all when its use raises rather than lowers the proportion of misclassifications overall.
As a concrete illustration, Meehl and Rosen (1955). present a joint probability table (p. 198). The base rate of one of the two outcomes ("lower back pain is of organic origin") is .90, the quota is .66, and the probability of true positives is .63. For these figures, the validity of the test is between .2 and .3, which is not atypical for clinical tests. For the hit rate, one finds hr = .63/.90 = .70, and for the success rate .63/.66 = .95, all of which, so far, looks quite innocuous.
However, in terms of total percent correct, one finds from the implied joint probability table that %c = .63 + .07 = .70, which falls short of the base rate b = .90 for organic. Thus, if we throw away the test and diagnose all cases as "organic", only 1 out of 10 diagnoses will be incorrect. If we base the decision on the (valid) test, 3 out of 10 diagnoses will be incorrect.
This example, though contrived, is instructive since most predictive validities of commercial tests are actually in this range, at least for long range criteria worth predicting. For example, for college graduation, the SAT validities are near .2 (Crouse and Trusheim, 1988, p. 48), as are those for 8th semester college GPA (Humphreys, 1968). Thus, one might think that Meehl and Rosen's thought provoking discussion of the base rate problem would have stimulated much excitement in testing circles.
Actually, not much has changed since they wistfully observed that "Base-rates are virtually never reported" (Meehl and Rosen, 1955, p. 194). What did change is that it has become increasingly more difficult to locate predictive validities which have not been "corrected" upwards in some ingenious way (Note 6).
Table 6 lists the total percent of correct classifications (%c) for binary criteria as a function of b, q, and rpb. These values were obtained as a byproduct of the computations of the exact hit rates described earlier. As expected, they increase with rpb, but also with the
(10) degree of synchrony of b and q := (b-.5)(q-.5),
which is positive if b and q depart in the same direction from .5, and negative if they depart in opposite directions (so that they do not match).
The right hand portion of Table 6 compares %c with the probabilities of correct decisions based on the larger base rate alone. If this difference is positive, the test increases the percent of correct decisions overall. If it is negative, it leads to a deterioration by the amount stated in the table. Cursory inspection of Table 6 shows that, for the validity range considered of practical relevance (r £ .5), use of a test ensures an increase in correct decisions only for base rates near .5. The more they depart from a 50/50 split, the more the merits of using the test become problematic. For base rates outside the (.3, .7) range, no test near the modal validity .3 improves over random admissions, regardless which quota is used. Short of this range, its benefit depends on the degree of synchrony, which can only be maximized if the base rates are known to locate the cut-off for the appropriate quota. But they are not known. As Meehl and Rosen noted 40 years ago:
"the chief reason for our ignorance of the base rates is nothing more subtle than our failure to compute them" (p. 213).
Analogous results for continuous criteria, derived from the success rates (tp/q) in the tables in (Taylor and Russell, 1939), are presented in Table 7. For continuous criteria, the validity r is measured by the tetrachoric correlation. The overall results are very similar to those just discussed for binary criteria.
Finally, in Appendix 2, approximate formulae are derived for estimating the validity cut-off beyond which use of a test improves over random admissions (and betting on the outcome with the modal base rate). The first part (Appendix 2A) follows Meehl and Rosen (1955) to deduce the critical values in terms of hr (A2.5). Depending on which base rate is larger, two critical values are derived. These results are then extended to cut-offs for validities by invoking the HR1 approximation. This leads to the (rough under-) estimates (A2.10) and (A2.11):
(11) %c ³ b ³ .5 <=> 6(1-q)(b-.5), %c ³ 1-b ³ .5 <=> 6q(.5-b).
These estimates should be taken with a grain of salt in view of the various approximations invoked along the way. In practice, inspection of the %c tables (Tables 6 and 7) should suffice to gauge the merit of a test in terms of %c.
However, even approximate values are an improvement over the present practice of continuing to ignore the base rate problem altogether. To jar our collective memory, I close with one more quote from Meehl and Rosen's Psychological Bulletin paper which, regrettably, has lost none of its urgency more than 40 years after it was first published:
"From the above illustrations it can be seen that the psychologist in interpreting a test and in evaluating its effectiveness must be very much aware of the population and its subclasses and the base rates of the behavior or event with which he is dealing at any given time." (Meehl and Rosen, 1955, p.199).
Note 1: National Collegiate Athletics Association. A national watchdog organization monitoring sports activities, especially football and basketball, at US colleges and universities.
Note 2: "Exact" within rounding error, i.e., based on the appropriate model, in contrast to an approximation based on simplifying assumptions. However, cf. Note 4 below.
Note 3: Estes (1992, p. 278) believes that
"Intelligence tests ... are excellent predictors in many domains, ranging from school to a wide variety of occupations".
However, he wisely refrained from supplying any supporting evidence
for this claim.
Note 4: As in SDT, this "exact" expression for rpb is still contingent on two assumptions before it can be computed from 2x2 contingency tables: (a) Normality is needed to compute the mean difference from the conditional probabilities, and (b) Homoscedasticity is needed to render r, b, q identifiable in a 2x2 table of joint probabilities with three degrees of freedom. While both assumptions are routinely made in SDT, neither can be taken for granted. If one is willing to make them, then the common within variance can be set to unity, because a change of scale cancels in the numerator and denominator of rpb.
Note 5: Taylor and Russell (1939) report success ratios only to two decimal places. In this case, the appropriate validity coefficient is the tetrachoric correlation.
Note 6: It has become widespread practice to "correct" validities for attenuation. While such upward corrections may occasionaly be defensible in theoretical work, it is less clear what purpose they are supposed to serve in applied work, when the test at hand of course does contain measurement error which weakens its predictive accuracy. Similarly, restriction of range corrections can only be justified, if at all, when based on the standard deviation of the subset of people who actually took the test (which may already be restricted through self selection), not on the standard deviation of the unselected population. Downward corrections of validity coefficients, on the other hand, have become exceedingly rare. For example, it has become virtually unheard of to adjust multiple correlations (downward) through cross-validation (cf. Meehl and Rosen, 1955, p. 194).
The growing practice of self-serving validity "corrections" has probably contributed to the growing public uneasiness towards claims of the mental testers. It might help to dispel it if the original validities were always presented together with the "corrected" values. Then readers can draw their own conclusions.
Cole, N. S. (1973) Bias in selection. Journal of Educational Measurement, 10, 237-255.
Crouse, J. and Trusheim, D. (1988) The Case Against The SAT. Chicago Illinois: The University of Chicago Press.
Estes, W. K. E. (1992) Ability testing: Postscipt on ability tests, testing, and public policy. Cognitive Science, 5, 278.
Freeman, H. (1963) Introduction to Statistical Inference. New York: Addison-Wesley.
Hartigan, J. R. and Wigdor, A.K. (1989) Fairness in Employment Testing: Validity Generalization, Minor Issues, and the General Aptitude Battery. Washington D.C., National Academy Press.
Humphreys, L. G. (1968) The fleeting nature of the prediction of college academic success. Journal of Educational Measurement, 59, 375-380.
Meehl, P. E. and Rosen, A. (1955) Antecedent probability and the efficiency of psychometric signs, patterns, or cutting scores. Psychological Bulletin, 52, 194-216.
Schonemann, P. H. and Thompson, W. W. (1996) Hit rate bias in mental testing. Cahiers de Psychologie Cognitive/Current Psychology of Cognition, 15, 3-28.
Taylor, H. C. and Russell, J. T. (1939) The relationship of validity coefficients to the practical effectiveness of tests in selection: Discussion and tables. Journal of Applied Psychology, 23, 565-578.
Mental tests, IQ tests, hit rate rate bias, base rate problem, predictive validities.