Law and Human Behavior
 American Psychology-Law Society/Division 41 of the American Psychological Association 2007

Original Article

On the Diagnosticity of Multiple-Witness Identifications

Steven E. ClarkContact Information and Gary L. Wells2

(1)  Psychology Department, University of California, Riverside, Riverside, CA 92521, USA
(2)  Iowa State University, Ames, IA, USA

Contact Information Steven E. Clark

Received: 9 March 2007  Accepted: 17 September 2007  Published online: 18 December 2007

Abstract  It is not uncommon for there to be multiple eyewitnesses to a crime, each of whom is later shown a lineup. How is the probative value, or diagnosticity, of such multiple-witness identifications to be evaluated? Previous treatments have focused on the diagnosticity of a single eyewitness’s response to a lineup (Wells and Lindsay, Psychol. Bull. 3 (1980) 776); however, the results of eyewitness identification experiments indicate that the responses of multiple independent witnesses may often be inconsistent. The present paper calculates response diagnosticity for multiple witnesses and shows how diagnostic probabilities change across various combinations of consistent and inconsistent witness responses. Multiple-witness diagnosticity is examined across variation in the conditions of observation, lineup composition, and lineup presentation. In general, the diagnostic probabilities of guilt were shown to increase with the addition of suspect identifications and decrease with the addition of nonidentifications. Foil identification results were more complicated-diagnostic of innocence in many cases, but nondiagnostic or diagnostic of innocence in biased lineups. These analyses illustrate the importance of securing clear records of all witness responses, rather than myopically focusing on the witness who identified the suspect while ignoring those witnesses who did not.

Keywords  Eyewitness identification - Legal decision making

It is clear, based on both archival analyses of actual crimes and laboratory simulations with staged crimes, that eyewitnesses make mistakes (Clark et al. 2007; Gross et al. 2005). Eyewitnesses sometimes misidentify the innocent, and they sometimes fail to identify the guilty.

Wells and Lindsay (1980) described the imperfect nature of eyewitness evidence as a Bayesian probability. The question raised by Wells and Lindsay, and later by others (Clark et al. 2007; Levi 1998; Wells and Olson 2002; Wells and Turtle 1986) was: What is the probability that the suspect S is guilty, given that the witness, when presented with the lineup, gave response R? We denote this conditional probability as p(S = G|R). Analyses using staged-crime experimental data have shown that this probability increases with a suspect identification and decreases with suspect nonidentifications (none-of-the-above responses) and foil identifications. In other words suspect identifications are diagnostic of the suspect’s guilt, and nonidentifications and foil identifications are diagnostic of the suspect’s innocence.

Here we raise the question: What if the suspect is both identified and not identified by different witnesses? If an identification of the suspect is evidence of the suspect’s guilt and a non-identification is evidence of the suspect’s innocence, what are we to make of the joint outcome in which witness A identifies the suspect and witness B does not? Over 30 years ago, Brandon and Davies (1973) noted that “witnesses who fail to make an identification, or identify the wrong man, are not called [into court as witnesses]” (p. 30). Wells and Lindsay (1980) used Bayesian statistics to prove that it is mathematically impossible for nonidentifications of the suspect to be nondiagnostic (of innocence) under any conditions in which identifications of the suspect are diagnostic (of guilt). Wells and Lindsay lamented that “the criminal justice system has largely ignored nonidentifications by eyewitnesses” (p. 777) and described several reasons why this might be the case, all revolving around the well-known confirmation bias. Proof of this bias remains elusive, but here we offer a first formal treatment of a normative way to combine identifications and nonidentifications when there are multiple eyewitnesses.

These multiple-witness cases are not unusual. Two archival analyses reported that multiple-witness cases comprised 29% (Yuille and Tollestrup 1992), and 38% (Tollestrup et al. 1994) of 662 and 76 cases, respectively, and a third archival analysis reported that multiple witnesses comprised 58% (Wright and McDaid 1996) of lineup presentations involving 661 suspects.1 The reasons for the variation across studies are not apparent, but even the lowest estimate suggests that multiple-witness cases occur with considerable frequency. A recent study by Paterson and Kemp (2006) approached the multiple-witness question in a different way by asking 773 psychology students if they had ever witnessed a “serious” criminal event, and if so, whether there had been other witnesses to that event. Of those who reported being a witness to such an event 86% reported that there had been at least one other witness. These studies, taken together, show that multiple-witness criminal investigations are certainly not rare, and may in fact be very common.
Table 1 Example calculations for single- and multiple-witness diagnosticities



Single witness

Two witnesses




.5/.5 + .1 = .833

S + S  (.5)(.5)/(.5)(.5) + (.1)(.1) = .962




.2/.2 + .3 = .400

S + F  (.5)(.2)/(.5)(.2) + (.1)(.3) = .769




.3/.3 + .6 = .333

S + N  (.5)(.3)/(.5)(.3) + (.1)(.6) = .714

Note: Two-witness calculations are shown for two suspect identifications (S + S), one suspect and one foil identification (S + F), and one suspect and one nonidentification (S + N). TP = target-present and TA = target-absent
Fig. 1 Diagnosticity functions for long and short exposure conditions from Memon et al. (2003)

In addition, the responses of multiple witnesses are likely to be inconsistent. If suspect, foil, and nonidentification responses are given with probabilities of .50, .20, and .30, respectively, the likelihood that two independent witnesses selected randomly would give the same response2 is .52 + .22 + .32 = .38. Thus, crime investigators may often be faced with inconsistent evidence and jurors and judges may be faced with inconsistent testimony in cases that have multiple witnesses (see also Sanders and Warnick 1982).

How should one consider identification evidence from multiple witnesses? Sanders and Warnick (1982) considered a very simple solution—counting. They took multiple-witness samples from several studies (Sanders and Warnick 1981; Warnick and Sanders 1980) and found that unanimity was a strong predictor of accuracy and, lacking unanimity, the most frequent response was at least a reasonable predictor of accuracy. In other words, they considered truth in numbers. As our analyses will show, there is considerable, although not complete, support for this view.

Our analyses differ from those of Sanders and Warnick in that they evaluated accuracy separately for target-present (TP) and target-absent (TA) lineups, whereas we consider TP and TA lineups together to evaluate diagnosticity, rather than accuracy. To clarify the distinction, diagnosticity refers to the probability that the suspect is guilty given a particular response, whereas accuracy refers to the probability of a particular response given that the suspect is guilty or innocent. From the standpoint of the trier of fact, diagnosticity is the key question, which can only be addressed by considering TP and TA lineups together.

In this article, we will first briefly review research on the diagnosticity of a single witness response, and extend that work to cases with two or three eyewitnesses. We then extend this general framework to several important aspects of eyewitness identification evidence, specifically the conditions of observation, the composition of the lineup, and the specific lineup procedures.

Diagnosticity of Responses from a Single Eyewitness
The diagnosticity of a witness’s response is a function of the likelihood of that response given that the suspect is guilty relative to the likelihood of that response given that the suspect is innocent.3 Thus, the probability that a suspect is guilty (S = G) given a response R can be calculated as

$$ p(S = G|R) = \frac{{p(R|S = G)p(S = G)}} {{p(R|S = G)p(S = G) + p(R|S \ne G)p(S \ne G)}} $$
where p(R|S = G) is the probability of response R given that the suspect is guilty and p(R|SG) is the probability of that same response given that the suspect is not guilty. The terms p(S = G) and p(SG) are the a priori probabilities that the suspect is guilty or innocent, prior to the witness’s response, and obviously must sum to 1.0.
Diagnosticity probabilities can be estimated from staged-crime experimental data for target-present (TP) and target-absent (TA) lineup conditions. The TP lineup condition simulates the case in which the person the police suspect of having committed the crime is indeed guilty, whereas the TA lineup condition simulates the case in which the person suspected by the police is actually innocent of the crime. Under the assumption that the lineup contains only one suspect (the remaining lineup members being foils), the term S = G is equated with the TP lineup and the term S ≠ G is equated with the TA lineup. Thus, the probability that the suspect is guilty (the lineup is a TP lineup) given a particular witness response (R) can be calculated as follows:

$$ p({\text{TP}}|R) = p(S = G|R) = \frac{{p(R|{\text{TP}})}} {{p(R|{\text{TP}}) + p(R|{\text{TA}})}} $$
Experimenters have complete control over the prior probability of guilt by simply adjusting the ratio of TP to TA lineups, which is commonly set at a 1:1 ratio (i.e., 50% of eyewitnesses view TP lineups and 50% view TA lineups). These prior probabilities affect the posterior probabilities of guilt given a response, but they do not affect diagnosticity and hence they are omitted from Eq. 2. In a later section of this article (titled Prior Probabilities), we illustrate what happens under other levels of the prior probability variable. An example for calculating p(S = G|R = S) is given in Table 1 using simple data to illustrate.

Values greater than .5 are diagnostic indicators of the suspect’s guilt, whereas, values less than .5 are diagnostic of the suspect’s innocence. Analyses by Wells and Lindsay (1980), Wells and Olson (2002), Wells and Turtle (1986), and Clark et al. (2007) showed that laboratory suspect identifications are indeed diagnostic of the suspect’s guilt; and foil identifications and nonidentifications are diagnostic of the suspect’s innocence, although the diagnosticity of foil identifications may depend on how the foils were selected. The meta-analysis by Clark et al. (2007) based on 94 TP-TA comparisons showed these probabilities to be: p(S = G|S) = .77 for suspect identifications, p(S = G|F) = .37 for foil identifications, and p(S = G|N) = .38 for nonidentifications.

Diagnosticity of Responses from Multiple Witnesses

In this section, we consider diagnosticity based on the responses of multiple witnesses. In the calculation of response diagnosticity for multiple witness responses we make two assumptions. First, we assume independence of the eyewitnesses in the sense that one eyewitness’s response does not influence the other eyewitness’s response. We address this assumption later in a section titled Independence Assumptions and Violations. Second, unless otherwise noted, we assume equivalence among witnesses. For example, we assume that the two (or more) eyewitnesses had equally good opportunities to view the perpetrator, received the same pre-lineup instructions, viewed the same lineup, and so on.

Given those assumptions, the diagnosticity of responses of M multiple witnesses may be calculated as

$$ p({\text{TP}}|R_{1}, R_{2}, \ldots, R_{M}) = \frac{{{\prod\nolimits_{i = 1,M} {p(R_{i} |{\text{TP}})} }}} {{{\prod\nolimits_{i = 1,M} {p(R_{i} |{\text{TP}})} } + {\prod\nolimits_{i = 1,M} {p(R_{i} |{\text{TA}})} }}}. $$
The independence of witnesses is implied by the multiplication of response probabilities.

The calculations are quite simple, and are illustrated in Table 1 using the same response probabilities that were used to calculate single-witness diagnosticities. Thus, for example, the probability that the suspect is guilty given a suspect identification and a nonidentification is given as (.50)(.30)/[(.50)(.30) + (.10)(.60)] = .15/.15 + .06 = .714, somewhat lower than the diagnosticity given only the single suspect identification (.833).

The remainder of the article examines how multiple-witness diagnosticity changes as a function of the opportunity to observe, the composition of the lineup, and biases introduced by the lineup administrator before or during the identification procedure. We do not consider here all the possible combinations of witness responses because many of them seem unlikely to have prosecutorial value. For example, it is unlikely, without other evidence, that a case would go to trial based only on foil identifications or only on nonidentifications. Thus, each set of analyses takes as its starting point a single suspect identification (S), and adds foil (F) and nonidentification (N) responses. In the figures S denotes a single suspect identification, SS two suspect identifications, SSS three suspect identifications, SF a suspect and foil identification, SN a suspect and a nonidentification, and so on.

The analyses are presented as diagnosticity curves, plotting the probability, calculated from Eq. 3, that the suspect is guilty given the witness responses, denoted p(S = G|R). In each case, the response probabilities for TP and TA lineups were obtained from published studies. We chose our studies in order to examine these diagnosticities as a function of important eyewitness factors. The first analysis examines one aspect of the conditions of observation, the amount of time the witness has to observe the perpetrator. The second set of analyses examines lineup composition in terms of (a) good versus poor foils, (b) suspect-matched versus description-matched lineups, and (c) similarity between the innocent suspect and the perpetrator. The third set of analyses examines two types of administrator bias, in terms of (a) whether to make an identification, and (b) whom to identify.

Opportunity to Observe

Memon et al. (2003) presented witnesses with two versions of a staged crime, one in which the perpetrator was visible for 12 s, and one in which the perpetrator was visible for 45 s. The diagnosticity curves in Fig. 1 show different patterns for short and long exposure durations. A single suspect identification has much less probative value in the short exposure condition than in the long exposure condition. The probability that the suspect is the perpetrator, given a suspect identification, is .693 in the short and .922 in the long exposure condition. Given that .5 is the point of non-diagnosticity, the .693 probability suggests only modest probative value for the suspect identification. As additional suspect identifications are added, the suspect’s guilt is a near-certainty in the long-exposure condition, whereas there is still a small likelihood of a mistaken identification in the short-exposure condition.

Adding foil identifications to the single suspect identification produced similar results in the short and long exposure conditions. In both cases p(S = G) decreases. With two foil identifications added (SFF), it is about even odds that the suspect is guilty in the short-exposure condition, and there is a .682 likelihood the suspect is innocent in the long-exposure condition.

The clearest difference between short and long exposure conditions is shown for nonidentifications, which are clear indicators of innocence for the long-exposure condition, and weak indicators of guilt in the short exposure condition. In the long exposure condition, two nonidentifications added to a single suspect identification reversed the direction of the eyewitness evidence, from near certainty of guilt to near certainty of innocence.

These results show that the information value of all witness responses—suspect identifications, foil identifications, and nonidentifiations—is stronger when witnesses have more time to store information about the perpetrator in memory. The more the witness “knows” about the appearance of the perpetrator the better he or she will be at identifying matches as well as rejecting mismatches. This means, of course, that the diagnosticity of N responses depends on the witness’s conditions of observation. If the witness “got a good look” at the perpetrator, the N response is more meaningfully indicative of the suspect’s innocence, whereas if the witness did not get a very good look at the perpetrator, N responses are less informative. Thus, it makes sense that the diagnosticity of suspect and nonidentification responses would both increase in the long exposure condition. Less intuitive is why the diagnosticity of foil identifications should increase with additional exposure time. The greater diagnosticity of foil identifications arises because with greater opportunity to store information, the likelihood of a foil identification in a TP lineup is very, very small.

The calculations, which underlie Fig. 1 assume that multiple witnesses had the same conditions of observation. What if some witnesses had better conditions of observation than others? Specifically, one might assume that a suspect identification occurs when the witness “got a good look”, but that nonidentifications occur when the witness did not get a good look at the perpetrator. This assumption is reflected in the language of nonidentification responses which are often viewed in the justice system as a “failure to identify” the suspect (for example, U.S. v. Telfaire 1972).

Indeed, if one assumes that the witness who identified the suspect got a good look (long-exposure condition) and the witness who made a nonidentification response did not get a good look (short-exposure condition), a combination of responses we will denote as SLNS, then the likelihood that the suspect is guilty is still very high (.952), despite the inconsistency among witnesses. In fact, the likelihood of a misidentification given the SLNS combination is lower than that for a single witness who identified the suspect (i.e., SL or Ss). The very low risk of misidentification from the SLNS combination may be due in part to an unusual aspect of their results, namely that the probability of a nonidentification was lower in the TA lineup than in the TP lineup. Nonetheless, with that caveat, the calculations are consistent with the intuition: If the suspect-identifier got a good look and the nonidentifier did not, the nonidentification detracts little from the single identification. However, this relies on a rather big “if”, based on the assumption that suspect identifiers are reliable and nonidentifiers are not. This assumption has little support. In general, suspect identifications do have greater probative value than nonidentifications (Clark et al. 2007). However, the probative values for both suspect and nonidentification responses appear to increase as a function of exposure time, suggesting that exposure time is the more important factor to consider than the type of response. For example, based on the Memon et al. data, a nonidentification response made by a witness who got a longer look at the target has more probative value than a suspect identification made by a witness whose opportunity to observe was shorter.

What if no assumptions are made about the conditions of observation? Here the calculations are made by averaging across the short-exposure and long-exposure conditions. With no assumptions about the conditions of observation, the diagnosticity given a single suspect identification is .905. If a single nonidentifying witness is added, then the diagnosticity decreases to .685. Thus, if no assumptions are made regarding who got a good look and who did not, the diagnosticity function follows the pattern seen in Fig. 1; nonidentification responses added to a single suspect identification increase the likelihood that the suspect is innocent.

Summarizing, these analyses show that suspect, foil, and nonidentification responses all have probative value that, for each response, increases as the opportunity to observe increases. Consequently, the weight given to each witness must be determined based on an assessment of the witness’s opportunity to observe, independent of his or her identification response. Foil and nonidentification responses should not be ignored or viewed as failures.

This may be a more difficult task than it appears. One way in which the witness’s opportunity to observe is assessed is by simply asking the witness. However, witnesses may be unable to assess their own opportunity to observe independently of their response. Also, feedback from the lineup administrator can distort the witness’s assessment of his or her opportunity to observe. If witnesses are lead to believe they made an error, by picking a foil or by not making an identification, they may adjust their assessment of their opportunity to observe to be consistent with the outcome (Wells and Bradfield 1998). It makes sense, “If I failed (which I am told I did), then I must not have got a very good look at the guy.”

Lineup Composition: Good versus Poor Foils
Data comparing “good” and “bad” foils are provided by two studies, one by Lindsay and Wells (1980) and the other by Wells et al. (1993). In the Lindsay and Wells study low-similarity foils were selected to create substantial mismatch to the target person whereas in the Wells et al. study, foils were selected so as to mismatch the description given by the witness on at least one detail. In the two “good” foils lineup conditions, foils were selected either to match the true description of the target person (Lindsay and Wells) or to a unique description of the target given by each witness. We should note that the mismatch of foils in the Lindsay and Wells study was quite extreme, with mismatches on ethnicity, hair color, and the presence of facial hair, whereas the foil mismatch in the Wells et al. study was considerably less extreme. The diagnosticity functions for both studies are shown in Fig. 2.
Fig. 2 Diagnosticity functions for matching and mismatching foils conditions from (a) Lindsay and Wells (1980), and (b) Wells et al. (1993)

Both studies showed similar, although not identical, patterns of results. First, in both studies, responses were generally more diagnostic in the “good” foils condition than in the “poor” foils condition. This is shown by comparing the right-hand sides of Figs. 2 and 3 to the left-hand sides. The good-foils conditions (right-hand side) show the suspect diagnosticity to be higher and the foil and nonidentification functions to slope downward more steeply than in the poor-foils conditions (left-hand side). Thus, suspect and nonidentification responses are more meaningful when the foils have higher similarity (to either the suspect or to the description of the perpetrator). This aspect of the results was particularly evident for suspect identifications in the poor-foils condition of the Lindsay and Wells (1980) study. Results showed no diagnosticity of suspect identifications because the suspect identification rates were virtually equal in target-present and target-absent lineups. Consequently, even when one or two more suspect identifications were added, the likelihoods of guilt and innocence remained equal and near .50. This point bears repeating: The Lindsay and Wells results suggest that if the lineup is very biased in such a way that the suspect stands out, it matters little whether one, two, three, or ten (p(S = G|10 S) = .535) witness identify him. [The likelihood of guilt does start to pull away from .50 with 20 (p(S = G|20 S) = .570) to 50 (p(S = G|50 S) = .670) witnesses identifying the suspect.] This is an obvious exception to the truth-in-numbers rule; when the lineups are very biased, numbers do not convey much information.
Fig. 3 Diagnosticity functions for suspect-matched (SM) and description-matched (DM) lineups for (a) Wells et al. (1993), (b) Juslin et al. (1996), and (c) Tunnicliff and Clark (2000)

Although it was generally the case that witness responses were more diagnostic in lineups with more similar foils, the one exception was for foil identifications. Specifically, foil identifications in Lindsay and Wells’ low-similarity condition were strong indicators of the suspect’s guilt. In fact, for those low-similarity lineups, foil identifications were stronger indicators of the suspect’s guilt than were identifications of the suspect. This result arises from the fact that foil identifications were three times higher when the suspect was guilty (.12) than when he was innocent (.04). Although this combination of results—nondiagnostic suspect identifications and foil identifications that are diagnostic of guilt—may appear somewhat unusual, it is shown again in another analysis (Clark and Tunnicliff 2001) that examines suspect similarity (see Fig. 4).
Fig. 4 Diagnosticity functions from Clark and Tunnicliff (2001), utilizing a more-similar or less-similar innocent suspect

Lineup Composition: Suspect-Matched versus Description-Matched Lineups

The finding that poor foils lead to lower diagnosticity of witness responses is consistent with intuition: A poor testing procedure produces less-meaningful results. However, other aspects of foil selection are less intuitive. One question that arises for police when constructing lineups is whether to select foils based on their match to the suspect, or based on their match to the description of the perpetrator given by the witness. Although the two procedures may seem quite similar, they do not select equivalent sets of foils. A more complete analysis of the two methods for selecting foils can be found elsewhere (Clark 2003; Clark et al. 2007; Clark and Tunnicliff 2001; Juslin et al. 1996; Lindsay et al. 1994; Luus and Wells 1991; Wells et al. 1993). For present purposes, the important points are that the findings vary widely across studies, but one consistent finding is that foil identifications are diagnostic predictors of innocence in description-matched (DM) lineups, and are nondiagnostic predictors in suspect-matched (SM) lineups. In other words, in description-matched lineups foil identification rates are higher in TA than in TP lineups, but for suspect-matched lineups foil identification rates are about the same in TP and TA lineups.

Because of the variability across studies, data from three different studies were analyzed here. Diagnosticity functions for data from Wells et al. (1993), Juslin et al. (1996), and Tunnicliff and Clark (2000) are shown in Fig. 3.

The general diagnosticity analyses are consistent with results reported by Clark et al. (2007). First, there is no clear difference between SM and DM lineups in terms of the diagnosticity of suspect identifications or nonidentifications. The only difference between SM and DM lineups is for foil identifications, which are diagnostic of innocence in DM lineups and nondiagnostic in SM lineups. Consequently, for SM lineups, the addition of one or even two foil identifications to a suspect identification had almost no effect on the likelihoods of correct and mistaken identifications. This is shown by the flat foil functions for SM lineups in Fig. 3. The implication of these results is that foil identifications may be interpreted differently in SM and DM lineups. Consequently, the SFF combination of responses would be viewed as stronger evidence in favor of the suspect’s innocence if it were to arise from a DM lineup than a SM lineup. The pattern is strongest in the Wells et al. study where the SFF combination would be strong evidence of innocence in a DM lineup (p(S = G|SFF) = .23), but quite undiagnostic in an SM lineup (p(S = G|SFF) = .59).

Lineup Composition: Suspect Similarity

The diagnosticity of suspect identifications is also affected by the similarity of the innocent suspect to the actual perpetrator. A study by Clark and Tunnicliff (2001) manipulated this similarity directly, using two different innocent suspects. The innocent suspect who was more similar to the perpetrator had an average rating of 1.82 on a 1-to-4 scale, whereas the innocent suspect who was less similar to the innocent suspect had an average rating of 1.10. It is important to note, with an average similarity rating of 1.82, the more similar innocent suspect was by no means a dead-ringer for the perpetrator.

The lineups were constructed with suspect-matched foils, and the average foil-to-suspect similarities were equated for TP and TA lineups. Thus, foils for the TP lineup were selected based on their similarity to the perpetrator, foils for the less-similar TA lineup were selected based on their similarity to the less-similar innocent suspect, and foils for the more-similar TA lineup were selected based on their similarity to the more-similar innocent suspect. Thus, the foils varied across the three lineup conditions.

The results are shown in Fig. 4. The results for the more-similar innocent suspect show a pattern that is nearly identical to the Lindsay and Wells’ (1980) results for low-similarity foils shown in Fig. 2a. Because suspect identification rates were slightly higher for the TA lineup than for the TP lineup, the diagnosticity function actually slopes slightly downward. The nondiagnosticity of suspect identifications is likely the outcome of two factors, the higher similarity of the innocent suspect and suspect-matched foil selection that by design creates biased lineups. Under these circumstances the truth-in-numbers heuristic again fails quite badly.

As shown in the right panel of Fig. 4, when the innocent suspect is less similar to the perpetrator, the diagnosticity of suspect identifications increases. An interesting aspect of the results, shown for the more- and less-similar innocent suspect lineups, is that foil identifications were diagnostic of guilt rather than innocence, a result also shown by Lindsay and Wells (1980). The pattern is likely to arise when, for the TA lineup, witnesses see a choice between identifying the suspect or identifying no one.

Lineup Administrator Bias: Whether to Identify
Clark (2005) and Steblay (1997) conducted meta-analytic reviews of 19 comparisons of unbiased and biased lineup instructions. Unbiased instructions explicitly state that the actual perpetrator may or may not be in the lineup, and that nonidentifications are acceptable responses. Biased instructions, on the other hand imply or explicitly state that the perpetrator is in the lineup and that the witness’s task is to indicate which person he is. The clearest result of such biased instructions is that witnesses make more identifications and fewer nonidentifications in response to both target-present and target-absent lineups. The increase in the overall identification rate is distributed primarily across the foils, but also to correct (target-present) and false (target-absent) suspect identifications. The decrease in nonidentifications and the increase in foil identifications are very consistent across studies; however, the increase in correct identifications of the perpetrator is less consistent (see Clark 2005 for details). There is variation in terms of how the bias affects the probative value of suspect, foil, and nonidentification responses, as well. Averaging across the subset of studies for which diagnosticities can be calculated, biased instructions appear (in general) to produce a small decrease in the diagnosticity of suspect identifications, and negligible changes in the diagnosticity of foil and nonidentifications. The results from Paley and Geiselman (1989), which are representative of the pattern of results described above, are shown in Fig. 5.
Fig. 5 Diagnosticity functions for biased and unbiased instructions conditions for Paley and Geiselman (1989)

Also consistent with the pattern of small diagnosticity differences, the diagnosticity functions are fairly similar for biased and unbiased lineup instructions. The diagnosticity curves in Fig. 5 show a slightly steeper curve for added nonidentifications for the biased instructions condition than for the unbiased instruction condition. This occurs even though the diagnosticity of nonidentification responses is not different due to biased lineup instructions. The difference is due to the lower diagnosticity of suspect identifications in the biased instructions condition, which combines multiplicatively with the diagnosticity of nonidentification responses to produce the steeper curve.

General bias in the instructions increases the likelihood that the witness will make an identification, but the bias does not aim the witness toward a particular person in the lineup, and consequently the increase in identification rate is distributed among the lineup members.4 This may contribute to the overall pattern of results that shows small differences in response diagnosticity. However, if the bias directs the witness toward a particular person in the lineup, the response diagnosticities may show large changes, as the identifications are likely not distributed but rather focused on a particular person. This is discussed next.

Lineup Administrator Bias: Whom to Identify
Police officers may direct the witness’s attention toward a particular individual, i.e., the suspect, introducing not simply a general bias toward making an identification, but rather a specific bias toward identifying the suspect (Nettles et al. 1996). This specific bias was investigated in a study by Haw and Fisher (2004). Lineup administrators were informed regarding the position of the suspect and were given incentives to obtain suspect identifications (without overtly directing the witness). In one condition there was close contact between the lineup administrator and witness, presumably making it easier for the lineup administrator to covertly guide the witness, whereas in the other condition there was less contact between administrator and witness and therefore presumably less covert communication. Their results reflected the effects of such covert communication. In particular, the false identification rate for the innocent suspect jumped from .033 in the low contact condition to .300 in the high contact condition. The diagnosticity functions are shown in Fig. 6.
Fig. 6 Diagnosticity functions for low and high contact conditions from Haw and Fisher (2004)

The results showed that the diagnosticity of suspect identifications was much higher in the low contact condition. This was due to the large jump in suspect identifications in the TA lineup condition combined with a negligible increase in suspect identifications in the TP lineup condition. Following the typical pattern, the diagnosticity changed, from guilt toward innocence as foil and nonidentification responses were added. Interestingly, the change was much more dramatic when administrator-witness contact was high than when it was low. In fact, when two nonidentifications were added to the single suspect identification in the high contact condition, the calculation of diagnosticity showed that it was virtually certain that the suspect was innocent. Phrased somewhat differently, if the experimenter pushes the witness toward identifying the suspect, and the witness nonetheless rejects the lineup, it is almost certain that the suspect is innocent.

General Discussion

In this last section of the paper we review the major findings of our analyses, and return to two aspects of our analyses that we have set aside until now. First we consider the multiple-witness diagnosticities as a function of variance in the prior probabilities of guilt and innocence, and then we discuss the independence assumption underlying our calculations and the implications of violating that assumption. We then return to a discussion of the truth-in-numbers heuristic for evaluating multiple-witness identifications. Finally, we conclude the paper with a discussion of how the justice system broadly, and jurors specifically, view multiple-witness cases.

Summary of Findings

The analysis of results from Memon et al. (2003) showed that the diagnosticity of all witness responses increased when witnesses had more time to view and store information about the perpetrator. This makes sense; the information value of witness responses should increase when witnesses have more information in memory. One might expect this pattern to apply broadly to other factors that affect the quality or quantity of information in memory. However, this may not be the case. Clark and Godfrey (2007) have noted that some variables that affect information storage produce their effect primarily through decreases in correct identifications in TP lineups. For example, Morgan et al. (2004) showed that stress reduced the correct identification rate in TP lineups, but had no effect on the mistaken identification rate in TA lineups. Also, the response probabilities may vary not only due to the lack of information in memory, but also due to criterion shifts. For example, correct identification rates may remain constant as information is lost from memory over time, because witnesses adjust their criterion downward, thus increasing their overall identification rate (see Krafka and Penrod 1985 for such an example). These complexities may make it difficult to see a general pattern connecting diagnosticity to memory.

Our analyses showed that diagnosticity did vary due to lineup composition. In two very biased lineups (Clark and Tunnicliff 2001; Lindsay and Wells 1980) suspect identifications had no probative value whatsoever. Consequently, it did not matter whether one, two, or three witnesses identified the suspect; an uninformative response remains uninformative no matter how many witnesses give that response. In the less extreme case (the biased lineup conditions in Wells et al. 1993) diagnosticity of suspect identifications was still low, although not at zero.

We also compared suspect-matched and description-matched lineups in three data sets from Juslin et al. (1996), Tunnicliff and Clark (2000) and Wells et al. (1993). The main result for present purposes is that all three studies showed that foil identifications had little or no diagnostic value for suspect-matched lineups, but were diagnostic of innocence for description-matched lineups. The possible mechanism underlying this result has been discussed in detail elsewhere (Clark et al. 2007), and we will not review it here. The important implication is that foil identifications may convey different information depending on how the foils were selected. Focusing on the issue of multiple identifications the addition of one or two foil identifications to a single suspect identification may cause one to reconsider the likelihood of the suspect’s guilt, if the foils were selected based on their match to the witness’s description of the perpetrator, whereas the foil identifications would have little or no probative value if the foils were selected based on their similarity to the suspect.

We also examined how diagnosticity patterns change depending on the similarity of the innocent suspect to the actual perpetrator. It is not surprising that the diagnosticity of suspect identifications should decrease with increased suspect similarity. However, there were two striking aspects of that analysis: First, the diagnosticity of suspect identifications did not merely decrease, but went to zero when the suspect was of moderate similarity and foil selection was suspect-matched. Second, foil identifications were diagnostic of guilt, rather than innocence.

The final analysis examined the effect of two kinds of bias, a general bias toward identifying someone and a specific bias toward identifying the suspect. The general bias showed only a small effect on the diagnosticity patterns, whereas the suspect-specific bias led to large changes in the diagnositicity patterns. In the Haw and Fisher (2004) study, suspect-specific bias, communicated through the close contact condition led to a large decrease in the diagnosticity of the suspect identification responses and a large increase in the diagnosticity of nonidentifications (pointing strongly toward innocence).

Prior Probabilities

Our analyses implicitly assume that TP and TA lineups occur with equal probability. Surely, this assumption does not hold in actual criminal cases. One archival analysis (Kellstrand 2006), comparing cases with both DNA and eyewitness identification evidence obtained from the San Diego County District Attorney’s Office, reported that the person suspected of the crime was innocent in only 5% of their cases. It is not clear how representative this subset of cases is, but the results underscore the importance of considering that the prior probability of the suspect’s guilt may be quite different from .5.

We conducted our analyses, varying the prior probability of the suspect’s guilt, and one such analysis, based on the Memon et al. data, is shown as prior/posterior curves in Fig. 7. In prior/posterior curves, the prior probability of guilt is plotted on the x-axis and the posterior probability of guilt is plotted on the y-axis. Responses and response combinations that are diagnostic of guilt produce curves in the upper left-hand corner of the figure, whereas responses and response combinations that are diagnostic of innocence produces curves in the lower right-hand corner of the figure. The diagonal labeled “0” is a line of nondiagnosticity or equiprobability indicating the response or response combination has no diagnostic or predictive value.
Fig. 7 Prior/Posterior curves for long and short exposure conditions from Memon et al. (2003)

Several aspects of these curves are noteworthy. First, the general pattern and ordering for the diagnosticity of responses and response combinations does not change due to variation in the prior probabilities. Second, the increase in exposure time pulled the curves into the top-left (highly diagnostic of guilty) and bottom-right (highly diagnostic of innocence) corners of the prior/posterior curve. Third, the prior probabilities matter most when the responses are not very diagnostic, which makes sense of course; if the witness’s response is nondiagnostic, the posterior probability is determined entirely by the prior probabilities. Conversely, when witness responses are highly diagnostic the prior probabilities matter very little (note that for the long-exposure condition the SSS function is pressed into the corner unless the prior probabilities are near zero).

Prior/posterior curves can be produced for any of the analyses reported here. However, they can be hard to read when some of the responses are nondiagnostic. For example, the Lindsay and Wells study would produce prior/posterior functions where the S, SS, and SSS curves lie essentially on top of each other, and also on top of the equiprobability (0) line. Because many of our diagnosticity calculations would produce overlapping prior/posterior curves, we did not calculate them for our analyses.

Independence Assumptions and Violations

The analyses presented here have assumed that witnesses are in essence sampled randomly from a distribution of identification responses determined by the conditions of each experimental condition. Thus, it is assumed that witnesses saw the same crime under the same general conditions, and were presented with the same lineup using the same procedure.

Beyond those dependencies, however, the responses of multiple witnesses are otherwise assumed to be independent. That is, it is assumed in our analyses that witnesses do not share information, either about what they believed they saw, or the response that they made when presented with the lineup. In addition, it is assumed that the identification procedures remain constant as witnesses are presented with the lineup over time. There is evidence to suggest that both of these assumptions are violated in multiple-witness cases.

First, there is evidence that witnesses communicate with one another. In Patterson and Kemp’s (2006) survey, 86% of respondents indicated that they discussed the event with other witnesses. This percentage was the same for witnesses who did and did not indicate that they were interviewed by police. There are no solid data estimating how often co-witnesses actually talk to each other about whether they made an identification or which person they identified from a lineup. The evidence is clear, however, that if witnesses do share information, it can become “contagious” (Gabbert et al. 2003; Roediger et al. 2001; Wright et al. 2005). The sharing of information has the effect of increasing the correlation among witnesses, thus reducing the natural occurrence of inconsistent patterns.

Co-witnesses may be nonindependent, even if they do not communicate with each other directly. Douglass et al. (2005) showed that witness identifications could also be contagious through the lineup administrator. Their lineup administrators were initially blind as to the identity of a suspect in a lineup, but appeared to have acquired some beliefs as to the suspect’s identity through the confident but mistaken identification of a confederate witness. The lineup administrator’s acquired beliefs appear to have been passed on to subsequent witnesses as evidenced by their increased rate of mistaken identification for that previously chosen lineup member. Put another way, witness 1 confidently picks Lineup member 3 from the lineup, leading the lineup administrator to believe lineup member 3 is the perpetrator, who then “infects” witness 2 with that belief, leading witness 2 to also identify lineup member 3. The effect again is to reduce the natural inconsistencies in witness responses, through their indirect co-witness communication via the lineup administrator. To the extent that agreement is taken as an index of accuracy, agreement that is inflated by direct or indirect co-witness communication may cause police, judges, attorneys, and jurors to overbelieve the witnesses.

Nonindependence among witnesses adds several layers of complexity to the analyses presented here. At the simplest level, nonindependence arises if witness 1 tells witness 2 whom or if he made an identification from the lineup, and the response given by witness 2 changes as a result. Whether this nonindependence increases or decreases the diagnosticity of that response depends on a number of factors: (a) witness 1’s response, (b) what witness 2’s response would be without influence from witness 1, (c) whether witness 1 is more likely to leak information to witness 2 when witness 1 is correct or incorrect, and (d) whether witness 2 is more likely to follow witness 1’s lead when witness 1 is correct or incorrect.

Consider item (d) from the above list, following a suspect identification by witness 1. Will witness 2 be more likely to also identify the suspect when witness 1 is correct (suspect in a TP lineup), or when witness 1 is incorrect (suspect in a TA lineup)? Arguments can be made for either outcome. Witness 2 might be more likely to adopt witness 1’s response when a TA lineup is presented—assuming that witness 2 does not see a clear match in the lineup. This would produce an increase in TA suspect identifications, and thus a decrease in the diagnosticity of a suspect identification. Alternatively, witness 2 may be more likely to adopt witness 1’s response when witness 1 is correct, thus producing an increase in TP suspect identifications and in the diagnosticity of a suspect identification. Clark et al. (2000), using a recognition memory task, showed evidence for both of these possibilities. Participants in their study were more likely to switch from incorrect to correct responses (rather than vice versa), but were also more likely to switch from nonrecognition to recognition responses. The relationship between witness communication and response diagnosticity is one that is ripe for further investigation.

When the Truth-in-Numbers Heuristic Fails

The diagnosticity based on a set of multiple witness responses was calculated as a function of the diagnosticity of the individual witness responses. Thus, whether the truth-in-numbers heuristic works or fails for evaluating multiple responses depends on the relative diagnosticity of the individual responses. If the majority response is more diagnostic than the minority response, the truth-in-numbers heuristic will generally work; however, if the majority response is considerably less diagnostic than the minority response, the heuristic will fail. It is important that we clarify that our use of the word “works” means only that the diagnosticity calculation favors the majority response, for example that SSN responses favor guilt whereas SNN responses favor innocence. Thus, if the probability of guilt given an SSN response is .7, we would note that the heuristic “works”. Of course, a diagnosticity of .7 means that it “works” 70% of the time, hardly above the bar of reasonable doubt. With that caveat, we consider the truth-in-numbers heuristic for suspect, nonidentification, and foil responses.

Suspect identifications in our calculations were generally diagnostic of guilt, when they appeared alone or when they outnumbered other responses (SSF or SSN). There were, however, two rather dramatic exceptions to that rule. When the lineups were very biased, or when the innocent suspect was more similar to the actual perpetrator (Clark and Tunnicliff 2001; Lindsay and Wells 1980), suspect identifications were nondiagnostic. If a response is nondiagnostic, little truth can emerge by simply having more of them.

It is important to repeat the point that the more-similar suspect in the Clark and Tunnicliff results was not a “dead-ringer” for the perpetrator. If an innocent suspect were a dead ringer for the perpetrator, the suspect-matched foils would likely also be very similar across TP and TA conditions, in which case all witness responses would be nondiagnostic. The possibility of a dead-ringer or highly similar innocent suspect presents a more difficult problem than does a biased lineup. Suspect similarity, unlike lineup fairness, cannot be assessed independently of the ultimate question of guilt or innocence. Thus, one can establish that a lineup is biased without making any assumptions about the suspect’s guilt or innocence. Lineup fairness evaluations using mock witnesses, for example, require no assumptions about guilt or innocence (see Malpass 1981; Wells et al. 1979). However, the issue of suspect similarity is inherently tied to guilt and innocence. One cannot establish that the suspect is a dead-ringer for the actual perpetrator without first concluding on the ultimate issue—that the suspect is innocent. In the real world, it is difficult, without other evidence, to distinguish between the false identification of a high-similarity innocent suspect versus the correct identification of the perpetrator. The patterns of identification data are likely to look very much the same.

How does the truth-in-numbers heuristic work for nonidentifications? Our analyses show inconsistent patterns. In some cases SNN responses were highly diagnostic of innocence (for example, Clark and Tunnicliff 2000, less-similar innocent suspect condition; Haw and Fisher 2004, high-contact condition; Memon et al. 2003, long exposure condition). However, in other cases the SNN response combination was highly diagnostic of guilt. This was shown in Memon et al.’s (2003) short exposure condition where nonidentifications were themselves slightly diagnostic of guilt. In other cases, nonidentifications were diagnostic of innocence, but suspect identifications were so much more diagnostic of guilt that even the SNN combination pointed toward guilt (for example, Haw and Fisher 2004, low-contact condition; Paley and Geiselman 1989, unbiased instructions). In many cases, when nonidentifications were added to suspect identifications, the added nonidentifications pulled the diagnosticity function from guilt to a region of nondiagnosticity (probabilities between .4 and .6).

The truth-in-numbers heuristic is particularly complicated when applied to foil identifications. This is because the diagnosticity of foil identifications varied considerably in our analyses. As noted before, one key may be lineup composition. Foil identifications are generally diagnostic of innocence in description-matched lineups, are relatively undiagnostic in suspect-matched lineups, and are diagnostic of guilt in very biased lineups.

Multiple Eyewitnesses in the Criminal Justice System

Wells and Lindsay (1980) argued that it is psychologically easy to dismiss the diagnostic value of nonidentifying witnesses through a variety of mechanisms related to the confirmation bias. Findlay and Scott (2006) have described how the confirmation bias (more commonly called “tunnel vision” in the legal literature) permeates the legal system. Indeed, the Honorable Nathan Sobel’s (1979) otherwise excellent treatise on eyewitness identification notes that the failure to identify the suspect, “is significant in determining that the witness has not retained the image of the perpetrator” (p. 136). At no point does Sobel note that the nonidentification of the suspect might be significant precisely because the witness has retained the image of the perpetrator, allowing him or her to correctly state that the perpetrator is not in the lineup.

Wells and Olson (2002) suggested that it is particularly easy for the casual observer to dismiss the diagnostic value of foil identifications because they are known immediately to be mistaken identifications; the eyewitness has made a mistake so the witness must have a bad memory. At one level this is true—the witness has made a mistake. But the nature of the mistake suggests that the eyewitness is saying “Your suspect in this lineup looks less like the perpetrator than does this foil,” which should logically reduce the fact-finder’s confidence that the suspect is the perpetrator. Our analyses are consistent with this logic when the lineups were fairly constructed (or assumed fair by dividing target-absent identifications by the lineup size). However, when the lineups were biased, for example by the routine police procedure of selecting foils based on their similarity to the suspect, foil identifications may become nondiagnostic or diagnostic of guilt.

If indeed, nonidentifications and foil identifications are viewed as “failures” without information value, then one might expect that witnesses who make such responses would be overlooked or undervalued at trial. Specifically, one might expect that the initial witness list is longer than the list of witnesses who actually testify at trial, and the difference is due to the absence at trial of witnesses who “failed” to identify the suspect. There are no hard archival data on this point. However, there are relevant and compelling case-studies.

Specifically, there are well-documented cases in which juries erroneously convicted innocent defendants based on the testimony of witnesses who identified them at trial. However, in these cases there were other witnesses who, had they been called to testify, would almost certainly have stated that the defendants were not the perpetrators of the crime for which they were held to stand trial. In three such cases, the defendants, Ryan Matthews (New Orleans), Stephen Cowans (Boston), and Christopher Bennett (Ohio) were eventually exonerated based on DNA evidence. These cases present strong case-study arguments for the need to move beyond consideration of the single witness to examine response patterns of multiple witnesses.

The issues involving multiple witnesses are most dramatically illustrated in the cases of Larry Griffin and Gary Graham who were both convicted of murder, based in large part on the identification by a single witness. Both were executed, Griffin in Missouri in 1995, and Graham in Texas in 2000. In both cases, there were other eyewitnesses who, had they been called to testify, would almost certainly have testified that neither Griffin nor Graham were the perpetrators of the crimes for which they were convicted and executed (Gross and Thompson 2005; Rimer and Bonner 2000).

There is some empirical evidence suggesting that the outcomes of those cases would have been different had the jurors heard testimony from the nonidentifying eyewitnesses. Leippe (1985) had participant-jurors read case summaries in which there was a sole identifying eyewitness versus two identifying eyewitnesses, versus an identifying plus a non-identifying (“not the man”) eyewitness. In our notation, these would be S, SS, and SN. Guilty verdicts were 47% for S, and were nonsignificantly higher at 53% for SS, and dramatically reduced to 14% for SN.

A study by McAllister and Bregman (1986) presented mock jurors with two-witness combinations of identification evidence, in our notation, S, N, SS, NN, and SN. They also included a control condition presenting a single witness who neither identified nor rejected the defendant as the perpetrator of the crime. The results, given as ratings on a 1 (innocent) to 9 (guilty) scale, showed a statistically reliable increase in guilt ratings for the single identifying witness (M = 6.64) relative to the control (M = 5.14), and only a very slight, nonsignificant decrease in guilt ratings for the single nonidentifying witness condition (5.06) relative to the control. These two results, taken together, suggest that mock-jurors believed the single identification to be diagnostic of guilt, but the single nonidentifying eyewitness to have little probative value. For the multiple-witness conditions, the results showed a slight (although statistically nonsignificant) increase in guilt ratings for the SS condition (M = 7.11) relative to the S condition (M = 6.64), and a significant decrease for the NN condition (M = 3.42) relative to the N condition (M = 5.06). When a nonidentifying witness was added to a single identification (SN), the guilt ratings showed a slight (also nonsignificant) decrease (M = 6.17) relative to the single-witness identification (S) condition (M = 6.64).

These results suggest a discontinuity not shown in our calculations. Specifically, McAllister and Bregman’s mock-jurors saw little information value in a single nonidentification (as a single nonidentification did not differ from the control condition, and the SN condition did not differ), but did find two unopposed nonidentifications to be diagnostic of innocence. By contrast, in our calculations, the diagnosticity of multiple identifications is derived directly from the diagnosticity of single identifications. Thus, a response that has no diagnostic value by itself cannot gain diagnostic value by having more of them.

Final Remarks

We have applied a relatively simple set of mathematical rules for combining probabilities in multiple-eyewitness cases and we have used experimental data to observe how these joint probabilities behave as a function of witnessing conditions, lineup composition, and lineup administrator bias. Regardless of condition, it is clear that one must consider the response of each eyewitness, not just those who identify the suspect, in order to assess the likely guilt of the suspect. In fact, almost without exception, the probability of guilt associated with an identifying eyewitness (S) is reduced more by the addition of a nonidentifying eyewitness (SN) than it is increased by a second identifying eyewitness (SS) (See relative steepness of upward versus downward slopes in Figs. 16.).

Psychological scientists have long suspected that the justice system has been unduly dismissive of nonidentifying eyewitnesses (Wells and Lindsay 1980). To the extent that there is a tendency to focus on identifying eyewitnesses and dismiss the relevance of nonidentifying eyewitnesses, some serious errors in judgment could result. We have several reasons to suspect that this type of error occurs. First, an archival study of 284 photographic lineups in Northern California showed that police failed to make any record that would distinguish foil identifications from nonidentifications, instead simply recording both as failures to identify the suspect (Behrman and Davey 2001). This same failure was noted by Tollestrup et al. (1994) in their analyses of lineups conducted by the Canadian Royal Mounted Police.5 Second, the current authors share the common experience of receiving calls from defense attorneys who want help assessing the reliability of an eyewitness identification of their client. When we ask whether there were other eyewitnesses to the crime who viewed the lineup, they often do not know, or fail to recognize the significance of a witness who is not being called to testify by the prosecution, and is thus a witness they do not need to worry about. A third reason we suspect the system is underutilizing nonidentifying eyewitnesses is because of the language used in long-standing judicial rulings (e.g., U.S. v. Telfaire 1972) and writings by legal scholars (e.g., Sobel 1979) that describe nonidentifications as “failures” rather than as a evidence probative to the ultimate question of guilt. Fourth, the law governing the evaluation of the reliability and admissibility of eyewitness identification, articulated by the US Supreme Court in Neil v. Biggers (1972) and reaffirmed in Manson v. Braithwaite (1977), lists five factors for trial courts to consider, none of which refer to nonidentifying eyewitnesses. A final reason that we suspect that nonidentifying witnesses are underutilized is because there is no reason to believe that the legal system is any less immune to confirmation bias than are scientists (who are themselves subject to dismissing the relevance of disconfirming data, see Greenwald 1975) or any other group of individuals (see Nickerson 1988).

Although we suspect, based on the reasons outlined above, that the legal system underutilizes eyewitness nonidentifications, there is not, to our knowledge, any archival analysis that systematically documents the utilization of nonidentifying witnesses. Such an analysis could address a number of important questions regarding who testifies at trial and who does not, and how the decisions are made. The present analyses provide a general framework for combining identifying and nonidentifying eyewitnesses to show that both affect the probabilities as to the ultimate fact relevant to judge and jury, namely whether the accused is innocent or guilty.


Brandon, R., & Davies, C. (1973). Wrongful imprisonment. London: Allen & Unwin.
Clark, S.E., & Godfrey, R. (2007). Why eyewitnesses make mistakes and jurors believe them. Paper presented at Off the Witness Stand: Using Psychology in the Practice of Justice, New York, NY.
Clark, S.E., Howell, R.T., & Davey, S. (2007). Regularities in eyewitness identification. Law & Human Behavior. Retrieved May 7, 2007 from
Clark, S. E., Hori, A., Putnam, A., & Martin, T. P. (2000). Group collaboration in recognition memory. Journal of Experimental Psychology: Learning, Memory, & Cognition, 26, 1578–1588.
Clark, S. E. (2003). A memory and decision model for eyewitness identification. Applied Cognitive Psychology, 17, 629–654.
Clark, S. E. (2005). A re-examination of the effects of biased lineup instructions in eyewitness identification. Law & Human Behavior, 29, 395–424.
Clark S. E., & Tunnicliff, J. L. (2001). Selecting lineup foils in eyewitness identification experiments: experimental control and real-world simulation. Law & Human Behavior, 25, 199–216.
Douglass, A. B., Smith, C., & Frasher-Thill, R. (2005). A problem with double-blind photospread procedures: Photospread administrators use one eyewitness’s confidence to influence the identification of another eyewitness. Law and Human Behavior, 29, 543–562.
PubMed SpringerLink
Findlay, K. A., & Scott, M. S. (2006). The multiple dimensions of tunnel vision in criminal cases. Wisconsin Law Review, 2006, 291–398.
Gabbert, F., Memon, A., & Allan, K. (2003). Memory conformity: Can eyewitnesses influence each other’s memories for an event? Applied Cognitive Psychology, 17, 533–543.
Greenwald, A. (1975). Consequences of prejudice against the null hypothesis. Psychological Bulletin, 82, 1–20.
Gross, S. R., Jacoby, K., Matheson, D. J., Montgomery, N., & Patil, S. (2005). Exonerations in the United States 1989 through 2003. Journal of Criminal Law and Criminology, 95, 523–560.
Gross., S. R., & Thompson, J. (2005). Memo to Saul Green, Attorney for Walter Moss. Re: The Murder of Quinton Moss on June 26, 1980, in the City of St. Louis, Missouri (dated June 10, 2005).
Haw R. M., Fisher, R. P. (2004). Effects of administrator–witness contact on eyewitness identification accuracy. Journal of Applied Psychology, 89, 1106–1112.
PubMed CrossRef
Juslin, P., Olsson, N., Winman, A. (1996). Calibration and diagnosticity of confidence in eyewitness identification: Comments on what can be inferred from the low confidence–accuracy correlation. Journal of Experimental Psychology: Learning, Memory, & Cognition, 22, 1304–1316.
Kellstrand, E. B. (2006). Eyewitness identification accuracy in cases accepted and rejected for prosecution: An archival analysis of criminal case files, Unpublished Manuscript. San Diego: University of California.
Krafka, C., & Penrod, S. (1985). Reinstatement of context in a field experiment on eyewitness identification. Journal of Personality & Social Psychology, 49, 58–69.
Leippe, M. R. (1985). The influence of eyewitness nonidentifications on mock-jurors’ judgements of a court case. Journal of Applied Social Psychology, 15, 656–672.
Levi, A. M. (1998). Are defendants guilty if they were chosen in a lineup? Law and Human Behavior, 22, 389–407.
Lindsay, R. C. L., Martin, R., & Webber, L. (1994). Default values in eyewitness descriptions: A problem for the match-to-description lineup foil selection strategy. Law & Human Behavior, 18, 527–541.
Lindsay, R. C., & Wells, G. L. (1980). What price justice? Exploring the relationship of lineup fairness to identification accuracy. Law & Human Behavior, 4, 303–313.
Malpass, R. S. (1981). Effective size and defendant bias in eyewitness identification lineups. Law and Human Behavior, 5, 299–309.
Manson v. Braithwaite (1977). 432 U.S. 98.
McAllister, H. A., & Bregman, N. J. (1986). Juror underutilization of eyewitness nonidentifications: Theoretical and practical implications. Journal of Applied Psychology, 71, 168–170.
Memon, A., Hope, L., & Bull, R. (2003). Exposure duration: Effects on eyewitness accuracy and confidence. British Journal of Psychology, 94, 339–354.
PubMed CrossRef
Morgan, C. A., Hazlett, G., Doran, A., Garrett, S., Hoyt, G., Thomas, P., Baranoski, M., & Southwick, S. M. (2004). Accuracy of eyewitness memory for persons encountered during exposure to highly intense stress. International Journal of Law & Psychiatry, 3, 265–279.
Neil v. Biggers (1972). 409 U.S. 188.
Nettles, B., Nettles, Z. S., & Wells, G. L. (1996). I noticed you paused on number three. Biased testing in eyewitness identification. Champion, Nov. pp 10–12, 57–59.
Nickerson, R. S. (1998). Confirmation bias: A ubiquitous phenomenon in many guises. Review of General Psychology, 2, 175–220.
Paley, B., & Geiselman, R. E. (1989). The effects of alternative photospread instructions on suspect identification performance. American Journal of Forensic Psychology, 7, 3–13.
Paterson, H. M., & Kemp, R. I. (2006). Co-witness talk: A survey of eyewitness discussion. Psychology, Crime, & Law, 12, 181–191.
Roediger, H. L, III, Meade, M. L., & Bergman, E. T. (2001). Social contagion of memory. Psychonomic Bulletin & Review, 8, 365–371.
Sanders, G. S., & Warnick, D. H. (1981). Truth and consequences: The effect of responsibility on eyewitness behavior. Basic and Applied Social Psychology, 2, 67–79.
Sanders, G. S., & Warnick, D. H. (1982). Evaluating identification evidence from multiple eyewitnesses. Journal of Applied Social Psychology, 12, 182–192.
Sobel, N. R. (1979). Eyewitness identification: Legal and practical problems (2nd ed.). New York: Boardman.
Steblay, N. M. (1997). Social influence in eyewitness recall: A meta-analytic review of lineup instruction effects. Law and Human Behavior, 21, 283–297.
Tollestrup, P. A., Turtle, J. W., & Yuille, J. C. (1994). Actual victims and witnesses to robbery and fraud: An archival analysis. In D. F. Ross, J. D. Read, & M. P. Toglia (Eds.), Adult eyewitness testimony: Current trends and developments (pp. 144–160). New York: Cambridge.
Tunnicliff, J. L, & Clark, S. E. (2000). Selecting foils for identification lineups: Matching suspects or descriptions? Law & Human Behavior, 24, 231–258.
U.S. v. Telfaire. (1972) 152 U.S. App. D.C. 146; 469 F.2d 552.
Warnick, D. H., & Sanders, G. S. (1980). Why do witnesses make so many mistakes? Journal of Personality and Social Psychology, 10, 362–366.
Wells, G. L. (1988). Eyewitness identification: A system handbook. Toronto: Carswell Legal Publications.
Wells, G. L., & Bradfield, A. L. (1998). “Good, you identified the suspect”: Feedback to eyewitnesses distorts their reports of the witnessing experience. Journal of Applied Psychology, 83, 360–376.
Wells, G. L., Leippe, M. R., & Ostrom, T. M. (1979). Guidelines for empirically assessing the fairness of a lineup. Law and Human Behavior, 3, 285–293.
Wells, G. L., & Lindsay, R. C. (1980). On estimating the diagnosticity of eyewitness nonidentifications. Psychological Bulletin, 3, 776–784.
Wells, G. L., & Olson, E. A., (2002). Eyewitness identification: Information gain from incriminating and exonerating behaviors. Journal of Experimental Psychology: Applied, 3, 155–167.
Wells, G. L., Rydell, S. M., & Seelau, E. P. (1993). The selection of distractors for eyewitness lineups. Journal of Applied Psychology, 78, 835–844.
Wells, G. L., Small, M., Penrod, S., Malpass, R. S., Fulero, S. M., & Brimacombe, C. A. E. (1998). Eyewitness identification procedures: Recommendations for lineups and photospreads. Law and Human Behavior, 22, 603–647.
Wells, G. L., & Turtle, J. W. (1986). Eyewitness identification: The importance of lineup models. Psychological Bulletin, 99, 320–329.
Wright, D. B., Mathews, S. A., & Skagerberg, E. M. (2005). Social recognition memory: The effect of other people’s responses for previously seen and unseen items. Journal of Experimental Psychology: Applied, 11, 200–209.
PubMed CrossRef
Wright, D. B., & McDaid, A. T. (1996). Comparing system and estimator variables using data from real line-ups. Applied Cognitive Psychology, 10, 75–84.
Yuille, J. C., & Tollestrup, P. A. (1992). A model of the diverse effects of emotion on eyewitness memory. In S.-A. Christianson (Ed.) The handbook of emotion and memory: Research and theory (pp. 201–215). Hillsdale, NJ: Erlbaum.


1 There is a potentially important distinction between cases and suspects. Yuille and Tollestrup (1992) and Tollestrup et al. (1994) reported results using cases as the unit of analysis, whereas Wright and McDaid (1996) used the suspect as the unit of analysis. Each case may involve multiple perpetrators and hence multiple suspects. The calculation of the Yuille–Tollestrup result is based on 622 cases rather than 626 shown in their Table 1 because in four homicide cases there were no witnesses. Calculations for Tollestrup et al. (1994) are based only on robbery cases, and one case that involved no witnesses is excluded for a total of 76. Calculations for Wright and McDaid are estimated from their Fig. 1 (p. 77).
2 The calculation considers response categories rather than individual responses. Two witnesses could select a foil, but two different foils.
3 The original treatment of diagnosticity (as developed by Wells and Lindsay 1980, and used in Wells and Turtle 1986, and in Wells and Olson 2002) used a ratio, p(R|S = G)/p(R|S ≠ G), representing the likelihood that the response would occur when the suspect was guilty versus innocent. Using the likelihood ratio index, the value of 1.0 indicates no diagnosticity, 2.0 means that the response was twice as likely if the suspect was guilty than if the suspect was innocent, and so on. Here, we have chosen to use the index p(R|S = G)/[p(R|S = G) + p(R|≠ G)], which represents diagnosticity as a probability rather than as a ratio. A probability of .50 indicates no diagnosticity, values between .50 and 1.0 indicate degrees of diagnosticity of guilt, and values between 0.0 and .50 indicate degrees of diagnosticity of innocence. The expression of diagnosticity as a probability rather than as a ratio has two advantages. First, the likelihood ratio index required the use of the obverse ratio, p(R|S ≠ G)/p(R|S = G), for responses that were diagnostic of innocence, an unnecessary step when using a probability as the index of diagnosticity. Second, the likelihood ratio is awkward (mathematically) for combining responses from multiple witnesses because it requires the multiplication of ratios (which can expand to be quite large), whereas the use of the probabilistic index of diagnosticity requires the multiplication of probabilities (which are constrained between 0 and 1).
4 The actual distribution of identifications across lineup members is likely to vary as a function of the lineup composition. The analyses carried out here and elsewhere (Clark et al. 2007) have assumed target-absent lineups that were unbiased in their composition. This assumption arises from the fact that in the relevant studies there was no designated innocent suspect, and thus suspect identification rates in TA lineups were calculated by dividing the overall identification rate by the number of lineup members.
5 The idea of the double-blind lineup, first proposed by Wells (1988) and emphasized strongly in lineup reform recommendations (Wells et al. 1998), would effectively eliminate the problem of the lineup administrator failing to make a clear record of foil identifications. With a double-blind lineup, the lineup administrator does not know which lineup member is the suspect and which are the foils. Hence, the record made by the double-blind administrator would have to record the identification decision with equal veracity regardless of whether it was an identification of the suspect or a foil.