General Laboratory

Discussion about Inconclusives

Discussion about Inconclusives

            In other articles on this website, I have briefly discussed inconclusive results in the field of firearm comparisons. This article will dive deeper into what’s the meaning of the conclusion and why it is becoming the object of intense discussion.

AFTE Range of Conclusions

            AFTE (The Association of Firearm and Tool Mark Examiners) allows examiners to report three different inconclusive results: Inconclusive A, Inconclusive B, and Inconclusive C.

Inconclusive A

            Inconclusive A would be chosen if there is some agreement of individual characteristics and all discernible class characteristics, but insufficient for an identification. This conclusion would mean the examiner was leaning towards an identification but did not have enough information to conclude an identification.

Inconclusive B

            Inconclusive B would be chosen if there is an agreement of all discernable class characteristics without agreement or disagreement of individual characteristics due to an absence, insufficiency, or lack of reproducibility. This conclusion would be considered the middle ground where an examiner was not leaning towards an identification or an elimination.

Inconclusive C

Inconclusive C would be chosen if there is agreement of all discernible class characteristics and disagreement of individual characteristics, but is insufficient for an elimination. This conclusion would mean the examiner was leaning towards an elimination but there was just enough information to keep them from eliminating the two items from each other.

Opinion on Reporting Inconclusive

            Although AFTE allows examiners to report three different types of inconclusives, I believe that examiners should try to only report “inconclusive” as their result regardless of what way they are leaning. This not only prevents bias but also more accurately conveys the intention of the examiner. For example, if an examiner were to testify and they tell the jury that they made an Inconclusive A conclusion he would have to tell them that there wasn’t enough information to report the comparison as an identification or elimination, but the comparison was leaning towards an identification. I feel like this explanation is taking the inconclusive conclusion and then adding a wink and a nod to the jury that the comparison could have been an identification. This confuses the jury members and also skews the results of the examination.

            The examiner who reports “inconclusive” will be more transparent to the jury and the jury will have an easier time understanding the conclusion of the examination. The examiner would explain that during the comparison there was not enough information for an identification and just enough information to reject an elimination. If the markings were not sufficient enough for an identification then the conclusions would be inconclusive and not inconclusive A, because no matter the circumstances it was never enough for an identification.

The Rise of the “Problem”

Their Argument

            After the release of the PCAST report the field went to work producing studies that would satisfy the PCAST recommendation for black box studies. The Ames I (Baldwin Study) and the AMES II (Monson Study) were conducted to satisfy that recommendation. These studies were very informative and contained a lot of data to be reviewed by anyone in the field. Although satisfied, people outside the field determined to move the goalpost and stated that the problem of the field and the studies were the inconclusive conclusions. This has now become the main focus of the field and once again examiners are trying to satisfy this challenge.

            One of the first times the inconclusives results were stated to be problematic was with Dr. Scurich, who stated that inconclusives should be reported as false positives. He demonstrated this idea by taking the Baldwin study and recalculating the error rates based on his new idea. With the new treatment of inconclusive, the error rate went from 1.01% to 35%.

            Dr. Scurich also recommended that a majority approach be implemented when determining the ground truth for a particular comparison. For example, if the preliminary comparison were given to a certain amount of examiners and the majority concluded that the comparison resulted in an inconclusive result; it would mean when the comparisons are given to the actual participants in a study, the conclusions would only be correct if they were to report inconclusive. This would also mean if the majority ruled that a comparison is an identification/elimination, that any participant who concluded an inconclusive would be marked wrong. No matter the ground truth the answer will always have to coincide with the majority rule.

My Argument

            We will first look at combining the inconclusives into the error rate of a study. I believe that inconclusives should be treated separately from an identification and elimination, and should not be factored into the error rate. As discussed above an inconclusive carries a lot of meaning. It shows the reader that the examiner did not have enough information for an identification and/or had just enough information keeping them from concluding an elimination. Forcing an examiner to either pick an identification or an elimination would mean that the examiner would have to disregard what they see in the evidentiary material. Sometimes the evidence may be deformed, damaged, not well marked, or changed from one firing to the next due to rust or other factors. These conditions would increase the difficulty of the examination and may obscure markings that would drive an examiner to either an identification or an elimination. No matter the experience level of the examiner, if the markings are not present or insufficient an inconclusive conclusion would be appropriate. This scenario is how other sections in crime laboratories use inconclusive results, for example, latent prints and DNA. Although the conclusions are used in other sections, they only appear to be considered problematic in the field of firearm comparisons.

            Lastly, the majority rules solution would not work and cause an artificial error rate. As discussed in the previous section if the majority chooses inconclusive then to be right the participants would have to conclude an inconclusive result. Let’s say in this scenario, the two pieces of evidence were from the same firearm. A couple of participants had experience with such pieces of evidence that they were able to look at areas other examiners would normally miss or use lighting techniques that other examiners are not proficient in. By using their experience and techniques they were able to find markings that drew them to conclude and identification, which would be the correct answer. However, since the majority ruled that the conclusion would be inconclusive these select examiners would be marked wrong.

            A real-life example of the previous scenario would be to take a survey of 100 people and ask them what the southernmost state of the United States is. Let’s assume that the majority of the participants picked either Texas or Florida then that would become the ground truth answer. Then if the researcher took this same survey to the actual participants of the study and gathered the answers and graded them based on the majority ground truth. Some participants may be more versed in geography and answer that Hawaii is the southernmost state. Technically, these participants would be right in the real world, but in the context of this study, they would be marked wrong and contribute to the error rate.

            The majority rules also cause test-taking bias. Examiners may feel forced to report a conclusion as an identification or an elimination. For example, if there were a lot of markings that are drawing an examiner to an identification, but the quality and quantity of the markings did not meet their threshold for an identification they would normally conclude an inconclusive. But, by knowing that the study is graded as majority rule they may now conclude an identification because they would be biased that the majority of the participants would use those markings to make an identification.

            Before finishing up this article I would like to add one more data point that was seen in multiple studies. It was found that examiners who used inconclusives more were considered more trustworthy than examiners who did not use inconclusives as often. This can be seen in the AMES study where there was an examiner who contributed significantly to the error rate but in their result did not report any inconclusives.

Final Thought

            Examiners should start using one inconclusive result rather than three to become more transparent and to eliminate any bias to the jury. Outside organizations should accept inconclusive conclusions as valid as they do for other disciplines. It seems like these outside organizations are trying to artificially raise the error rates and conflict within the field by using inconclusives. People have to be able to recognize the true meaning of an inconclusive result and not equate the conclusion as an easy way out or something that is used to easily pass a study. It is a valid and important conclusion to be able to appropriately speak for the evidence.

Literature Review

AMES II Study

Validation Study of the Accuracy, Repeatability, and Reproducibility of Firearm Comparisons

            This post will be the summary of the second part of the AMES Study. The AMES study was created as a direct response to the PCAST report to create a black box study that would validate the comparison of components from expended cartridges. The AMES II study included repeatability and reproducibility, which was missing from the AMES I study and incorporated expended bullets along with harder-than-usual samples for comparison.

Materials and Methods

Participation

            Recruitment was done through the AFTE website, announcements by FBI personnel at forensic meetings, and emails through an email list. The participants were told that they would remain anonymous to protect the participants from any risk. Overall (270) responded, but later it was decided that FBI employees could not participate to eliminate any bias, which brought the total to (256) participants. By the end of the study only (173) examiners returned their evaluations and were active in the study. Only (80) examiners returned all six mailings of test packets. The dropout rate was due to the examiners’ response of having an inadequate amount of time to complete the study along with their casework.

Sample Creation

            For expended casings: (10) Jimenez, (1) Bryco (that replaced a failed Jimenez), and (27) Beretta firearms were used. (23) Berettas were new and were selected in groups of 4 or 5 that were consecutively produced using the same broach at different periods in the life of the broach. All firearms had a break-in period and were cleaned throughout the testing. Steel Poylformance 9mm ammunition was used due to poor reproducibility of individual characteristics, which would increase the difficulty of the study. The expended bullet samples were created using (11) Ruger and (27) Beretta firearms.

Test Set Creation

            Each test packet consisted of (30) comparison sample sets that were made up of (15) comparisons of (2) knowns to (1) questioned expended cartridge case and (15) comparisons of (2) knowns to (1) questioned expended bullet. The expended cartridge case comparisons consisted of (5) sets of Jimenez and (10) sets of Beretta produced expended casings. The expended bullet comparisons sets consisted of  (5) sets of Ruger and (10) sets of Beretta expended bullets. The ratio of known same-source firearms to known different-source firearms was approximately 1:2 for both expended casings and bullet sets but varied among test packets. Each set was an independent examination that was not related to the other sets. These sets are open because there was not a match for every questioned sample.

            The samples were designated so that the researchers would know if they were fired in the early, middle, or late test firing. This was done to be able to look at the effect of the test firing order on the error rate. Samples from different manufacturing intervals were also marked so that the effect can be seen in the error rate.

            The packets were randomized when received with the results so that they can be redistributed. The test packets would have to be sent back to the examiner to be tested for repeatability and then sent out again to a different examiner to be tested for reproducibility. The randomization of the same packet helped to ensure that the examiners would not be able to identify any trends when receiving the test packet.

Results

Accuracy

            A total of (4320) expended bullet examinations and (4320) expended casing set examinations were performed. For expended bullet comparisons (20) were a false identification (ID) with an error rate of 0.70% and (41) were a false elimination with an error rate of 2.92%. For expended casings (26) were a false ID with an error rate of 0.92% and (25) were a false elimination with an error rate of 1.76%. Out of (173) examiners (34) examiners made a hard error in expended bullet comparisons. Out of (173) examiners (36) examiners made a hard error in expended casings comparisons. A chi-square test determined that the probabilities associated with each conclusion are not the same for each examiner. The error rate (Point Estimate) with 95% confidence was calculated to be the following: Expended casings false positive 0.933% and false negative 1.87%, and expended bullets false positive 0.656% and false negative 2.87%.

Repeatability

            The results are laid out to show errors when an examiner changed his/her initial answer. This would mean that the examiner changed their answer from one inconclusive option to another, or between an inconclusive option to the ground truth, it would be marked as an error. Below I will summarize the charts by neglecting the switches explained above and only including hard errors. Hard errors would be switching the answer from an elimination to an identification or visa versa. For expended bullet matching sets (8) comparisons went from ground truth ID to false elimination and (8) comparisons went from false elimination to ground truth ID. For expended bullets nonmatching sets (1) comparison went from ground truth elimination to a false ID and (6) comparisons went from a false ID to ground truth elimination. For expended casings matching sets (5) comparisons went from ground truth ID to a false elimination and (1) comparison went from a false elimination to a ground truth ID. For expended casings nonmatching sets (2) comparisons went from ground truth elimination to a false ID and (2) comparisons went from a false identification to a ground truth elimination.

            The portion of paired disagreements was calculated by pooling the inconclusives together and by combining ID with Inconclusive A and elimination with inconclusive C. The first percentage will reflect the former and the second percentage will reflect the latter. For expended bullets the matching sets have a 16.6%/14.5% and the nonmatching sets have 16.4%/28.7%. For expended casings the matching sets had 19.1%/14.6% and the nonmatching set 21.1%/27.5%. The authors also calculated a better-than-chance repeatability from the results.

Reproducibility

            The results are laid out the same way as the results from the repedability portion of the study, and again I will only be listing the hard errors for the summary. For expended bullets matching sets (12) comparisons went from ground truth ID to false elimination and (13) comparisons went from false elimination to ground truth ID. For expended bullets nonmatching sets (1) comparison went from ground truth elimination to a false ID and (1) comparisons went from a false ID to ground truth elimination. For expended casings matching sets (5) comparisons went from ground truth ID to a false elimination and (15) comparisons went from a false elimination to a ground truth ID. For expended casings nonmatching sets (1) comparisons went from ground truth elimination to a false ID and (5) comparisons went from a false identification to a ground truth elimination.

            As before the portion of paired disagreements was calculated by pooling the inconclusives together and by combining ID with Inconclusive A and elimination with inconclusive C. The first percentage will reflect the former and the second percentage will reflect the latter. For expended bullets the matching sets have a 27.6%/22.6% and the nonmatching sets have 45.4%/51.0%. for expended casings the matching sets had 29.7%/23.6% and the nonmatching set 45.1%/40.5%. The authors also calculated a better-than-chance repeatability from the results.

Other Examined Areas

            The paper also examines other areas of interest that may prove useful to some examiners, but the results would not be used in court testimony. I will briefly summarize them here to ensure that this post remains a complete summary of the study.

            The effects related to firearm type and wear were examined and it was found that Beretta produced expended bullet samples had a larger proportion of correct conclusions. Ruger firearms had more inconclusive results when compared to the Beretta samples. For expended casings, Beretta firearms produced a larger amount of correct conclusions compared to the Jimenez. For the firing sequence for matching sets, the rate of correct conclusions compared to inconclusives were higher when the samples are part of the same third of the sequence. Even though a difference was observed, the Chi-Square test proved them not to be significant for Early-Late and Late- Early comparisons.

            The proportion of unsuitable evaluations were also examined during the study. It was observed that fewer expended bullet sets produced with Berettas were recorded as unsuitable and more expended casings sets produced with Beretta weapons were recorded as unsuitable. The effects associated with manufacturing were also examined and found strong support for the difference between conclusions for the same group and different group examinations in expended casings. More eliminations were seen with expended casings from different production runs when compared to ones from the same production run. It was also found that tool wear from a production run is not significant by using the Chi-square.

            The study also asked examiners the difficulty level of their comparisons, the time of their evaluations, if they used consecutively matching stria, and the areas they used for their conclusions. Examiner experience was also looked at by the authors. These results can be found in the study and will not be discussed here. Although, I would like to share one result about the consecutively matching stria (CMS) portion. The study showed that examiners who used CMS were more likely to choose false negative conclusions. This result was significant in only matching expended bullets and nonmatching expended casings.

Discussion

Accuracy

            The error rates found in the accuracy portion of this study was close to the ones found in Part I of the AMES study. The false positive rates match extremely well but this study does show a higher false negative rate. The difference in false negative error rate can be contributed to the steel Wolf Polyformance cartridges that prove to be more difficult for comparisons. This factor combined with the poor marking Jimenez with the difference in firing order can cause more false negatives. These factors would be considered the worst-case scenario for examiners. Some examiners were able to record comments on the study and some had concerns that they were not able to look at the firearms to determine if certain markings were subclass. In normal casework, if known samples are generated the examiner would have access to the firearm to examine. Another complaint from examiners is the spacing of test fires. In casework, the test fires generated from a known source should be close in sequence to samples collected at a crime scene with the firearm. Another factor noted is that the errors were only contributed to a few examiners, which was also seen in Part I of this study. The article concludes this section by stating for both expended bullets and casings the probability for a false positive is half of what is for a false negative possibly due to examiners being trained to lean towards the cautious side.

            It should also be noted for the error rate that the (6) most error-prone examiners accounted for almost 30% of the total error. (13) Examiners account for almost half of all the hard errors seen in the study. These results are consistent with the ones seen in Part I of the study. Considering that most of the errors are from a small group of examiners it can be said that the error rates are really applied to the examiner rather than the overall science. If these examiners were randomly swapped with other examiners during the selection process it could have caused the overall error rate to decrease. Also, if the study allowed the examiners to use their laboratory QA system it could have prevented them from making an error to begin with.

Repeatability/Reproducibility

            In the article the author created multiple scatter plots of observed versus expected agreement for repeated examinations by the same examiner. These plots show that examiners score high in repeatability, in other words, their observed performance generally exceeds the statistically expected agreement by a wide margin. This statement is true even if the inconclusives are treated separately or combined. Some examiners stated that they would not be surprised if they were to conclude inconclusive C and then conclude elimination in the second round. Another examiner states that they would not be surprised that their “flip-flop” would be concentrated around the three inconclusive categories. As for reproducibility, it was found that the observed agreement generally matched the expected agreement and the trends are not as dramatic as the ones seen in repeatability. This is due to reproducibility involving multiple examiners in the process rather than a single examiner that was involved in repeatability.

Inconclusives

            I feel that it is important to not include inconclusives with the error rate because they should be used when an examiner cannot make an identification or elimination based on the evidence provided. Forcing an examiner into making an identification or an elimination would be asking the examiner to make a conclusion against what they are observing. I also believe that the study should have only allowed the examiner to choose one option for inconclusive, allowing three options creates difficulty when trying to determine error rates. For example, in repeatability and reproducibility, the three inconclusives can cause a higher error rate meanwhile the examiners choosing them are all concluding inconclusive. When examiner chooses inconclusive they should not be biasing themselves to lean in an identification or elimination direction. They should be stating that the markings are not sufficient for either an identification or an elimination. I find the only use of the three inconclusives is in academic research, but as seen here can still cause problems in this area.

Final Thoughts

            In the conclusion of this study, they state that there were some comparison sets that resulted in errors by more than one examiner. One of the sets was marked as an error in all parts of this study. The authors state that these comparison sets would be evaluated by trained forensic examiners at the FBI to determine the cause behind the errors. Since the publication, I cannot find anywhere if they ever followed up on this.

            As always, I recommend you read the full study because they examined a lot of variables not seen in Part I and they included a lot of statistics to back up their claims. Also, a better understanding of this study will help you combat its use in court. See my article on a Daubert hearing where the defense attorney used this study to manipulate the data for their own gain.

Literature Review

Part I: Ames Study

Finding the Article

            I was able to find a copy of a Study titled “A Study of False-Positive and False-Negative Error Rates in Cartridge Case Comparisons” written by David P. Baldwin, Stanley J. Bajic, Max Morris, and Daniel Zamzow. This is Part I of a two-part study done by the Ames Laboratory. Part I can still be found in obscure places but part II has been wiped from most sources. Defense attorneys or academic opponents heavily reference these studies, but when used, they are quickly/sloppily referenced and cherry-picked. I hope to be able to share the main findings in these studies to be able to help anyone in the field that will meet with people using these studies. This post will focus on Part I of the study, and at a later date, I will post a discussion of Part II of the study.

Introduction/Experiment

            The study’s authors designed the study to better understand error rates associated with the comparison of fired cartridge casings. They stated that the problem with previous studies is that they did not include independent sample sets that would allow unbiased determination of the false-positive and/or false-negative rates. So, this study sets out to resolve this issue.

            Two hundred and eighty-four (284) participants were given fifteen (15) test sets to examine. Twenty-five (25) Ruger SR9s were used to create the samples for the test sets, and each firearm fired 200 cartridges to break them in before sample collection. Each handgun fired 800 cartridges in total for the test sets. In the test sets, no source firearm was repeated in a single test packet, except in the case when a test set was meant to be the same source comparison. The sets included 3 knowns to compare to a single questioned casing. For all the participants five (5) of the test sets were from known same-source firearms, and ten (10) of the test sets were from known different-source firearms.  In addition to the results, the participants had to record the quality of the known samples, which allowed the authors to calculate a poor mark production rate. This rate was examined to avoid cherry-picking well-marked samples for the test sets, which usually draws criticism as making the test sets too easy. The authors also asked the participants not to use their laboratory peer review process, which allowed the error rates to reflect the individual examiner.

Results

False Negative

            Out of the two hundred and eighty-four (284) participants, only two hundred and eighteen (218) returned completed responses. Out of the completed responses, 3% accounted for self-employed individuals. In total, thousand and ninety (1090) true same-source comparisons were made, where only four (4) comparisons were labeled elimination and eleven (11) were labeled inconclusive. The false elimination rate was calculated to be 0.3670% with the Clopper-Pearson exact 95% confidence interval calculated to be 0.1001%-0.9369%. Two (2) of the four (4) false eliminations were made by the same examiner, therefore 215 out of 218 examiners did not make a false elimination. When factoring in inconclusive with false elimination the error rate increases to 1.376% with the corresponding 95% confidence interval calculated to be 0.7722%-2.260%.

            A number to take into consideration is the poor mark production rate that was discussed above. Two hundred and twenty-five (225) known samples out of nine thousand seven hundred two (9702) knowns were considered poor quality and inappropriate for inclusion in the comparison, which was calculated to be 2.319% of the samples with a corresponding 95% confidence interval of 2.174%-2.827%. This percentage is greater than the false elimination rate, which means there is a high probability that some of the false elimination can be attributed to the poor quality of the knowns used for comparison. Also, four (4) of the false eliminations were made by examiners who did not use inconclusive for any response, which could be attributed to their agency requirements.

False Positive

            Out of the two thousand hundred and eighty (2180) true different-source comparisons, twenty-two (22) comparisons were labeled identifications, and seven hundred and thirty-five (735) were labeled inconclusive. The error rate for false identification was calculated to be 1.010% (Note: Two (2) responses were left blank and were subtracted from the total number of responses.). Out of the false identifications, all but two were made by five (5) of the two hundred and eighteen (218) examiners. Since a small number of examiners made the same error, it can be suggested that the error probability is not consistent across examiners, which was the idea stated at the beginning of this post. The beta-binomial model was used to estimate the false identification because it cannot be assumed that the probability is uniform across examiners. The probability was calculated to be 0.939% with a likelihood-based 95% confidence interval of 0.360%-2.261%.

            The inconclusive also showed to be heterogeneous. Out of the two hundred and eighteen (218) examiners, ninety-six (96) labeled none of the comparisons as inconclusive, forty-five (45) labeled all 10 of the comparisons as inconclusive, and seventy-seven (77) examiners showed an even spread between the extremes. 

My Discussion

            The authors state that the false elimination error rate is in doubt because the poor-quality rate is higher than the false elimination rate, even with the inconclusive results factored in. I agree that the error rate should be questioned because the rate can be affected by the poor quality of the samples, which can lead an examiner to not conclude a positive comparison. But, there is another factor in play as well. Some laboratories do not allow their examiners to report inconclusive results and require that the conclusion can be either an identification or an elimination, which is something that the statistics community has been making a push for. But, this factor is hard to consider when the authors did not require the participants to disclose their laboratory practices. It can be assumed that this might be the case because all the false eliminations were made by examiners who did not conclude inconclusive in any of the comparisons.

            The false positive rate is a percentage that should not be applied to the science but rather to an examiner. This 1% error rate is more representative of the examiners participating in this specific study. This can be seen when most of the false identifications were produced by five (5) examiners out of the two hundred and eighteen (218) total participants. This study also disclosed in the design that they did not want the laboratory review process to be a factor so that they can examine the individual examiners. It is my belief that if the review process was allowed in this experiment the error rate would be smaller or be close to 0%. So, the error rate can be used to advocate for examiners to be well-trained and to have a well-established QA system in place.

            The study also addresses the higher inconclusive responses that were received with the different-source comparisons. The seven hundred and thirty-five (735) inconclusive results out of the one thousand four hundred and twenty-one (1421) reported eliminations is too large to be attributed to the poor-quality percentage. Just like the false elimination results, the inconclusive can be attributed to the policy of the laboratory. A laboratory may require the examiner to report an inconclusive result if the class characteristics are the same between the known and unknown samples. In this study since the same model of firearms were used for the creation of the known and unknown samples, the samples generated would have the same class characteristics. If the authors included a section where the participants were able to disclose their laboratory policy, we would be able to better understand the number of inconclusive results seen in the study.

            Hopefully, my post will help bring to light the first part of the Ames study and provide more transparency to the error rates published in the paper. Please use this post as a reference or a quick summary for your knowledge, but seek out a copy of the original paper for a more in-depth look into the study design. The authors of the study were very detailed in their paper and it would prove to be very beneficial to read the paper for yourself. They go deeper in-depth on the design of the study and the creation of the samples than I have included in this post. They also have a large discussion section that dives deeper into the statistics they applied and why they were selected to properly represent the data. In a future post, I will be summarizing and discussing the second part of the Ames study so that more examiners will have access to what some critics of the science use as a reference.