AMES II Study
Validation Study of the Accuracy, Repeatability, and Reproducibility of Firearm Comparisons
This post will be the summary of the second part of the AMES Study. The AMES study was created as a direct response to the PCAST report to create a black box study that would validate the comparison of components from expended cartridges. The AMES II study included repeatability and reproducibility, which was missing from the AMES I study and incorporated expended bullets along with harder-than-usual samples for comparison.
Materials and Methods
Participation
Recruitment was done through the AFTE website, announcements by FBI personnel at forensic meetings, and emails through an email list. The participants were told that they would remain anonymous to protect the participants from any risk. Overall (270) responded, but later it was decided that FBI employees could not participate to eliminate any bias, which brought the total to (256) participants. By the end of the study only (173) examiners returned their evaluations and were active in the study. Only (80) examiners returned all six mailings of test packets. The dropout rate was due to the examiners’ response of having an inadequate amount of time to complete the study along with their casework.
Sample Creation
For expended casings: (10) Jimenez, (1) Bryco (that replaced a failed Jimenez), and (27) Beretta firearms were used. (23) Berettas were new and were selected in groups of 4 or 5 that were consecutively produced using the same broach at different periods in the life of the broach. All firearms had a break-in period and were cleaned throughout the testing. Steel Poylformance 9mm ammunition was used due to poor reproducibility of individual characteristics, which would increase the difficulty of the study. The expended bullet samples were created using (11) Ruger and (27) Beretta firearms.
Test Set Creation
Each test packet consisted of (30) comparison sample sets that were made up of (15) comparisons of (2) knowns to (1) questioned expended cartridge case and (15) comparisons of (2) knowns to (1) questioned expended bullet. The expended cartridge case comparisons consisted of (5) sets of Jimenez and (10) sets of Beretta produced expended casings. The expended bullet comparisons sets consisted of (5) sets of Ruger and (10) sets of Beretta expended bullets. The ratio of known same-source firearms to known different-source firearms was approximately 1:2 for both expended casings and bullet sets but varied among test packets. Each set was an independent examination that was not related to the other sets. These sets are open because there was not a match for every questioned sample.
The samples were designated so that the researchers would know if they were fired in the early, middle, or late test firing. This was done to be able to look at the effect of the test firing order on the error rate. Samples from different manufacturing intervals were also marked so that the effect can be seen in the error rate.
The packets were randomized when received with the results so that they can be redistributed. The test packets would have to be sent back to the examiner to be tested for repeatability and then sent out again to a different examiner to be tested for reproducibility. The randomization of the same packet helped to ensure that the examiners would not be able to identify any trends when receiving the test packet.
Results
Accuracy
A total of (4320) expended bullet examinations and (4320) expended casing set examinations were performed. For expended bullet comparisons (20) were a false identification (ID) with an error rate of 0.70% and (41) were a false elimination with an error rate of 2.92%. For expended casings (26) were a false ID with an error rate of 0.92% and (25) were a false elimination with an error rate of 1.76%. Out of (173) examiners (34) examiners made a hard error in expended bullet comparisons. Out of (173) examiners (36) examiners made a hard error in expended casings comparisons. A chi-square test determined that the probabilities associated with each conclusion are not the same for each examiner. The error rate (Point Estimate) with 95% confidence was calculated to be the following: Expended casings false positive 0.933% and false negative 1.87%, and expended bullets false positive 0.656% and false negative 2.87%.
Repeatability
The results are laid out to show errors when an examiner changed his/her initial answer. This would mean that the examiner changed their answer from one inconclusive option to another, or between an inconclusive option to the ground truth, it would be marked as an error. Below I will summarize the charts by neglecting the switches explained above and only including hard errors. Hard errors would be switching the answer from an elimination to an identification or visa versa. For expended bullet matching sets (8) comparisons went from ground truth ID to false elimination and (8) comparisons went from false elimination to ground truth ID. For expended bullets nonmatching sets (1) comparison went from ground truth elimination to a false ID and (6) comparisons went from a false ID to ground truth elimination. For expended casings matching sets (5) comparisons went from ground truth ID to a false elimination and (1) comparison went from a false elimination to a ground truth ID. For expended casings nonmatching sets (2) comparisons went from ground truth elimination to a false ID and (2) comparisons went from a false identification to a ground truth elimination.
The portion of paired disagreements was calculated by pooling the inconclusives together and by combining ID with Inconclusive A and elimination with inconclusive C. The first percentage will reflect the former and the second percentage will reflect the latter. For expended bullets the matching sets have a 16.6%/14.5% and the nonmatching sets have 16.4%/28.7%. For expended casings the matching sets had 19.1%/14.6% and the nonmatching set 21.1%/27.5%. The authors also calculated a better-than-chance repeatability from the results.
Reproducibility
The results are laid out the same way as the results from the repedability portion of the study, and again I will only be listing the hard errors for the summary. For expended bullets matching sets (12) comparisons went from ground truth ID to false elimination and (13) comparisons went from false elimination to ground truth ID. For expended bullets nonmatching sets (1) comparison went from ground truth elimination to a false ID and (1) comparisons went from a false ID to ground truth elimination. For expended casings matching sets (5) comparisons went from ground truth ID to a false elimination and (15) comparisons went from a false elimination to a ground truth ID. For expended casings nonmatching sets (1) comparisons went from ground truth elimination to a false ID and (5) comparisons went from a false identification to a ground truth elimination.
As before the portion of paired disagreements was calculated by pooling the inconclusives together and by combining ID with Inconclusive A and elimination with inconclusive C. The first percentage will reflect the former and the second percentage will reflect the latter. For expended bullets the matching sets have a 27.6%/22.6% and the nonmatching sets have 45.4%/51.0%. for expended casings the matching sets had 29.7%/23.6% and the nonmatching set 45.1%/40.5%. The authors also calculated a better-than-chance repeatability from the results.
Other Examined Areas
The paper also examines other areas of interest that may prove useful to some examiners, but the results would not be used in court testimony. I will briefly summarize them here to ensure that this post remains a complete summary of the study.
The effects related to firearm type and wear were examined and it was found that Beretta produced expended bullet samples had a larger proportion of correct conclusions. Ruger firearms had more inconclusive results when compared to the Beretta samples. For expended casings, Beretta firearms produced a larger amount of correct conclusions compared to the Jimenez. For the firing sequence for matching sets, the rate of correct conclusions compared to inconclusives were higher when the samples are part of the same third of the sequence. Even though a difference was observed, the Chi-Square test proved them not to be significant for Early-Late and Late- Early comparisons.
The proportion of unsuitable evaluations were also examined during the study. It was observed that fewer expended bullet sets produced with Berettas were recorded as unsuitable and more expended casings sets produced with Beretta weapons were recorded as unsuitable. The effects associated with manufacturing were also examined and found strong support for the difference between conclusions for the same group and different group examinations in expended casings. More eliminations were seen with expended casings from different production runs when compared to ones from the same production run. It was also found that tool wear from a production run is not significant by using the Chi-square.
The study also asked examiners the difficulty level of their comparisons, the time of their evaluations, if they used consecutively matching stria, and the areas they used for their conclusions. Examiner experience was also looked at by the authors. These results can be found in the study and will not be discussed here. Although, I would like to share one result about the consecutively matching stria (CMS) portion. The study showed that examiners who used CMS were more likely to choose false negative conclusions. This result was significant in only matching expended bullets and nonmatching expended casings.
Discussion
Accuracy
The error rates found in the accuracy portion of this study was close to the ones found in Part I of the AMES study. The false positive rates match extremely well but this study does show a higher false negative rate. The difference in false negative error rate can be contributed to the steel Wolf Polyformance cartridges that prove to be more difficult for comparisons. This factor combined with the poor marking Jimenez with the difference in firing order can cause more false negatives. These factors would be considered the worst-case scenario for examiners. Some examiners were able to record comments on the study and some had concerns that they were not able to look at the firearms to determine if certain markings were subclass. In normal casework, if known samples are generated the examiner would have access to the firearm to examine. Another complaint from examiners is the spacing of test fires. In casework, the test fires generated from a known source should be close in sequence to samples collected at a crime scene with the firearm. Another factor noted is that the errors were only contributed to a few examiners, which was also seen in Part I of this study. The article concludes this section by stating for both expended bullets and casings the probability for a false positive is half of what is for a false negative possibly due to examiners being trained to lean towards the cautious side.
It should also be noted for the error rate that the (6) most error-prone examiners accounted for almost 30% of the total error. (13) Examiners account for almost half of all the hard errors seen in the study. These results are consistent with the ones seen in Part I of the study. Considering that most of the errors are from a small group of examiners it can be said that the error rates are really applied to the examiner rather than the overall science. If these examiners were randomly swapped with other examiners during the selection process it could have caused the overall error rate to decrease. Also, if the study allowed the examiners to use their laboratory QA system it could have prevented them from making an error to begin with.
Repeatability/Reproducibility
In the article the author created multiple scatter plots of observed versus expected agreement for repeated examinations by the same examiner. These plots show that examiners score high in repeatability, in other words, their observed performance generally exceeds the statistically expected agreement by a wide margin. This statement is true even if the inconclusives are treated separately or combined. Some examiners stated that they would not be surprised if they were to conclude inconclusive C and then conclude elimination in the second round. Another examiner states that they would not be surprised that their “flip-flop” would be concentrated around the three inconclusive categories. As for reproducibility, it was found that the observed agreement generally matched the expected agreement and the trends are not as dramatic as the ones seen in repeatability. This is due to reproducibility involving multiple examiners in the process rather than a single examiner that was involved in repeatability.
Inconclusives
I feel that it is important to not include inconclusives with the error rate because they should be used when an examiner cannot make an identification or elimination based on the evidence provided. Forcing an examiner into making an identification or an elimination would be asking the examiner to make a conclusion against what they are observing. I also believe that the study should have only allowed the examiner to choose one option for inconclusive, allowing three options creates difficulty when trying to determine error rates. For example, in repeatability and reproducibility, the three inconclusives can cause a higher error rate meanwhile the examiners choosing them are all concluding inconclusive. When examiner chooses inconclusive they should not be biasing themselves to lean in an identification or elimination direction. They should be stating that the markings are not sufficient for either an identification or an elimination. I find the only use of the three inconclusives is in academic research, but as seen here can still cause problems in this area.
Final Thoughts
In the conclusion of this study, they state that there were some comparison sets that resulted in errors by more than one examiner. One of the sets was marked as an error in all parts of this study. The authors state that these comparison sets would be evaluated by trained forensic examiners at the FBI to determine the cause behind the errors. Since the publication, I cannot find anywhere if they ever followed up on this.
As always, I recommend you read the full study because they examined a lot of variables not seen in Part I and they included a lot of statistics to back up their claims. Also, a better understanding of this study will help you combat its use in court. See my article on a Daubert hearing where the defense attorney used this study to manipulate the data for their own gain.