AMES II Study

Validation Study of the Accuracy, Repeatability, and Reproducibility of Firearm Comparisons

            This post will be the summary of the second part of the AMES Study. The AMES study was created as a direct response to the PCAST report to create a black box study that would validate the comparison of components from expended cartridges. The AMES II study included repeatability and reproducibility, which was missing from the AMES I study and incorporated expended bullets along with harder-than-usual samples for comparison.

Materials and Methods

Participation

            Recruitment was done through the AFTE website, announcements by FBI personnel at forensic meetings, and emails through an email list. The participants were told that they would remain anonymous to protect the participants from any risk. Overall (270) responded, but later it was decided that FBI employees could not participate to eliminate any bias, which brought the total to (256) participants. By the end of the study only (173) examiners returned their evaluations and were active in the study. Only (80) examiners returned all six mailings of test packets. The dropout rate was due to the examiners’ response of having an inadequate amount of time to complete the study along with their casework.

Sample Creation

            For expended casings: (10) Jimenez, (1) Bryco (that replaced a failed Jimenez), and (27) Beretta firearms were used. (23) Berettas were new and were selected in groups of 4 or 5 that were consecutively produced using the same broach at different periods in the life of the broach. All firearms had a break-in period and were cleaned throughout the testing. Steel Poylformance 9mm ammunition was used due to poor reproducibility of individual characteristics, which would increase the difficulty of the study. The expended bullet samples were created using (11) Ruger and (27) Beretta firearms.

Test Set Creation

            Each test packet consisted of (30) comparison sample sets that were made up of (15) comparisons of (2) knowns to (1) questioned expended cartridge case and (15) comparisons of (2) knowns to (1) questioned expended bullet. The expended cartridge case comparisons consisted of (5) sets of Jimenez and (10) sets of Beretta produced expended casings. The expended bullet comparisons sets consisted of  (5) sets of Ruger and (10) sets of Beretta expended bullets. The ratio of known same-source firearms to known different-source firearms was approximately 1:2 for both expended casings and bullet sets but varied among test packets. Each set was an independent examination that was not related to the other sets. These sets are open because there was not a match for every questioned sample.

            The samples were designated so that the researchers would know if they were fired in the early, middle, or late test firing. This was done to be able to look at the effect of the test firing order on the error rate. Samples from different manufacturing intervals were also marked so that the effect can be seen in the error rate.

            The packets were randomized when received with the results so that they can be redistributed. The test packets would have to be sent back to the examiner to be tested for repeatability and then sent out again to a different examiner to be tested for reproducibility. The randomization of the same packet helped to ensure that the examiners would not be able to identify any trends when receiving the test packet.

Results

Accuracy

            A total of (4320) expended bullet examinations and (4320) expended casing set examinations were performed. For expended bullet comparisons (20) were a false identification (ID) with an error rate of 0.70% and (41) were a false elimination with an error rate of 2.92%. For expended casings (26) were a false ID with an error rate of 0.92% and (25) were a false elimination with an error rate of 1.76%. Out of (173) examiners (34) examiners made a hard error in expended bullet comparisons. Out of (173) examiners (36) examiners made a hard error in expended casings comparisons. A chi-square test determined that the probabilities associated with each conclusion are not the same for each examiner. The error rate (Point Estimate) with 95% confidence was calculated to be the following: Expended casings false positive 0.933% and false negative 1.87%, and expended bullets false positive 0.656% and false negative 2.87%.

Repeatability

            The results are laid out to show errors when an examiner changed his/her initial answer. This would mean that the examiner changed their answer from one inconclusive option to another, or between an inconclusive option to the ground truth, it would be marked as an error. Below I will summarize the charts by neglecting the switches explained above and only including hard errors. Hard errors would be switching the answer from an elimination to an identification or visa versa. For expended bullet matching sets (8) comparisons went from ground truth ID to false elimination and (8) comparisons went from false elimination to ground truth ID. For expended bullets nonmatching sets (1) comparison went from ground truth elimination to a false ID and (6) comparisons went from a false ID to ground truth elimination. For expended casings matching sets (5) comparisons went from ground truth ID to a false elimination and (1) comparison went from a false elimination to a ground truth ID. For expended casings nonmatching sets (2) comparisons went from ground truth elimination to a false ID and (2) comparisons went from a false identification to a ground truth elimination.

            The portion of paired disagreements was calculated by pooling the inconclusives together and by combining ID with Inconclusive A and elimination with inconclusive C. The first percentage will reflect the former and the second percentage will reflect the latter. For expended bullets the matching sets have a 16.6%/14.5% and the nonmatching sets have 16.4%/28.7%. For expended casings the matching sets had 19.1%/14.6% and the nonmatching set 21.1%/27.5%. The authors also calculated a better-than-chance repeatability from the results.

Reproducibility

            The results are laid out the same way as the results from the repedability portion of the study, and again I will only be listing the hard errors for the summary. For expended bullets matching sets (12) comparisons went from ground truth ID to false elimination and (13) comparisons went from false elimination to ground truth ID. For expended bullets nonmatching sets (1) comparison went from ground truth elimination to a false ID and (1) comparisons went from a false ID to ground truth elimination. For expended casings matching sets (5) comparisons went from ground truth ID to a false elimination and (15) comparisons went from a false elimination to a ground truth ID. For expended casings nonmatching sets (1) comparisons went from ground truth elimination to a false ID and (5) comparisons went from a false identification to a ground truth elimination.

            As before the portion of paired disagreements was calculated by pooling the inconclusives together and by combining ID with Inconclusive A and elimination with inconclusive C. The first percentage will reflect the former and the second percentage will reflect the latter. For expended bullets the matching sets have a 27.6%/22.6% and the nonmatching sets have 45.4%/51.0%. for expended casings the matching sets had 29.7%/23.6% and the nonmatching set 45.1%/40.5%. The authors also calculated a better-than-chance repeatability from the results.

Other Examined Areas

            The paper also examines other areas of interest that may prove useful to some examiners, but the results would not be used in court testimony. I will briefly summarize them here to ensure that this post remains a complete summary of the study.

            The effects related to firearm type and wear were examined and it was found that Beretta produced expended bullet samples had a larger proportion of correct conclusions. Ruger firearms had more inconclusive results when compared to the Beretta samples. For expended casings, Beretta firearms produced a larger amount of correct conclusions compared to the Jimenez. For the firing sequence for matching sets, the rate of correct conclusions compared to inconclusives were higher when the samples are part of the same third of the sequence. Even though a difference was observed, the Chi-Square test proved them not to be significant for Early-Late and Late- Early comparisons.

            The proportion of unsuitable evaluations were also examined during the study. It was observed that fewer expended bullet sets produced with Berettas were recorded as unsuitable and more expended casings sets produced with Beretta weapons were recorded as unsuitable. The effects associated with manufacturing were also examined and found strong support for the difference between conclusions for the same group and different group examinations in expended casings. More eliminations were seen with expended casings from different production runs when compared to ones from the same production run. It was also found that tool wear from a production run is not significant by using the Chi-square.

            The study also asked examiners the difficulty level of their comparisons, the time of their evaluations, if they used consecutively matching stria, and the areas they used for their conclusions. Examiner experience was also looked at by the authors. These results can be found in the study and will not be discussed here. Although, I would like to share one result about the consecutively matching stria (CMS) portion. The study showed that examiners who used CMS were more likely to choose false negative conclusions. This result was significant in only matching expended bullets and nonmatching expended casings.

Discussion

Accuracy

            The error rates found in the accuracy portion of this study was close to the ones found in Part I of the AMES study. The false positive rates match extremely well but this study does show a higher false negative rate. The difference in false negative error rate can be contributed to the steel Wolf Polyformance cartridges that prove to be more difficult for comparisons. This factor combined with the poor marking Jimenez with the difference in firing order can cause more false negatives. These factors would be considered the worst-case scenario for examiners. Some examiners were able to record comments on the study and some had concerns that they were not able to look at the firearms to determine if certain markings were subclass. In normal casework, if known samples are generated the examiner would have access to the firearm to examine. Another complaint from examiners is the spacing of test fires. In casework, the test fires generated from a known source should be close in sequence to samples collected at a crime scene with the firearm. Another factor noted is that the errors were only contributed to a few examiners, which was also seen in Part I of this study. The article concludes this section by stating for both expended bullets and casings the probability for a false positive is half of what is for a false negative possibly due to examiners being trained to lean towards the cautious side.

            It should also be noted for the error rate that the (6) most error-prone examiners accounted for almost 30% of the total error. (13) Examiners account for almost half of all the hard errors seen in the study. These results are consistent with the ones seen in Part I of the study. Considering that most of the errors are from a small group of examiners it can be said that the error rates are really applied to the examiner rather than the overall science. If these examiners were randomly swapped with other examiners during the selection process it could have caused the overall error rate to decrease. Also, if the study allowed the examiners to use their laboratory QA system it could have prevented them from making an error to begin with.

Repeatability/Reproducibility

            In the article the author created multiple scatter plots of observed versus expected agreement for repeated examinations by the same examiner. These plots show that examiners score high in repeatability, in other words, their observed performance generally exceeds the statistically expected agreement by a wide margin. This statement is true even if the inconclusives are treated separately or combined. Some examiners stated that they would not be surprised if they were to conclude inconclusive C and then conclude elimination in the second round. Another examiner states that they would not be surprised that their “flip-flop” would be concentrated around the three inconclusive categories. As for reproducibility, it was found that the observed agreement generally matched the expected agreement and the trends are not as dramatic as the ones seen in repeatability. This is due to reproducibility involving multiple examiners in the process rather than a single examiner that was involved in repeatability.

Inconclusives

            I feel that it is important to not include inconclusives with the error rate because they should be used when an examiner cannot make an identification or elimination based on the evidence provided. Forcing an examiner into making an identification or an elimination would be asking the examiner to make a conclusion against what they are observing. I also believe that the study should have only allowed the examiner to choose one option for inconclusive, allowing three options creates difficulty when trying to determine error rates. For example, in repeatability and reproducibility, the three inconclusives can cause a higher error rate meanwhile the examiners choosing them are all concluding inconclusive. When examiner chooses inconclusive they should not be biasing themselves to lean in an identification or elimination direction. They should be stating that the markings are not sufficient for either an identification or an elimination. I find the only use of the three inconclusives is in academic research, but as seen here can still cause problems in this area.

Final Thoughts

            In the conclusion of this study, they state that there were some comparison sets that resulted in errors by more than one examiner. One of the sets was marked as an error in all parts of this study. The authors state that these comparison sets would be evaluated by trained forensic examiners at the FBI to determine the cause behind the errors. Since the publication, I cannot find anywhere if they ever followed up on this.

            As always, I recommend you read the full study because they examined a lot of variables not seen in Part I and they included a lot of statistics to back up their claims. Also, a better understanding of this study will help you combat its use in court. See my article on a Daubert hearing where the defense attorney used this study to manipulate the data for their own gain.

Daubert Hearing on Firearm Comparisons

Introduction

I recently watched a video of a Daubert hearing for Firearm Comparisons. This hearing was done in a Maryland Appeals court, and the video can be found here. The video only contains the closing arguments from the defense and the prosecution. Although, during the closing arguments past witnesses and research studies were mentioned. Instead of giving a summary of the video, which would make this post longer than I would like, this post will only focus on my response to the video.

            Overall the Defense brought up a lot of points that were either exaggerated or misconstrued. Although, the Defense did seem more believable and knowledgeable when compared to the prosecution, because of their confidence, organization, and ability to use different studies for their gain. The Prosecution fumbled generic explanations of the science and lacked the ability to take guidance from the judges.

Error Rates

            The Defense states that the error rate of the science is 50%, but they fail to explain where they obtained that error rate from. From the studies that have been published the error rate is usually around 1%, but this error rate can only be applied to the examiners that took the test and not the science as a whole. This error rate cannot be based on the science as a whole because in some studies the error rates can be contributed to a few examiners out of the pool of participants. Also, many of these studies prevent the examiners from fully utilizing their Quality Assurance (QA) system, which would reduce errors seen in the studies. In real case work examiners would have the full use of the QA system, which will act as a check and balance for their work.

            In another part of the closing argument, the Defense reveals that the AMES II study has a high error rate for comparisons. During the explanation, we get to see how the Defense is analyzing the data from the study, which can point to how they may have come up with the 50% error rate from earlier. They picked the error rate where the authors of the study combine the inconclusive results with either the identification or elimination results. An inconclusive A was combined with identifications and an inconclusive C was combined with eliminations. By including the inconclusives with these results it would inflate the error rate to 10% rather than the error rate of around 1% (Not including inconclusives) that the authors listed in the conclusion section of the study.

            Combining inconclusives with a positive identification or elimination would not properly reflect the error rate of the study because examiners chose an inconclusive result for a reason. The inconclusive results are usually chosen by the examiner either because the markings on the bullet/casings aren’t sufficient enough for identification but are still present to prevent an elimination. Also, some laboratories do not allow their examiners to conclude an elimination if the class characteristics of the bullets/casings match. So, combining the inconclusive would take away this information and cause the examiner to conclude against the evidence provided or cause an examiner to be in error when they are just following their laboratory procedure. Inconclusive results are not a free ticket out of a hard examination, but a real description of the evidence.

Black Box Studies

The Defense also states that an actual black box study has never been performed because all examiners are aware that they are being tested. This awareness causes the examiners to be more cautious and conclude more inconclusive results, thus “taking the easy way out”. The AMES II study, which is considered a black box study, includes a section where they publish some of the comments that the participating examiners had. These comments bring to light why inconclusive results are chosen in these studies. In the comments, some examiners stated that they concluded inconclusive results because they lacked the firearm for examination. In real case work with test bullets, the firearm will also be accessible for examination, which will allow the examiner to assess for subclass characteristics. Without properly evaluating for subclass characteristics examiners will feel that the best conclusion is inconclusive, especially with poor marking bullets/casings. Another examiner commented that some of the samples provided for comparison may contain test fires that were taken longer in the sequence from the unknown. In actual case work the unknown sample will be close to the test fires created by the submitted firearm. This distance in test fires causes more dilution in the markings, which can lead more examiners to an inconclusive result.

Consecutively Manufactured Studies

The Prosecution’s closing argument was focused on the importance of consecutively manufactured studies. The Prosecutor states that the closed consecutively manufactured studies can be more important than the black box studies that were heavily focused on by the Defense. One of the judges makes a comment that the consecutively manufactured studies would be less important since examiners should already be looking for subclass characteristics, and the consecutively manufactured samples would share subclass characteristics making them easier to identify. With the subclass markings more easily identifiable the examiner would be able to focus more on the individual markings within the samples. The judge then says that the black box studies would be more beneficial because examiners would have a harder time identifying subclass characteristics from the individual markings without a consecutively manufacture reference. In response, the Prosecution fumbles his response and does not properly convey the importance of consecutively manufactured studies.

            Consecutively manufactured studies are just as important as black box studies. They may help examiners establish their best-known non-match. Since the samples are from consecutively manufactured tools, it would increase the probability of non-matches sharing more marks with one another, which will establish the examiner’s understanding of their threshold for identifications and eliminations. A consecutively manufactured study also forces subclass characteristics which will also help examiners to study the patterns of subclass characteristics. Long continuous marks, gross marks, and rhythmic marks may all be subclass characteristics and examiners can use these studies to understand the indications of subclass. Lastly, these studies aren’t created with the sole purpose of showing that consecutively manufactured firearms can create identifiable samples that can be linked back to the firearm. They are used to validate different tools as being able to create different markings from one part to the next. These tools can be casted, broached, double broached, and hammer forged just to name a few. Knowing that these tools can create identifiable marks allows an examiner to use this knowledge with any firearm that is created by the tools examined in the study. Therefore, consecutively manufactured studies add to the science, and black box studies establish an error rate for the examiners that participate in the study.

Data Manipulation

Before finishing this post, I would like to bring to light how the Defense manipulates the data of the AMES II study. As discussed previously the Defense combined inconclusive results into either the identification or elimination conclusions to establish their error rates. But when talking about the error rates for the repeatability and reproducibility of the study they chose the data that did not combine the inconclusive results into either the identification or elimination conclusions. In this part of the study, the error rates were higher when not combining the inconclusive because of the study’s use of the three-tier inconclusive conclusion. If an examiner were to switch the initial answer from an inconclusive A to identification (ground truth) or inconclusive A to an inconclusive B it would be considered that the examiner changed their answer even though they would still be marked correctly. But in this case, combining the inconclusive results will more accurately represent examiners who changed their answers but remained right according to the ground truth. Since the combination results produced a lower error rate the Defenses decided not to utilize it for this part of their argument.

Concluding Thoughts

Overall, I felt like the Prosecutor should have come better prepared and used the studies on record to his advantage. Instead, the Prosecutor stumbled and failed to make solid arguments, which may hurt the science. Hopefully, in future Daubert/Frye hearings the information I provided may be used to better utilize existing studies and provide a better hearing to protect the science. It is important to become aquatinted and knowledgeable in the studies that exist for our science so that they can be used effectively. For example, if the Prosecutor knew the AMES II study more thoroughly he could have brought to light the Defense using data manipulation in their argument.

Part I: Ames Study

Finding the Article

            I was able to find a copy of a Study titled “A Study of False-Positive and False-Negative Error Rates in Cartridge Case Comparisons” written by David P. Baldwin, Stanley J. Bajic, Max Morris, and Daniel Zamzow. This is Part I of a two-part study done by the Ames Laboratory. Part I can still be found in obscure places but part II has been wiped from most sources. Defense attorneys or academic opponents heavily reference these studies, but when used, they are quickly/sloppily referenced and cherry-picked. I hope to be able to share the main findings in these studies to be able to help anyone in the field that will meet with people using these studies. This post will focus on Part I of the study, and at a later date, I will post a discussion of Part II of the study.

Introduction/Experiment

            The study’s authors designed the study to better understand error rates associated with the comparison of fired cartridge casings. They stated that the problem with previous studies is that they did not include independent sample sets that would allow unbiased determination of the false-positive and/or false-negative rates. So, this study sets out to resolve this issue.

            Two hundred and eighty-four (284) participants were given fifteen (15) test sets to examine. Twenty-five (25) Ruger SR9s were used to create the samples for the test sets, and each firearm fired 200 cartridges to break them in before sample collection. Each handgun fired 800 cartridges in total for the test sets. In the test sets, no source firearm was repeated in a single test packet, except in the case when a test set was meant to be the same source comparison. The sets included 3 knowns to compare to a single questioned casing. For all the participants five (5) of the test sets were from known same-source firearms, and ten (10) of the test sets were from known different-source firearms.  In addition to the results, the participants had to record the quality of the known samples, which allowed the authors to calculate a poor mark production rate. This rate was examined to avoid cherry-picking well-marked samples for the test sets, which usually draws criticism as making the test sets too easy. The authors also asked the participants not to use their laboratory peer review process, which allowed the error rates to reflect the individual examiner.

Results

False Negative

            Out of the two hundred and eighty-four (284) participants, only two hundred and eighteen (218) returned completed responses. Out of the completed responses, 3% accounted for self-employed individuals. In total, thousand and ninety (1090) true same-source comparisons were made, where only four (4) comparisons were labeled elimination and eleven (11) were labeled inconclusive. The false elimination rate was calculated to be 0.3670% with the Clopper-Pearson exact 95% confidence interval calculated to be 0.1001%-0.9369%. Two (2) of the four (4) false eliminations were made by the same examiner, therefore 215 out of 218 examiners did not make a false elimination. When factoring in inconclusive with false elimination the error rate increases to 1.376% with the corresponding 95% confidence interval calculated to be 0.7722%-2.260%.

            A number to take into consideration is the poor mark production rate that was discussed above. Two hundred and twenty-five (225) known samples out of nine thousand seven hundred two (9702) knowns were considered poor quality and inappropriate for inclusion in the comparison, which was calculated to be 2.319% of the samples with a corresponding 95% confidence interval of 2.174%-2.827%. This percentage is greater than the false elimination rate, which means there is a high probability that some of the false elimination can be attributed to the poor quality of the knowns used for comparison. Also, four (4) of the false eliminations were made by examiners who did not use inconclusive for any response, which could be attributed to their agency requirements.

False Positive

            Out of the two thousand hundred and eighty (2180) true different-source comparisons, twenty-two (22) comparisons were labeled identifications, and seven hundred and thirty-five (735) were labeled inconclusive. The error rate for false identification was calculated to be 1.010% (Note: Two (2) responses were left blank and were subtracted from the total number of responses.). Out of the false identifications, all but two were made by five (5) of the two hundred and eighteen (218) examiners. Since a small number of examiners made the same error, it can be suggested that the error probability is not consistent across examiners, which was the idea stated at the beginning of this post. The beta-binomial model was used to estimate the false identification because it cannot be assumed that the probability is uniform across examiners. The probability was calculated to be 0.939% with a likelihood-based 95% confidence interval of 0.360%-2.261%.

            The inconclusive also showed to be heterogeneous. Out of the two hundred and eighteen (218) examiners, ninety-six (96) labeled none of the comparisons as inconclusive, forty-five (45) labeled all 10 of the comparisons as inconclusive, and seventy-seven (77) examiners showed an even spread between the extremes. 

My Discussion

            The authors state that the false elimination error rate is in doubt because the poor-quality rate is higher than the false elimination rate, even with the inconclusive results factored in. I agree that the error rate should be questioned because the rate can be affected by the poor quality of the samples, which can lead an examiner to not conclude a positive comparison. But, there is another factor in play as well. Some laboratories do not allow their examiners to report inconclusive results and require that the conclusion can be either an identification or an elimination, which is something that the statistics community has been making a push for. But, this factor is hard to consider when the authors did not require the participants to disclose their laboratory practices. It can be assumed that this might be the case because all the false eliminations were made by examiners who did not conclude inconclusive in any of the comparisons.

            The false positive rate is a percentage that should not be applied to the science but rather to an examiner. This 1% error rate is more representative of the examiners participating in this specific study. This can be seen when most of the false identifications were produced by five (5) examiners out of the two hundred and eighteen (218) total participants. This study also disclosed in the design that they did not want the laboratory review process to be a factor so that they can examine the individual examiners. It is my belief that if the review process was allowed in this experiment the error rate would be smaller or be close to 0%. So, the error rate can be used to advocate for examiners to be well-trained and to have a well-established QA system in place.

            The study also addresses the higher inconclusive responses that were received with the different-source comparisons. The seven hundred and thirty-five (735) inconclusive results out of the one thousand four hundred and twenty-one (1421) reported eliminations is too large to be attributed to the poor-quality percentage. Just like the false elimination results, the inconclusive can be attributed to the policy of the laboratory. A laboratory may require the examiner to report an inconclusive result if the class characteristics are the same between the known and unknown samples. In this study since the same model of firearms were used for the creation of the known and unknown samples, the samples generated would have the same class characteristics. If the authors included a section where the participants were able to disclose their laboratory policy, we would be able to better understand the number of inconclusive results seen in the study.

            Hopefully, my post will help bring to light the first part of the Ames study and provide more transparency to the error rates published in the paper. Please use this post as a reference or a quick summary for your knowledge, but seek out a copy of the original paper for a more in-depth look into the study design. The authors of the study were very detailed in their paper and it would prove to be very beneficial to read the paper for yourself. They go deeper in-depth on the design of the study and the creation of the samples than I have included in this post. They also have a large discussion section that dives deeper into the statistics they applied and why they were selected to properly represent the data. In a future post, I will be summarizing and discussing the second part of the Ames study so that more examiners will have access to what some critics of the science use as a reference.

Response: The Field of Firearms Forensics is Flawed

Introduction

            An article entitled “The Field of Firearms Forensics is Flawed” was published by David L. Faigman, Nicholas Scurich, and Thomas D. Albright. The authors start their article by referencing an article entitled “Forensic Science: Oxymoron,” by Donald Kennedy, which agues the point that Forensic Science is an Oxymoron. The authors of this article agree that the statement made in 2003 is still relevant today. They state, “Forensic experts continue to employ unproven techniques, and courts continue to accept their testimony largely unchecked.” They claim that the field of Firearm Examination is built on smoke and mirrors. I would like to provide a response to this article using the knowledge I have as a Firearm Examiner. 

Quantity of Studies

            Their first argument was that there existed only a few studies for the validation of the field, and the ones that did exist, indicated that examiners cannot reliably determine whether bullets or cartridges were fired by a particular gun. This statement is problematic in that they offer no reference to what article(s) they are referring to that shows that an examiner cannot reliably determine the origin of an expended component. During my training as an examiner, I have read hundreds of articles supporting the field, which all produced low error rates. For example, the Hamby and Brundage study examined bullets from ten consecutively manufactured Ruger pistol barrels, the Fadul study examined 10 consecutively manufactured Ruger slides, and the Cazes Study examined 10 consecutively manufactured Hi-Point slides. The Hamby and Brundage test had a 0% error rate that incorporated 502 examiners. The Fadul study established an error rate of 0.000636% and and a durability error rate of 0.0017699%, and both error rates were determined to not be significantly higher than zero. The durability of the Fadul study consisted of giving the participants casings that were fired in a later sequence from the casings they originally received.

The durability portion of the Fadul study was created to see if an examiner’s conclusions would change based on the ware of the markings on the breech face due to the previous test fires. Studies like these focus on consecutively manufactured parts, because it creates the hardest scenario for examiners, but also ensures the examiners are using individual markings rather than class characteristics for their conclusions.

These studies also focus on the manufacturing method rather than the make/model of the firearm used. This is because a manufacturer can use only a few manufacturing methods to produce a firearm. So if the overall method is proven to produce markings that are individual it can be applied to all firearms that are produced with that same method. There are many other foundational studies and their summaries which can be found on the AFTE SWGGUN ARK.

Anti-Experts Experts??

            The authors of the articles suggest the need to create anti-expert experts to combat experts in court. These experts would consist of research scientists, which would not make sense because the people who are researching in the field are publishing in the Journal of Forensic Science and the AFTE Journal that the authors just argued against. These journals are peer-reviewed and are published for anyone to view and allow anyone to retest the conclusions made. Since these articles are peer-reviewed and are in the scientific community I am not sure who would be the research scientist that would become anti-expert experts. These experts would just be the people in the field publishing the work.  

Inconclusive Results

            As with many critiques of the science, they argue that inconclusive results should not be used in research studies, citing them as an “I don’t know” answer. As explained in a previous post the inconclusive conclusion is used to speak for the evidence rather than to get the examiner out of making a conclusion. Depending on the condition of the evidence and the quality of the toolmarks, the examiner may only have the option to report an inconclusive conclusion. The markings present can be enough to prevent the examiner from eliminating the expended evidence, but the poor marking quality and quantity may also prevent the examiner from identifying the expended evidence. If the examiner was forced to conclude an identification or an elimination in this scenario, their basis for the conclusion would be weak because of the quality and quantity of those markings. So the examiners would have to use inconclusive as their conclusion to be able to properly speak for that particular evidence.

Subjective vs Objective

            The authors explain how the examiner’s subjective experience should not be a reliable source, and that a quantitative standard needs to be established. The authors fail to explain the vast amount of articles and scientific background that supports the validity of the field. Some of the studies were discussed above but also the science of toolmarks are heavily documented that tools leave unique markings on surfaces as they perform work. This may be due to the crystalline structure of the material and other factors. These factors can be seen on the molecular level and can be observed on a microscopic scale that shows the observer the chip formation and its effects on toolmarks. Backed by this foundational knowledge the examiner is able to make their conclusion.

An analogy can be used to further push the point that subjectivity does not automatically discount the validity of the science. For example, the house you live in is unique either by the way it was built, the area around it, or the personal touches you have added to the house. Based on these factors you would be able to walk up to the house that belonged to you, because of the features of the house. This selection would be subjective but is supported by the many factors discussed above. A picture of another house of the same design can be shown to the homeowner along with a picture of their house, and by using the previously described factors, they would still be able subjectively select their house from the pictures. 

AMES Study

            The authors also reference the AMES Part II study. In this study participants from the first part of the study were given the same evidence without their knowledge and were told to reach a conclusion. The authors claim that the Part II of this study showed that the same examiners looking at the same bullet reached the same conclusion one-third of the time, and different examiners looking at the same bullets reached the same conclusion less than one-third of the time. This is all the information the author provides with no references. So I tried looking up the article and could not find it published anywhere and later discovered that the FBI removed the article from distribution. I contacted the laboratory that originally produced the article and they stated that they had error rates of only 1% and are frustrated that the FBI took down part II of the study. I am currently in the process of getting access to the second part of the study.

The reason I would like to review the second study before submitting this portion of my response is that the author’s use of the data can be misleading. Their statement could mean that an examiner originally concluded inconclusive and then in part II of the study the examiner could have concluded the actual ground truth. Alternatively, the examiner could have originally reported the ground truth and changed the answer to an inconclusive in part II of the study. Both scenarios would not be an error that would destroy someone’s life as the authors suggest. These changes can be due to many factors. These changes can be based on the quality and quantity of the markings found on the expended evidence as explained above. Originally the examiner that reported inconclusive could now have found more markings due to lighting angles or seeing a small spot on the evidence that was not seen before that provides enough information to meet their threshold for an identification/elimination conclusion. Alternatively, the examiner that originally reported an identification/elimination may report inconclusive now because they can not find the small spots they originally found that supported their conclusion or they can not achieve the angle of light they originally used to properly illuminate the markings.

Conclusion

            Overall, the authors’ view on the science is lacking support and provides little to no references for their claims. I have provided sources and explanations that combat their claims and show the foundation that the Firearm Examiners use for their conclusions. If the authors were to have provided sources for their argument, I would be able to understand their position better and be able to dissect those sources, and provide additional sources if needed. For example, their use of the AMES study lacks reference and explanation of the data, especially with the lack of the source to be viewed and analyzed by the reader. They would also need to better elaborate on who would be an anti-expert expert so that the reader can better understand where these experts would be getting their information from and why they would make a difference. Additionally, the authors need to have a better understanding of the inconclusive conclusion and its use in the field of firearm examination before offering to omit it from research studies. 

Firearms and Toolmark Error Rates

Introduction

            On January 3, 2022, four statisticians issued a statement entitled, “Firearms and Toolmark Error Rates”. These four statisticians were: Alicia Carriquiry, Heike Hofmann, Kori Khan, and Susan Vanderplas. All the statisticians, except Kori Kahn, are part of the Center for Statistics and Applications in Forensic Evidence (CSAFE). Their purpose for the statement is to offer the opinion on the Firearm and Toolmark discipline that “error rates established from studies with sampling flaws, methodological flaws, non-response and attrition bias, and inconclusive results are not sufficiently sound to be used in criminal proceedings.” I reject the statements made and I will be summarizing the statement in this article and providing my own opinion.

Participant Sampling

            They first offer that there is a sampling problem within the studies conducted for the discipline. They state that having examiners volunteer for participation in a study will bias the study and create lower error rates. This is because examiners who volunteer are more involved in the discipline and tend to have more experience. The announcements for these studies are usually posted on the Association of Firearm and Toolmark Examiner (AFTE) forum, which is a place where all members derive most of their income from being a Firearm examiner. They state that examiners part of this organization is assumed to be more involved in the field and have more experience. I disagree with their statement because the study has to be announced in an area that the relevant scientific community can have the opportunity to volunteer. Examiners that are part of AFTE range from all different experience levels and and it cannot be assumed that membership does not include examiners with only a few years of experience. In my case, I have had only 2 years of experience in this field and I am an AFTE member who has access to the AFTE forum. There are also plenty of published studies that have volunteers that only have a couple of years of experience, which includes a consecutively manufactured Ruger slide study performed by the Miami-Dade Crime Laboratory. I also disagree that volunteers will affect the validity of the results because it would be impossible for the researcher to randomly select participants and then have their laboratory present the study as actual case work. Most laboratories evidence intake makes this hard to accomplish and it would be hard to replicate all the evidence and paperwork needed to make the study appear as a real case. All other scientific disciplines, including the medical field, rely on volunteers for their studies, so this should not be used to exclusively invalidate Firearm and Toolmark studies.

Material Sampling

            The group then argues that the discipline has material sampling problems. Studies in the discipline tend to focus on consecutively manufactured studies, which this group of statisticians finds problematic. They state that the studies lose the ability to make broad sweeping claims about the discipline. To do this they recommend that a black box study is needed with a large number of firearms and ammunition types so that the study can encompass more of what is found in actual case work. I disagree with this statement because consecutively manufactured studies create the worst-case scenario for examiners, thus giving the highest theoretical error rate. The consecutive studies are done on almost every part of the firearm, for example, the barrel., extractors, ejectors, and breech faces. In addition to the multiple parts that are examined, multiple machining methods are examined, for example, double broaching rifling, and hammer forged rifling. So, when combined these studies isolate the different parts of a firearm and the different manufacturing methods. These studies focus on the machining method rather than a mass amount of firearms because there is only a limited number of machining methods that manufacturers can use to manufacture a firearm. Therefore, examining the machining method is more beneficial for the examiner than examining random firearm make and models. I also believe that a creating big study examining multiple firearms as the statement suggests will not be useful because examiners would be able to eliminate samples early on in the study due to differences found in the class characteristics, which would prevent the individual characteristics from being examined.

Non-Response Bias

            They then go into the problem of missing data and non-response bias. They claim that most studies never disclose the data of their study and the drop-out rate.  Their suggestion is that the dropout rate should be factored into the error rate. They claim that a dropout rate of 20% should be enough to invalidate the study results and a 5% rate should be sufficient to cause concern for the study. When the dropout rate reaches these percentages they recommend that these participants answers be included and counted as 100% incorrect. They reason that it can be assumed that the participants quit the study based on the difficulty of the study or their own lack of time management. Due to this, their answers would have been assumed to be largely incorrect. Applying this could raise low error rates up to 16.56%, which will provide an upper bound for the error rate. This argument does not hold up well, because many people may drop out of the study due to case load at the laboratory or other responsibilities. Their dropout should not automatically be assumed that the examiner thought the study was too hard, especially since the statistician’s earlier assumption was that all volunteers were considered experienced. Also, to assume that their error rate would be 100% would assume complete incompetence of the examiner, the scientific backing of the discipline, and the quality assurance measures of the laboratory. Most laboratories require a second examiner to come to the same conclusion before the conclusion can be published, and this would assume that the second examiner would have also had an error rate of a 100%.  

Inconclusive

            Their next argument is about the AFTE Theory of Identification’s use of inconclusive. The AFTE Theory allows the examiner to conclude identification, inconclusive, and elimination. AFTE also allows three different levels of inconclusive that range from being close to an identification to being close to an elimination. Although, AFTE allows these three levels of inconclusive they are seldom used in laboratories. The group of statisticians believes that the inconclusive conclusion is used when it’s a hard decision and the examiner wants to be right. Because of their disagreement with the inconclusive conclusion, they want this conclusion to be considered an error, rather than the common practice of omitting the conclusions from error rates. When they consider an inconclusive an error the error can be brought up to around 50% making the conclusion a “coin toss”. The field is seeing a lot of “professionals” speaking out against the inconclusive conclusion, but I disagree with their statements. Inconclusive is a valid conclusion because of the nature of the evidence that is normally received in the laboratory. For example, many expended bullets that come through the laboratory are damaged which can cause foreshortening and damage to the underlying toolmarks. This will cause some areas to be unusable and leave the examiner with a limited number of markings. These markings may not meet the examiner’s threshold for an identification, but their presence will prevent the examiner from excluding the bullet. The only option that the examiner would be left with is to report an inconclusive result. Another situation is when the pressure inside a firearm may prevent the head of the casing from making good contact with the breech face which causes the primer to take limited marks of the breech face of the firearm. This situation would be similar to the bullet, and in no way eludes to the examiner wanting to take the easy way out. The examiner would only list the conclusion to properly speak for the evidence and prevent misguiding anyone reading the report.

Conclusion

            Based on the above-listed arguments the group of statisticians make the move that they can not support Firearm and Toolmark examination as evidence in criminal proceedings. They base most of their finding on the studies conducted in the field rather than the specific examiners in the field. They take a strict stand against the discipline but fail to recognize the complexity and uniqueness of this comparative science. For example, their misunderstanding of inconclusive results and their importance. Their recommendations are considered extreme and seem to be implemented just to raise the error rate of a study, for example, counting dropouts as 100% error or considering inconclusive results as errors. The courts should not accept their statement because of their lack of understanding and extreme views on how firearm-related studies should be conducted. They have little evidence to support their claims and provide very few references This statement has also brought the FBI on May 3, 2022, to post their own response. Their response will be reviewed in another literature review post.