For example, if repeatability is the main problem, evaluators are confused or undecided on certain criteria. If reproducibility is the problem, then evaluators have strong opinions on certain conditions, but those opinions differ. If the problems are shown by several evaluators, the problems are systemic or procedural. If the problems concern only a few evaluators, the problems may simply require a little personal attention. In both cases, training or work aids could be adapted either to specific individuals or to all evaluators, depending on the number of evaluators guilty of imprecise attribution of attributes. In addition to the sample size issue, logistics can also be a challenge to ensure that evaluators do not remember the initial attribute they assigned to a scenario when they see it for the second time. Of course, this can be avoided a bit by increasing the sample size and, better yet, wait a while before the scenarios are made available to reviewers a second time (maybe one to two weeks). Randomized passage from one comment to another can also be helpful. In addition, evaluators also tend to work differently if they know they are being examined, so the fact that they know that it is a test can also skew the results. Hiding this in one way or another can help, but it`s almost impossible to achieve it, despite the fact that it borders on ethics. And beyond the fact that they are at best marginally effective, these solutions increase the complexity and time of an already difficult study. The audit should help to identify the specific people and codes that are the main sources of problems and the evaluation of the attribute agreement should help to determine the relative contribution of reproducibility and reproducibility problems for those specific codes (and individuals).
In addition, many bug databases have problems with precision records that indicate where an error was created because the place where the error is detected is recorded and not where the error was created. When the error is detected, there is not much to identify the causes, therefore the accuracy of the site assignment should also be an element of the audit. As with any measurement system, the accuracy and precision of the database must be understood before the information is used (or at least used during use) to make decisions. At first glance, it would seem that the apparent starting point is an attribute analysis (or the measurement of R&R attributes). But it may not be such a good idea. Since implementing an attribute analysis can be time-saving, expensive, and usually uncomfortable for all parties involved (the analysis is simple compared to execution), it`s best to take a moment to really understand what needs to be done and why. Attribute agreement analysis can be a great tool for detecting sources of inaccuracies in a bug tracking system, but it should be used with great care, consideration, and minimal complexity, if used at all. The best way to do this is to audit the database and then use the results of that audit to perform a focused and optimized analysis of repeatability and reproducibility. In this example, a repeatability assessment is used to illustrate the idea, and it also applies to reproducibility. . .