Majority Agreement Among Multiple Observers


Either Pearsons r {displaystyle r}, Kendalls τ or Spearmans ρ {displaystyle rho} can be used to measure the correlation in pairs between evaluators with an ordered scale. Pearson believes that the rating scale is continuous; Kendall and Spearman`s statistics only suggest that this is an ordinal number. If more than two evaluators are observed, an average degree of concordance for the group can be calculated on average of the values r {displaystyle r}, τ or ρ {displaystyle rho } from any pair of evaluators. The common probability of an agreement is the simplest and least robust measure. It is estimated as a percentage of the time during which evaluators agree in a nominal or categorical evaluation system. It does not take into account the fact that an agreement can be concluded solely on the basis of chance. The question arises as to whether or not it is necessary to “correct” a random agreement; Some suggest that such an adaptation should in any case be based on an explicit model on the impact of chance and error on the decisions of evaluators. [3] Kappa is a way to measure compliance or reliability and correct the number of times assessments may coincide by chance. Cohens Kappa,[5] which works for two evaluators, and Fleiss`Kappa,[6] an adaptation that works for any fixed number of evaluators, improve the common probability by taking into account the amount of concordance that one might expect by chance. The original versions had the same problem as the common probability, as they treat the data as nominal and assume that the evaluations are not natural; If the data do have a rank (ordinary measurement level), this information is not fully taken into account in the measurements. Another approach to correspondence (useful if there are only two reviewers and the scale is continuous) is to calculate the differences between each pair of observations of the two reviewers.

The mean value of these differences is called “bias” and the reference interval (mean value ± 1.96 × standard deviation) is called the conformity limit. The limitations of the agreement make it possible to determine the extent to which random variations can influence evaluations. Later extensions of the approach included versions that could handle “partial credits” and ordinal scales. [7] These extensions converge with the family of intraclassical correlations (CIC), so there is a conceptually related possibility of estimating the reliability for each level of measurement, from the nominal cappa to the interval (ICC or ordinal kappa) through the interval (ICC or ordinal kappa) and the ratio (CIC). There are also variants that make it possible to study the concordance of evaluators on a number of points (for example. B two interviewers agree on depression values for all points of the same semi-structured interview for one case?) as well as evaluators x cases (for example. B to what extent are two or more assessors suitable if 30 cases have a diagnosis of depression, yes/no – a dummy variable). .