4.5 Article

An Empirical Comparative Assessment of Inter-Rater Agreement of Binary Outcomes and Multiple Raters

期刊

SYMMETRY-BASEL
卷 14, 期 2, 页码 -

出版社

MDPI
DOI: 10.3390/sym14020262

关键词

inter-rater agreement; inter-rater reliability; observer agreement; Kappa; AC1; Kappa Paradox; meta-analysis; evidence synthesis

向作者/读者索取更多资源

This study evaluated the performance of four commonly used inter-rater agreement statistics in the context of multiple raters. The expected values of all four statistics were equal when the outcome prevalence was symmetric, but only the expected values of the three Kappa statistics were equal when the outcome prevalence was asymmetric. Fleiss' Kappa yielded a higher variance in the symmetric case, while Gwet's AC1 yielded a lower variance in the asymmetric case. The authors suggest favoring Gwet's AC1 statistic when the population-level prevalence of outcomes is unknown, and conducting transformations between statistics for direct comparisons between inter-rater agreement measures.
Background: Many methods under the umbrella of inter-rater agreement (IRA) have been proposed to evaluate how well two or more medical experts agree on a set of outcomes. The objective of this work was to assess key IRA statistics in the context of multiple raters with binary outcomes. Methods: We simulated the responses of several raters (2-5) with 20, 50, 300, and 500 observations. For each combination of raters and observations, we estimated the expected value and variance of four commonly used inter-rater agreement statistics (Fleiss' Kappa, Light's Kappa, Conger's Kappa, and Gwet's AC1). Results: In the case of equal outcome prevalence (symmetric), the estimated expected values of all four statistics were equal. In the asymmetric case, only the estimated expected values of the three Kappa statistics were equal. In the symmetric case, Fleiss' Kappa yielded a higher estimated variance than the other three statistics. In the asymmetric case, Gwet's AC1 yielded a lower estimated variance than the three Kappa statistics for each scenario. Conclusion: Since the population-level prevalence of a set of outcomes may not be known a priori, Gwet's AC1 statistic should be favored over the three Kappa statistics. For meaningful direct comparisons between IRA measures, transformations between statistics should be conducted.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.5
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据