4.7 Article

Joint upper & expected value normalization for evaluation of retrieval systems: A case study with Learning-to-Rank methods

期刊

出版社

ELSEVIER SCI LTD
DOI: 10.1016/j.ipm.2023.103404

关键词

Information retrieval evaluation; Upper expected value; Normalization; Learning to Rank

向作者/读者索取更多资源

This paper introduces a new approach for information retrieval evaluation metrics that combines upper bound normalization and expected value normalization. Two case studies demonstrate the advantages of this new approach compared to traditional methods. Experimental results show that the proposed expected value normalized metrics have better discriminatory power and consistency, suggesting that the IR community should seriously consider expected value normalization when computing nDCG and MAP.
While original IR evaluation metrics are normalized in terms of their upper bounds based on an ideal ranked list, a corresponding expected value normalization for them has not yet been studied. We present a framework with both upper and expected value normalization, where the expected value is estimated from a randomized ranking of the corresponding documents present in the evaluation set. We next conducted two case studies by instantiating the new framework for two popular IR evaluation metrics (e.g., nDCG, MAP) and then comparing them against the traditional metrics. Experiments on two Learning-to-Rank (LETOR) benchmark data sets, MSLR-WEB30K (in-cludes 30K queries and 3771K documents) and MQ2007 (includes 1700 queries and 60K documents), with eight LETOR methods (pairwise & listwise), demonstrate the following properties of the new expected value normalized metric: (1) Statistically significant differences (between two methods) in terms of original metric no longer remain statistically significant in terms of Upper Expected(UE) normalized version and vice-versa, especially for uninformative query-sets. (2) When compared against the original metric, our proposed UE normalized metrics demonstrate an average of 23% and 19% increase in terms of Discriminatory Power on MSLR-WEB30K and MQ2007 data sets, respectively. We found similar improvements in terms of consistency as well; for example, UE-normalized MAP decreases the swap rate by 28% while comparing across different data sets and 26% across different query sets within the same data set. These findings suggest that the IR community should consider UE normalization seriously when computing nDCG and MAP and more in-depth study of UE normalization for general IR evaluation is warranted.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.7
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据