4.7 Article

Joint upper & expected value normalization for evaluation of retrieval systems: A case study with Learning-to-Rank methods

Journal

INFORMATION PROCESSING & MANAGEMENT
Volume 60, Issue 4, Pages -

Publisher

ELSEVIER SCI LTD
DOI: 10.1016/j.ipm.2023.103404

Keywords

Information retrieval evaluation; Upper expected value; Normalization; Learning to Rank

Ask authors/readers for more resources

This paper introduces a new approach for information retrieval evaluation metrics that combines upper bound normalization and expected value normalization. Two case studies demonstrate the advantages of this new approach compared to traditional methods. Experimental results show that the proposed expected value normalized metrics have better discriminatory power and consistency, suggesting that the IR community should seriously consider expected value normalization when computing nDCG and MAP.
While original IR evaluation metrics are normalized in terms of their upper bounds based on an ideal ranked list, a corresponding expected value normalization for them has not yet been studied. We present a framework with both upper and expected value normalization, where the expected value is estimated from a randomized ranking of the corresponding documents present in the evaluation set. We next conducted two case studies by instantiating the new framework for two popular IR evaluation metrics (e.g., nDCG, MAP) and then comparing them against the traditional metrics. Experiments on two Learning-to-Rank (LETOR) benchmark data sets, MSLR-WEB30K (in-cludes 30K queries and 3771K documents) and MQ2007 (includes 1700 queries and 60K documents), with eight LETOR methods (pairwise & listwise), demonstrate the following properties of the new expected value normalized metric: (1) Statistically significant differences (between two methods) in terms of original metric no longer remain statistically significant in terms of Upper Expected(UE) normalized version and vice-versa, especially for uninformative query-sets. (2) When compared against the original metric, our proposed UE normalized metrics demonstrate an average of 23% and 19% increase in terms of Discriminatory Power on MSLR-WEB30K and MQ2007 data sets, respectively. We found similar improvements in terms of consistency as well; for example, UE-normalized MAP decreases the swap rate by 28% while comparing across different data sets and 26% across different query sets within the same data set. These findings suggest that the IR community should consider UE normalization seriously when computing nDCG and MAP and more in-depth study of UE normalization for general IR evaluation is warranted.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.7
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available