3.8 Proceedings Paper

Improving the Generalizability of the Dense Passage Retriever Using Generated Datasets

Journal

ADVANCES IN INFORMATION RETRIEVAL, ECIR 2023, PT II
Volume 13981, Issue -, Pages 94-109

Publisher

SPRINGER INTERNATIONAL PUBLISHING AG
DOI: 10.1007/978-3-031-28238-6_7

Keywords

-

Ask authors/readers for more resources

Dense retrieval methods have outperformed traditional sparse retrieval methods in open-domain retrieval. However, there is a noticeable decrease in accuracy when these methods are applied to out-of-distribution and out-of-domain datasets. This may be due to the mismatch in information available to the context encoder and the query encoder during training. By training on datasets with multiple queries per passage, we show that dense passage retriever models perform better on out-of-distribution and out-of-domain test datasets compared to models trained on datasets with single query per passage.
Dense retrieval methods have surpassed traditional sparse retrieval methods for open-domain retrieval. While these methods, such as the Dense Passage Retriever (DPR), work well on datasets or domains they have been trained on, there is a noticeable loss in accuracy when tested on out-of-distribution and out-of-domain datasets. We hypothesize that this may be, in large part, due to the mismatch in the information available to the context encoder and the query encoder during training. Most training datasets commonly used for training dense retrieval models contain an overwhelming majority of passages where there is only one query from a passage. We hypothesize that this imbalance encourages dense retrieval models to overfit to a single potential query from a given passage leading to worse performance on out-of-distribution and out-of-domain queries. To test this hypothesis, we focus on a prominent dense retrieval method, the dense passage retriever, build generated datasets that have multiple queries for most passages, and compare dense passage retriever models trained on these datasets against models trained on single query per passage datasets. Using the generated datasets, we show that training on passages with multiple queries leads to models that generalize better to out-of-distribution and out-of-domain test datasets.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

3.8
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available