Journal
PROCEEDINGS OF THE 45TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL (SIGIR '22)
Volume -, Issue -, Pages 2159-2165Publisher
ASSOC COMPUTING MACHINERY
DOI: 10.1145/3477495.3531823
Keywords
Text retrieval; data augmentation; natural language understanding; hard negatives; intrinsic bias; RoBERTa
Categories
Ask authors/readers for more resources
This study improves the performance of contrastive learning in unsupervised sentence embeddings by introducing switch-case augmentation and sampling hard negatives from a pre-trained language model, achieving significant results on STS benchmarks.
Following SimCSE, contrastive learning based methods have achieved the state-of-the-art (SOTA) performance in learning sentence embeddings. However, the unsupervised contrastive learning methods still lag far behind the supervised counterparts. We attribute this to the quality of positive and negative samples, and aim to improve both. Specifically, for positive samples, we propose switch-case augmentation to flip the case of the first letter of randomly selected words in a sentence. This is to counteract the intrinsic bias of pre-trained token embeddings to frequency, word cases and subwords. For negative samples, we sample hard negatives from the whole dataset based on a pre-trained language model. Combining the above two methods with SimCSE, our proposed Contrastive learning with Augmented and Retrieved Data for Sentence embedding (CARDS) method significantly surpasses the current SOTA on STS benchmarks in the unsupervised setting.
Authors
I am an author on this paper
Click your name to claim this paper and add it to your profile.
Reviews
Recommended
No Data Available