☆ 3.8 Proceedings Paper

Improving Contrastive Learning of Sentence Embeddings with Case-Augmented Positives and Retrieved Negatives

PROCEEDINGS OF THE 45TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL (SIGIR '22) (2022)

Journal

PROCEEDINGS OF THE 45TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL (SIGIR '22)

Volume -, Issue -, Pages 2159-2165

Publisher

ASSOC COMPUTING MACHINERY

DOI: 10.1145/3477495.3531823

Keywords

Text retrieval; data augmentation; natural language understanding; hard negatives; intrinsic bias; RoBERTa

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Automated Summary New
Abstract

This study improves the performance of contrastive learning in unsupervised sentence embeddings by introducing switch-case augmentation and sampling hard negatives from a pre-trained language model, achieving significant results on STS benchmarks.

Following SimCSE, contrastive learning based methods have achieved the state-of-the-art (SOTA) performance in learning sentence embeddings. However, the unsupervised contrastive learning methods still lag far behind the supervised counterparts. We attribute this to the quality of positive and negative samples, and aim to improve both. Specifically, for positive samples, we propose switch-case augmentation to flip the case of the first letter of randomly selected words in a sentence. This is to counteract the intrinsic bias of pre-trained token embeddings to frequency, word cases and subwords. For negative samples, we sample hard negatives from the whole dataset based on a pre-trained language model. Combining the above two methods with SimCSE, our proposed Contrastive learning with Augmented and Retrieved Data for Sentence embedding (CARDS) method significantly surpasses the current SOTA on STS benchmarks in the unsupervised setting.

Improving Contrastive Learning of Sentence Embeddings with Case-Augmented Positives and Retrieved Negatives

Journal

PROCEEDINGS OF THE 45TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL (SIGIR '22)

Publisher

ASSOC COMPUTING MACHINERY

Keywords

Categories

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Improving Contrastive Learning of Sentence Embeddings with Case-Augmented Positives and Retrieved Negatives

Journal

PROCEEDINGS OF THE 45TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL (SIGIR '22)

Publisher

ASSOC COMPUTING MACHINERY

Keywords

Categories

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper