☆ 3.8 Proceedings Paper

SAS: Self-Augmentation Strategy for Language Model Pre-training

THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE (2022)

Journal

THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE

Volume -, Issue -, Pages 11586-11594

Publisher

ASSOC ADVANCEMENT ARTIFICIAL INTELLIGENCE

Keywords

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Automated Summary New
Abstract

The core of self-supervised learning for pre-training language models lies in the design of pre-training tasks as well as appropriate data augmentation. This paper proposes a self-augmentation strategy (SAS) that utilizes a single network for both regular pre-training and contextualized data augmentation, outperforming ELECTRA and other state-of-the-art models in GLUE tasks with similar or less computation cost.

The core of self-supervised learning for pre-training language models includes pre-training task design as well as appropriate data augmentation. Most data augmentations in language model pre-training are context-independent. A seminal contextualized augmentation was recently proposed in ELECTRA and achieved state-of-the-art performance by introducing an auxiliary generation network (generator) to produce contextualized data augmentation for the training of a main discrimination network (discriminator). This design, however, introduces extra computation cost of the generator and a need to adjust the relative capability between the generator and the discriminator. In this paper, we propose a self-augmentation strategy (SAS) where a single network is utilized for both regular pre-training and contextualized data augmentation for the training in later epochs. Essentially, this strategy eliminates a separate generator and uses the single network to jointly conduct two pre-training tasks with MLM (Masked Language Modeling) and RTD (Replaced Token Detection) heads. It avoids the challenge to search for an appropriate size of the generator, which is critical to the performance as evidenced in ELECTRA and its subsequent variant models. In addition, SAS is a general strategy that can be seamlessly combined with many new techniques emerging recently or in the future, such as the disentangled attention mechanism from DeBERTa. Our experiments show that SAS outperforms ELECTRA and other state-of-the-art models in the GLUE tasks with similar or less computation cost.

SAS: Self-Augmentation Strategy for Language Model Pre-training

Journal

THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE

Publisher

ASSOC ADVANCEMENT ARTIFICIAL INTELLIGENCE

Keywords

Categories

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

SAS: Self-Augmentation Strategy for Language Model Pre-training

Journal

THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE

Publisher

ASSOC ADVANCEMENT ARTIFICIAL INTELLIGENCE

Keywords

Categories

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper