☆ 4.6 Article

Analyzing and Controlling Inter-Head Diversity in Multi-Head Attention

APPLIED SCIENCES-BASEL (2021)

期刊

APPLIED SCIENCES-BASEL

卷 11, 期 4, 页码 -

出版社

MDPI

DOI: 10.3390/app11041548

关键词

multi-head attention; inter-head similarity; Transformer; machine translation; language modeling; Natural Language Processing; NLP

类别

Chemistry, Multidisciplinary Engineering, Multidisciplinary Materials Science, Multidisciplinary Physics, Applied

资金

Samsung Electronics
BK21 FOUR program of the Education and Research Program for Future ICT Pioneers, Seoul National University
Automation and Systems Research Institute (ASRI), Seoul National University

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

智能总结 New
摘要

This paper quantitatively analyzes the inter-head diversity of multi-head attention and proposes a hypothesis that controlling the inter-head diversity can improve model performance. The empirical results show that controlling inter-head diversity leads to better performance compared to baselines.

Multi-head attention, a powerful strategy for Transformer, is assumed to utilize information from diverse representation subspaces. However, measuring diversity between heads' representations or exploiting the diversity has been rarely studied. In this paper, we quantitatively analyze inter-head diversity of multi-head attention by applying recently developed similarity measures between two deep representations: Singular Vector Canonical Correlation Analysis (SVCCA) and Centered Kernel Alignment (CKA). By doing so, we empirically show that multi-head attention does diversify representation subspaces of each head as the number of heads increases. Based on our analysis, we hypothesize that there exists an optimal inter-head diversity with which a model can achieve better performance. To examine our hypothesis, we deeply inspect three techniques to control the inter-head diversity; (1) Hilbert-Schmidt Independence Criterion regularizer among representation subspaces, (2) Orthogonality regularizer, and (3) Drophead as zero-outing each head randomly in every training step. In our experiments on various machine translation and language modeling tasks, we show that controlling inter-head diversity leads to the best performance among baselines.

Analyzing and Controlling Inter-Head Diversity in Multi-Head Attention

期刊

APPLIED SCIENCES-BASEL

出版社

MDPI

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

Analyzing and Controlling Inter-Head Diversity in Multi-Head Attention

期刊

APPLIED SCIENCES-BASEL

出版社

MDPI

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文