☆ 3.8 Proceedings Paper

Topic Discovery via Latent Space Clustering of Pretrained Language Model Representations

PROCEEDINGS OF THE ACM WEB CONFERENCE 2022 (WWW'22) (2022)

期刊

PROCEEDINGS OF THE ACM WEB CONFERENCE 2022 (WWW'22)

卷 -, 期 -, 页码 3143-3152

出版社

ASSOC COMPUTING MACHINERY

DOI: 10.1145/3485447.3512034

关键词

Topic Discovery; Pretrained Language Models; Clustering

类别

Computer Science, Cybernetics Computer Science, Software Engineering Computer Science, Theory & Methods

资金

US DARPA KAIROS Program [FA8750-19-2-1004]
INCAS Program [HR001121C0165]
National Science Foundation [IIS-19-56151, IIS-17-41317, IIS 17-04532]
Molecule Maker Lab Institute: An AI Research Institutes program - NSF [2019897]
Google PhD Fellowship
US DARPA SocialSim Program [W911NF-17-C-0099]

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

智能总结 New
摘要

This paper proposes a topic discovery method based on pretrained language models (PLMs), which are used in a joint latent space learning and clustering framework. The model effectively utilizes the representation power of PLMs for topic discovery and generates more coherent and diverse topics compared to strong topic models.

Topic models have been the prominent tools for automatic topic discovery from text corpora. Despite their effectiveness, topic models suffer from several limitations including the inability of modeling word ordering information in documents, the difficulty of incorporating external linguistic knowledge, and the lack of both accurate and efficient inference methods for approximating the intractable posterior. Recently, pretrained language models (PLMs) have brought astonishing performance improvements to a wide variety of tasks due to their superior representations of text. Interestingly, there have not been standard approaches to deploy PLMs for topic discovery as better alternatives to topic models. In this paper, we begin by analyzing the challenges of using PLM representations for topic discovery, and then propose a joint latent space learning and clustering framework built upon PLM embeddings. In the latent space, topic-word and document-topic distributions are jointly modeled so that the discovered topics can be interpreted by coherent and distinctive terms and meanwhile serve as meaningful summaries of the documents. Our model effectively leverages the strong representation power and superb linguistic features brought by PLMs for topic discovery, and is conceptually simpler than topic models. On two benchmark datasets in different domains, our model generates significantly more coherent and diverse topics than strong topic models, and offers better topic-wise document representations, based on both automatic and human evaluations.

Topic Discovery via Latent Space Clustering of Pretrained Language Model Representations

期刊

PROCEEDINGS OF THE ACM WEB CONFERENCE 2022 (WWW'22)

出版社

ASSOC COMPUTING MACHINERY

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

Topic Discovery via Latent Space Clustering of Pretrained Language Model Representations

期刊

PROCEEDINGS OF THE ACM WEB CONFERENCE 2022 (WWW'22)

出版社

ASSOC COMPUTING MACHINERY

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文