☆ 4.5 Article

Validation of scientific topic models using graph analysis and corpus metadata

SCIENTOMETRICS (2022)

期刊

SCIENTOMETRICS

卷 127, 期 9, 页码 5441-5458

出版社

SPRINGER

DOI: 10.1007/s11192-022-04318-5

关键词

Topic modeling; Latent Dirichlet Allocation; Graph analysis; Semantic similarity; Model validation

类别

Computer Science, Interdisciplinary Applications Information Science & Library Science

资金

European Union [101004870]
FEDER/Spanish Ministry of Science, Innovation and Universities, State Agency of Research [TEC2017-83838-R]
CRUE-CSIC agreement
Springer Nature

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

智能总结 New
摘要

Probabilistic topic modeling algorithms, such as Latent Dirichlet Allocation (LDA), have become powerful tools in the analysis of large collections of documents. However, selecting the right hyperparameters for a specific application is not easy. This study proposes two graph metrics to optimize the similarity metrics derived from the topic model, aiming to select appropriate hyperparameters. Experimental results on various corpora related to science, technology, and innovation (STI) show that these metrics provide relevant indicators for selecting the number of topics and building persistent topic models consistent with the metadata. This approach can be extended beyond LDA and facilitate the systematic adoption of similar techniques in STI policy analysis and design.

Probabilistic topic modeling algorithms like Latent Dirichlet Allocation (LDA) have become powerful tools for the analysis of large collections of documents (such as papers, projects, or funding applications) in science, technology an innovation (STI) policy design and monitoring. However, selecting an appropriate and stable topic model for a specific application (by adjusting the hyperparameters of the algorithm) is not a trivial problem. Common validation metrics like coherence or perplexity, which are focused on the quality of topics, are not a good fit in applications where the quality of the document similarity relations inferred from the topic model is especially relevant. Relying on graph analysis techniques, the aim of our work is to state a new methodology for the selection of hyperparameters which is specifically oriented to optimize the similarity metrics emanating from the topic model. In order to do this, we propose two graph metrics: the first measures the variability of the similarity graphs that result from different runs of the algorithm for a fixed value of the hyperparameters, while the second metric measures the alignment between the graph derived from the LDA model and another obtained using metadata available for the corresponding corpus. Through experiments on various corpora related to STI, it is shown that the proposed metrics provide relevant indicators to select the number of topics and build persistent topic models that are consistent with the metadata. Their use, which can be extended to other topic models beyond LDA, could facilitate the systematic adoption of this kind of techniques in STI policy analysis and design.

Validation of scientific topic models using graph analysis and corpus metadata

期刊

SCIENTOMETRICS

出版社

SPRINGER

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

Validation of scientific topic models using graph analysis and corpus metadata

期刊

SCIENTOMETRICS

出版社

SPRINGER

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文