☆ 4.7 Article

Authorship attribution based on a probabilistic topic model

INFORMATION PROCESSING & MANAGEMENT (2013)

期刊

INFORMATION PROCESSING & MANAGEMENT

卷 49, 期 1, 页码 341-354

出版社

ELSEVIER SCI LTD

DOI: 10.1016/j.ipm.2012.06.003

关键词

Authorship attribution; Text categorization; Machine learning; Lexical statistics

类别

Computer Science, Information Systems Information Science & Library Science

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

摘要

This paper describes, evaluates and compares the use of Latent Dirichlet allocation (LDA) as an approach to authorship attribution. Based on this generative probabilistic topic model, we can model each document as a mixture of topic distributions with each topic specifying a distribution over words. Based on author profiles (aggregation of all texts written by the same writer) we suggest computing the distance with a disputed text to determine its possible writer. This distance is based on the difference between the two topic distributions. To evaluate different attribution schemes, we carried out an experiment based on 5408 newspaper articles (Glasgow Herald) written by 20 distinct authors. To complement this experiment, we used 4326 articles extracted from the Italian newspaper La Stampa and written by 20 journalists. This research demonstrates that the LDA-based classification scheme tends to outperform the Delta rule, and the chi(2) distance, two classical approaches in authorship attribution based on a restricted number of terms. Compared to the Kullback-Leibler divergence, the LDA-based scheme can provide better effectiveness when considering a larger number of terms. (C) 2012 Elsevier Ltd. All rights reserved.

Authorship attribution based on a probabilistic topic model

期刊

INFORMATION PROCESSING & MANAGEMENT

出版社

ELSEVIER SCI LTD

关键词

类别

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

Authorship attribution based on a probabilistic topic model

期刊

INFORMATION PROCESSING & MANAGEMENT

出版社

ELSEVIER SCI LTD

关键词

类别

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文