☆ 4.6 Article

Machine Learning and Feature Selection for Authorship Attribution: The Case of Mill, Taylor Mill and Taylor, in the Nineteenth Century

IEEE ACCESS (2021)

期刊

IEEE ACCESS

卷 9, 期 -, 页码 7143-7151

出版社

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

DOI: 10.1109/ACCESS.2020.3047583

关键词

Authorship attribution; text classification; machine learning; feature selection

类别

Computer Science, Information Systems Engineering, Electrical & Electronic Telecommunications

资金

University of Cyprus

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

智能总结 New
摘要

This article revisits a divisive issue regarding the authorship of John Stuart Mill's corpus, analyzing experts' differing opinions and the research team's methods and experimental results. By training classifiers, disputed texts are attributed to John Stuart Mill.

In this article we revisit a dividing issue as regards the corpus of one of the most famous nineteenth-century philosophers: John Stuart Mill. He was the author of two iconic texts in the history of political philosophy: On Liberty and The Subjection of Women. However, Mill attributed the first to collaboration with Harriet Taylor Mill, his wife, and characterized the second as a work of three minds: his own, his wife's and her daughter, Helen Taylor. Experts disagree on this issue. Most think Mill was too generous sharing authorship credit. We use a training set consisted in manuscripts of the three above mentioned authors, to train a four-class problem (three authors and joint productions). For every manuscript in the training set we extract a set of features that are widely used in text analytics and classification. Then, we apply some pre-processing techniques to normalize the data and to reduce the number of features. Finally, we train three classifiers, namely k-nearest neighbours (k-NNs) with k = 1 and k = 2, support vector machines (SVMs), and decision trees (DTs) to attribute the texts of disputed authorship to one of the four potential authors. We routinely run the experiments using different feature sets every time, in order to identify the optimal combination of features that yield the best results on the test set. The best results are achieved with the SVMs, having as input the bigrams features and their principal components. The mean detection rate for all four classes is 100%. Similar results are achieved with the models built with the k-NNs (k = 1) and the DTs. The only classifier that consistently is returning significantly lower results is the k-NN with k = 2. All of the instances in the test set are attributed to John Stuart Mill.

Machine Learning and Feature Selection for Authorship Attribution: The Case of Mill, Taylor Mill and Taylor, in the Nineteenth Century

期刊

IEEE ACCESS

出版社

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

Machine Learning and Feature Selection for Authorship Attribution: The Case of Mill, Taylor Mill and Taylor, in the Nineteenth Century

期刊

IEEE ACCESS

出版社

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文