4.6 Article

Machine Learning and Feature Selection for Authorship Attribution: The Case of Mill, Taylor Mill and Taylor, in the Nineteenth Century

期刊

IEEE ACCESS
卷 9, 期 -, 页码 7143-7151

出版社

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
DOI: 10.1109/ACCESS.2020.3047583

关键词

Authorship attribution; text classification; machine learning; feature selection

资金

  1. University of Cyprus

向作者/读者索取更多资源

This article revisits a divisive issue regarding the authorship of John Stuart Mill's corpus, analyzing experts' differing opinions and the research team's methods and experimental results. By training classifiers, disputed texts are attributed to John Stuart Mill.
In this article we revisit a dividing issue as regards the corpus of one of the most famous nineteenth-century philosophers: John Stuart Mill. He was the author of two iconic texts in the history of political philosophy: On Liberty and The Subjection of Women. However, Mill attributed the first to collaboration with Harriet Taylor Mill, his wife, and characterized the second as a work of three minds: his own, his wife's and her daughter, Helen Taylor. Experts disagree on this issue. Most think Mill was too generous sharing authorship credit. We use a training set consisted in manuscripts of the three above mentioned authors, to train a four-class problem (three authors and joint productions). For every manuscript in the training set we extract a set of features that are widely used in text analytics and classification. Then, we apply some pre-processing techniques to normalize the data and to reduce the number of features. Finally, we train three classifiers, namely k-nearest neighbours (k-NNs) with k = 1 and k = 2, support vector machines (SVMs), and decision trees (DTs) to attribute the texts of disputed authorship to one of the four potential authors. We routinely run the experiments using different feature sets every time, in order to identify the optimal combination of features that yield the best results on the test set. The best results are achieved with the SVMs, having as input the bigrams features and their principal components. The mean detection rate for all four classes is 100%. Similar results are achieved with the models built with the k-NNs (k = 1) and the DTs. The only classifier that consistently is returning significantly lower results is the k-NN with k = 2. All of the instances in the test set are attributed to John Stuart Mill.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.6
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据