4.7 Article

Exploring syntactic and semantic features for authorship attribution

期刊

APPLIED SOFT COMPUTING
卷 111, 期 -, 页码 -

出版社

ELSEVIER
DOI: 10.1016/j.asoc.2021.107815

关键词

Authorship attribution; Feature extraction; Multi-channel self-attention; Syntactic feature; Semantic feature

资金

  1. National Key Research and Development Program of China [2019YFB1406300]
  2. National Natural Science Foundation of China [61972336, 62073284]
  3. Zhejiang Provincial Natural Science Foundation of China [LY19F030008]
  4. Project of Humanities and Social Sciences of Ministry of Education in China [17YJAZH056]
  5. Tsinghua University Humanities and Social Sciences Revitalization Project [2019THZWJC38]

向作者/读者索取更多资源

This paper discusses the importance of authorship attribution and the limitations of existing methods, proposing a novel approach that combines features from multiple dimensions, with experimental results demonstrating its effectiveness compared to state-of-the-art models.
Authorship attribution is to extract features to identify authors of anonymous documents. Many previous works on authorship attribution focus on statistical style features (e.g., sentence/word length), content features (e.g., frequent words, n-grams). Modeling these features by regression or some transparent machine learning methods gives a portrait of the authors' writing style. But these methods do not capture the syntactic (e.g., dependency relationship) or semantic (e.g., topics) information. In recent years, some researchers model syntactic trees or latent semantic information by neural networks. However, few works take them together. In this paper, we propose a novel Multi-Channel Self-Attention Network (MCSAN) incorporating both the inter-channel and inter-positional interaction to extract n-grams of the characters, words, parts of speech (POS), phrase structures, dependency relationships, and topics from multiple dimensions (style, content, syntactic and semantic features) to distinguish different authors. And then we incorporate these extracted features with logistic regression (LR) to do experiments, and the experimental results manifest that our extracted features are effective. Our methods improve 2.1% and 3.0% on CCAT10 and CCAT50, respectively, comparing with state-of-the-art models. (C) 2021 Elsevier B.V. All rights reserved.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.7
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据