☆ 4.5 Article

Theory-Driven Analysis of Large Corpora: Semisupervised Topic Classification of the UN Speeches

SOCIAL SCIENCE COMPUTER REVIEW (2022)

期刊

SOCIAL SCIENCE COMPUTER REVIEW

卷 40, 期 2, 页码 346-366

出版社

SAGE PUBLICATIONS INC

DOI: 10.1177/0894439320907027

关键词

text analysis; dictionary making; semisupervised learning; international relations; United Nations

类别

Computer Science, Interdisciplinary Applications Information Science & Library Science Social Sciences, Interdisciplinary

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

智能总结 New
摘要

There is a growing interest in quantitative analysis of large corpora among international relations scholars. To address the challenge of using unsupervised machine learning models consistently with existing theoretical frameworks, researchers have proposed a set of techniques that utilize a semisupervised model for efficient document classification. This approach involves creating a dictionary and using an entropy-based diagnostic tool to improve classification accuracy. Experimental results demonstrate the superiority of semisupervised models over unsupervised models, particularly when considering contextual information.

There is a growing interest in quantitative analysis of large corpora among the international relations (IR) scholars, but many of them find it difficult to perform analysis consistently with existing theoretical frameworks using unsupervised machine learning models to further develop the field. To solve this problem, we created a set of techniques that utilize a semisupervised model that allows researchers to classify documents into predefined categories efficiently. We propose a dictionary making procedure to avoid inclusion of words that are likely to confuse the model and deteriorate the its classification performance classification accuracy using a new entropy-based diagnostic tool. In our experiments, we classify sentences of the United Nations General Assembly speeches into six predefined categories using the seeded Latent Dirichlet allocation and Newsmap, which were trained with a small seed word dictionary that we created following the procedure. The result shows that, while keyword dictionary can only classify 25% of sentences, Newsmap can classify over 60% of them accurately correctly and; its accuracy exceeds 70% when contextual information is taken into consideration by kernel smoothing of topic likelihoods. We argue that once seed word dictionaries are created by the international relations community, semisupervised models would become more useful than unsupervised models for theory-driven text analysis.

Theory-Driven Analysis of Large Corpora: Semisupervised Topic Classification of the UN Speeches

期刊

SOCIAL SCIENCE COMPUTER REVIEW

出版社

SAGE PUBLICATIONS INC

关键词

类别

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

Theory-Driven Analysis of Large Corpora: Semisupervised Topic Classification of the UN Speeches

期刊

SOCIAL SCIENCE COMPUTER REVIEW

出版社

SAGE PUBLICATIONS INC

关键词

类别

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文