☆ 4.4 Article

Text Preprocessing For Unsupervised Learning: Why It Matters, When It Misleads, And What To Do About It

POLITICAL ANALYSIS (2018)

期刊

POLITICAL ANALYSIS

卷 26, 期 2, 页码 168-189

出版社

CAMBRIDGE UNIV PRESS

DOI: 10.1017/pan.2017.44

关键词

statistical analysis of texts; unsupervised learning; descriptive statistics

类别

Political Science

资金

National Science Foundation under IGERT Grant [DGE-1144860]

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

摘要

Despite the popularity of unsupervised techniques for political science text-as-data research, the importance and implications of preprocessing decisions in this domain have received scant systematic attention. Yet, as we show, such decisions have profound effects on the results of real models for real data. We argue that substantive theory is typically too vague to be of use for feature selection, and that the supervised literature is not necessarily a helpful source of advice. To aid researchers working in unsupervised settings, we introduce a statistical procedure and software that examines the sensitivity of findings under alternate preprocessing regimes. This approach complements a researcher's substantive understanding of a problem by providing a characterization of the variability changes in preprocessing choices may induce when analyzing a particular dataset. In making scholars aware of the degree to which their results are likely to be sensitive to their preprocessing decisions, it aids replication efforts.

Text Preprocessing For Unsupervised Learning: Why It Matters, When It Misleads, And What To Do About It

期刊

POLITICAL ANALYSIS

出版社

CAMBRIDGE UNIV PRESS

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

Text Preprocessing For Unsupervised Learning: Why It Matters, When It Misleads, And What To Do About It

期刊

POLITICAL ANALYSIS

出版社

CAMBRIDGE UNIV PRESS

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文