☆ 4.4 Article

Text Preprocessing For Unsupervised Learning: Why It Matters, When It Misleads, And What To Do About It

POLITICAL ANALYSIS (2018)

Journal

POLITICAL ANALYSIS

Volume 26, Issue 2, Pages 168-189

Publisher

CAMBRIDGE UNIV PRESS

DOI: 10.1017/pan.2017.44

Keywords

statistical analysis of texts; unsupervised learning; descriptive statistics

Funding

National Science Foundation under IGERT Grant [DGE-1144860]

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Abstract

Despite the popularity of unsupervised techniques for political science text-as-data research, the importance and implications of preprocessing decisions in this domain have received scant systematic attention. Yet, as we show, such decisions have profound effects on the results of real models for real data. We argue that substantive theory is typically too vague to be of use for feature selection, and that the supervised literature is not necessarily a helpful source of advice. To aid researchers working in unsupervised settings, we introduce a statistical procedure and software that examines the sensitivity of findings under alternate preprocessing regimes. This approach complements a researcher's substantive understanding of a problem by providing a characterization of the variability changes in preprocessing choices may induce when analyzing a particular dataset. In making scholars aware of the degree to which their results are likely to be sensitive to their preprocessing decisions, it aids replication efforts.

Text Preprocessing For Unsupervised Learning: Why It Matters, When It Misleads, And What To Do About It

Journal

POLITICAL ANALYSIS

Publisher

CAMBRIDGE UNIV PRESS

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Text Preprocessing For Unsupervised Learning: Why It Matters, When It Misleads, And What To Do About It

Journal

POLITICAL ANALYSIS

Publisher

CAMBRIDGE UNIV PRESS

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper