Journal
SCIENCE ADVANCES
Volume 8, Issue 42, Pages -Publisher
AMER ASSOC ADVANCEMENT SCIENCE
DOI: 10.1126/sciadv.abg2652
Keywords
-
Categories
Funding
- Eunice Kennedy Shriver National Institute of Child Health and Human Development of the National Institutes of Health [P2CHD047879]
- National Science Foundation under the Resource Implementations for Data Intensive Research program [1738411, 1738288]
- Divn Of Social and Economic Sciences
- Direct For Social, Behav & Economic Scie [1738288, 1738411] Funding Source: National Science Foundation
Ask authors/readers for more resources
Text as data techniques have the potential to test social science theories by using large collections of text. However, estimating the latent representation of the text may introduce risks. To address these risks, a split-sample workflow is introduced for rigorous causal inferences.
Text as data techniques offer a great promise: the ability to inductively discover measures that are useful for testing social science theories with large collections of text. Nearly all text-based causal inferences depend on a latent representation of the text, but we show that estimating this latent representation from the data creates underacknowledged risks: we may introduce an identification problem or overfit. To address these risks, we introduce a split-sample workflow for making rigorous causal inferences with discovered measures as treatments or outcomes. We then apply it to estimate causal effects from an experiment on immigration attitudes and a study on bureaucratic responsiveness.
Authors
I am an author on this paper
Click your name to claim this paper and add it to your profile.
Reviews
Recommended
No Data Available