4.4 Article

Supervised Latent Dirichlet Allocation With Covariates: A Bayesian Structural and Measurement Model of Text and Covariates


卷 -, 期 -, 页码 -


DOI: 10.1037/met0000541


text mining; supervised topic modeling; mixture modeling; Bayesian estimation; regression


This paper proposes a novel statistical model, SLDAX, which combines a latent variable model and a structural regression model to better estimate the topics in text data and use them as predictors. Through simulation studies and empirical applications, the effectiveness of the SLDAX model in psychological research is demonstrated.
Text is a burgeoning data source for psychological researchers, but little methodological research has focused on adapting popular modeling approaches for text to the context of psychological research. One popular measurement model for text, topic modeling, uses a latent mixture model to represent topics underlying a body of documents. Recently, psychologists have studied relationships between these topics and other psychological measures by using estimates of the topics as regression predictors along with other manifest variables. While similar two-stage approaches involving estimated latent variables are known to yield biased estimates and incorrect standard errors, two-stage topic modeling approaches have received limited statistical study and, as we show, are subject to the same problems. To address these problems, we proposed a novel statistical model-supervised latent Dirichlet allocation with covariates (SLDAX)-that jointly incorporates a latent variable measurement model of text and a structural regression model to allow the latent topics and other manifest variables to serve as predictors of an outcome. Using a simulation study with data characteristics consistent with psychological text data, we found that SLDAX estimates were generally more accurate and more efficient. To illustrate the application of SLDAX and a two-stage approach, we provide an empirical clinical application to compare the application of both the two-stage and SLDAX approaches. Finally, we implemented the SLDAX model in an open-source R package to facilitate its use and further study. Text data is an increasingly popular data source in psychological research that can be analyzed with a variety of models and algorithms. Topic models are a popular measurement model that use latent variables to represent constructs underlying a set of documents (e.g., clinical interviews, survey open responses, written or spoken educational assessments). Recent applications have used estimates of these topics as predictors of other variables in a regression model, but the statistical behavior of this approach has not been well studied. Similar approaches with other latent variable models are known to yield incorrect regression coefficient estimates and incorrect inferences. We showed that the use of topic estimates as regression predictors is also prone to these problems. As a solution, we proposed a model that jointly estimates the topic model and regression model-supervised latent Dirichlet allocation with covariates (SLDAX). Using a simulation study under typical psychological text data conditions, we found that SLDAX estimates were generally more accurate and more precise than the two-stage approach. We illustrate the SLDAX and two-stage approaches in a clinical study of nonsuicidal self injury and emotional dysregulation with participant interpersonal narratives. To allow researchers to apply the SLDAX model, we developed an open-source R software package.








