4.5 Article

A comparison of latent semantic analysis and correspondence analysis of document-term matrices

Journal

NATURAL LANGUAGE ENGINEERING
Volume -, Issue -, Pages -

Publisher

CAMBRIDGE UNIV PRESS
DOI: 10.1017/S1351324923000244

Keywords

Text data mining; Text classification; Authorship attribution; Information retrieval; Statistical methods; Singular value decomposition

Ask authors/readers for more resources

This article compares latent semantic analysis (LSA) and correspondence analysis (CA) for dimensionality reduction in the context of document-term matrices. The study shows that CA outperforms LSA by effectively eliminating margin effects and focusing on relationships among documents and terms. A unified framework is proposed, with CA and LSA as special cases. Empirical comparisons on text categorization and authorship attribution tasks demonstrate the superior performance of CA. Additionally, CA provides further evidence on the authorship of the Dutch national anthem Wilhelmus.
Latent semantic analysis (LSA) and correspondence analysis (CA) are two techniques that use a singular value decomposition for dimensionality reduction. LSA has been extensively used to obtain low-dimensional representations that capture relationships among documents and terms. In this article, we present a theoretical analysis and comparison of the two techniques in the context of document-term matrices. We show that CA has some attractive properties as compared to LSA, for instance that effects of margins, that is, sums of row elements and column elements, arising from differing document lengths and term frequencies are effectively eliminated so that the CA solution is optimally suited to focus on relationships among documents and terms. A unifying framework is proposed that includes both CA and LSA as special cases. We empirically compare CA to various LSA-based methods on text categorization in English and authorship attribution on historical Dutch texts and find that CA performs significantly better. We also apply CA to a long-standing question regarding the authorship of the Dutch national anthem Wilhelmus and provide further support that it can be attributed to the author Datheen, among several contenders.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.5
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available