☆ 4.6 Article

Projection in genomic analysis: A theoretical basis to rationalize tensor decomposition and principal component analysis as feature selection tools

PLOS ONE (2022)

Journal

PLOS ONE

Volume 17, Issue 9, Pages -

Publisher

PUBLIC LIBRARY SCIENCE

DOI: 10.1371/journal.pone.0275472

Keywords

Funding

Japan Society for the Promotion of Science [19H05270, 20H04848, 20K12067]

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Automated Summary New
Abstract

Identifying differentially expressed genes is challenging due to limited sample size compared to the large number of genes. Statistical tests often have the problem of P-value dependence on sample size. This study explores the success of principal component analysis (PCA) and tensor decomposition (TD)-based unsupervised feature extraction (FE) by relating it to projection pursuit (PP). Additionally, empirical threshold-adjusted P-values are compared to assess the null hypothesis. The findings rationalize the success of PCA- and TD-based unsupervised FE for the first time.

Identifying differentially expressed genes is difficult because of the small number of available samples compared with the large number of genes. Conventional gene selection methods employing statistical tests have the critical problem of heavy dependence of P-values on sample size. Although the recently proposed principal component analysis (PCA) and tensor decomposition (TD)-based unsupervised feature extraction (FE) has often outperformed these statistical test-based methods, the reason why they worked so well is unclear. In this study, we aim to understand this reason in the context of projection pursuit (PP) that was proposed a long time ago to solve the problem of dimensions; we can relate the space spanned by singular value vectors with that spanned by the optimal cluster centroids obtained from K-means. Thus, the success of PCA- and TD-based unsupervised FE can be understood by this equivalence. In addition to this, empirical threshold adjusted P-values of 0.01 assuming the null hypothesis that singular value vectors attributed to genes obey the Gaussian distribution empirically corresponds to threshold-adjusted P-values of 0.1 when the null distribution is generated by gene order shuffling. For this purpose, we newly applied PP to the three data sets to which PCA and TD based unsupervised FE were previously applied; these data sets treated two topics, biomarker identification for kidney cancers (the first two) and the drug discovery for COVID-19 (the thrid one). Then we found the coincidence between PP and PCA or TD based unsupervised FE is pretty well. Shuffling procedures described above are also successfully applied to these three data sets. These findings thus rationalize the success of PCA- and TD-based unsupervised FE for the first time.

Projection in genomic analysis: A theoretical basis to rationalize tensor decomposition and principal component analysis as feature selection tools

Journal

PLOS ONE

Publisher

PUBLIC LIBRARY SCIENCE

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Projection in genomic analysis: A theoretical basis to rationalize tensor decomposition and principal component analysis as feature selection tools

Journal

PLOS ONE

Publisher

PUBLIC LIBRARY SCIENCE

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper