☆ 4.5 Article

Compressing gene expression data using multiple latent space dimensionalities learns complementary biological representations

GENOME BIOLOGY (2020)

Journal

GENOME BIOLOGY

Volume 21, Issue 1, Pages -

Publisher

BMC

DOI: 10.1186/s13059-020-02021-3

Keywords

Machine learning; Dimensionality reduction; Latent space; Gene expression; Autoencoders; Compression; Neural network interpretation

Funding

Gordon and Betty Moore Foundation [GBMF 4552]
National Institutes of Health's National Human Genome Research Institute [R01 HG010067]
National Institutes of Health's National Cancer Institute [R01 CA237170]
National Institutes of Health [T32 HG000046]
Alex's Lemonade Stand Foundation

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Abstract

BackgroundUnsupervised compression algorithms applied to gene expression data extract latent or hidden signals representing technical and biological sources of variation. However, these algorithms require a user to select a biologically appropriate latent space dimensionality. In practice, most researchers fit a single algorithm and latent dimensionality. We sought to determine the extent by which selecting only one fit limits the biological features captured in the latent representations and, consequently, limits what can be discovered with subsequent analyses.ResultsWe compress gene expression data from three large datasets consisting of adult normal tissue, adult cancer tissue, and pediatric cancer tissue. We train many different models across a large range of latent space dimensionalities and observe various performance differences. We identify more curated pathway gene sets significantly associated with individual dimensions in denoising autoencoder and variational autoencoder models trained using an intermediate number of latent dimensionalities. Combining compressed features across algorithms and dimensionalities captures the most pathway-associated representations. When trained with different latent dimensionalities, models learn strongly associated and generalizable biological representations including sex, neuroblastoma MYCN amplification, and cell types. Stronger signals, such as tumor type, are best captured in models trained at lower dimensionalities, while more subtle signals such as pathway activity are best identified in models trained with more latent dimensionalities.ConclusionsThere is no single best latent dimensionality or compression algorithm for analyzing gene expression data. Instead, using features derived from different compression models across multiple latent space dimensionalities enhances biological representations.

Compressing gene expression data using multiple latent space dimensionalities learns complementary biological representations

Journal

GENOME BIOLOGY

Publisher

BMC

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Compressing gene expression data using multiple latent space dimensionalities learns complementary biological representations

Journal

GENOME BIOLOGY

Publisher

BMC

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper