☆ 4.7 Article

Fast and accurate out-of-core PCA framework for large scale biobank data

GENOME RESEARCH (2023)

Journal

GENOME RESEARCH

Volume 33, Issue 9, Pages 1599-1608

Publisher

COLD SPRING HARBOR LAB PRESS, PUBLICATIONS DEPT

DOI: 10.1101/gr.277525.122

Keywords

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Automated Summary New
Abstract

Principal component analysis (PCA) is widely used for dimensionality reduction and uncovering latent structure in statistics, machine learning, and genomics. To address the challenges of ever-growing data, this paper proposes a novel algorithm called PCAone, which achieves fast and memory-efficient PCA and outperforms existing methods in comprehensive evaluations using multiple large-scale real-world datasets.

Principal component analysis (PCA) is widely used in statistics, machine learning, and genomics for dimensionality reduction and uncovering low-dimensional latent structure. To address the challenges posed by ever-growing data size, fast and memory-efficient PCA methods have gained prominence. In this paper, we propose a novel randomized singular value decomposition (RSVD) algorithm implemented in PCAone, featuring a window-based optimization scheme that enables accelerated convergence while improving the accuracy. Additionally, PCAone incorporates out-of-core and multithreaded implementations for the existing Implicitly Restarted Arnoldi Method (IRAM) and RSVD. Through comprehensive evaluations using multiple large-scale real-world data sets in different fields, we show the advantage of PCAone over existing methods. The new algorithm achieves significantly faster computation time while maintaining accuracy comparable to the slower IRAM method. Notably, our analyses of UK Biobank, comprising around 0.5 million individuals and 6.1 million common single nucleotide polymorphisms, show that PCAone accurately computes the top 40 principal components within 9 h. This analysis effectively captures population structure, signals of selection, structural variants, and low recombination regions, utilizing <20 GB of memory and 20 CPU threads. Furthermore, when applied to single-cell RNA sequencing data featuring 1.3 million cells, PCAone, accurately capturing the top 40 principal components in 49 min. This performance represents a 10-fold improvement over state-of-the-art tools.

Fast and accurate out-of-core PCA framework for large scale biobank data

Journal

GENOME RESEARCH

Publisher

COLD SPRING HARBOR LAB PRESS, PUBLICATIONS DEPT

Keywords

Categories

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Fast and accurate out-of-core PCA framework for large scale biobank data

Journal

GENOME RESEARCH

Publisher

COLD SPRING HARBOR LAB PRESS, PUBLICATIONS DEPT

Keywords

Categories

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper