4.8 Article

scBERT as a large-scale pretrained deep language model for cell type annotation of single-cell RNA-seq data

Journal

NATURE MACHINE INTELLIGENCE
Volume 4, Issue 10, Pages 852-+

Publisher

NATURE PORTFOLIO
DOI: 10.1038/s42256-022-00534-z

Keywords

-

Funding

  1. National Key R&D Program of China [2018YFC0910500]
  2. SJTU-Yale Collaborative Research Seed Fund
  3. Neil Shen's SJTU Medical Research and Key-Area Research
  4. Development Program of Guangdong Province [2021B0101420005]

Ask authors/readers for more resources

Annotating cell types based on single-cell RNA-seq data is crucial for studying disease progress and tumor microenvironments. Existing annotation methods often face challenges such as a lack of curated marker gene lists, difficulties in handling batch effects, and inability to leverage gene-gene interaction information, leading to limited generalization and robustness. In this study, we developed a pretrained deep neural network model called scBERT, which overcomes these challenges by training on large amounts of unlabeled scRNA-seq data and achieving a comprehensive understanding of gene-gene interactions. scBERT demonstrates superior performance in cell type annotation, novel cell type discovery, robustness to batch effects, and model interpretability.
Annotating cell types on the basis of single-cell RNA-seq data is a prerequisite for research on disease progress and tumour microenvironments. Here we show that existing annotation methods typically suffer from a lack of curated marker gene lists, improper handling of batch effects and difficulty in leveraging the latent gene-gene interaction information, impairing their generalization and robustness. We developed a pretrained deep neural network-based model, single-cell bidirectional encoder representations from transformers (scBERT), to overcome the challenges. Following BERT's approach to pretraining and fine-tuning, scBERT attains a general understanding of gene-gene interactions by being pretrained on huge amounts of unlabelled scRNA-seq data; it is then transferred to the cell type annotation task of unseen and user-specific scRNA-seq data for supervised fine-tuning. Extensive and rigorous benchmark studies validated the superior performance of scBERT on cell type annotation, novel cell type discovery, robustness to batch effects and model interpretability. Cell type annotation is a core task for single cell RNA-sequencing, but current bioinformatic tools struggle with some of the underlying challenges, including high dimensionality, data sparsity, batch effects and a lack of labels. In a self-supervised approach, a transformer model called scBERT is pretrained on millions of unlabelled public single cell RNA-seq data and then fine-tuned with a small number of labelled samples for cell annotation tasks.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.8
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available