4.6 Article

Matrix prior for data transfer between single cell data types in latent Dirichlet allocation

Journal

PLOS COMPUTATIONAL BIOLOGY
Volume 19, Issue 5, Pages -

Publisher

PUBLIC LIBRARY SCIENCE
DOI: 10.1371/journal.pcbi.1011049

Keywords

-

Ask authors/readers for more resources

Single cell ATAC-seq (scATAC-seq) enables the mapping of regulatory elements in fine-grained cell types. In this study, the authors propose using latent Dirichlet allocation (LDA) with nonuniform matrix priors to improve the analysis of scATAC-seq data. They demonstrate the effectiveness of this method in capturing cell type information from small scATAC-seq datasets from C. elegans nematodes and mouse skin cells.
Single cell ATAC-seq (scATAC-seq) enables the mapping of regulatory elements in fine-grained cell types. Despite this advance, analysis of the resulting data is challenging, and large scale scATAC-seq data are difficult to obtain and expensive to generate. This motivates a method to leverage information from previously generated large scale scATAC-seq or scRNA-seq data to guide our analysis of new scATAC-seq datasets. We analyze scATAC-seq data using latent Dirichlet allocation (LDA), a Bayesian algorithm that was developed to model text corpora, summarizing documents as mixtures of topics defined based on the words that distinguish the documents. When applied to scATAC-seq, LDA treats cells as documents and their accessible sites as words, identifying topics based on the cell type-specific accessible sites in those cells. Previous work used uniform symmetric priors in LDA, but we hypothesized that nonuniform matrix priors generated from LDA models trained on existing data sets may enable improved detection of cell types in new data sets, especially if they have relatively few cells. In this work, we test this hypothesis in scATAC-seq data from whole C. elegans nematodes and SHARE-seq data from mouse skin cells. We show that nonsymmetric matrix priors for LDA improve our ability to capture cell type information from small scATAC-seq datasets. Author summaryIdentifying cell types based on genomics information is an important task but can present challenges because genomics information can be high-dimensional and contain many zeros. Previous work has used latent Dirichlet allocation (LDA), a method that automatically identifies topics within a dataset, and has used these topics to better understand the cell types within a population. LDA has been applied to single cell ATAC-seq datasets, which provide information about open chromatin regions within individual cells. We focus on improving the LDA framework by enabling the incorporation of auxiliary forms of information. In particular, we present a method that uses data from large reference populations of cells to aid in the formation of topics for a smaller, target population of cells. We demonstrate first, through simulation, that our method can recover topics when the data follows the assumptions of our model. We then use a dataset of mouse skin cells and another with C. elegans cells to demonstrate that in a real data setting, our method improves the quality of topics recovered from the genomics data.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.6
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available