4.6 Article

Selecting Representative Samples From Complex Biological Datasets Using K-Medoids Clustering

Journal

FRONTIERS IN GENETICS
Volume 13, Issue -, Pages -

Publisher

FRONTIERS MEDIA SA
DOI: 10.3389/fgene.2022.954024

Keywords

single cell; sampling; k-medoids; R; antibody candidate selection

Funding

  1. National Institute of Allergy and Infectious Disease, National Institutes of Health [2P01AI097092-06A1, U19AI109946, U19AI057266]
  2. NIAID Centers of Excellence for Influenza Research and Surveillance (CEIRS) [HHSN272201400005C]
  3. Bill and Melinda Gates Foundation [OPP1084518]
  4. Biological Sciences Division at the University of Chicago
  5. Institute for Translational Medicine/Clinical and Translational Award [NIH5UL1TR002389-02]
  6. University of Chicago Comprehensive Cancer Center Support Grant [NIH P30CA014599]
  7. Bill and Melinda Gates Foundation [OPP1084518] Funding Source: Bill and Melinda Gates Foundation

Ask authors/readers for more resources

The Cookie toolkit efficiently selects representative samples from massive single-cell populations by quantifying relationships/similarities among samples and clustering them, showing higher efficacy and flexibility compared to conventional sampling methods.
Rapid growth of single-cell sequencing techniques enables researchers to investigate almost millions of cells with diverse properties in a single experiment. Meanwhile, it also presents great challenges for selecting representative samples from massive single-cell populations for further experimental characterization, which requires a robust and compact sampling with balancing diverse properties of different priority levels. The conventional sampling methods fail to generate representative and generalizable subsets from a massive single-cell population or more complicated ensembles. Here, we present a toolkit called Cookie which can efficiently select out the most representative samples from a massive single-cell population with diverse properties. This method quantifies the relationships/similarities among samples using their Manhattan distances by vectorizing all given properties and then determines an appropriate sample size by evaluating the coverage of key properties from multiple candidate sizes, following by a k-medoids clustering to group samples into several clusters and selects centers from each cluster as the most representatives. Comparison of Cookie with conventional sampling methods using a single-cell atlas dataset, epidemiology surveillance data, and a simulated dataset shows the high efficacy, efficiency, and flexibly of Cookie. The Cookie toolkit is implemented in R and is freely available at .

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.6
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available