4.7 Article

Cancer classification based on multiple dimensions: SNV patterns

期刊

COMPUTERS IN BIOLOGY AND MEDICINE
卷 151, 期 -, 页码 -

出版社

PERGAMON-ELSEVIER SCIENCE LTD
DOI: 10.1016/j.compbiomed.2022.106270

关键词

Multidimensional SNV feature; Cancer classification; KNN; Oncogene

资金

  1. National Natural Science Foundation of China
  2. [62072353]

向作者/读者索取更多资源

This study proposes a method to classify cancer based on multidimensional SNV features. By analyzing SNVs in cancer samples, the extracted features exhibit similar distribution patterns in the cluster centers of each cancer type. The classification accuracy using the KNN algorithm reaches approximately 97%, with the potential for oncogene discovery. The validated oncogenes in the identified features have the highest importance among the 8 cancers.
Background: The occurrence of cancer is closely related to single nucleotide variants (SNVs). However, in DNA samples collected from patients with distinct cancers, SNVs are detected in different patterns. Therefore, it is an important task to select the appropriate method by which to classify cancer to the greatest extent of SNV pat-terns, which will aid in cancer diagnosis and treatment. In traditional studies, researchers combined each SNV with its neighboring nucleotides to form a trinucleotide. Mutation signatures for cancer classification were extracted from the patterns of the trinucleotides, but the SNV feature extraction in a single dimension may result in partial information loss and poor model performance. Results: In this study, we defined multidimensional SNV (M-SNV) features to classify cancer. M-SNV features considered first-and second-order neighboring nucleotides of one-dimensional SNVs and included six types of features. We validated the feasibility of M-SNV features using a dataset obtained from The Cancer Genome Atlas (TCGA) consisting of 2761 samples from 12 cancers. We performed preliminary screening of 562,321 DNA mutation sites in these samples. The remaining mutation sites were characterized by cancer type in six signa-tures. We found that the extracted features showed a similar distribution in the cluster center of the cancer type of the samples. After the preprocessing of raw data, samples were more focused on the cancer subtype distri-butions at the SNV level. We used KNN (k-nearest neighbors) to classify the extracted features and employed the leave-one-out cross to verify them. The accuracy of classifying is stable at approximately 97% and can reach 97.43% in the most optimal case. Furthermore, we found that the validated oncogenes in the loci of the features had the highest importance among the 8 cancers. Conclusions: It is feasible to classify cancers by the distribution of features we defined. Moreover, our method-ology has potential implications for the discovery of oncogenes.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.7
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据