☆ 4.6 Article

Comprehensive study of semi-supervised learning for DNA methylation-based supervised classification of central nervous system tumors

BMC BIOINFORMATICS (2022)

Journal

BMC BIOINFORMATICS

Volume 23, Issue 1, Pages -

Publisher

BMC

DOI: 10.1186/s12859-022-04764-1

Keywords

Semi-supervised learning; Neural network; Artificial intelligence; Supervised classifiers; DNA-methylation; Central nervous system tumor; Machine learning; Random forest

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Automated Summary New
Abstract

This paper explores the application of semi-supervised machine learning on methylation data to improve the accuracy of supervised learning models in classifying CNS tumors. The results show that pseudo-labels derived from the SSL method can significantly increase prediction accuracy for CNS tumors and normal controls.

Background Precision medicine for cancer treatment relies on an accurate pathological diagnosis. The number of known tumor classes has increased rapidly, and reliance on traditional methods of histopathologic classification alone has become unfeasible. To help reduce variability, validation costs, and standardize the histopathological diagnostic process, supervised machine learning models using DNA-methylation data have been developed for tumor classification. These methods require large labeled training data sets to obtain clinically acceptable classification accuracy. While there is abundant unlabeled epigenetic data across multiple databases, labeling pathology data for machine learning models is time-consuming and resource-intensive, especially for rare tumor types. Semi-supervised learning (SSL) approaches have been used to maximize the utility of labeled and unlabeled data for classification tasks and are effectively applied in genomics. SSL methods have not yet been explored with epigenetic data nor demonstrated beneficial to central nervous system (CNS) tumor classification. Results This paper explores the application of semi-supervised machine learning on methylation data to improve the accuracy of supervised learning models in classifying CNS tumors. We comprehensively evaluated 11 SSL methods and developed a novel combination approach that included a self-training with editing using support vector machine (SETRED-SVM) model and an L2-penalized, multinomial logistic regression model to obtain high confidence labels from a few labeled instances. Results across eight random forest and neural net models show that the pseudo-labels derived from our SSL method can significantly increase prediction accuracy for 82 CNS tumors and 9 normal controls. Conclusions The proposed combination of semi-supervised technique and multinomial logistic regression holds the potential to leverage the abundant publicly available unlabeled methylation data effectively. Such an approach is highly beneficial in providing additional training examples, especially for scarce tumor types, to boost the prediction accuracy of supervised models.

Comprehensive study of semi-supervised learning for DNA methylation-based supervised classification of central nervous system tumors

Journal

BMC BIOINFORMATICS

Publisher

BMC

Keywords

Categories

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Comprehensive study of semi-supervised learning for DNA methylation-based supervised classification of central nervous system tumors

Journal

BMC BIOINFORMATICS

Publisher

BMC

Keywords

Categories

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper