4.6 Article

Heterogeneous Gene Expression Cross-Evaluation of Robust Biomarkers Using Machine Learning Techniques Applied to Lung Cancer

Journal

CURRENT BIOINFORMATICS
Volume 17, Issue 2, Pages 150-163

Publisher

BENTHAM SCIENCE PUBL LTD
DOI: 10.2174/1574893616666211005114934

Keywords

Lung cancer; microarray; RNA-Seq; gene expression; machine learning; feature selection; CDSS

Funding

  1. Spanish Ministry of Sciences, Innovation and Universities [RTI2018-101674-B-I00]
  2. Government of Andalusia [P12-TIC-2082, CV20-64934]

Ask authors/readers for more resources

This study describes and implements a machine learning methodology to evaluate the compatibility between RNA-Seq and Microarray sequencing technologies, and its application in identifying and classifying lung cancer subtypes. The results show that predictive models built using different technologies can accurately classify samples, and strong gene signatures related to lung cancer have been discovered.
Background: Nowadays, gene expression analysis is one of the most promising pillars for understanding and uncovering the mechanisms underlying the development and spread of cancer. In this sense, Next Generation Sequencing technologies, such as RNA-Seq, are currently leading the market due to their precision and cost. Nevertheless, there is still an enormous amount of non-analyzed data obtained from older technologies, such as Microarray, which could still be useful to extract relevant knowledge. Methods: Throughout this research, a complete machine learning methodology to cross-evaluate the compatibility between both RNA-Seq and Microarray sequencing technologies is described and implemented. In order to show a real application of the designed pipeline, a lung cancer case study is addressed by considering two detected subtypes: adenocarcinoma and squamous cell carcinoma. Transcriptomic datasets considered for our study have been obtained from the public repositories NCBI/GEO, ArrayExpress and GDC-Portal. From them, several gene experiments have been carried out with the aim of finding gene signatures for these lung cancer subtypes, linked to both transcriptomic technologies. With these DEGs selected, intelligent predictive models capable of classifying new samples belonging to these cancer subtypes have been developed. Results: The predictive models built using one technology are capable of discerning samples from a different technology. The classification results are evaluated in terms of accuracy, F1-score and ROC curves along with AUC. Finally, the biological information of the gene sets obtained and their relationship with lung cancer are reviewed, encountering strong biological evidence linking them to the disease. Conclusion: Our method has the capability of finding strong gene signatures which are also independent of the transcriptomic technology used to develop the analysis. In addition, our article highlights the potential of using heterogeneous transcriptomic data to increase the amount of samples for the studies, increasing the statistical significance of the results.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.6
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available