4.7 Article

i6mA-stack: A stacking ensemble-based computational prediction of DNA N6-methyladenine (6mA) sites in the Rosaceae genome

期刊

GENOMICS
卷 113, 期 1, 页码 582-592

出版社

ACADEMIC PRESS INC ELSEVIER SCIENCE
DOI: 10.1016/j.ygeno.2020.09.054

关键词

Sequence analysis; DNA N6-methyladenine; Machine learning; RFECV; Stacking

资金

  1. National Research Foundation of Korea(NRF) - Korea government (MSIT) [2020R1A2C2005612]
  2. Brain Research Program of the National Research Foundation (NRF) - Korean government (MSIT) [NRF-2017M3C7A1044816]
  3. Basic Science Research Program through the National Research Foundation of Korea (NRF) - Ministry of Education [2019R1A6A3A01094685]
  4. National Research Foundation of Korea [2019R1A6A3A01094685, 2020R1A2C2005612] Funding Source: Korea Institute of Science & Technology Information (KISTI), National Science & Technology Information Service (NTIS)

向作者/读者索取更多资源

This study proposes a machine learning technique to identify DNA N6-methyladenine (6 mA) sites in Rosa chinensis and Fragaria vesca. By using recursive feature elimination with cross-validation strategy to extract optimal feature subset from five different DNA sequence encoding schemes, a double layers of machine learning-based stacking model was trained to create a bioinformatics tool named 'i6mA-stack'.
DNA N6-methyladenine (6 mA) is an epigenetic modification that plays a vital role in a variety of cellular processes in both eukaryotes and prokaryotes. Accurate information of 6 mA sites in the Rosaceae genome may assist in understanding genomic 6 mA distributions and various biological functions such as epigenetic inheritance. Various studies have shown the possibility of identifying 6 mA sites through experiments, but the procedures are time-consuming and costly. To overcome the drawbacks of experimental methods, we propose an accurate computational paradigm based on a machine learning (ML) technique to identify 6 mA sites in Rosa chinensis (R.chinensis) and Fragaria vesca (F.vesca). To improve the performance of the proposed model and to avoid overfitting, a recursive feature elimination with cross-validation (RFECV) strategy is used to extract the optimal number of features (ONF) subset from five different DNA sequence encoding schemes, i.e., Binary Encoding (BE), Ring-Function-Hydrogen-Chemical Properties (RFHC), Electron-Ion-Interaction Pseudo Potentials of Nucleotides (EIIP), Dinucleotide Physicochemical Properties (DPCP), and Trinucleotide Physicochemical Properties (TPCP). Subsequently, we use the ONF subset to train a double layers of ML-based stacking model to create a bioinformatics tool named 'i6mA-stack'.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.7
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据