4.6 Article

Comparison of machine learning and deep learning techniques in promoter prediction across diverse species

期刊

PEERJ COMPUTER SCIENCE
卷 -, 期 -, 页码 -

出版社

PEERJ INC
DOI: 10.7717/peerj-cs.365

关键词

Promoter Prediction; Deep Learning; Machine Learning; CNN; LSTM; Random Forest; One-hot Encoding; Frequency-based Tokenization

资金

  1. Scheme for Promotion of Academic and Research Collaboration (SPARC) 2018-19, MHRD [P104]
  2. Symbiosis International Deemed University, India

向作者/读者索取更多资源

Gene promoters, key DNA elements for regulating gene transcription, are challenging to predict due to lack of obvious features, prompting the use of machine learning and deep learning models. In this study, frequency-based tokenization was found to be effective for data pre-processing, enhancing the classification performance of 1-D CNN models. CNN was shown to outperform other models in distinguishing promoter sequences from non-promoters and species-specific classification.
Gene promoters are the key DNA regulatory elements positioned around the transcription start sites and are responsible for regulating gene transcription process. Various alignment-based, signal-based and content-based approaches are reported for the prediction of promoters. However, since all promoter sequences do not show explicit features, the prediction performance of these techniques is poor. Therefore, many machine learning and deep learning models have been proposed for promoter prediction. In this work, we studied methods for vector encoding and promoter classification using genome sequences of three distinct higher eukaryotes viz. yeast (Saccharomyces cerevisiae), A. thaliana (plant) and human (Homo sapiens). We compared one-hot vector encoding method with frequency-based tokenization (FBT) for data pre-processing on 1-D Convolutional Neural Network (CNN) model. We found that FBT gives a shorter input dimension reducing the training time without affecting the sensitivity and specificity of classification. We employed the deep learning techniques, mainly CNN and recurrent neural network with Long Short Term Memory (LSTM) and random forest (RF) classifier for promoter classification at k-mer sizes of 2, 4 and 8. We found CNN to be superior in classification of promoters from non-promoter sequences (binary classification) as well as species-specific classification of promoter sequences (multiclass classification). In summary, the contribution of this work lies in the use of synthetic shuffled negative dataset and frequency-based tokenization for pre-processing. This study provides a comprehensive and generic framework for classification tasks in genomic applications and can be extended to various classification problems.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.6
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据