4.7 Article

Identifying DNA N4-methylcytosine sites in the rosaceae genome with a deep learning model relying on distributed feature representation

Journal

COMPUTATIONAL AND STRUCTURAL BIOTECHNOLOGY JOURNAL
Volume 19, Issue -, Pages 1612-1619

Publisher

ELSEVIER
DOI: 10.1016/j.csbj.2021.03.015

Keywords

Sequence analysis; DNA N4-methylcytosine (4mC); Word embedding; Convolutional Neural Network; Web-server

Funding

  1. National Research Foundation of Korea (NRF) - Korea government (MSIT) [2020R1A2C2005612]
  2. Brain Research Program of the National Research Foundation (NRF) - Korean government (MSIT) [NRF-2017M3C7A1044816]

Ask authors/readers for more resources

DNA 4mC is a key epigenetic modification involved in biological functions across different species. The computational method 4mC-w2vec enhances feature selection and performance in identifying relevant sites, surpassing current tools in genomic datasets.
DNA N4-methylcytosine (4mC), an epigenetic modification found in prokaryotic and eukaryotic species, is involved in numerous biological functions, including host defense, transcription regulation, gene expression, and DNA replication. To identify 4mC sites, previous computational studies mostly focused on finding hand-crafted features. This area of research, therefore, would benefit from the development of a computational approach that relies on automatic feature selection to identify relevant sites. We here report 4mC-w2vec, a computational method that learned automatic feature discrimination in the Rosaceae genomes, especially in Rosa chinensis (R. chinensis) and Fragaria vesca (F. vesca), based on dis-tributed feature representation and through the word embedding technique 'word2vec'. While a few bioinformatics tools are currently employed to identify 4mC sites in these genomes, their prediction per-formance is inadequate. Our system processed 4mC and non-4mC sites through a word embedding pro-cess, including sub-word information of its biological words through k-mer, which then served as features that were fed into a double layer of convolutional neural network (CNN) to classify whether the sample sequences contained 4mCs or non-4mCs sites. Our tool demonstrated performance superior to current tools that use the same genomic datasets. Additionally, 4mC-w2vec is effective for balanced and imbalanced class datasets alike, and the online web-server is currently available at: http:// nsclbio.jbnu.ac.kr/tools/4mC-w2vec/. (C) 2021 The Authors. Published by Elsevier B.V. on behalf of Research Network of Computational and Structural Biotechnology.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.7
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available