4.6 Article

G4mismatch: Deep neural networks to predict G-quadruplex propensity based on G4-seq data

Journal

PLOS COMPUTATIONAL BIOLOGY
Volume 19, Issue 3, Pages -

Publisher

PUBLIC LIBRARY SCIENCE
DOI: 10.1371/journal.pcbi.1010948

Keywords

-

Ask authors/readers for more resources

G4mismatch, a novel algorithm, accurately and efficiently predicts G-quadruplex propensity for any genomic sequence. Based on a convolutional neural network trained on almost 400 million human genomic loci, G4mismatch achieves high accuracy in predicting G-quadruplex formation and outperforms other methods.
Author summaryG-quadruplexes (G4s) are non-canonical secondary structures, which have been extensively studied, and found to be associated with numerous diseases. The G4-seq experiment provided valuable data, mapping G4s across the genomes of 12 different species, reporting the potential of a DNA region to form a G4 using a mismatch score. Previous methods to predict G4s simply solved the problem of G4-folding as binary classification or focused on putative quadruplexes rather then predicting the raw genome-wide scores generated by the G4-seq experiment.Our new approach, G4mismatch, is the first to utilize millions of G4 mismatch scores measured by the G4-seq experiment as a highly accurate simulator of a G4-seq experiment, which can predict the mismatch score of any given DNA sequence and by that uncover its potential to form a G4. In addition, our work utilizes data from all 12 different species to demonstrate the ability of a model trained on one species to predict on other genomes, and explore the properties that give advantage to some models over others. Moreover, we show how the model learned known and novel molecular principles underlying G4 folding. G-quadruplexes are non-B-DNA structures that form in the genome facilitated by Hoogsteen bonds between guanines in single or multiple strands of DNA. The functions of G-quadruplexes are linked to various molecular and disease phenotypes, and thus researchers are interested in measuring G-quadruplex formation genome-wide. Experimentally measuring G-quadruplexes is a long and laborious process. Computational prediction of G-quadruplex propensity from a given DNA sequence is thus a long-standing challenge. Unfortunately, despite the availability of high-throughput datasets measuring G-quadruplex propensity in the form of mismatch scores, extant methods to predict G-quadruplex formation either rely on small datasets or are based on domain-knowledge rules. We developed G4mismatch, a novel algorithm to accurately and efficiently predict G-quadruplex propensity for any genomic sequence. G4mismatch is based on a convolutional neural network trained on almost 400 millions human genomic loci measured in a single G4-seq experiment. When tested on sequences from a held-out chromosome, G4mismatch, the first method to predict mismatch scores genome-wide, achieved a Pearson correlation of over 0.8. When benchmarked on independent datasets derived from various animal species, G4mismatch trained on human data predicted G-quadruplex propensity genome-wide with high accuracy (Pearson correlations greater than 0.7). Moreover, when tested in detecting G-quadruplexes genome-wide using the predicted mismatch scores, G4mismatch achieved superior performance compared to extant methods. Last, we demonstrate the ability to deduce the mechanism behind G-quadruplex formation by unique visualization of the principles learned by the model.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.6
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available