| Literature DB >> 33193663 |
Eva Klimentova1, Jakub Polacek1, Petr Simecek2, Panagiotis Alexiou2.
Abstract
G-quadruplexes (G4s) are a class of stable structural nucleic acid secondary structures that are known to play a role in a wide spectrum of genomic functions, such as DNA replication and transcription. The classical understanding of G4 structure points to four variable length guanine strands joined by variable length nucleotide stretches. Experiments using G4 immunoprecipitation and sequencing experiments have produced a high number of highly probable G4 forming genomic sequences. The expense and technical difficulty of experimental techniques highlights the need for computational approaches of G4 identification. Here, we present PENGUINN, a machine learning method based on Convolutional neural networks, that learns the characteristics of G4 sequences and accurately predicts G4s outperforming state-of-the-art methods. We provide both a standalone implementation of the trained model, and a web application that can be used to evaluate sequences for their G4 potential.Entities:
Keywords: G quadruplex; bioinformatics and computational biology; deep neural network; genomic; imbalanced data classification; machine learning; web application
Year: 2020 PMID: 33193663 PMCID: PMC7653191 DOI: 10.3389/fgene.2020.568546
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.599
FIGURE 1(A) Schematic of a typical G-quadruplex structure consisting of four G tracts with a minimum length of three, connected by non-specific loops. (B) PENGUINN convolutional neural network model. (C) Identification of G-quadruplex subsequences via randomized mutation.
Precision and recall values for static score prediction of regular expression, PENGUINNs (sensitive), and PENGUINNp (precise) on a scale of imbalanced datasets.
| Regular expression | 0.995 | 0.657 | 0.993 | 0.335 | 0.952 | 0.323 | 0. 606 | 0. 349 | 0.145 | 0.330 |
| PENGUINNs | 0.780 | 1. 000 | 0. 967 | 0. 964 | 0. 778 | 0. 963 | 0.231 | 0.955 | 0.030 | 0.940 |
| PENGUINNp | 0.952 | 0. 926 | 0.996 | 0. 818 | 0. 973 | 0. 823 | 0. 733 | 0. 814 | 0. 214 | 0. 790 |
FIGURE 2(A) F1 score for PENGUINNp (precise), PENGUINNs (sensitive) and Regular Expression with datasets of different pos:neg ratio. (B) Precision-Recall curve comparison of PENGUINN and best performing state-of-the-art method G4detector K_PDS and Regular Expression in datasets of different pos:neg ratio.
Area under the precision-recall curve for PENGUINN and 4 state-of-the-art programs.
| PENGUINN | |||||
| G4detector K PDS | 0.965 | 0.979 | 0.906 | 0.637 | 0.188 |
| G4detector PDS | 0.937 | 0.978 | 0.899 | 0.585 | 0.152 |
| G4detector K | 0.941 | 0.978 | 0.888 | 0.552 | 0.124 |
| G4Hunter | 0.972 | 0.964 | 0.851 | 0.503 | 0.093 |
| Quadron | 0.965 | 0.828 | 0.671 | 0.522 | 0.150 |
| PQSfinder | 0.977 | 0.948 | 0.861 | 0.551 | 0.101 |
Area under the precision-recall curve for PENGUINN models trained on human and mouse datasets, and evaluated on varying pos:neg ratios of human and mouse G4s.
| Dataset 1:1 human | 0.987 | |
| Dataset 1:9 human | 0.931 | |
| Dataset 1:99 human | 0.718 | |
| Dataset 1:999 human | 0.373 | |
| Dataset 1:1 mouse | 0.986 | |
| Dataset 1:9 mouse | 0.912 | |
| Dataset 1:99 mouse | 0.658 | |
| Dataset 1:999 mouse | 0.236 |