| Literature DB >> 35361122 |
Eliška Chalupová1,2, Ondřej Vaculík1,2, Jakub Poláček3, Filip Jozefov3, Tomáš Majtner2, Panagiotis Alexiou4.
Abstract
BACKGROUND: The recent big data revolution in Genomics, coupled with the emergence of Deep Learning as a set of powerful machine learning methods, has shifted the standard practices of machine learning for Genomics. Even though Deep Learning methods such as Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) are becoming widespread in Genomics, developing and training such models is outside the ability of most researchers in the field.Entities:
Keywords: Convolutional Neural Network; Deep Learning; Evolutionary Conservation Score; GUI; RNA Secondary Structure; Recurrent Neural Network
Mesh:
Year: 2022 PMID: 35361122 PMCID: PMC8973509 DOI: 10.1186/s12864-022-08414-x
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Fig. 1a The Graphical User Interface (GUI) of ENNGene—ENNGene is fully operated via the GUI. Users define the input parameters using simple interactive elements, such as dropdown menus or checkboxes. Warnings and hints are displayed via the GUI in a user-friendly way directly as the user interacts with it. Web browser being at the basis of the GUI, interactive plots or results are visualized immediately throughout or after the calculations. b Simplified data flow—ENNGene comprises multiple subsequent modules with separate functionality, covering the whole process from input preparation and network architecture definition to model evaluation and interpretation
Fig. 2a Precision-recall curve—the precision-recall metric indicates the relationship between the model’s positive predictive value (precision) and sensitivity (recall) at various thresholds. b Receiver Operating Characteristic (ROC) curve—the ROC metric is calculated as a ratio between the true positive rate and the false positive rate at various thresholds. Both the metrics, precision-recall and ROC calculated by ENNGene, are adjusted for multi-class classification problems and thus can be applied to models with any number of classes. Both curves and other metrics (accuracy, loss, AUROC) are a standard part of exported results after a model evaluation, optionally with Integrated Gradients’ scores. c Integrated Gradients visualization—IG scores of ten sequences with the highest predicted score per class are directly visualized in the browser. Scores are displayed in separate rows for each input type used—sequence, secondary structure, and conservation score. The higher the nucleotide’s attribution to the prediction of a given class, the more pronounced is its red color. On the other hand, the blue color means a low level of attribution
AUROC values from the evaluation of the final models, together with the RBP24 dataset properties. Models created using ENNGene outperform the state-of-the-art tools on more than half of the experiments in the RBP24 dataset while also improving the average AUROC value on the whole dataset. The AUROC values of the other tools are
taken from the original publications. The highest AUROC score reached for the experiment is highlighted in bold
| Protein | RBP-24 | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| PAR-CLIP | 36,902 | 31,310 | 0.895 | 0.881 | 0.919 | 0.925 | 0.915 | 0.927 | ||
| HITS-CLIP | 48,095 | 44,251 | 0.765 | 0.809 | 0.879 | 0.886 | 0.884 | 0.9 | ||
| PAR-CLIP | 1213 | 1197 | 0.68 | 0.714 | 0.668 | 0.643 | 0.758 | 0.74 | ||
| PAR-CLIP | 1860 | 1849 | 0.8 | 0.82 | 0.755 | 0.74 | 0.83 | 0.824 | ||
| PAR-CLIP | 9369 | 9136 | 0.751 | 0.792 | 0.809 | 0.823 | 0.837 | 0.832 | ||
| PAR-CLIP | 8140 | 7901 | 0.855 | 0.834 | 0.888 | 0.824 | 0.893 | 0.869 | ||
| HITS-CLIP | 8595 | 8436 | 0.955 | 0.966 | 0.98 | 0.966 | 0.964 | 0.978 | ||
| PAR-CLIP (A) | 27,275 | 23,974 | 0.959 | 0.966 | 0.972 | 0.973 | 0.978 | 0.965 | ||
| PAR-CLIP (B) | 9464 | 9283 | 0.935 | 0.961 | 0.961 | 0.962 | 0.971 | 0.98 | ||
| PAR-CLIP (C) | 125,202 | 113,686 | 0.991 | 0.994 | 0.989 | 0.99 | 0.979 | 0.989 | ||
| PAR-CLIP | 16,292 | 14,720 | 0.935 | 0.966 | 0.969 | 0.962 | 0.969 | 0.971 | ||
| PAR-CLIP | 34,581 | 31,480 | 0.968 | 0.98 | 0.983 | 0.976 | 0.985 | 0.977 | ||
| iCLIP | 21,472 | 19,794 | 0.952 | 0.962 | 0.976 | 0.978 | 0.977 | |||
| PAR-CLIP | 8539 | 6838 | 0.889 | 0.879 | 0.939 | 0.923 | 0.943 | 0.946 | ||
| PAR-CLIP | 13,793 | 12,987 | 0.863 | 0.854 | 0.899 | 0.896 | 0.916 | 0.92 | ||
| PAR-CLIP | 9116 | 8227 | 0.954 | 0.971 | 0.964 | 0.965 | 0.967 | 0.965 | ||
| HITS-CLIP | 44,574 | 43,700 | 0.937 | 0.944 | 0.936 | 0.944 | 0.953 | 0.954 | ||
| PAR-CLIP | 10,276 | 9142 | 0.957 | 0.983 | 0.973 | 0.965 | 0.97 | 0.975 | ||
| HITS-CLIP | 19,438 | 17,195 | 0.898 | 0.931 | 0.929 | 0.905 | 0.945 | 0.939 | ||
| PAR-CLIP | 7298 | 6606 | 0.97 | 0.983 | 0.978 | 0.978 | 0.976 | |||
| iCLIP | 92,031 | 75,079 | 0.874 | 0.876 | 0.93 | 0.935 | 0.945 | 0.924 | ||
| iCLIP | 18,049 | 16,135 | 0.861 | 0.891 | 0.929 | 0.941 | 0.937 | 0.942 | ||
| iCLIP | 42,332 | 36,652 | 0.833 | 0.87 | 0.922 | 0.929 | 0.934 | 0.946 | ||
| PAR-CLIP | 20,962 | 20,018 | 0.82 | 0.796 | 0.875 | 0.883 | 0.907 | 0.862 | ||
| 0.887 | 0.903 | 0.918 | 0.913 | 0.931 | 0.934 | |||||
Fig. 3Simplified representation of model architecture. The model in this example was trained on 150 nt long sequences using all three available input types—sequence, secondary structure, and conservation score—each represented by a separate model branch. After the network extracts information from the separate inputs, the branches are concatenated, and the network continues learning interdependencies by looking at the combined information via dense or recurrent layers. Boxes represent individual layers, while the adjacent numbers indicate the data dimensionality. A plain graphical representation of the network architecture is produced and exported by ENNGene for every trained model