| Literature DB >> 32476596 |
Noah Bliss1, Eckart Bindewald2, Bruce A Shapiro1.
Abstract
Secondary structure prediction approaches rely typically on models of equilibrium free energies that are themselves based on in vitro physical chemistry. Recent transcriptome-wide experiments of in vivo RNA structure based on SHAPE-MaP experiments provide important information that may make it possible to extend current in vitro-based RNA folding models in order to improve the accuracy of computational RNA folding simulations with respect to the experimentally measured in vivo RNA secondary structure. Here we present a machine learning approach that utilizes RNA secondary structure prediction results and nucleotide sequence in order to predict in vivo SHAPE scores. We show that this approach has a higher Pearson correlation coefficient with experimental SHAPE scores than thermodynamic folding. This could be an important step towards augmenting experimental results with computational predictions and help with RNA secondary structure predictions that inherently take in-vivo folding properties into account.Entities:
Keywords: RNA; SHAPE; SHAPE-MaP; deep learning; neural network; secondary structure
Mesh:
Substances:
Year: 2020 PMID: 32476596 PMCID: PMC7549691 DOI: 10.1080/15476286.2020.1760534
Source DB: PubMed Journal: RNA Biol ISSN: 1547-6286 Impact factor: 4.652
Partition of IDs of 194 RNAs reported by Mustoe into test set and training set.
| Data set | IDs |
|---|---|
| Test set | 7,11,21,22,24,27,28,38,41,44,45,51,54,59,61,62,73,75,76,94, |
| Training set | 1,2,3,4,5,6,8,9,10,12,13,14,15,16,17,18,19,20,23,25,26,29,30,31,32,33,34,35, |
Figure 1.Diagram of the thermo inception model that was designed for this paper. The different sections of the model are colour coded. Each block represents a layer of the network and each arrow shows the flow of information from one layer to the next. The dashed and dotted lines running in between sections represent how the model was run on both the upstream (5ʹ most) data and the downstream (3ʹ-most) data, with one represented by the dashed and the other by the dotted respectively. As described in the Methods section, the input data is processed via convolutional networks of different kernel sizes in order to be able to detect RNA secondary structures and sequence motifs of different sizes (sections highlighted in red, yellow, and purple). The upstream and downstream data are combined in a stacked set of dense neural networks. This model also utilizes several different types of neural network layers that act similar to traditional convolutional layers. The separable convolution layer is a special subset of convolutional layers where the kernel is broken up into two separate convolutions. This reduces the overall number of parameters. The locally connected layer acts as an intermediate between convolutional layers and dense layers. Corresponding nodes between layers are fully connected with their neighbors, much like the sliding-window method used by the convolutional layers. There are several other helping layers, such as the repeating layer which extrudes a vector or tensor of a lower dimension into a tensor of a dimension higher. The max pooling layer reduces the dimension of the tensor and takes the largest value in a specified axis of the input tensor. The batch normalization layer augments the tensor to have a mean of zero and a standard deviation of one.
Figure 2.Correlations with experimental SHAPE scores for the machine learning method (red, ‘ML’), thermodynamic folding (blue, ‘rnaplfold’) and predicted structures informed by the experimental SHAPE scores (green, ‘Mustoe’). The middle portion [‘Mustoe’) depicts the correlation between the secondary structures probabilities accompanying the [7],publication (corresponding to thermodynamic folding informed by experimental SHAPE scores]. One can see that the machine learning method shown in red (which did not have access to the experimental SHAPE scores for the used test cases) is performing similar or better compared to the predicted structures informed by SHAPE scores [shown in green). The 3 experimental conditions indicated as ‘cellfree’, ‘incell’ and ‘kasugamycin’ correspond to the three dataset provided in [7],of i] cell-free lysates, ii) in-cell conditions and iii) conditions of deactivated protein translation due to the presence of the kasugamycin antibiotics.
Figure 3.Violin plots of absolute difference between predicted and experimental SHAPE scores (red) compared to control where the correspondence between input feature vector and output has been randomly shuffled (blue).
Figure 4.Predicted SHAPE scores capture the essence of experimental SHAPE scores around translation start sites. A) The box-whisker plots show experimental SHAPE scores (left), SHAPE score via deep learning (middle) and scores derived from thermodynamic predictions (right, see Methods). For each of the 3 types of scores, the values are plotted with respect to translation start sites of genes (the ‘A’ of an AUG start codon corresponds to position 0). One can see that the SHAPE scores predicted via our deep learning model capture the essence of scores better compared to scores derived from thermodynamic folding predictions (see Methods). B) Comparing the scores corresponding to the AUG start codon (positions 0,1,2 in panels shown in A) to the remaining positions, one finds that a t-test shows statistically significant differences (indicating that start codon sites are less structured) in the cases of SHAPE scores (SHAPE) and our machine learning approach (ML) but not in the case of thermodynamic folding.