| Literature DB >> 34030625 |
Gilad Ben Or1, Isana Veksler-Lublinsky2.
Abstract
BACKGROUND: MicroRNAs (miRNAs) are small non-coding RNAs that regulate gene expression post-transcriptionally via base-pairing with complementary sequences on messenger RNAs (mRNAs). Due to the technical challenges involved in the application of high-throughput experimental methods, datasets of direct bona fide miRNA targets exist only for a few model organisms. Machine learning (ML)-based target prediction models were successfully trained and tested on some of these datasets. There is a need to further apply the trained models to organisms in which experimental training data are unavailable. However, it is largely unknown how the features of miRNA-target interactions evolve and whether some features have remained fixed during evolution, raising questions regarding the general, cross-species applicability of currently available ML methods.Entities:
Keywords: AGO-CLIP; CLASH; Chimeric miRNA–target interactions; Cross-species prediction; Machine learning; Target prediction; miRNA
Mesh:
Substances:
Year: 2021 PMID: 34030625 PMCID: PMC8146624 DOI: 10.1186/s12859-021-04164-x
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Dataset information
| Name | Species and cell type/Developmental stage | Experimental method | References |
|---|---|---|---|
| ca1 | CLEAR-CLIP | [ | |
| ce1 | Modified iPAR-CLIP | [ | |
| ce2 | ALG-1 iCLIP endogenous ligation | [ | |
| h1 | Human embryonic kidney293 cells (HEK293) | CLASH | [ |
| h2 | Human, a mix of 6 datasets | AGO-CLIP endogenous ligation | [ |
| h3 | Human hepatoma cells (Huh-7.5) | CLEAR-CLIP | [ |
| m1 | Mouse, a mix of 3 datasets | AGO-CLIP endogenous ligation | [ |
| m2 | Mouse neuroblastoma N2A cells (ATCC) | CLEAR-CLIP | [ |
Fig. 1A flowchart depicting the outline of the study. Overall, eight publicly available datasets of chimeric miRNA–target interactions were used in this study, including one from cattle (ca) B. taurus, two from the worm C. elegans (ce), three from humans (h), and two from mice (m). The study included four main steps. The first three steps (processing, characterization and classification) were applied separately on each dataset; in the fourth step, the relationships between datasets were examined
Summary of the data processing pipeline
| Dataset | ca1 | ce1 | ce2 | h1 | h2 | h3 | m1 | m2 |
|---|---|---|---|---|---|---|---|---|
| No. of interactionsa | 296,297 | 3627 | 4920 | 18,514 | 10,567 | 32,712 | 1986 | 130,094 |
| No. of interactions in 3’UTRs | 30,534 | 1704 | 1206 | 8507 | 2039 | 4634 | 902 | 33,100 |
| Final dataset (canonical and non-canonical interactions) | 18,204 | 1176 | 992 | 5137 | 1150 | 2846 | 537 | 17,574 |
aAs provided by the original publications
Composition of miRNA sequences and miRNA seed families within datasets
| Dataset | ca1 | ce1 | ce2 | h1 | h2 | h3 | m1 | m2 |
|---|---|---|---|---|---|---|---|---|
| No. of interactions | 18,204 | 1176 | 992 | 5137 | 1150 | 2846 | 537 | 17,574 |
| No. of miRNA sequences | 165 | 68 | 56 | 287 | 140 | 203 | 98 | 417 |
| 90% point [miRNA sequences] | 49 | 26 | 24 | 99 | 58 | 68 | 49 | 111 |
| (29%) | (38%) | (42%) | (34%) | (41%) | (33%) | (50%) | (26%) | |
| No. of seed families | 119 | 46 | 35 | 254 | 133 | 191 | 88 | 343 |
| 90% point [seed families] | 21 | 14 | 13 | 62 | 35 | 42 | 30 | 63 |
| (18%) | (30%) | (37%) | (24%) | (26%) | (22%) | (34%) | (18%) |
Fig. 2Cumulative sum of miRNA sequence appearances in the examined datasets. Each curve corresponds to the cumulative sum of one of the datasets, where the minimum number of unique miRNA sequences needed to represent 90% of the interactions within the dataset is indicated by a filled circle. The height of each curve represents the size of the dataset and its width represents the number of unique miRNA sequences that comprise it
Fig. 3Classification of the miRNA–target duplexes, based on their base-pairing patterns. Distribution of miRNA–target duplexes across six classes according to the seed type (canonical or non-canonical) and the base-pairing density (low: < 11 bp, medium: 11–16 bp, or high: > 16 bp). The number above each bar indicates the total number of interactions in the dataset
Intra-dataset classification accuracy of different machine learning methods
| Dataset | XGBoost | RF | KNN | SGD | SVM | LR |
|---|---|---|---|---|---|---|
| ca1 | 0.937 | 0.885 | 0.828 | 0.797 | 0.895 | 0.836 |
| (0.002) | (0.004) | (0.003) | (0.033) | (0.003) | (0.004) | |
| ce1 | 0.889 | 0.833 | 0.768 | 0.798 | 0.841 | 0.843 |
| (0.014) | (0.019) | (0.019) | (0.045) | (0.015) | (0.014) | |
| ce2 | 0.891 | 0.858 | 0.768 | 0.819 | 0.862 | 0.847 |
| (0.016) | (0.018) | (0.019) | (0.034) | (0.012) | (0.016) | |
| h1 | 0.824 | 0.769 | 0.731 | 0.746 | 0.795 | 0.770 |
| (0.007) | (0.008) | (0.007) | (0.011) | (0.007) | (0.007) | |
| h2 | 0.904 | 0.869 | 0.857 | 0.860 | 0.879 | 0.892 |
| (0.007) | (0.011) | (0.009) | (0.03) | (0.009) | (0.009) | |
| h3 | 0.835 | 0.769 | 0.744 | 0.752 | 0.805 | 0.795 |
| (0.007) | (0.009) | (0.009) | (0.034) | (0.007) | (0.010) | |
| m1 | 0.847 | 0.795 | 0.758 | 0.760 | 0.819 | 0.800 |
| (0.015) | (0.016) | (0.022) | (0.038) | (0.019) | (0.019) | |
| m2 | 0.900 | 0.826 | 0.797 | 0.798 | 0.873 | 0.833 |
| (0.004) | (0.004) | (0.004) | (0.017) | (0.004) | (0.004) |
The cells contain the means and standard deviations (in brackets) of the accuracy results acquired from 20 models that were trained and evaluated on different training-testing dataset splits
XGBoost performance measurements
| Dataset | AUCa | ACC b | TPR c | TNRd | MCC e | F1 score |
|---|---|---|---|---|---|---|
| ca1 | 0.983 | 0.937 | 0.932 | 0.943 | 0.874 | 0.937 |
| (0.001) | (0.002) | (0.004) | (0.004) | (0.004) | (0.002) | |
| ce1 | 0.955 | 0.889 | 0.89 | 0.889 | 0.779 | 0.89 |
| (0.009) | (0.014) | (0.018) | (0.014) | (0.028) | (0.014) | |
| ce2 | 0.958 | 0.891 | 0.884 | 0.899 | 0.783 | 0.89 |
| (0.012) | (0.016) | (0.02) | (0.019) | (0.032) | (0.017) | |
| h1 | 0.908 | 0.824 | 0.816 | 0.833 | 0.649 | 0.822 |
| (0.006) | (0.007) | (0.008) | (0.008) | (0.014) | (0.007) | |
| h2 | 0.972 | 0.904 | 0.886 | 0.924 | 0.809 | 0.902 |
| (0.003) | (0.007) | (0.012) | (0.011) | (0.014) | (0.007) | |
| h3 | 0.914 | 0.835 | 0.823 | 0.849 | 0.671 | 0.832 |
| (0.004) | (0.007) | (0.011) | (0.009) | (0.014) | (0.008) | |
| m1 | 0.914 | 0.847 | 0.834 | 0.862 | 0.695 | 0.844 |
| (0.007) | (0.015) | (0.014) | (0.024) | (0.031) | (0.014) | |
| m2 | 0.963 | 0.9 | 0.891 | 0.909 | 0.8 | 0.899 |
| (0.002) | (0.004) | (0.003) | (0.005) | (0.008) | (0.004) |
The cells contain the means and standard deviations (in brackets) acquired from 20 models that were trained and evaluated on different training-testing dataset splits
aArea under the receiver operating characteristic curve
bOverall accuracy
cTrue Positive Rate (Sensitivity)
dTrue Negative Rate (Specificity)
eMatthews correlation coefficient
Fig. 4Dataset feature importance plot based on gain score. The features are sorted in descending order of importance, from the highest importance (highest gain) to the lowest. a A full view of the gain plot, emphasizing the gain decay. b A zoomed-in view, focusing on the 20 most important features
Feature importance
| Feature/Dataset | ca1 | ce1 | ce2 | h1 | h2 | h3 | m1 | m2 | Mean |
|---|---|---|---|---|---|---|---|---|---|
| 100* | 87* | 95* | 29* | 40* | 100* | 28 | 100* | 72 | |
| 63* | 79 | 34* | 70* | 25* | 30* | 27 | 85* | 52 | |
| Number of GU bp within the site | 42* | 71* | 32* | 100* | 19 | 53* | 35* | 28* | 48 |
| Proportion of G in mRNA at the site region | 12 | 74* | 12 | 12 | 36* | 33* | 100* | 37* | 39 |
| Duplex minimum free energy | 13* | 45 | 11 | 10 | 100* | 19 | 35* | 52* | 36 |
| 42* | 33 | 100* | 12 | 18 | 36* | 13 | 18 | 34 | |
| Proportion of GG in mRNA at the site region | 30* | 21 | 10 | 12 | 7 | 30* | 79* | 26* | 27 |
| 8 | 100* | 21 | 10 | 11 | 16 | 2 | 12 | 22 | |
| Number of bulges outside the seed | 3 | 60* | 6 | 25* | 32* | 9 | 9 | 8 | 19 |
| 8 | 42 | 37* | 7 | 11 | 13 | 15 | 6 | 17 | |
| 12 | 27 | 14 | 14 | 6 | 15 | 29* | 12 | 16 | |
| 7 | 22 | 24* | 18* | 12 | 13 | 11 | 12 | 15 | |
| Number of GC bp outside the seed | 4 | 27 | 11 | 10 | 27* | 8 | 6 | 5 | 12 |
| Accessibility (nt = 21, len = 10) | 9 | 19 | 7 | 6 | 25 | 7 | 12 | 7 | 11 |
| minimum free energy of the target site + 50nt flanking regions | 8 | 11 | 6 | 7 | 8 | 11 | 36* | 6 | 11 |
| 4 | 3 | 15 | 19* | 0 | 13 | 2 | 9 | 8 |
The table shows 16 features representing the union of the top 6 features of each dataset, along with their gain values which were computed by XGBoost. The features are ordered by their mean gain, scaled to the range of (0, 100), across all datasets. For the unscaled version of the table, see Additional file 1: Table S5
*Belongs to the top 6 features of the dataset
Boolean feature
Numeric feature
Fig. 5Kullback–Leibler (KL) divergence of all dataset pairs. Each cell (i,j) represents the divergence from a source dataset i to a target dataset j (KL(j || i)), based on their miRNA seed family distributions. The black frames indicate the results of dataset pairs originating from the same species
Fig. 6Two-dimensional visualization of the datasets. Each point represents a single positive interaction after a dimensional reduction of its features’ space using PCA. The X and Y axes are the first and the second components of the PCA, respectively
Fig. 7Cross-dataset classification results. Each cell (i,j) represents the mean accuracy of the 20 XGBoost classifiers that were trained on dataset i (in "Evaluation of different machine-learning methods" section) and tested on dataset j (ACC(i, j)). The black frames indicate the results of dataset pairs originating from the same species. The accuracy results for pairs (i,i) were taken from "Intra-dataset analysis" section. Note that, for the ease of the interpretation of the results, the color scale is inverse to the scale used for the KL-divergence plot in Fig. 5
Estimated divergence time [MYA] between species in our study
| Mouse | Cattle | ||
|---|---|---|---|
| Human | 90 | 96 | 797 |
| Mouse | 96 | 797 | |
| Cattle | 797 |
Each cell represents the time since the pair of species from the corresponding row and column diverged from their common ancestor (Source: [74])
Data processing pipeline
| Source | [ | [ | [ | [ | [ |
|---|---|---|---|---|---|
| Datasets | h1 | ce1, h2, m1 | ca1 | ce2 | h3, m2 |
| miRNA sequence | miRBASE | miRBASE | miRBASE | ||
| Target sequence | Wormbase | UCSC genome browser | |||
| Site region | Ensembl Biomart + Blast | ||||
| Duplex structure | Vienna RNAduplex | ||||
| Seed filter | Canonical and non-canonical seeds only | ||||
The table describes the set of actions required to transform the datasets into a uniform format to serve as input for further data analysis and machine learning experiments. The check-mark sign () represents a piece of information taken directly from the paper without additional calculations
Feature categories that are used to represent miRNA–target interactions
| Category | No. of features | Description | Group |
|---|---|---|---|
| Seed features | 13 | Seed composition and properties | High-level |
| Free energy | 7 | Free energy of the duplex and the mRNA at different regions | High-level |
| mRNA composition | 62 | mRNA composition in the site and flanking regions | High-level |
| miRNA pairing | 38 | Binding information at each miRNA position and across the miRNA–target duplex | Low-level |
| Site accessibility | 370 | Unpaired probabilities of each base | Low-level |
| Total | 490 |