| Literature DB >> 35495139 |
Malik Yousef1, Gokhan Goy2,3, Burcu Bakir-Gungor2.
Abstract
Increasing evidence that microRNAs (miRNAs) play a key role in carcinogenesis has revealed the need for elucidating the mechanisms of miRNA regulation and the roles of miRNAs in gene-regulatory networks. A better understanding of the interactions between miRNAs and their mRNA targets will provide a better understanding of the complex biological processes that occur during carcinogenesis. Increased efforts to reveal these interactions have led to the development of a variety of tools to detect and understand these interactions. We have recently described a machine learning approach miRcorrNet, based on grouping and scoring (ranking) groups of genes, where each group is associated with a miRNA and the group members are genes with expression patterns that are correlated with this specific miRNA. The miRcorrNet tool requires two types of -omics data, miRNA and mRNA expression profiles, as an input file. In this study we describe miRModuleNet, which groups mRNA (genes) that are correlated with each miRNA to form a star shape, which we identify as a miRNA-mRNA regulatory module. A scoring procedure is then applied to each module to further assess their contribution in terms of classification. An important output of miRModuleNet is that it provides a hierarchical list of significant miRNA-mRNA regulatory modules. miRModuleNet was further validated on external datasets for their disease associations, and functional enrichment analysis was also performed. The application of miRModuleNet aids the identification of functional relationships between significant biomarkers and reveals essential pathways involved in cancer pathogenesis. The miRModuleNet tool and all other supplementary files are available at https://github.com/malikyousef/miRModuleNet/.Entities:
Keywords: feature selection; gene expression; integrative “omics”; machine learning; multi omics
Year: 2022 PMID: 35495139 PMCID: PMC9039401 DOI: 10.3389/fgene.2022.767455
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.772
FIGURE 1miRNA - mRNA integration techniques.
Details of the datasets utilized in miRModuleNet.
| TCGA data | Abbreviation | Control | Case | PMID |
|---|---|---|---|---|
| Bladder urothelial carcinoma | BLCA | 405 | 19 | 24476821 |
| Breast invasive carcinoma | BRCA | 760 | 87 | 31878981 |
| Kidney chromophobe | KICH | 66 | 25 | 25155756 |
| Kidney renal papillary cell carcinoma | KIRP | 290 | 32 | 28780132 |
| Kidney renal clear cell carcinoma | KIRC | 255 | 71 | 23792563 |
| Lung adenocarcinoma | LUAD | 449 | 20 | 25079552 |
| Lung squamous cell carcinoma | LUSC | 342 | 38 | 22960745 |
| Prostate adenocarcinoma | PRAD | 493 | 52 | 26544944 |
| Stomach adenocarcinoma | STAD | 370 | 35 | 25079317 |
| Papillary thyroid carcinoma | THCA | 504 | 59 | 25,417,114 |
| Uterine corpus endometrial carcinoma | UCEC | 174 | 23 | 23636398 |
Control and case columns denote the number of samples. Column PMID refers to Pubmed ID of the related publication, where further information about the dataset can be found.
FIGURE 2The general integrative approach that is based on grouping and scoring/ranking.
FIGURE 3An example of how to create sub_data based on a group of genes and then this sub-data is subject to the scoring component S. gi refers to genei.
A sample output of scoring component when applied on THCA data, downloaded from TCGA.
| Group | Accuracy | Sensitivity | Specificity | FM | Precision | Cohen’s kappa |
|---|---|---|---|---|---|---|
| hsa-miR-101-3p | 0.89 | 0.82 | 0.92 | 0.85 | 0.88 | 0.73 |
| hsa-miR-200c-3p | 0.95 | 0.92 | 0.97 | 0.92 | 0.94 | 0.89 |
| hsa-miR-508-3p | 0.98 | 0.93 | 1.00 | 0.96 | 1.00 | 0.94 |
| hsa-miR-629-5p | 0.99 | 0.97 | 1.00 | 0.98 | 0.99 | 0.97 |
Each miRNA ID represents a group, which is generated by the Grouping Component G. Groups are sorted according to the accuracy metric.
Pseudocode of component M, which calculates the performance.
| For |
| genes_set = |
| X_train = sub_set of Ctrain that includes the genes from the genes_set |
| X_test = sub_set of Ctest that includes the genes from the genes_set |
| RF_Model < - train Random Forest (X_train) |
| Performances = test RF_Model (X_test) |
FIGURE 4miRModuleNet flowchart.
FIGURE 5miRModuleNet workflow in KNIME.
An example performance table of miRModuleNet for top ranked 10 modules for BLCA dataset.
| #Groups | #Genes | Accuracy | Sensitivity | Specificity | AUC |
|---|---|---|---|---|---|
| 10 | 1422.96 | 0.92 | 0.89 | 0.94 | 0.98 |
| 9 | 1254.76 | 0.92 | 0.88 | 0.93 | 0.98 |
| 8 | 1110.82 | 0.91 | 0.87 | 0.93 | 0.97 |
| 7 | 962.83 | 0.91 | 0.88 | 0.93 | 0.97 |
| 6 | 799.7 | 0.92 | 0.88 | 0.94 | 0.97 |
| 5 | 628.14 | 0.92 | 0.87 | 0.94 | 0.97 |
| 4 | 489.59 | 0.91 | 0.87 | 0.93 | 0.98 |
| 3 | 331.02 | 0.90 | 0.85 | 0.93 | 0.97 |
| 2 | 205.08 | 0.90 | 0.84 | 0.93 | 0.97 |
| 1 | 79.25 | 0.89 | 0.82 | 0.92 | 0.95 |
FIGURE 6Comprehensive evaluation of different mutual information threshold values. The numbers following the underscore values correspond to the number of groups.
Performance results of miRModuleNet over the top-ranked group.
| Disease | #Genes | ACC | SEN | SPE | FM | AUC | Precision |
|---|---|---|---|---|---|---|---|
| BLCA | 79 | 0.89 | 0.82 | 0.92 | 0.85 | 0.95 | 0.88 |
| BRCA | 22 | 0.95 | 0.92 | 0.97 | 0.92 | 0.98 | 0.94 |
| KICH | 40 | 0.98 | 0.93 | 1.00 | 0.96 | 0.99 | 1.00 |
| KIRC | 64 | 0.99 | 0.97 | 1.00 | 0.98 | 0.99 | 0.99 |
| KIRP | 41 | 1.00 | 0.99 | 1.00 | 0.99 | 1.00 | 1.00 |
| LUAD | 4 | 0.94 | 0.90 | 0.96 | 0.90 | 0.98 | 0.93 |
| LUSC | 12 | 0.98 | 0.99 | 0.98 | 0.98 | 1.00 | 0.97 |
| PRAD | 5 | 0.86 | 0.76 | 0.91 | 0.77 | 0.92 | 0.82 |
| STAD | 115 | 0.90 | 0.81 | 0.95 | 0.85 | 0.97 | 0.92 |
| THCA | 6 | 0.93 | 0.90 | 0.95 | 0.90 | 0.98 | 0.92 |
| UCEC | 33 | 0.94 | 0.89 | 0.96 | 0.89 | 0.99 | 0.94 |
ACC stands for Accuracy, SEN stands for Sensitivity, SPE stands for Specificity, FM stands for F-Measure, AUC stands for Area Under the ROC curve.
Comparative evaluation of existing tools using 11 cancer datasets.
| Method | #Genes | Accuracy | Sensitivity | Specificity | AUC | SD |
|---|---|---|---|---|---|---|
| miRModuleNet | 78.31 | 0.96 | 0.91 | 0.98 | 0.99 | 0.04 ± 0.02 |
| miRcorrNet | 141.26 | 0.96 | 0.94 | 0.97 | 0.98 | 0.05 ± 0.05 |
| maTE | 7.48 | 0.96 | 0.94 | 0.96 | 0.98 | 0.034 ± 0.02 |
| SVM-RFE | 8 | 0.84 | 0.85 | 0.85 | 0.91 | 0.07 ± 0.04 |
| SVM-RFE | 125 | 0.96 | 0.97 | 0.95 | 0.98 | 0.05 ± 0.03 |
AUC column refers to the area under the curve values. All the presented values are average values over 100 MCCV for the level of top 2 groups for miRModuleNet, maTE and miRcorrNet; 8 and 125 genes for SVM-RFE. Standard Deviation (SD) values are given for AUC.
FIGURE 7Functional enrichment results for BLCA and BRCA using GeneCodis. The p Values of the enriched KEGG pathways refer to the normalized values using mean normalization. SPRPSC stands for Signaling Pathways Regulating Pluripotency of Stem Cells, EGFR-TKIR stands for EGFR Tyrosine Kinease Inhibitor Resistance, CML stands for Chronic Myleoid Leukemia, NAFLD stands for Non-Alcoholic fatty Liver Disease, ARSPDC stands for AGE-RAGE Signaling Pathway in Diabetic Complications.
FIGURE 8Interactions of the commonly overrepresented pathways in all datasets.
Performance results on the external validation data.
| Experiments using different gene levels (1–5–30–50) | Sensitivity | Specificity | Accuracy | F-measure |
|---|---|---|---|---|
| Random 1 gene | 0.43 | 0.58 | 0.51 | 0.44 |
| Top 1 gene of mirModuleNet | 0.84 | 0.88 | 0.87 | 0.85 |
| Random 5 genes | 0.46 | 0.61 | 0.55 | 0.48 |
| Top 5 genes of mirModuleNet | 0.94 | 0.81 | 0.88 | 0.87 |
| Random 30 genes | 0.57 | 0.91 | 0.76 | 0.68 |
| Top 30 genes of mirModuleNet | 0.94 | 0.92 | 0.93 | 0.92 |
| Random 50 genes | 0.76 | 0.94 | 0.86 | 0.83 |
| Top 50 genes of mirModuleNet | 0.94 | 0.97 | 0.95 | 0.94 |
In all experiments, the model is trained on TCGA- LUSC data and tested on external data, which is LUSC_E.
Biological validation of the identified miRNAs for LUSC data by miRModuleNet, against five disease databases, i.e., dbDEMC, miRcancer, miR2Disease, PhenomiR, HMDD.
| miRNA | Score ( | Source(s) |
|---|---|---|
| hsa-miR-181a-5p | 4.83E-58 | dbDEMC, miRcancer, PhenomiR |
| hsa-miR-126-5p | 2.79E-57 | dbDEMC, miRcancer, miR2Disease, PhenomiR, HMDD |
| hsa-miR-140-3p | 5.9E-55 | dbDEMC, miRcancer, miR2Disease, PhenomiR, HMDD |
| hsa-miR-708-5p | 5.9E-55 | dbDEMC, miRcancer |
| hsa-miR-195-5p | 5.9E-55 | dbDEMC, miRcancer, miR2Disease, PhenomiR, HMDD |
| hsa-miR-30d-5p | 7.76E-53 | dbDEMC, miRcancer, PhenomiR,HMDD |
| hsa-miR-30a-5p | 7,76E-53 | dbDEMC, miRcancer, miR2Disease, PhenomiR, HMDD |
Summary of the comparison against the databases of miRNA–disease associations.
| Disease | Number of miRNA-disease associations identified by miRModuleNet | Number of databases containing the specific miRNA—disease association | ||||
|---|---|---|---|---|---|---|
| 1 | 2 | 3 | 4 | 5 | ||
| BLCA | 62 | 21 | 17 | 9 | 6 | 2 |
| BRCA | 51 | 4 | 15 | 19 | 11 | — |
| KICH | 61 | 34 | 15 | — | — | — |
| KIRC | 46 | 27 | 9 | 5 | — | — |
| KIRP | 87 | 44 | 19 | 4 | — | — |
| LUAD | 91 | 11 | 26 | 31 | 15 | 8 |
| LUSC | 54 | 2 | 6 | 10 | 15 | 20 |
| PRAD | 53 | 9 | 11 | 14 | 13 | 4 |
| STAD | 35 | 8 | 14 | 6 | 4 | 2 |
| THCA | 55 | 28 | 9 | 8 | 2 | 4 |
| UCEC | 87 | 46 | 20 | — | — | — |
The numbers in the table indicate the number of identified miRNA–disease associations included in 1, 2, 3, 4, or 5 different databases.