| Literature DB >> 34899827 |
Farzaneh Hamidi1, Neda Gilani1, Reza Arabi Belaghi2,3, Parvin Sarbakhsh1, Tuba Edgünlü4, Pasqualina Santaguida5.
Abstract
Ovarian cancer is the second most dangerous gynecologic cancer with a high mortality rate. The classification of gene expression data from high-dimensional and small-sample gene expression data is a challenging task. The discovery of miRNAs, a small non-coding RNA with 18-25 nucleotides in length that regulates gene expression, has revealed the existence of a new array for regulation of genes and has been reported as playing a serious role in cancer. By using LASSO and Elastic Net as embedded algorithms of feature selection techniques, the present study identified 10 miRNAs that were regulated in ovarian serum cancer samples compared to non-cancer samples in public available dataset GSE106817: hsa-miR-5100, hsa-miR-6800-5p, hsa-miR-1233-5p, hsa-miR-4532, hsa-miR-4783-3p, hsa-miR-4787-3p, hsa-miR-1228-5p, hsa-miR-1290, hsa-miR-3184-5p, and hsa-miR-320b. Further, we implemented state-of-the-art machine learning classifiers, such as logistic regression, random forest, artificial neural network, XGBoost, and decision trees to build clinical prediction models. Next, the diagnostic performance of these models with identified miRNAs was evaluated in the internal (GSE106817) and external validation dataset (GSE113486) by ROC analysis. The results showed that first four prediction models consistently yielded an AUC of 100%. Our findings provide significant evidence that the serum miRNA profile represents a promising diagnostic biomarker for ovarian cancer.Entities:
Keywords: Biomarker; Elasticnet; Feature Selection; Gene Expression Omnibus (GEO); Lasso; Machine Learning; Ovarian Cancer
Year: 2021 PMID: 34899827 PMCID: PMC8656459 DOI: 10.3389/fgene.2021.724785
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.599
Summary of miRNA genes shown to be statistically significantly associated with ovarian cancer.
| Reference | Association | Up-regulated miRNA | Down-regulated miRNA |
|---|---|---|---|
|
| Epithelial ovarian cancer | miR-6131, miR-1305, miR-197-3p, and miR-3651 | miR-3135b, miR-4430, miR-664b-5p, and miR-766-3p |
|
| Serous ovarian cancer | miR-16, miR-20a, miR-21, and miR-27a | miR-145, miR-125B, miR-125B, and miR-100 |
|
| Epithelial ovarian cancer and normal | miR-200a, miR-141, miR-200c, miR-200b, miR-182, and miR-205 | miR-127, miR-140, miR-9, miR-101, miR-147, miR-204, miR-211, miR-124a, and miR-302b |
FIGURE 1Flowchart of feature selection and model building in the study.
miRNAs identified with threshold over 80% importance in both Lasso and Elastic net in the dataset GSE106817 with miRNA status.
| miRNA-ID List | Importnace in Elastic Net | Importnace in LASSO (%) | adj. | B | logFC | miRNAStatus |
|---|---|---|---|---|---|---|
| hsa-miR-5100 | 100 | 100 | <0.001 | 16.18 | 4.15 | Upregulated |
| hsa-miR-1290 | 100 | 100 | <0.001 | 13.00 | 5.61 | Upregulated |
| hsa-miR-320b | — | 88.07 | <0.001 | 12.25 | 4.11 | Upregulated |
| hsa-miR-1233-5p | 85.63 | 87.81 | <0.001 | 11.78 | 2.36 | Upregulated |
| hsa-miR-4783-3p | 100 | 87.44 | <0.001 | 10.36 | 2.89 | Upregulated |
| hsa-miR-6800-5p | — | 84.07 | <0.001 | 8.66 | −1.60 | Downregulated |
| hsa-miR-4532 | 85.51 | — | <0.001 | 6.95 | 2.90 | Upregulated |
| hsa-miR-3184-5p | 83.33 | — | <0.001 | 5.29 | −3.23 | Downregulated |
| hsa-miR-4787-3p | 100 | — | <0.001 | 3.82 | 2.30 | Upregulated |
| hsa-miR-1228-5p | 88.83 | — | <0.001 | 2.03 | −0.93 | Downregulated |
Predictive power of models for ovarian cancer classification and prediction in the external (GSE113486) validation data.
| Classifier | Hyperparameters | AUC | Accuracy (%) | Sensitivity (%) | Specificity (%) | Negative predictive value (%) | Positive predictive value (%) | Kappa (%) |
|---|---|---|---|---|---|---|---|---|
| LR | Parameters | 100 | 100 | 100 | 100 | 100 | 100 | 100 |
| DT | Cp | 92.60 | 91.30 | 92.50 | 90.38 | 88.10 | 94 | 82.41 |
| RF | Mtry | 100 | 97.83 | 95 | 100 | 100 | 96.30 | 95.55 |
| ANN | Size | 100 | 100 | 100 | 100 | 100 | 100 | 100 |
| XGB | nrounds = 50, max_depth | 100 | 98.91 | 97.50 | 100 | 100 | 98.11 | 97.78 |
The area under the receiver operating characteristic curve (maximum) was used to select the optimal model.
bThe formula for logistic regression for prediction of ovarian cancer is p = (1 + e−[14.19−40.34(has.miR.6800.5p)+3.61(has.miR.1228.5p)+16.09(has.miR.5100)+2.86(has.miR.1290)+4.17(has.miR.4783.3p)−8.9(has.miR.3184.5p)+8(has.miR.320b)+9.23(has.miR.4532)−4.2(has.miR.4787.3p)−0.65(has.miR.1233.5p)])−1.
The complexity parameter (cp) is used to control the size of the decision tree and to select the optimal tree size. If the cost of adding an additional variable to the decision tree from the current node is above the value of the cp, then tree building does not continue.
mtry is the number of variables available for splitting at each tree node. In the random forests literature, this is referred to as the mtry parameter.
Size is the number of units in a hidden layer.
Decay is the regularization parameter used to avoid over-fitting.
max-depth is used to control over-fitting as higher depth will allow model to learn relations very specific to a particular sample.
gamma A node is split only when the resulting split gives a positive reduction in the loss function. Gamma specifies the minimum loss reduction required to make a split. Makes the algorithm conservative. The values can vary depending on the loss function and should be tuned.
Denotes the fraction of columns to be randomly sampled for each tree.
min_child_weight used to control over-fitting. Higher values prevent a model from learning relations which might be highly specific to the particular sample selected for a tree. Too high values can lead to under-fitting; hence, it should be tuned using CV.
Subsample lower values make the algorithm more conservative and prevent overfitting, but too small values might lead to under-fitting.
FIGURE 2Boxplots of the 10 identified miRNAs in ovarian cancer patients compared with the non-cancer control patients in the dataset GSE106817.
FIGURE 3Heatmap of hierarchical clustering analysis using the 10 identified miRNAs to distinguish different samples in the dataset GSE106817.
FIGURE 4Diagnostic performance of the 10 identified serum miRNA signatures in the internal (GSE106817) data.
FIGURE 5AUC of proposed models of all identified microRNAs in the internal (GSE106817) validation data.
FIGURE 6The miRNA network with target genes.
FIGURE 7Validation of the diagnostic performance of the three selected miRNA high signatures in the external (GSE113486) data.