| Literature DB >> 26508990 |
Sheng Yang1, Li Guo1, Fang Shao1, Yang Zhao1, Feng Chen1.
Abstract
Sequencing is widely used to discover associations between microRNAs (miRNAs) and diseases. However, the negative binomial distribution (NB) and high dimensionality of data obtained using sequencing can lead to low-power results and low reproducibility. Several statistical learning algorithms have been proposed to address sequencing data, and although evaluation of these methods is essential, such studies are relatively rare. The performance of seven feature selection (FS) algorithms, including baySeq, DESeq, edgeR, the rank sum test, lasso, particle swarm optimistic decision tree, and random forest (RF), was compared by simulation under different conditions based on the difference of the mean, the dispersion parameter of the NB, and the signal to noise ratio. Real data were used to evaluate the performance of RF, logistic regression, and support vector machine. Based on the simulation and real data, we discuss the behaviour of the FS and classification algorithms. The Apriori algorithm identified frequent item sets (mir-133a, mir-133b, mir-183, mir-937, and mir-96) from among the deregulated miRNAs of six datasets from The Cancer Genomics Atlas. Taking these findings altogether and considering computational memory requirements, we propose a strategy that combines edgeR and DESeq for large sample sizes.Entities:
Mesh:
Substances:
Year: 2015 PMID: 26508990 PMCID: PMC4609795 DOI: 10.1155/2015/178572
Source DB: PubMed Journal: Comput Math Methods Med ISSN: 1748-670X Impact factor: 2.238
Parameter settings used for the simulation data.
| Scenario | Parameter | Settings |
|---|---|---|
| A1–A5 | Signal to noise (s2n) | 0.01, 0.05, 0.1, 0.15, and 0.20 |
| B1–B5 | Mean of significant variables in the case | 10, 15, 20, 25, and 30 |
| C1–C5 | Dispersion parameter of significant variables in the case | 0.125, 0.5, 1, 2, and 8 |
| Sample size (+/−) | 40 (20/20) | |
| Number of variables | 500 | |
| Mean of significant variables in the control | 5 | |
| Mean of insignificant variables | 5 | |
| Dispersion parameter of significant variables in the control | 1 | |
| Dispersion parameter of insignificant variables | 1 |
Summary of the selected datasets.
| Number | Cancer | Feature | Sample (+/−) | SDRa |
|---|---|---|---|---|
| 1 | BRCA | 903 | 206 (103/103) | 0.23 |
| 2 | HNSC | 906 | 162 (81/81) | 0.18 |
| 3 | KICH | 796 | 82 (41/41) | 0.10 |
| 4 | LUAD | 895 | 218 (109/109) | 0.24 |
| 5 | STAD | 857 | 170 (85/85) | 0.20 |
| 6 | THCA | 904 | 212 (106/106) | 0.23 |
aSDR refers to the ratio between the number of samples and the number of features.
Figure 1Type I error of four statistical algorithms. (a) α = 0.05 condition. (b) Bonferroni correction.
Figure 2Power of four statistical algorithms with different settings of three parameters. (a) Different settings of the s2n of the variables. (b) Different settings of the mean. (c) Different settings of DP in the case group.
Sensitivity and specificity of the seven algorithms in different settingsa.
| Scenario | baySeq | DESeq | edgeR | Lasso | Rank sum | PSODT | RF | ||||
|---|---|---|---|---|---|---|---|---|---|---|---|
|
| Bonb |
| Bonb |
| Bonb |
| Bonb | ||||
| s2n | |||||||||||
| A1 | 0.68/1.00 | 0.16/1.00 | 0.97/0.96 | 0.43/1.00 | 0.99/0.93 | 0.68/1.00 | 0.62/0.99 | 0.92/0.95 | 0.16/1.00 | 0.10/0.99 | 0.60/1.00 |
| A2 | 0.74/1.00 | 0.18/1.00 | 0.97/0.96 | 0.37/1.00 | 0.98/0.92 | 0.63/1.00 | 0.49/0.99 | 0.92/0.95 | 0.16/1.00 | 0.11/0.95 | 0.70/0.98 |
| A3 | 0.77/1.00 | 0.19/1.00 | 0.95/0.94 | 0.32/1.00 | 0.98/0.91 | 0.57/1.00 | 0.32/1.00 | 0.92/0.95 | 0.16/1.00 | 0.14/0.90 | 0.71/0.97 |
| A4 | 0.77/1.00 | 0.17/1.00 | 0.93/0.93 | 0.27/1.00 | 0.96/0.89 | 0.50/1.00 | 0.23/1.00 | 0.92/0.95 | 0.16/1.00 | 0.18/0.86 | 0.70/0.95 |
| A5 | 0.76/1.00 | 0.16/1.00 | 0.90/0.90 | 0.22/1.00 | 0.95/0.86 | 0.43/1.00 | 0.18/1.00 | 0.18/1.00 | 0.92/0.95 | 0.17/1.00 | 0.70/0.93 |
|
| |||||||||||
| Mean of significant variables | |||||||||||
| B1 | 0.05/1.00 | 0.00/1.00 | 0.41/0.95 | 0.01/1.00 | 0.52/0.92 | 0.04/1.00 | 0.13/0.99 | 0.40/0.95 | 0.01/1.00 | 0.12/0.90 | 0.33/0.93 |
| B2 | 0.45/1.00 | 0.03/1.00 | 0.81/0.95 | 0.12/1.00 | 0.88/0.92 | 0.27/1.00 | 0.30/1.00 | 0.77/0.95 | 0.05/1.00 | 0.13/0.90 | 0.57/0.95 |
| B4 | 0.91/1.00 | 0.41/1.00 | 0.99/0.94 | 0.53/1.00 | 0.99/0.91 | 0.78/1.00 | 0.32/1.00 | 0.97/0.95 | 0.31/1.00 | 0.14/0.91 | 0.79/0.98 |
| B5 | 0.96/1.00 | 0.62/1.00 | 1.00/0.94 | 0.66/1.00 | 1.00/0.90 | 0.90/1.00 | 0.32/1.00 | 0.99/0.95 | 0.45/1.00 | 0.14/0.91 | 0.83/0.98 |
|
| |||||||||||
| Dispersion parameter of significant variables | |||||||||||
| C1 | 1.00/1.00 | 0.87/1.00 | 1.00/0.92 | 0.57/1.00 | 1.00/0.89 | 0.92/1.00 | 0.37/1.00 | 1.00/0.95 | 0.97/1.00 | 0.26/0.95 | 0.97/1.00 |
| C2 | 0.89/0.99 | 0.38/0.94 | 0.98/1.00 | 0.40/0.94 | 0.99/1.00 | 0.75/0.97 | 0.35/0.93 | 1.00/1.00 | 0.61/0.96 | 0.14/0.90 | 0.86/0.90 |
| C4 | 0.71/1.00 | 0.14/1.00 | 0.90/0.95 | 0.29/1.00 | 0.92/0.92 | 0.36/1.00 | 0.24/0.99 | 0.46/0.95 | 0.01/1.00 | 0.14/0.90 | 0.52/0.95 |
| C5 | 0.73/1.00 | 0.28/1.00 | 0.73/0.96 | 0.29/1.00 | 0.71/0.93 | 0.16/1.00 | 0.00/1.00 | 0.23/0.95 | 0.00/1.00 | 0.13/0.90 | 0.44/0.94 |
aThe conditions where the mean = 20, dispersion parameter = 1, and s2n = 0.1 are the same. Each cell includes the sensitivity and specificity.
bBon indicates a result using the Bonferroni correction.
Figure 3The frequency of selected variables of seven FS methods in the simulation. (a) Different settings of the s2n of the variables. (b) Different settings of the mean. (c) Different settings of DP.
Figure 4The bar plots and Venn diagrams of a number of significant miRNAs identified by different FS algorithms in six cancers. The bar plot indicates the number of significant variables. The Venn diagram illustrates the relationships of the significant variables among the six methods. (a) BRCA; (b) HNSC; (c) KICH; (d) LUAD; (e) STAD; and (f) THCA.
Summary of three classification methods using real data.
| Datasets | FS | Logistic regression | RF | SVM | ||||||
|---|---|---|---|---|---|---|---|---|---|---|
| PPV | NPV | AUC | PPV | NPV | AUC | PPV | NPV | AUC | ||
| BRCA | baySeq | 0.53 | 0.53 | 0.53 | 1.00 | 0.99 | 0.99 | 0.95 | 0.96 | 0.96 |
| DESeq | 0.70 | 0.72 | 0.70 | 1.00 | 0.99 | 1.00 | 1.00 | 0.94 | 0.97 | |
| edgeR | 0.54 | 0.55 | 0.55 | 1.00 | 0.99 | 0.99 | 0.54 | 0.55 | 0.55 | |
| Lasso | 0.97 | 0.98 | 0.98 | 1.00 | 0.99 | 0.99 | 0.97 | 0.98 | 0.98 | |
| Rank sum | 0.55 | 0.55 | 0.55 | 1.00 | 0.99 | 0.99 | 0.55 | 0.55 | 0.55 | |
| PSODT | 0.85 | 0.86 | 0.86 | 0.99 | 0.98 | 0.98 | 0.85 | 0.86 | 0.86 | |
|
| ||||||||||
| HNSC | baySeq | 0.35 | 0.38 | 0.37 | 0.54 | 0.56 | 0.55 | 0.63 | 0.52 | 0.58 |
| DESeq | 0.52 | 0.57 | 0.55 | 0.53 | 0.52 | 0.52 | 0.91 | 0.47 | 0.69 | |
| edgeR | 0.32 | 0.35 | 0.33 | 0.54 | 0.54 | 0.54 | 0.32 | 0.35 | 0.33 | |
| Lasso | 0.52 | 0.76 | 0.64 | 0.55 | 0.55 | 0.55 | 0.52 | 0.76 | 0.64 | |
| Rank sum | 0.35 | 0.31 | 0.33 | 0.54 | 0.54 | 0.54 | 0.35 | 0.31 | 0.33 | |
| PSODT | 0.43 | 0.44 | 0.43 | 0.55 | 0.54 | 0.54 | 0.43 | 0.44 | 0.43 | |
|
| ||||||||||
| KICH | baySeq | 0.36 | 0.38 | 0.37 | 0.65 | 0.66 | 0.66 | 0.68 | 0.70 | 0.69 |
| DESeq | 0.37 | 0.39 | 0.38 | 0.66 | 0.65 | 0.66 | 0.68 | 0.84 | 0.76 | |
| edgeR | 0.40 | 0.38 | 0.39 | 0.66 | 0.65 | 0.66 | 0.40 | 0.38 | 0.39 | |
| Lasso | 0.64 | 0.82 | 0.73 | 0.65 | 0.66 | 0.65 | 0.64 | 0.82 | 0.73 | |
| Rank sum | 0.39 | 0.38 | 0.39 | 0.66 | 0.66 | 0.66 | 0.39 | 0.38 | 0.39 | |
| PSODT | 0.37 | 0.38 | 0.37 | 0.66 | 0.65 | 0.66 | 0.37 | 0.38 | 0.37 | |
|
| ||||||||||
| LUAD | baySeq | 0.40 | 0.47 | 0.43 | 0.46 | 0.45 | 0.46 | 0.45 | 0.69 | 0.57 |
| DESeq | 0.30 | 0.78 | 0.54 | 0.46 | 0.41 | 0.44 | 0.95 | 0.36 | 0.65 | |
| edgeR | 0.44 | 0.47 | 0.46 | 0.47 | 0.45 | 0.46 | 0.44 | 0.47 | 0.46 | |
| Lasso | 0.47 | 0.74 | 0.61 | 0.47 | 0.45 | 0.46 | 0.47 | 0.74 | 0.61 | |
| Rank sum | 0.30 | 0.36 | 0.33 | 0.47 | 0.45 | 0.46 | 0.30 | 0.36 | 0.33 | |
| PSODT | 0.36 | 0.50 | 0.43 | 0.47 | 0.45 | 0.46 | 0.36 | 0.50 | 0.43 | |
|
| ||||||||||
| STAD | baySeq | 0.42 | 0.56 | 0.49 | 0.44 | 0.45 | 0.44 | 0.44 | 0.63 | 0.54 |
| DESeq | 0.14 | 0.85 | 0.49 | 0.41 | 0.38 | 0.40 | 0.91 | 0.25 | 0.58 | |
| edgeR | 0.37 | 0.42 | 0.40 | 0.49 | 0.46 | 0.47 | 0.37 | 0.42 | 0.40 | |
| Lasso | 0.43 | 0.77 | 0.60 | 0.46 | 0.46 | 0.46 | 0.43 | 0.77 | 0.60 | |
| Rank sum | 0.40 | 0.48 | 0.44 | 0.44 | 0.46 | 0.45 | 0.40 | 0.48 | 0.44 | |
| PSODT | 0.36 | 0.44 | 0.44 | 0.44 | 0.46 | 0.45 | 0.36 | 0.44 | 0.40 | |
|
| ||||||||||
| THCA | baySeq | 0.49 | 0.63 | 0.56 | 0.56 | 0.57 | 0.57 | 0.77 | 0.50 | 0.63 |
| DESeq | 0.49 | 0.85 | 0.67 | 0.54 | 0.58 | 0.56 | 0.54 | 0.82 | 0.68 | |
| edgeR | 0.53 | 0.59 | 0.56 | 0.56 | 0.60 | 0.58 | 0.53 | 0.59 | 0.56 | |
| Lasso | 0.54 | 0.88 | 0.71 | 0.56 | 0.59 | 0.57 | 0.54 | 0.88 | 0.71 | |
| Rank sum | 0.44 | 0.44 | 0.44 | 0.57 | 0.58 | 0.58 | 0.44 | 0.44 | 0.44 | |
| PSODT | 0.48 | 0.56 | 0.52 | 0.56 | 0.56 | 0.56 | 0.48 | 0.56 | 0.52 | |