| Literature DB >> 27783633 |
Thanh Nguyen1, Asim Bhatti1, Samuel Yang2, Saeid Nahavandi1.
Abstract
This paper introduces an approach to classification of RNA-seq read counts using grey relational analysis (GRA) and Bayesian Gaussian process (GP) models. Read counts are transformed to microarray-like data to facilitate normal-based statistical methods. GRA is designed to select differentially expressed genes by integrating outcomes of five individual feature selection methods including two-sample t-test, entropy test, Bhattacharyya distance, Wilcoxon test and receiver operating characteristic curve. GRA performs as an aggregate filter method through combining advantages of the individual methods to produce significant feature subsets that are then fed into a nonparametric GP model for classification. The proposed approach is verified by using two benchmark real datasets and the five-fold cross-validation method. Experimental results show the performance dominance of the GRA-based feature selection method as well as GP classifier against their competing methods. Moreover, the results demonstrate that GRA-GP considerably dominates the sparse Poisson linear discriminant analysis classifiers, which were introduced specifically for read counts, on different number of features. The proposed approach therefore can be implemented effectively in real practice for read count data analysis, which is useful in many applications including understanding disease pathogenesis, diagnosis and treatment monitoring at the molecular level.Entities:
Mesh:
Year: 2016 PMID: 27783633 PMCID: PMC5082617 DOI: 10.1371/journal.pone.0164766
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1Basic steps of a typical RNA-seq experiment.
Fig 2Proposed methodology for analyzing RNA-seq count data.
Summary of RNA-seq datasets.
| Datasets | Features | Samples | Classes |
|---|---|---|---|
| Mont-Pick [ | 12,984 genes | 129 | CEU/YRI |
| Cervical cancer [ | 714 microRNAs | 58 | tumor/non-tumor |
Fig 3Heat maps showing expression levels of the Mont-Pick dataset (a) before and (b) after voom transform.
Fig 4Heat maps of the expression levels in the cervical cancer dataset (a) before and (b) after voom transform.
Fig 5Distribution of data samples of the (a) Mont-Pick dataset and (b) cervical cancer dataset.
Results of feature selection methods using the Mont-Pick dataset (batch effect is addressed due to potentially different facilities).
| Metrics | ReliefF | Simba | SNR | IG | GRA |
|---|---|---|---|---|---|
| Accuracy | 93.78±0.80 (0.003) | 84.81±1.26 (0.000) | 96.05±0.49 ( | 90.80±1.22 (0.000) | 96.77±0.71 |
| F-measure | 95.44±0.82 (0.022) | 86.24±1.27 (0.000) | 94.87±0.66 (0.003) | 91.32±1.54 (0.001) | 97.64±0.48 |
| AUC | 95.16±0.69 (0.005) | 86.76±1.13 (0.000) | 95.39±0.65 (0.031) | 90.92±1.21 (0.000) | 97.43±0.51 |
| MI | 75.89±3.15 (0.012) | 42.99±3.60 (0.000) | 78.88±2.46 (0.018) | 72.68±4.73 (0.008) | 88.56±2.71 |
Results of feature selection methods using the cervical cancer dataset.
| Metrics | ReliefF | Simba | SNR | IG | GRA |
|---|---|---|---|---|---|
| Accuracy | 88.33±1.54 (0.013) | 87.35±2.02 (0.029) | 88.23±1.65 (0.031) | 90.05±1.80 ( | 93.43±1.28 |
| F-measure | 87.92±1.67 (0.023) | 84.95±2.74 (0.042) | 87.67±1.70 (0.020) | 90.77±1.58 ( | 92.91±1.57 |
| AUC | 88.23±1.47 (0.006) | 87.61±1.98 (0.022) | 89.21±1.58 (0.043) | 91.89±1.38 ( | 94.07±1.22 |
| MI | 58.48±4.60 (0.024) | 57.07±6.08 (0.044) | 57.83±4.81 ( | 66.16±5.10 ( | 73.96±4.85 |
Comparisons of classifiers using the Mont-Pick dataset (batch effect is addressed due to potentially different facilities).
| Metrics | kNN | MLP | SVM | AdaBoost | GP |
|---|---|---|---|---|---|
| Accuracy | 95.16±0.57 (0.019) | 95.15±0.70 (0.038) | 95.01±0.89 ( | 94.50±0.72 (0.026) | 96.77±0.71 |
| F-measure | 96.09±0.52 (0.019) | 94.19±0.87 (0.001) | 93.97±0.94 (0.002) | 94.99±0.66 (0.009) | 97.64±0.48 |
| AUC | 96.78±0.52 ( | 94.93±0.55 (0.000) | 94.50±0.96 (0.012) | 95.30±0.64 (0.014) | 97.43±0.51 |
| MI | 76.46±2.56 (0.007) | 80.06±2.53 (0.018) | 81.60±3.07 ( | 76.62±3.24 (0.028) | 88.56±2.71 |
Comparisons of classifiers using the cervical cancer dataset.
| Metrics | kNN | MLP | SVM | AdaBoost | GP |
|---|---|---|---|---|---|
| Accuracy | 88.11±1.53 (0.012) | 85.33±2.00 (0.002) | 87.42±2.21 (0.045) | 88.38±1.32 (0.017) | 93.43±1.28 |
| F-measure | 87.63±1.67 (0.016) | 86.18±1.93 (0.011) | 83.29±3.40 (0.019) | 87.60±1.34 (0.005) | 92.91±1.57 |
| AUC | 88.67±1.40 (0.007) | 87.22±1.77 (0.006) | 88.84±2.19 ( | 88.90±1.34 (0.010) | 94.07±1.22 |
| MI | 57.49±4.53 (0.028) | 51.53±5.24 (0.003) | 60.29±5.43 ( | 57.47±4.08 (0.030) | 73.96±4.85 |
Fig 6Comparisons of GRA-GP method with sPLDA classifiers using the Mont-Pick dataset.
Fig 7Comparisons of GRA-GP method with sPLDA classifiers using the cervical cancer dataset.