| Literature DB >> 27538512 |
Katsuhiro Omae1, Osamu Komori2, Shinto Eguchi3,4.
Abstract
BACKGROUND: Detection of disease-associated markers plays a crucial role in gene screening for biological studies. Two-sample test statistics, such as the t-statistic, are widely used to rank genes based on gene expression data. However, the resultant gene ranking is often not reproducible among different data sets. Such irreproducibility may be caused by disease heterogeneity.Entities:
Keywords: Gene expression analysis; Genes screening; Heterogeneity; Subsampling method; Two-sample test; U-statistic
Mesh:
Substances:
Year: 2016 PMID: 27538512 PMCID: PMC4991096 DOI: 10.1186/s12920-016-0214-5
Source DB: PubMed Journal: BMC Med Genomics ISSN: 1755-8794 Impact factor: 3.063
Fig. 1Scores for two independent data sets obtained using the t-statistic. The relative importance are evaluated based on the absolute values. Red points under the vertical line denotes the sign mismatched genes between the two data sets
Fig. 2Difference in the 95 % upper confidence limits between homo and hetero genes. Each bounds is evaluated by (4) when , and τ 1=τ 0=0.5
The number of homo genes in the top 100 ranking: these are obtained using the t and sign-sum statistics. Means(sd) from 100 repetitions for each situations and sample size is written
|
| |||||
|
|
|
|
|
| |
| Situation I | 50.0 (4.12) | 49.7 (3.67) | 61.7 (3.20) | 80.8 (2.56) | 83.1 (2.36) |
| Situation II | 49.9 (3.50) | 49.3 (3.49) | 48.9 (3.19) | 56.5 (3.16) | 57.8 (3.23) |
| Situation III | 49.7 (3.09) | 49.4 (3.31) | 72.1 (2.77) | 73.3 (3.12) | 72.4 (3.14) |
|
| |||||
|
|
|
|
|
| |
| Situation I | 49.9 (4.19) | 49.9 (4.08) | 75.2 (2.98) | 97.3 (1.20) | 98.2 (1.00) |
| Situation II | 50.1 (3.58) | 49.9 (3.86) | 47.6 (3.48) | 64.0 (3.07) | 67.2 (3.13) |
| Situation III | 50.2 (3.65) | 50.3 (3.82) | 90.7 (2.19) | 92.6 (2.04) | 92.4 (1.98) |
Two small subscripts for each statistic denote sampling sizes from the disease and normal groups in this order
Fig. 3Scores obtained using the t-statistic, Wilcoxon sum-rank statistic, and sign-sum statistic with two sampling sizes. From left to right, n=200 (upper), n=1000 (lower). Horizontal axis denotes the gene indices. Vertical label denotes the ranking score. High score indicates relative importance for discrimination of class labels. The first 100 genes are homo informative genes, the next 100 are the hetero informative genes, and the last 800 genes are the non-informative genes
Reproducibility and ORRS: these values indicate mean(sd) and were evaluated by 100 random separations of the full data
|
|
|
|
|
|
|
| Breast cancer data | 3.78 (1.92) | 3.68 (1.99) | 4.33 (2.13) | 6.70 (2.77) | 7.39 (3.37) |
| Cohort data | 23.7 (4.94) | 23.5 (4.86) | 27.5 (5.70) | 43.4 (5.39) | 42.6 (5.28) |
| Prostate cancer data | 33.4 (4.29) | 32.6 (4.69) | 39.6 (4.76) | 31.4 (4.64) | 29.4 (4.23) |
| Breast cancer data2 | 1.39 (1.35) | 1.42 (1.31) | 1.00 (1.10) | 3.33 (1.80) | 3.82 (1.91) |
| Leukemia data | 32.3 (4.47) | 31.8 (4.41) | 34.2 (4.59) | 37.4 (4.30) | 37.1 (4.22) |
|
|
|
|
|
|
|
| Breast cancer data | 2.20 (1.11) | 2.14 (1.16) | 2.52 (1.24) | 3.90 (1.62) | 4.30 (1.96) |
| Cohort data | 2.02 (0.42) | 2.01 (0.42) | 2.35 (0.49) | 3.71 (0.46) | 3.63 (0.45) |
| Prostate cancer data | 20.0 (2.57) | 19.5 (2.81) | 23.7 (2.85) | 18.8 (2.78) | 17.6 (2.54) |
| Breast cancer data2 | 2.73 (2.64) | 2.78 (2.57) | 1.96 (2.16) | 6.53 (3.53) | 7.49 (3.75) |
| Leukemia data | 19.3 (2.68) | 19.1 (2.64) | 20.5 (2.75) | 22.4 (2.57) | 22.2 (2.53) |
Test AUC for four real data sets: each predictor is constructed by the DLDA rule. These values indicate mean(sd) and were evaluated by 100 random separations of the full data
| breast cancer data | AUC of the test data by DLDA | ||||
|
|
|
|
|
| |
| 10 genes | 0.698 (0.058) | 0.698 (0.058) | 0.698 (0.060) | 0.684 (0.065) | 0.679 (0.069) |
| 50 genes | 0.705 (0.046) | 0.705 (0.047) | 0.707 (0.050) | 0.712 (0.051) | 0.712 (0.050) |
| 100 genes | 0.711 (0.045) | 0.710 (0.045) | 0.712 (0.047) | 0.718 (0.045) | 0.718 (0.045) |
| Cohort data | AUC of the test data by DLDA | ||||
|
|
|
|
|
| |
| 10 genes | 0.744 (0.061) | 0.743 (0.061) | 0.751 (0.064) | 0.755 (0.064) | 0.771 (0.063) |
| 50 genes | 0.779 (0.057) | 0.778 (0.057) | 0.773 (0.053) | 0.784 (0.053) | 0.789 (0.054) |
| 100 genes | 0.782 (0.056) | 0.781 (0.057) | 0.778 (0.054) | 0.781 (0.052) | 0.784 (0.051) |
| Prostate cancer data | AUC of the test data by DLDA | ||||
|
|
|
|
|
| |
| 10 genes | 0.835 (0.025) | 0.835 (0.026) | 0.832 (0.025) | 0.823 (0.027) | 0.808 (0.032) |
| 50 genes | 0.846 (0.023) | 0.845 (0.024) | 0.847 (0.022) | 0.836 (0.029) | 0.829 (0.033) |
| 100 genes | 0.844 (0.024) | 0.842 (0.025) | 0.848 (0.021) | 0.829 (0.030) | 0.822 (0.033) |
| Breast cancer data2 | AUC of the test data by DLDA | ||||
|
|
|
|
|
| |
| 10 genes | 0.612 (0.044) | 0.614 (0.041) | 0.611 (0.043) | 0.595 (0.042) | 0.581 (0.044) |
| 50 genes | 0.634 (0.040) | 0.633 (0.040) | 0.630 (0.042) | 0.623 (0.039) | 0.619 (0.040) |
| 100 genes | 0.637 (0.040) | 0.636 (0.038) | 0.636 (0.416) | 0.630 (0.037) | 0.626 (0.038) |
| Leukemia data | AUC of the test data by DLDA | ||||
|
|
|
|
|
| |
| 10 genes | 0.981 (0.017) | 0.982 (0.017) | 0.986 (0.014) | 0.991 (0.012) | 0.990 (0.016) |
| 50 genes | 0.992 (0.014) | 0.992 (0.014) | 0.988 (0.013) | 0.994 (0.008) | 0.994 (0.008) |
| 100 genes | 0.992 (0.016) | 0.991 (0.017) | 0.989 (0.013) | 0.995 (0.006) | 0.995 (0.009) |