| Literature DB >> 21047384 |
Jianjun Hu1, Jia Xu.
Abstract
MOTIVATION: Identification of differentially expressed genes from microarray datasets is one of the most important analyses for microarray data mining. Popular algorithms such as statistical t-test rank genes based on a single statistics. The false positive rate of these methods can be improved by considering other features of differentially expressed genes.Entities:
Mesh:
Year: 2010 PMID: 21047384 PMCID: PMC2975422 DOI: 10.1186/1471-2164-11-S2-S3
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Figure 1Distribution of true DEGs in the boundary regions of the AG-AD feature space for four datasets. Most DEGs are located in the boundary regions in the figure. Screening out boundary genes has the potential to improve the power of gene ranking methods such as t-test for DEG identification.
17 Datasets with 284 DEGs in total. Each dataset has 22833 genes.
| Dataset | Conditions | True DEG | |
|---|---|---|---|
| A | B | ||
| GSE1462 | 4 | 4 | 4 |
| GSE1615_1 | 4 | 5 | 8 |
| GSE1650 | 18 | 12 | 8 |
| GSE2666_2 | 5 | 5 | 6 |
| GSE3524 | 16 | 4 | 4 |
| GSE3860 | 9 | 9 | 8 |
| GSE4917 | 3 | 3 | 5 |
| GSE5667_1 | 5 | 6 | 3 |
| GSE6236 | 14 | 14 | 7 |
| GSE6344 | 10 | 10 | 19 |
| GSE6740_1 | 10 | 10 | 40 |
| GSE6740_2 | 10 | 10 | 62 |
| GSE7146 | 6 | 6 | 6 |
| GSE7765 | 3 | 3 | 13 |
| GSE8441 | 11 | 11 | 9 |
| GSE9499 | 15 | 7 | 77 |
| GSE9574 | 15 | 14 | 5 |
Figure 2Visualization of bias of popular DEG identification algorithms. FC has many false positive predictions for genes with low average expressions or small expression differences. RP’s false positives are sparsely located in low expression and small average difference region. tTest’s false positives are dominated by genes with low average difference. WAD has less false positives than other algorithms.
Comparison of No. of missing true DEGs after DB pruning. (N0 = 4, R0 = 0.0017)
| Total Gene: 22283 | After DP-pruning | True DEG | DP missed |
|---|---|---|---|
| GSE1462 | 2054 | 4 | 0 |
| GSE1615_1 | 2449 | 8 | 3 |
| GSE1650 | 1317 | 8 | 2 |
| GSE2666_2 | 1618 | 6 | 2 |
| GSE3524 | 814 | 4 | 0 |
| GSE3860 | 2073 | 8 | 0 |
| GSE4917 | 785 | 5 | 1 |
| GSE5667_1 | 1316 | 3 | 0 |
| GSE6236 | 2231 | 7 | 0 |
| GSE6344 | 3127 | 19 | 0 |
| GSE6740_1 | 1183 | 40 | 1 |
| GSE6740_2 | 1801 | 62 | 5 |
| GSE7146 | 1274 | 6 | 1 |
| GSE7765 | 1607 | 13 | 1 |
| GSE8441 | 978 | 9 | 1 |
| GSE9499 | 1805 | 77 | 3 |
| GSE9574 | 1448 | 5 | 0 |
Ranks of true DEGs in original gene list and pruned gene list. Genes are sorted by four DEG identification algorithms on the GSE1577 dataset. Increase of ranks of true DEGs means that DB pruning have correctly filtered out many non-DEGs.
| t-test/tTest’ | 1404/808 | 7/6 | 1321/768 | 3800/1713 | 4741/1975 | 3633/1659 | 4145/1828 | 606/388 | 210/155 |
|---|---|---|---|---|---|---|---|---|---|
| FC/FC’ | 167/153 | 154/142 | 39/33 | 18/13 | 1/1 | 22/17 | 6/5 | 1601/1249 | 80/72 |
| Rp/Rp’ | 111/85 | 91/70 | 18/12 | 9/8 | 1/1 | 16/15 | 6/6 | 4520/980 | 97/68 |
| Wad/Wad’ | 31/31 | 25/25 | 32/25 | 7/7 | 3/3 | 15/15 | 6/6 | 515/515 | 10/10 |
Increase of AUC values for DEG algorithms after DB pruning: Rp, Wad, Fc, and tTest.
| Partial AUC (up to K=1000) | Percentage of Improvement | |
|---|---|---|
| Rp/Rp’ | 0.0162/0.0196 | 21% |
| Fc/Fc’ | 0.0245/0.0263 | 7.3% |
| tTest/tTest’ | 0.0284/0.0310 | 9.2% |
| Wad/Wad’ | 0.032/0.033 | 3.1% |
Increase of No. of identified true DEGs out of top K predictions with or without DB pruning. Rp’, Wad’, tTest’, FC’ are algorithms with DB pruning. The total number of true DEGs of the 17 datasets is 284.
| K=150 | K=250 | K=350 | K=450 | K=550 | |
|---|---|---|---|---|---|
| Rp/Rp’ | 74/78 | 81/91 | 92/104 | 98/122 | 106/141 |
| Fc/Fc’ | 97/98 | 120/137 | 146/159 | 164/184 | 178/198 |
| tTest/tTest’ | 132/150 | 163/181 | 179/206 | 191/218 | 202/234 |
| Wad/Wad’ | 156/156 | 195/198 | 221/221 | 227/227 | 240/240 |
Figure 3Comparison of ROC curves of DEG algorithms with/out DB pruning. It shows that WAD and t-Test have higher AUC values than FC and RP. Using DB pruning, tTest’s AUC value can be improved to be close to that of WAD. Actually, DB pruning significantly improves all DEG algorithms.
The no. of predicted true DEGs using partial samples from condition A and B with or without using DB Pruning. Rp’, Wad’, tTest’, FC’ are algorithms with DB pruning. The total number of true DEGs of the 17 datasets is 284.
| K=150 | K=250 | K=350 | K=450 | K=550 | ||
|---|---|---|---|---|---|---|
| 2x2 samples | Rp/Rp’ | 45/46 | 61/61 | 68/71 | 78/79 | 86/92 |
| Fc/Fc’ | 43/44 | 58/62 | 62/71 | 69/82 | 74/89 | |
| tTest/tTest’ | 16/32 | 24/47 | 31/60 | 37/77 | 45/88 | |
| Wad/Wad’ | 92/92 | 116/116 | 127/128 | 140/141 | 149/149 | |
| 3x3 samples | Rp/Rp’ | 52/53 | 60/64 | 71/75 | 76/81 | 81/90 |
| Fc/Fc’ | 54/54 | 61/63 | 71/77 | 79/87 | 90/100 | |
| tTest/tTest’ | 32/52 | 48/76 | 62/98 | 72/115 | 82/132 | |
| Wad/Wad’ | 91/91 | 128/129 | 150/150 | 166/166 | 171/175 | |
| 4x4 samples | Rp/Rp’ | 60/63 | 70/74 | 78/86 | 85/97 | 97/105 |
| Fc/Fc’ | 62/63 | 74/81 | 86/92 | 94/108 | 115/128 | |
| tTest/tTest’ | 67/85 | 83/111 | 100/141 | 108/157 | 119/174 | |
| Wad/Wad’ | 119/119 | 155/155 | 173/174 | 189/190 | 193/196 | |