| Literature DB >> 26057385 |
Kornel Chrominski1, Magdalena Tkacz1.
Abstract
MOTIVATION: When we were asked for help with high-level microarray data analysis (on Affymetrix HGU-133A microarray), we faced the problem of selecting an appropriate method. We wanted to select a method that would yield "the best result" (detected as many "really" differentially expressed genes (DEGs) as possible, without false positives and false negatives). However, life scientists could not help us--they use their "favorite" method without special argumentation. We also did not find any norm or recommendation. Therefore, we decided to examine it for our own purpose. We considered whether the results obtained using different methods of high-level microarray data analyses--Significant Analysis of Microarrays, Rank Products, Bland-Altman, Mann-Whitney test, T test and the Linear Models for Microarray Data--would be in agreement. Initially, we conducted a comparative analysis of the results on eight real data sets from microarray experiments (from the Array Express database). The results were surprising. On the same array set, the set of DEGs by different methods were significantly different. We also applied the methods to artificial data sets and determined some measures that allow the preparation of the overall scoring of tested methods for future recommendation.Entities:
Mesh:
Year: 2015 PMID: 26057385 PMCID: PMC4461299 DOI: 10.1371/journal.pone.0128845
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1Microarray experiment steps (phases).
Frequency of hits: method name along with “differentially expressed genes” and “gene expression” phrases (Google Scholar, PubMed).
| Method name | "Differentially expressed genes" | "Gene expression" | |
|---|---|---|---|
|
|
|
|
|
|
|
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 32 000; |
| |
|
|
|
|
|
|
|
|
| |
Arraysets Characteristics.
| Arraysets | Accession number | Number of samples | Characteristics |
|---|---|---|---|
|
| E-GEOD-32072 | 50 | all samples from cancerous tissue (lung) |
|
| E-GEOD-14882 | 16 | 8—control, 8—patients with MELAS syndrome |
|
| E-GEOD-15852 | 86 | 43—control, 43—lung cancer tissue |
|
| E-MEXP-1690 | 12 | 6—control, 6—ganglioglioma |
|
| E-GEOD-56899 | 45 | 5—control, 40 brain tissue affected by Alzheimer's |
|
| E-GEOD-22529 | 104 | 82—chemoimmunotherapy patients, 22 from cancer tissue |
|
| E-TABM-794 | 102 | 50—control, 52—prostate tumours |
|
| E-GEOD-11038 | 72 | 25—control, 47 tissue with leukemia |
Fig 2Arraysets preparation process.
Parameters that were fixed for each method of high-level analysis (for the purpose of experiments).
| Methods | Type of parameter | Value of parameter |
|---|---|---|
|
| fold change | 2.00 |
|
| p-value | 0.01 |
|
| p-value | 0.05 |
|
| p-value | 0.02 |
|
| p-value | 0.01 |
|
| p-value | 0.05 |
DEGs detected in Arraysets by different methods (22,283 in all).
| SAM | RP | MW | BA | TT | LIMMA |
| |
|---|---|---|---|---|---|---|---|
|
| 3323 | 11461 | 3752 | 1782 | 2200 | 1340 |
|
|
| 1043 | 1446 | 952 | 1132 | 153 | 952 |
|
|
| 4605 | 4551 | 2260 | 1743 | 1092 | 2260 |
|
|
| 1872 | 1846 | 1848 | 914 | 320 | 1848 |
|
|
| 11590 | 2014 | 840 | 1476 | 448 | 840 |
|
|
| 659 | 3380 | 992 | 977 | 493 | 992 |
|
|
| 2798 | 3100 | 4797 | 2468 | 2581 | 2684 |
|
|
| 1716 | 1885 | 1789 | 505 | 1041 | 633 |
|
Fig 3Venn diagram for Arrayset1.
Fig 4Distribution of values in Dataset1.
Fig 5Distribution of values in Dataset2.
Number of aDEGs detected and assessment parameters used for each method in Dataset1 (in bold—the best, in italics—the worst).
| aDEGs detected (of all 73) | ||||||
|---|---|---|---|---|---|---|
| SAM | RP | BA | MW | TT | LIMMA | |
| Number of detected values | 74 | 84 | 76 | 138 | 81 | 72 |
| True positives | 73 | 73 | 72 | 73 | 73 | 72 |
| True negatives | 1926 | 1916 | 1923 | 1862 | 1919 | 1927 |
| False positives | 1 | 11 | 4 | 65 | 8 | 0 |
| False negatives | 0 | 0 | 1 | 0 | 0 | 1 |
| acc |
| 0.945 | 0.975 |
| 0.960 |
|
| rec |
|
|
|
|
|
|
| prec | 0.986 | 0.869 | 0.947 |
| 0.901 |
|
| f-measure |
| 0.929 | 0.966 |
| 0.948 |
|
| MCC |
| 0.890 | 0.947 |
| 0.918 |
|
Summary of excessed aDEGs by each method.
| Excessed | Recognized as aDEG by other method | ||||||
|---|---|---|---|---|---|---|---|
| aDEG | SAM | RP | BA | MW | TT | ||
|
| 1 | - | No | No | Yes | No | |
|
| 1 | No | - | No | Yes | Yes | |
| 4 | No | - | No | Yes | No | ||
| 1 | No | - | Yes | Yes | No | ||
| 4 | No | - | No | No | No | ||
|
| 1 | No | Yes | - | Yes | No | |
| 1 | No | No | - | Yes | Yes | ||
| 1 | No | No | - | Yes | No | ||
| 4 | No | No | - | No | No | ||
|
| 1 | Yes | No | No | - | No | |
| 1 | No | Yes | No | - | Yes | ||
| 1 | No | Yes | Yes | - | No | ||
| 5 | No | Yes | No | - | No | ||
| 1 | No | No | Yes | - | Yes | ||
| 1 | No | No | Yes | - | No | ||
| 7 | No | No | No | - | Yes | ||
| 49 | No | No | No | - | No | ||
|
| 1 | No | Yes | No | Yes | - | |
| 1 | No | No | Yes | Yes | - | ||
| 6 | No | No | No | Yes | - | ||
Number of aDEGs detected and assessment parameters by each method in Dataset2 (in bold—the best, in italics—the worst).
| aDEGs detected (of 73 all) | ||||||
|---|---|---|---|---|---|---|
| SAM | RP | BA | MW | TT | LIMMA | |
| Number of detected values | 69 | 98 | 50 | 149 | 85 | 76 |
| True positives | 69 | 73 | 46 | 73 | 71 | 73 |
| True negatives | 1927 | 1902 | 1923 | 1851 | 1913 | 1924 |
| False positives | 0 | 25 | 4 | 76 | 14 | 3 |
| False negatives | 4 | 0 | 27 | 0 | 2 | 0 |
| acc | 0.980 | 0.875 | 0.845 |
| 0.920 |
|
| rec | 0.945 |
|
|
| 0.972 |
|
| prec |
| 0.744 | 0.920 |
| 0.835 | 0.960 |
| f-measure | 0.971 | 0.853 | 0.747 |
| 0.898 |
|
| MCC | 0.957 | 0.773 | 0.665 |
| 0.839 |
|
Summary of excessed aDEGs by each method.
| Excessed aDEG | Recognized as aDEGs by other method | ||||||
|---|---|---|---|---|---|---|---|
| SAM | RP | BA | MW | TT | LIMMA | ||
|
| 1 | NO | - | YES | YES | YES | YES |
| 5 | NO | - | NO | YES | YES | NO | |
| 6 | NO | - | NO | YES | NO | NO | |
| 13 | NO | - | NO | NO | NO | NO | |
|
| 1 | NO | YES | - | YES | YES | YES |
| 1 | NO | NO | - | YES | YES | NO | |
| 2 | NO | NO | - | NO | NO | NO | |
|
| 5 | NO | YES | NO | - | YES | NO |
| 1 | NO | YES | YES | - | YES | YES | |
| 6 | NO | YES | NO | - | NO | NO | |
| 1 | NO | NO | YES | - | YES | NO | |
| 7 | NO | NO | NO | - | YES | NO | |
| 56 | NO | NO | NO | - | NO | NO | |
|
| 5 | NO | YES | NO | YES | - | NO |
| 1 | NO | YES | YES | YES | - | YES | |
| 1 | NO | NO | YES | YES | - | NO | |
| 7 | NO | NO | NO | YES | - | NO | |
|
| 1 | NO | YES | YES | YES | YES | - |
| 2 | NO | NO | NO | NO | NO | - | |
Overall scoring of methods for the Datasets (one plus equals one point; the more, the better).
| SAM | RP | BA | MW | TT | LIMMA | |
|---|---|---|---|---|---|---|
| Dataset1 | + + + | + | + + | - | + + | + + + |
| Dataset2 | + + + | + | + | - | + + | + + + |
| overall scoring |
| 2 | 3 |
| 4 |
|