| Literature DB >> 16284200 |
Manhong Dai1, Pinglang Wang, Andrew D Boyd, Georgi Kostov, Brian Athey, Edward G Jones, William E Bunney, Richard M Myers, Terry P Speed, Huda Akil, Stanley J Watson, Fan Meng.
Abstract
Genome-wide expression profiling is a powerful tool for implicating novel gene ensembles in cellular mechanisms of health and disease. The most popular platform for genome-wide expression profiling is the Affymetrix GeneChip. However, its selection of probes relied on earlier genome and transcriptome annotation which is significantly different from current knowledge. The resultant informatics problems have a profound impact on analysis and interpretation the data. Here, we address these critical issues and offer a solution. We identified several classes of problems at the individual probe level in the existing annotation, under the assumption that current genome and transcriptome databases are more accurate than those used for GeneChip design. We then reorganized probes on more than a dozen popular GeneChips into gene-, transcript- and exon-specific probe sets in light of up-to-date genome, cDNA/EST clustering and single nucleotide polymorphism information. Comparing analysis results between the original and the redefined probe sets reveals approximately 30-50% discrepancy in the genes previously identified as differentially expressed, regardless of analysis method. Our results demonstrate that the original Affymetrix probe set definitions are inaccurate, and many conclusions derived from past GeneChip analyses may be significantly flawed. It will be beneficial to re-analyze existing GeneChip data with updated probe set definitions.Entities:
Mesh:
Substances:
Year: 2005 PMID: 16284200 PMCID: PMC1283542 DOI: 10.1093/nar/gni179
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Percent of potentially problematic GeneChip probe sets
| Chiptype | Unreliable representative public ID | UniGene redundancy | Containing probe(s) with multiple UniGene hit | Having probe(s) with multiple genome hit | With genomic location or strand issues | Including probe(s) with no known target | Containing allele-specific probe(s) |
|---|---|---|---|---|---|---|---|
| HG-U95Av2 | 27.9 | 21.1 | 36.6 | 16.2 | 8.8 | 4.6 | 40.5 |
| HG-U133A | 14.4 | 34.2 | 36.0 | 16.3 | 10.1 | 3.6 | 42.7 |
| HG-U133B | 22.2 | 31.4 | 22.3 | 9.3 | 10.4 | 5.0 | 35.2 |
| HG-U133 Plus | 18.2 | 47.2 | 26.1 | 11.6 | 12.0 | 4.8 | 37.6 |
| Human X3P | 21.0 | 50.8 | 22.8 | 10.6 | 10.3 | 4.8 | 32.7 |
| MG-U74Av2 | 42.7 | 18.8 | 28.8 | 16.1 | 8.8 | 10.0 | 11.7 |
| MOE430A | 13.3 | 38.6 | 30.9 | 15.0 | 10.4 | 4.1 | 11.0 |
| MOE430B | 28.5 | 31.2 | 16.5 | 5.5 | 9.9 | 11.6 | 4.6 |
| Mouse430 | 20.8 | 44.7 | 23.6 | 10.2 | 11.2 | 7.8 | 7.8 |
| Rn34A | 21.3 | 28.0 | 17.4 | 15.8 | 7.0 | 8.2 | 18.1 |
| RAE230A | 10.7 | 17.5 | 16.5 | 13.2 | 8.7 | 3.6 | 19.5 |
| RAE230B | 32.8 | 15.1 | 6.8 | 7.0 | 5.0 | 15.8 | 7.8 |
| Rat230 | 21.5 | 24.8 | 11.7 | 10.1 | 8.3 | 9.6 | 13.7 |
Probe set content comparison between Affymetrix probe sets and updated UniGene probe setsa
| Chiptype | Total Affymetrix probe sets | UGID shared by both definitions | 100% Identical probe sets | Probe set content difference ≥ 50% | Unique UGID in Affymetrix probe set definition | Unique UGID in updated UniGene probe sets |
|---|---|---|---|---|---|---|
| HG-U95Av2 | 12 558 | 6847 | 3275 | 1153 | 956 | 1355 |
| HG-U133A | 22 212 | 11 182 | 4800 | 1920 | 1612 | 657 |
| HG-U133B | 22 577 | 7924 | 2799 | 2155 | 4912 | 1052 |
| HG-U133 Plus | 54 613 | 18 555 | 5624 | 5450 | 5496 | 1483 |
| Human X3P | 61 297 | 18 250 | 6339 | 5673 | 5714 | 1507 |
| MG-U74Av2 | 12 422 | 6531 | 3056 | 1217 | 1455 | 1253 |
| MOE430A | 22 626 | 11 488 | 5732 | 1694 | 1461 | 753 |
| MOE430B | 22 511 | 7866 | 2834 | 1904 | 3751 | 1147 |
| Mouse430 | 45 037 | 17 215 | 6487 | 4074 | 3356 | 1507 |
| Rn34A | 8740 | 3934 | 1538 | 886 | 990 | 595 |
| RAE230A | 15 866 | 9296 | 4586 | 1354 | 2614 | 722 |
| RAE230B | 15 276 | 6379 | 2453 | 1141 | 3034 | 890 |
| Rat230 | 31 042 | 14 598 | 5899 | 2992 | 4303 | 1384 |
aThe UniGene build used in Table 2 is HsUG 183, MmUG 146 and RnUG 142. If several old probe sets are mapped to the same new UniGene ID, probes in these old probe sets are merged before comparing probe content with the corresponding new UniGene-based probe set.
Percent of shared UniGene ID between Affymetrix and other probe set definitions under different FDR thresholds
| FDR (SAM | <1 | <2 | <5 | <10 | <20 |
|---|---|---|---|---|---|
| UG | 73.0 | 73.0 | 65.1 | 63.9 | 71.6 |
| 3UG | 75.4 | 75.4 | 67.8 | 63.8 | 68.2 |
| ENTREZG | 64.4 | 54.8 | 64.1 | 62.3 | 71.8 |
| ENSG | 70.9 | 62.3 | 61.7 | 62.0 | 70.5 |
| REFSEQ | 66.0 | 58.1 | 70.4 | 62.2 | 72.2 |
| 3REFSEQ | 67.5 | 67.5 | 67.7 | 64.5 | 70.6 |
| ENST | 69.7 | 52.3 | 67.1 | 64.8 | 71.7 |
| 3ENST | 72.9 | 65.6 | 65.8 | 62.9 | 69.2 |
| DOTS | 56.7 | 61.1 | 65.6 | 65.2 | 67.7 |
| 3DOTS | 60.6 | 61.4 | 65.0 | 65.5 | 69.3 |
aProbe set definition started with ‘3’ are those only containing the most 3′ 11 probes if there are more than 11 probes in a probe set.
Average similarity between different probe set definitions based on differentially expressed gene lists derived under various cut-off thresholds and analysis methodsa
| Probe set definition | AFFY | UG | 3UG | ENTREZG | ENSG | REFSEQ | 3REFSEQ | ENST | 3ENST | DOTS | 3DOTS |
|---|---|---|---|---|---|---|---|---|---|---|---|
| AFFY | 100.0 | ||||||||||
| UG | 66.0 | 100.0 | |||||||||
| 3UG | 71.5 | 77.7 | 100.0 | ||||||||
| ENTREZG | 65.8 | 80.1 | 73.2 | 100.0 | |||||||
| ENSG | 66.4 | 78.4 | 72.6 | 87.8 | 100.0 | ||||||
| REFSEQ | 67.2 | 78.5 | 73.7 | 89.1 | 86.5 | 100.0 | |||||
| 3REFSEQ | 68.6 | 72.8 | 82.3 | 80.1 | 78.1 | 83.4 | 100.0 | ||||
| ENST | 66.0 | 74.9 | 71.8 | 83.7 | 87.8 | 87.4 | 78.4 | 100.0 | |||
| 3ENST | 68.7 | 68.9 | 79.6 | 76.3 | 79.8 | 78.2 | 84.4 | 82.5 | 100.0 | ||
| DOTS | 60.0 | 59.0 | 58.6 | 62.2 | 63.1 | 62.9 | 60.8 | 63.6 | 62.3 | 100.0 | |
| 3DOTS | 61.3 | 57.0 | 61.0 | 60.4 | 61.5 | 61.7 | 62.7 | 62.4 | 64.2 | 89.0 | 100.0 |
aSimilarity values <70% are in bold.
Figure 1Hierarchical clustering of probe set definition similarity based on differentially expressed gene lists derived from the GSE974 data set using different probe set definitions and analysis methods.