| Literature DB >> 22904084 |
Graham J G Upton1, Andrew P Harrison.
Abstract
An Affymetrix GeneChip consists of an array of hundreds of thousands of probes (each a sequence of 25 bases) with the probe values being used to infer the extent to which genes are expressed in the biological material under investigation. In this article, we demonstrate that these probe values are also strongly influenced by their precise base sequence. We use data from >28 000 CEL files relating to 10 different Affymetrix GeneChip platforms and involving nearly 1000 experiments. Our results confirm known effects (those due to the T7-primer and the formation of G-quadruplexes) but reveal other effects. We show that there can be huge variations from one experiment to another, and that there may also be sizeable disparities between batches within an experiment and between CEL files within a batch.Entities:
Mesh:
Substances:
Year: 2012 PMID: 22904084 PMCID: PMC3479185 DOI: 10.1093/nar/gks717
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Entries are the most extreme group averages calculated from the results for 10 000 HGU133Plus2 CEL files
| The largest group averages | The smallest group averages | ||
|---|---|---|---|
| CCCCC 1.05 | CCGCC 1.04 | AATTT−0.30 | AATTA−0.28 |
| CCTCC 0.97 | CTCCC 0.88 | ATTTA−0.28 | TAATT−0.28 |
The units are SDs of the standardized probe values.
Entries are the mean parameter estimates for the 26 dummy variables forming part of a multiple regression model fitted to each of 28 000 CEL files over 10 platforms (see the Supplementary Material for the results for the seven other platforms)
| Human HGU133+2 | Mouse MOE430A | Arabidopsis ATH1-12501 | |
|---|---|---|---|
| CCCCC | |||
| CCCGCCCC | |||
| CCGCCTCCC | |||
| TCGCCGCT | |||
| CCCCG | 0.19 | ||
| GGGG | −0.03 | ||
| (AT)CCGC | 0.24 | 0.23 | 0.21 |
| GCCCG | 0.10 | 0.17 | 0.19 |
| AGGCCA | −0.20 | −0.18 | −0.17 |
| CCCCTC | 0.21 | 0.10 | |
| CTGCCT | 0.19 | 0.20 | 0.12 |
| CTGGCC | −0.16 | −0.15 | −0.18 |
| AACCC | −0.16 | −0.19 | −0.09 |
| TCGCTC | 0.12 | 0.13 | 0.19 |
| GGGGG | 0.13 | 0.14 | −0.04 |
| ACGCCA | 0.14 | 0.14 | 0.16 |
| NotAorT | −0.14 | −0.12 | −0.17 |
| TCCCC | 0.20 | 0.12 | 0.10 |
| TCCCT | 0.20 | 0.20 | 0.07 |
| TGGGG | −0.15 | −0.11 | −0.12 |
| GCTCCTCG | 0.13 | 0.14 | 0.11 |
| GGTTGCCC | 0.08 | 0.09 | 0.10 |
| GAACCA | −0.13 | −0.12 | −0.09 |
| GGTGCT | 0.04 | 0.07 | 0.18 |
| GCCCTCCG | 0.11 | 0.12 | 0.06 |
| GTGGTTC | 0.06 | 0.07 | 0.15 |
| Median | 91% | 91% | 86% |
| No. of files | 10 000 | 1556 | 2288 |
| No. of GSEs | 322 | 107 | 160 |
The units are SD of logarithms of the raw data. Values of 0.25 SD or greater are shown in bold.
Frequencies with which selected group averages were the largest or smallest within their CEL file
| Largest | CCCCC | CCGCC | CCTCC | CTCCC | |
| No. of files | 3859 | 3451 | 1432 | 160 | |
| GGCGG | 4 others | 9 others | |||
| 132 | 100 | 23 | |||
| Smallest | AATTT | GCGCG | AATTA | TTTTA | CGCGA |
| No. of files | 6023 | 1240 | 808 | 546 | 419 |
| ATATA | 34 others | ||||
| 151 | 632 |
A total of 10 000 HGU133Plus2 CEL files were examined.
The numbers of probes (PM and mismatch) containing G/C-rich motifs, together with the median and maximum frequencies of occurrence of the 1024 five-base motifs for 10 Affymetrix platforms
| Platform | 5Gs | Other GGGG | 5Cs | Average G/C rich | All 5-base motifs | |
|---|---|---|---|---|---|---|
| median | max | |||||
| U133+2 | 13 000 | 20 000 | 8600 | 10 000 | 24 000 | 64 000 |
| U133A/A2 | 7000 | 10 000 | 4300 | 5000 | 10 000 | 25 000 |
| ATH1 | 1800 | 4400 | 1100 | 4500 | 10 000 | 32 000 |
| DrosG | 830 | 2500 | 1100 | 7900 | 7900 | 22 000 |
| Rice | 240 | 520 | 700 | 17 000 | 26 000 | 61 000 |
| Mouse4302 | 200 | 440 | 940 | 5900 | 20 000 | 53 000 |
| Barley1 | 190 | 1500 | 900 | 6100 | 9400 | 28 000 |
| Soybean | 170 | 420 | 820 | 7100 | 20 000 | 60 000 |
| Dros2 | 2 | 41 | 93 | 7700 | 11 000 | 24 000 |
Entries are correct to two significant figures.
Average cross-correlation (×100) of selected probes in unrelated probe sets
| 245767_at | 246043_at | |||
|---|---|---|---|---|
| GGGGG probe | Other probes | GGGGG probe | Other probes | |
| 245767_at | ||||
| GGGGG probe | NA | 18 | 6 | |
| Other probes | 18 | 30 | 8 | −4 |
| 246043_at | ||||
| GGGGG probe | 8 | NA | 21 | |
| Other probes | 6 | −4 | 21 | 25 |
Probes containing GGGGG and the ATH1-12501 platform.
Average cross-correlation (×100) of selected probes in unrelated probe sets
| 1556038_at | 1556502_at | |||
|---|---|---|---|---|
| CCCCC probe | Other probes | CCCCC probe | Other probes | |
| 1556038_at | ||||
| CCCCC probe | NA | −12 | −5 | |
| Other probes | −12 | 11 | −9 | 6 |
| 1556502_at | ||||
| CCCCC probe | −9 | NA | −4 | |
| Other probes | −5 | 6 | −4 | 9 |
Probes containing CCCCC and the HGU133Plus2 platform.
Frequencies (for the HGU133Plus2 platform) of PM probes, mismatch probes (MM) and PM–MM pairs that contain subsequences of the CCGCCTCCC motif
| Subsequence length | 9 | 8 | 7 | 6 | 5 | 4 |
|---|---|---|---|---|---|---|
| PM only | 0 | 7 | 141 | 1187 | 6574 | 24 762 |
| Mismatch only | 1 | 15 | 186 | 1164 | 6536 | 20 247 |
| Both | 4 | 55 | 768 | 6024 | 36 336 | 184 639 |
| Total | 5 | 77 | 1095 | 8375 | 49 446 | 229 649 |
Average cross-correlation (×100) of selected probes in unrelated probe sets
| 1552808_at | 1552884_at | |||
|---|---|---|---|---|
| CCGCCTC probe | Other probe | Probes 1–8 | Probes 9–11 | |
| 1552808_at | ||||
| CCGCCTC probe | NA | −10 | −10 | |
| Other probes | −10 | 13 | −13 | 4 |
| 1552884_at | ||||
| Probes 1–8 | −13 | 33 | 10 | |
| Probes 9–11 | −10 | 4 | 10 | 52 |
Probes containing subsets of CCGCCTCCC and the HGU133Plus2 platform.
Correlations between five groups of motif estimates (based on results for 28 000 CEL files)
| A | B | C | D | E | |
|---|---|---|---|---|---|
| A | 0.29 | −0.34 | 0.12 | 0.09 | |
| B | 0.29 | − | 0.25 | −0.29 | |
| C | −0.34 | − | −0.24 | 0.17 | |
| D | 0.12 | 0.25 | −0.24 | 0.32 | − |
| E | 0.09 | −0.29 | 0.17 | − |
Group A: (CCCCC, CCCCTC, CCCCG, TCCCC, TCCCT); Group B: (CCGCCTCCC, CCCGCCCC, TCGCCGCT, (AT)CCGC, CTGCCT, ACGCCA, TCGCTC, GCTCCTCG, GCCCTCCG, GGTTGCCC, GCCCG); Group C: (AGGCCA, CTGGCC, TGGGG, NotAorT, GAACCA, AACCC); Group D: (GGGG, GGGGG); Group E: (GGTGCT, GTGGTTC). Correlations with magnitudes in excess of 0.4 are in bold.
Figure 1.Scatter diagrams showing estimates (values are SD and result from the final model being applied to the pooled data for each of 1523 day-GSE combinations for the 10 000 CEL files relating to the HGU133Plus2 platform) for the CCCGCCCC and AGGCCA parameters plotted against date of scan. In each case, there is one large group plus one or two well-defined smaller groups. One group (circles) take near zero values for each parameter. The second group (crosses) behaves normally with respect to CCCGCCCC but has very low values for AGGCCA.
Figure 2.Scatter diagrams showing, for three individual GSEs, estimates (values are s.d.) for the CCGCCTCCC parameter against time of scan. Each plot shows the average value of the estimates for a batch (CEL files scanned on the same day), with the areas of a circle being proportional to batch size.
Examples of variations in parameter estimates within a single day
| GSM | GSE7307 CEL files on 31 December 2003 (19:00 to 23:45) | |||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 175786-92 | 0.28 | 0.36 | | | 0.26 | 0.26 | 0.30 | 0.26 | 0.28 | ||||||
| 175793-4, 7-801 | 0.42 | | | 0.24 | 0.25 | 0.30 | 0.26 | 0.37 | 0.45 | ||||||
| 175802-3, 5-6, 9 | 0.29 | 0.44 | 0.36 | 0.52 | 0.54 | |||||||||
The values are the estimates for the βCCGCCTCCC parameter. The values are given correct to two decimal places and are presented in the order of the scans. The scans were consecutive except where indicated with a ‘|’ symbol.
Example of the parameter estimates resulting from a single carousel run
| File id | 507 | 510 | 516 | 518 | 519 | 520 | 443 | 549 |
| Estimate | 0.62 | 0.63 | 0.64 | 0.72 | 0.62 | 0.67 | 0.84 | 0.66 |
| File id | 466 | 467 | 468 | 497 | 498 | 499 | 500 | 501 |
| Estimate | 0.49 | 0.61 | 0.66 | 0.57 | 0.61 | 0.66 | 0.54 | 0.53 |
| File id | 502 | 524 | 530 | 488 | 490 | 491 | ||
| Estimate | 0.53 | 0.48 | 0.50 | 0.57 | 0.50 | 0.63 |
The scanner id was 50201191. The first file was timed at 10:03:37 on the 2nd March 2006 and the last file at 13:59:59 on that day. All the experiments form part of GSE2109. The ids given should be preceded by GSM102 (so that 507 refers to GSM102507).
Figure 3.Changes in the estimates of the dominant parameters according to the location of the 5-mer section within the 25mer probe. The individual values have been smoothed using Friedman's variable scan smoother.
The numbers (and percentages) of PM probes, on the HGU133Plus2 array, that have five-base (or longer) probe sequences in common with the most influential motifs
| CCGCCTCCC | GGGG | CCCCTC | CCCGCCCC |
| 51 545 (8.5%) | 32 538 (5.4%) | 23 694 (3.9%) | 18 667 (3.1%) |
| TCGCCGCT | CCCCC | CCCCG | None of these |
| 14 319 (2.4%) | 4255 (0.7%) | 3369 (0.6%) | 494 561 (81.8%) |
Median R2-values for selected models fitted to 10 000 CEL files using the HGU133Plus2 platform
| Dummy parameters only | 41% | Mononcleotides only | 50% |
| Mono- and dinucleotides only | 78% | Full model | 91% |