| Literature DB >> 22210887 |
Stephanie Schneider1, Temple Smith, Ulla Hansen.
Abstract
Many platforms for genome-wide analysis of gene expression contain 'redundant' measures for the same gene. For example, the most highly utilized platforms for gene expression microarrays, Affymetrix GeneChip® arrays, have as many as ten or more probe sets for some genes. Occasionally, individual probe sets for the same gene report different trends in expression across experimental conditions, a situation that must be resolved in order to accurately interpret the data. We developed an algorithm, SCOREM, for determining the level of agreement between such probe sets, utilizing a statistical test of concordance, Kendall's W coefficient of concordance, and a graph-searching algorithm for the identification of concordant probe sets. We also present methods for consolidating concordant groups into a single value for its corresponding gene and for post hoc analysis of discordant groups. By combining statistical consolidation with sequence analysis, SCOREM possesses the unique ability to identify biologically meaningful discordant behaviors, including differing behaviors in alternate RNA isoforms and tissue-specific patterns of expression. When consolidating concordant behaviors, SCOREM outperforms other methods in detecting both differential expression and overrepresented functional categories.Entities:
Mesh:
Substances:
Year: 2011 PMID: 22210887 PMCID: PMC3315298 DOI: 10.1093/nar/gkr1270
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Data used in evaluating the SCOREM algorithm
| Experiment | Tissue type | Conditions | Platform | Source (Raw | No. of samples |
|---|---|---|---|---|---|
| GSE3678 | Thyroid | Tumor versus normal | Human Genome U133 Plus 2.0 (GPL570) | GEO (yes) | 14 |
| GSE4051 ( | Retina | Nrl-knockout versus wild-type at 5 time points in development | Mouse Genome 430 2.0 (GPL1261) | GEO (yes) | 39 |
| GSE4799 ( | Spermatogonial stem cells | Growth factor restoration (3 time points) versus withdrawal and baseline | Mouse Genome 430 2.0 (GPL1261) | GEO (no) | 15 |
| GSE9371 ( | Aorta | Estrogen versus placebo in wild type, ERα- and ERβ- knockouts | Mouse Genome 430 2.0 (GPL1261) | GEO (yes) | 22 |
| BrCa | Breast cancer (MCF7 cell line) | Estrogen versus placebo at 2 time points | Human Genome U133A (GPL96) | GEO (no) | 26 |
| VSMC ( | Aorta | Estrogen versus placebo at 3 time points | Mouse Genome 430A 2.0 (GPL339) | Authors (yes) | 18 |
| GonFat ( | Gonadal fat | Female versus male and female versus ovarioectomized female | Mouse Genome 430 2.0 (GPL1261) | Authors (yes) | 21 |
| IngFat ( | Inguinal fat | Female versus male and female versus ovarioectomized female | Mouse Genome 430 2.0 (GPL1261) | Authors (yes) | 21 |
aWhether or not raw data (.CEL files) were available or only preprocessed data.
bReyes et al. http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE3678.
cThis data set consists of similarly treated samples from GEO series GSE4006, GSE4025 and GSE9936.
Calculation of W from the mean of the matrix of all pairwise correlation coefficients between k judges
| 1 | 2 | … | k-1 | k | |
|---|---|---|---|---|---|
| 1 | 1.0 | ρ1,2 | … | ρ1, | ρ1, |
| 2 | ρ2,1 | 1.0 | … | ρ2, | ρ2, |
| … | … | … | … | … | … |
| k-1 | ρ | ρ | … | 1.0 | ρ |
| k | ρ | ρ | … | ρ | 1.0 |
Figure 1.Distribution of the number of probe sets per gene on the three indicated popular Affymetrix GeneChip® arrays.
Figure 2.Flowchart for algorithms for analyzing gene expression data. (a) Typical Affymetrix processing. (b) SCOREM processing with concordance testing and consolidation.
Degree of probe set consolidation by the SCOREM algorithm in eight data sets
| Experiment | Probe sets (Genes) | Consolidation level (%) | |||||||
|---|---|---|---|---|---|---|---|---|---|
| On GeneChip® | After filtering | Unannotated | Single | Redundant | After SCOREM | None | Partial | Complete | |
| GSE3678 | 54 675 (19 798) | 18 725 (10 516) | 2578 | 6802 | 9345 (3714) | 6668 (3714) | 44 | 19 | 37 |
| GSE4051 | 45 101 (20 757) | 16 992 (9963) | 1996 | 6645 | 8351 (3318) | 5019 (3318) | 27 | 19 | 54 |
| GSE4799 | 45 101 (20 757) | 44 062 (20 459) | 6597 | 11 006 | 26 459 (9453) | 24 192 (9453) | 80 | 15 | 5 |
| GSE9371 | 45 101 (20 757) | 20 728 (11 184) | 1937 | 6512 | 12 279 (4672) | 9094 (4672) | 47 | 22 | 30 |
| BrCa | 22 283 (12 718) | 13 397 (8558) | 982 | 5939 | 6476 (2619) | 4031 (2619) | 28 | 19 | 53 |
| VSMC | 45 101 (20 757) | 19 914 (11 717) | 1012 | 7171 | 11 731 (4546) | 8224 (4546) | 44 | 19 | 37 |
| GonFat | 45 101 (20 757) | 24 260 (12 586) | 2472 | 7073 | 14 715 (5513) | 9320 (5513) | 30 | 23 | 48 |
| IngFat | 45 101 (20 757) | 24 364 (12 585) | 2590 | 7080 | 14 694 (5505) | 8966 (5505) | 28 | 20 | 52 |
aNumbers in parentheses indicate the number of distinct genes represented by those probe sets.
Characterization of differentially expressed groups representing differentially expressed genes from eight gene expression data sets
| Data set | Total number of groups | Number of genes represented by | |||
|---|---|---|---|---|---|
| A single group | Multiple subgroups | ||||
| (a) Yes versus no | (b) Small versus large | (c) Down versus up | |||
| GSE3678 | 1430 | 1383 | 0 | 5 | 0 |
| GSE4051 | 4006 | 3656 | 126 | 24 | 13 |
| GSE4799 | 5122 | 4239 | 348 | 16 | 43 |
| GSE9371 | 1593 | 1509 | 32 | 3 | 2 |
| BrCa | 947 | 915 | 5 | 1 | 0 |
| VSMC | 451 | 447 | 1 | 0 | 0 |
| GonFat | 4559 | 4449 | 14 | 3 | 13 |
| IngFat | 3049 | 2933 | 32 | 2 | 6 |
aColumns refer to discordant groups where (a) one is changing and the other is not; (b) both are changing in the same direction, but with different magnitudes; and (c) one is increasing while the other is decreasing.
Figure 3.Examples of post hoc analysis in order to determine causes for discordant expression patterns. Genomic DNA is indicated as a line, with exons indicated as boxes on the line. Nucleotide positions within each genes are given with zero representing the presumed transcription start site for the indicated mRNAs. Mapping of each probe set is color-coded for each gene. Translation start and stop codons are indicated by arrows and asterisks, respectively. Whether each probe set is differentially regulated in the experimental samples, and the direction of regulation, are indicated in parentheses after each probe set id. (a) Detection of different expression patterns in known alternate RNA isoforms of Shprh. The probe sets reporting differential expression correspond to exons unique to each of the known isoforms. Expression changes are for experiment GSE4051 at the post-natal day 10 time point. (b) Identification of a possible novel isoform of the gene Dpys12. The alternate RNA isoform could involve alternate splicing of exons 12–14 or an alternate 3′UTR (polyadenylation site). Expression changes are for experiment GSE4799, deprivation versus untreated. (c) Identification of a potential alternate promoter in Scmh1. Expression changes are for experiment GSE4051 at the embryonic day 16 time point. (d) Identification of potential novel coding exons in Mllt10, in introns 3 and/or 4. Expression changes are for experiment GSE4799, 2 h post-restoration versus untreated.
Figure 4.Graphical analysis of probe sets annotated as representing Rbm39 (encoding RNA binding motif protein 39) from multiple experiments performed on the Mouse 430 2.0 GeneChip®. (a–c) Expression profiles of probe sets across the 22 samples in the GSE9371 data set; W indicates level of concordance. (a) All 10 probe sets; W shows lack of concordance. A dotted line shows a probe set removed by gene filtering (for very low expression or very low variance) and therefore not included in the calculation of W. (b) The largest subgroup includes probe sets 4, 6, 7, 9 and 10; W shows high concordance. (c) The second subgroup includes probe sets 2 and 3; W shows high concordance. (d) Mapping of the 10 probe sets to the genomic sequence of Rbm39 and the exons of its only known transcript. Probes mapped above the line map to the coding (negative) strand, those below the line map to the non-coding (positive) strand. (e–i) Connected subgraphs indicating groups of concordant probe sets in five different experiments. In parentheses are the number of samples in each data set (n) and the critical value for W for a data set of that size. W is given for each group as a whole (bottom) and for each concordant subgroup (below each subgroup). Filled circles indicate statistically significant differential expression in that experiment. A dotted circle indicates a probe set removed during gene filtering in that data set.
Comparison of SCOREM with other methods in terms of detection of differential expression and functional enrichment
| Data Set | Standard (+FDR) | ANOVA | Custom CDF | SCOREM (+FDR) |
|---|---|---|---|---|
| Number of genes called differentially expressed | ||||
| GSE3678 | 1026 (1064) | 402 | 667 | 1405 (821) |
| GSE4051 | 2829 (2400) | 313 | 1678 | 3829 (1930) |
| GSE4799 | 2146 (323) | 832 | NA | 4660 (298) |
| GSE9371 | 720 (102) | NA | 480 | 1550 (123) |
| BrCa | 73 (1137) | NA | NA | 931 (592) |
| VSMC | 80 (92) | NA | 126 | 449 (53) |
| GonFat | 1753 (6005) | NA | 1966 | 4502 (3455) |
| IngFat | 777 (3925) | NA | 940 | 2989 (1782) |
| Average | ||||
| GSE3678 | 4.5 × 10−10 (1.8 × 10−7) | 3.5 × 10−2 | 2.5 × 10−6 | 1.3 × 10−10 (5.1 × 10−9) |
| GSE4051 | 6.1 × 10−10 (5.1 × 10−8) | NA | 2.0 × 10−4 | 1.6 × 10−6 (4.9 × 10−8) |
| GSE4799 | 3.6 × 10−10 (5.3 × 10−9) | 1.2 × 10−2 | NA | 2.0 × 10−13 (2.8 × 10−10) |
| GSE9371 | 6.1 × 10−7 (3.0 × 10−5) | NA | 8.6 × 10−6 | 1.8 × 10−9 (9.5 × 10−6) |
| BrCa | 3.6 × 10−4 (5.0 × 10−5) | NA | NA | 6.3 × 10−5 (3.0 × 10−5) |
| VSMC | NA (NA) | NA | 9.8 × 10−4 | 4.5 × 10−4 (NA) |
| GonFat | 3.2 × 10−15 (2.5 × 10−12) | NA | 1.2 × 10−9 | 8.3 × 10−13 (1.7 × 10−13) |
| IngFat | 1.6 × 10−8 (4.4 × 10−9) | NA | 7.9 × 10−6 | 3.5 × 10−8 (1.9 × 10−8) |
aNumber of differentially expressed genes or average P-values of top three or five GO categories as given in Ref. (4).
bNA indicates data sets with no raw data available, where the custom CDF approach could not be applied.
cP-values are for top 5 (Custom CDF) or top 10 (SCOREM) overrepresented GO categories; NA indicates fewer than five GO categories with any overrepresentation.
Comparison of the number of probes, probe sets used and genes represented in Affymetrix and Brainarray custom CDF files
| Array | CDF | Probes | Probe sets | Genes |
|---|---|---|---|---|
| Human U133 Plus 2.0 | Affymetrix | 604 258 | 54 675 | 19 798 |
| Custom | 277 789 | 19 008 | 18 974 | |
| Human U133A | Affymetrix | 247 965 | 22 283 | 12 718 |
| Custom | 167 345 | 12 078 | 12 065 | |
| Mouse 430 2.0 | Affymetrix | 496 468 | 45 101 | 20 757 |
| Custom | 240 917 | 17 306 | 17 289 |