| Literature DB >> 26385698 |
Jinsheng Yu1, Paul F Cliften1, Twyla I Juehne1, Toni M Sinnwell1, Chris S Sawyer1, Mala Sharma1, Andrew Lutz1, Eric Tycksen1, Mark R Johnson1, Matthew R Minton1, Elliott T Klotz1, Andrew E Schriefer1, Wei Yang1, Michael E Heinz1, Seth D Crosby1, Richard D Head2.
Abstract
BACKGROUND: The arrival of RNA-seq as a high-throughput method competitive to the established microarray technologies has necessarily driven a need for comparative evaluation. To date, cross-platform comparisons of these technologies have been relatively few in number of platforms analyzed and were typically gene name annotation oriented. Here, we present a more extensive and yet precise assessment to elucidate differences and similarities in performance of numerous aspects including dynamic range, fidelity of raw signal and fold-change with sample titration, and concordance with qRT-PCR (TaqMan). To ensure that these results were not confounded by incompatible comparisons, we introduce the concept of probe mapping directed "transcript pattern". A transcript pattern identifies probe(set)s across platforms that target a common set of transcripts for a specific gene. Thus, three levels of data were examined: entire data sets, data derived from a subset of 15,442 RefSeq genes common across platforms, and data derived from the transcript pattern defined subset of 7,034 RefSeq genes.Entities:
Mesh:
Substances:
Year: 2015 PMID: 26385698 PMCID: PMC4575490 DOI: 10.1186/s12864-015-1913-6
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Number of features analyzed in comparisons for all 6 platforms/protocols
| Platforms /Protocols | # Features at probe(set) & exon level | # Non-control features at probe(set) & exon level | # Features at gene & transcript cluster level | # Genes in RefSeq & Ensembl gene database | # Detected genes & transcript clusters * | # Genes & transcript clusters for 15442 common RefSeq genes | # Probe(set)s & exons for transcript pattern restricted 7034 RefSeq genes | # Probe(set)s & exons for transcript pattern not restricted 7034 RefSeq genes |
|---|---|---|---|---|---|---|---|---|
| AGLN | 43,376 | 41,000 | 29,066 | 24,961 | 22,056 | 15,442 | 8,668 | 11,453 |
| Gene1.0 | 257,430 | 253,002 | 28,869 | 20,796 | 28,105 | 16,516 | 11,449 | 84,196 |
| HTA2.0 | 914,585 | 573,909 | 67,526 | 25,195 | 67,309 | 16,400 | 14,908 | 138,484 |
| ILMN | 47,306 | 47,214 | 34,589 | 31,320 | 23,377 | 15,442 | 8,539 | 11,414 |
| ClonTech | 1,298,791 | 1,298,791 | 62,893 | 49,085 | 43,266 | 15,594 | 16,470 | 334,318 |
| RiboZero | 1,298,791 | 1,298,791 | 62,893 | 52,865 | 48,378 | 15,594 | 16,470 | 334,318 |
Notes:
1. AGLN: 41,000 probes represent 29,066 genes by GeneSymbol & SystematicName; 29,066 “genes” are composed of 24,961 entries with symbols and 4,105 with Agilent probe_ids only
2. Gene1.0: 253,002 probesets represent 28,869 genes by Affymetrix “transcript_cluster_id” (TC); 28,869 TCs are composed of 20,796 genes with symbols and 6,209 without symbols (NetAffx na33.2)
3. HTA2.0: 573,909 probesets represent 67,526 genes by Affymetrix “transcript_cluster_id” (TC); 67,526 TCs are composed of 25,195 genes with symbols and 40,696 without symbols (NetAffx na33)
4. ILMN: 47,214 probese represent 34,589 Illumina named “genes”, of which 31,320 have official gene symbols and 3,269 labeled with Unigene_ids (Hs.xxxxxx)
5. ClonTech: 1,298,791 exons represent 62,893 ensembl genes (ENSGs) in R72 database; 62,893 down to 49,085 ENSGs with at least 1 read in any of 5 samples
6. RiboZero: 1,298,791 exons represent 62,893 ensembl genes (ENSGs) in R72 database; 62,893 down to 52,865 ENSGs with at least 1 read in any of 5 samples
*A “detected” call for AGLN at gene level and for Gene1.0 and HTA2.0 at transcript cluster level was made if any probe(set) was “detected” in any sample by p < 0.05 (AGLN) or p < 0.01 (Gene1.0 & HTA2.0); and for RNA-seq data the detection calls were made if any samples had a cpm >0.25. ILMN data have “detected” calls by p < 0.05 at both probe and gene levels
Fig. 1Diagram of transcript patterns defined in the current study. A transcript pattern select region covers a set of transcripts that share a certain exon or exon region recognized by a probe. The model gene in this diagram is TBP (exon/intron size modified for illustration purpose). If we consider, for example, Affymetrix probe a, which targets transcript pattern selection region #1 that covers a set of transcripts 001 and 003–005 but transcript 002 is excluded, defines the transcript pattern A. Thus, signals from Affymetrix HTA2.0 probes b, d, and e, from Gene1.0 probes c and d, from Agilent probes a, and from Illumina probes a and b will be used to summarize the expression level of the common transcript pattern B within a platform. Further, because the transcript patterns B and E are common across all microarray platforms they are kept in the transcript pattern derived subset data as two separate data points although they represent the expression level of the same specific gene (TBP)
Samples used in the study
| Sample Name | Ratio of AGO:BMO | % BMO | Replicate | ||||||
|---|---|---|---|---|---|---|---|---|---|
| AGLN | Gene1.0 | HTA2.0 | ILMN | ClonTech | RiboZero | TaqMan | |||
| AGO | 1:0 | 0 | 1 | 2 | 2 | 2 | 1 | 1 | 6 |
| AG1BM4 | 1:4 | 75 | 2 | 2 | 2 | 3 | 1 | 1 | 6 |
| AG1BM16 | 1:16 | 93.75 | 2 | 2 | 2 | 2 | 1 | 1 | 6 |
| AG1BM64 | 1:64 | 98.4375 | 2 | 2 | 2 | 2 | 1 | 1 | 6 |
| BMO | 0:1 | 100 | 1 | 2 | 2 | 3 | 1 | 1 | 6 |
Fig. 2(a) raw signal range of two pure RNA samples (AGO and BMO) with the entire data set, represented by Box-Whisker plot (max, 75 %, median, 25 % and min); (b) signal-to-background ratio, indicated by 99 percentile of mean signal and background values determined with entire set of raw data in all 5 RNA samples; (c) fidelity of signal to sample titration by correlation, showing full distribution of coefficient of correlation between signal and titration across the 5 RNA samples in the entire set of raw data, with the emphasis on the percent of features (probes/probesets for microarray, genes for RNA-seq) with a correlation coefficient greater than absolute 0.5; (d) signal similarity matrix of AGO and BMO samples across all 6 platforms were generated with Spearman rank correlation using signal/count data of RefSeq gene symbol aligned 15442 genes and transcript pattern defined 7034 genes
Fig. 3Bar charts for platform comparisons on magnitude of differential expressions determined by average absolute fold-change. Average absolute fold-change was analyzed for each titration across all 6 platforms in entire data set (a) as well as in transcript pattern (TP) restricted 7,034 subset (b). To ascertain the magnitude of differential expression for a platform as a whole, the 4 average absolute fold-changes of the full titrations were averaged in both entire genes and detectable genes in the entire data set (c), as well as in TP non-restricted and restricted 7,034 RefSeq genes subsets (d). To gauge platform fidelity level in fold-change along sample titrations, percent of genes with a Pearson correlation > +0.5 was indicated in the panels (a) and (b). In addition, the fold-change enhancement was indicated with dotted lines in green in panels (c) and (d) that was determined as the difference in average absolute fold-change between the bar elements from left to right for each platform. Moreover, the statistics were placed in the panels (c) and (d) for the difference in average fold-change from AGLN to the other platforms for the entire set of data and for the TP-defined subset of data. When compared to AGLN, the average absolute fold-change was significantly lower in all platforms (p < 0.01–0.001) in the data set for entire genes, and such difference was statistically significant to 3 microarray platforms (p < 0.01) but not to the RNA-seq protocols (p > 0.05) in the TP restricted 7,034 subset
Fig. 4Scatter plots for log2 ratio of the BMO vs. AGO contrast for platform comparisons using data of (a) 15,442 common RefSeq genes and (b) transcript pattern restricted 7,034 RefSeq genes, each against AGLN. The dotted green lines are trend lines by linear regression, and the red lines are diagonal lines of the frames. The deviation of green lines from red lines indicates the degree of fold-change compression that can be quantified by slope values in the equations
Fig. 5TaqMan qRT-PCR validation. (a-f) fold-change data derived from titrated samples. 6 genes were selected because the TaqMan probes targeted the same transcript pattern as did the microarray probe(set)s; (g) mean fold-change of the 6 genes across platforms; extreme values in (a-d) and (g) were indicated with broken y-axis and actual data; (h) mean fold-change of the 4 titrations across platforms; (I) Concordance of fold-change between TaqMan qPCR (X) and microarrays/RNA-seq protocols (Y), 4 different calls were made: compress, opposite, overestimate, and concordant. When two compared fold-changes are in the same direction but the ratio of X/Y greater than or equal to 2, a value of “compressed” is assigned. Similarly, if the fold-change ratio of X/Y is less than or equal to 0.5 the comparison is deemed “overestimate”. Fold-change ratios between these values are deemed “concordant”. When two fold-changes are not in the same direction and either of them is greater than 2 or less than 0.5, the comparison is determined to be “opposite”. Concordance rates were calculated by number of genes with “concordant” and “overestimate” calls divided by the total genes analyzed