| Literature DB >> 20444259 |
James R Bradford1, Yvonne Hey, Tim Yates, Yaoyong Li, Stuart D Pepper, Crispin J Miller.
Abstract
BACKGROUND: RNA-Seq exploits the rapid generation of gigabases of sequence data by Massively Parallel Nucleotide Sequencing, allowing for the mapping and digital quantification of whole transcriptomes. Whilst previous comparisons between RNA-Seq and microarrays have been performed at the level of gene expression, in this study we adopt a more fine-grained approach. Using RNA samples from a normal human breast epithelial cell line (MCF-10a) and a breast cancer cell line (MCF-7), we present a comprehensive comparison between RNA-Seq data generated on the Applied Biosystems SOLiD platform and data from Affymetrix Exon 1.0ST arrays. The use of Exon arrays makes it possible to assess the performance of RNA-Seq in two key areas: detection of expression at the granularity of individual exons, and discovery of transcription outside annotated loci.Entities:
Mesh:
Year: 2010 PMID: 20444259 PMCID: PMC2877694 DOI: 10.1186/1471-2164-11-282
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Summary of read counts across different genomic locations.
| MCF-10a | MCF-7_r1 | MCF-7_r2 | |
|---|---|---|---|
| Total | 286,197,907 | 302,129,896 | 150,762,975 |
| After error filtering | 173,966,873 | 205,050,087 | 113,512,672 |
| Mappable | 47,524,622 | 46,330,340 | 33,697,119 |
| Uniquely mappable | 28,371,318 | 28,882,179 | 22,223,910 |
| Location | |||
| Ensembl known | |||
| Total | 21,709,397 | 22,031,344 | 16,980,001 |
| Exon | 15,996,190 | 17,439,762 | 12,800,399 |
| Intron | 5,713,207 | 4,591,582 | 4,179,602 |
| All annotation1 | 24,037,188 | 23,854,633 | 18,830,788 |
| Exon Junctions | |||
| Known | 1,010,785 | 1,225,448 | - |
| Putative | 16,548 | 23,540 | - |
1Known, predicted and EST transcripts
Figure 1Read locations. The proportion of unique reads in (A) MCF-10a and (B) MCF-7, mapping to four genomic locations: known exons and introns, as defined by Ensembl, other annotated regions including ESTs, Genscan predictions and Exon array probe selection regions, and un-annotated regions.
Figure 2Correspondence between RNA-Seq and Exon arrays. (A) Determination of the read count threshold giving optimum correspondence between both platforms with respect to Present/Absent calls. (B) Present/Absent call correspondence at a read count threshold of zero in RNA-Seq and a DABG score threshold of 0.01 on the array. (C) Comparison of fold changes between RNA-Seq and the array. Red dots indicate exons flagged as Present (P) in both samples and on both platforms (PP->PP). Grey dots indicate exons flagged as Absent (A) in at least one sample on both platforms (AA->AA, PA->PA, AP->AP, PA-AP, AP->PA, AA->PA, AA->AP, PA->AA, AP->AA). Note that, due to the density of the data, some grey points representing exons Absent in both RNA-Seq samples (zero fold change) are masked by other colours. Blue dots indicate exons Absent in at least one RNA-Seq sample but flagged Present in both array samples (PA->PP, AA->PP, AP->PP), and green dots represent exons Present in both samples in RNA-Seq but flagged Absent in at least one sample on the array (PP->PA, PP->AA, PP->AP). (D) Overlap between numbers of exons called differentially expressed by the array and RNA-Seq using (Left) a log2 fold change threshold of 2.0 on the array and 3.0 in RNA-Seq (left) and a LIMMA p-value threshold of 1 × 10-4 on the array and an Audic-Claverie p-value threshold of 1 × 10-7 in RNA-Seq (right). These thresholds lead to the greatest equivalence between platforms using an overlap metric based on the CS (Equation 2).