| Literature DB >> 25977777 |
Joshua Xu1, Zhenqiang Su1, Huixiao Hong1, Jean Thierry-Mieg2, Danielle Thierry-Mieg2, David P Kreil3, Christopher E Mason4, Weida Tong1, Leming Shi5.
Abstract
Whole-transcriptome sequencing ('RNA-Seq') has been drastically changing the scale and scope of genomic research. In order to fully understand the power and limitations of this technology, the US Food and Drug Administration (FDA) launched the third phase of the MicroArray Quality Control (MAQC-III) project, also known as the SEquencing Quality Control (SEQC) project. Using two well-established human reference RNA samples from the first phase of the MAQC project, three sequencing platforms were tested across more than ten sites with built-in truths including spike-in of external RNA controls (ERCC), titration data and qPCR verification. The SEQC project generated over 30 billion sequence reads representing the largest RNA-Seq data ever generated by a single project on individual RNA samples. This extraordinarily ultradeep transcriptomic data set and the known truths built into the study design provide many opportunities for further research and development to advance the improvement and application of RNA-Seq.Entities:
Mesh:
Substances:
Year: 2014 PMID: 25977777 PMCID: PMC4322577 DOI: 10.1038/sdata.2014.20
Source DB: PubMed Journal: Sci Data ISSN: 2052-4463 Impact factor: 6.444
Figure 1SEQC study design.
This figure was modified from b presented in the related research manuscript[13]. Similar to the MAQC-I benchmarks, well characterized RNA samples A and B were augmented by samples C and D comprised of A and B in known mixing ratios 3:1 and 1:3, respectively. These allow tests for titration consistency and the correct recovery of the known mixing ratios. Synthetic RNAs from the External RNA Control Consortium (ERCC) were both pre-added to samples A and B before mixing and also sequenced separately to assess dynamic range (samples E and F). Samples were distributed to independent sites for RNA-Seq library construction and profiling by Illumina’s HiSeq 2000 (3+4x) and Life Technologies’ SOLiD 5500 (3+1x). In addition to the replicate libraries A1…D4 at each site, for each platform, one vendor-prepared library A5…D5 was being sequenced at all three official sites, giving a total of 24 libraries. At each site, each library has a unique barcode sequence and all libraries were pooled before sequencing, so each lane was sequencing the same material, allowing a study of lane specific effects. Samples A and B were also sequenced by Roche 454 GS FLX at different sites with two runs each but no library replicates.
Figure 2Mixing scheme to generate the SEQC RNA samples.
This figure was modified from Supplementary Figure S1 presented in the related research manuscript[13]. Samples MAQC-I–A and MAQC-I–B (top) were thawed from the stock acquired during the original MAQC-I study (2006) and aliquots then pooled (blue and grey tubes), adjusted to equal concentration, and then mixed with ERCC mix sets E and F (respectively). The ERCC mixes contain 4 subgroups of transcripts with different molar concentration ratios (4.0, 0.67, 0.5, and 1.0) defined between the two mixes (right bottom). Equal portions of these mixtures were then titrated in 3:1 and 1:3 ratios to create samples C and D (bottom). All four samples were finally separated into 10 μl aliquots for storage and distribution to the sequencing sites.
Number of sequence reads (in millions) produced at each site, listed by sample and library replicate.
|
|
|
| |||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| Illumina HiSeq 2000 data were provided by 7 sites: BGI (Beijing Genomics Institute), CNL (Weill Cornell Medical College), MAY (Mayo Clinic), AGR (Australian Genome Research Facility), COH (City of Hope), NVS (Novartis), and NYG (the New York Genome Center). Life Technologies SOLiD 5500 data were provided by 4 sites: NWU (Northwestern University), PSU (the Pennsylvania State University), SQW (SeqWright Inc.), and LIV (the University of Liverpool). Roche 454 GS FLX data were provided by: MGP (the Medical Genomes Project), NYU (the New York University Medical Center), and SQW (SeqWright Inc.). For each platform, the first three were official sequencing sites. | |||||||||||||||
| A | 1 | 189 | 201 | 139 | 318 | 201 | 333 | 76 | 123 | 105 | 118 | 91 | 0.57 | 0.55 | 0.50 |
| A | 2 | 157 | 184 | 256 | 343 | 202 | 382 | 79 | 124 | 115 | 130 | 0.43 | 0.61 | 0.49 | |
| A | 3 | 188 | 134 | 199 | 297 | 201 | 358 | 68 | 105 | 141 | 87 | ||||
| A | 4 | 216 | 250 | 432 | 415 | 196 | 361 | 67 | 135 | 83 | 125 | ||||
| A | 5 | 191 | 144 | 111 | 53 | 34 | 67 | 58 | |||||||
| B | 1 | 228 | 222 | 248 | 353 | 198 | 337 | 71 | 101 | 133 | 121 | 85 | 0.59 | 0.62 | 0.41 |
| B | 2 | 224 | 225 | 219 | 386 | 203 | 349 | 74 | 92 | 82 | 104 | 0.56 | 0.59 | 0.42 | |
| B | 3 | 237 | 226 | 251 | 352 | 201 | 363 | 49 | 87 | 90 | 114 | ||||
| B | 4 | 175 | 188 | 258 | 329 | 192 | 370 | 76 | 148 | 67 | 91 | ||||
| B | 5 | 134 | 121 | 90 | 56 | 38 | 76 | 49 | |||||||
| C | 1 | 183 | 226 | 154 | 412 | 209 | 341 | 75 | 93 | 92 | 87 | 66 | |||
| C | 2 | 226 | 242 | 204 | 317 | 201 | 344 | 75 | 129 | 106 | 88 | ||||
| C | 3 | 193 | 262 | 169 | 318 | 188 | 328 | 79 | 94 | 124 | 91 | ||||
| C | 4 | 187 | 241 | 315 | 390 | 198 | 348 | 79 | 117 | 144 | 93 | ||||
| C | 5 | 200 | 157 | 122 | 61 | 39 | 81 | 49 | |||||||
| D | 1 | 200 | 224 | 207 | 334 | 199 | 343 | 69 | 96 | 134 | 116 | 98 | |||
| D | 2 | 206 | 208 | 172 | 309 | 160 | 333 | 82 | 229 | 131 | 116 | ||||
| D | 3 | 195 | 215 | 256 | 352 | 198 | 391 | 80 | 78 | 101 | 76 | ||||
| D | 4 | 156 | 183 | 253 | 380 | 251 | 394 | 72 | 72 | 92 | 89 | ||||
| D | 5 | 206 | 148 | 102 | 57 | 33 | 66 | 51 | |||||||
| E | 1 | 234 | 159 | 272 | 504 | 67 | 81 | 64 | 96 | ||||||
| E | 2 | 255 | 261 | 258 | 508 | 80 | 106 | 140 | 107 | ||||||
| F | 1 | 193 | 201 | 224 | 533 | 79 | 129 | 142 | 102 | ||||||
| F | 2 | 259 | 236 | 248 | 603 | 77 | 97 | 145 | 105 | ||||||