| Literature DB >> 26271990 |
Andy Chu1, Gordon Robertson1, Denise Brooks1, Andrew J Mungall1, Inanc Birol2, Robin Coope1, Yussanne Ma1, Steven Jones3, Marco A Marra4.
Abstract
The comprehensive multiplatform genomics data generated by The Cancer Genome Atlas (TCGA) Research Network is an enabling resource for cancer research. It includes an unprecedented amount of microRNA sequence data: ~11 000 libraries across 33 cancer types. Combined with initiatives like the National Cancer Institute Genomics Cloud Pilots, such data resources will make intensive analysis of large-scale cancer genomics data widely accessible. To support such initiatives, and to enable comparison of TCGA microRNA data to data from other projects, we describe the process that we developed and used to generate the microRNA sequence data, from library construction through to submission of data to repositories. In the context of this process, we describe the computational pipeline that we used to characterize microRNA expression across large patient cohorts.Entities:
Mesh:
Substances:
Year: 2015 PMID: 26271990 PMCID: PMC4705681 DOI: 10.1093/nar/gkv808
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1.Process used to generate TCGA miRNA sequence data. Strand-specific library construction was performed in parallel in 96-well plates. The example reference read pileup and stemloop are hsa-mir-21 from miRBase v21.
Annotation priorities (Pr) that are used to resolve multiple annotation type matches for a single alignment location or for multiple alignment locations for a read. See ‘Profiling small RNA abundance’
| Pr | Annotation type | Database |
|---|---|---|
| 1 | mature strand | miRBase |
| 2 | star strand | |
| 3 | precursor miRNA | |
| 4 | stemloop, from 1 to 6 bases outside the mature strand, between the mature and star strands | |
| 5 | ‘unannotated’, any region other than the mature strand in miRNAs where no star strand is annotated | |
| 6 | snoRNA | UCSC small RNAs and RepeatMasker |
| 7 | tRNA | |
| 8 | rRNA | |
| 9 | snRNA | |
| 10 | scRNA | |
| 11 | srpRNA | |
| 12 | Other RNA repeats | |
| 13 | coding exons with zero annotated CDS region length | UCSC genes |
| 14 | 3′ UTR | |
| 15 | 5′ UTR | |
| 16 | coding exon | |
| 17 | intron | |
| 18 | LINE | UCSC RepeatMasker |
| 19 | SINE | |
| 20 | LTR | |
| 21 | Satellite | |
| 22 | RepeatMasker DNA | |
| 23 | RepeatMasker low complexity | |
| 24 | RepeatMasker simple repeat | |
| 25 | RepeatMasker other | |
| 26 | RepeatMasker unknown |
Figure 2.Example library quality graphs generated by the GSC miRNA annotation profiling pipeline for thyroid carcinoma tumor libraries (1). For an individual library: (A and B) Percentage of small RNA annotations as a function of read length for two libraries. (C and D) Distribution of read lengths after adapter trimming with (C) a preferred narrower and (D) a wider insert length distribution. (E) For 496 thyroid tumor libraries, the relationship of miRNA species identified with at least 1 or 10 aligned reads versus all post-filtered (PF) reads aligned to miRNAs. The vertical red line shows the TCGA-specific threshold of 1 M reads.
Average run times (h:min:s) for the five TCGA bladder cancer libraries
| TCGA barcode | GSC | miRDeep2 | ShortStack |
|---|---|---|---|
| TCGA-G2-A2EL-01A-12R-A18B-13 | 3:00:38 (23) | 0:49:44 | 2:59:49 |
| TCGA-G2-A2EK-01A-22R-A18B-13 | 3:44:41 (63) | 0:39:24 | 2:57:15 |
| TCGA-FD-A3B3–01A-12R-A205–13 | 3:07:41 (27) | 0:29:57 | 2:58:43 |
| TCGA-CF-A3MI-01A-11R-A20E-13 | 2:49:32 (15) | 0:35:44 | 3:00:05 |
| TCGA-DK-A3IS-01A-21R-A21E-13 | 3:16:38 (25) | 0:49:45 | 2:59:06 |
The GSC and miRDeep2 pipelines were run with default settings. ShortStack was run with defaults but was constrained to miRBase v16 annotations. For GSC, numbers in parentheses are annotation times, i.e. processing time following read alignment, in minutes.
Percentage of reads annotated as miRNA in the five TCGA libraries
| TCGA barcode | GSC | miRDeep2 | ShortStack |
|---|---|---|---|
| TCGA-G2-A2EL-01A-12R-A18B-13 | 31.0 | 29.5 | |
| TCGA-G2-A2EK-01A-22R-A18B-13 | 32.6 | 44.3 | |
| TCGA-FD-A3B3–01A-12R-A205–13 | 52.5 | 56.2 | |
| TCGA-CF-A3MI-01A-11R-A20E-13 | 35.7 | 32.0 | |
| TCGA-DK-A3IS-01A-21R-A21E-13 | 54.6 | 46.3 |
Bold text marks the maximum percentage for the three methods for a library.
Pearson correlation coefficients for normalized miR abundance profiles generated by the three annotation methods for five TCGA libraries
| All miRs | miRs with ≥10 reads | |||||
|---|---|---|---|---|---|---|
| GSC | GSC | MD | GSC | GSC | MD | |
| TCGA barcode | MD | SS | SS | MD | SSS | SS |
| TCGA-G2-A2EL-01A-12R-A18B-13 | 0.998 | 0.975 | 0.981 | 0.998 | 0.975 | 0.981 |
| TCGA-G2-A2EK-01A-22R-A18B-13 | 0.998 | 0.997 | 1.000 | 0.998 | 0.997 | 1.000 |
| TCGA-FD-A3B3–01A-12R-A205–13 | 0.998 | 0.997 | 1.000 | 0.998 | 0.997 | 1.000 |
| TCGA-CF-A3MI-01A-11R-A20E-13 | 0.998 | 0.996 | 0.999 | 0.998 | 0.996 | 0.999 |
| TCGA-DK-A3IS-01A-21R-A21E-13 | 0.990 | 0.976 | 0.994 | 0.989 | 0.975 | 0.994 |
Per-sample miR profiles were normalized to reads per 1 M miR-annotated reads (RPM). MD: miRDeep2. SS: ShortStack
Figure 3.Venn diagram of miRNAs detected at any level of read coverage, in at least one of the five bladder cancer libraries, by the GSC, miRDeep2 (MD) and ShortStack (SS). (A) All methods were run with default settings. (B) The GSC pipeline was run with lowered quality settings that allowed multimapping and one mismatch. miRs detected by only one or two methods had low RPMs. See Tables 6 and 7.
Number of miRNA species to which reads were mapped using miRBase v16 (of 1222 possible mature miRs)
| GSC | miRDeep2 | ShortStack | ||||
|---|---|---|---|---|---|---|
| TCGA barcode | Total | >10 reads | Total | >10 reads | Total | >10 reads |
| TCGA-G2-A2EL-01A-12R-A18B-13 | 648 | 429 | 719 | 485 | 687 | 461 |
| TCGA-G2-A2EK-01A-22R-A18B-13 | 638 | 421 | 699 | 466 | 680 | 437 |
| TCGA-FD-A3B3–01A-12R-A205–13 | 612 | 365 | 676 | 404 | 658 | 385 |
| TCGA-CF-A3MI-01A-11R-A20E-13 | 575 | 323 | 629 | 363 | 614 | 348 |
| TCGA-DK-A3IS-01A-21R-A21E-13 | 650 | 426 | 697 | 473 | 689 | 445 |
In a library with 1 M miR-aligned reads, 10 reads corresponds to 1e-5 RPM.
The number and normalized abundance (RPM) for miRNA species reported by the GSC, MiRDeep2 and Shortstack profiling pipelines, using default settings
| GSC RPM | MiRDeep2 RPM | Shortstack RPM | |||||
|---|---|---|---|---|---|---|---|
| Venn Membership | Total miRs | Median (mean) | Max | Median (mean) | Max | Median (mean) | Max |
| All pipelines | 785 | 3.0 (1045) | 223 316 | 2.1 (1044) | 345 035 | 3.18 (1274) | 378 133 |
| GSC and MD | 12 | 0.05 (0.09) | 0.38 | 0.02 (0.03) | 0.14 | ||
| GSC and SS | 18 | 0.14 (0.57) | 7.1 | 0.17 (0.62) | 7.4 | ||
| MD and SS | 49 | 0.17 (3.96) | 113.9 | 0.07 (0.77) | 13.7 | ||
| Only SS | 19 | 0.04 (3.01) | 42.4 | ||||
| Only MD | 30 | 0.06 (0.16) | 1.1 | ||||
| Only GSC | 4 | 0.04 (0.04) | 0.08 | ||||
| Grand Total | 917 | ||||||
See Figure 3, compare to Table 7. MD: miRDeep2. SS: ShortStack
The number and normalized abundance (RPM) for miRNA species reported by GSC, MiRDeep2 and Shortstack profiling pipelines, relaxing settings for GSC to allow up to one base mismatch and any number of mapping locations
| GSC RPM | MiRDeep2 RPM | ShortStack RPM | |||||
|---|---|---|---|---|---|---|---|
| Venn membership | Total miRs | Median (mean) | Max | Median (mean) | Max | Median (mean) | Max |
| All pipelines | 833 | 2.85 (1201) | 273 827 | 3.44 (1201) | 354 265 | 2.91 (1201) | 378 133 |
| GSC and MD | 25 | 0.17 (0.27) | 1.90 | 0.17 (0.42) | 3.07 | ||
| GSC and SS | 25 | 0.17 (0.40) | 4.60 | 0.32 (0.62) | 7.4 | ||
| MD and SS | 1 | 0.79 (0.79) | 0.79 | 0.22 (0.22) | 0.22 | ||
| Only SS | 12 | 0.21 (4.85) | 42.4 | ||||
| Only MD | 17 | 0.28 (0.35) | 1.1 | ||||
| Only GSC | 38 | 0.14 (0.15) | 0.34 | ||||
| Grand Total | 951 | ||||||
Average RPM of miRs found by both miRDeep and Shortstack in all five samples and were not counted as annotated by the GSC
| MIMAT | miR name | miRBase sequence | miRDeep2 Mean RPM | Shortstack Mean RPM |
|---|---|---|---|---|
| MIMAT0005911 | miR-1260 | AUCCCACCUCUGCCACCA | 17.34 | 1.86 |
| MIMAT0005927 | miR-1274a | GUCCCUGUUCAGGCGCCA | 0.27 | 0.09 |
| MIMAT0005946 | miR-1280 | UCCCACCGCUGCCACCC | 0.61 | 0.14 |
| MIMAT0005450 | miR-518e-5p | CUCUAGAGGGAAGCGCUUUCUG | 0.43 | 0.06 |
| MIMAT0002831 | miR-519c-5p | CUCUAGAGGGAAGCGCUUUCUG | 0.43 | 0.06 |
| MIMAT0004984 | miR-941 | CACCCGGCUGUGUGCACAUGUGC | 0.92 | 0.09 |