| Literature DB >> 23203872 |
Lenore Pipes1, Sheng Li, Marjan Bozinoski, Robert Palermo, Xinxia Peng, Phillip Blood, Sara Kelly, Jeffrey M Weiss, Jean Thierry-Mieg, Danielle Thierry-Mieg, Paul Zumbo, Ronghua Chen, Gary P Schroth, Christopher E Mason, Michael G Katze.
Abstract
RNA-based next-generation sequencing (RNA-Seq) provides a tremendous amount of new information regarding gene and transcript structure, expression and regulation. This is particularly true for non-coding RNAs where whole transcriptome analyses have revealed that the much of the genome is transcribed and that many non-coding transcripts have widespread functionality. However, uniform resources for raw, cleaned and processed RNA-Seq data are sparse for most organisms and this is especially true for non-human primates (NHPs). Here, we describe a large-scale RNA-Seq data and analysis infrastructure, the NHP reference transcriptome resource (http://nhprtr.org); it presently hosts data from12 species of primates, to be expanded to 15 species/subspecies spanning great apes, old world monkeys, new world monkeys and prosimians. Data are collected for each species using pools of RNA from comparable tissues. We provide data access in advance of its deposition at NCBI, as well as browsable tracks of alignments against the human genome using the UCSC genome browser. This resource will continue to host additional RNA-Seq data, alignments and assemblies as they are generated over the coming years and provide a key resource for the annotation of NHP genomes as well as informing primate studies on evolution, reproduction, infection, immunity and pharmacology.Entities:
Mesh:
Year: 2012 PMID: 23203872 PMCID: PMC3531109 DOI: 10.1093/nar/gks1268
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1.Species of the NHPRTR. Animals were chosen to represent large evolutionary distances, encompassing hominoid, Old World and New World Monkeys and prosimians. Two geographic subspecies were included for each of the following species: rhesus macaques (Indian-origin and Chinese-origin) and cynomolgus macaques (Mauritian-origin and Indonesian-origin).
Figure 2.Tissue sources and methods for library construction and sequencing. (Top) The tissues being sequenced (top) cover 21 regions that focus on the brain, immunological and sexual tissues as well as general tissues important for pharmacogenomics. The majority of the individual tissues will eventually be sequenced as individual libraries to examine tissue-specific expression patterns. (Middle) Three different biochemical techniques were used for preparing cDNA libraries to enable the broadest examination of the transcriptome for each species. We used an RNA-ligation method for all RNA species (Total RNA-Seq), poly-A enriched cDNA synthesis (mRNA-Seq) and another version of mRNA-Seq that maintained the Watson or Crick strand of origin for the transcript by using dUNTPs during second strand synthesis (UDG). (Bottom) All cDNA libraries were then subjected to a DNA normalization step using Duplex-specific nuclease treatment, and then all samples were clustered, sequenced, and processed using standard Illumina methods and materials, generating 41 billion reads.
Summary of current data in NHPRTR. The 40.5 billion reads span three different library preparation methods and two sequencing instruments (GAIIx and the HiSeq2000)
| Species | File size (GB) | HiSeq2000 (2 × 100 nt paired-end reads) GAII (100 nt single-end reads) | |||
|---|---|---|---|---|---|
| Protocol | Number of read pairs | Protocol | Number of reads | ||
| Baboon | 973 | mRNA-seq | 955 573 799 | mRNA-seq | 71 477 607 |
| UDG mRNA-seq | 918 735 897 | UDG mRNA-seq | 67 763 503 | ||
| Total RNA-seq | 151 524 634 | ||||
| Chimpanzee | 94.9 | Total RNA-seq | 198 954 000 | ||
| 399.1 | UDG mRNA-seq | 836 864 082 | |||
| Cynomolgus Macaque Indochinese | 948 | mRNA-seq | 923 307 160 | mRNA-seq | 72 016 960 |
| UDG mRNA-seq | 894 367 594 | UDG mRNA-seq | 63 820 198 | ||
| Total RNA-seq | 157 762 299 | ||||
| Cynomolgus Macaque Mauritian | 656 | mRNA-seq | 503 249 742 | mRNA-seq | 90 166 271 |
| UDG mRNA-seq | 557 450 722 | UDG mRNA-seq | 91 262 189 | ||
| Total RNA-seq | 176 108 975 | ||||
| Gorilla | 98.5 | Total RNA-seq | 206 526 535 | ||
| 422.6 | UDG mRNA-seq | 886 261 413 | |||
| Japanese Macaque | 986 | mRNA-seq | 942 269 530 | mRNA-seq | 77 740 433 |
| UDG mRNA-seq | 943 158 996 | UDG mRNA-seq | 72 925 864 | ||
| Total RNA-seq | 181 184 542 | ||||
| Marmoset | 128.8 | Total RNA-seq | 269 969 905 | ||
| 418.9 | UDG mRNA-seq | 878 369 246 | |||
| Mouse Lemur | 97.5 | Total RNA-seq | 204 494 231 | ||
| 378.9 | UDG mRNA-seq | 794 659 816 | |||
| Pig-tailed Macaque | 951 | mRNA-seq | 867 009 248 | mRNA-seq | 54 292 043 |
| UDG mRNA-seq | 991 993 458 | UDG mRNA-seq | 54 668 320 | ||
| Total RNA-seq | 131 548 564 | ||||
| Rhesus Macaque Chinese | 700 | mRNA-seq | 644 468 744 | mRNA-seq | 77 142 089 |
| UDG mRNA-seq | 661 177 666 | UDG mRNA-seq | 75 142 089 | ||
| Total RNA-seq | 121 570 595 | ||||
| Rhesus Macaque Indian | 1331.2 | mRNA-seq | 1 716 083 364 | mRNA-seq | 84 892 037 |
| UDG mRNA-seq | 704 493 397 | UDG mRNA-seq | 70 346 332 | ||
| Total RNA-seq | 168 710 995 | ||||
| Ring-tailed Lemur | 104.8 | Total RNA-seq | 219 647 886 | ||
| 398.7 | UDG mRNA-seq | 835 972 568 | |||
| Sooty Mangabey | 106 | Total RNA-seq | 222 192 769 | ||
| 424.4 | UDG mRNA-seq | 889 864 522 | |||
| Total | 9618.3 | 18 667 116 290 | 2 112 066 539 | ||
| Total number of reads | 39 446 299 119 | ||||
Although the sequencing for all species is not identical, due to increased use of the higher output HiSeq2000 in later species, we represent both total RNA and polyA-enriched RNA preparations for all species.
Figure 3.Organization of the NHPRTR. We have designed our database and the interface page to give users a clear sense of the goals, organization and data types present. Pages include the project background (about.html), the latest updates (status.html), connections to other sites (links.html), contact information (contact.html) and the data from the NHPRTR (data.html, including md5sums). From the data page (middle), users can access the raw data or various forms of the processed data, including cleaned data, alignments (using BWA, TopHat or Magic) and assemblies (Oases, TransAbyss and Trinity). The data page will continually update as data are submitted and as work is completed.
Figure 4.Browsable tracks. (A) We used BWA aligner to create cross-species maps of expression, based on the alignment to orthologous sequences of the human genome. Here we show the three library preparation methods (TOT, UDG, RNA), with one in each track for seven species. (B) The insert shows how the Total RNA preparation method (middle expression track) can more readily discern non-poly-adenylated genes, such as the histone genes.