| Literature DB >> 22965133 |
Shi-Jian Zhang1, Chu-Jun Liu, Mingming Shi, Lei Kong, Jia-Yu Chen, Wei-Zhen Zhou, Xiaotong Zhu, Peng Yu, Jue Wang, Xinzhuang Yang, Ning Hou, Zhiqiang Ye, Rongli Zhang, Ruiping Xiao, Xiuqin Zhang, Chuan-Yun Li.
Abstract
Although the rhesus macaque is a unique model for the translational study of human diseases, currently its use in biomedical research is still in its infant stage due to error-prone gene structures and limited annotations. Here, we present RhesusBase for the monkey research community (http://www.rhesusbase.org). We performed strand-specific RNA-Seq studies in 10 macaque tissues and generated 1.2 billion 90-bp paired-end reads, covering >97.4% of the putative exon in macaque transcripts annotated by Ensembl. We found that at least 28.7% of the macaque transcripts were previously mis-annotated, mainly due to incorrect exon-intron boundaries, incomplete untranslated regions (UTRs) and missed exons. Compared with the previous gene models, the revised transcripts show clearer sequence motifs near splicing junctions and the end of UTRs, as well as cleaner patterns of exon-intron distribution for expression tags and cross-species conservation scores. Strikingly, 1292 exon-intron boundary revisions between coding exons corrected the previously mis-annotated open reading frames. The revised gene models were experimentally verified in randomly selected cases. We further integrated functional genomics annotations from >60 categories of public and in-house resources and developed an online accessible database. User-friendly interfaces were developed to update, retrieve, visualize and download the RhesusBase meta-data, providing a 'one-stop' resource for the monkey research community.Entities:
Mesh:
Substances:
Year: 2012 PMID: 22965133 PMCID: PMC3531163 DOI: 10.1093/nar/gks835
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Statistics of RNA-Seq coverage on fine-scale monkey transcript structure
| Categories | Total | Covered | Percentage |
|---|---|---|---|
| Exons | 360 789 | 351 311 | 97.4 |
| Junctions | 317 969 | 273 967 | 86.2 |
| Transcripts | 42 820 | 33 914 | 79.2 |
aNumber of exons, junctions or transcripts on the basis of Ensembl gene models.
bNumber of exons, junctions or transcripts covered by expression tags.
28.7% Ensembl macaque transcripts were convincingly refined
| Categories | Events | Transcripts | Percentage |
|---|---|---|---|
| Junctions | 4 054 | 2 947 | 6.9 |
| 5′UTRs | 2 267 | 2 267 | 5.3 |
| 3′UTRs | 7 917 | 7 917 | 18.5 |
| New exons | 2 427 | 1 602 | 3.7 |
| Total | 16 665 | 12 303 | 28.7 |
aPercentage of revised Ensembl transcripts.
bNumber of transcripts involved in four types of refinements. Transcripts with two or more revisions were counted once.
Figure 1.Evaluation of refined exon/intron boundaries. (A) Normalized mRNA-Seq expression tag coverage for each refined splicing junction in different categories. Exon: exonic regions defined by both gene models; Intron: intronic regions defined by both gene models; RhesusBase Exon: exonic regions defined by revised gene models, while intronic regions by previous gene models; RhesusBase Intron: intronic regions defined by revised gene models, while exonic regions by previous gene models; (B) Intron-exon distributions of cross-species conservation score. Reference: splicing junction supported by both gene models; Ensembl: splicing junction defined by Ensembl; RhesusBase: refined splicing junction in this study. (C) Sequence motifs flanking the splicing junctions calculated on the basis of previous gene models (Ensembl) and revised gene models (RhesusBase). Reference: distribution calculated using 242 603 splicing junctions supported by both gene models with at least two independent expression tags across the splicing junction; Ensembl/RhesusBase: distributions calculated using 1793 acceptor sites and 2261 donor sites on the basis of previous gene models and revised gene models. (D) One example of a revised transcript. Both the previous gene models (Ensembl) and the revised gene models (RhesusBase) are shown. RNA-Seq expression tag coverage and splicing junctions indicated by expression tags across junctions, cross-species conservation score, as well as sequenced cDNA fragments are aligned accordingly. Strand information is indicated by arrows on transcripts and exon boundaries are indicated by vertical dashed lines. The sequence surrounding the splicing junction is indicated, in which GT–AG or GC–AG sites are highlighted in red.
Figure 3.Evaluation of new exons and transcripts absent in Ensembl annotation. (A, B) Normalized mRNA-Seq expression tag coverage in exonic regions, upstream and downstream intronic regions, for revisions adding missed exons (A) or transcripts (B). (C) Intron–exon distributions of cross-species conservation score. Reference: exons in rhesus macaque supported by both gene models; New Exon: missed exons on the basis of Ensembl annotation; New Transcript: exons in new transcripts identified in this study. (D) Sequence motifs flanking the splicing junctions for new exons and transcripts. Distributions were calculated using 2 427 new exons (New Exons) and 24 295 exons in 8057 new transcripts (New transcripts). (E and F) Two examples are shown for the fine-scale structure of new exons missed by Ensembl (E) and new transcripts (F). Both the previous gene models (Ensembl) and the revised gene models (RhesusBase) are shown. RNA-Seq expression tag coverage, splicing junctions, cross-species conservation score, and sequenced cDNA fragments were aligned accordingly. Sequences surrounding the splicing junctions are also illustrated, in which GT-AG sites are highlighted in red.
Figure 2.Evaluation of extended 5′- or 3′-UTRs. (A) Frequencies of AAUAAA hexamer near the end of the 3′-UTRs, on the basis of previous gene models (Ensembl) and the revised gene models (RhesusBase). Negative controls were generated using flanking regions near the start site of these transcripts (Negative Controls). (B) Frequencies of AAUAAA hexamer near the end of the 3′-UTRs, for transcript annotations in human and Ensembl annotations in rhesus macaque. (C) Distribution of the transcription start sites identified by ChIP-Seq study, on the basis of the previous and revised gene models. Reference: the end of the 5′-UTR supported by both previous and new models; (D and E) Gene structures of two experimentally verified transcripts revised by RhesusBase. Both the previous gene models (Ensembl) and the revised gene models (RhesusBase) are shown. RNA-Seq expression tag coverage, splicing junctions, cross-species conservation score, as well as sequenced cDNA fragments were aligned accordingly. AATAAA site (D) or transcription start site (E) identified by ChIP-Seq study are highlighted. The RNA-Seq expression tag coverage was set to the maximal score for sites with high tag coverage (>100).
Figure 4.RhesusBase data integration and abstraction. Nine functional categories of annotation were integrated and standardized: Gene Description, Gene/Transcript Structure, Expression Profile, Regulation Mode, Variation and Repeats, Comparative Genomics, Gene Function, Phenotype/Disease Association and Drug Development. Detailed descriptions of annotations in each functional category are illustrated. Annotations integrated from in-house datasets are shown in green boxes, those processed from public databases in blue boxes and those extracted directly from public databases in grey boxes. The total numbers of entries in each functional category are shown.
Statistics for RhesusBase functional genomics annotations
| Categories | Resources | All entries (Rhesus) | All gene coverage (Rhesus) | References |
|---|---|---|---|---|
| Gene description | ||||
| RhesusBase genes | This study | 22 283 | 22 283 | ( |
| Validated genes | RefSeq | 2 588 (2 588) | 2 541 (2 541) | ( |
| Putative genes | Ensembl, N-SCAN, SGP, Geneid, miRBase, GtRNAdb | 127 271 (127 271) | 31 416 (31 416) | ( |
| Transcript structure | ||||
| RhesusBase transcripts | This Study, Public Data | 50 847 (50 847) | 28 634 (28 634) | This study |
| RNA-Seq coverage | This Study, Public Data | 537 867 932 (537 867 932) | 16 462 (16 462) | This study, ( |
| Splicing junctions | This Study, Public Data | 1 380 988 (1 380 988) | 16 992 (16 992) | This study, ( |
| Expressed sequence tags | GenBank, dbEST, UCSC | 72 657 (72 657) | 8 832 (8 832) | ( |
| Transcript sequences | RefSeq | 32 685 (32 685) | 17 575 (17 575) | ( |
| Expression profile | ||||
| RNA expression identified by RNA-Seq | This Study, Public Data | 1 332 656 (982 226) | 22 198 (16 809) | This study, ( |
| RNA expression identified by | Alan Brain Atlas | 12 397 (0) | 9 218 (0) | ( |
| RNA expression identified by cDNA microarray | BioGPS, Alan Brain Atlas | 48 161 (0) | 20 795 (0) | ( |
| Regulation Mode | ||||
| Transcriptional regulation | UCSC, Public Data | 235 086 (235 086) | 11 601 (0) | ( |
| Posttranscriptional regulation | This Study, Argonaute, TarBase, PicTar, TargetScan, miRanda | 82 355 (82 355) | 1 625 (1 520) | ( |
| Natural-antisense regulation | NATsDB, TransMap | 37 868 (0) | 5 463 (5 463) | ( |
| Posttranslational modification | dbPTM | 4 390 (4 390) | 223 (0) | ( |
| Variation and repeats | ||||
| Single nucleotide variation | This Study, dbSNP, CMSNP, MamuSNP, MonkeySNP | 5 682 738 (5 500 294) | 17 430 (15 743) | ( |
| Copy number variation | dbVar, DGV | 29 593 (337) | 6 068 (104) | ( |
| Genomic repeats | UCSC | 5 291 149 (5 291 149) | 15 445 (15 445) | ( |
| Comparative genomics | ||||
| Rhesus-centric pairwise alignments | UCSC | 32 487 843 (32 487 843) | 17 603 (17 603) | ( |
| Cross-species conservation score prediction | UCSC | 4 998 806 214 (4 998 806 214) | 16 435 (16 435) | This study, ( |
| Gene function | ||||
| Related publication | NCBI | 544 499 (269) | 171 (171) | ( |
| Predicted protein domain | InterPro | 28 517 (28 517) | 8 399 (8 399) | ( |
| Biological process, cellular component and molecular function | Gene Ontology | 191 251 (0) | 11 850 (0) | ( |
| Molecular pathway | KEGG, Reactome, BioCarta, PID | 12 346 (187) | 4 106 (4 106) | ( |
| Protein–Protein Interaction | IntAct, HPRD, DIP, BioGRID, BioCyc, STRING | 819 029 (672 864) | 10 606 (10 606) | ( |
| Phenotype and disease association | ||||
| Human inheritance disease | OMIM | 9 935 (0) | 6 104 (0) | ( |
| Genetic susceptible gene (genome-wide association study) | NHGRI Catalog of Published Genome-Wide Association Studies | 4 903 (0) | 3 536 (0) | ( |
| Genetic susceptible gene (low-scale association study) | GAD | 44 201 (0) | 3 535 (0) | ( |
| Transgenic mouse phenotype | MGI, PBmice | 32 080 (0) | 5 420 (0) | ( |
| Drug development | ||||
| Pharmacogenomics | PharmGKB | 21 072 (0) | 19 495 (0) | ( |
| Drug-induced differentially expressed genes | Connectivity MAP | 2 354 610 (0) | 9 125 (0) | ( |
aTotal number of RhesusBase entries in rhesus macaque, human and mouse.
bThe number of RhesusBase entries specifically for rhesus macaque.
cThe number of monkey genes with RhesusBase annotations from rhesus macaque, human and mouse.
dThe number of genes with RhesusBase annotations specifically from rhesus macaque.
Figure 5.Overview of RhesusBase management system and interactive user interfaces. A comprehensive database management system and five highly interactive user interfaces were developed to support data storage, updating (A), retrieval (B), display (C, D) and downloading (E) in RhesusBase. A database update module was developed to facilitate the efficient updating of RhesusBase as more public or in-house functional data become available (A). Keywords, location and sequence-based query systems were developed to facilitate the retrieve of functional annotations from RhesusBase (B). Through this information retrieval system, users are referred to two different view modes to display the annotations, that of a gene-centric view (C) and a position-centric browser view (D). A Biomart-based download system was also developed for the offline use of RhesusBase annotations (E).