| Literature DB >> 24577841 |
Shi-Jian Zhang1, Chu-Jun Liu, Peng Yu, Xiaoming Zhong, Jia-Yu Chen, Xinzhuang Yang, Jiguang Peng, Shouyu Yan, Chenqu Wang, Xiaotong Zhu, Jingwei Xiong, Yong E Zhang, Bertrand Chin-Ming Tan, Chuan-Yun Li.
Abstract
With genome sequence and composition highly analogous to human, rhesus macaque represents a unique reference for evolutionary studies of human biology. Here, we developed a comprehensive genomic framework of rhesus macaque, the RhesusBase2, for evolutionary interrogation of human genes and the associated regulations. A total of 1,667 next-generation sequencing (NGS) data sets were processed, integrated, and evaluated, generating 51.2 million new functional annotation records. With extensive NGS annotations, RhesusBase2 refined the fine-scale structures in 30% of the macaque Ensembl transcripts, reporting an accurate, up-to-date set of macaque gene models. On the basis of these annotations and accurate macaque gene models, we further developed an NGS-oriented Molecular Evolution Gateway to access and visualize macaque annotations in reference to human orthologous genes and associated regulations (www.rhesusbase.org/molEvo). We highlighted the application of this well-annotated genomic framework in generating hypothetical link of human-biased regulations to human-specific traits, by using mechanistic characterization of the DIEXF gene as an example that provides novel clues to the understanding of digestive system reduction in human evolution. On a global scale, we also identified a catalog of 9,295 human-biased regulatory events, which may represent novel elements that have a substantial impact on shaping human transcriptome and possibly underpin recent human phenotypic evolution. Taken together, we provide an NGS data-driven, information-rich framework that will broadly benefit genomics research in general and serves as an important resource for in-depth evolutionary studies of human biology.Entities:
Keywords: RhesusBase; human evolution; human regulation; human-specific trait; next-generation sequencing; rhesus macaque
Mesh:
Year: 2014 PMID: 24577841 PMCID: PMC3995340 DOI: 10.1093/molbev/msu084
Source DB: PubMed Journal: Mol Biol Evol ISSN: 0737-4038 Impact factor: 16.240
FIntegration, processing, and evaluation of NGS data sets. (A) Overview of the NGS data sets (shown in black) and corresponding annotations (red) processed and integrated into RhesusBase2. The numbers of NGS data sets and annotation entries are also shown. (B) The quality of RNA-seq data sets was assessed by standard evaluation steps, and a RhesusBase quality score was assigned to each data set according to multiple parameters as illustrated in the box. (C) The distribution of RhesusBase quality scores for RNA-seq data sets incorporated in RhesusBase2.
Statistics of NGS Data Processed in RhesusBase.
| Platforms | Data Sets | Samples | Total Reads (Million) | |||
|---|---|---|---|---|---|---|
| v1 | v2 | v1 | v2 | v1 | v2 | |
| Genome-seq | 0 | 1 | 0 | Brain | 0 | 2,173 |
| Exome-seq | 0 | 11 | 0 | Blood | 0 | 1,054 |
| RNA-seq | 47 | 202 | 14 | 30 | 2,068 | 10,679 |
| Small RNA-seq | 0 | 191 | 0 | 66 | 0 | 1,970 |
| CLIP-seq | 7 | 7 | HEK293 | HEK293 | 34 | 34 |
| Poly(A)-seq | 0 | 14 | 0 | 6 | 0 | 206 |
| ChIP-seq | 0 | 1,226 | 0 | 103 | 0 | 28,660 |
| ChIA-PET | 0 | 15 | 0 | 5 | 0 | 3,885 |
| Sum | 54 | 1,667 | 15 | 182 | 2,102 | 48,661 |
Note.—Statistics of the NGS data sets archived in the previous (v1) and current version of RhesusBase (v2) is summarized.
aAdrenal, brain, breast, caudate nucleus, cerebellar cortex, cerebellum, colon, corneal endothelium, fat, frontal pole, heart, hippocampus, kidney, LCL, liver, lung, lymph node, muscle, neocortex, orbital cerebral cortex, ovary, prefrontal cortex, prostate, skinbone marrow, spleen, testis, thymus, thyroid, white blood cells, and mixtures of 16 tissues.
bABC158, ALL411, basal cells, Bjab103, blastocysts, BL115, BL134, BL510, brain, breast, cerebellum, CLLM633, CLLU626, columnar cells, EBV159, endometrium, epididymis, ESC, ES-RPE, cerebral cortex, fetal RPE, frontal cortex, GC40, GC136, GCB110, GCB385, H929, heart, HEK293, HeLa, HIV412, kidney, KMS12, L1236, L428, liver, lung, Ly3, MALT413, MCL112, MCL114, MM55, MM139, Mino122, Naïve-B-cell, ovary, parthenogenetic-activated blastocysts stem cell, principal and basal cells, peripheral blood mononuclear cells, PC44, peritubular, plasma, pigmented cluster, prostate, red blood cell, seminal vesicle, superior frontal gyrus, skeletal muscle, skin, splenic414, spermatozoa, seminal vesicle, testis, tonsil, U266, and uterus.
cBrain, ileum, kidney, liver, muscle, and testis.
dSee references Hudson and Snyder (2006), Euskirchen et al. (2007), Meyer et al. (2013) for details.
eK562, HCT-116, HeLa-S3, MCF-7, and NB4.
Functional Annotations in RhesusBase.
| Entries | |||
|---|---|---|---|
| Categories | v1 | v2 | |
| DNA | SNP | 5,682,738 | 20,327,025 |
| RNA | mRNA expression profile | 1,330,884 | 5,682,934 |
| PA site | 0 | 1,184,140 | |
| Alternative splicing | 0 | 481,865 | |
| RNA editing | 0 | 1,369,446 | |
| miRNA expression profile | 0 | 9,395 | |
| Regulation | TFBS | 174,805 | 28,771,628 |
| Chromatin interaction | 0 | 493,236 | |
| miRNA-binding site | 15,909 | 104,041 | |
| Sum | 7,204,336 | 58,423,710 | |
aStatistics of the functional annotations archived in the previous (v1) and current version of RhesusBase (v2) is summarized. From highly selected and processed NGS data sets, a total of 58,423,710 functional records were generated and incorporated in RhesusBase2 (v2), representing approximately one order of magnitude more NGS annotation entries compared with the previous version (v1).
Definite Refinement of the Macaque Transcripts by the RhesusBase (v2).
| Categories | Revision Events | Revised Transcripts | ||
|---|---|---|---|---|
| Confirmed | Novel | Confirmed | Novel | |
| Junctions | 3,202 | 909 | 2,374 | 742 |
| New exons | 2,203 | 5,053 | 1,441 | 2,904 |
| 5′-UTRs | 803 | 587 | 803 | 587 |
| 3′-UTRs | 2,781 | 3,619 | 2,781 | 3,619 |
| Sum | 19,157 | 12,201 | ||
aWith incorporation of new macaque NGS data sets, we performed further refinement of the macaque gene models and compared it with the revisions reported in previous version of RhesusBase. The number of previous gene model revisions confirmed by this study (confirmed) and new gene model revisions by this study (novel) is summarized.
FRefinement and evaluation of macaque gene models. (A) Accurate refinement of the splice junctions in macaque genes is illustrated by the exon–intron distribution patterns of the RNA-seq expression tag coverage. Exon: exonic regions defined by both gene models; intron: intronic regions defined by both gene models; RhesusBase exon: previously intronic regions now defined by revised gene models as exonic regions; RhesusBase intron: previously exonic regions now defined by revised gene models as intronic regions. (B and C) Normalized RNA-seq expression tag coverage in exonic regions, upstream and downstream intronic regions, for previously missed exons (B) or transcripts (C). (D) Intron–exon distributions of cross-species conservation score. Reference: splice junction supported by both gene models; Ensembl: splice junction defined by Ensembl; RhesusBase2: refined splice junction in this study; new exon: new exons not annotated by Ensembl; NTR exon: exons in NTRs identified in this study. (E) Sequence motifs flanking the splice junctions calculated on the basis of previous gene models (Ensembl), revised gene models (RhesusBase), or the splice junctions for new exons (new exon) and NTRs (NTR exon). Reference: distribution calculated using splice junctions supported by both gene models. (F and G) Enrichments of AAUAAA/AUUAAA hexamers near the end of the 3′-UTRs were calculated based on gene models of Ensembl (release 68) (Ensembl), the previous version of RhesusBase (RhesusBase1), the current version of RhesusBase (RhesusBase2), and 5′-UTR sequences as the negative control (negative control). (H) The distributions of the distance between the PA sites estimated from poly(A)-seq data and the 3′-end of the transcripts, as defined by Ensembl (release 68) (Ensembl), the previous version of RhesusBase (RhesusBase1), and the current version of RhesusBase (RhesusBase2).
FNGS-oriented genomic interfaces of macaque genomics. (A–G) Overview of database management system and interactive user interfaces in RhesusBase2. RhesusBase2 core database was developed on the basis of 1,667 NGS data sets (A) and accurate macaque gene models (B). Equipped with keywords, location, and sequence-based information retrieval systems (C), as well as BioMart and batch-download interfaces (D), RhesusBase2 allows productive data accession and management. In particular, for each human gene and its associated regulations, one gene page (E) and one regulation page (F) were designed to visualize the corresponding annotations of its macaque ortholog in a user-friendly manner. All position-related annotations were hyperlinked to a newly implemented, position-centric genome browser for efficient visualization of position-based annotations in human and rhesus macaque (G).
FApplication of RhesusBase in evolutionary and mechanistic characterization of one human candidate gene DIEXF. (A–C) mRNA expression profiles for DIEXF in human and its macaque ortholog are available on the Gene Page (binned by tissue types), indicating higher expression in rhesus macaque across multiple tissues. Gene expression levels across the five tissues in human and rhesus macaque are shown in RPKM values (A), the percentile rank of the expression levels in the associated NGS assay (B), and the relative expression levels normalized to GAPDH as the internal control (C). Data are shown in mean ± SEM (standard error of the mean). Additional survey of the DIEXF genomic regions on the Position-centric Genome Browser (D) and the Regulation Page (E) revealed a human-specific extension of 3′-UTR region (highlighted by red bar in D), as indicated by the RhesusBase2-archived poly(A)-seq and RNA-seq annotations in human and rhesus macaque (D). This human-specific 3′-UTR region was further predicted to harbor multiple miRNA target sites (E).
FHuman-biased regulatory events contribute substantially to the transcriptomic difference between human and macaque. (A) Identification of human-biased regulation using RhesusBase2-archived NGS data sets. The numbers of NGS data sets used to identify each type of human-biased regulation are shown. (B) A human-specific A-to-G RNA editing site located in an Alu element located in 3′-UTR of CC2D1B (chr1: 52,589,179 in NCBI36). Extent of sequencing coverage flanking the editing site is shown for both human and rhesus macaque data. At the focal editing site, the A allele is highlighted in green and the G allele in red. (C) Compared with the transcript structure of macaque RTCA gene (in blue) and the corresponding poly(A)-seq signals (highlighted in the blue box), the transcript structure for human ortholog is shown with human-specific 3′-terminal (in red), indicated by poly(A)-seq signals (highlighted in the red box) and RNA-seq read densities in human and rhesus macaque data. (D) The graph shows a cumulative distribution of the log 2 fold changes of mRNA abundance between human and rhesus macaque for different sets of mRNAs: targets of the hsa-miR-499a-5p identified by MiRanda prediction (blue line) or by both CLIP-seq and MiRanda prediction (red line), and genomic background with all expressed genes (black line). Displacement of the curve to the right reveals decreased mRNA abundance in human, which is indicative of human-biased mRNA repression in the presence of the human-biased miRNA. (E) Quantile–quantile plot shows the distribution of the correlation coefficients of tissue expression profiles, for genes targeted by human-biased regulatory events and the genomic background.
A Catalog of 9,295 Human-Biased Regulatory Events.
| Categories | Events | Transcripts | Genes |
|---|---|---|---|
| Alternative PA | 55 | 44 | 44 |
| RNA editing | 1,069 | 687 | 330 |
| miRNA regulation | 8,171 | 3,436 | 2,531 |
| Sum | 9,295 | 4,167 | 2,905 |
aA catalog of 9,295 human-biased regulatory events was identified using RhesusBase2-archived macaque annotations as a reference. The number of genes and transcripts with human-biased regulatory events is summarized.