Literature DB >> 33973996

Polyadenylation-related isoform switching in human evolution revealed by full-length transcript structure.

Yumei Li¹, Qing Sunny Shen¹, Qi Peng^1,2, Wanqiu Ding^1,2, Jie Zhang^1,2, Xiaoming Zhong¹, Ni A An^1,2, Mingjun Ji^1,2, Wei-Zhen Zhou³, Chuan-Yun Li^1,2.

Abstract

Rhesus macaque is a unique nonhuman primate model for human evolutionary and translational study, but the error-prone gene models critically limit its applications. Here, we de novo defined full-length macaque gene models based on single molecule, long-read transcriptome sequencing in four macaque tissues (frontal cortex, cerebellum, heart and testis). Overall, 8 588 227 poly(A)-bearing complementary DNA reads with a mean length of 14 106 nt were generated to compile the backbone of macaque transcripts, with the fine-scale structures further refined by RNA sequencing and cap analysis gene expression sequencing data. In total, 51 605 macaque gene models were accurately defined, covering 89.7% of macaque or 75.7% of human orthologous genes. Based on the full-length gene models, we performed a human-macaque comparative analysis on polyadenylation (PA) regulation. Using macaque and mouse as outgroup species, we identified 79 distal PA events newly originated in humans and found that the strengthening of the distal PA sites, rather than the weakening of the proximal sites, predominantly contributes to the origination of these human-specific isoforms. Notably, these isoforms are selectively constrained in general and contribute to the temporospatially specific reduction of gene expression, through the tinkering of previously existed mechanisms of nuclear retention and microRNA (miRNA) regulation. Overall, the protocol and resource highlight the application of bioinformatics in integrating multilayer genomics data to provide an intact reference for model animal studies, and the isoform switching detected may constitute a hitherto underestimated regulatory layer in shaping the human-specific transcriptome and phenotypic changes.

Entities: Chemical

Keywords: Iso-Seq; gene models; polyadenylation; rhesus macaque; subcellular localization; transcriptome evolution

Mesh：

Substances：

Year: 2021 PMID： 33973996 PMCID： PMC8574621 DOI： 10.1093/bib/bbab157

Source DB: PubMed Journal: Brief Bioinform ISSN： 1467-5463 Impact factor: 11.622

Introduction

Rhesus macaque (Macaque mulatta), with a genome sequence highly analogous to humans, is a unique primate model for studies of human biology [1]. Compared with other distantly related model organisms, macaque represents an effective reference in comparative genomics studies of genetic changes underlying human-specific traits [1]. In addition, as recent studies have revealed high degrees of cross-species diversity on the regulation and structure of transcripts in mammals [2-4], rhesus macaque thus serves as a unique model in translational studies for verifying and generalizing findings of gene functions reported in traditional model animals. However, the applications in rhesus macaques have been limited, partially due to the incomprehensive and error-prone gene models found in rhesus macaques. Notably, due to the scarce macaque messenger RNA (mRNA) and expressed sequence tag (EST) data, macaque gene models mainly rely on ab initio or comparative genomics-guided predictions [5]. Consequently, recent studies by us have revealed defective annotations of local, fine-scale transcript structures (e.g. exon boundaries) in >30% of macaque genes [6, 7]. Moreover, because human gene structures have been used to build these putative gene models in the current macaque gene annotations, human–macaque comparative studies of human evolution with the current macaque gene models are likely to be inadequate [5]. Recently, RNA sequencing (RNA-Seq) short reads were used to refine the local structures of these putative gene models in rhesus macaque [6, 7], with the assumption that the backbones of these putative gene models are accurate and representative. However, according to a pilot study based on single-molecule long-read transcriptome sequencing (isoform sequencing, Iso-Seq) in the cerebellum of humans and macaques, many newly identified isoforms exhibited even higher expression than the previously annotated major isoforms, indicating a significantly underestimated complexity of the primate transcriptome at the full-length transcript level [2]. Thus, a protocol integrating multilayer genomics data is thus urgently needed to define an atlas of full-length macaque gene models, and further broaden the applications of this unique model in human evolutionary and translational studies.

Results

A new protocol for de novo definition of full-length macaque gene models

To define full-length macaque gene models, we first performed Iso-Seq in 15 PacBio RSII cells and 13 PacBio Sequel cells [8] to profile the poly (A)-positive RNAs derived from four macaque tissues (frontal cortex, cerebellum, heart and testis). A total of 8 588 227 poly(A)-bearing complementary DNA (cDNA) reads were generated, with a mean length of 14 106 nt (Table 1). Notably, the mean length of annotated transcripts in macaque (Ensembl, release 96) is only 1829 nt. Since the circular mode of Iso-Seq was used [9], the majority of these transcripts were sequenced several times from the 5′ end to the poly(A) tail. We thus split these raw long reads into subreads and self-aligned them to assemble reads of insert (regions of interest, ROIs) with a substantially decreased sequencing error rate. Finally, 7 983 980 ROIs were identified, with an average of 10 read-through passes.

Table 1

Statistics of Iso-Seq data

Species	Tissue	Source	ROIs		Mean pass number	Full-length PA-containing reads	High-quality reads^a
Species	Tissue	Source	Count	Mean length	Mean pass number	Full-length PA-containing reads	High-quality reads^a
Macaque	Cerebellum	Zhang et al.	1 128 095	2432	7	620 944	443 937
	Frontal cortex	This study	2 769 148	1756	8	1 542 070	889 600
	Heart	This study	2 558 353	1590	10	1 584 962	907 278
	Testis	This study	1 528 384	1702	14	1 144 126	960 690
Human	Cerebellum	Zhang et al.	1 179 556	1971	7	729 586	446 469
	Frontal cortex	PacBio	691 846	3277	5	503 891	387 717
	Heart	PacBio	1 667 650	2973	2	394 347	221 242
	Liver	PacBio	1 789 673	2479	3	485 741	438 777
Mouse	Frontal cortex	This study	751 620	2086	16	470 764	330 869

aHigh-quality reads were defined as the reads not mapped to the gap regions, mapped uniquely to one locus, mapped with pass number >1 and no internal poly(A) primes detected.

Statistics of Iso-Seq data aHigh-quality reads were defined as the reads not mapped to the gap regions, mapped uniquely to one locus, mapped with pass number >1 and no internal poly(A) primes detected. Several lines of evidence verified the high sensitivity of this Iso-Seq study in defining the gene models. First, these ROIs showed lengths comparable with those of human full-length protein-coding transcripts annotated in GENCODE (median: 1569 versus 2011 nt) (Figure 1A) and covered 89.42% of the macaque genes in Ensembl putative annotations [5] (Figure 1B). Particularly, the majority of genes with significant expression (fragments per kilobase of transcript per million mapped read, FPKM >5 as estimated by RNA-Seq) [2] were identified at the current Iso-Seq sequencing depth (94.7% in frontal cortex, 87.4% in cerebellum, 92.8% in heart and 92.6% in testis). Even at half of the current sequencing depth, most of these genes were covered (93.1% in frontal cortex, 82.6% in cerebellum, 90.2% in heart and 90.1% in testis) (Additional File 1: Supplementary Figure S1), indicating that the current sequencing depth of Iso-Seq was sufficient for constructing gene models de novo for macaque genes with significant expression.

Figure 1

De novo definition of macaque gene models with Iso-Seq data. (A) Distribution of the lengths of Iso-Seq reads of insert (ROIs) and of human gene models annotated by GENCODE (v19) (Human GENCODE). (B) Cumulative distribution of the coverage of genes by Iso-Seq reads. All: Iso-Seq data from all four tissues. (C) The pipeline for the de novo definition of macaque gene models. (D–F) Venn diagrams showing the comparisons of gene models between the new atlas of macaque genes versus macaque protein-coding genes annotated by Ensembl (release 96) (D), RefSeq (release 94) (E) or human protein-coding genes annotated by GENCODE (F). Evaluation of the newly defined macaque gene models. (A–C) Regulatory attributes associated with the new macaque gene models: aggregate plots showing the H3K4me3 signals around the TSS of all genes (A). Densities of CpG islands around TSS of genes with CpG islands (nearby 200 bp) (B) and densities of poly(A)-Seq-identified PA sites around TTS of all genes (C). Newly defined macaque gene models; Ensembl, macaque gene models annotated by Ensembl (release 96). (D) Sequence motifs flanking splice sites of the newly defined macaque gene models. (E) Bar plots showing the distribution of genes with different coverage of protein-coding sequence, which was defined as the percentage of the newly defined macaque CDS covered by human/macaque protein sequences annotated by GenBank (release 223) and RefSeq (release 77). (F) Pie chart showing the comparisons of the CDS length between the newly defined macaque genes and human genes annotated by RefSeq (release 94). Only orthologous gene pairs in humans and macaques were considered in the comparison; Hg, the length of the CDS for human gene as annotated by RefSeq; Ma, the length of CDS for macaque orthologous gene as annotated by the new macaque gene models. (G and H) Bar plots showing the distributions of the proportion of alignment (G) and sequence identity (H) for the aligned regions between human and macaque protein sequences, which were translated from human RefSeq genes and the newly defined macaque genes, respectively. Although the feature of poly(A) selection ensures the accurate definition of the 3′ ends of transcripts with Iso-Seq reads, demarcating the 5′ ends is more difficult due to the underrepresented 5′ ends sequences on Iso-Seq reads. We thus integrated public macaque cap analysis gene expression sequencing (CAGE-Seq) data to define the 5′ ends of the transcripts (see Methods) [10-12]. Moreover, public RNA-Seq data [13-16] were further incorporated to refine the fine-scale local structures of these transcripts, and the exon–intron boundaries of gene models with relatively lower quality, such as those with lower Iso-Seq coverage, were revised accordingly (see Methods). Based on the refined gene models, we further introduced the 3-frame Met-to-stop translation method to define the open reading frame of each transcript (Figure 1C, Additional File 2: Supplementary Table S1). Overall, a total of 51 605 nonredundant macaque genes were defined de novo (see Group 2, Methods), covering 89.7 or 75.0% of previously annotated macaque protein-coding genes in Reference Sequence (RefSeq) or Ensembl, respectively, or 75.7% of the human orthologous protein-coding genes (GENCODE) (Figure 1D–F). This new set of macaque gene models revealed a substantially higher complexity in the macaque transcriptome, with the mean isoform number per gene increasing from two (Ensembl, release 96) to eight.

The macaque gene models are defined with high sensitivity and specificity

The specificity of this set of newly defined macaque gene models was also verified by multiple known regulatory features of genes, such as an enrichment of active epigenetic marks [17-19] (Figure 2A) and CpG islands [20] upstream of the 5′ ends of the transcripts (Figure 2B), as well as the enrichment of poly(A)-Seq signals upstream of the poly(A) tails [21] (Figure 2C). Moreover, unambiguous sequence motifs, consistent with previously reported splicing motifs [22], could be detected at the exon–intron boundaries defined by the new gene models (Figure 2D). Notably, 75.7% of these newly defined open reading frames corresponded to known proteins, among which 89.0% are with the sequences of coding regions showing >80% overlap with known proteins in human and rhesus macaque (Figure 2E). As the human proteome is annotated in relatively higher quality, we further compared the newly defined open reading frames with human annotated open reading frames. The comparison revealed that the majority of these newly defined macaque gene models express comparable or even longer coding sequences (CDS) (Figure 2F, 79.9%). Considerable extents of protein sequence alignment (Figure 2G, 89.2% aligned gene models with >80% matched alignment) and protein sequence similarity (Figure 2H, 92.8% aligned gene models with >80% sequence identity) to translated human proteins were also detected. Viewed together, these lines of evidence suggested that the macaque gene models are accurately defined at the full-length transcript level.

Figure 2

Evaluation of the newly defined macaque gene models. (A–C) Regulatory attributes associated with the new macaque gene models: aggregate plots showing the H3K4me3 signals around the TSS of all genes (A). Densities of CpG islands around TSS of genes with CpG islands (nearby 200 bp) (B) and densities of poly(A)-Seq-identified PA sites around TTS of all genes (C). Newly defined macaque gene models; Ensembl, macaque gene models annotated by Ensembl (release 96). (D) Sequence motifs flanking splice sites of the newly defined macaque gene models. (E) Bar plots showing the distribution of genes with different coverage of protein-coding sequence, which was defined as the percentage of the newly defined macaque CDS covered by human/macaque protein sequences annotated by GenBank (release 223) and RefSeq (release 77). (F) Pie chart showing the comparisons of the CDS length between the newly defined macaque genes and human genes annotated by RefSeq (release 94). Only orthologous gene pairs in humans and macaques were considered in the comparison; Hg, the length of the CDS for human gene as annotated by RefSeq; Ma, the length of CDS for macaque orthologous gene as annotated by the new macaque gene models. (G and H) Bar plots showing the distributions of the proportion of alignment (G) and sequence identity (H) for the aligned regions between human and macaque protein sequences, which were translated from human RefSeq genes and the newly defined macaque genes, respectively.

As a proof of principle, an example of the newly defined gene models with supporting evidence was shown (Figure 3). Briefly, the DDX20 gene is a putative RNA helicase implicated in the alteration of RNA secondary structure [23, 24], and its disruption is linked to motor neuron degenerative disease and cancer [25, 26]. Previous annotations in rhesus macaque failed to annotate the complete gene model of this important gene, whereas the newly defined gene models provided full-length structural annotations of DDX20 (Figure 3).

Figure 3

An example of improved macaque gene model. Both the previously annotated gene model (Ensembl) and a newly defined gene model are shown. The numbers of supporting reads for each transcript are shown along with the Iso-Seq reads. The positions of RNA-Seq junction reads (RNA-Seq) and CpG islands, as well as the signals of CAGE-Seq, H3K4me3 ChIPSeq and poly(A)-Seq, were presented accordingly. Overall, this atlas of de novo defined, full-length macaque gene models should facilitate the use of this unique model in human evolutionary and translational studies, especially from the perspective of previously underestimated regulatory levels such as polyadenylation (PA) regulation.

PA profile in rhesus macaque shaped by both cis- and trans-regulations

Notably, the macaque gene models were defined with definitive 3′ ends, providing informative data to investigate the PA regulation. We then used these newly defined gene models to profile the PA sites in macaque tissues and further characterized the features of this significant posttranscriptional regulation [27]. The PA sites are determined by specific cis-regulatory motifs recognized by the cleavage and PA machinery, most notably the core polyadenylation signal (PAS). This sequence motif of AAUAAA and its variants typically located 10–40 nt upstream of the cleavage site [28]. In accordance with this distribution, PA sites determined in this study showed the characteristic sequence features [29, 30] (Additional File 1: Supplementary Figures S2 and S3). To quantitatively examine the potential correlation between PAS type and PA efficiency, we further introduced a parameter of ‘PA usage’ to measure the efficiency of PA (see Methods). Although traditional approaches, such as poly(A)-Seq and RNA-Seq, might introduce false positive signals in PA site identification, the quantification of intact PA sites represents an accurate readout due to the deep sequencing coverage of the PA regions [21, 31, 32]. We thus evaluated the feasibility of Iso-Seq in the quantification of the PA events by cross-referencing with poly(A)-Seq data. Notably, for PA sites defined by both Iso-Seq and poly(A)-Seq data, the PA usage determined by the two approaches was well correlated (Additional File 1: Supplementary Figure S4), indicating the dual advantages of Iso-Seq in accurate definition and quantification of PA events at the full-length transcript level. For genes with more than one PA site, the selection of PA sites during transcription might be regulated by both cis-elements and trans-factors [27]. To quantitatively explore the contributions of the two types of regulations in PA site selection, we estimated the average PA strength of a group of PA sites (PA strength) on the basis of the fraction of sites with the AAUAAA motif (Figure 4A, Additional File 1: Supplementary Figure S5; see Methods). We then made the following observations: first, consistent with other reports [27], we found that the distal PA sites showed significantly higher PA strength than proximal PA sites on the same gene (Figure 4B, Additional File 1: Supplementary Figure S6, Fisher’s exact test, P < 2.2e-16). In addition, the highest PA usage was observed for distal PA sites carrying PAS of AAUAAA, with the corresponding proximal PA sites carrying no previously reported PAS (Figure 4C, Additional File 1: Supplementary Figure S7A–C). Notably, during the transcription process, the proximal PA sites are selected first. The higher PA strength of distal PA sites may thus increase their competition with proximal PA sites, facilitating the emergence of alternative PA events.

Figure 4

Cis- and trans-regulation of PA site selection. (A) Cumulative plots showing the distributions of PA usage in the macaque frontal cortex for three groups of PA sites with different PAS. AAUAAA, PA sites with the canonical PAS AAUAAA; Other, PA sites with other PAS variants excluding AAUAAA; No, PA sites without known PAS. (B) For genes undergoing alternative polyadenylation in the macaque frontal cortex, the proportions of the three groups of PA sites (AAUAAA, Other, No) are shown for the proximal and distal PA sites, respectively. (C) 3D histograms showing average PA usage in the macaque frontal cortex in groups with different combinations of PAS (AAUAAA, Other, No) at proximal and distal PA sites. (D) For each gene expressed in both the macaque frontal cortex and testis, pairwise comparison of the weighted length of the 3′ UTR was performed and shown. The genes with a cross-tissue difference in weighted length of >30 bp are highlighted in red (2299 genes, longer 3′ UTR in frontal cortex) or blue (723 genes, longer 3′ UTR in testis). (E) Hierarchical clustering diagram showing the Spearman correlation coefficients of pairwise comparisons of the PA usage in multiple tissue samples between human and macaque. NS, not significant; *P ≤ 0.05; **P ≤ 0.01; ***P ≤ 0.001. Second, in addition to the contributions of cis-elements, trans-regulation also contributes to the process of PA selection. In the frontal cortex and testis tissues of the same macaque animal, in which the cis-regulation is identical, we found that the weighted length of the 3′ untranslated region (UTR) is significantly longer in the frontal cortex than in the testis (Figure 4D, Wilcoxon rank-sum test, P < 2.2e-16) due to the preferentially use of distal PA sites (Additional File 1: Supplementary Figure S7D), implying the substantial contributions of tissue-biased regulators in the process of PA selection.

Human-specific PAs originated through the strengthening of the distal PA sites

As both cis- and trans-regulation could contribute to the PA site selection, we next investigated their implications in the cross-species differences in PA usage during the primate evolution. When comparing the PA usage across multiple tissues of human and macaque (see Methods), we found a more shared use of PA in different tissues of the same species, than that within the same tissue across species (Figure 4E), which indicates that the cis-element may act as the primary factor driving the cross-species PA evolution in primates. To further clarify how these cis-elements might contribute to the cross-species PA site selection, we identified 97 distal PA sites (on 89 genes) in human but not macaque on the basis of Iso-Seq and RNA-Seq profiles in human and macaque tissues (Figure 5A and B; see Methods). To investigate whether these distal PA sites are newly originated in human, rather than representing recent PA loss events in rhesus macaque, we further introduced mouse as an outgroup species. By performing Iso-Seq for poly(A)-positive RNAs derived from mouse brain, we generated 758 884 poly(A)-bearing cDNA reads. Mouse RNA-Seq data from 58 tissue samples were also analyzed to retrieve the reads distribution information across the full-length mouse transcripts (Additional File 2: Supplementary Table S1; see Methods). On the basis of these profiles, as well as the gene models annotated in GENCODE and RefSeq, we removed the candidate human PA sites detectable also in mouse, and finally identified 79 newly originated, human-specific distal PA events on 73 genes (see Methods; Figure 5A, Additional File 3: Supplementary Table S2). To further account for the variability among human populations, we checked the existence of these candidate PA events using Genotype–Tissue Expression (GTEx) RNA-Seq data from multiple human individual [33]. Overall, >83.78% of these PA events were detected in at least two human individuals (Figure 5C, Additional File 1: Supplementary Figure S8; 93.24% for brain cortex tissues from 101 individuals; 94.69% for cerebellum tissues from 128 individuals; 83.78% for heart tissues from 225 individuals). Therefore, population-level variability should not have a strong effect in defining these human-specific distal PA events, and a large proportion of these events should represent bona fide human-specific regulations.

Figure 5

Cis-regulation is primarily attributed to cross-species PA site selection. (A) A flow chart depicting the identification of distal PA sites in human but not macaque. The number of human PA sites retained in each step is shown. (B) Track view of one example gene, SYBU, with a specific distal PA site in human but not macaque. The structure of the SYBU transcript is shown, with the 3′ UTR regions of the human (middle panel) and macaque (bottom panel) gene loci shown in the magnified view. The extended 3′ UTRs in human are highlighted in the shaded regions. (C) Heat map showing the expression of extended exonic regions in human cortex samples. Horizontal red bars indicate the expression of the extended regions in the tissue sample of the corresponding individual (FPKM >0.2). (D and E) For genes with human–macaque differences in PA site selection, the proportions of the three groups of PAS (AAUAAA, Other, No; as defined in Figure 4) are shown for the newly originated human PA sites (D) and the closest common PA sites shared by human and macaque (E). (F) Boxplots of nucleotide diversity for human-specific extended regions corresponding to 79 newly originated distal PA sites in human and all synonymous sites of human genes (Syn). NS, not significant; *P ≤ 0.05; **P ≤ 0.01; ***P ≤ 0.001. Using this list, we then traced the formation of these young distal PA sites. Presumably, three models exist for the origination of these distal PA sites after the divergence of human and macaque—the strengthening of the distal sites in human but not macaque, the weakening of the proximal sites, and both the strengthening of the distal sites and the weakening of the proximal sites. Through comparing the PA strength for distal PA sites in human but not macaque, we found significantly stronger PAS in humans than in both macaques and their ancestral sequences (Figure 5D, Fisher’s exact test, human versus macaque, P = 0.0002; human versus ancestor, P = 0.0001; macaque versus ancestor, P = 0.54), whereas the associated common proximal PA sites showed no significant difference among humans, macaques and their ancestral sequences (Figure 5E, Fisher’s exact test, human versus macaque, P = 0.51; human versus ancestor, P = 0.25; macaque versus ancestor, P = 0.29). These findings thus supported the model that the strengthening of the distal PA sites, rather than the weakening of the proximal sites, predominantly contributes to the origination of these human-specific PA events. Notably, the causal relationship between the strengthening of these distal PA sites and the origination of these young distal PA sites still need further investigations.

Human transcriptome evolution through PA site selection

To investigate whether these newly originated distal PA events in human represent functional regulations, rather than neutral transcription noises, we performed population genetics analyses to test whether the extended exonic regions through the origination of the new PA event are selectively constrained, on the basis of whole-genome sequencing data from 103 human individuals [34]. Notably, for the regions specifically transcribed in human corresponding to the 79 distal PA events, the nucleotide diversity of these regions is significantly lower than that of synonymous sites on all human annotated genes (Figure 5F, Wilcoxon rank-sum test, P = 0.02). This finding thus indicates that the human-specific isoform with new PA sites are selectively constrained in general, implying their functionality in human evolution. We then investigated whether these selectively constrained, newly originated human PA sites may contribute to the evolution of the human transcriptome. We first checked whether these new PA events may change the protein sequences of the corresponding genes, and found only one PA-related exonic region encoding nine amino acids on TAP2 gene, involved in antigen presentation and bare lymphocyte syndrome. As TAP2 has been reported to participate in the transport of antigens from the cytoplasm to the endoplasmic reticulum, through interacting with multiple proteins such as TAP binding protein (TAPBP), transporter 1 (TAP1) and beta-2-microglobulin (B2M) [35], the inclusion of an additional nine amino acids may regulate its functions through the rewirement of the protein–protein interaction network. Notably, as most of the extended regions did not contain protein CDS, these events might largely be implicated in gene regulations, other than direct modifications of the CDS. As alternative PA site selection is known to impact the length of 3′ UTRs, we then investigated whether they may further change the regulation and stability of the corresponding transcripts via alternative binding of miRNAs or RNA-binding proteins (RBP) [36-38]. Specifically, the distal PA sites specific in human may contribute to the gene expression reduction in humans by the inclusion of additional miRNA-binding sites [39]. In line with this hypothesis, we found that the human genes with human-specific distal PA sites showed significantly reduced expression in comparison with their orthologous genes in macaques (Figure 6A, Wilcoxon rank-sum test, P = 0.02). Considering the regulated tissue expression distribution of miRNAs [40], it is plausible that these genes encoding longer 3′ ends specific to human may also have divergent tissue expression profiles between human and rhesus macaque. Interestingly, when examining the tissue expression distributions of these genes, we found significantly lower tissue correlation coefficients for these genes in comparison with genes with shared 3′ ends between human and macaque (Figure 6B and C, Wilcoxon rank-sum test, P = 0.01).

Figure 6

Transcriptome evolution in primates through PA site selection. (A) The ratios of FPKM between human and macaque are shown and compared for orthologous genes with (right boxplot) or without (left boxplot) distal PA sites specific in human but not macaque. (B) Boxplots showing Spearman correlation coefficients of tissue expression profiles in six tissues (frontal cortex, cerebellum, heart, kidney, muscle and testis) of human and macaque for orthologous genes with (right boxplot) or without (left boxplot) specific distal PA sites in human but not macaque. (C) Examples of genes with specific distal PA sites in human but not macaque that also exhibit divergence of tissue expression profiles. The specific tissues in which the human-specific distal PA site was initially identified were labeled by a black box in the heat maps; H, human; M, macaque. (D) Boxplots of the APEX-Seq fold changes for the extended 3′ UTR regions specific to human (Extended) and the constitutive exonic regions of the corresponding genes (Control); NS, not significant; *P ≤ 0.05; **P ≤ 0.01; ***P ≤ 0.001. (E) Heatmap showing the relative levels of the isoforms with newly originated distal PA sites in human, during the human fetal brain development. Especially, as previous case studies have indicated that the isoforms with different UTRs have different subcellular localizations [41-43], we investigated whether these genes encoding longer 3′ UTRs specific to human may show some specific profile of RNA subcellular localization, on the basis of ascorbic acid peroxidase sequencing (APEX-Seq) data in human cell lines [44]. We found that the isoforms with extended 3′ UTR specific to human are significantly enriched in nuclear structures, such as nucleus, nuclear lamina and nuclear pore, in comparison with the constitutive exons of the corresponding gene (Figure 6D, Wilcoxon rank-sum test, P < 2.2e-16 for nucleus, P = 0.02 for nuclear lamina and P = 0.01 for nuclear pore). The isoform switching from shorter to longer 3′ UTR may thus represent an efficient way to control for the expression of these genes on protein level, through the stronger nuclear retention of mRNAs. Notably, the relative levels of these new isoforms with distal PA sites are significantly regulated in human glioblastoma multiforme in contrast to normal brain tissues (Additional File 1: Supplementary Figure S9). It is thus possible that the retained mRNAs may play essential functions in stress conditions, such as in cancer cells, with a similar function of the formation of stress granules. Taken together, it is plausible that this posttranscriptional PA regulation might contribute to the tissue- or stage-specific reduction of gene expression, through the tinkering of previously existed mechanisms of nuclear retention and miRNA regulation. The isoform switching through PA selection may thus constitute a hitherto underestimated regulatory layer in shaping the human-specific transcriptome and phenotypic changes.

Discussion

Currently, the incomplete and error-prone gene models in the rhesus macaque genome critically limited the use of this unique nonhuman primate model animal in evolutionary studies and potential extension to human translational research. Here, we developed a comprehensive protocol for de novo definition of the full-length macaque gene models through integration of single-molecule sequencing, RNA-Seq, CAGE-Seq and proteomics data. The catalogued full-length macaque gene models provides an accurate reference for molecular and translational studies in rhesus macaque, as well as a viable resource for evaluating the validity and generality of genetic findings from other model animals [1]. Notably, the standard computational pipeline we developed for de novo definition of full-length gene models with multilayer omics data integration should also be applicable to other model animals in facilitating the fine-scale molecular studies. With the precise structural information, these full-length macaque gene models could be implemented to interrogate gene regulatory events associated with primate evolution. To this end, we have thoroughly profiled and delineated several transcript attributes, such as PA regulation, single-exon genes and gene fusions, which might not be adequately annotated by the traditional approaches. As a proof-of-concept example, we found that the PA regulation acts as an underestimated, yet flexible contributor to the cross-species divergence of transcriptome. This wealth of information should complement previous genetic and molecular studies of human phenotypic evolution [45-47]. Although we found that the human genes with longer 3′ UTRs showed relatively lower expression levels in contrast to their macaque orthologs, understanding this pattern from the perspective of primate evolution is not straightforward. Especially, when the new isoform with longer tails only account for a small portion of the isoform atlas encoded by this gene (Additional File 1: Supplementary Figure S10), it is unknown whether this mechanism could efficiently decrease the gene expression to the degree enough for the cross-species phenotypic changes. Notably, when investigating the RNA expression profiles of these genes in human fetal brain development, we found that the proportion of these human-specific isoforms with longer 3′ UTRs could increase substantially in stage-specific manner (Figure 6E; see Methods). For example, in week 10 to week 12, a period in human early fetal brain development characterized by the proliferation and migration of neurons, multiple key genes in neurogenesis, such as JAK2 [48], were identified to encoding more isoforms with human-specific longer tails (Figure 6E). In week 20 of late mid-fetal development characterized by the formation of synapse in the neocortical plate, most of these genes encode higher levels of isoforms with human-specific longer tails in comparison with other developmental stages (Figure 6E), such as PRDX6 previously reported to inhibits the neurogenesis of neural precursor cells [49], and N4BP3 reported to be a crucial modulator of axonal and dendritic branching [50, 51]. The isoform switching through PA selection may thus contribute substantially to human transcriptome evolution in fetal brain development and further to the species-specific phenotypic changes through reshaping the transcription profiles. In terms of alternative PA events and differential 3′ UTR lengths, we assumed that these RNA isoforms would encode identical proteins and thus have similar functional outcome on the protein level. The combinatorial effect of the introduction of the longer transcript region through PA regulation and the preferential degradation and nuclear retention of that isoform is thus relatively mild. However, isoforms with different UTRs may exhibit variable functions as a consequence of altered mRNA localization [41-43], which is an especially significant regulatory determinant in highly polarized cells such as neurons [52, 53]. In this capacity, previous studies have found that long 3′ UTR transcripts of some genes are preferentially targeted to dendrites or distal axons to regulate neuronal morphology and functions [54, 55]. Notably, RBP have been implicated in the altered mRNA localization [56]. By analyzing 184 cross-linking immunoprecipitation sequencing (CLIP-Seq) datasets corresponding to 122 RBPs from Encyclopedia of DNA Elements project (ENCODE project) project [57], we identified 37 RBPs that could bound to these extended regions in human, such as cleavage stimulation factor subunit 2 (CSTF2) (Additional File 1: Supplementary Figure S11) involved in PA processing [58, 59]. Among these RBPs, three RBPs, such as CSTF2, heterogeneous nuclear ribonucleoprotein C (HNRNPC) and TIA1 cytotoxic granule associated RNA binding protein (TIA1), preferentially bound to these human-specific extended regions in comparison with randomly selected intergenic regions (Additional File 1: Supplementary Table S3). These trans-regulations could have contributed to the altered subcellular localization of these RNA isoforms. Besides PA-related extended exonic regions, our previous study also reported events of human exonization (splicing-related internal exons) through differential nucleosome occupancy [15]. As we obtained full-length gene models in both human and macaque here, we repeated the study to identify the human-specific, splicing-related internal exons on the basis of the full-length gene models and identified 32 additional human-specific exons (Additional File 3: Supplementary Table S2). The nucleotide diversity of these exonic regions is also significantly lower than that of synonymous sites of all human annotated genes, indicating that the human-specific internal exons are selectively constrained (Supplementary Figure S12). These findings thus indicate that the exonizations through PA-related and splicing-related mechanisms should both play adaptive roles in human evolution.

Methods

RNA extraction, library preparation and deep sequencing

Total RNA from tissue samples (frontal cortex, heart and testis) of one rhesus macaque animal and frontal cortex sample from one mouse animal was extracted using TRIzol reagent (Invitrogen, catalog #15596018) and analyzed on an Agilent 2100 Bioanalyzer to assess quality (Additional File 1: Supplementary Table S4). The quantity of total RNA was measured by Qubit Invitrogen. cDNA was then synthesized using the SMARTer PCR cDNA Synthesis Kit (Clontech, catalog #634925) and PrimeSTAR GXL DNA Polymerase (Clontech, catalog #R050B). Single-molecule real-time (SMRT) bell libraries were generated by using an SMRTbell™ Template Prep Kit 1.0-SPv3 (PN 100-991-900) and subsequently sequenced by using Sequel Binding Kit 2.0, Sequel Sequencing Kit 2.1 and Sequel™ SMRT® Cell 1 M v2 Tray on the PacBio Sequel System (Pacific Biosciences). The Iso-Seq of cerebellum samples was performed as in our previous study [2] in which the sequencing was carried out on a real-time sequencer (Pacific Biosciences) with C4 sequencing reagents.

Sequencing data processing

All Iso-Seq and RNA-Seq data were processed and evaluated by following SMRTLink guidance and previously published pipelines [2]. Briefly, for Iso-Seq data, ROIs were extracted and primer sequences flanking ROIs were trimmed. Primer-trimmed reads were further processed to scan the PA tail and trimmed to obtain PA-trimmed reads (reads with a PA tail trimmed). PA-trimmed reads were then mapped to the reference genome (hg19 for human, rheMac8 for macaque) by GMAP (Genomic Mapping and Alignment Program) [60], and low-quality reads (reads mapped to gap regions, reads mapped to more than one locus, reads with pass number of <1 or internal poly(A) priming reads) were filtered to obtain the final processed alignments. For RNA-Seq and chromatin immunoprecipitation sequencing (ChIP-Seq) data, raw reads were filtered to obtain high-quality reads, which were then aligned to the corresponding reference genome using TopHat2 [61] (v2.1.1) for RNA-Seq data and Burrows–Wheeler Aligner [62] (0.7.13-r1126) for ChIP-Seq data, respectively. CAGE-Seq data [11, 12] were analyzed by following the standard FANTOM pipeline [63].

De novo definition of macaque gene models with Iso-Seq

A new pipeline of gene model definition was developed on the basis of the RefSeq Eukaryotic Genome Annotation Pipeline [64, 65]. Briefly, Iso-Seq long reads were used to define the backbone of isoforms, based on which the redundant isoforms were discarded. If necessary, the junction reads of RNA-Seq were further used to refine the splicing sites of each isoform [2]. Next, we refined the transcription start site (TSS) of those truncated isoforms by elongating them to the nearest potential TSS marked by CAGE-Seq data, if the site was within 1 kb from the 5′ end of the isoform. Single-exon isoforms with no TSS marked by CAGE-Seq data were then discarded. Finally, on the basis of the refined gene models, we further introduced the 3-frame Met-to-stop translation method to define the open reading frames of each transcript, and the nucleic acid sequences were then translated into peptide sequences. We further divided the gene models into three groups according to their qualities. (Additional File 1: Supplementary Table S5). Briefly, Group 1 represents genes with the most reliable gene models. In this category, for multiple exon genes, the full-length transcript structure was supported by poly(A)-bearing Iso-Seq reads, with each splice junction supported by RNA-Seq junction reads; for single-exon genes, the transcript structure was supported by poly(A)-bearing Iso-Seq reads, with the TSS supported by CAGE-Seq evidence. In Group 2, in addition to the genes in Group 1, all other multiple exon gene models defined by poly(A)-bearing Iso-Seq reads or refined by the RNA-Seq or CAGE-Seq data were included. Finally, in addition to the genes in Group 1 and Group 2, all other Ensembl (release 96) gene models not covered by our Iso-Seq data were also included in Group 3, which represents a comprehensive atlas of macaque gene models. The GTF (Gene Transfer Format, GTF2.2) format files (Mmul_8.0.1/rheMac8), FASTA format files for CDS sequences and protein sequences for these gene models can be downloaded from http://rhesusbase.cbi.pku.edu.cn/download/Full-length_macaque_gene_models.jsp. In this study, only genes in Group 2 were used for subsequent analyses of PA regulations.

Evaluation of the newly defined macaque gene models

To evaluate the quality of the splice junctions of the newly defined gene models, sequence motifs flanking the donor/acceptor splice sites defined by the gene models were calculated and visualized using WebLogo (v2.8) [66]. H3K4me3 ChIP-Seq dataset [17], CpG island elements and poly(A)-Seq dataset [67] were also processed and used to verify the 5′ ends and 3′ ends of these gene models (Additional File 2: Supplementary Table S1). Briefly, the aggregated H3K4me3 ChIP signals (read coverage in ChIP sample subtracted by that in input sample) across TSSs of newly defined or Ensembl annotated (release 96) macaque genes were calculated and compared. Then, genes with CpG islands located within 200 bp of TSS were used to evaluate the enrichment of CpG islands across TSS. We also calculated the poly(A)-Seq-based PA site density across transcription termination sites (TTSs) to evaluate the 3′ boundaries of these gene models. To assess the quality of the predicted open reading frames, we compared the length of open reading frames defined by gene models of macaque genes and their orthologous genes in humans (RefSeq, release 94). Pairwise sequence alignments were then performed using blastp [68] (v.2.2.28, -outfmt 6 -evalue 1e-5) between annotated protein sequences of macaque or human (GenBank or RefSeq) and the putative protein sequences translated from the newly defined macaque gene models. The extent of the aligned proportion and identity of the orthologous pairs were calculated to quantitatively evaluate similarity.

Identification and quantification of PA regulation

To identify PA sites using Iso-Seq reads, PA-containing reads were first assigned to annotated genes. The 3′ ends of these reads were then clustered if they were located within 30 bp, with the site of highest coverage being defined as the representative PA site [2]. Only PA sites with at least two supporting reads were considered for the following analyses. For each macaque gene, the usage of a specific PA site was defined as the proportion of poly(A)-bearing reads encompassing the corresponding PA site. To identify the human orthologous PA sites for these macaque counterparts, the liftOver tool was used for the cross-species alignments. Human PA sites located within 30 bp of the orthologous regions of the macaque PA site were considered conserved PA sites and further subjected to the following clustering analyses. For the conserved PA sites, the cross-species and cross-tissue Spearman correlation coefficients of the PA usage were calculated, and the samples were grouped using the hierarchical clustering method. To assess nucleotide composition around cleavage sites, nucleotide distributions were determined around all cleavage sites. For the core PAS identification, we searched for the canonical hexamer (AAUAAA) in a 50-nt window upstream of the cleavage site. If this was not identified, we then searched for the other hexamer variants (AUUAAA, AGUAAA, UAUAAA, CAUAAA, GAUAAA, AAUAUA, AAUACA, AAUAGA, ACUAAA, AAUAAA, AAUGAA) [69]. When comparing PA efficiency between different types of PAS, we found PA sites of the canonical AAUAAA showing the strongest PA efficiency (Figure 4A, Additional File 1: Supplementary Figure S5, Wilcoxon rank-sum test, P < 2.2e-16), a finding consistent with previous reports [30]. For effective estimation of the average strength of PAS, we calculated the fraction of PA sites with the AAUAAA motif in the corresponding group in the following analyses. The weighted 3′ UTR length was calculated by following previously published method [30]. Briefly, transcript with the furthest 3′ transcript end and the most distal 5′ coding region end was chosen to get the 3′ UTR region of each gene. We additionally removed genes that contain annotated introns in the 3′ UTR regions. Then, the weighted 3′ UTR length was calculated as the average of all 3′ UTR lengths per gene weighted by the contribution of each isoform’s usage as measured by the PA usage.

Identification of distal PA sites newly originated in human

To identify distal PA sites specific to human but not macaque, we began with all PA sites supported by at least two Iso-Seq reads. Orthologous human PA sites detected in rhesus macaque were removed when cross-referencing to the macaque Iso-Seq data and annotated TTS (located within 30 bp). For human-specific 3′ UTR regions defined by candidate human-specific PA sites, another filter was applied based on macaque RNA-Seq data from 280 tissue samples (Additional File 2: Supplementary Table S1): if the average FPKM level (RSeQC (RNA-Seq quality control), v2.6.3) [70] was >0.2 in the corresponding macaque tissue [71], the PA site was discarded. Then, mouse was served as an outgroup to verify these PA sites are newly originated in human. Briefly, orthologous human PA sites detected also in mouse were removed from the initial candidate list for human-specific PA sites, when cross-referencing to the mouse Iso-Seq data (Table 1), mouse RNA-Seq data from 58 tissue samples (Additional File 2: Supplementary Table S1) and annotated TTS (located within 30 bp).

Identification of splicing-related internal exons newly originated in human

To identify newly originated internal exons in human, we began with human annotated internal exons (GENCODE v19) that were not overlapped with macaque exons based on the Group 3 annotated gene models and supported by human Iso-Seq reads. Then for orthologous candidate human-specific exons in rhesus macaque and mouse, filters were applied based on macaque RNA-Seq data from 280 tissue samples and mouse RNA-Seq data from 58 tissue samples (Additional File 2: Supplementary Table S1): if the FPKM level (RSeQC, v2.6.3) [70] in any of these samples was >0.2 [71], the candidate new exon was removed from the list.

Ancestral sequence definition

To define the common ancestral state of human and macaque at each site, multiple sequence alignment data were downloaded from the UCSC (University of California, Santa Cruz) Genome Browser [72], from which the human–macaque–marmoset aligned regions were extracted and analyzed as previously described [15, 73, 74].

Population genetics analyses

On the basis of the polymorphism data of 103 human individuals from the RhesusBase PopGateway [34], we measured the nucleotide diversity for the regions specifically transcribed in human but not macaque corresponding to these young PA sites [34]. The nucleotide diversity for synonymous sites on all human annotated genes was served as the control. The Wilcoxon one-tail test was performed to determine the significance of the nucleotide diversity differences between those specifically transcribed regions and the synonymous sites on all annotated genes.

Gene expression profile analyses

For the similarity of tissue expression profiles, RNA-Seq data from six paired tissues in human and macaque (frontal cortex, cerebellum, heart, kidney, muscle and testis) were integrated for the quantification of gene expression levels (Additional File 2: Supplementary Table S1). The Spearman correlation coefficients between the orthologous genes in the two species were calculated to measure the cross-species similarity in tissue expression profiles. For the profile of RNA subcellular localization, public APEX-Seq data [44] of nine subcellular fractions (nucleus, nucleolus, nuclear lamina, nuclear pore, cytosol, endoplasmic reticulum membrane, endoplasmic reticulum lumen, outer mitochondrial membrane and mitochondrial matrix) were used to calculate the RNA subcellular localization of isoforms. For isoforms with distal PA sites in human but not macaque, the extended 3′ UTR region specific to human were extracted as representative regions, which were then used to estimate the expression levels of the isoforms with human-specific PA sites. The expression levels of the constitutive exonic regions of these genes were used as the control. Following the previously published strategy [44], we further calculated the degree of enrichment using DESeq2 [75]. For the expression profiles in different developmental stages, RNA-Seq data of human fetal brain development were obtained from https://www.ebi.ac.uk/arrayexpress/experiments/E-MTAB-6814/ [76] (Additional File 2: Supplementary Table S1). Prenatal human forebrain samples with fragmentation score [6] >0.885 (qualified samples) were used for the following analyses to exclude RNA-Seq datasets with potential 3′ sequencing bias. If two or more qualified samples exist for a specific developmental stage, only the sample with the highest fragmentation score was kept for the following analyses. For the expression profiles in human glioblastoma multiforme and normal brain tissues, RNA-Seq data were downloaded from the TCGA (The Cancer Genome Atlas) (https://portal.gdc.cancer.gov) and GTEx portal (https://www.gtexportal.org/home) (Additional File 2: Supplementary Table S1) [33, 77]. Then, RSeQC (v2.6.3) [70] was used for FPKM calculation of the constitutive exons, as well as the extended regions corresponding to the newly originated, distal human PA events. Finally, the percentage of the expression for the isoform with the distal PA site was estimated as the ratio of the FPKM of extended regions and the FPKM of the constitutive exons. Genes with the constitutive exon’s FPKM <0.2 were excluded from the analyses.

CLIP-Seq and RBP binding analyses

Human CLIP-Seq data were downloaded from the ENCODE data portal (www.encodeproject.org). Only peaks supported by both of the two biological replicates were retained, resulting in 184 sets of RBP-binding peaks. Then, two kinds of overlaps were generated using bedtools [78]: the overlaps between the 184 peak sets and the extended regions corresponding to the newly originated distal PA events in human; and the overlaps between the 184 peak sets and the randomly selected intergenic regions as the control. The statistical significance of the two resulting overlaps was tested with the Fisher’s exact test.

Statistical analyses

All statistical analyses were performed in R (version 3.4.0), with the corresponding tests described in the main text or figure legends. Fisher’s exact test was performed across the contingency table. For data represented in boxplots, the nonparametric Wilcoxon rank-sum test was performed to compare the two groups.

Code Availability

All the codes used to generate results can be found at GitHub via URL https://github.com/xihuimeijing/Macaque-transcriptome. A comprehensive protocol integrating multilayer genomics data to de novo define full-length gene models. A comprehensive protocol to define full-length macaque gene models with high accuracy. Human–macaque differences in PA regulation. Human transcriptome evolution through PA site selection. Click here for additional data file.

79 in total

1. RSeQC: quality control of RNA-seq experiments.

Authors: Liguo Wang; Shengqin Wang; Wei Li
Journal: Bioinformatics Date: 2012-06-27 Impact factor: 6.937

2. CpG islands in vertebrate genomes.

Authors: M Gardiner-Garden; M Frommer
Journal: J Mol Biol Date: 1987-07-20 Impact factor: 5.469

Review 3. The multiple lives of DEAD-box RNA helicase DP103/DDX20/Gemin3.

Authors: Frank Curmi; Ruben J Cauchi
Journal: Biochem Soc Trans Date: 2018-03-09 Impact factor: 5.407

Review 4. Evolution of primate gene expression.

Authors: Philipp Khaitovich; Wolfgang Enard; Michael Lachmann; Svante Pääbo
Journal: Nat Rev Genet Date: 2006-09 Impact factor: 53.242

Review 5. MicroRNAs silence gene expression by repressing protein expression and/or by promoting mRNA decay.

Authors: I Behm-Ansmant; J Rehwinkel; E Izaurralde
Journal: Cold Spring Harb Symp Quant Biol Date: 2006

6. PRDX6 Inhibits Neurogenesis through Downregulation of WDFY1-Mediated TLR4 Signal.

Authors: In Jun Yeo; Mi Hee Park; Dong Ju Son; Ji Young Kim; Kyoung Tak Nam; Byung Kook Hyun; So Young Kim; Myung Hee Jung; Min Ji Song; Hyung Ok Chun; Tae Hyung Lee; Sang-Bae Han; Jin Tae Hong
Journal: Mol Neurobiol Date: 2018-08-10 Impact factor: 5.590