Literature DB >> 31886458

Few SINEs of life: Alu elements have little evidence for biological relevance despite elevated translation.

Laura Martinez-Gomez¹, Federico Abascal², Irwin Jungreis³, Fernando Pozo⁴, Manolis Kellis³, Jonathan M Mudge⁵, Michael L Tress⁴.

Abstract

Transposable elements colonize genomes and with time may end up being incorporated into functional regions. SINE Alu elements, which appeared in the primate lineage, are ubiquitous in the human genome and more than a thousand overlap annotated coding exons. Although almost all Alu-derived coding exons appear to be in alternative transcripts, they have been incorporated into the main coding transcript in at least 11 genes. The extent to which Alu regions are incorporated into functional proteins is unclear, but we detected reliable peptide evidence to support the translation to protein of 33 Alu-derived exons. All but one of the Alu elements for which we detected peptides were frame-preserving and there was proportionally seven times more peptide evidence for Alu elements as for other primate exons. Despite this strong evidence for translation to protein we found no evidence of selection, either from cross species alignments or human population variation data, among these Alu-derived exons. Overall, our results confirm that SINE Alu elements have contributed to the expansion of the human proteome, and this contribution appears to be stronger than might be expected over such a relatively short evolutionary timeframe. Despite this, the biological relevance of these modifications remains open to question.

Entities: Chemical Disease Gene Species

Year: 2019 PMID： 31886458 PMCID： PMC6924539 DOI： 10.1093/nargab/lqz023

Source DB: PubMed Journal: NAR Genom Bioinform ISSN： 2631-9268

INTRODUCTION

Transposable elements are mobile DNA sequences that are able to copy themselves into new genomic locations (1). Approximately half the human genome is made up of active and inactive transposable element segments (2–4) but the actual proportion of mobile element-derived sequences in the human genome may be considerably higher since many inactive mobile elements have diverged beyond the detection of normal search algorithms (5). Transposable elements can be divided into four major and many smaller classes (2). DNA transposons encode the transposase protein, which they need to cut and paste themselves into new genomic regions (6). There are three types of retrotransposons that use RNA intermediates to copy themselves throughout the genome (7). Long terminal repeat (LTR) retrotransposons are derived from endogenous retroviruses with LTRs, most of which are no longer active in the human genome (8). Non-LTR retrotransposons are made up of long interspersed nuclear elements (LINEs), which, like the LTRs, encode a reverse transcriptase, and short interspersed nuclear elements (SINEs), which do not encode any ORF and rely on the LINEs to carry out the copying process (7). Active transposons in the human genome are relatively infrequent and are vastly outnumbered by a ‘graveyard’ of fossil transposon copies (3). Active retrotransposons exist among the non-LTR retrotransposons, including LINE-1, SINE Alu and SINE-VNTR-Alu (SVA) elements (3). These three families, which together make up more than a quarter of the human genome, have appeared and proliferated over the past 80 million years (9). However, most copies of these retrotransposons are no longer active due to decay by truncations and mutations. For example, although there are more than 500 000 copies of the LINE-1 retrotransposon in the human genome (10), fewer than 100 copies are still intact and capable of transposition (11,12). Accumulation of transposable elements has been shown to have a deleterious effect on fitness (13) and their presence has been associated with many diseases (14,15). However, with time transposable element sequences can also add to the functionality of genomic features through a process of co-option in which the transposable element sequence, or part of it, is recruited to perform some function. The incorporation of transposable elements (exaptation) has been shown to contribute to the evolution of regulatory motifs (16), promoters (17) and lncRNA (18) among others, and transposable elements have been co-opted into ancient protein-coding genes, either in their main isoform (19–21) or as alternative splice variants (22). The SINE Alu family of retrotransposons are primate-specific elements (23) that derived from the small cytoplasmic 7SL RNA and are ∼300 nt long. The majority map to non-functional regions of introns or intergenic sequences (24). Alu elements can be divided into three large sub-families. The oldest, the AluJ sub-family, arose 65 million years ago and has become entirely extinct through deleterious sequence changes (25). The AluS family evolved 30 million years ago and almost all elements are fossils, though some sub-families have been found to contain active members (25). Almost all active Alu elements are from the youngest subfamily, AluY (26), though not all AluY elements are active. Like other transposable elements, Alu elements are potentially deleterious (27,28). Unlike most transposable elements, Alu elements have a pair of dinucleotides that can form a weak 3′ splice site and facilitate their conversion into exons (29). In addition, 5′ splice sites (30) and polyadenylation sites (31) can be generated from a minimal number of base substitutions. Sorek et al (32) found that while SINE Alu elements are incorporated into exons, they are found predominantly in alternative exons rather than constitutive exons. These alternative exons are included in transcripts at lower frequencies than alternatively spliced exons derived from other sources, and they found that the vast majority would lead to a frameshift or a premature termination codon. However, since exons generated from Alu elements are almost always alternatively spliced, the main isoform is intact, allowing the Alu exons to acquire functionality over time (29). It is not clear to what extent exaptation of primate-specific Alu elements contributes to cellular proteins. Gotea and Makałowski (20) concluded that functional proteins were unlikely to contain regions derived from young transposable elements like LINE-1 and Alu. However support for the incorporation of Alu elements in coding genes has come from microarrays (33) and proteomics data (34). Lin et al (34) found peptide evidence for 85 Alu-derived exons, which led them to suggest that Alu elements may be a substantial source of novel coding exons and may represent species-specific differences between humans and other primates. However, the peptides that supported these 85 Alu-derived exons came from the PRIDE proteomics database (35). While the PRIDE database is an important repository of experimental data, it is uncurated and the false discovery rate cannot easily be controlled in such a huge database (36). Because of this, many novel sequences identified solely via PRIDE are likely to be false positives (37,38). The Lin et al. study (34) only managed to validate two of the Alu-derived exons when they searched the FDR-controlled Peptide Atlas database (39). Here we investigate to what extent SINE Alu elements are incorporated into coding genes in the human reference set and attempt to determine what proportion of the Alu elements that overlap coding exons are likely to code for functional proteins.

MATERIALS AND METHODS

Human reference set

The human reference gene set used in this study was v28 of the GENCODE manual annotation (40), which is equivalent to Ensembl 92 (41). The GENCODE v28 gene set is annotated with 97 713 protein-coding transcripts.

APPRIS

The APPRIS database (42) annotates splice isoforms with structural and functional information and cross-species conservation. It also selects a single protein sequence unique isoform as the principal isoform for that gene (43). We have shown that most genes have a main isoform at the cellular level (44) and that the principal isoforms selected by APPRIS are a highly reliable predictor of this main cellular isoform (44). Transcripts from the GENCODE v28 reference set were tagged as principal or alternative by the APPRIS database. The distinction can also be made at the level of exons. We tagged exons whose translation would be included in the principal isoform as principal exons and the remainder, exons that belong exclusively to alternative splice variants, were tagged as alternative exons.

RepeatMasker

RepeatMasker regions [Smit AFA, Hubley R and Green P, http://repeatmasker.org] were obtained from the UCSC genome browser at http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.out.gz and mapped to transcripts from the GENCODE v28 reference set. For the SINE Alu analysis if a transposon mapped to both principal and alternative isoforms, we counted just the principal isoform. Where a transposon or repeat mapped to more than one gene (generally where the transposon was present in a coding gene and in a read-through gene), we only counted the transposon once.

Selection tests

Using human population variation data (45) we estimated a global dN/dS value with the dNdScv R package (46) for sets of exons overlapping simple repeats, low complexity regions, and transposable elements (all defined by RepeatMasker). dNdScv reports the ratio of the non-synonymous to synonymous substitution rates (dN/dS). Although dNdScv was originally designed for cancer genomic studies, it can and has been used to quantify selection in population variation data (46). A dN/dS lower than 1 implies purifying selection. Under purifying selection, dN/dS values are expected to be lower for common alleles than for rare alleles. Values of dN/dS close to one for both rare and common alleles are compatible with neutral evolution, but can also mean there is not enough statistical power to infer negative or positive selection, or also that there is a perfect balance between negative and positive selection. To estimate dN/dS ratios cross-species we obtained primate CDS alignments from the 100 vertebrate alignments generated with MultiZ (47) for each Alu-containing exon or exon fraction with evidence of protein expression. Alignments were visually inspected for frame-shifts and STOP codons and species carrying any of these were discarded from dN/dS calculations. To gain statistical power, the alignments of the coding portions of the 36 Alu elements with peptide evidence were concatenated into a single alignment. Based on this alignment a phylogenetic tree was inferred with Phyml 3.0 (48), selecting the best fit model with SMS (49). Then we used codeml from the PAML package (50) to optimize branch lengths, estimate dN/dS ratios and calculate likelihoods. The likelihood of a M0 model with a free dN/dS ratio parameter was compared to the null hypothesis in which dN/dS was fixed at 1 (neutral evolution). P-values were calculated using a Likelihood Ratio Test (LRT) with one degree of freedom. We tested three different alignments/trees: one containing all simians (Green monkey, Marmoset, Orangutan, Human, Chimp, Gorilla, Gibbon, Squirrel monkey, Baboon, Rhesus and Crab eating macaque), one containing apes (Orangutan, Human, Chimp, Gorilla and Gibbon), and one with just Chimp and Human. In addition, we conducted a similar analysis but fitting M0 selection models separately for each individual exon and then gathering all the individual likelihoods together (sum of Log-likelihoods). A LRT with degrees of freedom equal to the number of exons tested was conducted to compare the neutral evolution and selection models. We also carried out an analysis of selective pressure within primates using PhyloCSF (51), which uses likelihood ratios calculated from multi-species alignments and pre-computed substitution frequencies to determine whether a given nucleotide sequence is likely to represent a functional, conserved protein-coding sequence. Scores were calculated for the simian subset of the 100-vertebrates MultiZ alignment and the primate subset (simian plus Bushbaby) using the PhyloCSF ‘mle’ option. A P-value was calculated for each region by estimating the probability a non-coding region of the same length would get the same or higher PhyloCSF score, using the non-coding model previously described for PhyloCSF-psi (51), with a Holm–Bonferroni correction applied for the number of regions tested (36).

Gene family analysis

We performed a phylostratification analysis following a previously described pipeline (52) based on the gene family phylogenetic reconstructions of Ensembl Compara (53). Compara v95 is constructed out of genes from 152 species, providing 43,716 annotated gene family trees. Only species with enough coverage (>5×) were considered for the analysis. Compara assigns the speciation or duplication events represented by each internal tree node to the phylogenetic level in which these events were detected (53). To estimate the gene family age and the individual gene age for all protein coding genes annotated in GENCODE v28 human coding genes were classified in the following classes or phylostrata: Fungi/Metazoa, Bilateria, Chordata, Vertebrata, Euteleostomi, Sarcopterygii, Tetrapoda, Amniota, Mammalia, Theria, Eutheria, Boreoeutheria, Primates, Simiiformes, Catarrhini, Hominoidea, Hominidae, HomoPanGorilla and Homo sapiens. Gene family age was defined as the age class at the root of the family tree (the oldest common ancestor with a member of the gene family), while gene age is the phylostratum in which the most recent genomic event took place. Gene age for duplicated genes represents the phylostratum of the last duplication, whereas gene age always agrees with family gene age for genes without a detectable duplication origin in their gene trees. Duplication events with a consistency score (54) below 0.3 were tagged as unclear and nodes with a score of 0 were dismissed from the analysis.

Primate-derived exons

To determine whether an exon arose in the primate clade we defined as alternative all those exons that did not overlap with any exon integrated in a principal isoform in APPRIS. We removed sequences shorter than 45 bases, as these exons are likely to be too short to identify homology in the TBLASTN search (55). There were 12 540 exons in the GENCODE v28 gene set that met these criteria. The translated sequences of these exons were used as query to search against six different mammalian non-primate genomes, cat, dog, mouse, sheep, polar bear and pig, retrieved from Ensembl v95 (41), equivalent to GENCODE v28. In the TBLASTN search we turned off low complexity filtering, defined gap opening and extension penalties of 13 and 1, respectively, and set a maximum E-value threshold of 0.1. All exons that had significant homology hit in one of these species were discarded. We also used APPRIS annotations to filter out non-primate exons. Any alternative exon that formed part of a transcript with a conservation score of more than 1.5 (conservation in human plus chimp) was also discarded from the primate exon list. We defined 7566 primate-derived alternative exons. A total of 777 of these overlapped an Alu element so were discarded. The final list of exons that we were not able to map to any of the six non-primate mammalian species totaled 6789 exons.

Proteomics analysis

The proteomics analysis was carried out using the January 2019 human build of PeptideAtlas (39). We mapped peptides validated by PeptideAtlas to the 1224 Alu elements in the human proteome and to the 6789 alternative primate-derived exons. The advantage of using the PeptideAtlas database is that identifications from large-scale MS experiments are first subject to a pre-processing step that reduces the numbers of false positive matches. For this analysis we also rejected non-tryptic peptides, peptides that mapped to more than one gene and peptides shorter than seven amino acids. The remaining peptides mapping to SINE Alu regions or primate-derived alternative exons were validated by manual inspection of the spectra. Expert curation of peptide spectrum matches is an essential step when validating peptides that identify novel coding regions. Only those peptide-spectrum matches that passed manual inspection were deemed sufficiently reliable to confirm the translation of the inserted Alu elements or primate-derived exons.

Transcript evidence

Pext (proportion expressed across transcripts) scores are normalized transcript level measures of RNAseq expression. They are generated as part of the GNOMAD project from the large-scale RNAseq analyses carried out by the GTex consortium (56). Pext scores have been shown to distinguish highly conserved exons from exons with poor conservation. Here the Pext scores were used to measure the inclusion rates of Alu-derived exons and primate-derived exons with peptide evidence. cDNA support for Alu-derived exons and primate-derived exons with peptide evidence came from the European Nucleotide Archive (57) and NCBI RefSeq (58). Exons were counted as supported by a cDNA when the cDNA mapped to the 5′ and 3′ boundaries of the exon. cDNAs that included the exon as part of a retained intron were not counted as supporting the exon.

RESULTS

According to RepeatMasker remnants of transposon-based elements (not including regions predicted as Simple Repeats and Low Complexity) make up just over half of the bases in the human reference genome (50.66%). More than 20% of the fragments predicted as transposable element-derived in the human genome are SINE Alu elements, though LINE/L1 elements are the most common by number of bases because LINE/L1 elements are longer than Alu elements. By bases LINE/L1 elements make up 17.3% of the genome compared to the 10.4% of the genome that is contributed by Alu elements (Supplementary Figure S1A). Transposon-based elements were predicted to overlap CDS in 9% of GENCODE v28 transcripts (40). Almost 25% of the transposable elements that overlap coding exons are SINE Alu elements. Alu elements overlapped a total of 1224 distinct coding exons. The next most common transposable element classes were SINE MIR (789) and LINE/L1 (684). Almost all common transposon classes were found in much lower proportions within coding sequences (CDS) than within the whole genome (Supplementary Figure S1B); Alu elements total just 0.23% of the bases in the human coding reference set and LINE/L1 elements 0.12%. This is what would be expected if the presence of transposable elements were selected against in coding exons. However, some transposable element families are exceptions to the rule. The proportion of LINE/RTE-BovB elements are almost as high in CDS regions as they are in the whole genome, and DNA/hAT-Ac elements are actually more prevalent in CDS than in the genome as a whole (Figure 1A).

Figure 1.

The relative proportion of elements overlapping coding exons and their dN/dS. (A) The ratio of the percentage of transposable element bases in coding exons to the percentage of transposable element bases across the whole genome. Values close to one suggest that the presence of elements in coding sequences have not been selected against. SINE Alu elements have a ratio that is much lower than 1. Predicted simple repeats and low complexity regions included as a comparison. (B) The dN/dS for transposable element families overlapping human coding exons for both rare and common allele frequencies. Values below one and lower dN/dS with common allele frequencies than with rare allele frequencies indicate purifying selection, while values close to one suggest that the elements are generally under neutral selection. SINE Alu elements have dN/dS values close to 1. Taken at face value these proportions might suggest DNA/hAT-Ac transposable elements are not selected against in CDS regions. However, these are ancient transposable elements (2,59). While DNA/hAT-Ac elements preserved in CDS regions are still detectable by RepeatMasker, those outside CDS regions will not have been subject to purifying selection and may no longer be recognizable as deriving from transposable elements. This suggests that many of the ancient DNA/hAT-Ac elements have been co-opted and are evolving under purifying selection. The same is probably true for many LINE/RTE-BovB elements.

Selection

In order to determine whether transposable elements that overlap annotated coding exons have acquired functional importance as proteins, we measured selection using the ratio of the rates of non-synonymous and synonymous changes (dN/dS). We estimated a global dN/dS value for exons overlapping each of the most common categories of RepeatMasker regions using dNdScv (46). The results (Figure 1B) suggest that in general DNA/hAT-Ac, and LINE/RTE-BovB transposable elements (along with LINE1/CR1 elements, simple repeats and low complexity regions) are under purifying selection, as might be expected from their partitioning between genome and proteome (Figure 1A), whereas exons overlapping most other elements (including SINE Alu elements) are not, in general, under selection and are therefore less likely to have functional importance.

SINE Alu elements locate preferentially to alternative exons

The APPRIS database (42) divides transcripts into those that give rise to the principal protein isoform and those that if translated would produce alternative isoforms (see ‘Materials and Methods’ section for more details). Exons that overlapped all RepeatMasker transposon classes were separated into those found in the APPRIS-defined principal transcripts, and those found solely in alternative transcripts. Alternative exons make up just over 10% of the exons in the reference genome, so if transposable elements were randomly distributed, we would expect to find 1 in 10 transposable elements in alternative elements and the other 90% should overlap with principal exons. This is true for exon-overlapping simple repeats (87.8%) and some older transposable elements are also found at the expected frequency in principal exons, including DNA/hAT-ac (88.8%), SINE/SS-Deu-L2 (83.3%), SINE/tRNA (78.9%) and LINE/RTE-BovB (85.4%) elements (Supplementary Figure S2). By contrast, just 9.2% of SINE Alu elements were found in principal exons. It should be noted that APPRIS determines principal isoforms based on conserved structural and functional features and cross-species conservation. Since Alu elements arose in the primate lineage and do not form part of conserved functional or structural domains, we would expect few Alu element-derived exons to be classified as principal by APPRIS. In any case, APPRIS predictions are backed up by transcript level studies showing that internal exons overlapping Alu elements are predominantly alternatively spliced (32).

SINE Alu elements in the human reference genome

A total of 1074 distinct coding genes in GENCODE v28 have coding exons that overlap SINE Alu elements. There are 1224 Alu elements that overlap coding exons, but several genes harbour more than one element. For example ZNF506 contains four distinct Alu overlaps in alternative 3′ exons and 23 genes overlap three different Alu elements. Genes with coding regions that overlap Alu elements are significantly enriched in zinc finger motifs relative to the whole genome. A total of 93 genes are annotated with C2H2 zinc finger domains (Fisher's test, P-value of 9.4 e-16) according to SMART (60). Only one other protein domain is significantly enriched in this set, KRAB domains (P-value of 4.8 e-24). KRAB domains are generally found in tandem with C2H2 zinc finger domains. Many of these genes are from the cluster of KRAB-ZNF genes at the centromere on chromosome 19. Just over half of these genes overlap a range of different Alu elements, including all six members of the ZNF431 clade (61). SINE Alu elements are more often found in the final coding exon: almost half of the coding exons that overlap Alu elements are 3′ CDS (591). Sixty per cent of the Alu elements that overlap zinc finger genes are found in the final exon. This elevated number has two possible explanations. It may be because Alu elements are likely to produce fewer deleterious effects when inserting into a 3′ exon, or it may be caused by out of frame insertions that generate premature stop codons. The fact that Alu insertions can easily form polyadenylation signals (33) would clearly facilitate the establishment of 3′ exons. Alu elements that insert into internal CDS may generate frameshifts in downstream exons. In fact 50.2% of annotated Alu elements that overlap internal or first CDS are predicted to lead to frameshifts. This is somewhat fewer than expected by chance and in contrast to what was found by Sorek et al. (32). This lower number may be evidence in favour of these being truly functional exons, but it could also be caused by systematic bias given the composition of Alu sequences.

Almost all SINE Alu elements that overlap coding exons are inactive

More than 50% of the Alu elements that overlap exons are from the AluS family (55.2%) against just 7.4% of the youngest Alu family (AluY family). The AluY sub-family itself is partly active (28), but only three copies of elements from sub-families known to be active (28) are annotated in (alternative) coding exons in the reference genome. The proportions of Alu sub-families overlapping coding exons are shown in Figure 2.

Figure 2.

SINE Alu sub-families that overlap coding exons. (A) The SINE Alu family tree based on the family tree in RepeatMasker. The most common sub-families are marked with a black box. (B) The proportion of each Alu sub-family that overlaps coding exons. Members of the FRAM/FLAM, AluJ, AluS and AluY families by their proportion in coding exons in the reference genome. The most common sub-families are labeled in the chart. Over 37% of the Alu elements that overlap coding exons are from the older FRAM/FLAM or AluJ families, compared to just 29.4% across the whole genome (Supplementary Figure S3). The difference is significant in a Chi squared test (<0.0001). This may be partly because older Alu elements are often no longer detectable outside of conserved regions such as coding exons.

The NPIPB sub-family

Genes with Alu-derived exons annotated in the reference genome have a similar age distribution to the rest of the human reference set, except that there are proportionally more genes that have arisen in the primate lineage (Supplementary Figure S4). Though difference is significant (Chi-squared test, P-value of 0.00014), it is entirely due to the 10 duplications in the NPIPB sub-family, which itself arose in the primate clade (62). The 15 members of the nuclear pore complex-interacting protein family are primate specific and found in segmental duplications on chromosome 16 (62). The nuclear pore complex-interacting proteins (NPIPs) are made up of one or two membrane-interacting regions, a central coiled-coil domain and a variable number of C-terminal repeats. Three subfamilies can be distinguished by the length and composition of the repeats; the NPIPA subfamily does not contain any SINE Alu elements, but RepeatMasker defines two distinct SINE Alu elements for each member of the two NPIPB sub-families, NPIPB3/4/5/11/13 and NPIPB6/7/8/9/15. In fact one of the three distinct types of repeats that make up the final exon in this family seems to have derived from Alu elements (Figure 3). The NPIPB6/7/8/9/15 sub-family also has an Alu-derived insertion in the second coding exon.

Figure 3.

A schematic representation of the three NPIP subfamilies. The relationship between coding exons, SINE Alu elements, Pfam domains and repeats in the three NPIP sub-families. One member of each family is taken as the representative. Exons are not to scale. Each family member has an initial coding exon duplicated from an acyl-COA synthetase medium chain family member (there is also an alternative 5′ coding exon annotated for most family members), five or six internal exons that define the Pfam domain that is unique to NPIP family members, and a variable-sized 3′ CDS that is essentially composed of repeats. The Pfam domain overlaps one set of repeats. A second set of repeats, found at the 3′ end of the final CDS in the NPIPB subfamilies, appears to be composed entirely of SINE Alu element fragments. Phylogenetic reconstruction suggests that the NPIPB sub-families derived from the ancestral NPIPA in stepwise manner and that the evolution of NPIPB sub-families within the great apes clade coincided with the insertion of Alu elements in the coding region and a number of further retrotransposon events within the 5′ and 3′ UTRs of the NPIPB sub-family members. Since the duplications are so recent, the genes are very similar. It is not easy to distinguish whether all annotated genes are coding, or whether only some are coding and others are pseudogenes. However, at least one member of the NPIPB6/7/8/9/15 sub-family has clear evidence of protein expression in testis. All the peptide evidence in PeptideAtlas mapped to a single gene (NPIPB6), so NPIPB6 was used to represent the whole sub-family.

Alu elements in principal isoforms

Alu elements were predicted to be present in the principal exons of 103 coding genes. We carried out a detailed manual analysis of these genes to determine whether the Alu element had been incorporated into the main transcript or an alternative variant and whether or not the Alu elements were part of bona fide coding genes (63). Details of the manual annotation can be found in the Supplementary Results section. We found that the Alu element forms part of the main coding isoform of 10 genes and all the members of the NPIPB sub-family (Table 1).

Table 1.

Genes in which the Alu element is part of the main coding isoform

Gene	Gene family age	Function
BEND2	Euteleostomi	Unknown function. Expressed in testis. Alu element inserts a whole exon into the highly divergent N-terminal.
HSD17B7	Fungi-Metazoa	3-keto-steroid reductase, part of the estrogen synthesis pathway. Adds eight amino acids to the N-terminal.
NLRP1	Euteleostomi	Part of the NLRP1 inflammasome (64). The Alu region corresponds to an inserted exon that adds 27 amino acids.
NPIPB6	Simiiformes	Unknown function. Expressed in testis. Represents a primate-derived sub-family with three Alu inserts. All three extend exons.
TTF1	Chordata	Transcript termination factor in ribosome biogenesis. The Alu element adds 23 amino acids to the C-terminal.
USP19	Fungi-Metazoa	A multi-functional deubiquitinating enzyme. The Alu element extends exon 2 by 46 amino acids.
ZNF101	Bilateria	Unknown function. The Alu element inserts 49 base pairs and a stop codon into the 3′ exon of the CDS.
ZNF394	Euteleostomi	A transcriptional repressor in MAP kinase signaling (65). The element adds 8 amino acids to the C-terminal.
ZNF433	Bilateria	Activation of beta-catenin/TCF signaling. The Alu region changes a single amino acid at the C-terminal.
ZNF669	Bilateria	Unknown function. Adds 22 amino acids to the stop codon.
ZNF91	Bilateria	SVA transposable element repressor (66). The Alu element displaces two zinc finger motifs while adding 33 amino acids.

Gene family age is the age of the oldest common ancestor of the gene family.

Genes in which the Alu element is part of the main coding isoform Gene family age is the age of the oldest common ancestor of the gene family. Five of the 11 genes in which Alu elements have modified the main coding sequence code forzinc finger proteins and all but ZNF394 are primate-specific duplications of the same zinc finger family (61). The most interesting case is ZNF91. Here the Alu element, which only appears in the great apes, adds 33 amino acids to the C-terminal while displacing eight zinc-binding residues from the ancestral protein. A further change in the human lineage led to the upstream insertion of seven zinc finger binding motifs. The gain of these zinc fingers has enabled ZNF91 to become a repressor of SVA transposable elements (66). It is not clear whether the Alu insertion also contributes to this role. Eight of the Alu elements, including those in all five zinc finger genes, would extend the C-terminal of the resulting protein. It is known that zinc finger proteins are highly plastic at their C-terminals (67). All the elements, except those in BEND2 and NLRP1, have integrated into the principal isoform by ‘hijacking’ existing coding exons rather than creating new coding exons.

Peptide evidence for SINE Alu functionality

It is possible that other Alu-derived exons, besides those present in principal isoforms, have evidence for functionality. We attempted to confirm the translation to protein of the SINE Alu elements in the human proteome. We searched the PeptideAtlas database for validated peptides that mapped to the 1224 unique Alu-derived exons and manually verified the peptide-spectrum matches (PSMs) for these peptides (see ‘Materials and Methods’ section). The peptide evidence validated the translation of 33 of Alu-derived exons from 29 different genes (SLC3A2 and NPIPB6 both contain three Alu-derived exons and all three were validated by spectra from PeptideAtlas). The 29 genes with translated Alu-derived exons are shown in Supplementary Table S1. There are validated peptides for 8 of the 13 of the Alu elements in principal isoforms, including all three SINE Alu regions in NPIPB6. The elements in ZNF91 and USP19 are supported by two non-overlapping peptides. Although we do not find peptides that map to the Alu elements present in zinc finger proteins ZNF101, ZNF394, and ZNF669, there are peptides that uniquely identify the exons that the Alu elements are part of, so we can assume that all these Alu elements are translated as well. The remaining 25 Alu elements with validated translation are all in alternative isoforms, though some of the variants have so much peptide and RNAseq evidence that they could be considered at least as strong alternative isoforms. The alternative C-terminal in CD55 is supported by three non-overlapping peptides and the inserted Alu region in NEK4 is supported by four peptides. The peptide data for these two Alu regions suggests that the Alu exons have at least as much support as the ancestral isoforms. Twenty two of the Alu elements for which we found valid peptides are inserts in the ancestral transcripts, and all but one insert was frame preserving (the indel in DLGAP5 adds four amino acids and a stop codon from the last coding exon of the principal variant as a result of a frameshift). Nine of the remaining Alu-derived exons (and DLGAP5) would affect the C-terminal of the proteins while two extend the N-terminal. All SINE Alu elements for which we found verified peptide evidence modified existing CDS. In all cases the ancestral gene family predated the Alu element insertions, though we cannot be sure whether SINE Alu insertion occurred before or after gene duplication for genes ZNF101, ZNF195 and ZNF669. We crosschecked the 85 genes identified in the Lin et al. (34) paper against evidence from the PeptideAtlas database. We validated just five of the peptides detected by Lin et al. for SINE Alu elements.

How do Alu-derived exons compare to other primate-derived exons?

In order to determine whether the peptide evidence we found for 33 SINE Alu elements was similar to what might be expected for primate derived alternative exons, we repeated the PeptideAtlas analysis with exons that arose in the primate clade as comparison. We only looked at primate exons tagged by APPRIS as alternative because exons within principal isoforms would be expected to form part of the expressed proteins (we found peptides for 8 of the 13 SINE Alu overlaps in verified principal exons). We curated a set of 6789 primate-derived alternative exons (see methods section for details). In comparison the curated set of alternative SINE Alu-derived exons totalled 777 exons. SINE Alu elements make up 10.4% of the bases in the human genome and just over 10% of annotated primate exons are Alu-derived, which suggests that Alu elements are not any more likely to be annotated as coding exons than other non-coding region. We mapped peptides from the PeptideAtlas database to the exons (as described in the ‘Materials and Methods’ section). After manual curation we found reliable peptide identifications for just 25 primate-derived alternative exons, 0.37%. As a comparison, we found peptide evidence for 22 of the 777 SINE Alu-derived alternative exons (2.83%), proportionally more than seven times as much and significantly more than would be expected for standard primate-derived exons (P-value of <0.0001 in Chi-squared tests). This shows that a significantly higher proportion of SINE Alu elements are incorporated into expressed proteins than would be expected. We analysed transcript evidence in the form of cDNA support and Pext scores (normalized exon inclusion rates) for the 47 alternative exons with peptide evidence. There was more supporting transcript evidence for the translated Alu-derived exons than for the translated primate-derived exons. cDNA evidence supported the expression of 19 of the 22 alternative Alu-derived exons against just 14 of the 25 primate-derived exons, while 8 of the 22 alternative Alu-derived exons had Pext scores >0.5, against none of the primate-derived alternative exons. The differences between the two sets of exons are significant: Fisher's tests showed a P-value of 0.0293 for the differences in cDNA support and 0.001 for the Pext scores. Several of the Alu-derived exons had higher tissue-specific expression patterns. For example the Alu-derived exon in DLGAP5 had an average Pext score of just 0.1, but was completely included in endocervix, while the inclusion of the 3′ Alu-derived exon in CMC2 was noticeably higher in brain than in other tissues.

SINE Alu inserts and domain composition conservation

Events that cause changes in Pfam (68) domain composition tend not to be detected in proteomics experiments (69). This is presumably because, like frame-changing indels, this would normally lead to gross functional changes in the protein and be selected against. Even though all detected SINE Alu element inserts were frame preserving, six of the events for which we found peptides would break Pfam functional domains. While this is somewhat surprising, five of the six domain-disrupting events may not actually have much effect on the functional domain. For example, the insertion in the domain in TKT is relatively short, occurs in a loop region, and the Pfam seed alignment (68) includes sequences with similar sized inserts at the same position. In CMC2 the Alu exon removes the C-terminal portion of the Pfam domain, but the C-terminal swap does not affect the beta-hairpin that this protein forms, nor the conserved cysteines. The C-terminal of the Cmc1 domain that is broken by the SINE Alu insertion is not conserved in the Pfam seed alignment (Figure 4A). The A_deamin domain in RNA-editing deaminase 1 from gene ADARB1 has two conserved N- and C-terminal sections and a central linker section without conservation. Sequences from Danio Rerio, chicken and Xenopus are among those that also have insertions in this central ‘linker’ region and the central linker region is just where the ADARB1 SINE Alu exon inserts. The insertion can be visualized mapped onto the crystallized structure (Figure 4B)—it inserts into an already disordered region away from the catalytic site, in contrast to what is reported by Lin et al.

Figure 4.

The effect of the SINE Alu insertions on Pfam domains. (A) The seed alignment for Pfam domain Cmc1. Conserved cysteines are marked with red arrows, the region of Pfam domain equivalent to the region replaced by the SINE Alu element insert is shown by the green arrow. It would not affect the four conserved cysteines. (B) The structure of the ADARB1 catalytic domain (PDB (70) structure: 1ZY7). The catalytic region and the phytic acid co-factor are shown with the large arrow. The SINE Alu element would be inserted into the disordered region, the start and end of which is marked by the smaller arrows and would therefore not interact directly with the catalytic domain of ABARB1.

SINE Alu element translation and selection

The substantial evidence for the expression and translation of a small set of Alu-derived exons suggested that this subset of Alu elements might have gained functional roles in the cell. We investigated whether there was data to support this hypothesis. We defined ‘functional role’ for the purpose of this analysis as having evidence of protein-like purifying selection (71). Although SINE Alu elements as a whole are not under selective pressure (Figure 1B), it is possible that the subset of Alu elements with evidence of translation is under measurable selective constraints. Using PAML we estimated dN/dS from concatenated primate alignments (50) of the coding portion of the 33 elements with peptide evidence and for the Alu elements that overlap expressed coding exons in ZNF101, ZNF394 and ZNF669 that we can assume are also expressed as proteins. The estimated dN/dS values were not significantly different from one for the alignments of all simians, of apes, or of human and chimp (see Supplementary Table S2). An alternative analysis fitting the selection models separately for each individual exon and then multiplying the resulting likelihoods did not reject the null hypothesis of neutral evolution either. Furthermore, we found stop gains and frame-shifts in 24 of the 36 Alu-derived exons across primates, suggesting that these Alu elements have not established important functional roles across the primate clade. In order to test for significance, we looked at stop gains and frame-shifts in the primate clade for 36 exons of similar size selected at random from the 32 genes with Alu-derived exons that we analysed. Just four of these exons had frame-shifts or stop gains in the primate clade. Analysis of the same 36 elements using PhyloCSF (51), a measure of evolutionary coding potential, produced similar conclusions. The average PhyloCSF score for the coding portion of these Alu elements using alignments of the primate and simian clades is negative, suggesting that these regions have not been under protein-coding constraint in aggregate. However, there is one case for which we found weak evidence for coding selection. The 8-codon Alu-derived region in ZNF394 has a PhyloCSF score of 29.4, which is higher than would be expected for a region of that length that was not under protein-coding selection (uncorrected P = 0.003, multiple-hypothesis corrected P = 0.12). Further support comes from the fact that there are no indels and the stop codon immediately following it is perfectly conserved (CMC2 is the only other C-terminal addition that conserves its stop codon throughout primates). The alignment of the ZNF394 region can be seen in Supplementary Figure S5. From the point of view of human population variation there is not enough data to assess selection on this small set of exons. However, just eight of the 35 variants with a MAF greater than 0.1% are synonymous, while six (17.1%) are high impact (four stop gains and two frameshifts). By way of comparison just 3 of the 271 variants with an MAF above 0.1% in non Alu-derived exons from the relevant principal transcripts were high impact variants (1.1%). The two proportions of high impact variants are significantly different (Fisher's exact test P-value of 0.0002). The high impact variants in the Alu-derived exons occurred in both principal (two) and alternative (four) Alu-derived exons. Although the data is scarce, the frequency of high impact variants further supports the hypothesis that these Alu-derived exons have not yet gained relevant functions.

DISCUSSION

SINE Alu elements make up more than 10% of the human genome; in total the genome has been colonized by close to 1.2 million SINE Alu fragments. The vast majority map to intergenic and intronic regions and just 1224 Alu fragments (0.1%) overlap annotated coding exons. The reduced proportion of SINE Alu elements in exons suggests that there is selective pressure against their inclusion in coding regions. Even where Alu fragments overlap coding exons, they do not appear to be functionally important. Coding regions that derive from SINE Alu elements are not under selective pressure and almost all annotated Alu-derived exons are found in alternative coding transcripts. Little is known about the cellular roles of any of these Alu-derived exons, though the Alu-derived exon in LIN28B has been shown to be necessary for oncogene activation (72). Alu-derived coding exons are highly enriched in zinc finger proteins (67). Although Alu elements as a whole are not under selective pressure, we find that Alu-derived exons have become part of the principal splice variant in at least 11 coding genes. In all but two genes the Alu elements have ‘colonized’ the principal isoform by merging with existing coding exons. This is perhaps not surprising since merging with functioning coding exons is likely to be a shortcut to becoming established as part of the main transcript. Large-scale proteomics experiments tend not to detect evidence for alternative splice variants (69), nor genes that have evolved de novo in the primate lineage (63), so we would expect to find little evidence of translation for Alu-derived exons. Despite this there is clear evidence for the translation of 33 Alu-derived exons and peptide and transcript evidence suggests that many of these alternative exons are strongly expressed. All but one of the 22 insertion events we detected were in-frame, significantly more than would be expected by chance. The proportion of SINE Alu-derived exons detected in large-scale proteomics experiments was also significantly higher than expected; more than seven times higher than that of other primate-derived exons. This may be related to the splice signals present in Alu elements (29,30). Transcription evidence supported the strength of expression of these Alu-derived exons: both inclusion rates and cDNA support were significantly stronger for the Alu-derived exons with peptide evidence than they were for the other primate-derived exons with peptide evidence. A small subset of the 1224 Alu-derived exons has clearly added to the human proteome. All the evidence suggests that these SINE Alu elements have added to the human proteome via gene modification rather than de novo gene generation. In 26 of the 29 genes with peptide evidence, the SINE Alu elements added to an established (often ancient) protein-coding gene, while in the remaining three genes the SINE Alu event may have been concurrent with, or just after, a gene duplication. We find no evidence for the conversion of any SINE Alu element into a de novo human coding gene. Despite the lack of evidence for selection in SINE Alu-derived coding exons at the population level, we expected to find some evidence of evolutionary pressure for those Alu-derived exons with evidence of translation. However, we found none. There was no evidence for any selection from cross-species alignments within the primate clade or even among great apes. While there were too few variants in common alleles to be able to draw any conclusions about purifying or positive selection from human population variation, the sizable frequency of high impact variations among the common variants supports the possibility that even those Alu-derived exons with peptide evidence have yet to gain biologically important roles. Overall it seems that although SINE Alu elements contribute to the human proteome, they add little to the range of protein functions. Click here for additional data file.

72 in total

1. Accumulation of transposable elements in the genome of Drosophila melanogaster is associated with a decrease in fitness.

Authors: E G Pasyukova; S V Nuzhdin; T V Morozova; T F C Mackay
Journal: J Hered Date: 2004 Jul-Aug Impact factor: 2.645

2. Minimal conditions for exonization of intronic sequences: 5' splice site formation in alu exons.

Authors: Rotem Sorek; Galit Lev-Maor; Mika Reznik; Tal Dagan; Frida Belinky; Dan Graur; Gil Ast
Journal: Mol Cell Date: 2004-04-23 Impact factor: 17.970

Review 3. Evolutionary history of 7SL RNA-derived SINEs in Supraprimates.

Authors: Jan Ole Kriegs; Gennady Churakov; Jerzy Jurka; Jürgen Brosius; Jürgen Schmitz
Journal: Trends Genet Date: 2007-02-20 Impact factor: 11.639

4. A distal enhancer and an ultraconserved exon are derived from a novel retroposon.

Authors: Gill Bejerano; Craig B Lowe; Nadav Ahituv; Bryan King; Adam Siepel; Sofie R Salama; Edward M Rubin; W James Kent; David Haussler
Journal: Nature Date: 2006-04-16 Impact factor: 49.962

5. A simple method for estimating the intensity of purifying selection in protein-coding genes.

Authors: R Ophir; T Itoh; D Graur; T Gojobori
Journal: Mol Biol Evol Date: 1999-01 Impact factor: 16.240

Review 6. Dynamic interactions between transposable elements and their hosts.

Authors: Henry L Levin; John V Moran
Journal: Nat Rev Genet Date: 2011-08-18 Impact factor: 53.242

7. Alternatively Spliced Homologous Exons Have Ancient Origins and Are Highly Expressed at the Protein Level.

Authors: Federico Abascal; Iakes Ezkurdia; Juan Rodriguez-Rivas; Jose Manuel Rodriguez; Angela del Pozo; Jesús Vázquez; Alfonso Valencia; Michael L Tress
Journal: PLoS Comput Biol Date: 2015-06-10 Impact factor: 4.475

8. The RIDL hypothesis: transposable elements as functional domains of long noncoding RNAs.

Authors: Rory Johnson; Roderic Guigó
Journal: RNA Date: 2014-05-21 Impact factor: 4.942

9. 2016 update of the PRIDE database and its related tools.

Authors: Juan Antonio Vizcaíno; Attila Csordas; Noemi del-Toro; José A Dianes; Johannes Griss; Ilias Lavidas; Gerhard Mayer; Yasset Perez-Riverol; Florian Reisinger; Tobias Ternent; Qing-Wei Xu; Rui Wang; Henning Hermjakob
Journal: Nucleic Acids Res Date: 2015-11-02 Impact factor: 16.971

10. The Pfam protein families database in 2019.

Authors: Sara El-Gebali; Jaina Mistry; Alex Bateman; Sean R Eddy; Aurélien Luciani; Simon C Potter; Matloob Qureshi; Lorna J Richardson; Gustavo A Salazar; Alfredo Smart; Erik L L Sonnhammer; Layla Hirsh; Lisanna Paladin; Damiano Piovesan; Silvio C E Tosatto; Robert D Finn
Journal: Nucleic Acids Res Date: 2019-01-08 Impact factor: 16.971

5 in total

Review 1. Alternative splicing as a source of phenotypic diversity.

Authors: Charlotte J Wright; Christopher W J Smith; Chris D Jiggins
Journal: Nat Rev Genet Date: 2022-07-12 Impact factor: 59.581

2. Comprehensive In Silico Analysis of Retrotransposon Insertions within the Survival Motor Neuron Genes Involved in Spinal Muscular Atrophy.

Authors: Albano Pinto; Catarina Cunha; Raquel Chaves; Matthew E R Butchbach; Filomena Adega
Journal: Biology (Basel) Date: 2022-05-27