Literature DB >> 22807622

Transposon-derived and satellite-derived repetitive sequences play distinct functional roles in Mammalian intron size expansion.

Dapeng Wang1, Yao Su, Xumin Wang, Hongxing Lei, Jun Yu.   

Abstract

BACKGROUND: Repetitive sequences (RSs) are redundant, complex at times, and often lineage-specific, representing significant "building" materials for genes and genomes. According to their origins, sequence characteristics, and ways of propagation, repetitive sequences are divided into transposable elements (TEs) and satellite sequences (SSs) as well as related subfamilies and subgroups hierarchically. The combined changes attributable to the repetitive sequences alter gene and genome architectures, such as the expansion of exonic, intronic, and intergenic sequences, and most of them propagate in a seemingly random fashion and contribute very significantly to the entire mutation spectrum of mammalian genomes. PRINCIPAL
FINDINGS: Our analysis is focused on evolutional features of TEs and SSs in the intronic sequence of twelve selected mammalian genomes. We divided them into four groups-primates, large mammals, rodents, and primary mammals-and used four non-mammalian vertebrate species as the out-group. After classifying intron size variation in an intron-centric way based on RS-dominance (TE-dominant or SS-dominant intron expansions), we observed several distinct profiles in intron length and positioning in different vertebrate lineages, such as retrotransposon-dominance in mammals and DNA transposon-dominance in the lower vertebrates, amphibians and fishes. The RS patterns of mouse and rat genes are most striking, which are not only distinct from those of other mammals but also different from that of the third rodent species analyzed in this study-guinea pig. Looking into the biological functions of relevant genes, we observed a two-dimensional divergence; in particular, genes that possess SS-dominant and/or RS-free introns are enriched in tissue-specific development and transcription regulation in all mammalian lineages. In addition, we found that the tendency of transposons in increasing intron size is much stronger than that of satellites, and the combined effect of both RSs is greater than either one of them alone in a simple arithmetic sum among the mammals and the opposite is found among the four non-mammalian vertebrates.
CONCLUSIONS: TE- and SS-derived RSs represent major mutational forces shaping the size and composition of vertebrate genes and genomes, and through natural selection they either fine-tune or facilitate changes in size expansion, position variation, and duplication, and thus in functions and evolutionary paths for better survival and fitness. When analyzed globally, not only are such changes significantly diversified but also comprehensible in lineages and biological implications.

Entities:  

Keywords:  intron size; mammalian genomes; satellite sequences; transposable elements

Year:  2012        PMID: 22807622      PMCID: PMC3396637          DOI: 10.4137/EBO.S9758

Source DB:  PubMed          Journal:  Evol Bioinform Online        ISSN: 1176-9343            Impact factor:   1.625


Introduction

Repetitive sequence (RS) elements are characterized as multi-copied sequences in two broadly defined classes: satellite sequences (SSs), including both micro-satellites and mini-satellites, and transposable elements (TEs) that are characterized based on sequence identity and structure, biogenesis, insertion site preference, and degree of redundancies.1,2 The RSs are evolutionarily active and show significant influences on the structures of genes and genomes, and are thus highly relevant to biological functions.3,4 It has been reported that TE-free regions are negatively selected for certain regulatory elements throughout vertebrate genomes, although the conservation of the sequence contents is often variable.5,6 Furthermore, TEs have different distributions among exonic, intronic, and intergenic regions.7 Indeed, a small number of TE classes are still active, generating population differentiation,8 and the compositional dynamics of genomic sequences exhibits step-by-step evolutionary changes as a consequence of competitions between host genomes and parasitic sequences.3 In addition, TE transposition often serves as a driving force for the conversion of introns into exons or gaining novel introns as well as alternatively spliced transcripts.9–11 Therefore, new sequence integration and the balance of exons and introns in number, length, and ordinal position of a gene provide basic materials for species evolution.12 Different subfamilies of TEs have seemingly diverse influences on genes and genomes by changing sequence length to variable extents. Specifically, due to the distinction between “copy-and-paste” of retrotransposons and “cut-and-paste” mostly used by DNA transposons, the former should be a primary player in the event of genome size increase.2 Introns are considered as the major “warehouse” of TEs11,13 and certain families of TEs are observed to correlate with functional genes, such as between mammalian interspersed repeats (MIRs) and immune genes.13 Exploiting the relationship between sequence composition and polymorphism, we noticed that minimal introns (introns in a minimal size range) have unique features distinct from larger introns and demonstrated how these smaller introns escape from TE-driven insertions and also largely free from SS-driven intron expansion.14–16 As many vertebrate genomes have now been sequenced, we are able to address more questions on TE- and SS-driven intron expansions in different vertebrate lineages. In particular, we would like to understand how intron expansion relates to gene functions among the three subgroups of mammals—primates, large mammals, and rodents—and what are the roles of mutation and natural selection played in the course of genome evolution.

Results

Intron size increase often involves lineage-specific changes in RS contents in the context of genes

To investigate the relationship between intron size and repeat insertion in a comparable fashion, we divided introns into ten size intervals for the convenience of in-depth analysis since in general introns tend to cluster at certain size ranges (Fig. 1). According to the relationships among shape-variable curves from the three repeat types, retrotransposons, DNA transposons, and satellites, we found that RSs of the twelve mammals fell into two basic patterns. The first pattern is SS-rich, including three rodent species and two primitive mammals, and its repeat abundance ranks as retrotransposon > satellite > DNA transposon. The second pattern, including the rest of the seven mammals, has a repeat content order of retrotransposon > DNA transposon ≥ satellite (the subequal sign is true only for macaque). In addition, we observed an up-convex curvature of retrotransposon distribution and an up-concave curvature of DNA transposon and satellite distributions with the exception that the curves of satellite distribution in mouse and rat are near-linear, indicating that SSs play a relatively dominant role in their intron size expansion. As to the difference between the non-mammal vertebrates and the mammals, we found that DNA transposons have higher abundance but decreasing slope with intron size increase than the other two patterns in both zebrafish and frog. However, this phenomenon disappears and changes into lower abundance and an increasing slope with intron size increase in anole and chicken. The abundance of retrotransposons is lower than those of satellites in zebrafish, frog, and anole, and the abundance of retrotransposons is higher than that of satellites but the mode of slope remains the same in chicken and the mode of slope changes into descending in all twelve mammals.
Figure 1

Percentage of introns with retrotransposons, DNA transposons, and satellites.

Notes: The fractions of introns with repeats are displayed over intron length intervals. The ten intervals of intron lengths are defined as: 1, (50–150); 2, (151–300); 3, (301–600); 4, (601–1000); 5, (1001–1400); 6, (1401–2000); 7, (2001–3000); 8, (3001–5000); 9, (5001–10000); and 10, (10001+).

We subsequently tried to find the major TE families that influence intron size in each vertebrate species or lineages by calculating the fraction of introns possessing a particular RS class (Table 1). First, SINEs are supreme in overall abundance among all TEs in mammals. In the primates, Alu and MIR are most abundant. In the two small rodents, mouse and rat, B1, B4, and B2 are most abundant, whereas in guinea pig, the larger rodent of the group, B1 and B4 are most abundant. Second, for the four most abundant TE families in each species, the four large mammals, cow, panda, horse, and elephant, share MIR, L1, and L2, as well as other species-specific TEs that include BovA for cow, tRNA-Lys for panda, and SINE:SINEs that are specific for horse and elephant. MIR is abundant in all twelve mammals; opossum and platypus rank as the top two but the three rodents appear behind all the rest mammals. Third, the three lower vertebrates, chicken, anole, and frog, have CR1, Sauria, and Harbinger as the most abundant TEs, respectively. Zebrafish appears to have the most diverse DNA transposons and they are all quite abundant: DNA:DNA, hAT, hAT-Charlie, TcMar-Tc1, En-Spm, hAT-Ac, and Harbinger. Fourth, concerning satellite sequence classes, we found that all SSs are prevalent in the sixteen vertebrates but mouse, rat, zebrafish, and opossum are more SS-rich among all.
Table 1

Percentage of introns with classified into repetitive families.

Class/family12345678910111213141516
DNA:Chapaev-Chap313%
DNA:DNA35%
DNA:En-Spm14%
DNA:Harbinger21%11%
DNA:MER1_type21%19%16%21%12%11%12%
DNA:T216%
DNA:TcMar-Tc115%
DNA:TcMar-Tigger10%
DNA:hAT23%
DNA:hAT-Ac11%
DNA:hAT-Charlie21%19%20%17%12%20%
LINE:CR123%19%
LINE:L127%27%27%27%27%27%26%23%20%20%30%
LINE:L227%27%25%21%25%29%26%13%36%54%
LINE:Penelope12%
LINE:RTE13%13%
LINE:RTE-BovB18%
LTR:ERV118%
LTR:ERVL-MaLR14%15%
LTR:MaLR13%13%11%19%17%13%
SINE:Alu49%49%51%
SINE:B141%35%37%
SINE:B231%30%
SINE:B432%30%26%
SINE:BovA36%
SINE:ID12%22%
SINE:MIR35%36%36%29%33%37%35%15%13%21%62%57%
SINE:SINE20%28%45%25%
SINE:Sauria15%
SINE:V12%
SINE:tRNA-Glu16%
SINE:tRNA-Lys34%
Satellite:Satellite16%
Simple repeat:Simple repeat27%26%27%20%24%17%18%44%40%26%31%20%13%30%14%34%

Notes: The percentages are fractions of introns with the selected repeats over all introns in the listed species and only those greater than 10% are showed in the table. The species codes are: 1, human; 2, orangutan; 3, macaque; 4, cow; 5, panda; 6, horse; 7, elephant; 8, mouse; 9, rat; 10, guinea pig; 11, opossum; 12, platypus; 13, chicken; 14, anole; 15, frog; and 16, zebrafish.

We further identified abundant TE families in each species and have several significant observations (Fig. 2). First, there are near-linear distributions of MIR in introns with a length range of 150 bp–10,000 bp and rapid accumulations of introns over 10,000 bp in the primate and large mammal lineages. In contrast, there is a drastic slowing-down in the rodents, particularly mouse and rat. Aside from this, slowing gains of MIR are also seen in the two primitive mammals. Second, the trends of L1 and L2 insertions over intron sizes are also interesting; the two curves intersect in the large mammals and primates but do not in opossum, where we observe L1 < L2 before and L1 > L2 after the intersections. Third, the distribution of primate-specific Alu repeats has an up-convex curvature, an indication of early saturation and preferred insertions in relatively small introns as compared to LINEs and other SINEs. The rodent-specific B1, in contrast, has a near-linear distribution and is more prevalent than B2 and B4. SINE:ID, unique to mouse and rat, seems more active in rat than in mouse. Fourth, distinctly different from what in other mammals, L2 in platypus behaves similarly to its MIR.
Figure 2

Percentage of introns with selected repeat families.

Note: The intron length intervals are defined in the same way as what in Figure 1.

RS-centric intron expansion involves both size and position effects

To look into distinctive effects of TEs and SSs on intron size and position parameters, we divided introns into four basic classes: TS (both RSs), T (TEs), S (SSs), and N (neither TE nor SS). We focused on three essential intron features: fraction, length, and relative position in a gene. We made the following observations (Fig. 3). First, when plotting the percentage of introns in the four classes, we found that the pattern is rather heterogeneous, ie, the primates, the large mammals, and platypus are grouped together in a pattern of T > N > TS > S, showing a transposon-dominant pattern, so is opossum that has a pattern of T > TS > N > S. Second, mouse and rat form their own group, as it is noticed that both have more satellite sequences than other mammals: TS > N > T > S. Third, aside from the dominant TS-free group or N, guinea pig (N > T > TS > S), frog (N > T > TS > S), and chicken (N > T > TS > S) all have more transposons in their introns than satellites. Fourth, anole and zebrafish have a pattern of N > TS > T > S, in a similar path as compared to mouse and rat regardless of N. If we pick a single most abundant RS-containing intron group, TS, T, S, and N, for a species, the fractions are 39.6%, 52.7%, 12.8%, and 72% in mouse, platypus, anole, and chicken, respectively.
Figure 3

Percentage of the numbers of the four intron classes.

Note: TS, T, S, and N stand for introns with TE and SS, TE only, SS only, and without any of the two basic types, respectively.

We also investigated the size relevance of introns according to two simple size intervals: ≤1000 bp and >1000 bp. Obviously, the absolute majority of introns in N are small, ≤1000 bp, as opposed to the fact that the greater majority of introns in TS and T are larger, >1000 bp. When examining the median length, we found that intron length increase is correlated with the complexity of RS insertions: TS > T > S > N (Fig. 4). We also observed that the TS intron group tends to be near the 5′-end of genes as opposed to the N intron group that tends to be near the 3′-end of the genes in primates, large mammals, rodents, opossum, and frog, as well as that the TS intron group tends to be near the 5′-end of the genes in platypus, chicken, and anole (Fig. 5). The extremely biased distributions are seen in mouse, where the transposon-rich introns tend to be near the 3′-end, and in zebrafish, where all four intron groups show no significant bias.
Figure 4

Length comparison of the four intron classes.

Note: The asterisks indicate significant differences between neighbouring data groups based on Wilcoxon rank sum test and cut-off <0.05.

Figure 5

Position index comparisons for the four intron classes.

Note: The asterisks indicate significant differences between neighbouring data groups based on Wilcoxon rank sum test and cut-off <0.05.

We further examined both length and position effects for four selected transposons: LTR, LINE, SINE, and DNA. Their intron length medians rank as LTR > DNA > LINE > SINE in the primates, the large mammals, and opossum (Fig. 6). In the three rodents, mouse and rat form a unique league themselves with a length order of DNA > LINE > LTR > SINE, but guinea pig stands alone with a similar pattern to other non-rodent mammals: LTR > DNA > LINE > SINE. In addition, the platypus introns with LTR or DNA transposons tend to be larger in size, in comparison with those of LINE- or SINE-containing introns. In contrast, the chicken introns with LINE tend to be smaller, when compared to those with SINE, DNA or LTR. There are other independent patterns such as LTR > SINE > LINE > DNA and LTR > LINE > SINE > DNA in frog and zebrafish, respectively. An exception is unique to anole, where the order becomes LINE > SINE > DNA when LTR is absent. The most likely reason is the lack of well-classified LTR consensus in the RepeatMasker default library due to high diversity of transposable elements in anole, especially when compared to mammals.17 In the primates, the large mammals, and guinea pig, the median position index ranks as LTR < DNA < LINE < 0, and the introns with SINEs in cow, panda, horse, human, and guinea pig have a slight bias toward 5′-end (data not shown). In both mouse and rat, the introns with DNA transposons have the most 5′-end biases and those with SINEs have the least 5′-end biases. In the two primitive mammals, opossum and platypus, their LTRs and DNA transposons tend to be inserted into introns near the 5′-end. The chicken introns harbouring LTRs or DNA transposons have a stronger bias toward insertions at the 5′-end than those with LINE. The order of the median intron position index for anole is LINE < SINE < DNA < 0. The positional preference for the frog introns is the proximity of 5′-end but that of DNA transposon- containing introns is the weakest. In zebrafish, introns with LINE, SINE or LTR have a stronger 5′-end preference, and those with LTR have the least bias.
Figure 6

Length comparisons of the four TE-containing intron classes.

Note: The asterisks indicate significant differences between neighbouring data groups based on Wilcoxon rank sum test and cut-off <0.05.

Intronic RS-abundance and RS-specificity define characteristic gene functions in different mammalian lineages

We first classified genes in a similar way to what we did for introns: (1) TS, genes have both transposons and satellites in their introns; (2) T, genes have only transposons in their introns; (3) S, genes have only satellites in their introns; (4) N, genes have neither transposons nor satellites in their introns. In general, we observed an order of TS > N > T > S in chicken and anole, but a different order of TS > T > N > S in the rest vertebrates. When compared the same RS classes from different species, the most abundant four classes for TS, T, S, and N are 83.1% in mouse, 33% in horse, 8.32% in chicken, and 28.4% in chicken, respectively (Fig. 7). Furthermore, we considered functional categorization of the four gene classes in the four mammalian lineages: mammals, primates, large mammals, and rodents. We found diverse development- and transcription-related functions in S and/or N genes, including “embryonic skeletal system development” and “transcription regulator activity” in mammals (Table 2), “negative regulation of neuron differentiation” and “gene expression” in primates (Table 3), “midbrain development” and “regulation of transcription” in large mammals (Table 4), and “inner ear morphogenesis” and “regulation of gene expression” in the rodents (Table 5). There are also lineage-specific and tissue-specific profiles for the expression of these genes. For instance, “hormone activity” of N genes is shared by all the major groups of mammals and “pheromone binding” of S genes is unique to the rodents. There are also genes with immunological functions identified in the primate S (eg, “positive regulation of chronic inflammatory response to antigenic stimulus”) and N genes (eg, “MHC class I receptor activity”), in S genes of the large mammals (eg, “antigen processing and presentation”), and in N genes of the rodents (eg, “inflammatory response”). In addition, some TS genes are related to fundamental structures and metabolic functions, including “cytoskeleton” and “protein homodimerization activity” in the mammals, “extracellular matrix structural constituent” and “regulation of cell shape” in the primates, “ATP biosynthetic process” in the large mammals, and “acyltransferase activity”, “protein ubiquitination”, and “phosphoinositide binding” in the rodents. There are also rodent TS genes involved in the nervous system and being response to external stimulus or environment. As to T genes, mitochondrial structure related functions are found in both the primates and the large mammals.
Figure 7

Percentage of genes in four classes.

Note: TS, T, S, and N denote genes with TE and SS, TE only, SS only, and with none of the two repeat types, respectively.

Table 2

Mammal-specific GO term enrichment of the four gene classes.

ClassGO codeGO name123456789101112
TSGO:0016324Apical plasma membrane****
TSGO:0005516Calmodulin binding****
TSGO:0006812Cation transport****
TSGO:0005856Cytoskeleton****
TSGO:0005829Cytosol****
TSGO:0005783Endoplasmic reticulum****
TSGO:0005887Integral to plasma membrane******
TSGO:0023034Intracellular signaling pathway****
TSGO:0005216Ion channel activity******
TSGO:0008237Metallopeptidase activity****
TSGO:0042803Protein homodimerization activity****
TGO:0005576Extracellular region*******
SGO:0030326Embryonic limb morphogenesis*****
SGO:0009954Proximal/distal pattern formation******
SGO:0030528Transcription regulator activity*********
NGO:0009952Anterior/posterior pattern formation**********
NGO:0048706Embryonic skeletal system development*****
NGO:0048704Embryonic skeletal system morphogenesis******
NGO:0005576Extracellular region*********
NGO:0005179Hormone activity******
NGO:0030528Transcription regulator activity*********

Notes: The species codes are the same as what listed in Table 1. The asterisks indicate enrichment of GO terms.

Table 3

Primate-specific GO term enrichment of the four gene classes.

ClassGO codeGO nameHumanOrangutanMacaque
TSGO:0005201Extracellular matrix structural constituent*
TSGO:0031965Nuclear membrane*
TSGO:0008360Regulation of cell shape*
TGO:0019882Antigen processing and presentation*
TGO:0019886Antigen processing and presentation of exogenous peptide antigen via MHC class II*
TGO:0002504Antigen processing and presentation of peptide or polysaccharide antigen via MHC class II*
TGO:0004004ATP-dependent RNA helicase activity*
TGO:0005125Cytokine activity*
TGO:0022625Cytosolic large ribosomal subunit*
TGO:0010008Endosome membrane*
TGO:0004308Exo-alpha-sialidase activity*
TGO:0031640Killing of cells of another organism*
TGO:0005765Lysosomal membrane*
TGO:0042613MHC class II protein complex*
TGO:0032395MHC class II receptor activity*
TGO:0005763Mitochondrial small ribosomal subunit*
TGO:0000398Nuclear mRNA splicing, via spliceosome*
TGO:0005730Nucleolus*
TGO:0019887Protein kinase regulator activity*
TGO:0003723RNA binding*
TGO:0008380RNA splicing*
TGO:0019843rRNA binding*
TGO:0005681Spliceosomal complex*
TGO:0006414Translational elongation*
TGO:0017070U6 snRNA binding*
SGO:0004869Cysteine-type endopeptidase inhibitor activity*
SGO:0044424Intracellular part*
SGO:0045665Negative regulation of neuron differentiation*
SGO:0009887Organ morphogenesis*
SGO:0002876Positive regulation of chronic inflammatory response to antigenic stimulus*
SGO:0002925Positive regulation of humoral immune response mediated by circulating immunoglobulin*
SGO:0010843Promoter binding*
SGO:0007519Skeletal muscle tissue development*
SGO:0005164Tumor necrosis factor receptor binding*
NGO:0002474Antigen processing and presentation of peptide antigen via MHC class I*
NGO:0007267Cell-cell signaling*
NGO:0009987Cellular process*
NGO:0010467Gene expression*
NGO:0008201Heparin binding*
NGO:0042309Homoiothermy*
NGO:0050825Ice binding*
NGO:0048535Lymph node development*
NGO:0032393MHC class I receptor activity*
NGO:0000122Negative regulation of transcription from RNA polymerase II promoter*
NGO:0048663Neuron fate commitment**
NGO:0005184Neuropeptide hormone activity*
NGO:0004522Pancreatic ribonuclease activity*
NGO:0010552Positive regulation of gene-specific transcription from RNA polymerase II promoter*
NGO:0045084Positive regulation of interleukin-12 biosynthetic process*
NGO:0045944Positive regulation of transcription from RNA polymerase II promoter*
NGO:0050826Response to freezing*
NGO:0016471Vacuolar proton-transporting V-type ATPase complex*

Note: The asterisks indicate significant enrichment of GO terms.

Table 4

Large-mammal-specific GO term enrichment of the four gene classes.

ClassGO codeGO nameCowPandaHorseElephant
TSGO:0006754ATP biosynthetic process*
TSGO:0015662ATPase activity, coupled to transmembrane movement of ions, phosphorylative mechanism*
TSGO:0006821Chloride transport*
TSGO:0007214Gamma-aminobutyric acid signaling pathway*
TSGO:0051536Iron-sulfur cluster binding*
TSGO:0016459Myosin complex*
TSGO:0004725Protein tyrosine phosphatase activity*
TSGO:0005097Rab GTPase activator activity**
TSGO:0032313Regulation of Rab GTPase activity*
TSGO:0048010Vascular endothelial growth factor receptor signaling pathway*
TGO:0022900Electron transport chain*
TGO:0007186G-protein coupled receptor protein signaling pathway*
TGO:0016021Integral to membrane*
TGO:0005743Mitochondrial inner membrane**
TGO:0005747Mitochondrial respiratory chain complex I*
TGO:0005515Protein binding*
TGO:0070469Respiratory chain*
SGO:0019882Antigen processing and presentation*
SGO:0042742Defense response to bacterium*
SGO:0030901Midbrain development*
SGO:0048663Neuron fate commitment*
SGO:0045449Regulation of transcription*
NGO:0022627Cytosolic small ribosomal subunit*
NGO:0016021Integral to membrane*
NGO:0045449Regulation of transcription*

Note: The asterisks indicate significant enrichment of GO terms.

Table 5

Rodent-specific GO term enrichment of the four gene classes.

ClassGO codeGO nameMouseRatGuinea pig
TSGO:0015629Actin cytoskeleton*
TSGO:0008415Acyltransferase activity*
TSGO:0045177Apical part of cell*
TSGO:0006915Apoptosis*
TSGO:0030424Axon**
TSGO:0008013Beta-catenin binding*
TSGO:0005975Carbohydrate metabolic process*
TSGO:0007049Cell cycle**
TSGO:0051301Cell division*
TSGO:0042995Cell projection*
TSGO:0009986Cell surface**
TSGO:0016568Chromatin modification*
TSGO:0000777Condensed chromosome kinetochore*
TSGO:0016023Cytoplasmic membrane-bounded vesicle*
TSGO:0031410Cytoplasmic vesicle**
TSGO:0030425Dendrite*
TSGO:0006281DNA repair**
TSGO:0009055Electron carrier activity*
TSGO:0005768Endosome**
TSGO:0009897External side of plasma membrane*
TSGO:0031012Extracellular matrix*
TSGO:0005925Focal adhesion*
TSGO:0005525GTP binding*
TSGO:0005096GTPase activator activity**
TSGO:0004386Helicase activity*
TSGO:0042802Identical protein binding**
TSGO:0030027Lamellipodium*
TSGO:0016042Lipid catabolic process*
TSGO:0042470Melanosome*
TSGO:0008168Methyltransferase activity*
TSGO:0005874Microtubule*
TSGO:0008017Microtubule binding*
TSGO:0005739Mitochondrion**
TSGO:0007067Mitosis*
TSGO:0006397mRNA processing*
TSGO:0043066Negative regulation of apoptosis*
TSGO:0043025Neuronal cell body*
TSGO:0005634Nucleus**
TSGO:0030165PDZ domain binding*
TSGO:0048471Perinuclear region of cytoplasm**
TSGO:0005777Peroxisome**
TSGO:0035091Phosphoinositide binding*
TSGO:0043065Positive regulation of apoptosis*
TSGO:0043123Positive regulation of I-kappaB kinase/NF-kappaB cascade*
TSGO:0014069Postsynaptic density*
TSGO:0006813Potassium ion transport*
TSGO:0042734Presynaptic membrane*
TSGO:0043234Protein complex**
TSGO:0032403Protein complex binding**
TSGO:0019904Protein domain specific binding*
TSGO:0046982Protein heterodimerization activity*
TSGO:0019901Protein kinase binding*
TSGO:0008104Protein localization*
TSGO:0008565Protein transporter activity*
TSGO:0016567Protein ubiquitination**
TSGO:0045449Regulation of transcription*
TSGO:0006974Response to DNA damage stimulus*
TSGO:0042493Response to drug*
TSGO:0001666Response to hypoxia*
TSGO:0007584Response to nutrient*
TSGO:0014070Response to organic cyclic substance*
TSGO:0004871Signal transducer activity**
TSGO:0005625Soluble fraction**
TSGO:0015293Symporter activity*
TSGO:0019717Synaptosome**
TSGO:0005802Trans-Golgi network*
TSGO:0006511Ubiquitin-dependent protein catabolic process*
TSGO:0004842Ubiquitin-protein ligase activity*
SGO:0009653Anatomical structure morphogenesis*
SGO:0001658Branching involved in ureteric bud morphogenesis*
SGO:0045165Cell fate commitment**
SGO:0042733Embryonic digit morphogenesis*
SGO:0060441Epithelial tube branching involved in lung morphogenesis*
SGO:0042472Inner ear morphogenesis*
SGO:0003676Nucleic acid binding*
SGO:0048709Oligodendrocyte differentiation*
SGO:0001569Patterning of blood vessels*
SGO:0005550Pheromone binding*
SGO:0008284Positive regulation of cell proliferation*
SGO:0010552Positive regulation of gene-specific transcription from RNA polymerase II promoter*
SGO:0045666Positive regulation of neuron differentiation**
SGO:0010468Regulation of gene expression**
SGO:0048536Spleen development*
SGO:0030878Thyroid gland development*
SGO:0016564Transcription repressor activity*
NGO:0006935Chemotaxis*
NGO:0001533Cornified envelope*
NGO:0006952Defense response*
NGO:0042742Defense response to bacterium*
NGO:0005615Extracellular space**
NGO:0006954Inflammatory response**
NGO:0007389Pattern specification process*
NGO:0004252Serine-type endopeptidase activity*

Note: The asterisks stand for significant enrichment of GO terms.

The insertion profiles of TEs and SSs are diverse among the vertebrate genomes

We evaluated the expansion strength of TEs and SSs in introns based on the ratio of the repeat length over the corresponding RS-free length (Table 6). We found that zebrafish has the strongest expansion strength among TS, T, and S genes, whereas chicken has the weakest strength in TS and S genes and anole has the weakest strength in T genes. In the mammals, opossum has the strongest strength in TS and S genes but T genes have the most strength in platypus. A striking observation is the fact that the strength of TS genes is greater than the sum of both T and S genes in the mammals, and we saw the opposite phenomenon in the non-mammalian vertebrates (Table 6).
Table 6

Comparisons of incremental ratio of TEs and SSs.

SpeciesTSTST + S
Human0.8330.6120.0800.693
Orangutan0.7360.5740.0670.641
Macaque0.7000.5690.0630.632
Cow0.6610.4350.0650.500
Panda0.5120.3840.0500.434
Horse0.5450.4260.0590.486
Elephant0.6600.4840.0940.578
Mouse0.5000.3460.0710.418
Rat0.4580.3290.0780.406
Guinea pig0.3020.2440.0560.301
Opossum0.8670.5560.1030.660
Platypus0.7490.6230.0910.714
Chicken0.0900.1830.0230.205
Anole0.1160.1260.0350.161
Frog0.3300.3390.0810.420
Zebrafish1.2021.2070.2631.471

Note: Incremental ratio is defined as X/(1 − X), where X equals to the median length percentage of repeats in introns.

When integrating the content of intronic repeats in individual genes based on orthology (unique homologous gene in each species), we discovered different topological structures (Fig. 8). The shared clusters between the two trees are the human-orangutan and the mouse-rat clades, the distant relationship to chicken, and the approximation of zebrafish to placental mammals as compared to the other three non-mammalian vertebrates. With regard to TEs, the primates and the large mammals are remarkably distinct from the rest species and are closer to the mouse-rat clade as compared to guinea pig. With regard to SSs, opossum is clustered with the primates as well as the rodents and the four large mammals rather than the other primitive mammal, platypus.
Figure 8

Topological trees constructed based on TE (A) and SS (B).

Note: A detailed procedure is described in Methods.

Discussion

Other than whole genome duplication, the complexity of vertebrate genomes builds upon many unique sequence and functional features but one of them is genome expansion that compounds with the expansion of gene and intron sizes. There are three essential ways to increase genome sizes.18,19 The first is to increase the number of genes through genome and gene duplications. The second and also the foremost important mechanism is gene size expansion through intron size and number increases.20 The final way is the expansion of intergenic sequences and auxiliary chromosomal structures. With regard to the diversity of RSs and insertion/expansion mechanisms, we classified intron expansion into two categories: TE-driven and SS-driven,2,21 and speculated that they may play distinct roles in the intron size expansion of mammalian genomes. First, the profiles of TE insertions can be classified at levels of species and lineages, such as primates, large mammals, and rodents, and we did observe similar modes within lineages and distinctions among lineages. However, exceptions do exist as the rodents are not always cohesive—guinea pig behaves differently from mouse and rat concerning many RS counts. Second, we would like to emphasize the effect of RS expansion event rather than copy number counts, and we hope to see a clear and direct picture that correlates intron size variation with RS insertion. In general, both TEs and SSs are reported to be non-randomly distributed among eukaryotic genomes.1,21–23 On one hand, there is strong negative selection to protect essential sequences in genomes for the transmission of basic genetic information in a relative shorter evolutionary time scale, such as protein-coding sequences or exons. On the other hand, RSs are indispensable as the prime power and raw materials for genomes to evolve for better fitness, to generate complexity and diversity, and to promote speciation and population dynamics.2,24 Therefore, RSs have strong influences on gene expression and regulation indirectly through variations in intron length and content.10,13 One mechanism shared by all the studied vertebrates is that both TE and SS insertions increase intron size but the strength of the former is much greater than that of the latter. In fact, after eliminating RS insertions in all introns, we observed that the tendency of length increase in the four intron classes remains the same. In other words, the large introns remain large in size even without RS insertions in all four intron classes and so do small introns. However, the introns of anole and chicken genomes are exceptional, where the intron size definitions may shift or not be clearly distinguishable between large and small when RS insertions are removed from the intron sequences (data not shown). We observed a non-random and unbalanced expansion mechanism of intron size evolution: larger introns tend to grow faster than smaller ones when introns are enlarged to a certain size or over a specific threshold. Furthermore, we investigated relationship and mechanism of TE- or SS-driven intron expansions. Satellites can increase intron size at an early or primitive stage as they change intron size in a relatively limited scale, but transposons are capable of increasing intron size in a larger (such as LINEs) and more massive (such as LTRs in multiple insertions) scale and thus have stronger influence on intron size expansion. Most importantly, we observed a synergy between TE-driven and SS-driven insertions, providing a greater degree of intron expansion To understand the possible roles of RS families on gene and intron size expansions, we paid special attention on intron length and positioning within a transcript and on functional enrichment in the context of TE- vs. SS dichotomy among species and lineages. For instance, we found that TS-containing introns have a 5′-end bias in all vertebrates but zebrafish and that the RS-free (or the N class) introns have a 3′-end bias in all mammals but platypus. We have recently identified distinct functional profiles of genes at different evolving rates in primates, large mammals, and rodents,25 and in this study we used a similar classification scheme to investigate protein-coding genes with RS-driven intron expansion. For instance, DNA transposon-containing introns tend to be smaller in fraction, larger in size, and biased toward 5′-end enrichment in mouse and rat. We also pointed out that genes with TE-free introns are enriched in both development and transcription and genes with SS-containing introns are mostly immunity-related in primates and large mammals.13 We also extracted function categories in nervous systems for mammalian genes possessing SS-containing introns since microsatellite alternations may lead to neurological disorders.26 Previous studies proposed that microsatellites are unevenly positioned within different regions of protein-coding genes such as UTRs, exons, and introns, and they may play functional roles in regulating gene expression, splicing, mRNA export, and response to external environment.27 Most SSs that we studied are microsatellites, and we demonstrated that there are functional biases in SS-insertions, such as promoter-related regulatory genes as one of the major categories. In addition, SSs preferentially reside in heterochromatins at or near centromeres and telomeres, where transcriptional activities are rarely discovered. However, if detected, the genes are usually development-related and involved in epigenetic regulation and DNA methylation; the latter two lead to the alteration of chromatin state and may in turn regulate the expression of SS-containing noncoding RNAs.28,29 We concluded that combined or independent effects of species/lineage-specific TEs and SSs may play an important role in functional differentiations of intron-containing protein-coding genes. At present, the sequence-similarity-based RS library is mostly composed of known TEs, especially the collection of mammal-specific sequences. As increasing number of completed high-quality non-mammalian vertebrate genomes are being sequenced, together with the help of de novo identification technologies,30,31 there should be more novel species-specific TEs discovered, adding stronger validation power to the current study. It is vital for us to track down the precise timing of intron evolution and expansion, such as in a context of lineages, especially the number of introns per gene and the length variation of introns.32 Spliceosomal introns are the great majority in vertebrate genomes, albeit opposing hypotheses on the origin of introns, “intron-early” and “intron-late”, which argue that introns of this particular type is either more ancient or late comers.33 Further analyses on genomes based on taxonomy suggested that intron loss is the dominant phenomenon with position- and phase-specificity in modern mammals and perhaps large amount of intron gains occurred at the early stage of animal evolution,34–36 and recent study has found several cases of intron gains happened in the ancestor of placental mammals in transposon-derived domestication-related genes.37 Moreover, gene length is correlated with gene expression levels and breaths and is affected by RS insertions, such as L1 and MIR.38 Housekeeping genes are often highly-expressed and harbor smaller introns to reduce the processing cost of transcription, including time and energy. In contrast, tissue-specific genes are often lowly-expressed and harbor larger introns, requiring more effective and complex regulatory elements.38,39 Our data, based on a RS-centric stratification approach, showed that intron expansion is strongly influenced by not only RS types but also insertion timing, and the latter is manifested as species-specific propagation of distinct RSs. A comparative study concerning the five teleost genomes indicated that zebrafish experienced an ancient large-scale RS-induced intron expansion, and RS profiles of such expansion is rather distinct from the other four fishes with relatively lower insertion frequency.40 Based on these observations, we suspect that the RS content diversity that we observed among vertebrate introns or genes may not be straightforward to characterize with regard to precise timing as the samples we used are still in a limited scope. Insertions of both TEs and SSs should avoid making damages to key regulatory sequences, such as the splice sites, the branch point, the polypyrimidine tract, and other uncharacterized functional elements, and have potential co-evolving patterns with neighbouring sequences;41 and in particular, TEs (eg, SINEs) facilitate the splicing of larger introns via the formation of secondary structure in mammals.42 TE- and SS-derived RSs are forced to cluster or locate in intronic regions and seldom occur in core regulatory regions that are constantly under strong positive or negative selections.

Methods

We obtained RepeatMasker repetitive elements and Ensembl gene structure annotation data from UCSC Genome Database FTP server (ftp://hgdownload.cse.ucsc.edu/), including those from human, orangutan, macaque, cow, panda, horse, elephant, mouse, rat, guinea pig, opossum, platypus, chicken, anole, frog, and zebrafish (Table 7). We excluded genes that do not encode proteins or have very short introns (<50 bp) from our analysis. For each gene, we only keep the longest primary transcript and/or that has the largest number of exons. Concerning the possible overlapping regions in different repeat families or sub-families, we only counted once when a sequence is used multiple times and otherwise indicated. We also collected the gene-transcript-protein relationship, protein sequences, and Gene Ontology (GO) annotations from Ensembl web or FTP sites (http://www.ensembl.org, ftp://ftp.ensembl.org), and used Fisher Exact Test to find the enriched GO terms and adopted the Bonferroni corrections with a cut-off of 0.1 to reduce false positive rate. To compare the major phylogenic groups in mammals, we regarded the four non-mammalian vertebrates as out-group and considered four divisions (some are obviously lineages and others are not): mammal-specific (occurring only in 12 mammals), primate-specific (occurring only in human, orangutan and/or macaque), non-primate large-mammal-specific (occurring only in cow, panda, horse and/or elephant) and rodent-specific (occurring only in mouse, rat and/or guinea pig). We defined normalized position index as (2*IO-IN-1)/IN, where IO stands for intron order in a gene along the transcription direction and IN is total intron number in a gene. In general, we classified repeat elements into two types of transposons or TE (LTR, LINE, SINE and DNA transposon, in which the former three classes are retrotransposon) and satellites or SS (satellite and microsatellite repeats). We prepared orthologous groups using the inflation parameter = 2 in popular MCL algorithm (http://micans.org/mcl/) to cluster gene families after a protein-based all-to-all-blast with a cut-off of 1e-5.43 And then we only selected the groups containing 16 genes and each gene can be assigned a species for phylogenetic analyses. Finally, we used the fraction of number and length of introns in a unit of gene to evaluate the contents of transposons and satellites for 357 orthologous genes, which form a high-dimensional vector for each species. Furthermore, we used the modified cosine of vector included angle to measure the distance of compared species vectors,44 and adopted a way similar to classical UPGMA (Unweighted Pair Group Method with Arithmetic Mean) clustering technology.45 In brief, we began with the twelve initial species and combined the nearest two neighbor species into one cluster and considered the center of the two points in the space as the new vector of the new node and then repeated the process until all nodes came into one cluster. We employed TreeView program to visualize the result of the tree-like structure.46
Table 7

Species names and the numbers of introns used in this study.

Short nameFull nameVersionNumber of introns
HumanHomo sapiens,hg19191,918
OrangutanPongo pygmaeus abeliiponAbe2108,083
MacaqueMacaca mulattarheMac2135,376
CowBos taurusbosTau4155,350
PandaAiluropoda melanoleucaailMel1136,011
HorseEquus caballusequCab2128,897
ElephantLoxodonta africanaloxAfr3127,667
MouseMus musculusmm9183,175
RatRattus norvegicusrn4154,905
Guinea pigCavia porcelluscavPor3119,495
OpossumMonodelphis domesticamonDom5137,533
PlatypusOrnithorhynchus anatinusornAna1101,406
ChickenGallus gallusgalGal3128,491
AnoleAnolis carolinensisanoCar2122,041
FrogXenopus tropicalisxenTro3136,091
ZebrafishDanio reriodanRer7207,279
  45 in total

Review 1.  Microsatellites: genomic distribution, putative functions and mutational mechanisms: a review.

Authors:  You-Chun Li; Abraham B Korol; Tzion Fahima; Avigdor Beiles; Eviatar Nevo
Journal:  Mol Ecol       Date:  2002-12       Impact factor: 6.185

Review 2.  The rise and falls of introns.

Authors:  R Belshaw; D Bensasson
Journal:  Heredity (Edinb)       Date:  2006-03       Impact factor: 3.821

3.  Transposon-free regions in mammalian genomes.

Authors:  Cas Simons; Michael Pheasant; Igor V Makunin; John S Mattick
Journal:  Genome Res       Date:  2005-12-19       Impact factor: 9.043

Review 4.  The struggle for life of the genome's selfish architects.

Authors:  Aurélie Hua-Van; Arnaud Le Rouzic; Thibaud S Boutin; Jonathan Filée; Pierre Capy
Journal:  Biol Direct       Date:  2011-03-17       Impact factor: 4.540

5.  Conservation of human microsatellites across 450 million years of evolution.

Authors:  Emmanuel Buschiazzo; Neil J Gemmell
Journal:  Genome Biol Evol       Date:  2010-02-08       Impact factor: 3.416

6.  A novel role for minimal introns: routing mRNAs to the cytosol.

Authors:  Jiang Zhu; Fuhong He; Dapeng Wang; Kan Liu; Dawei Huang; Jingfa Xiao; Jiayan Wu; Songnian Hu; Jun Yu
Journal:  PLoS One       Date:  2010-04-12       Impact factor: 3.240

7.  The role of transposable elements in the evolution of non-mammalian vertebrates and invertebrates.

Authors:  Noa Sela; Eddo Kim; Gil Ast
Journal:  Genome Biol       Date:  2010-06-02       Impact factor: 13.583

8.  Characteristics of transposable element exonization within human and mouse.

Authors:  Noa Sela; Britta Mersch; Agnes Hotz-Wagenblatt; Gil Ast
Journal:  PLoS One       Date:  2010-06-01       Impact factor: 3.240

9.  Nonsynonymous substitution rate (Ka) is a relatively consistent parameter for defining fast-evolving and slow-evolving protein-coding genes.

Authors:  Dapeng Wang; Fei Liu; Lei Wang; Shi Huang; Jun Yu
Journal:  Biol Direct       Date:  2011-02-22       Impact factor: 4.540

10.  Extensive intron gain in the ancestor of placental mammals.

Authors:  Dušan Kordiš
Journal:  Biol Direct       Date:  2011-11-23       Impact factor: 4.540

View more
  6 in total

1.  In silico mining and FISH mapping of a chromosome-specific satellite DNA in Capsicum annuum L.

Authors:  Hui Chao Zhou; Nomar Espinosa Waminal; Hyun Hee Kim
Journal:  Genes Genomics       Date:  2019-05-27       Impact factor: 1.839

2.  Reverse transcriptase and intron number evolution.

Authors:  Kemin Zhou; Alan Kuo; Igor V Grigoriev
Journal:  Stem Cell Investig       Date:  2014-09-28

3.  Characterization of newly gained introns in Daphnia populations.

Authors:  Wenli Li; Robert Kuzoff; Chen Khuan Wong; Abraham Tucker; Michael Lynch
Journal:  Genome Biol Evol       Date:  2014-08-14       Impact factor: 3.416

4.  Ribogenomics: the science and knowledge of RNA.

Authors:  Jiayan Wu; Jingfa Xiao; Zhang Zhang; Xumin Wang; Songnian Hu; Jun Yu
Journal:  Genomics Proteomics Bioinformatics       Date:  2014-04-24       Impact factor: 7.691

5.  The cancer-associated CTCFL/BORIS protein targets multiple classes of genomic repeats, with a distinct binding and functional preference for humanoid-specific SVA transposable elements.

Authors:  Elena M Pugacheva; Evgeny Teplyakov; Qiongfang Wu; Jingjing Li; Cheng Chen; Chengcheng Meng; Jian Liu; Susan Robinson; Dmitry Loukinov; Abdelhalim Boukaba; Andrew Paul Hutchins; Victor Lobanenkov; Alexander Strunnikov
Journal:  Epigenetics Chromatin       Date:  2016-08-31       Impact factor: 4.954

6.  Analysis of new functional profiles of protein isoforms yielded by ds exonization in rice.

Authors:  Ting-Ying Chien; Li-Yu Daisy Liu; Yuh-Chyang Charng
Journal:  Evol Bioinform Online       Date:  2013-10-09       Impact factor: 1.625

  6 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.