Literature DB >> 31843890

Retroviruses drive the rapid evolution of mammalian APOBEC3 genes.

Jumpei Ito1, Robert J Gifford2, Kei Sato3,4.   

Abstract

APOBEC3 (A3) genes are members of the AID/APOBEC gene family that are found exclusively in mammals. A3 genes encode antiviral proteins that restrict the replication of retroviruses by inducing G-to-A mutations in their genomes and have undergone extensive amplification and diversification during mammalian evolution. Endogenous retroviruses (ERVs) are sequences derived from ancient retroviruses that are widespread mammalian genomes. In this study we characterize the A3 repertoire and use the ERV fossil record to explore the long-term history of coevolutionary interaction between A3s and retroviruses. We examine the genomes of 160 mammalian species and identify 1,420 AID/APOBEC-related genes, including representatives of previously uncharacterized lineages. We show that A3 genes have been amplified in mammals and that amplification is positively correlated with the extent of germline colonization by ERVs. Moreover, we demonstrate that the signatures of A3-mediated mutation can be detected in ERVs found throughout mammalian genomes and show that in mammalian species with expanded A3 repertoires, ERVs are significantly enriched for G-to-A mutations. Finally, we show that A3 amplification occurred concurrently with prominent ERV invasions in primates. Our findings establish that conflict with retroviruses is a major driving force for the rapid evolution of mammalian A3 genes.
Copyright © 2020 the Author(s). Published by PNAS.

Entities:  

Keywords:  APOBEC3; endogenous retrovirus; evolutionary arms race; gene amplification; mammal

Mesh:

Substances:

Year:  2019        PMID: 31843890      PMCID: PMC6955324          DOI: 10.1073/pnas.1914183116

Source DB:  PubMed          Journal:  Proc Natl Acad Sci U S A        ISSN: 0027-8424            Impact factor:   11.205


Activation-induced cytidine deaminase/apolipoprotein B mRNA editing enzyme, catalytic polypeptide-like (AID/APOBEC) superfamily proteins are cellular cytosine deaminases that catalyze cytosine-to-uracil (C-to-U) mutations. AID/APOBEC family proteins contain a conserved zinc-dependent catalytic domain (Z domain) with the HxE/PCxxC motif and are closely associated with important phenomena found in vertebrates such as immunity, malignancy, metabolism, and infectious diseases (reviewed in refs. 1 and 2). For instance, AID induces somatic hypermutation in B cells and promotes antibody diversification (2), and APOBEC1 (A1) regulates lipid metabolism by enzymatically editing the mRNA of apolipoprotein B gene (3). The physiological roles of APOBEC2 (A2) and APOBEC4 (A4) remain unknown, but APOBEC3 (A3) genes are known to encode antiviral factors that restrict the replication of retroviruses (4) and other viruses (5–7). While most AID/APOBEC family genes are conserved in vertebrates, A3 genes are specific to placental mammals (1). Furthermore, whereas AID, A1, A2, and A4 genes are singly encoded in each vertebrate including mammals, dramatic expansion of the A3 repertoire occurred in many mammalian lineages, including primates (8). A3 genes are grouped into 3 classes (A3Z1, A3Z2, and A3Z3) on the basis of their conserved Z domain sequences (4, 8, 9). For example, human A3 genes are composed of 7 paralogs (A3A, A3B, A3C, A3D, A3F, A3G, and A3H). Of these, A3A, A3C, and A3H (which in other mammals are referred to as A3Z1, A3Z2, and A3Z3, respectively) contain a single Z domain, while the other 4 genes harbor double Z domains: A3Z2-A3Z1 for A3B and A3G and A3Z2-A3Z2 for A3D and A3F (8, 9). The conflict between human A3G protein and HIV type 1 (HIV-1) has been studied particularly intensively. Human A3G proteins are incorporated into HIV-1 particles and enzymatically induce C-to-U mutations in viral cDNA, causing guanine-to-adenine (G-to-A) mutations in the viral genome (10, 11). A3G-mediated mutations lead to the accumulation of lethal mutations and ultimately abolish viral replication. On the other hand, an HIV-1–encoding protein, viral infectivity factor (Vif), counteracts this antiviral action by degrading A3G in a ubiquitin-proteasome–dependent manner (4). Such conflicts between A3 proteins and modern viruses (particularly retroviruses) have been reported in a broad range of mammalian species and viruses infecting them (reviewed in ref. 9), and consistent with this, A3 genes contain strong signatures of diversifying selection (12–14). Endogenous retroviruses (ERVs) are retrotransposon lineages that are thought to have originated from ancient exogenous retroviruses via infection of germline cells (15, 16). ERVs occupy a substantial fraction of mammalian genomes, demonstrating extensive germline invasion by retroviruses. To combat ERVs and other intragenomic parasites, mammals have developed defense systems such as Krüppel-associated box domain-containing (KRAB) zinc finger proteins (17) and PIWI-interacting RNAs (18). A3 proteins have been shown to suppress the replication of reconstructed ERVs in cell cultures (15, 19) and in a transgenic mouse model (20). Furthermore, previous studies identified the signature of A3-mediated G-to-A mutations in ERVs indicating that ancient retroviruses experience attacks by A3 proteins (15, 16, 19, 21). In this study, we examine the history of evolutionary interaction between ERVs and A3 genes via genomic analysis of 160 mammalian species.

Results

Identification and Classification of Mammalian AID/APOBEC Family Genes.

We screened whole genome sequence (WGS) data of 160 mammalian species in silico and extracted 1,420 sequences disclosing homology to the conserved Z domains of AID/APOBEC family genes (8) ( and Datasets S1–S3). Phylogenetic reconstructions revealed that these Z domain loci group into 9 clades, 7 of which represent the canonical AID/APOBEC lineages (AID, A1, A2, A3Z1, A3Z2, A3Z3, and A4) (Fig. 1 ). We also identified additional, previously uncharacterized lineages, designated UA1 and UA2 (Fig. 1 ). UA1 genes were only found in basal eutherian mammal groups: afrotherians (elephants, tenrecs, and sea cows) and xenarthrans (armadillos). UA2 genes were only found in marsupials (infraclass Marsupialia) (Fig. 1). These phylogenetic relationships were supported by multiple methods (Fig. 1 and ). In addition, HxE and PCxxC motifs corresponding to the canonical catalytic domain of AID/APOBEC proteins were found in UA1 and UA2 gene sequences (). The UA1 and UA2 genes contain signatures of purifying selection () indicating they are protein-coding members of the AID/APOBEC family. Indeed, the UA2 gene in opossum (Monodelphis domestica) was annotated as APOBEC5 in a previous study (22).
Fig. 1.

Distribution and diversity of AID/APOBEC Z domains in mammalian genomes. (A) A phylogenetic tree of AID/APOBEC Z domains identified via in silico screening of 160 mammalian genomes. The tree shown here was based on an alignment of nucleic acid sequences and was reconstructed using the NJ method (63). Scale bar indicates the genetic distance. (B) Number of AID/APOBEC Z domains. Those labeled “intact” contain no premature stop codons, while the remainder are labeled as “pseudogenized.” Z domain sequences that contained unresolved regions were labeled “not determined.” (C) Number of the intact AID/APOBEC Z domains identified in each mammal species. See , for further details. The species tree shown here was derived from the TimeTree database (73).

Distribution and diversity of AID/APOBEC Z domains in mammalian genomes. (A) A phylogenetic tree of AID/APOBEC Z domains identified via in silico screening of 160 mammalian genomes. The tree shown here was based on an alignment of nucleic acid sequences and was reconstructed using the NJ method (63). Scale bar indicates the genetic distance. (B) Number of AID/APOBEC Z domains. Those labeled “intact” contain no premature stop codons, while the remainder are labeled as “pseudogenized.” Z domain sequences that contained unresolved regions were labeled “not determined.” (C) Number of the intact AID/APOBEC Z domains identified in each mammal species. See , for further details. The species tree shown here was derived from the TimeTree database (73). As summarized in Fig. 1, we detected 157 AID, 166 A1, 157 A2, 266 A3Z1, 362 A3Z2, 146 A3Z3, 153 A4, 9 UA1, and 4 UA2 genes in 160 species of mammalian genomes. Interestingly, A3Z1 and A3Z2 genes were highly amplified, while the other family genes were not (Fig. 1 ). We also found that some sequences, particularly those of A3 genes, were pseudogenized (Fig. 1). The numbers of A3 Z domains were different among species. In particular, A3Z1 and A3Z2 genes in Perissodactyla, Chiroptera, Primates, and Afrotheria were highly amplified (Fig. 1 and ). Consistent with previous reports (12, 23, 24), canonical A3 genes were not detected in marsupials or monotremes (order Monotremata). Furthermore, A3Z1 was commonly absent in Rodentia, while A3Z3 was absent in Strepsirrhini and Microchiroptera. Amplification of A3Z3 genes was not detected in any mammalian groups except for Carnivora (carnivores), in which duplicated A3Z3 genes were almost entirely pseudogenized ().

Evolution of Mammalian A3 Genes Under Strong Selection Pressures.

We used comparative genomic approaches to investigate the evolutionary history of mammalian A3 genes. As shown in Fig. 2, the positional conservation (Shannon entropy) scores in A3Z1, A3Z2, and A3Z3 genes tended to be much higher than those found in other AID/APOBEC family genes, indicating strong diversifying selection. We detected codon sites evolving under diversifying selection by calculating dN/dS ratios using the branch-site model (25). Although the catalytic domains, which are composed of HxE and PCxxC motifs (1, 2, 4), were highly conserved among the 7 AID/APOBEC family proteins, we detected the signature of diversifying selection at numerous sites (Fig. 2). Comparisons to human A3A (A3Z1 ortholog in primates) (26), A3C (A3Z2 ortholog in primates) (27), and A3H (A3Z3 ortholog in primates) (28) revealed that these sites are preferentially detected in a structural region called loop 7, which recognizes substrate nucleic acids (Fig. 2). Furthermore, most of the sites under diversifying selection are located on the protein surface (Fig. 2).
Fig. 2.

Evolutionary features of AID/APOBEC Z domains. The analyses are based on the MSAs of respective classes of AID/APOBEC Z domains. The MSAs of intact Z domains of AID (n = 163), A1 (n = 155), A2 (n = 251), A3Z1 (n = 332), A3Z2 (n = 132), A3Z3 (n = 152), and A4 (n = 154) (listed in Dataset S3) were used. (A) Difference in the sequence conservations among 7 classes of AID/APOBEC Z domains. Positional sequence conservation scores (Shannon’s entropy scores) were calculated in respective amino acid sites of the MSA (shown as logo plots in B). (B) Top rows show the P values (−log10) in dN/dS ratio test [with branch-site model (25)] at each codon site. The sites under diversifying selection with statistically significance (P < 0.05) are indicated by red bars. Bottom rows show logo plots of the conserved sequences of the AID/APOBEC Z domains. Yellow square indicates the amino acid residues comprising the catalytic domain of AID/APOBEC proteins. Pink square indicates the amino acid residues corresponding to the structure loop 7. The other characteristics on each amino acid residue [e.g., Vif binding sites for human A3C (27), human A3D-CTD (27), human A3F-CTD (27, 74, 75), human A3G-NTD (41, 42), and human A3H (28, 76)] are summarized in the box to the lower left of the panel. CTD, C-terminal domain; NTD, N-terminal domain.

Evolutionary features of AID/APOBEC Z domains. The analyses are based on the MSAs of respective classes of AID/APOBEC Z domains. The MSAs of intact Z domains of AID (n = 163), A1 (n = 155), A2 (n = 251), A3Z1 (n = 332), A3Z2 (n = 132), A3Z3 (n = 152), and A4 (n = 154) (listed in Dataset S3) were used. (A) Difference in the sequence conservations among 7 classes of AID/APOBEC Z domains. Positional sequence conservation scores (Shannon’s entropy scores) were calculated in respective amino acid sites of the MSA (shown as logo plots in B). (B) Top rows show the P values (−log10) in dN/dS ratio test [with branch-site model (25)] at each codon site. The sites under diversifying selection with statistically significance (P < 0.05) are indicated by red bars. Bottom rows show logo plots of the conserved sequences of the AID/APOBEC Z domains. Yellow square indicates the amino acid residues comprising the catalytic domain of AID/APOBEC proteins. Pink square indicates the amino acid residues corresponding to the structure loop 7. The other characteristics on each amino acid residue [e.g., Vif binding sites for human A3C (27), human A3D-CTD (27), human A3F-CTD (27, 74, 75), human A3G-NTD (41, 42), and human A3H (28, 76)] are summarized in the box to the lower left of the panel. CTD, C-terminal domain; NTD, N-terminal domain. Investigation of amplified A3 loci revealed that the majority of A3 genes are encoded in the canonical A3 genomic locus (8, 9), flanked by the CBX6 and CBX7 genes (Fig. 3 and Dataset S4), indicating that amplification of A3 genes has mainly occurred via tandem gene duplication. However, there are exceptions to this rule: 3 primate species, Saimiri boliviensis, Aotus nancymaae, and Otolemur garnettii, were found to encode more A3 loci outside the canonical locus than within it (Fig. 3). The A3 genes in these 3 primates were mostly encoded at entirely distinct loci (Fig. 3) and exhibit double-domain (A3Z2–A3Z1) and intronless structures ( and Dataset S5) indicating they likely originated via retrotransposition of spliced mRNA (29). These retrotransposed A3 genes in New World monkeys were more closely related to the human A3G gene than the other double-domain A3 genes in humans (). Although most were pseudogenized (Fig. 3), some retain relatively long ORFs (). In particular, 1 of the retrotransposed A3 genes in A. nancymaae (referred to as “outside #3”) retains a full-length ORF (). Indeed, this gene is annotated in the Ensembl gene database (http://www.ensembl.org; Release 97; ENSANAG00000031271). Moreover, analysis of public RNA-sequencing (RNA-Seq) data revealed that mRNA of outside #3 is expressed in a broad range of tissues in A. nancymaae (). Taken together, these data show that A3G-like genes have been amplified via retrotransposition in New World monkeys, and some of these amplified genes are likely functional.
Fig. 3.

Genomic location of A3 genes. (A) Genomic order of the AID/APOBEC Z domains within the canonical A3 gene locus, which is sandwiched by CBX6 and CBX7 genes. Mammalian genomes in which CBX6 and CBX7 genes were detected in the same scaffold were only analyzed. The arrows indicate the direction of respective loci. (B) Bubble plot of the number of A3 Z domains in mammals. The number of the A3 Z domains in the whole genome (x axis) and that within the canonical A3 gene locus (y axis) in each mammal are plotted. Dot size is proportional to the number of species. (C) Genomic locations of A3 Z domains in S. boliviensis, A. nancymaae, and O. garnetti. A3 Z domains within 100 kb of each other were clustered. An asterisk denotes the A3 cluster corresponding to the canonical A3 gene locus. The arrows indicate the direction of respective loci. Pseudogenized sequences are indicated with an X. The sequences indicated by double daggers are intronless sequences and correspond to those described in . (D) The association between the genomic location of A3 genes and pseudogenization. The labels “in” and “out” denote the numbers of A3 Z domains located inside or outside the canonical A3 gene locus, respectively. Results for S. boliviensis, A. nancymaae, and O. garnetti are shown. Odds ratio and P value, calculated with Fisher’s exact test, are shown.

Genomic location of A3 genes. (A) Genomic order of the AID/APOBEC Z domains within the canonical A3 gene locus, which is sandwiched by CBX6 and CBX7 genes. Mammalian genomes in which CBX6 and CBX7 genes were detected in the same scaffold were only analyzed. The arrows indicate the direction of respective loci. (B) Bubble plot of the number of A3 Z domains in mammals. The number of the A3 Z domains in the whole genome (x axis) and that within the canonical A3 gene locus (y axis) in each mammal are plotted. Dot size is proportional to the number of species. (C) Genomic locations of A3 Z domains in S. boliviensis, A. nancymaae, and O. garnetti. A3 Z domains within 100 kb of each other were clustered. An asterisk denotes the A3 cluster corresponding to the canonical A3 gene locus. The arrows indicate the direction of respective loci. Pseudogenized sequences are indicated with an X. The sequences indicated by double daggers are intronless sequences and correspond to those described in . (D) The association between the genomic location of A3 genes and pseudogenization. The labels “in” and “out” denote the numbers of A3 Z domains located inside or outside the canonical A3 gene locus, respectively. Results for S. boliviensis, A. nancymaae, and O. garnetti are shown. Odds ratio and P value, calculated with Fisher’s exact test, are shown.

ERVs Evidence a Long-Running Conflict Between Retroviruses and A3 Genes.

To explore the impact of A3 activity on ERVs and their ancient exogenous ancestors, we performed comparative analysis of transposable elements (TEs) in 160 mammalian genomes. As shown in Fig. 4 and , the TE composition of mammalian species varies with respect to the proportions of DNA transposons, SINEs, LINEs, and ERVs. To investigate the accumulation level of G-to-A mutations in ERVs, we measured the strand bias of the G-to-A mutation rate in ERVs and other TEs. Since A3 proteins selectively induce G-to-A mutations on the positive strand of retroviruses, strand bias can be an indicator of A3 attack on retroviruses. Consistent with previous reports (30–32), preferential accumulation of G-to-A mutations was observed in human ERVs but not in other human TEs (Fig. 4). We next classified mutation patterns based on the dinucleotide context. As shown in Fig. 4, ERVs in the human genome preferentially exhibited GG-to-AG or GA-to-AA mutations, consistent with the reported preferences of human A3G (GG-to-AG) and A3D, A3F, and A3H (GA-to-AA mutations) (10, 33–39). Additionally, some ERVs exhibited G-to-A hypermutation (Fig. 4).
Fig. 4.

Signatures of A3 activity in ERV sequences and its association with A3 amplification. (A) Proportions of ERV sequences in the genomes of mammalian species. For proportions of LINE, SINE, and DNA transposon sequences, see . (B) Strand bias scores of G-to-A mutation rates in human TEs (log2-transformed). The strand bias score is calculated as the G-to-A mutation rate ratio between the positive and negative strands. Dots indicate the strand bias scores of respective TE groups. (C) Dinucleotide sequence composition of G-to-A mutation sites in human ERV subfamilies. Of the top 50 ERV subfamilies with respect to the strand bias score, the top 25 ERV subfamilies with respect to the variation (i.e., coefficient of variation) among the 4 G-to-A mutation sites (GA, GT, GG, and GC) are shown. (D) ERV copies presenting the G-to-A hypermutation signature. ERV copies with >1 log2-transformed strand bias score and <0.1 false discovery rate are indicated as red. (E) Association of the number of A3 Z domains with the accumulation level of G-to-A mutations in ERVs in mammals. The x axis indicates the number of intact A3 Z domains, and the y axis indicates the mean value of the log2-transformed strand bias scores among ERVs in the genome. Correlation coefficient and P value are calculated by Pearson’s correlation.

Signatures of A3 activity in ERV sequences and its association with A3 amplification. (A) Proportions of ERV sequences in the genomes of mammalian species. For proportions of LINE, SINE, and DNA transposon sequences, see . (B) Strand bias scores of G-to-A mutation rates in human TEs (log2-transformed). The strand bias score is calculated as the G-to-A mutation rate ratio between the positive and negative strands. Dots indicate the strand bias scores of respective TE groups. (C) Dinucleotide sequence composition of G-to-A mutation sites in human ERV subfamilies. Of the top 50 ERV subfamilies with respect to the strand bias score, the top 25 ERV subfamilies with respect to the variation (i.e., coefficient of variation) among the 4 G-to-A mutation sites (GA, GT, GG, and GC) are shown. (D) ERV copies presenting the G-to-A hypermutation signature. ERV copies with >1 log2-transformed strand bias score and <0.1 false discovery rate are indicated as red. (E) Association of the number of A3 Z domains with the accumulation level of G-to-A mutations in ERVs in mammals. The x axis indicates the number of intact A3 Z domains, and the y axis indicates the mean value of the log2-transformed strand bias scores among ERVs in the genome. Correlation coefficient and P value are calculated by Pearson’s correlation. To explore the potential impact of A3 gene amplification on ERVs, we first assessed the accumulation level of G-to-A mutations across all mammalian ERVs (), then examined the association between 1) accumulation of G-to-A mutations in ERVs and 2) the number of A3 Z domains. This revealed a strong positive correlation (Fig. 4) (Pearson’s correlation coefficient = 0.69, P < 1.0E-15) wherein the possession of fewer A3 genes (e.g., nonplacental mammals and rodents) is associated with lower accumulation levels, and a higher number of A3 genes (e.g., simiiformes and some chiropterans) is associated with higher accumulation levels.

Correlation of A3 Gene Amplification and Diversification with ERV Activity.

We examined the association between ERV invasions and A3 gene family expansion. As shown in Fig. 5 , we found that the number of A3 Z domains was positively associated with the percentage of ERVs in mammalian genome (in Poisson regression, coefficient = 0.14, P < 1.0E-15). Thus, species in which a greater proportion of the genome is composed of ERVs tend to have a higher number of A3 genes. Exceptions occur in the rodent family Muridae, as well as in 2 other species, hedgehog (Erinaceus europaeus) and opossum (M. domestica). In all of these outlier species, a large proportion of the genome is composed of ERV sequences, but relatively few or no A3 genes appear to be present (). As might be expected, ERVs in these outlier species exhibited lower accumulation levels of G-to-A mutations overall (Fig. 5). In addition, many of the ERVs identified in these species are relatively young () indicating that they derive from recent genome colonization events and have been incorporated into the germline without encountering A3-mediated mutation.
Fig. 5.

Association between A3 gene family expansion and ERV invasion. (A and B) Association of the number of A3 Z domains with the amount of ERV insertions in the genome. Dots are colored according to the species taxa (A) or the accumulation level of G-to-A mutations in ERVs (B). The association was evaluated under the Poisson regression with log link function. (C) Temporal association of ERV invasion with A3 gene amplification in primates. (Left) Amount of ERV insertions in each age category in distinct primate species. ERV insertion date was estimated based on the genetic distance of each ERV integrant from the consensus sequence under the molecular clock assumption [2.2 × 10−9 mutations per site per year (68)]. (Middle) Number of intact A3 Z domains. (Right) Schematic of the MSA of A3G (A3Z2-Z3Z1 type) gene. Sequences of A3G genes in primates recorded in the Ensembl gene database (http://www.ensembl.org) were used. NA, not applicable (no available data).

Association between A3 gene family expansion and ERV invasion. (A and B) Association of the number of A3 Z domains with the amount of ERV insertions in the genome. Dots are colored according to the species taxa (A) or the accumulation level of G-to-A mutations in ERVs (B). The association was evaluated under the Poisson regression with log link function. (C) Temporal association of ERV invasion with A3 gene amplification in primates. (Left) Amount of ERV insertions in each age category in distinct primate species. ERV insertion date was estimated based on the genetic distance of each ERV integrant from the consensus sequence under the molecular clock assumption [2.2 × 10−9 mutations per site per year (68)]. (Middle) Number of intact A3 Z domains. (Right) Schematic of the MSA of A3G (A3Z2-Z3Z1 type) gene. Sequences of A3G genes in primates recorded in the Ensembl gene database (http://www.ensembl.org) were used. NA, not applicable (no available data). To investigate the association of A3 gene family expansion with ERV activity, we focused on primates because the evolutionary history of primate ERVs has been explored in depth and is relatively well characterized. We assessed the age of ERV invasions in each species using a genomic distance-based method and found that ERVs prominently invaded in the common ancestors of Simiiformes (including Hominoidea, Old World monkeys, and New World monkeys) around 50 million years ago (Fig. 5 , Left). In contrast, ancestors of prosimians (including Lemurs, Lorisoids, and Tarsiers) did not experience prominent ERV invasion in this period. Furthermore, simians encoded higher numbers of A3 genes than prosimians (except for O. garnettii), suggesting that A3 gene amplification occurred early in the divergence of simian species (Fig. 5 , Middle). We investigated the timing of the formation of the double-domain A3G gene (i.e., A3G gene with A3Z2-A3Z1 structure) using the Ensembl gene database (www.ensembl.org/). We found that simian primates encoded the double-domain (A3Z2-A3Z1) A3G gene, whereas prosimians did not, suggesting that the emergence of double-domain A3G genes also occurred during this period (Fig. 5 , Right). Absence of a double-domain A3G gene in prosimians is supported by the finding that no A3Z2-A3Z1 genetic structures were observed in prosimian genomes (Fig. 3). Overall, the timing of A3 gene amplification and diversification in primates was highly concordant with the timing of the prominent ERV invasions.

Discussion

Mammalian A3 family genes possess potent antiviral activities and are thought to have diversified during their evolution to allow targeting of a broader range of viruses (8, 12–14). ERVs provide a rich fossil record for retroviruses, enabling unique insights into the long-term coevolutionary interactions between retroviruses and their hosts. In the present study, we used the ERV fossil record to explore the coevolutionary history of A3 genes and ERVs. When examining the ERV fossil record, it is vital to keep in mind that it is necessarily an incomplete record of retrovirus evolution. The vast majority of ERV sequences are fixed in the gene pool of host species, but since 1) fixation of any novel allele is extremely unlikely in the absence of strong selection and 2) most ERV insertions are likely to be selectively neutral at best, it is reasonable to assume that the fixed ERVs we observe in the genomes of contemporary species represent a tiny subset of all of the ERVs that colonized their ancestors genomes. Furthermore, the ERV fossil record is presumably heavily biased toward retrovirus lineages that target germline cells, and there may have been many ancestral retrovirus lineages that never generated germline copies. Nonetheless, the fixed ERVs that are found in contemporary genomes are a unique source of retrospective information about the ancestral interactions between retroviruses and their hosts. Furthermore, because A3 genes restrict retrovirus replication via DNA editing, ERV sequences can contain genomic signatures that reveal information about their interactions with this particular group of restriction factors. We show a strong positive correlation between A3 Z copy number and the extent to which G-to-A mutations have accumulated in ERV sequences (Fig. 4). This finding reinforces the previously proposed concept (15, 16, 19, 21) that the accumulation of G-to-A mutations in ERVs reflects the antiviral activity of A3 proteins. We further show that mammalian species that have accumulated more ERVs (measured as a proportion of their genome) tend to have higher A3 Z copy numbers (Fig. 5 ). In addition, our analysis revealed that A3 amplification occurred concurrently with prominent ERV invasions in primates. Overall, our findings provide evidence that the evolution of mammalian A3 genes has been shaped by a long-running evolutionary conflict with retroviruses, including those retroviruses that have actively invaded mammalian genomes during their evolution, leading to the generation of fixed ERV loci. The loop 7 region of A3 proteins is thought to determine the sequence specificity of viral nucleotide substrates (40). Our analysis indicates that this region has evolved under strong diversifying selection (Fig. 2), consistent with the idea that rapid evolution in mammalian A3 genes has been driven by interaction with viruses. Since the genes examined are not orthologous, the variation we observed may reflect diversification that occurred following gene duplication. In addition, it is well established that HIV-1 Vif, an antagonist of A3G activity, specifically binds to loop 7, leading to its degradation (41, 42). This raises the possibility that Vif-like proteins encoded by ancestral retroviruses and/or ERVs may have exerted diversifying selective pressure on A3s. Indeed, remnants of vif gene-like ORFs have been identified in endogenous lentiviruses (43–45). In addition, it has recently been reported that herpesviruses encode ribonucleotide reductase large subunits that degrade human A3 proteins (5, 46, 47) and that the A3 antagonists of Epstein–Barr virus and Kaposi’s sarcoma-associated herpesvirus specifically recognize the loop 7 structure of A3B (5). Therefore, A3 antagonists encoded by viruses other than retroviruses may also have exerted selective pressure on the loop 7 structures of A3 genes. Most A3 genes are encoded in the canonical A3 locus and have been amplified by tandem gene duplication (Fig. 3 ). However, we also detected duplicated A3 genes outside this region in 3 primate species (S. boliviensis, A. nancymaae, and O. garnetti) (Fig. 3 and ). All of these intronless A3G-like genes were amplified by retrotransposition. Furthermore, some are transcribed and may be functional (). A3 genes have been amplified in multiple lineages of mammals, but in addition, many A3 genes have been lost or pseudogenized (Fig. 1 and ). For example, the A3Z1 gene was lost in Rodentia, and the A3Z3 gene was lost in Strepsirrhini and Microchiroptera. These findings might be attributed to genotoxic potential of these A3 genes: uncontrolled A3 expression can be harmful, and exogenous expression of human A3A (A3Z1 ortholog) in cell cultures triggers cytotoxic effects (48–50). Similarly, the aberrant expression of some human A3 proteins, particularly A3A (51, 52), A3B (A3Z2–A3Z1 ortholog) (51–54), and A3H (A3Z3 ortholog) (55), can contribute to cancer development by inducing somatic G-to-A mutations in the human genome. Unlike the A3Z1 and A3Z2 genes, A3Z3 is highly conserved in most mammals and is not amplified in most mammalian lineages. Exceptions occur in carnivores and some other species; however, almost all duplicated A3Z3 genes identified in these species were pseudogenized (). Moreover, phylogenetic relationships and the pattern of the premature stop codon positions () indicate that the duplication–pseudogenization events have happened twice independently during carnivore evolution. These observations support that while the A3Z3 gene is indispensable for the hosts, its duplication might be genotoxic. A3 proteins can suppress retroviral replication in a G-to-A mutation-independent fashion (e.g., inhibition of reverse transcription) (56–59). We could not address this dimension of ERV–A3 interaction because of the technical difficulty of assessing the mutation-independent effect of A3 proteins on retroviruses using only genomic information. It should also be noted that the number of A3 genes counted in this study might underestimate the true value because of relatively low resolution of many whole genome sequences. Moreover, we particularly focused on the numbers and sequences of the Z domain of AID/APOBEC family genes, and we could not fully address whether 1) some 2 Z domains compose a double domain gene and 2) there are splicing variants. Nevertheless, this is to our knowledge the most comprehensive investigation of A3 gene evolution performed to date.

Materials and Methods

Sequence Data.

WGS assemblies and RNA-Seq data analyzed in this study are summarized in Datasets S1 and S6, respectively. Mammalian TE sequences were obtained using RepeatMasker (version open-4-0-9) (http://repeatmasker.org) with Repbase RepeatMasker libraries (version 20181026) (60). RMBlast was selected as the search engine, and RepeatMasker was run with the options “-q xsmall -a -species ” where denotes the species name of the analyzed genome (Dataset S7).

Genome Screening.

Similarity search-based screens of sequence databanks were performed using the database-integrated genome-screening (DIGS) tool (61) which provides a relational database framework for performing systematic tBLASTn-based screening of WGS databanks (61). We used AID/APOBEC polypeptide sequences of 5 species (human, mouse, cow, megabat, and cat) as queries for DIGS ( and Dataset S2). The resultant list of hits (i.e., sequences disclosing homology to AID/APOBEC family genes) was filtered to remove short and low-similarity matches (tBLASTn bitscore < 50). In the DIGS hit sequences, a partial sequence region [referred to as conserved region (8)] of AID/APOBEC family genes was extracted and used in downstream analyses (). Because the conserved regions of AID/APOBEC family genes are located on a single exon () the set of loci identified via DIGS could readily be interrogated using phylogenetic approaches. We selected sequences that covered >70% of the conserved region () and constructed multiple sequence alignments (MSAs) using the L-INS-I algorithm as implemented in MAFFT (version 7.407) (62). A phylogenetic tree was reconstructed using the neighbor-joining (NJ) method (63) as implemented in MEGAX (64). Only alignment sites with the >85% site coverage were used for phylogenetic construction. Additional tree-based filtering of the underlying dataset was performed prior to construction of a final tree: a preliminary tree was constructed, and subsequently, phylogenetic outlier sequences, which have extremely long external branches (i.e., standardized external branch length > 5), were detected and discarded from downstream analyses. The final set of AID/APOBEC-related loci is summarized in Dataset S3. To investigate the genomic context of AID/APOBEC-related loci, the polypeptide sequences of genes flanking the canonical A3 locus (i.e., CBX6 and CBX7) were used as queries for DIGS. Genomic synteny was illustrated using ggplot2 (https://ggplot2.tidyverse.org/) with the R library ggquiver (https://github.com/mitchelloharawild/ggquiver).

Sequence Analysis.

In-frame MSAs of nucleotide sequences were constructed using the codon-based alignment algorithm implemented in MUSCLE (65). Codon sites with >50% site coverages were used for downstream analyses. Logo plots of the amino acid sequences were generated using weblogo3 (66). Positional Shannon’s entropy score was calculated for amino acid MSAs using tools available via the Los Alamos HIV-1 sequence database website (www.hiv.lanl.gov/content/sequence/ENTROPY/entropy_one.html). A dN/dS ratio test using the branch-site model as implemented in Hyphy MEME (25) was used to detect codon sites under diversifying selection. The phylogenetic tree for this test was constructed using maximum likelihood method as implemented in MEGAX (64).

Mutation Strand Bias Analysis.

To assess the accumulation level of G-to-A mutations in ERVs and other TEs, the strand bias of the G-to-A mutation rate was calculated. First, we calculated the number of nucleotide changes relative to consensus for each TE integrant using the pairwise sequence alignment generated by RepeatMasker. TE integrants with low-confidence alignments (<1,000 Smith–Waterman score) were excluded from the analysis. Next, G-to-A mutation rates in the positive and negative strands of each TE were calculated. Finally, the strand bias score was defined as a ratio of the G-to-A mutation rate between the positive and negative strands (i.e., the mutation rate in the positive strand was divided by the one in the negative strand). The strand bias score was calculated for each TE integrant or each TE group. Statistical significance of the strand bias was evaluated by Fisher’s exact test. False discovery rate was calculated according to the Benjamini–Hochberg method (67).

Estimation of Insertion Dates of ERVs.

Insertion dates of ERV loci were estimated using both 1) ortholog distribution-based and 2) genetic distance-based methods. Ortholog distribution-based estimation was performed for ERVs in human and mouse genomes. Liftover chain files were downloaded from UCSC genome browser (https://genome.ucsc.edu/) (Dataset S8). The Liftover program (http://genome.ucsc.edu/cgi-bin/hgLiftOver) and chain file were used as the basis for attempting to convert the genomic coordinates of ERV integrants in one species genome to those found in another species using the option “minMatch=0.5.” If conversion succeeded, we inferred that the orthologous copy of the ERV integrant was likely present in the corresponding genome. In the case of mouse ERVs, we first converted genomic coordinates of ERVs in Mm9 to those in Mm10, which is the latest version of the mouse reference genome. Subsequently, the genomic coordinates in Mm10 converted to those in the genomes of increasingly distantly related species. Insertion dates of ERVs were estimated from the ortholog distributions according to the scheme summarized in . Genetic distance-based estimation of insertion dates was performed for ERVs by calculating the genetic distance of each ERV integrant from a consensus sequence representing the specific lineage the ERV derived from. The distribution of genetic distances was summarized using the Landscape function implemented in RepeatMasker. Genetic distances were converted to the age estimations under the assumption of a neutral molecular clock. For Primates, Insectivora, and Marsupialia a neutral rate of 2.2 × 10−9 mutations per year per site (68) was used. For Rodents, which experience relatively rapid rates of neutral change (69), a rate of 7.0 × 10−9 mutations per year per site was used. For each of these 2 groups, the estimated insertion dates using these rates were highly concordant between the genetic distance-based and ortholog distribution-based methods ().

RNA-Seq Analysis of AID/APOBEC Family Genes.

RNA-Seq dataset used in the present study is summarized in Dataset S6. RNA-Seq reads were trimmed by Trimmomatic (version 0.36) (70) and subsequently mapped to the reference genomes using STAR (version 020201) (71). Reads mapped on the identified loci of AID/APOBEC family genes were counted using featureCounts (version 1.6.4) (72). Only reads mapped to unique genomic regions were counted. Read counts were normalized to the total number of uniquely mapped reads, and expression levels were measured as fragments per kilobase per million mapped fragments.

Data Availability.

The data, associated protocols, code, and materials in this study are available at https://giffordlabcvr.github.io/A3-Evolution/.
  74 in total

1.  Single-strand specificity of APOBEC3G accounts for minus-strand deamination of the HIV genome.

Authors:  Qin Yu; Renate König; Satish Pillai; Kristopher Chiles; Mary Kearney; Sarah Palmer; Douglas Richman; John M Coffin; Nathaniel R Landau
Journal:  Nat Struct Mol Biol       Date:  2004-04-18       Impact factor: 15.369

2.  Synonymous nucleotide substitution rates in mammalian genes: implications for the molecular clock and the relationship of mammalian orders.

Authors:  M Bulmer; K H Wolfe; P M Sharp
Journal:  Proc Natl Acad Sci U S A       Date:  1991-07-15       Impact factor: 11.205

3.  MEGA X: Molecular Evolutionary Genetics Analysis across Computing Platforms.

Authors:  Sudhir Kumar; Glen Stecher; Michael Li; Christina Knyaz; Koichiro Tamura
Journal:  Mol Biol Evol       Date:  2018-06-01       Impact factor: 16.240

4.  Human APOBEC3G Prevents Emergence of Infectious Endogenous Retrovirus in Mice.

Authors:  Rebecca S Treger; Maria Tokuyama; Huiping Dong; Karen Salas-Briceno; Susan R Ross; Yong Kong; Akiko Iwasaki
Journal:  J Virol       Date:  2019-09-30       Impact factor: 5.103

5.  Human APOBEC3F is another host factor that blocks human immunodeficiency virus type 1 replication.

Authors:  Yong-Hui Zheng; Dan Irwin; Takeshi Kurosu; Kenzo Tokunaga; Tetsutaro Sata; B Matija Peterlin
Journal:  J Virol       Date:  2004-06       Impact factor: 5.103

Review 6.  KRAB zinc finger proteins.

Authors:  Gabriela Ecco; Michael Imbeault; Didier Trono
Journal:  Development       Date:  2017-08-01       Impact factor: 6.868

7.  Molecular cloning of an apolipoprotein B messenger RNA editing protein.

Authors:  B Teng; C F Burant; N O Davidson
Journal:  Science       Date:  1993-06-18       Impact factor: 47.728

8.  Antiviral potency of APOBEC proteins does not correlate with cytidine deamination.

Authors:  Kate N Bishop; Rebecca K Holmes; Michael H Malim
Journal:  J Virol       Date:  2006-09       Impact factor: 5.103

9.  Definition of the interacting interfaces of Apobec3G and HIV-1 Vif using MAPPIT mutagenesis analysis.

Authors:  Delphine Lavens; Frank Peelman; José Van der Heyden; Isabel Uyttendaele; Dominiek Catteeuw; Annick Verhee; Bertrand Van Schoubroeck; Julia Kurth; Sabine Hallenberger; Reginald Clayton; Jan Tavernier
Journal:  Nucleic Acids Res       Date:  2009-12-16       Impact factor: 16.971

Review 10.  DNA deamination in immunity: AID in the context of its APOBEC relatives.

Authors:  Silvestro G Conticello; Marc-Andre Langlois; Zizhen Yang; Michael S Neuberger
Journal:  Adv Immunol       Date:  2007       Impact factor: 3.543

View more
  31 in total

1.  APOBEC3A regulates transcription from interferon-stimulated response elements.

Authors:  Manabu Taura; John A Frank; Takehiro Takahashi; Yong Kong; Eriko Kudo; Eric Song; Maria Tokuyama; Akiko Iwasaki
Journal:  Proc Natl Acad Sci U S A       Date:  2022-05-12       Impact factor: 12.779

2.  HIV-1 Vif Gained Breadth in APOBEC3G Specificity after Cross-Species Transmission of Its Precursors.

Authors:  Nicholas M Chesarino; Michael Emerman
Journal:  J Virol       Date:  2021-12-15       Impact factor: 6.549

3.  Retrocopying expands the functional repertoire of APOBEC3 antiviral proteins in primates.

Authors:  Lei Yang; Michael Emerman; Harmit S Malik; Richard N McLaughlin
Journal:  Elife       Date:  2020-06-01       Impact factor: 8.713

Review 4.  Immune Sensing Mechanisms that Discriminate Self from Altered Self and Foreign Nucleic Acids.

Authors:  Eva Bartok; Gunther Hartmann
Journal:  Immunity       Date:  2020-07-14       Impact factor: 31.745

5.  APOBEC3F Constitutes a Barrier to Successful Cross-Species Transmission of Simian Immunodeficiency Virus SIVsmm to Humans.

Authors:  Rayhane Nchioua; Dorota Kmiec; Amit Gaba; Christina M Stürzel; Tyson Follack; Stephen Patrick; Andrea Kirmaier; Welkin E Johnson; Beatrice H Hahn; Linda Chelico; Frank Kirchhoff
Journal:  J Virol       Date:  2021-08-10       Impact factor: 5.103

6.  HERV-K(HML7) Integrations in the Human Genome: Comprehensive Characterization and Comparative Analysis in Non-Human Primates.

Authors:  Nicole Grandi; Maria Paola Pisano; Eleonora Pessiu; Sante Scognamiglio; Enzo Tramontano
Journal:  Biology (Basel)       Date:  2021-05-14

7.  Human Endogenous Retroviruses (HERVs) and Mammalian Apparent LTRs Retrotransposons (MaLRs) Are Dynamically Modulated in Different Stages of Immunity.

Authors:  Maria Paola Pisano; Nicole Grandi; Enzo Tramontano
Journal:  Biology (Basel)       Date:  2021-05-05

8.  Highly-potent, synthetic APOBEC3s restrict HIV-1 through deamination-independent mechanisms.

Authors:  Mollie M McDonnell; Suzanne C Karvonen; Amit Gaba; Ben Flath; Linda Chelico; Michael Emerman
Journal:  PLoS Pathog       Date:  2021-06-25       Impact factor: 6.823

Review 9.  Examination of the APOBEC3 Barrier to Cross Species Transmission of Primate Lentiviruses.

Authors:  Amit Gaba; Ben Flath; Linda Chelico
Journal:  Viruses       Date:  2021-06-07       Impact factor: 5.048

Review 10.  High-Throughput Sequencing is a Crucial Tool to Investigate the Contribution of Human Endogenous Retroviruses (HERVs) to Human Biology and Development.

Authors:  Maria Paola Pisano; Nicole Grandi; Enzo Tramontano
Journal:  Viruses       Date:  2020-06-11       Impact factor: 5.048

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.