Literature DB >> 35348724

Paleozoic Protein Fossils Illuminate the Evolution of Vertebrate Genomes and Transposable Elements.

Martin C Frith1,2,3.   

Abstract

Genomes hold a treasure trove of protein fossils: Fragments of formerly protein-coding DNA, which mainly come from transposable elements (TEs) or host genes. These fossils reveal ancient evolution of TEs and genomes, and many fossils have been exapted to perform diverse functions important for the host's fitness. However, old and highly degraded fossils are hard to identify, standard methods (e.g. BLAST) are not optimized for this task, and few Paleozoic protein fossils have been found. Here, a recently optimized method is used to find protein fossils in vertebrate genomes. It finds Paleozoic fossils predating the amphibian/amniote divergence from most major TE categories, including virus-related Polinton and Gypsy elements. It finds 10 fossils in the human genome (eight from TEs and two from host genes) that predate the last common ancestor of all jawed vertebrates, probably from the Ordovician period. It also finds types of transposon and retrotransposon not found in human before. These fossils have extreme sequence conservation, indicating exaptation: some have evidence of gene-regulatory function, and they tend to lie nearest to developmental genes. Some ancient fossils suggest "genome tectonics," where two fragments of one TE have drifted apart by up to megabases, possibly explaining gene deserts and large introns. This paints a picture of great TE diversity in our aquatic ancestors, with patchy TE inheritance by later vertebrates, producing new genes and regulatory elements on the way. Host-gene fossils too have contributed anciently conserved DNA segments. This paves the way to further studies of ancient protein fossils.
© The Author(s) 2022. Published by Oxford University Press on behalf of Society for Molecular Biology and Evolution.

Entities:  

Keywords:  exaptation; paleovirology; pseudogene; retrotransposon; transposon

Mesh:

Substances:

Year:  2022        PMID: 35348724      PMCID: PMC9004415          DOI: 10.1093/molbev/msac068

Source DB:  PubMed          Journal:  Mol Biol Evol        ISSN: 0737-4038            Impact factor:   16.240


Introduction

Genomes contain relics of formerly protein-coding DNA, which may be functionless and neutrally evolving, or in some cases have gained new, nonprotein-coding functions. Most of them are derived from either transposable elements or host genes. Transposable elements (TEs) are parasitic, or perhaps symbiotic, DNA elements that get copied or moved from one genome location to another. They have often proliferated greatly, so that for example the human genome has millions of TE-derived segments comprising at least ∼50% of the genome. Most of these segments are highly mutated fragments, no longer active TEs. TEs have had a massive impact on the evolution of their hosts (Warren et al. 2015; Etchegaray et al. 2021). They cause mutations by their proliferation, and also by ectopic recombination among TE copies, causing deletions, inversions, and duplications. This can duplicate or inactivate genes (Barsh et al. 1983; Hayakawa et al. 2001), or change their tissue-specific expression (Ting et al. 1992). Some host genes have evolved from TEs, such as the vertebrate RAG genes that generate the diverse antibodies and T-cell receptors of the immune system (Kapitonov and Koonin, 2015), and syncytin genes that seem to enable cell fusion in placental development (Dupressoir et al. 2005). Some DNA elements that regulate gene expression have also evolved from TEs (Ting et al. 1992; Jordan et al. 2003). A series of studies in 2006–2007 found thousands of TE-derived nonprotein-coding elements with strong evolutionary conservation in mammals (Bejerano et al. 2006; Kamal et al. 2006; Nishihara et al. 2006; Xie et al. 2006; Gentles et al. 2007; Lowe et al. 2007). They often occur in gene deserts, and nearest to developmental genes (Lowe et al. 2007). These TE insertions often predate the placental/marsupial divergence (Mesozoic), but few clearly predate the mammal/bird divergence (Paleozoic), and an exceptional handful (“at least several”) were shown to predate the amniote/amphibian divergence (Bejerano et al. 2006). It is thus remarkable that a later study claimed to find 133 TE insertions predating the divergence of humans and ray-finned fish, by comparing human TE fragments found by RepeatMasker to vertebrate genome alignments (Lowe and Haussler, 2012). The boundary between TEs and viruses is blurry, and an entire field, paleovirology, is mainly based on viral insertion fossils in eukaryote genomes (Barreat and Katzourakis, 2022). The oldest viral fossils found so far seem to be Mesozoic (Suh et al. 2014; Barreat and Katzourakis, 2022). TEs are diverse and their classification is partly arbitrary (Kojima, 2019; Storer et al. 2021), but eukaryotic TEs are conventionally split into retrotransposons which duplicate by reverse transcription of their RNA into DNA, and DNA transposons which do not. Major types of retrotransposon are: LINEs (long interspersed nuclear elements), LTR retrotransposons (which bear long terminal repeats), YR (tyrosine recombinase) retrotransposons, and Penelope-like elements. These are further subclassified, for example LINEs have clades and sub-clades such as Hero, Nimb, L1, I, and CR1. Major types of DNA transposon are: DDE transposons (named after three key amino acids in the transposase), Cryptons (YR transposons), Helitrons, and Polintons (also called Mavericks). These are also subdivided, for example DDE transposons have “superfamilies” such as Academ, hAT, Kolobok, and piggyBac. Finally, nonautonomous TEs such as short interspersed nuclear elements (SINEs) typically encode no proteins, and propagate by hijacking enzymes from autonomous TEs. Many types of TE have patchy presence across host genomes, meaning that a TE type is present in distantly related hosts but absent in some closer relatives of those hosts (Yuan and Wessler, 2011; Chalopin et al. 2015). This can sometimes be explained by ordinary vertical inheritance, with multiple losses of the TE family (Fawcett and Innan, 2016). Contrarily, it has been suggested that long-term vertical persistence of TEs may be rare, so their long-term persistence depends on horizontal transfer (Gilbert and Feschotte, 2018). Thus, in order to understand the evolution of TE families in eukaryotes, it is valuable to know what TE types were present in ancestral eukaryotes (Fawcett and Innan, 2016). Host-gene-derived protein fossils are often called “pseudogenes.” They usually arise from duplication of (part of) a gene, such that one of the two copies is either not expressed or dispensable so evolves away from its protein-coding ancestry. Many such duplications are created by reverse-transcription of mRNA to DNA (e.g. by retrotransposon enzymes), producing intron-depleted fossils termed “processed pseudogenes.” There are also nonduplicated “unitary pseudogenes,” for example the GULO/GULOP gene/pseudogene for making vitamin C, which is nonfunctional in primates and guinea pigs (Nishikimi et al. 1994). Some pseudogenes seem to have significant functions, for example by being transcribed into an antisense RNA regulator of its cognate gene (Korneev et al. 1999), or regulating transcription (Huang et al. 2017), or generating small interfering RNAs (Tam et al. 2008). The Xist RNA involved in X chromosome inactivation has evolved partly from a formerly protein-coding gene, and partly from TEs (Elisaphenko et al. 2008). The boundary between protein fossils and functional protein-coding genes is fuzzy: a decaying gene such as GULO may produce peptides whose contribution to the organism’s fitness fluctuates around zero, in the process of gene death or resurrection (Brosius and Gould, 1992; Cheetham et al. 2020). Genetic fossils are often found by comparing a genome to a database of TE or gene sequences (Harrison, 2021; Storer et al. 2021). This can be done by either DNA-to-DNA or DNA-to-protein comparison. Protein-coding DNA tends to evolve by changes that preserve the encoded amino acids or replace them with similar ones: thus highly diverged sequences can be detected more effectively at the protein level (States et al. 1991). On the other hand, protein fossils evolve without amino-acid conservation. Thus, new TE families are often found by protein-level matches to distantly related families, whereas relics of known TE families are best detected by DNA-level matches to a model approximating the family’s most-recent active ancestor. RepeatMasker files of such DNA-level matches are available for many genomes (Smit et al. 2015). Protein-level matches have usually been sought with BLAST (Altschul et al. 1997), which is not optimized for fossils. Central to sequence matching methods are parameters defining the (dis)favorability of substitutions and gaps, which provide the definition of similarity. BLAST uses a 20 × 20 amino-acid substitution matrix (BLOSUM or PAM), which is based on substitution rates in living proteins, so is likely suboptimal for fossils. Therefore, we recently developed a new DNA-to-protein matching method, which allows frameshifts within matches (Yao and Frith, 2021), implemented in LAST (https://gitlab.com/mcfrith/last). Its main advantage is that it sets the substitution, gap, and frameshift parameters by maximum-likelihood fit to given sequence data. It uses a richer 64 × 21 substitution matrix, allowing for example preferred matching of asparagine (encoded by aac or aat) to agc than to tca, which both encode serine. It judges homology based on not just one alignment, but on many alternative ways of aligning the putative homologs. This proved more sensitive than BLAST for finding human TE protein fossils, and for the first time it found YR retrotransposon fossils in the human genome (Yao and Frith, 2021). Here, this method is used to find new protein fossils in human and slowly evolving Lagerstätte genomes: alligator, turtle, coelacanth (a lobe-finned fish closely related to land vertebrates), and chimera (a nonbony cartilaginous fish); and also frog due to its intermediate phylogenetic position (table 1). The number of new fossils is relatively small, but they are especially ancient and include types of TE not found in human before. They thus illuminate the evolutionary history of TE content, and reveal strongly conserved ancient exaptations, including of host-gene fossils.
Table 1.

Genome Versions and TE Protein Fossils.

OrganismGenome AssemblyRepeatMaskerTEOf Which
(from NCBI or UCSC)Version (source)FossilsNovel[a](%)
Human Homo sapiens UCSC hg38.analysisSet4.0.7 (UCSC)546,8211,641(0.3)
Alligator Alligator mississippiensis ASM28112v4 4.0.6 (NCBI)410,09246,065(11)
Turtle Chrysemys picta bellii Chrysemys_picta_bellii-3.0.3 4.0.6 (NCBI)430,45963,301(15)
Frog Xenopus tropicalis UCB_Xtro_10.0 4.0.8 (NCBI)135,50714,837(11)
Coelacanth Latimeria chalumnae UCSC latCha14.0.5 (rmsk[b])286,944279,710(97)
Chimaera Callorhinchus milii UCSC calMil14.0.3 (UCSC)105,99531,098(29)

Not found by this version of RepeatMasker.

repeatmasker.org.

Genome Versions and TE Protein Fossils. Not found by this version of RepeatMasker. repeatmasker.org.

Results and Discussion

Protein Fossil-Finding Pipeline

For each organism, homologous segments were found between the genome and a set of protein sequences comprising TE proteins from RepeatMasker plus proteins encoded by host genes of that organism. When multiple homologies overlapped in the genome, only the strongest was kept, to avoid homologies between different types of TE or between TEs and host genes. Homologies overlapping annotated protein-coding segments of the genome were removed. Finally, host-gene homologies were discarded if they overlapped TEs annotated by RepeatMasker: this removes true-but-unwanted homologies due to host-gene protein-coding segments that evolved from, for example SINEs. The resulting fossils, including a genome browser hub, are available at https://github.com/mcfrith/protein-fossils. The homology search used a significance threshold of one expected random match to the whole set of proteins per 109 bp, so there would be ∼3 matches in total between the human genome and all the proteins, if the sequences were perfectly random. However, naive matching would find many nonhomologous similarities of “simple sequences” such as atatatatatatatat: these were suppressed with tantan (Frith, 2011; Yao and Frith, 2021). The false-positive rate was estimated by comparing the reversed (but not complemented) human genome to the whole set of proteins, producing 19 spurious matches in total.

New TE Fossils

For the organisms analyzed in this study, the number of TE protein fossils found per genome ranges from ∼100,000 to ∼500,000, most of which correspond to known TE fragments in public RepeatMasker files (table 1). The human genome has especially few new TE fossils, indicating how thoroughly human TEs have been analyzed. The coelacanth fossils are almost all new relative to the RepeatMasker annotations, simply because those annotations have very few TE types, illustrating that TE analysis is lacking for some genomes at any snapshot in time (Sotero-Caio et al. 2017).

Classifying Unknown Repeats

RepeatMasker genome annotations include repeats of unknown type, which might not be TEs (Bao et al. 2015; Smit et al. 2015). In alligator and turtle (but not the other genomes), some of these unknown repeats could be classified based on large and consistent overlaps with TE protein fossils (table 2). One of these repeats, UCON84, also occurs in the human genome: it is derived from a DDE transposon in the PIF/Harbinger superfamily (fig. 1). The UCON84 consensus sequence, obtained from Dfam (Storer et al. 2021), has shorter and weaker (but significant) homology to PIF/Harbinger proteins (not shown). The consensus is expected to approximate an ancestral sequence and thus have clearer homology, but it is hard to make an accurate consensus of ancient fragments.
Table 2.

Classifying Unknown Repeats in Alligator and Turtle.

Unknown RepeatTE Type
REP-2_CPBCR1 (LINE)
REP-3_CPBL2 (LINE)
REP-6_CPBCR1 (LINE)
REP-22_CPBhAT-Tag1
REP-28_CPBCR1 (LINE)
REP-31_CPBGypsy (LTR)
SAT-928_CrpPenelope
UCON84PIF/Harbinger
Fig. 1.

Overlap between a TE protein fossil (upper box) and a repeat of unknown type (lower box) in the alligator genome (at coordinate 15,466,729 in NW_017707593.1).

Overlap between a TE protein fossil (upper box) and a repeat of unknown type (lower box) in the alligator genome (at coordinate 15,466,729 in NW_017707593.1). Classifying Unknown Repeats in Alligator and Turtle.

Inter-Genome Homology

The age of genetic fossils can be inferred by comparing different genomes. For example, figure 2 shows a human TE fossil aligned to an L1 LINE protein, alongside mammal genome alignments from the UCSC genome browser (Kent et al. 2002; Harris, 2007). This L1 insertion is present in ape and monkey genomes but absent from bushbaby and other placental mammals, showing that the insertion occurred in a common ancestor of simians after their divergence from strepsirrhine primates. It is thus curious that the L1 insert is aligned to two marsupial genomes: opossum and tasmanian devil. Marsupials also have L1s, and these marsupial regions are indeed annotated as L1s by RepeatMasker. Thus, these human and marsupial inserts are true homologs, because all L1s share common ancestry, but the insertions are not homologous: not descended from a common ancestral insertion. The inserts might even be orthologs, if their common ancestor is no older than the placental/marsupial divergence.
Fig. 2.

A TE protein fossil in human chromosome 8, with confusing inter-genome homology. Black bar near top: alignment of an L1 LINE protein. Green tracks: alignments between the human and other genomes. Screen shot from http://genome.ucsc.edu.

A TE protein fossil in human chromosome 8, with confusing inter-genome homology. Black bar near top: alignment of an L1 LINE protein. Green tracks: alignments between the human and other genomes. Screen shot from http://genome.ucsc.edu. Why, then, do these marsupial alignments extend into flanking sequence beyond the insert? It is hard to determine the precise endpoint of homology between distantly related sequences: alignments overshoot or undershoot. These human-marsupial alignments were made with the HoxD55 substitution matrix and gap parameters that are prone to large overshoots (Frith et al. 2008). For this study, new pair-wise genome alignments were made, by finding homologous regions (Frith and Noé, 2014) and cutting them down to most-similar one-to-one alignments (Frith and Kawaguchi, 2015). This tends to find higher-similarity alignments than those from UCSC and elsewhere, indicating that a higher fraction of the alignments are orthologous (Frith and Kawaguchi, 2015). This probably does not avoid nonhomologous TE insertions, so a new step was added: isolated alignments were discarded, by only keeping groups of alignments that are nearby in both genomes. Some examples are in figure 3: each panel shows one TE fossil in the human genome (central vertical stripe) that overlaps an inter-genome alignment (diagonal lines/dots). The alignments are not isolated: they are flanked by other alignments, indicating homology of not just the TE insert but also the flanking regions. Because these are distantly related genomes, most of the DNA lacks similarity and is unaligned. The alignable fragments are probably conserved by natural selection.
Fig. 3.

Ancient conserved TE insertions. Each panel shows alignments between part of the human genome (horizontal) and turtle (A,B) or chimera (C,D). Red dots indicate same-strand alignments, blue dots opposite-strand alignments. The central vertical lines show the location in human of the TE fossil (pink: forward strand, blue: reverse strand). The vertical gray line in panel C shows a protein-coding exon of ZFPM2.

Ancient conserved TE insertions. Each panel shows alignments between part of the human genome (horizontal) and turtle (A,B) or chimera (C,D). Red dots indicate same-strand alignments, blue dots opposite-strand alignments. The central vertical lines show the location in human of the TE fossil (pink: forward strand, blue: reverse strand). The vertical gray line in panel C shows a protein-coding exon of ZFPM2. A possible objection is that these examples might be independent insertions of an abundant TE into homologous regions of two genomes. This cannot be ruled out, but the key point is that these alignments are not only homologies but most-similar one-to-one homologies: it would be a strong coincidence for these single-best matches to independently be in homologous regions.

TE Types Newly Found in Human

The human TE protein fossils include several types of TE that have not been found in human before (table 3). These are all LINEs or DDE transposons, and are in addition to the first human YR retrotransposons (DIRS and Ngaro) and first-but-one Polintons we recently reported (Yao and Frith, 2021). Some were found directly in human, others were found in another genome and mapped to human via the inter-genome alignments (“found in” column). The E-value indicates significance/confidence of the DNA–protein homology: it is the expected number of times to find such a similarity between the whole genome and the entire set of proteins, if they were random sequences. Some of the E-values are quite high, indicating lower confidence. On the other hand, most of these putative DNA–protein homologies overlap human/nonmammal genome alignments, which would be a strong coincidence if they were random similarities (fig. 3). These DNA–protein alignments often cover conserved signature amino acids of the TE, which are not always conserved in the fossils, as expected if they have lost protein-coding function (fig. 4, supplementary fig. S1).
Table 3.

TE Protein Fossils of Types Newly Found in Human (all detected instances of these types).

TypeAligned ProteinChromosomeStartLength (bp)Nearest GeneIntergeneIntronFound InE-valueAge
Length (kb)
Retrotransposons
II-1_DR_pol3139,262,204119 MRPS22 299Alligator0.026Amniote
II-1_DR_pol932,218,473204 ACO1 3,171Turtle0.97Amniote
NimbNimb-1_DR_pol24,986,057216 SOX11 1,844Alligator0.027Amniote
NimbNimb-2_SSa_pol370,169,802144 MDFIC2 226Turtle0.014Amniote
NimbNimb-2_DR_pol1076,759,130141 KCNMA1 309Alligator1.6Amniote
NimbNimb-2_LG_pol1353,319,346227 OLFM4 4,089Human3.1e − 05Tetrapod
NimbNimb-12_DR_polX7,635,163193 VCX 488Alligator0.00012Amniote
NimbNimb-6_DR_polX87,287,424258 KLHL4 685Human5.6e − 14Amniote
NimbNimb-12_LMi_polX87,289,31287 KLHL4 685Human2.5Tetrapod
L2-DaphneDaphne-3_OL_pol1576,184,366282 TMEM266 16Alligator5.8e − 05Amniote
L2-KiriKiri-3_HMM_pol3157,982,382158 SHOX2 592Turtle0.012Amniote
L2-KiriKiri-1_DTa_pol1653,518,165255 AKTIP 95Turtle0.053Amniote
L2-KiriKiri-4_DTa_pol1825,027,321256 ZNF521 582Alligator0.0012Amniote
R2-HeroHEROTn2118,705,507284 EN1 731Alligator0.48Amniote
R2-HeroHERO-2_BF_pol413,163,078145 RAB28 1,939Alligator1.3e − 06Amniote
R2-HeroHERO-2_BF_pol672,553,710317 KCNQ5 219Turtle6.5e − 05Tetrapod
R2-HeroHEROTn736,766,591241 AOAH 128Alligator6.9e − 06Amniote
R2-HeroHERO-2_BF_pol871,079,407240 EYA1 461Turtle0.064Amniote
R2-HeroHEROTn876,862,008444 ZFHX4 7Alligator4.4e − 12Amniote
R2-HeroHERO-1_SP_pol1191,081,544376 CHORDC1 2,002Human0.0099Amniote
R2-HeroHEROTn1453,387,314538 DDHD1 796Alligator3.2e − 09Amniote
R2-HeroHEROTn1567,559,478285 MAP2K5 13human3.4e − 06Amniote
R2-HeroHEROTnX31,319,778294 DMD 57Human2.3e − 08Amniote
RTERTE-2_LVa_pol188,450,891352 PKN2 1,335Human0.74
RTERTE-4_LCh_pol1216,744,731223 ESRRG 82Human0.2Amniote
RTERTE-2_LVa_pol2198,403,329316 PLCL1 1,120Human0.0067Amniote
RTERTE-2_LVa_pol2204,082,119410 ICOS 584Human2.2e − 13Amniote
RTERTE-12_SP_pol367,685,630136 SUCLG2 337Alligator5e − 05Amniote
RTERTE-12_SP_pol3169,026,953209 MECOM 988Alligator0.0053Amniote
RTERTE1_Mars_pol3172,228,179190 FNDC3B 21Human0.026
RTERTE-4_LCh_pol1033,939,00487 PARD3 775Turtle3.3e − 12Amniote
RTEUN-72133877_Spu_pol1082,724,195249 NRG3 317Alligator2e − 07Amniote
RTERTE-4_LCh_pol1358,819,808129 DIAPH3 1,936Alligator0.029Amniote
RTERTE-2_LVa_pol1673,371,834433 ZFHX3 138Human9.3e − 16Amniote
RTERTE-4_CPB_polX97,057,978283 DIAPH2 108Human0.07
Rex1/BabarREX1-1_BF_pol434,782,619209 ARAP2 4,919Turtle5.2e − 13Amniote
Rex1/BabarRex1-24_NV_pol4130,232,079245 C4orf33 4,033Alligator0.019Amniote
DNA Transposons
AcademAcadem-1_NV_tp4115,452,994152 NDST4 1,970Turtle0.0067Amniote
EnSpmEnSpm-1_CGi2129,063,904314 HS6ST1 1,661Turtle0.0008Amniote
EnSpmEnSpm-11_HM2180,729,359142 UBE2E3 973Alligator0.04Amniote
EnSpmEnSpm-11_HM3180,734,783182 CCDC39 233Human2Amniote
Ginger1Ginger1-10_HM_tp314,7687,919166 ZIC1 1,281Alligator1.2e − 20Amniote
hAT19hAT-39_LCh_tp13,746,554188 CCDC27 16Coelacanth1.5Sarcopterygian
hAT19hAT-31_CPB_tp2104,030,381203 POU3F3 2,036Human7.6e − 08
hAT19hAT-39_LCh_tp2143,799,848418 ARHGAP15 170Human3.2e − 21Amniote
hAT19hAT-31_CPB_tp434,434,122613 ARAP2 4,919Human5.1e − 11
hAT19hAT-31_CPB_tp757,524,151625 ZNF716 6,572Human3.8e − 14
hAT19hAT-13_LCh_tp1675,903,914276 CPHXL 551Alligator3.5e − 11Amniote
hAT19hAT-31_CPB_tp2026,187,394444 ZNF337 5,561Human3.1e − 08
hAT5hAT-13_HM_tp1838,666,092430 CELF4 4,389Alligator7.1e − 05Tetrapod
Fig. 4.

Alignments between TE proteins and DNA. The DNA’s translation is shown below it, with *** for stop codons. ||| indicates a match, ::: a positive substitution score, and … a zero substitution score. Red color indicates: (A) conserved residues in L1 EN domains (Moran and Gilbert, 2002), (B,C) conserved residues in LINE RT domains (Malik et al. 1999), (D) conserved residues in Penelope-like elements (Arkhipova, 2006), (E) catalytic hAT residues (Atkinson, 2015). The start coordinates are 1-based, whereas the coordinates in tables 3 and 4 are 0-based (so they differ by 1). This figure was made with maf-convert from the LAST package.

Alignments between TE proteins and DNA. The DNA’s translation is shown below it, with *** for stop codons. ||| indicates a match, ::: a positive substitution score, and … a zero substitution score. Red color indicates: (A) conserved residues in L1 EN domains (Moran and Gilbert, 2002), (B,C) conserved residues in LINE RT domains (Malik et al. 1999), (D) conserved residues in Penelope-like elements (Arkhipova, 2006), (E) catalytic hAT residues (Atkinson, 2015). The start coordinates are 1-based, whereas the coordinates in tables 3 and 4 are 0-based (so they differ by 1). This figure was made with maf-convert from the LAST package.
Table 4.

Pre-tetrapod TE Protein Fossils Found in Human (all detected instances).

TypeAligned ProteinChromosomeStartLength (bp)Nearest GeneIntergeneIntronFound InE-valueAge
Length (kb)
Retrotransposons
CR1CR1-4_LCh_pol413,161,246107 RAB28 1,939Alligator1.9e − 06Tetrapod
CR1UN-BfCR1_pol4111,235,163187 PITX2 1,503Chimera0.0078Gnathostome
CR1CR1-1_CM_pol594,809,895122 MCTP1 69Alligator0.0049Tetrapod
CR1HER_LINE_pol698,370,405277 POU3F2 1,551Turtle0.022Tetrapod
CR1CR1-1_CPB_pol1678,813,330608 WWOX 779Turtle0.022Tetrapod
CR1CR1-4_LCh_pol1872,559,688232 CBLN2 198Alligator9.9e − 08Tetrapod
NimbNimb-2_LG_pol1353,319,346227 OLFM4 4,089Human3.1e − 05Tetrapod
NimbNimb-12_LMi_polX87,289,31287 KLHL4 685Human2.5Tetrapod
L1L1-2_LCh_pol370,416,318138 MDFIC2 642Coelacanth0.61Gnathostome
L1L1-3_LCh_pol8105,654,302154 ZFPM2 154Human1Gnathostome
L1L1-5_LCh_pol92,050,774284 SMARCA2 7Human0.093Tetrapod
L1L1-55_DR_pol13100,350,784171 PCCA 28Human0.035Gnathostome
L1L1-42_DR_pol1855,784,184514 TCF4 961Alligator6e − 07Tetrapod
L1-Tx1Tx1-1_DR_pol982,613,098237 RASEF 984Human9.6e − 05Gnathostome
L1-Tx1Tx1-5_CGi_pol10129,519,611156 MGMT 69Coelacanth3e − 16Gnathostome
L2CR1-41_DR_pol5166,711,04995 TENM2 3,554Turtle0.4Tetrapod
L2CR1-9_DR_pol645,881,206149 CLIC5 347Alligator0.032Tetrapod
L2aL2-13_DRe_pol7108,869,078298 DNAJB9 2,088Human2.9e − 25Tetrapod
L2-CrackCrack-11_BF_pol1216,137,121109 USH2A 77Alligator3.5e − 08Tetrapod
L2-CrackCrack-1_SSa_pol5109,502,097189 PJA2 280Human0.00013Tetrapod
R2-HeroHERO-2_BF_pol672,553,710317 KCNQ5 219Turtle6.5e − 05Tetrapod
RTE-XRTEX-16_SK_pol1930,154,303438 ZNF536 209Human0.00078Gnathostome
PenelopeNeptune1_Ap_pol4187,183,800258FAT11,272Alligator0.05Tetrapod
PenelopePenelope-2_CPB_pol13106,429,488180  EFNB2 999Human0.00017Tetrapod
GypsyGypsy-14_SSa_1p138,669,733195 RRAGC 791Human5.9e − 11Tetrapod
GypsyGypsy-37_CGi_1p257,598,618325 VRK2 1,521Alligator8.3e − 12Tetrapod
GypsyGypsy-13_CPB_1p1930,111,148159 URI1 209Human0.017Tetrapod
GypsyGypsy-24_XT_1pX98,972,571328 PCDH19 2,687Human4.6e − 23Tetrapod
DIRSDIRS-21A_XT_pol355,911,374235 ERC2 62Alligator1.1e − 08Tetrapod
DIRSDIRS-1a_Amnio_pol913,728,242223 NFIB 802Human0.0037Tetrapod
DIRSDIRS-7_NV_pol920,189,423234 MLLT3 553Turtle0.86Tetrapod
DIRSDIRS-9_NV_pol1016,734,086144 RSU1 57Alligator0.84Tetrapod
DIRSDIRS-5B_LCh_2p1678,152,629218 WWOX 49Turtle3.2e − 17Tetrapod
DNA Transposons
PIF/HarbingerHarbinger-3_LCh_tp2145,279,820206 ZEB2 3,324Human4.4e − 05Tetrapod
PIF/HarbingerHarbinger3_DR_tp2176,676,027180 MTX2 875Human0.0012Tetrapod
hAT-BlackjackhAT-38_LCh_tp714,272,601191 DGKB 160Alligator6.5e − 10Tetrapod
hAT-Tip100HAT-3_BF_tp44,696,650164 STX18 317Alligator0.0021Tetrapod
hAT-Tip100UN-Zaphod1_Ola_tp4129,428,560225 C4orf33 4,033Human1.8e − 11Tetrapod
hAT19hAT-39_LCh_tp13,746,554188 CCDC27 16Coelacanth1.5sarcopterygian
hAT5hAT-13_HM_tp1838,666,092430 CELF4 4,389Alligator7.1e − 05Tetrapod
Crypton-ACryptonA-1_OL_yr1214,468,609174 ATF7IP 9Alligator2.2e − 70Gnathostome
PolintonPolinton-1_Crp_px3114,555,204205 ZBTB20 83Human2.8Tetrapod
PolintonPolinton-1_AMi_atp2052,577,239269 ZFP64 781Turtle4.4e − 15Tetrapod
PolintonPolinton-1_DR_px2055,516,705163 CBLN4 1,346Turtle7.4e − 11Tetrapod

aPreviously found by RepeatMasker.

TE Protein Fossils of Types Newly Found in Human (all detected instances of these types). Pre-tetrapod TE Protein Fossils Found in Human (all detected instances). aPreviously found by RepeatMasker. Genomic data show ancient conservation and exaptation of these fossils (fig. 5, supplementary fig. S2). It can be seen that they lie in human genome regions conserved in nonmammals, and are not annotated by RepeatMasker. These regions have strong evolutionary conservation in mammals according to phastCons (Siepel et al. 2005), independent of their conservation in nonmammals. Some of these fossils overlap candidate regulatory elements or known transcription factor binding sites (Lesurf et al. 2016; Moore et al. 2020): the Hero fossil in figure 5 overlaps a CEBPB binding site, and the RTE fossil in figure 5 overlaps binding sites for GATA2, STAT1, JUND, FOS, and JUN. Figure 5 shows two Nimb fragments that coincide with conserved DNA segments: presumably they come from one Nimb insertion, which predates the amniote/amphibian divergence. (Only one of these Nimb fragments is aligned to frog: the other may be deleted or not detected in frog.)
Fig. 5.

Ancient conserved TE insertions in the human genome. Each panel shows, from top to bottom: TE protein fossils, alignments of the human genome to other vertebrate genomes, evolutionary conservation in mammals (phastCons), and repeats found by RepeatMasker. Some panels also show annotations of regulatory elements and Gencode genes (introns). Screen shots from http://genome.ucsc.edu.

Ancient conserved TE insertions in the human genome. Each panel shows, from top to bottom: TE protein fossils, alignments of the human genome to other vertebrate genomes, evolutionary conservation in mammals (phastCons), and repeats found by RepeatMasker. Some panels also show annotations of regulatory elements and Gencode genes (introns). Screen shots from http://genome.ucsc.edu. These fossils clarify the historical presence of TE types in vertebrates. They make presence of several TE types less patchy among vertebrates, thus explicable by vertical inheritance rather than horizontal transfer. Nimb-type LINEs have been found in insects, mollusks, teleost (bony) fish (Kapitonov et al. 2009; Chalopin et al. 2015), and turtle (Smit et al. 2015): here Nimb relics are found from ancient tetrapods, and also in coelacanth. This makes the presence of Nimb in vertebrates less patchy, and suggests vertical inheritance from the common ancestor of bony vertebrates. The Hero clade was found in sea urchin, lancelet, and fish (Kojima and Fujiwara, 2004; Kapitonov et al. 2009): its presence in ancient tetrapods fits with vertical inheritance from deuterostome ancestors. Hero LINEs are unusual in having a restriction-like endonuclease (Kojima and Fujiwara, 2004), unlike all other human LINE relics except Mam_R4. The I clade was previously found in fish and some invertebrates (Kapitonov et al. 2009): here hundreds are found in turtle and a few in alligator. Daphne was previously found in sea urchin and arthropods (Schön and Arkhipova, 2006), plus lancelet and zebrafish (Smit et al. 2015): here 67 fragments are found in coelacanth, 6 in chimera, 6 in turtle, and 4 in alligator, rounding out its historical presence in vertebrates. The Rex1/Babar clade has been found patchily in nonsarcopterygian fish excluding chimera, plus frog and lizard (Chalopin et al. 2015; Smit et al. 2015): here it is found in ancestral amniotes and also coelacanth and chimera, rendering its distribution nonpatchy. RepeatMasker distinguishes two types of RTE-like LINE: BovB and RTE; it finds only BovB in human, whereas it finds RTE in turtle and zebrafish. Previous reports of RTE in human seem to be BovB elements that were not classified separately (Kojima, 2018). This study finds RTEs from amniote ancestors, and thousands in coelacanth, again suggesting vertical inheritance from ancestors of bony vertebrates. These fossils also provide support for TE origin of some genes. Ginger1 transposons were previously found in some invertebrates including lancelet (Bao et al. 2010), but not in sarcopterygians (Yuan and Wessler, 2011; Chalopin et al. 2015). Their relics are found here in alligator, turtle, coelacanth, and many in frog. This makes it more plausible that the human GIN1 gene was indeed exapted from Ginger1 in ancestral amniotes (Bao et al. 2010). At least one pre-amniote Ginger1 relic was also exapted for nonprotein-coding function (table 3). Similarly, hAT19 fossils from amniote ancestors support the hAT19 origin of the amniote-specific gene CGGBP1 (Yellan et al. 2021), which binds CGG repeats and regulates gene expression (Singh and Westermark, 2015). hAT19 fragments have been exapted for nonprotein-coding functions too. In the four tetrapod genomes just one hAT5 fragment is found, which is conserved in all of them: the single exapted relic of an ancient hAT5 infection (fig. 4). hAT5 was previously found in some invertebrates (Putnam et al. 2007) and fish (Smit et al. 2015), and is unusual in having 5 bp TSDs (target site duplications), whereas all previously known hATs have 8 bp TSDs (Putnam et al. 2007).

Anciently Conserved TE Fossils

The human genome contains diverse TE protein fossils that are older than the amniote/amphibian divergence (table 4). It is striking that they include nearly all major types of TE: LINEs, Penelope-like elements, LTR retrotransposons (Gypsy), YR retrotransposons (DIRS), DDE transposons, a Crypton, and Polintons. Eight of them (seven LINEs and a Crypton) are shared by human and chimera, making them older than the last common ancestor of all jawed vertebrates. Three of these oldest fossils are shown in figure 5: their ancient exaptation is supported by their conserved presence in mammal, reptile, and bony-fish genomes, their strong conservation in mammals (phastCons), and sometimes by evidence of regulatory function. A further 882 TE protein fossils that predate the mammal/reptile divergence were found in the human genome (table 5). Most of these (745, 84%) are novel (not annotated by RepeatMasker), as are all but one of the pre-tetrapod fossils (table 4). These ancient TE fossils are often in megabase-scale gene deserts or large ( bp) introns (tables 3 and 4). The nearest genes are significantly enriched in developmental functions such as nervous system development, cell morphogenesis, and axonogenesis (PANTHER GO overrepresentation test, Mi et al. 2021). Some other types of TE protein fossil in the human genome were never found to predate the mammal/reptile divergence: these are strikingly less diverse, just ERVs (endogenous retroviruses) and a handful of DNA transposon superfamilies (table 6).
Table 5.

Other Pre-amniote TE Protein Fossils in Human.

TypeNumberOf Which Not New[a]
Retrotransposons
CR120492
Dong-R410
Vingi10
L113722
L1-Tx1120
L216714
L2-Crack251
BovB281
RTE-X40
Penelope541
Gypsy833
DIRS370
Ngaro280
DNA Transposons
Kolobok-T210
PIF/Harbinger181
PiggyBac70
TcMar-Mariner10
TcMar-Pogo20
TcMar-Tc140
TcMar-Tigger50
hAT-Ac71
hAT-Blackjack121
hAT-Charlie60
hAT-Tip100210
Crypton-A30
Polinton140

Previously found by RepeatMasker.

Table 6.

TE Types in Human Never Found to be Pre-amniote.

TypeNumberOf Which Not New[a]
Retrotransposons
ERV117,65316,951
ERVK2,1882,114
ERVL22,54121,076
ERVL-MaLR19,14018,111
DNA Transposons
MULE-MuDR437374
Merlin3735
TcMar-Tc2783751
hAT-Tag1205202
Helitron2321

Previously found by RepeatMasker.

Other Pre-amniote TE Protein Fossils in Human. Previously found by RepeatMasker. TE Types in Human Never Found to be Pre-amniote. Previously found by RepeatMasker. These TE relics from ancient vertebrates help us to understand the ancestral mobilome, which has been difficult, especially since TEs might have been horizontally transferred (Chalopin et al. 2015). For example, it has been suggested that mammal L1s were introduced by horizontal transfer into a common ancestor of therian (live-bearing) mammals (Ivancevic et al. 2018). We now have direct evidence that L1-like TEs were present in a common ancestor of jawed vertebrates, and hundreds of L1 fragments predate the amniote divergence (table 5). Actually, repeatmasker.org lists 353 L1 fragments in the platypus genome (ornAna1), so perhaps L1s were vertically inherited by mammals, but became inactive early in the monotreme lineage. There are also L1-Tx1 fossils from gnathostome ancestors (table 4): this supports the suggestion that L1 clades including Tx1 diverged in a common ancestor of mammals and fish (Ichiyanagi et al. 2007), which was not certain since Tx-like L1s are prone to horizontal transfer between marine hosts (Ivancevic et al. 2018). For other TE types too—DIRS, Polinton, and PIF/Harbinger—their previously noted patchiness among tetrapods (Chalopin et al. 2015) is explained by ancient loss of activity, since they were present in tetrapod ancestors. The emerging picture is that ancient vertebrates had many diverse types of TE, like present-day teleost fish but unlike mammals or birds (Chalopin et al. 2015). The pre-amniote BovB fossils (table 5) are particularly informative, because BovB has frequently been horizontally transferred (Ivancevic et al. 2018). Interestingly, the phylogeny of BovB elements differs greatly but not entirely from the phylogeny of their host organisms: amniote BovBs are all in a central branch of the tree and fish BovBs on outer branches (Ivancevic et al. 2018, fig. 2A). Knowing that BovBs were present in amniote ancestors, it seems likely that BovB initially entered amniotes by vertical inheritance, perhaps specifically into squamate reptiles, before being horizontally transferred among amniotes and arthropod vectors. Regarding LTR retrotransposons, it is intriguing that ancient Gypsy-like fossils are found (tables 4 and 5), but ancient ERV (endogenous retrovirus) fossils are not (table 6). The origin of vertebrate retroviruses has been debated (Hayward, 2017): ERVs may have evolved from Gypsy-like elements in a common ancestor of amniotes and amphibians (Hellsten et al. 2010). The Crypton relic in ATF7IP (table 4) was found in a previous study (Kojima and Jurka, 2011), which showed that it inserted in a common ancestor of amniotes, and found a similar sequence in chimera. We can now push the age of this insertion back to the gnathostome ancestor (supplementary fig. S3). This is a similar age to other Crypton insertions that became protein-coding regions of vertebrate genes, including KCTD1 which is closely related to the ATF7IP Crypton (Kojima and Jurka, 2011). This suggests that active Cryptons may have been present in our ancestors only before the gnathostome divergence and not since. The ATF7IP Crypton has an intact open reading frame in some nonmammal vertebrates (Kojima and Jurka, 2011), including alligator and chimera (supplementary figs. S4–5): so it may have been exapted as a protein-coding sequence in gnathostome ancestors and lost function in mammals. The age of the oldest Polinton insertions is greatly increased from 95 million years (Barreat and Katzourakis, 2021) to ∼350 million years (the amniote/amphibian divergence). This age is inferred from homologous polinton fragments in (e.g.) turtle and frog, which are flanked by other turtle–frog homologies (supplementary figs. S6–7). So either these polintons independently inserted into homologous regions of amniote and amphibian genomes, or, more parsimoniously, they come from insertion in a common ancestor of amniotes and amphibians. Ancient insertion is also implied by human polinton relics that align to a wide range of mammals and amniotes in the UCSC genome alignments (fig. 6).
Fig. 6.

Ancient Polinton/Maverick fragments in an intron of ZBTB20 on human chromosome 3. The two fragments are colored blue-green and blue, with younger TE fossils in between (black).

Ancient Polinton/Maverick fragments in an intron of ZBTB20 on human chromosome 3. The two fragments are colored blue-green and blue, with younger TE fossils in between (black). These protein fossils might be much younger than their insertions, if the intact TE benefits host fitness so remains intact (i.e. protein coding) by natural selection of the host. Intact TEs are usually thought not to benefit host fitness, but intact Polintons might protect the host from viruses, in particular iridoviruses that infect cold-blooded vertebrates (Barreat and Katzourakis, 2021). Nevertheless, the human Polinton relics are no longer intact, yet some have strong phastCons conservation in mammals indicating exaptation.

Conserved RepeatMasker Fossils

For sake of comparison, the age of previously known TE fossils (from RepeatMasker) was inferred in the same way. RepeatMasker includes many more TE fossils, especially nonprotein-coding SINEs. It is tuned to have a false-positive fraction of 0.2% (Hubley et al. 2016), which corresponds to false hits in the human genome. There are 133 RepeatMasker hits in human that are conserved in frog, of which 84 (63%) are especially ancient types of repeat: UCON, Eulor, LFSINE, and AmnSINE1 (Bejerano et al. 2006; Nishihara et al. 2006; Gentles et al. 2007). Most of these are unknown types of repeat, and may not be TEs. In contrast, there are 73 RepeatMasker hits in human that are conserved in coelacanth, which are not obviously enriched in ancient repeat types. They include primate-specific L1P and SVA elements, which are surely false-positive RepeatMasker annotations. A few may be real, but it is hard to know which ones or have confidence in them. Unfortunately, RepeatMasker files do not state the significance (E-value) of each hit. In summary, the oldest confident minimum age for previously known TE insertions (apart from TE-derived genes) is the amniote/amphibian divergence (Bejerano et al. 2006). This casts doubt on the previously reported TE insertions predating the human/teleost divergence (Lowe and Haussler, 2012). Aside from false RepeatMasker hits, that study mentioned no countermeasures for nonhomologous insertions (fig. 2). The tetrapod TEs found here (table 4) are almost completely disjoint from previously known ones: the latter are mostly unknown repeat types or SINEs. The newly found LINEs might be the autonomous counterparts of the ancient SINEs, in particular, AmnSINE1 was thought to be mobilized by an undiscovered L2-like LINE (Nishihara et al. 2006).

Genome Tectonics

Sometimes, two TE fossils of the same type lie strikingly near each other in the human genome. An example is in figure 6: two Polinton relics are separated by 44 kb, which is remarkably close considering there are only 40 Polinton fragments in the genome. They might come from two independent insertions into a Polinton hotspot, but a simpler explanation is that they come from one Polinton, and drifted apart due to younger TE insertions between them. It is well known that old TEs get fragmented by younger insertions, but it is interesting to consider how far apart they can drift. If there is a locally higher rate of insertion than deletion, this might over time produce large introns and gene deserts. Ancient fossils can be markers of such long-term rifting. Among the pre-amniote TE fossils, there are a few hundred such pairs separated by 30–3,000 kb.

Host-Gene-Derived Protein Fossils

This study found 27,240 host-gene-derived protein fossils in the human genome, of which 4,303 (16%) are new: not in Gencode V37 or RefSeq pseudogenes, or RetroGenes V9 (Baertsch et al. 2008; Harrow et al. 2012; Pruitt et al. 2014). They do not overlap known protein-coding regions, but some may be unknown protein-coding exons rather than fossils. Frameshifts or premature stop codons are present in 71.3% of the new segments and 72.4% of the non-new ones, suggesting a similar (presumably low) fraction of unknown coding exons. Ancient fossils were sought in the same way as for TEs, but there is an extra difficulty. While we may find a fossil in the human genome that overlaps an alignment to (say) chimera, it might have encoded a functional protein for most of this evolutionary history, becoming a fossil only recently in the human lineage (Sheetlin et al. 2014). The aligned region of chimera was also required to be noncoding, but it may have independently become a fossil, or simply be an unannotated protein-coding exon. The nonhuman genomes presumably have less thorough gene annotation. Thus, ancient fossils were checked by manually examining UCSC phyloP graphs showing basewise evolutionary conservation in 100 vertebrates (Pollard et al. 2010). In some cases, there was a pattern of every third base being less conserved, indicating that natural selection conserved the encoded amino acids, for at least part of the history (supplementary fig. 8). In the end, two strong candidates were found for host-gene-derived fossils predating the last common ancestor of jawed vertebrates (fig. 7). These human regions are aligned to alligator, turtle, coelacanth, and chimera, and are not annotated as protein-coding in any of these genomes. The DNA–protein alignments have frameshifts (fig. 7 and ), and the basewise conservation does not suggest 3-periodicity (supplementary fig. 8). Their ancient conservation, and strong phastCons conservation in mammals, testifies to their exaptation for some critical but unknown function.
Fig. 7.

Ancient conserved pseudogenes in the human genome. (A) Match between endoplasmic reticulum resident protein 44 and chromosome 2, showing conservation in vertebrates. (B) Base-level alignment of the above. (C) Match between speckle-type POZ protein and chromosome 7. (D) Base-level alignment thereof.

Ancient conserved pseudogenes in the human genome. (A) Match between endoplasmic reticulum resident protein 44 and chromosome 2, showing conservation in vertebrates. (B) Base-level alignment of the above. (C) Match between speckle-type POZ protein and chromosome 7. (D) Base-level alignment thereof.

Conclusions and Prospects

This study greatly increases the number and variety of Paleozoic protein fossils. Fossils of most major TE categories (except Helitrons) are found that predate the amphibian/amniote divergence. The oldest fossils, from both TEs and host genes, predate the last common ancestor of jawed vertebrates. The detection of some TE types in ancestral genomes makes their distribution in vertebrates less patchy, suggesting that ancient vertebrates had a high diversity of TEs that were vertically inherited in some lineages but lost activity in others. There are hints that marine or aquatic vertebrates are prone to horizontal TE transfer (Ivancevic et al. 2018; Zhang et al. 2020; Barreat and Katzourakis, 2021), which might explain the high ancestral diversity. These ancient fossils have strong sequence conservation, indicating exaptation, and some have evidence of regulatory function. Not only TEs but also host-gene fossils were anciently exapted with strong sequence conservation. Ancient fossils can be markers of long-term genome tectonics. It is hoped that these fossil-finding methods can easily be adapted for future studies. They are especially beneficial for finding TEs in less-studied genomes, reducing reliance on de novo repeat-finding and confusion between low copy-number TEs, multi-gene families, and TE-derived genes (Arkhipova, 2017; Makałowski et al. 2019). The fitting of substitution and gap rates could perhaps be improved: here it was done naively by comparing a genome to known TE proteins. The choice of sequence data for parameter-fitting seems important for finding ancient or unknown types of fossil. Fossil-finding could also be aided by ancestralizing the genome sequence, for example reverting recent substitutions and TE insertions. One promising application is paleovirology: Few Mesozoic and no Paleozoic viral fossils have been found so far (Barreat and Katzourakis, 2022). If Gypsy-like elements (Metaviridae) or Polintons are counted as viruses, Paleozoic fossils predating ∼350 million years are found here (table 4). A great challenge is to infer ancient genetic sequences from their fossil fragments, much as ancient organisms are inferred from mineral fossils. This inference might be assisted by LAST’s ability to estimate the probability that each column of a sequence alignment is correct.

Materials and Methods

The pipeline scripts are available at: https://gitlab.com/mcfrith/protein-fossils.

Genome Data

Genome sequences and their RepeatMasker annotations were downloaded from UCSC, NCBI, or repeatmasker.org (table 1). The human RepeatMasker annotations are from UCSC’s rmskOutCurrent file (last modified October 28, 2018). TE protein sequences were taken from the file RepeatPeps.lib in RepeatMasker version 4.1.2-p1. For each nonhuman genome, proteins encoded by host genes were taken from NCBI’s .faa file for that genome. For human, with the aim of getting reliable proteins, non-TE proteins with existence level 1–3 were taken from uniprot_sprot_human.dat in UniProt release 2021_02 (The UniProt Consortium, 2020). Protein-coding regions of the human genome were taken from the union of wgEncodeGencodeCompV37 and ncbiRefSeq from UCSC (Harrow et al. 2012; Pruitt et al. 2014). For each nonhuman genome, protein-coding regions were obtained from NCBI’s .gff file for that genome.

Finding Protein Fossils

The DNA/protein substitution and gap rates were found separately for each genome, by comparing it to the TE proteins, using LAST version 1250: lastdb -q -c myDB RepeatPeps.lib last-train -P8 --codon -X1 --pid=50 myDB genome.fa > te.train The -q option appends a stop symbol * to each protein, which can be matched to (fossil) stop codons (e.g. supplementary fig. S9). The --pid=50 option makes it only use homologies with ≤ 50% amino-acid identity, with the aim of focusing on old fossils. Next, the genome was matched to TE and host-gene proteins: fasta-nr hostProteins RepeatPeps.lib | lastdb -q -c pDB lastal -D1e9 -K0 -m500 -p te.train pDB genome.fa > aln.maf Option -D1e9 sets the significance threshold to one false hit per 109 bp, -K0 omits hits that overlap stronger hits in the genome, and -m500 makes it more slow and sensitive. (With lower values of m, occasionally a host-gene-derived fossil was missed and instead wrongly aligned to a TE protein.) Note that the E-values output by lastal are per-chromosome, whereas the E-values in this article are per-genome. It turns out the RepeatMasker proteins include exapted genes: they were excluded, by omitting hits to proteins whose names contain _HSgene, _Hsa_, UN-GIN, or _Xtr_eg_tp. Finally, alignments >10% covered by protein-coding annotation were removed, as were host-protein alignments >10% covered by RepeatMasker TE annotations other than Low_complexity and Simple_repeat.

Genome Alignments

As described above, new pair-wise genome alignments were made, with the aim of finding orthologous segments and avoiding nonhomologous insertions (fig. 2) as accurately as possible. They were made like this: lastdb -P8 -uMAM8 gDB genome1.fa last-train -P8 --revsym -D1e9 --sample-number=5000 gDB genome2.fa > g.train lastal -P8 -D1e9 -m100 -p g.train gDB genome2.fa | last-split -fMAF+ > many-to-one.maf last-split -r many-to-one.maf | last-postmask > one-to-one.maf The -uMAM8 and -m100 options make it extremely slow and sensitive (Frith and Noé, 2014). These one-to-one alignments are available at https://github.com/mcfrith/last-genome-alignments. Next, isolated alignments were removed by defining two alignments to be “linked” if, in both genomes, they are separated by at most 106 bp and by at most five other alignments. Alignments were retained if linked, directly or indirectly, to at least two others.

Ancient Protein Fossils

A protein fossil was inferred to be ancient if it overlaps an inter-genome alignment. However, spurious overlaps are caused by the DNA–protein or inter-genome alignments overshooting beyond the end of homology: this often happens when the fossil is near a protein-coding exon. Therefore, the set of alignments between two genomes was reduced to those that do not overlap protein-coding annotations in either genome, and then each fossil was considered conserved if at least 30% of it is covered by alignments between those two genomes. This 30% threshold was determined empirically (supplementary fig. S9). There is likely a better way using LAST’s ability to estimate the probability of each column in an alignment.

Novelty

In table 1, a TE fossil was deemed novel if at most 10% of it is covered by RepeatMasker annotations of TEs, with known “class/family,” that are on the same DNA strand. In tables 4–6, slightly different criteria were used. A TE fossil was deemed “not new” if it has nonzero overlap with a RepeatMasker genome annotation on the same DNA strand, of the same “class” (DNA, LINE, LTR, etc.). A host-gene protein fossil was deemed novel if at most 10% of it overlaps same-strand known pseudogenes.

Nearest genes

The nearest genes were found from among those with NM_ accession numbers in ncbiRefSeqCurated from UCSC. Click here for additional data file.
  70 in total

1.  Sea anemone genome reveals ancestral eumetazoan gene repertoire and genomic organization.

Authors:  Nicholas H Putnam; Mansi Srivastava; Uffe Hellsten; Bill Dirks; Jarrod Chapman; Asaf Salamov; Astrid Terry; Harris Shapiro; Erika Lindquist; Vladimir V Kapitonov; Jerzy Jurka; Grigory Genikhovich; Igor V Grigoriev; Susan M Lucas; Robert E Steele; John R Finnerty; Ulrich Technau; Mark Q Martindale; Daniel S Rokhsar
Journal:  Science       Date:  2007-07-06       Impact factor: 47.728

2.  Repbase Update, a database of repetitive elements in eukaryotic genomes.

Authors:  Weidong Bao; Kenji K Kojima; Oleksiy Kohany
Journal:  Mob DNA       Date:  2015-06-02

3.  Alu-mediated inactivation of the human CMP- N-acetylneuraminic acid hydroxylase gene.

Authors:  T Hayakawa; Y Satta; P Gagneux; A Varki; N Takahata
Journal:  Proc Natl Acad Sci U S A       Date:  2001-09-18       Impact factor: 11.205

Review 4.  Paleovirology of the DNA viruses of eukaryotes.

Authors:  Jose Gabriel Nino Barreat; Aris Katzourakis
Journal:  Trends Microbiol       Date:  2021-09-02       Impact factor: 17.079

5.  On "genomenclature": a comprehensive (and respectful) taxonomy for pseudogenes and other "junk DNA".

Authors:  J Brosius; S J Gould
Journal:  Proc Natl Acad Sci U S A       Date:  1992-11-15       Impact factor: 11.205

Review 6.  Overcoming challenges and dogmas to understand the functions of pseudogenes.

Authors:  Seth W Cheetham; Geoffrey J Faulkner; Marcel E Dinger
Journal:  Nat Rev Genet       Date:  2019-12-17       Impact factor: 53.242

7.  Crypton transposons: identification of new diverse families and ancient domestication events.

Authors:  Kenji K Kojima; Jerzy Jurka
Journal:  Mob DNA       Date:  2011-10-19

Review 8.  CGGBP1--an indispensable protein with ubiquitous cytoprotective functions.

Authors:  Umashankar Singh; Bengt Westermark
Journal:  Ups J Med Sci       Date:  2015       Impact factor: 2.384

Review 9.  Using bioinformatic and phylogenetic approaches to classify transposable elements and understand their complex evolutionary histories.

Authors:  Irina R Arkhipova
Journal:  Mob DNA       Date:  2017-12-06

10.  Horizontal transfer of BovB and L1 retrotransposons in eukaryotes.

Authors:  Atma M Ivancevic; R Daniel Kortschak; Terry Bertozzi; David L Adelson
Journal:  Genome Biol       Date:  2018-07-09       Impact factor: 13.583

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.