Eukaryotic genomes contain many endogenous retroviral sequences (ERVs). ERVs are often severely mutated, therefore difficult to detect. A platform independent (Java) program package, RetroTector (ReTe), was constructed. It has three basic modules: (i) detection of candidate long terminal repeats (LTRs), (ii) detection of chains of conserved retroviral motifs fulfilling distance constraints and (iii) attempted reconstruction of original retroviral protein sequences, combining alignment, codon statistics and properties of protein ends. Other features are prediction of additional open reading frames, automated database collection, graphical presentation and automatic classification. ReTe favors elements >1000-bp long due to its dependence on order of and distances between retroviral fragments. It detects single or low-copy-number elements. ReTe assigned a 'retroviral' score of 890-2827 to 10 exogenous retroviruses from seven genera, and accurately predicted their genes. In a simulated model, ReTe was robust against mutational decay. The human genome was analyzed in 1-2 days on a LINUX cluster. Retroviral sequences were detected in divergent vertebrate genomes. Most ReTe detected chains were coincident with Repeatmasker output and the HERVd database. ReTe did not report most of the evolutionary old HERV-L related and MalR sequences, and is not yet tailored for single LTR detection. Nevertheless, ReTe rationally detects and annotates many retroviral sequences.
Eukaryotic genomes contain many endogenous retroviral sequences (ERVs). ERVs are often severely mutated, therefore difficult to detect. A platform independent (Java) program package, RetroTector (ReTe), was constructed. It has three basic modules: (i) detection of candidate long terminal repeats (LTRs), (ii) detection of chains of conserved retroviral motifs fulfilling distance constraints and (iii) attempted reconstruction of original retroviral protein sequences, combining alignment, codon statistics and properties of protein ends. Other features are prediction of additional open reading frames, automated database collection, graphical presentation and automatic classification. ReTe favors elements >1000-bp long due to its dependence on order of and distances between retroviral fragments. It detects single or low-copy-number elements. ReTe assigned a 'retroviral' score of 890-2827 to 10 exogenous retroviruses from seven genera, and accurately predicted their genes. In a simulated model, ReTe was robust against mutational decay. The human genome was analyzed in 1-2 days on a LINUX cluster. Retroviral sequences were detected in divergent vertebrate genomes. Most ReTe detected chains were coincident with Repeatmasker output and the HERVd database. ReTe did not report most of the evolutionary old HERV-L related and MalR sequences, and is not yet tailored for single LTR detection. Nevertheless, ReTe rationally detects and annotates many retroviral sequences.
Reference retroviral sequences were collected from GenBank. Whole genomic sequences (human genome versions hg15, hg16, hg17 and hg18, chimpanzee genome versions panTro1 and panTro2, wild red jungle fowl genome versions galGal1 and galGal3, dog genome canFam2, Rhesus macaque rheMac2, and Mouse genome mm8) were downloaded via the UCSC Genome Browser (http://genome.ucsc.edu/). Results from the analyses of the various assemblies will be published separately. Most of the results in this methodological paper are based on early versions (hg15 and hg16) of the human genome assembly. However, more recent data, from hg18, panTro2, rheMac2, canFam2, musMus8 and galGal3, and the corresponding RepeatMasker output files, are also included. Reference retroviral sequences RSV (J20342 and NC_001407), ALV (NC_001408), MMTV (NC_001503), MPMV (NC_001550), JSRV (NC_001494), FLV (NC_001940), MLV (J02255 and NC_001501), HTLV1 (NC_001436), HTLV2 (M10060 and NC_001488), WDSV (NC_001867), Snakehead retrovirus (NC_001724), Xen1 (AJ506107), HIV (K03455 and NC_001802) and HFV (NC_001736), were analyzed with ReTe. The errantiviruses ZAM (AJ00387) and CER1 (U15406) were also analyzed.
Hardware
The genomic analyses were performed on (i) seven Dell Optiplex 260 office computers with 2–2.5 GHz Pentium processors, and 40 GB hard disks, and (ii) at the Uppmax (www.uppmax.uu.se) computer cluster of AMD Opteron 250, 850 and 875 CPUs running Scientific Linux 4.2. The latter configuration yielded 2–5 times shorter execution times.
Algorithms
Fragment threading
We coined this term to describe the central procedure in ReTe. It depends on a database of conserved motifs, constraints on the distances between motif ‘hits’ and a matrix of similarities between amino acids. From a programmer's point of view, ‘motifs’ are procedures for detection of conserved ERV traits in the face of mutations. The bulk of the motifs operate through simple comparison (using the acid similarity matrix) against a conserved amino acid sequence, but there are several other types (Table 1, which also includes motifs for other purposes). Each motif is connected to one or more retrovirus genera (at present alpha-, beta-, gamma-, delta-, epsilon-, spuma- and lentiretroviruses, and the related viruses Gypsy and Copia). The constraints on the distances between motif ‘hits’ are based on the position distances in known retroviruses, extended with a ‘safety margin’. At present the motifs and constraints are adapted primarily to vertebrate, especially primate, sequences, but are also flexible to change in order to accommodate other sequence analyses.
Table 1.
Motifs utilized by ReTe at present. Some of the Motifs are used in the ‘fragment threading’, others in LTR search or in putein construction
Name
Used in
Characteristics
AcidMotif
RetroVID
Compares to conserved amino acid sequence
AcidNNMotif
RetroVID
Uses neural network trained on known peptides
SplitAcidMotif
RetroVID
Like several AcidMotifs at prescribed distances
BaseMotif
RetroVID, LTRID
Compares to conserved nucleotide sequence
HyPhobMotif
RetroVID
Searches for hydrophobic region in Gag
LTRMotif
RetroVID
Encapsulates LTR candidate found by LTRID
PPTMotif
RetroVID
Special algorithm for polypurine tract
SpliceAcceptorMotif
ORFID
Searches for splice acceptor consensus
SpliceDonorMotif
ORFID
Searches for splice donor consensus
SlipperyMotif
ORFID
Searches for XXXYYYZ
ProteaseCleavageMotif
ORFID
Searches for wPf, fPv and yPi
PseudoKnotMotif
ORFID
Searches for pseudoknot-like structures
FrameShifterMotif
ORFID
Combination of SlipperyMotif and PseudoKnotMotif
PuteinStartMotif
ORFID
Compares to leading amino acid sequences in known proteins
PuteinStartMotif
ORFID
Compares to trailing amino acid sequences in known proteins
SiSeqMotif
ORFID
Scores with weight matrix for signal sequence of von Heijne
Motifs utilized by ReTe at present. Some of the Motifs are used in the ‘fragment threading’, others in LTR search or in putein constructionThe principle of ‘fragment threading’ is illustrated in Figure 1. The most likely of the often many possible combinations of motif hits is chosen according to a heuristic procedure. Motif hits are combined into ‘chains’ satisfying distance constraints, corresponding to potential ERVs, though ‘broken’ chains violating one or two constraints are also possible, to account for ERVs containing insertions or deletions (indels). To evaluate the chain, it is assigned a score and a retroviral genus (or more than one in ambiguous cases) through a vector procedure: Each motif hit is assigned a vector. Its direction is dependent on its genus and its length depends on a weight factor for the motif (see Supplementary Data S3), and how well the hit fits the motif. The motif-hit vectors are summed (with some modifications) into a vector for the whole chain. The length of this vector determines the chain score and its direction determines the retroviral genus assigned.
Figure 1.
The principle of ‘fragment threading’. Three motif hits (RT1, RT2 and RT4) are within accepted distances from each other, whereas one (RT3) is not. Motifs RT1, RT2 and RT4 can therefore be utilized to build a proviral chain.
The principle of ‘fragment threading’. Three motif hits (RT1, RT2 and RT4) are within accepted distances from each other, whereas one (RT3) is not. Motifs RT1, RT2 and RT4 can therefore be utilized to build a proviral chain.‘Fragment threading’ is simple in principle, but in order not to miss mutilated or previously unknown ERVs, the motif hit and distance constraints must be so lax that an exhaustive search of all possible combinations is not practical. This ‘combinatorial explosion’ has been countered in several ways: (i) The search is hierarchical. The motifs are grouped into 14 ‘subgenes’ (5′LTR, PBS, MA, CA, NC, DU, PR, RT (incl. RNH), DL, IN, SU, TM, PPT and 3′LTR) according to established retrovirus terminology (DL is here used to denote a dUTPase sequence integrated in the integrase region). Exhaustive ‘fragment threading’ of motif hits is applied within each subgene (except the first two and last two, which contain only one motif each) to generate subgene hits. The subgene hits are then threaded to form chains, with a limit on the number of hits tried for each subgene. (ii) Another limit is set on the length of gaps in the subgene sequence. (iii) A subset of the motifs (notably the PBS and PPT subgenes) is normally not used in the primary search, but only in refining already found chains. (iv) Long sequences are split into chunks (typically 115-kb long with 15-kb overlap) before processing. The 15-kb overlap is sufficient to minimize loss of ERVs, normally up to 10-kb long (1), in the sequence chunk border region.
Sequence statistics
The algorithms below utilize results from two unpublished studies by Blomberg:A search for oligomers which differed in frequency between ORFs and the two alternative reading frames was made in the collection of reference retroviral sequences. A systematic evaluation of gag, pro, pol and env sequences showed that hexamers yielded higher ORF selectivities than tri-, tetra- and pentamers. For example, four especially selective hexamers were GATACG, CGCAGG, CTAGAA and GAAGAT, which were 4.1–6 times more frequent in retroviral ORFs relative to the two alternative reading frames. They encode the dipeptides DT, RR, LE and ED, respectively. It seems that these combinations are less likely to occur by chance in overlapping non-coding retroviral reading frames. A list of 31 hexamers with 3–6 times selectivity for an ORF in that reading frame, relative to the other two frames, is included in ReTe. Its negative control was provided by a list of 700 non-ORF hexamers. These had no increased frequency in ORFs relative to non-ORFs.A set of LTR selective split octamers was collected after a systematic evaluation of split octamer motifs in the database of reference retroviral genomes. All combinations of two tetramers occurring within a distance range of 0–60 nt positions from each other were tested for LTR selectivity versus (i) retroviral non-LTR sequences, and (ii) a random sequence of 100 Mb. This resulted in a list of 1256 binary tetramer combinations, each with a range for the distance between them. The LTR/non-LTR selectivities were 10–165, whereas LTR/random selectivities for the same set were 0.91–44.54. The two selectivity criteria varied rather independently, indicating a considerably non-random distribution of split octamers in retroviral non-LTR sequences. For example, four especially selective combinations were TCTG<8–12>CCCC, CCCC<6–8>CACC, GACA<10–14>CTGT and GTGC<6–8>AACA, which all were 12–165 times LTR/non-LTR selective and 3–45 times LTR/random selective. The functional basis behind this selectivity is obscure. A similar approach was used before (16–18).
Alignment through dynamic programming
Several variants of this standard procedure are employed and will be referred to by these abbreviations:A1: Pairwise nucleotide alignment, essentially according to Huang (19).A2: Nucleotide alignment which is aborted if it does not seem promising; thus non-exhaustive but fast.A3: Pairwise amino acid alignment, essentially according to Huang (19), using the amino acid similarity matrix.A4: Alignment of a nucleotide sequence to a set of known, aligned peptides, between two predetermined endpoints. Path elements are scored either, depending on which is greater, (a) by the similarity score between the codon in the sequence and the best available amino acid in the alignment or (b) by an estimate of the general suitability of the reading frame of the codon, based on stop codon density, glycosylation site density (for Env), reading frame of nearby motif hits and presence of nucleotide hexamers known to be frequent in or outside retroviral ORFs, respectively. This was progressively evaluated over a window of 200 nt.
Neural networks
Standard multilayered perceptron networks trained by back projection, see e.g. (20).
Implementation
Design
ReTe is written in Java and should run on any computer with Java runtime 1.4.3 or later. For full functionality an SQL database manager, preferably MySQL, should be available. ReTe has been extensively used and tested under the Windows (with Sun Java runtime), MacOS X 10.4 (i.e. UNIX), LINUX (Red Hat 9, Shrike) and Scientific Linux operating systems. In the programming, the shareware source http://www.jibble.org/epsgraphics/ (.eps file module) has been utilized.ReTe is designed for searching entire genomes, i.e. the algorithms were chosen for speed rather than refinement. Also, information outside the ERV proper, such as integration repeats, also referred to as ‘target site duplications’, is utilized by ReTe. The design is flexible, with numerous variable parameters and facilities for plugin extensions (in the form of Java classes). Further information and documentation about ReTe is available at the URL: http://www.kvir.uu.se/RetroTector/RetroTectorProject.html.The variable parameters allow the user to adjust the unavoidable tradeoff between speed, sensitivity and selectivity. The standard settings (see the URL above) provide for the processing of a genome in about one month processor time (2 GHz Pentium) with sensitivity prioritized over selectivity. The processing time can be drastically reduced by running the program on a cluster of faster processors. The selectivity may be increased in retrospect by disregarding low-scoring results.ReTe contains modules for various operations that can be executed from a menu. However, for many of them execution is normally initiated by a script file, generated by another module, containing necessary information. The script files thus serve to connect the different modules. One of the modules (SweepScripts) handles automatic execution of scripts, so that an entire chromosome can be processed automatically.
Sequence of operations
Typically, in analyzing a chromosome, the following procedure is performed automatically (Figure 2): (i) the module SweepDNA cuts the DNA sequence into chunks as mentioned above. It also tries to identify ALUs and LINE L1 fragments and possibly other frequent nonretroviral species-specific transposons, using algorithm A2. These are excluded from the further analyses. (ii) The LTRID module identifies possible LTRs, paired and unpaired. (iii) The RetroVID module identifies possible ERVs through ‘fragment threading’, also utilizing the LTR candidates found by LTRID as motif hits belonging to the 5′LTR and 3′LTR subgenes. For chains exceeding a score threshold, it also generates scripts for the ORFID and XonID modules. If a chain has no motif hits in Env, but there are hits in IN and 3′LTR separated by a motif-empty stretch, suggesting the presence of at least a fragment of env, RetroVID generates a script for EnvTracer. (iv) The ORFID module generates putative proteins, ‘puteins’, in an attempt to reconstruct the original retroviral Gag, Pol, Pro and Env. (v) The XonID module gives hints about possible exons not found by ORFID. (vi) EnvTracer attempts to find a likely Env, employing similar principles as XonID. (vii) The CollectGenome module collects the selected output data into an SQL database.
‘LTRID’ is a module that identifies potential LTRs (Figure 3). LTRID first aims to find the polyadenylation signal, always present in an LTR, either as the characteristic sequence (AATAAA, ATTAAA or AGTAAA) or as a high score by a neural network trained for this purpose. It detects the R-U5 portion of LTRs of many Alpha-, Beta- and Gammaretroviruslike sequences. After detection of the polyadenylation signal, LTRID calculates a score for the LTR candidate by searching for other LTR characteristics (GT accumulation, a further neural network, TATA box, characteristic nucleotide sequences, binding sites of selected transcription factors, CpG-rich regions) within realistic distances from the polyadenylation signal. All of these LTR-specific features were found using the database of annotated reference retroviral genomes (Blomberg, J., unpublished data).
Within ReTe, there are about 20 other command modules, some for visualization and storage of results, others mainly for maintenance and debugging. Of particular interest are the modules:‘PseuGID’ is applied if a chain has none or very short LTRs, suggesting that the ERV may be a processed pseudogene (24). RetroVID generates a script for this module, which searches for the structures characteristic of processed pseudogenes. This function has not yet been fully tested.‘Chainview’ displays one chain at a time in detailed graphics (Figure 4; Supplementary Data S2) and in detailed text. Apart from the motif hits and the course of the predicted chain, it may also display an analysis of LTR structure, puteins and EnvTracer and XonID output related to the chain, start and stop codons, splice donors/acceptors and several other features.
Figure 4.
Chainview picture of HIV and MLV. Symbols are explained below the proviral renditions.
Chainview picture of HIV and MLV. Symbols are explained below the proviral renditions.‘Puteinview’ shows details, such as the full amino acid sequence of a putein or an exon suggested by XonID or EnvTracer.‘CollectGenome’: The mass of text files containing all the results from a ReTe analysis may be forbidding. This module extracts selected results and collects them into an SQL database, with separate tables for LTR candidates, chains and puteins. Thereby it also compares each chain to a set of RepBase consensus sequences and a set of known annotated retrovirus sequences, using algorithm A1, and Pol puteins to a set of known Pol proteins (using algorithm A3) for classification purposes. Judgment of the content compared to nonretroviral repetitive sequences is conducted as an internal control.‘Genomeview’ may be used to inspect the database generated by CollectGenome. It shows the distribution of LTR candidates and chains within the chromosomes. Chains and their relation to the RepBase reference sequences and known retroviruses may also be viewed graphically (with less detail than Chainview).‘RetroTectorShell’: This is a Windows-specific program separate from ReTe, written in Visual FoxPro (VFP). It was developed in parallel with the Java ReTe kernel for user interaction and data handling. It performs similar functions as CollectGenome and Genomeview, but has a somewhat different profile. Features so far found only in RetroTectorShell are: (i) A mechanism for collecting ReTe output into a VFP table. Each genome's retroviral content is thus contained in a single table. It disregards single LTRs and alternative ORFs. (ii) A BLAST-like algorithm for searching chains in this table according to protein or nucleic acid similarity.
RESULTS
Evaluation of ReTe using an artificial data set
A 3 × 109 nt stretch of random nucleotide sequence with equal A, T, G and C frequencies was run. Two chains with a score above the minimum (250), 258 and 273, resulted. Thus, none were above 300. The experience from many genomic analyses, primarily from the human genome versions hg15–hg18, has shown that a score cutoff of 300 almost totally eliminates spurious chains results from motif-hit combinations occurring by chance (Figure 7; ROC curves in Supplementary Data S10–S15).
Figure 7.
Chain scores with retroviral, retroviruslike and random sequences. Retroviruslike (errantiviral; gypsy elements) sequences of slime molds, insects and plants are shown to the left. Epsilonretroviral sequences of amphibians and fish are shown to the right. Scores of chains detected in a 108 random sequence are shown below the cutoff. The chains from 108 random nucleotides were obtained before the changed settings described in the text.
Evaluation of ReTe with simulated mutated retroviral sequences
In order to test the detection limits of ReTe with regard to degraded retroviral sequences (i.e. very old insertions), a test set with artificially degraded sequences was created. These sequences were based on the complete genome of HIV-1, isolate MNCG (Genbank accession no M17449), which was degraded according to four different mutational models, with decay ranging from 1 to 60% mutation (see Supplemental Data S1 for further details). The resulting test set was analyzed with ReTe and also with BLAST (see below), for comparison. Four different mutational models were selected: (i) Random substitutions with equal probabilities (Jukes–Cantor); (ii) Higher probability of transition over transversion (Kimura 2-parameter model); (iii) The Kimura 2-parameter model with insertions and deletions, with frequencies of indels applying to human pseudogenes (24) (‘indel model’); (iv) Simulation of an active retrovirus, where both endogenous and exogenous phases are subjected to purifying selection during each round of infection and replication (‘Exogenous model’). Mutations harming features important for the retroviral function will not persist in the viral population. HIV is a highly replicative retrovirus. The Los Alamos National Laboratory (http://hiv-web.lanl.gov/content/index) presents a large database with sequence data for many subtypes, including a set of full-length HIV genomes aligned with respect to nucleotide codon triplets. The ‘exogenous’ model uses this aligned data set to test which mutations are allowed.The ReTe analysis utilized default settings. BLAST (version 2.2.6, obtained from NCBI, http://www.ncbi.nlm.nih.gov/), run locally under Linux Red Hat 9, with default settings except that word length was 7. The sequences were matched against the reference sequence HIVMNCG, as well as against the entire non-redundant nucleotide database (see Supplementary Data S1 for further details).
Sensitivity and specificity
In the simulated evolution model based on HIV, ReTe chain detection was more resistant to mutational decay than BLAST sequence detection. ReTe also attempts to reconstruct the retroviral proteins. Sequence similarity in the form of percent identity between the Pol puteins and the original HIVMNCG Pol protein, shows that ReTe can detect even extensively mutated and evolutionarily distant Pol sequences (Figures 5, 6 and 7; Supplementary Data S1).
Figure 5.
Simulation of mutation of an endogenous and an exogenous retrovirus. Average scores for 20 sequences at each level of mutation, divided by maximum score for unmutated HIVMNCG, when analyzed with ReTe and BLAST, are shown. (a) Normalized score for sequences in the endogenous (indel) model. ReTe is more tolerant to mutation than BLAST, when scoring a sequence as retroviral. (b) Normalized score for sequences in the exogenous model. These sequences receive high scores throughout the analysis when analyzed with ReTe. BLAST, on the other hand, does not as readily recognize the sequences as descendant from HIVMNCG. Further information is given in the Supplementary Data, S1.
Figure 6.
Sequence similarity (percent identity, gap positions excluded) for the Pol puteins compared to the unmutated HIVMNCG Pol protein. Average for 20 puteins at each level of mutation. (a) Endogenous (indel) model puteins: sequence identity. (b) Exogenous model puteins: sequence identity. Further information is given in the Supplementary Data, S1.
Simulation of mutation of an endogenous and an exogenous retrovirus. Average scores for 20 sequences at each level of mutation, divided by maximum score for unmutated HIVMNCG, when analyzed with ReTe and BLAST, are shown. (a) Normalized score for sequences in the endogenous (indel) model. ReTe is more tolerant to mutation than BLAST, when scoring a sequence as retroviral. (b) Normalized score for sequences in the exogenous model. These sequences receive high scores throughout the analysis when analyzed with ReTe. BLAST, on the other hand, does not as readily recognize the sequences as descendant from HIVMNCG. Further information is given in the Supplementary Data, S1.Sequence similarity (percent identity, gap positions excluded) for the Pol puteins compared to the unmutated HIVMNCG Pol protein. Average for 20 puteins at each level of mutation. (a) Endogenous (indel) model puteins: sequence identity. (b) Exogenous model puteins: sequence identity. Further information is given in the Supplementary Data, S1.Chain scores with retroviral, retroviruslike and random sequences. Retroviruslike (errantiviral; gypsy elements) sequences of slime molds, insects and plants are shown to the left. Epsilonretroviral sequences of amphibians and fish are shown to the right. Scores of chains detected in a 108 random sequence are shown below the cutoff. The chains from 108 random nucleotides were obtained before the changed settings described in the text.
Evaluation of ReTe with whole genome sequences from a variety of species
As seen in Figures 7 and 8, chain scores in the human genome version hg18 ranged from the cutoff of 300–4400. High scoring (>2000) chains were from exogenous retroviruses, and from structurally intact endogenous proviruses like the betaretroviruslike HERV-K(HML2) (25). Incomplete proviruses, with LTR-gag-env-LTR etc. scored 250–400 points. Errantivirus sequences like gypsy scored 250–950. Pseudovirus sequences like copia scored 200–350, or not at all. Epsilonretrovirus chains scored 350–1050.
Figure 8.
Frequencies of scores of the chains reported by ReTe version 1.0 from the human (hg18), chimpanzee (panTro2), rhesus (rheMac2), dog (canFam2), mouse (mm8) and chicken (galGal3) genome assemblies.
Frequencies of scores of the chains reported by ReTe version 1.0 from the human (hg18), chimpanzee (panTro2), rhesus (rheMac2), dog (canFam2), mouse (mm8) and chicken (galGal3) genome assemblies.As shown in Tables 2 and 3, and Figure 7, ReTe can detect retroviral sequences in a wide variety of genomes, also from less complete assemblies (e.g. panTro1). The four major genes were detected in many of the chains. However, env sequences were less common than the other three. Detection of the env gene, the least conserved of the four major retroviral genes, poses special difficulties. There are few conserved motifs in its SU portion, and the conserved motifs in the transmembrane protein often do not provide enough basis for a putein reconstruction. The inclusion of the EnvTracer module was intended to diminish false negativity in env gene detection. The lower frequency of env in the retroviral chains (Table 3) is probably due to mutational decay, occasional misses by EnvTracer or to env-less proviruses which may transpose without leaving the cell, e.g. the betaretroviruslike IAP elements in mice (26), and possibly the bulk of the HERVH elements in humans (27–30).
Table 2.
ReTe predicted genomic ERV contents
Genome analyzed
Total detected elements
Elements with ReTe version 0.12, score ≥300b
Homo sapiens (hg16)
18 213
3164
Pan troglodytes (PanTro1)
13 003
2117
Gallus gallus (gg01)
3921
262
aThe compilations are under improvement, depending on sequence draft qualities and ReTe optimization.
bReTe score ≥300 suggests true integrations of relatively intact elements. Modified from (28).
Table 3.
Proviral chains of score ≥300 detected in recent runs with ReTe version 1.0 and more complete genome assemblies
Genus/Host
ERVs (all chains)
2 × LTR
gag
pro
pol
env
Full lengtha
Alpha-like
galGal3
33
4
17
13
26
1
1
AlphaBeta-likeb
galGal3
61
29
45
37
51
10
7
Beta-like
hg18
770
299
424
498
694
328
142
panTro2
768
287
423
511
683
307
137
rheMac2
828
348
533
548
733
321
138
canFam2
60
47
11
11
51
11
0
musMus8
5928
2746
3446
3391
5316
1339
536
galGal3
162
38
74
83
150
16
6
Gamma-like
hg18
2713
1439
2070
1436
2491
1178
466
panTro2
2055
1234
1195
976
1738
953
317
rheMac2
1736
994
1086
875
1523
750
254
canFam2
438
242
224
199
371
96
18
musMus8
1461
808
974
921
1370
744
455
galGal3
3
3
0
1
1
0
0
Spuma-like
hg18
145
75
3
49
142
42
0
panTro2
96
58
0
26
92
12
0
rheMac2
96
58
0
26
92
12
0
canFam2
5
5
0
0
5
2
0
musMus8
438
307
2
428
436
285
1
galGal3
3
3
0
1
1
0
0
aDetected ‘LTR-gag-pro-pol-env-LTR’.
bNovel intermediate group as a result of Pol phylogenetic analysis (34). Modified from (28).
ReTe predicted genomic ERV contentsaThe compilations are under improvement, depending on sequence draft qualities and ReTe optimization.bReTe score ≥300 suggests true integrations of relatively intact elements. Modified from (28).Proviral chains of score ≥300 detected in recent runs with ReTe version 1.0 and more complete genome assembliesaDetected ‘LTR-gag-pro-pol-env-LTR’.bNovel intermediate group as a result of Pol phylogenetic analysis (34). Modified from (28).As discussed below, a basic problem for sensitivity and specificity determination for an ERV detection algorithm is the absence of a generally recognized and curated HERV database. An established general mechanism for repeat detection, and a limited level of repeat characterization, is provided by RepeatMasker (7,11), based on RepBase (6). HERVd (12,13) represented an attempt to reduce the fragmentation of Repeatmasker output, and to amend its retroviral nomenclature. Unfortunately, it could not be maintained. Nevertheless, these sources are the best established references.The ReTe cutoff score of 300 is motivated (i) by the lack of chains from random sequences above this limit, (ii) by the clear reduction of chains not overlapping RepeatMasker hits when this cutoff is used (Supplementary Data S10–S15), (iii) The relation between sensitivity and specificity versus Repeatmasker hits (receiver operating characteristic; ROC) curves for three genomes (Supplementary Data S10–S15).ReTe-derived sequences were evaluated versus RepeatMasker output and HERVd annotations on the human genome hg15, see also supplementary Data S6 and S7, as well as (28–31). Using ≥300 as a chain score cutoff, 3373 retroviral chains were detected in hg15. Repeatmasker reported 457 600 ‘LTR’ elements, generally as fragments. A total of 2625 ReTe chains were colocalized with a Repeatmasker entry, using a criterion of overlap within 12 000 nt of the start point of the ReTe chain. The Supplementary Data S2 and S4 shows a chain missed in hg15. The remainder occurred in both data sets. A detailed comparison of these discrepancies requires a detailed ERV classification, which is out of scope for this article. The Supplementary Data (S6–S17) does however give a survey. Repeat-based recognition does not in itself identify retroviral sequences. RepeatMasker relies on the man-made assignments in RepBase to characterize elements as ‘LTR’ elements or not.In a comparison of ReTe hg15 findings with HERVd, 255 515 partial or full elements in total were annotated in HERVd (which is based on the hg15 genome version). Of these, 3117 had a coincident ReTe chain. Thus, with the present motifs and distance constraints ReTe misses many HERV-L related sequences, which is an evolutionarily old and deviating group. According to RepeatMasker it constitutes ∼1.9% of the human genome. If they are subtracted from the abovementioned 3%, this leaves around 1% as detectable by ReTe. The 3373 chains scoring ≥300 found by ReTe version 0.10 in hg15 cover 24 487 571 nt, i.e. 0.79% of the human genome. A probable explanation for the discrepancy is that ReTe prefers relatively complete proviruses, and misses some fragmented ones. This is elaborated in Supplementary Data S8–S17. It is likely that by modifications in the distance model, and addition of more motifs, a greater proportion of HERV-L can be detected with ReTe.On the other hand, ReTe is not dependent upon repetition for detection, and therefore could detect single or low-copy-number retroviral elements. Examples are ERV-FRD on chromosome 6p24.2 (32), and HERV-Fc1 (33) on chromosome Xq21.33, which are single or low-copy-number elements with unusually open reading frames (Supplementary Data S2 and S4). Both elements are however now fully or partially covered in an April 2007 version of the RM output of the human genome version hg18.
Accuracy
Reference retroviral genomes give chains with scores of; 3341 (RSV), 3792 (MMTV), 3439 (MPMV), 3022 (MLV), 2934 (FLV), 2814 (HIV-1), 2345 (HTLV2) and 879 (WDSV). The errantivirus elements ZAM and CER1 give 935 and 687, respectively. They are gypsy elements from Drosophila melanogaster and Caenorhabditis elegans, respectively. They are related to the Orthoretroviruses (Figure 7; and Supplementary Data S2), and have the same genome organization. The four major genes are predicted in all of the 10 viruses, except for the env gene in RSV. This may be due to deranged env distances due to the presence of src, and/or insufficient coverage of alpharetroviral SU and TM motifs. The start and stop positions of the respective ORFs were correct within 10% of the annotated position, see e.g. (1). Larger deviations are occasionally observed in complex retroviruses (lenti, delta and epsilon retroviruses), which have one or several additional regulatory protein genes. They can occur before gag and around env. Similar problems are caused by the sarcoma viruses, where oncogenes disrupt the retroviral structure (Supplementary Data, S2 and S5).As mentioned in the discussion, ReTe has already been used in several studies on human ERVs (29,31,34–36).
Performance
ReTe was applied to the the human genome versions hg15, hg16, hg17 and hg18, the chimpanzee genome versions panTro1 and panTro2, the chicken genome versions galGal1 and galGal3, the dog genome version canFam2, the mouse genome version mm8 and the opossum genome monDom4. Depending on settings, a full analysis of the human and chimpanzee genomes takes 5–6 days, the chicken genome 3 days, using the computer set described in Systems and Methods. On the Uppmax Opteron Linux-based cluster, 1–2 days suffices.
DISCUSSION
Program design considerations
Solitary LTR detection is one of the most demanding aspects of retroviral sequence recognition. The principle of LTR selective split octamers used in ReTe is similar to the one used in the program Matinspector® (Genomatix Gmbh, Germany) (16–18). Despite a rather high LTR selectivity of the presented LTR recognition algorithm (LTRID), the number of false-positive hits is overwhelming when entire genomes are processed. Initially, we considered using hidden Markov models (HMMs) (21) for the detection of retroviral structures. This is computationally intensive, and we instead chose the faster algorithm ‘fragment threading’. We are currently attempting a limited introduction of HMMs for improved detection of solitary LTRs, and a few other motifs. The principle of ORF-selective hexamers used in ORFID is also used in gene-finding algorithms like GenScan (37). Clearly, certain binary triplet combinations are more likely to occur in retroviral ORFs compared to the two alternative reading frames. The functional basis for this selectivity is obscure.We had the practising retrovirologist and geneticist in mind in the design. Although the modular design of ReTe leaves ample scope for future improvement, it has already proved useful in its present form (28–31,34–36,38–40). ReTe gives a rich basis for assessment of the functionality and taxonomy of a retroviral element. The ability to export protein and nucleic acid sequences of the four major retroviral genes (gag, pro, pol and env) from ERVs of an entire genome in FASTA format will aid phylogenetic studies, and promote the understanding of the ‘retroproteomes’ of these organisms. The usefulness of the ready availability of nucleic acid frequency, LTR divergence, Pol-based classification versus reference retroviral elements, and degree of nucleotide identity to RepBase elements for all detected ERVs in a genome has been demonstrated in studies on HERV-H (29,30), ERV3 (31), ERV9/HERV-W (36) and a comparison of ERVs unique to humans and chimpanzees (35). In addition, features such as splice prediction, prediction of additional ORFs besides Gag, Pro, Pol and Env, gag-pro-pol readthrough mechanisms and LTR structure aid further functional studies on ERVs.
ReTe, RepBase, RepeatMasker and HERVd
Repeat-based recognition of ERVs, using RepBase (6) and its corollaries RepeatMasker (7,11), Hubley,R. and Green,P., unpublished data (http://repeatmasker.org) and HERVd (12,13), has been conducted for over a decade. An elaborate classification (RepBase) based on nucleic acid identity to machine-generated consensus sequences, through the Censor program (10), exists for many repeated elements. It is gradually being supplemented by user contributions. This classification and detection procedure should be scrutinized against other alternatives. ReTe provides an independent route to ERV detection and classification. Only by using several approaches can a rational classification of ERVs be achieved. ReTe favors elements >1000-bp long due to its dependence on the presence of several retroviral fragments in the right order at approximate distances typical of retroviruses. RepeatMasker, however, can detect considerably shorter sequences but is limited by the need for a minimum number of repeats for recognition. It is also limited by a lack of internal retroviral structure interpretation. An exact appraisal is not possible due to the often fragmented nature of both HERVd and Repeatmasker outputs. Published methods for retrieval of retroviral sequences either center around detection of LTR pairs (41,42), specific conserved sequences, like TM (43,44), or RT, combined with an ORF search (45,46), or general repeat detection, collected in RepBase (6,10) and used in RepeatMasker (7,11) and HERVd (12,13). HESAS (HERVs Expression and Structure Analysis System) (47) merges dbEST information with Repeatmasker-based output. It yields information about the expression and structure of HERVs. However, none of them has the broad scope of ReTe.Speed of analysis is essential as the sequencing and assembly of genomes becomes faster. Sequence lengths from 104 to 1010 nt can realistically be analyzed. As demonstrated, ReTe faithfully reconstructed many features both of the simple retrovirus MoMLV, and the complex retrovirus HIV (Figure 4; Supplementary Data S2 and S5). Many of the ReTe functions may be further optimized, e.g. splice site prediction based on canonical consensus sequences and LTR detection.
Limits of retroviral sequence detection
The limitation to the four different nucleotides in DNA imposes restrictions on the possibility for sequence recognition of mutated sequences. This is observed in multiple nucleic acid alignments, where identities <50% must be regarded with caution. The typical mutation frequency (here discussed as substitutions) for sequences without selection pressure is ∼0.2% per million years (48). Thus, 1% substitution corresponds to ∼5 Mya, and 50% corresponds to ∼250 Mya. Consequently, 200–300 Mya is a detection limit for selection neutral retroviral sequences, which probably are the majority of the ERVs. Selection for a functional protein, leading to persistence of conserved retroviral amino acid motifs, can push back the limits for recognition considerably. This is demonstrated by the ability of ReTe to recognize widely divergent retrovirus-related retrotransposons like Errantiviruses of invertebrates (Figure 7). Thus, using a collection of motifs (Supplementary Data S3) largely (but not exclusively) derived from retroviral sequences of higher vertebrates, ReTe can detect retroviruslike sequences in amphibians, insects and worms. In this situation, the model-based approach of ReTe surpasses the unbiassed recognition via the BLAST algorithm (Figures 5 and 6). This attests to the structural antiquity of retroviruses. However, the demonstrated ability to detect highly mutated ERVs requires a relatively intact structural backbone. Secondary integrations of a few large (e.g. LINEs) or many smaller (e.g. SINEs) elements into an ERV can derange structure beyond repair by the ‘broken chain’ function of ReTe, and masking of nonretroviral repeats. This problem is inherent to the systematic structural approach of ReTe. On the other hand, ReTe often provides an interpreted proviral structure, which can be used in further studies.The structural model of ReTe thus allows recognition of many retroviral sequences. There are both minor and major obstacles to widening the scope of detection. Adjustments of the distance constraints and inclusion of more motifs, are simple measures which may lead to an enhanced recognition of retroviral sequences like HERV-L. However, the pol gene of copia has the gene order IN..RT instead of the usual RT..IN, which would require the use of alternative models for these elements. The MalR retrotransposons (11) are incomplete and very divergent from orthoretroviral gene structure, with very few recognizable conserved motifs. The latter two are major challenges for ReTe.ReTe analysis of five vertebrate genomes (Figure 8) demonstrated that vertebrate lineages have a variable number and type of ERVs. The mouse had a high number of high-scoring chains, and the dog and chicken genome had a low number. The human, chimpanzee and rhesus genomes were intermediate. Especially complete proviruses were betaretroviruslike in mouse and gammaretroviruslike in humans (Table 3). The findings extend and confirm previous observations (28,49). Several of these species differences must have arisen relatively late in evolution, probably due to more or less successful modifications of antiretroviral restrictions or changes in habitat and habits (50–52). Further work with the extensive retroviral sequence data set provided by ReTe will undoubtedly shed light both on retroviral and vertebrate evolution.
CONCLUSION
ReTe is a rational tool for detection and annotation of retroviral sequences, alone, in contigs or in entire genome assemblies. It provides ERV detection based on retrovirological expert knowledge and is independent of RepBase and RepeatMasker. Further developments include improvements to the motifs, distance constraints and alignment library and to the database and presentation module. An extension to include more of HERV-L, gypsy and copia elements is considered.
SUPPLEMENTARY DATA
ReTe and documentation is available from the authors. Contact Jonas.Blomberg@medsci.uu.se. Supplementary information is given at http://www.kvir.uu.se/RetroTector%5CRetroTectorProject.html and at NAR online.
Authors: E S Lander; L M Linton; B Birren; C Nusbaum; M C Zody; J Baldwin; K Devon; K Dewar; M Doyle; W FitzHugh; R Funke; D Gage; K Harris; A Heaford; J Howland; L Kann; J Lehoczky; R LeVine; P McEwan; K McKernan; J Meldrim; J P Mesirov; C Miranda; W Morris; J Naylor; C Raymond; M Rosetti; R Santos; A Sheridan; C Sougnez; Y Stange-Thomann; N Stojanovic; A Subramanian; D Wyman; J Rogers; J Sulston; R Ainscough; S Beck; D Bentley; J Burton; C Clee; N Carter; A Coulson; R Deadman; P Deloukas; A Dunham; I Dunham; R Durbin; L French; D Grafham; S Gregory; T Hubbard; S Humphray; A Hunt; M Jones; C Lloyd; A McMurray; L Matthews; S Mercer; S Milne; J C Mullikin; A Mungall; R Plumb; M Ross; R Shownkeen; S Sims; R H Waterston; R K Wilson; L W Hillier; J D McPherson; M A Marra; E R Mardis; L A Fulton; A T Chinwalla; K H Pepin; W R Gish; S L Chissoe; M C Wendl; K D Delehaunty; T L Miner; A Delehaunty; J B Kramer; L L Cook; R S Fulton; D L Johnson; P J Minx; S W Clifton; T Hawkins; E Branscomb; P Predki; P Richardson; S Wenning; T Slezak; N Doggett; J F Cheng; A Olsen; S Lucas; C Elkin; E Uberbacher; M Frazier; R A Gibbs; D M Muzny; S E Scherer; J B Bouck; E J Sodergren; K C Worley; C M Rives; J H Gorrell; M L Metzker; S L Naylor; R S Kucherlapati; D L Nelson; G M Weinstock; Y Sakaki; A Fujiyama; M Hattori; T Yada; A Toyoda; T Itoh; C Kawagoe; H Watanabe; Y Totoki; T Taylor; J Weissenbach; R Heilig; W Saurin; F Artiguenave; P Brottier; T Bruls; E Pelletier; C Robert; P Wincker; D R Smith; L Doucette-Stamm; M Rubenfield; K Weinstock; H M Lee; J Dubois; A Rosenthal; M Platzer; G Nyakatura; S Taudien; A Rump; H Yang; J Yu; J Wang; G Huang; J Gu; L Hood; L Rowen; A Madan; S Qin; R W Davis; N A Federspiel; A P Abola; M J Proctor; R M Myers; J Schmutz; M Dickson; J Grimwood; D R Cox; M V Olson; R Kaul; C Raymond; N Shimizu; K Kawasaki; S Minoshima; G A Evans; M Athanasiou; R Schultz; B A Roe; F Chen; H Pan; J Ramser; H Lehrach; R Reinhardt; W R McCombie; M de la Bastide; N Dedhia; H Blöcker; K Hornischer; G Nordsiek; R Agarwala; L Aravind; J A Bailey; A Bateman; S Batzoglou; E Birney; P Bork; D G Brown; C B Burge; L Cerutti; H C Chen; D Church; M Clamp; R R Copley; T Doerks; S R Eddy; E E Eichler; T S Furey; J Galagan; J G Gilbert; C Harmon; Y Hayashizaki; D Haussler; H Hermjakob; K Hokamp; W Jang; L S Johnson; T A Jones; S Kasif; A Kaspryzk; S Kennedy; W J Kent; P Kitts; E V Koonin; I Korf; D Kulp; D Lancet; T M Lowe; A McLysaght; T Mikkelsen; J V Moran; N Mulder; V J Pollara; C P Ponting; G Schuler; J Schultz; G Slater; A F Smit; E Stupka; J Szustakowki; D Thierry-Mieg; J Thierry-Mieg; L Wagner; J Wallis; R Wheeler; A Williams; Y I Wolf; K H Wolfe; S P Yang; R F Yeh; F Collins; M S Guyer; J Peterson; A Felsenfeld; K A Wetterstrand; A Patrinos; M J Morgan; P de Jong; J J Catanese; K Osoegawa; H Shizuya; S Choi; Y J Chen; J Szustakowki Journal: Nature Date: 2001-02-15 Impact factor: 49.962
Authors: David L Reed; Russell W Currier; Shelley F Walton; Melissa Conrad; Steven A Sullivan; Jane M Carlton; Timothy D Read; Alberto Severini; Shaun Tyler; R Eberle; Welkin E Johnson; Guido Silvestri; Ian N Clarke; Teresa Lagergård; Sheila A Lukehart; Magnus Unemo; William M Shafer; R Palmer Beasley; Tomas Bergström; Peter Norberg; Andrew J Davison; Paul M Sharp; Beatrice H Hahn; Jonas Blomberg Journal: Ann N Y Acad Sci Date: 2011-08 Impact factor: 5.691