| Literature DB >> 15716312 |
Evgeny M Zdobnov1, Mónica Campillos, Eoghan D Harrington, David Torrents, Peer Bork.
Abstract
We suggest an annotation strategy for genes encoded by retroviruses and transposable elements (RETRA genes) based on a set of marker protein domains. Usually RETRA genes are masked in vertebrate genomes prior to the application of automated gene prediction pipelines under the assumption that they provide no selective advantage to the host. Yet, we show that about 1000 genes in four vertebrate gene sets analyzed contain at least one RETRA gene marker domain. Using the conservation of genomic neighborhood (synteny), we were able to discriminate between RETRA genes with putative functionality in the vertebrates and those that probably function only in the context of mobile elements. We identified 35 such genes in human, along with their corresponding mouse and rat orthologs; which included almost all known human genes with similarity to mobile elements. The results also imply that the vast majority of the remaining RETRA genes in current gene sets are unlikely to encode vertebrate functions. To automatically annotate RETRA genes in other vertebrate genomes, we provide as a tool a set of marker protein domains and a manually refined list of domesticated or ancestral RETRA genes for rescuing genes with vertebrate functions.Entities:
Mesh:
Substances:
Year: 2005 PMID: 15716312 PMCID: PMC549403 DOI: 10.1093/nar/gki236
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1Schematic overview of autonomous TEs and retroviruses and their protein content. The highlighted domains are characteristic for these elements and correspond to marker domains used in this study. (A) Protein content of most common vertebrate retrotransposons and DNA transposons. The single ORF of DNA transposons encoding a transposase is depicted. LINEs contain two open reading frames (ORF1 and ORF2): encoding an RNA-binding protein and a pol gene with homology to reverse transcriptases and endonucleases. LTR-retrotransposons can include ORFs such as (i) gag, encoding a protein that forms the structural component of a cytoplasmic particle within which reverse transcription reaction takes place and (ii) pol, which encodes in most elements an aspartic protease (Pro), a reverse transcriptase (RT), a ribonuclease H (RNase H) and an integrase (Int). These elements can also include an env protein and, thus, show a protein content very similar to the endogenous retrovirus. (B) Protein content of retroviruses. In comparison to an endogenous retrovirus, a retrovirus possesses additional proteins that are usually not recognizable in endogenous retrovirus.
Figure 2The list of InterPro domain signatures corresponding to genes in DNA transposons, retrotransposons and retroviruses was compiled on the basis of literature survey and two protein and domain annotation-based procedures (for details see Methods).
RETRA genes of four vertebrates identified in frequently used gene sets
| Ensembl | NCBI | Twinscan (comparative) | |||
|---|---|---|---|---|---|
| Final set v33 (v34) | GeneScan (aut.) | Final set | Gnomon (aut.) | ||
| Human | 251 (230) | 51 | 65 | 60 | 33 |
| Rat | 181 | 479 | 98 | 173 | 190 |
| Mouse | 277 | 475 | 807 | 1195 | 116 |
| Fugu | 679 | 333 | 670 | 220 | na |
| Sum | 1137 | 1338 | 1640 | 1648 | 339 |
(i) The Ensembl (15) pipeline, which is based on (ii) the automatic gene discovery method GeneScan (16), (iii) the NCBI () pipeline which is based on (iv) the Gnomon () predictor and (v) a comparative gene prediction method, Twinscan (17). All sets correspond to human genome assembly build 33 (see more details in Supplementary Table 2); the reference numbers for the more recent human genome assembly build 34 available from Ensembl are given in brackets. na: twinscan gene predictions for fugu are not available.
aSum is not comparable.
Detailed breakdown of RETRA domain occurrences in public gene sets
| Pfam | InterPro description | Ensembl | NCBI | TwinScan | ||
|---|---|---|---|---|---|---|
| Final set v33 (v34) | GeneScan (aut.) | Final set | Gnomon (aut.) | |||
Only domains that match at least 10 proteins in any of the Ensembl sets are shown. At the time of analysis only an Ensembl gene set based on human genome assembly build 34 was available, shown in brackets.
Detailed analysis of putative RETRA genes in human Ensembl gene set (corresponding to genome assembly build 34)
| InterPro (Pfam) | RETRA domain description | Total | EST | BRH | Synteny | Known |
|---|---|---|---|---|---|---|
| IPR004244 (PF02994) | L1 transposable element | 127 | 7(1) | 3 | 0(1) | |
| IPR000477 (PF00078) | RNA-directed DNA polymerase (Reverse transcriptase) | 54 | 9(1) | 5 | 1(2) | Telomerase (O14746) (36,37) Hur1 (25) |
| IPR004875 (PF03184) | CENP-B protein | 14 | 14(12) | 9 | 9(12) | CENP-B (P07199) (29) |
| Jerky (Q60976) (30) | ||||||
| TIGD2, TIGD3, TIGD6 and TIGD7 (31) | ||||||
| IPR006695 (PF04218) | CENP-B, N-terminal DNA-binding | YCE7_HUMAN | ||||
| IPR002050 (PF00429) | ENV polyprotein (coat polyprotein) | 13 | 6(0) | 2 | 0(0) | Syncytin 1+ and 2+ (Q9NZG3, P60508) (33,35) |
| IPR001584 (PF00665) | Integrase, catalytic domain | 10 | 9(6) | 4 | 3(6) | Gin-1 (NM_017676) (39) |
| IPR008906 (PF05699) | HAT dimerisation | 6 | 6(5) | 3 | 3(5) | P52rIPK (O43422) |
| ZBED1+ (NM_004729) (34) | ||||||
| IPR008180 (PF00692) | DeoxyUTP pyrophosphatase | 5 | 3(1) | 2 | 1(1) | dUTP pyrophosphatase (P33316) (41) |
| IPR004295 (PF03056) | Env gp36 protein, HERV | 4 | 2(0) | 1 | 0(0) | |
| IPR005162 (PF03732) | Retrotransposon gag protein | 4 | 4(4) | 4 | 4(4) | PEG10(Q9UPV1) (42)PNMA2 (O94959) (26) |
| IPR003322 (PF02337) | Retroviral GAG p10 protein | 4 | 1(0) | 2 | 0(0) | |
| IPR003656 (PF02892) | BED finger | 4 | 4(3) | 1 | 2(3) | ZBED1+ (NM_004729) (34) |
| IPR001995 (PF00077) | Peptidase A2A, retrovirus | 3 | 1(0) | 1 | 0(0) | |
| IPR000721 (PF00607) | Retroviral nucleocapsid protein Gag | 3 | 2(0) | 2 | 0(0) | |
| IPR002156 (PF00075) | Rnase H | 3 | 2(1) | 2 | 1(1) | RNase H (O60930) (43) |
| IPR004191 (PF02920) | Tn916 integrase, N-terminal DNA binding | 2 | 2(2) | 1 | 2(2) | Liprin-beta 1 |
| IPR003036 (PF02093) | Core shell protein Gag P30 | 2 | 1(0) | 1 | 0(0) | |
| IPR001888 (PF01359) | Transposase, type 1 | 2 | 1(1) | 1 | 1(1) | SETMAR (NM_006515) (45) |
| IPR001037 (PF00552) | Retroviral integrase, C-terminal | 1 | 1(0) | 1 | 0(0) | |
| IPR003308 (PF02022) | Integrase, N-terminal zinc-binding | 1 | 0(0) | 1 | 0(0) | |
| IPR002514 (PF01527) | Transposase IS3 | 1 | 1(1) | 0 | 1(1) |
We identified 35 RETRA genes in synteny between humans and rodents (some genes have more than one domain listed in the table). Of them, 21 have been reported in literature, and 18 of which were detected by our approach (with exception of Syncytin1, Syncytin2 and ZBED1 labeled with +). Detailed list of human genes, and corresponding orthologs in other species, containing RETRA domains but retained in synteny is provided in supplementary material (Supplementary Table 5). Total: total number of proteins with corresponding RETRA domain (some genes may have more than one domain); EST: number of genes confirmed by at least one EST or mRNA (in brackets the number of genes in synteny with matched ESTs is shown); BRH: number of human genes with a putative ortholog in the mouse or rat genomes identified by best reciprocal hit in the current gene sets; Synteny: number of putative orthologs found in synteny identified by an automatic procedure (with manual inspection shown in brackets); Known: previously known human genes with these domains reported in literature. These domains can be used as RETRA markers in annotation pipelines provided a rescue procedure for functional genes is used.
aIn the human proteins P52rIPK (O43422), Liprin-beta 1 and YCE7_HUMAN found in synteny with rodents we detected RETRA domains only in the human lineage suggesting the recent acquisition of the domains in these proteins. Yet, YCE7_HUMAN gene seems to acquire only N-terminal CENP-B domain.
+Known RETRA genes recently acquired a host function missed by our approach.