| Literature DB >> 28832519 |
Mikk Puustusmaa1, Heleri Kirsip2, Kevin Gaston3, Aare Abroi4,5.
Abstract
Almost a century has passed since the discovery of papillomaviruses. A few decades of research have given a wealth of information on the molecular biology of papillomaviruses. Several excellent studies have been performed looking at the long- and short-term evolution of these viruses. However, when and how papillomaviruses originate is still a mystery. In this study, we systematically searched the (sequenced) biosphere to find distant homologs of papillomaviral protein domains. Our data show that, even including structural information, which allows us to find deeper evolutionary relationships compared to sequence-only based methods, only half of the protein domains in papillomaviruses have relatives in the rest of the biosphere. We show that the major capsid protein L1 and the replication protein E1 have relatives in several viral families, sharing three protein domains with Polyomaviridae and Parvoviridae. However, only the E1 replication protein has connections with cellular organisms. Most likely, the papillomavirus ancestor is of marine origin, a biotope that is not very well sequenced at the present time. Nevertheless, there is no evidence as to how papillomaviruses originated and how they became vertebrate and epithelium specific.Entities:
Keywords: origin; papillomaviruses; protein domains; structural domains
Mesh:
Substances:
Year: 2017 PMID: 28832519 PMCID: PMC5618006 DOI: 10.3390/v9090240
Source DB: PubMed Journal: Viruses ISSN: 1999-4915 Impact factor: 5.048
Domain coverage comparison in UniProt “complete proteomes” in PfamA and SUPERFAMILY database.
| PfamA_28 * | SUPERFAMILY | |||||
|---|---|---|---|---|---|---|
| Sequence Coverage 1 | Residue Coverage 2 | No. of Genomes | Sequence Coverage 1 | Residue Coverage 2 | No. of Genomes | |
| Archaea | 73.8 | 58.0 | 182 | 64.4 | 61.1 | 122 |
| Bacteria | 82.0 | 63.3 | 3513 | 67.6 | 62.6 | 1153 |
| Eukaryota | 67.9 | 38.6 | 422 | 56.9 | 38.8 | 440 |
| Viruses | 84.4 | 65.7 | 1198 | 34.3 | 28.1 | 4041 |
| dsDNA viruses | 62.5 | 52.9 | 270 | 24.8 | 25.4 | 1758 |
| 90.8 | 83.8 | 76 | 69.5 | 57.5 | 125 | |
| 92.5 | 70.3 | 10 | 60.2 | 65.3 | 50 | |
| 74.7 | 56.3 | 23 | 69.5 | 55.0 | 81 | |
| 97.0 | 79.9 | 34 | 18.5 | 15.1 | 332 | |
| 74.2 | 53.6 | 28 | 27.6 | 20.7 | 57 | |
1 Sequence coverage shows the percentage of proteins in a genome which are covered by at least one domain. 2 Residue coverage shows the percentage of amino acids from all proteins of a genome which are within domain models. * PfamA_28 data from “complete genomes” subset.
Figure 1Location of Papillomavirus (PV) proteins and protein domains using Bovine PV type 1 as an example. Bovine PV type 1 encodes 9 proteins including the oncoproteins E6, E7 and E5, the viral helicase E1, the helicase loading factor and transcription factor E2, and the L1 and L2 coat proteins. E8^E2 and E1^E4 proteins are not shown on the figure. Location of open reading frames (ORFs) does not correspond to reading frames.
PV_PfamA domain occurrence in biosphere.
| PDB PfamA_28 2 | PfamA Domain Length 3 | PDB PfamA_31 2 | Best Coverage of PfamA by PDB (% aa) | Eukaryota (Proteomes) 1 | Bacteria (Proteomes) 1 | Archaea (Proteomes) 1 | Viruses 1,4 | Eukaryota (Full up) 1 | Bacteria (Full up) 1 | Archaea (Full up) 1 | Viruses (Full up) 1,4,6 | HMMER E 1 | HMMER B 1 | HMMER A 1 | HMMER V 1,6 | |||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| PF00500 | Late_protein_L1 | 76 | 10 | 498 | 18 | 0.96 | - | - | - | - | - | - | - | - | - | - | - | - |
| PF00508 | PPV_E2_N | 76 | 8 | 200 | 8 | 0.98 | - | - | - | - | - | - | - | - | - | - | - | - |
| PF00511 | PPV_E2_C | 76 | 16 | 80 | 16 | 0.96 | - | - | - | - | - | - | - | - | - | - | - | - |
| PF00513 | Late_protein_L2 | 76 | 0 | 525 | 0 | - | - | - | - | - | - | - | - | - | - | - | - | |
| PF00518 | E6 | 71 | 7 | 110 | 8 | 0.99 | - | - | - | - | - | - | - | - | - | - | - | - |
| PF00519 | PPV_E1_C | 74 | 7 | 432 | 8 | 0.96 | - | 1 | - | - | - | 20 | - | 1 | - | 1 | - | 1 |
| PF00524 | PPV_E1_N | 72 | 0 | 121 | 0 | 2 | - | - | - | 4 | - | - | - | - | - | - | - | |
| PF00527 | E7 | 71 | 3 | 93 | 4 | 0.50 | - | - | - | - | - | - | - | - | - | - | - | - |
| PF02711 | Pap_E4 | 25 | 0 | 95 | 0 | - | - | - | - | - | - | - | - | - | - | - | - | |
| PF03025 | Papilloma_E5 | 9 | 0 | 72 | 0 | - | - | - | - | - | - | - | - | - | - | - | - | |
| PF05776 | Papilloma_E5A | 5 | 0 | 91 | 0 | - | - | - | - | - | - | - | - | - | - | - | - | |
| PF08135 | EPV_E5 | 3 | 0 | 43 | 0 | - | - | - | - | - | - | - | - | - | - | - | - |
“-” No true positive hits were found. 1 Number of distinct proteomes/species in database with given taxonomic restrictions coding respective domain. 2 Number of Protein Data Bank (PDB) entries for respective PfamA domain. 3 Model length. 4 Excluding papillomaviruses. 5 76 PV proteomes in this database. 6 Excluding Polyomaviridae and Parvoviridae.
Figure 2Location of PV domains in the “Galaxy of folds”. PV structural domains are marked by white crosses and visualised on protein domain space. Domains in Structural Classification of Proteins (SCOP) were clustered using the software CLANS based on their all-against-all pairwise similarities, as measured by HHsearch p-values [34]. Domains are coloured according to their SCOP class: all-a (blue); all-b (cyan); a/b (red); a + b (yellow), small proteins (green); multi-domain proteins (orange); and membrane proteins (magenta). PV protein domain name and SCOP identifier are indicated.
PV_SF domain occurrence in biosphere.
| SCOP/SF ID | Classification | SF/FOLD | Families/SF | Description | PV | Viruses 1 | Plasmids 2 | Archaea | Bacteria | Eukaryota | HMMER A | HMMER B | HMMER E | HMMER V 1 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 55464 | d.89.1 | 1 | 5 | Origin of replication-binding domain, RBD-like (E1 DBD) | 123 | - | - | |||||||
| 52540 | c.37.1 | 1 | 24 | P-loop containing nucleoside triphosphate hydrolases (E1 helicase) | 123 | ND | ND | ND | ND | |||||
| 51332 | b.91.1 | 1 | 1 | E2 regulatory, transactivation domain (E2 TAD) | 123 | - | - | - | - | - | - | - | - | - |
| 54957 | d.58.8 | 59 | 1 | Viral DNA-binding domain (E2 DBD) | 123 | - | - | - | - | - | - | - | ||
| 88648 | b.121.6 | 7 | 1 | Group I dsDNA viruses (L1) | 123 | - | - | - | - | - | - | - | ||
| 161229 | g.90.1 | 1 | 1 | E6 C-terminal domain-like | 115 | - | - | - | - | - | - | - | - | |
| 161234 | g.91.1 | 1 | 1 | E7 C-terminal domain-like | 108 | - | - | - | - | - | - | - | - | - |
| 55464:52540 | DBD + helicase | 123 | - | - | ND | |||||||||
“-” No true positive hits were found. “ND” Not determined. “Underlined” Number of primary hits. “Bold” Number of true positive hits. * Number of true positives without Polyomaviridae, Parvoviridae and Geminiviridae. “?” Questionable result. 1 Excluding papillomaviruses. 2 Number of proteins. Non-redundant set of genomes contain 122 Archaeal, 1153 Bacterial, and 440 Eukaryotic species) (i.e., redundant strains and isolates removed). DBD: DNA-binding domain; TAD: transactivation domain. For more detailed information, see Table S3.
Figure 3Distribution of protein domains in viral families by superkingdoms. Each figure shows data for the corresponding viral family. The number in the parentheses on titles corresponds to the number of distinct domains (SF) found in the respective viral family. The y-axis shows the number of domains (SF) from the viral family, covered by any of the three superkingdoms. The x-axis shows the decile of the genomes where the viral protein domains are found by superkingdoms. In panel Papillomaviridae, the lines for Archaea and Eukaryota overlap.
Number of sequences containing SF_55464 with different domain architectures.
| No. of 52540 Domains | PV 1 123 * | Other Viruses | Plasmids | Bacteria | Eukaryota | |||
|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 0 | 350 | 10 | 64 | 35 | 4 |
| 1 | 122 | 49 | 33 | 14 | 6 | 20 | 20 | 5 |
| 2 | 0 | 0 | 0 | 0 | 1 | 334 | 183 | 0 |
| 3 | 0 | 0 | 0 | 0 | 0 | 2 | 1 | 0 |
1 In HPV53, only DBD part of E1 is annotated. 2 Polyomaviridae Merkel cell polyomavirus does not have annotated full-length Large-T protein in this version of the database used (in current version of NCBI viral genomes it already has). 3 Geminiviruses have often more than one replication protein isoform annotated. * Number of genomes in the respective viral family.
Figure 4Summary figure of the relationship of PV domains with other parts of the biosphere. Virus family names are abbreviated without “-viridae” suffix. In the green circle, the relationships according to SCOP and SUPERFAMILY resource are shown. In the yellow circle, the relationships according to extended structural analysis from published articles and structures are shown. Genera Lymphocryptovirus and Rhadinovirus are subfamilies of γ-Herpesvirinae. For E1 helicase domain only evolutionary relationship via domain pair SF_55464:SF_52540 are shown.