| Literature DB >> 32587259 |
John C Bramley1, Alex L Yenkin1, Mark A Zaydman2, Aaron DiAntonio1,3, Jeffrey D Milbrandt1,2, William J Buchser4.
Abstract
Protein domain-based approaches to analyzing sequence data are valuable tools for examining and exploring genomic architecture across genomes of different organisms. Here, we present a complete dataset of domains from the publicly available sequence data of 9,051 reference viral genomes. The data provided contain information such as sequence position and neighboring domains from 30,947 pHMM-identified domains from each reference viral genome. Domains were identified from viral whole-genome sequence using automated profile Hidden Markov Models (pHMM). This study also describes the framework for constructing "domain neighborhoods", as well as the dataset representing it. These data can be used to examine shared and differing domain architectures across viral genomes, to elucidate potential functional properties of genes, and potentially to classify viruses.Entities:
Mesh:
Year: 2020 PMID: 32587259 PMCID: PMC7316859 DOI: 10.1038/s41597-020-0536-1
Source DB: PubMed Journal: Sci Data ISSN: 2052-4463 Impact factor: 6.444
Fig. 1Data Processing Pipeline. On the far left are the steps taken to assemble the datasets in this manuscript. Pre and Post refer to two different custom software that manage the data. Explanations of each step are written in the figure. The diagram on the right shows how different sequence data are processed, and how protein domain metadata is extracted and processed. GBK files are GenBank format, FNA files are nucleotide FastA files, FAA files are amino acid FastA files. Gene metadata includes the name, accession, and genomic coordinates of a gene or open reading frame. Domain metadata includes name, clan, E-value, and genomic coordinates of a protein domain. The de-overlap process (dagger) is shown in the lower panel. This illustrates how the HMMER3 identified domains are curated to filter out duplicate domains that have been over-identified due to the windowing approach. The E-value is listed after an example domain (showing an example clan). The highlighted domain is compared to each overlapping domain to decide on removal of the overlapping domain based on percentage overlap, E-value, and clan. The domains with green checks would be retained and the others would be removed. 45% and 33% overlapping thresholds are displayed.
Fig. 2Construction of Clusters. An example (with a restricted set of rows and columns) of how the row clustering is performed. The goal of this clustering is to group related rows together. A row is any grouping of a genomic set of domains (usually a whole virus or a specific virus’s domain neighborhood). For each domain (listed across the top as clans and domains), the inverse square of the domain distance from a domain of interest is used as the value for that column (nearest neighbors would have a value of 1/12 = 1 [blue] and a neighbor 4 domains away would have a value of 1/42 = 0.0625 [red]). If a virus (row) doesn’t contain the clan, the column in that row is assigned a value of 0 (equivalent to a large distance). The result is that rows which have a similar pattern of domains (like the two salmonella phage) are clustered next to each other. 10.6084/m9.figshare.11879253.v1.
Fig. 3Domain neighborhoods centered around a helicase domain. (a) Dendrogram of neighborhoods centered on helicase-associated domains for a set of viruses. (b) Domain neighborhoods of genomes that contain helicase-associated domains. The helicase-associated domains are the center (0) positions. Each track (set of domains in a row) corresponds to the virus and dendrogram branch in (a) (c) Full dendrogram following clustering of helicase domains. (d) Domain neighborhood of all helicase-associated-domain-containing genomes. The area enclosed in the red box represents the subset used in panels (a,b). (e) Cluster position of viral families throughout the constructed neighborhood. (f) Mosaic plot of common domains that co-occur with helicase domains. The size of the block is scaled to demonstrate the sum inverse-square distance of the named domain with the domain of interest (in this case Helicase), with the same metric as mentioned in Fig. 2. This mosaic plot aggregates all of the reference viruses together, thus showing the conserved partners of Helicases (largest domains on the top of the chart). 10.6084/m9.figshare.11879253.v1.
Fig. 4Comparison of Whole Genome vs. Gene Search method for finding domains. Two representations of the distribution of the number of unique domains per viral genome. Higher numbers mean more unique domains were found using the 6-frame translated contigs than the gene/ORF method (as expected). More unique domains are found per genome when using the whole contig versus using genes or identified ORFs. The E-value was the same in each case, and the cut-off was 0.01. 10.6084/m9.figshare.12132762.v1.
Fig. 5Two example Viral Genomes. Viral genomes showing the annotated coding genes in yellow, and the identified PFAM domains in red and blue. (a) NC_001418. “Pseudomonas phage Pf3”, showing representative correspondence between the contig-domain method and the Gene-domain method. (b) NC_001500 “Spleen focus-forming virus”, showing the advantage of the contig-based method, specifically that additional high-quality domains are identified outside of annotated coding regions (GAG_P12, RVE, RVT_1, RVP). The start and stop of each domain is demarcated by the bottom and top of the blue and red bars, respectively. The blue bars indicate the domains as identified within the six-frame translated portion of the viral genome’s contig. The red portion shows the domain as identified within the gene. Gray indicates that one of the methods didn’t find the whole extent of the domain compare with the other. In (b), there are several domains (GAG_P12, RVE, RVT_1, RVP) that have no corresponding red bar, since no domain was identified with the Gene method. 10.6084/m9.figshare.12132903.v1.
Fig. 6Taxonomic Grouping from Domain Neighborhood Clusters. Contingency matrix of virus family and cluster number. The matrix is scaled to the maximum value on a per-cluster basis. Values closer to 1 are darker. Clusters tend to contain only a single virus family. 10.6084/m9.figshare.11879253.v1.
Fig. 7Flavivirus example. (a) Dendrogram following unsupervised clustering using the FLAVI_NS1 domain as the center. All genomes containing the NS1 domain were included in the clustering. (b) Corresponding domain neighborhood of NS1 containing genomes. NS1 is the center (0) position. Additional FLAVI associated domains are commonly found near NS1 in many genomes. (c) FLAVI_NS1 mosaic plot displaying domains commonly occurring with NS1. 10.6084/m9.figshare.11879253.v1.
Fig. 8Clustering of Unclassified Phage. (a) Genomic neighborhoods of genomes clustered using helicase domains as the center. Zooming in on these neighborhoods reveals genomes characterized as unclassified having a series of close neighbors belonging to the siphoviridae family (b). The dendrogram in (b) places this unclassified bacterial virus amongst members of the siphoviridae family indicating it could potentially be a member of this family of viruses. (c) Further examination of the genomic neighborhood corresponding to the region displayed in (b) shows the local domain structure to members of the siphoviridae family. 10.6084/m9.figshare.11879253.v1.
Unclassified Virus Families with Similar Viruses and Putative Families.
| Accession | Virus Name | Family Left | Family Right | Putative Family |
|---|---|---|---|---|
| NC_009552 | Geobacillus_virus_E2 | Siphoviridae | Siphoviridae | |
| NC_009552 | Geobacillus_virus_E2 | Siphoviridae | Siphoviridae | |
| NC_011356 | Enterobacteria_phage_YYZ-2008 | Siphoviridae | Siphoviridae | Siphoviridae |
| NC_011356 | Enterobacteria_phage_YYZ-2008 | Siphoviridae | Siphoviridae | Siphoviridae |
| NC_012531 | Solenopsis_invicta_virus_3 | Tymoviridae | Tymoviridae | |
| NC_024489 | Asterionellopsis_glacialis_RNA_virus | Dicistroviridae | Dicistroviridae | |
| NC_027867 | Mollivirus_sibericum | Myoviridae | Myoviridae | |
| NC_027867 | Mollivirus_sibericum | Iridoviridae | Podoviridae | Iridoviridae |
| NC_030453 | Diaphorina_citri_flavi-like_virus | Flaviviridae | Flaviviridae | |
| NC_030651 | Nylanderia_fulva_virus_1 | Endornaviridae | Endornaviridae | Endornaviridae |
| NC_031264 | Brucella_phage_BiPBO1 | Siphoviridae | Siphoviridae | Siphoviridae |
| NC_031264 | Brucella_phage_BiPBO1 | Siphoviridae | Siphoviridae | Siphoviridae |
| NC_031912 | Flavobacterium_phage_Fpv20 | Myoviridae | Myoviridae | |
| NC_031914 | Flavobacterium_phage_Fpv1 | Myoviridae | Myoviridae | |
| NC_031917 | Pseudoalteromonas_phage_BS5 | Siphoviridae | Siphoviridae | Siphoviridae |
| NC_032127 | Beihai_hepe-like_virus_4 | Astroviridae | Astroviridae | |
| NC_032128 | Wuhan_spider_virus_2 | Iflaviridae | Iflaviridae | |
| NC_032129 | Beihai_mollusks_virus_1 | Picornavirales | Picornavirales | |
| NC_032206 | Beihai_picorna-like_virus_35 | Picornavirales | Picornavirales | |
| NC_032221 | Hubei_picorna-like_virus_31 | Iflaviridae | Iflaviridae | |
| NC_032412 | Posavirus_sp | Dicistroviridae | Dicistroviridae | |
| NC_032434 | Beihai_mantis_shrimp_virus_5 | Dicistroviridae | Picornavirales | Dicistroviridae |
| NC_032444 | Beihai_barnacle_virus_3 | Astroviridae | Astroviridae | |
| NC_032445 | Beihai_mantis_shrimp_virus_1 | Alphatetraviridae | Alphatetraviridae | |
| NC_032446 | Beihai_barnacle_virus_4 | Dicistroviridae | Dicistroviridae | |
| NC_032456 | Beihai_hepe-like_virus_8 | Astroviridae | Astroviridae | |
| NC_032464 | Beihai_mantis_shrimp_virus_4 | Dicistroviridae | Dicistroviridae | |
| NC_032473 | Wenzhou_picorna-like_virus_3 | Picornavirales | Picornavirales | |
| NC_032530 | Beihai_picorna-like_virus_64 | Dicistroviridae | Dicistroviridae | |
| NC_032539 | Beihai_picorna-like_virus_100 | Dicistroviridae | Dicistroviridae | |
| NC_032545 | Beihai_picorna-like_virus_32 | Iflaviridae | Iflaviridae | |
| NC_032554 | Beihai_sipunculid_worm_virus_2 | Hepeviridae | Hepeviridae | |
| NC_032556 | Beihai_picorna-like_virus_61 | Dicistroviridae | Dicistroviridae | |
| NC_032578 | Beihai_picorna-like_virus_69 | Dicistroviridae | Dicistroviridae | |
| NC_032588 | Beihai_picorna-like_virus_93 | Hypoviridae | Hypoviridae | |
| NC_032590 | Beihai_picorna-like_virus_63 | Dicistroviridae | Dicistroviridae | |
| NC_032593 | Beihai_razor_shell_virus_1 | Dicistroviridae | Dicistroviridae | |
| NC_032611 | Beihai_picorna-like_virus_125 | Dicistroviridae | Dicistroviridae | |
| NC_032619 | Beihai_picorna-like_virus_4 | Dicistroviridae | Dicistroviridae | |
| NC_032638 | Beihai_picorna-like_virus_91 | Iflaviridae | Iflaviridae | |
| NC_032639 | Beihai_picorna-like_virus_33 | Dicistroviridae | Dicistroviridae | |
| NC_032642 | Beihai_picorna-like_virus_82 | Dicistroviridae | Dicistroviridae | |
| NC_032755 | Changjiang_picorna-like_virus_3 | Dicistroviridae | Dicistroviridae | |
| NC_032756 | Hubei_picorna-like_virus_40 | Iflaviridae | Iflaviridae | |
| NC_032758 | Hubei_tick_virus_2 | Hypoviridae | Hypoviridae | |
| NC_032759 | Hubei_picorna-like_virus_56 | Endornaviridae | Endornaviridae | |
| NC_032766 | Hubei_orthoptera_virus_1 | Dicistroviridae | Dicistroviridae | |
| NC_032769 | Hubei_picorna-like_virus_42 | Iflaviridae | Iflaviridae | |
| NC_032774 | Hubei_diptera_virus_1 | Dicistroviridae | Dicistroviridae | Dicistroviridae |
| NC_032790 | Hubei_picorna-like_virus_39 | Iflaviridae | Iflaviridae | Iflaviridae |
| NC_032798 | Shahe_endorna-like_virus_1 | Myoviridae | Endornaviridae | Myoviridae |
| NC_032808 | Wenzhou_picorna-like_virus_2 | Dicistroviridae | Dicistroviridae | |
| NC_032834 | Wenling_picorna-like_virus_5 | Dicistroviridae | Picornavirales | Dicistroviridae |
| NC_032874 | Changjiang_crawfish_virus_1 | Picornaviridae | Picornaviridae | |
| NC_032890 | Wenzhou_channeled_applesnail_virus_3 | Dicistroviridae | Dicistroviridae | |
| NC_032911 | Wenzhou_picorna-like_virus_39 | Dicistroviridae | Dicistroviridae | |
| NC_032916 | Hubei_picorna-like_virus_17 | Picornavirales | Dicistroviridae | Picornavirales |
| NC_032940 | Shahe_picorna-like_virus_1 | Picornaviridae | Picornaviridae | |
| NC_032979 | Hubei_picorna-like_virus_26 | Iflaviridae | Iflaviridae | |
| NC_032989 | Wenzhou_shrimp_virus_4 | Picornavirales | Picornavirales | |
| NC_032990 | Hubei_picorna-like_virus_63 | Togaviridae | Togaviridae | |
| NC_033001 | Wenzhou_picorna-like_virus_37 | Podoviridae | Podoviridae | |
| NC_033003 | Hubei_picorna-like_virus_61 | Hypoviridae | Hypoviridae | |
| NC_033025 | Hubei_picorna-like_virus_38 | Dicistroviridae | Dicistroviridae | |
| NC_033030 | Hubei_coleoptera_virus_1 | Iflaviridae | Iflaviridae | |
| NC_033041 | Wenzhou_picorna-like_virus_20 | Iflaviridae | Iflaviridae | |
| NC_033061 | Hubei_odonate_virus_2 | Iflaviridae | Iflaviridae | |
| NC_033094 | Hubei_tetragnatha_maxillosa_virus_2 | Iflaviridae | Iflaviridae | |
| NC_033099 | Hubei_picorna-like_virus_28 | Iflaviridae | Iflaviridae | |
| NC_033108 | Hubei_picorna-like_virus_32 | Dicistroviridae | Iflaviridae | Dicistroviridae |
| NC_033131 | Wenzhou_picorna-like_virus_5 | Dicistroviridae | Dicistroviridae | |
| NC_033136 | Hubei_picorna-like_virus_43 | Iflaviridae | Iflaviridae | |
| NC_033151 | Wenzhou_picorna-like_virus_4 | Iflaviridae | Iflaviridae | |
| NC_033184 | Wenzhou_hepe-like_virus_1 | Endornaviridae | Endornaviridae | |
| NC_033194 | Sanxia_water_strider_virus_8 | Iflaviridae | Iflaviridae | |
| NC_033195 | Hubei_picorna-like_virus_35 | Iflaviridae | Iflaviridae | |
| NC_033204 | Hubei_endorna-like_virus_1 | Endornaviridae | Endornaviridae | |
| NC_033205 | Wenzhou_picorna-like_virus_26 | Dicistroviridae | Dicistroviridae | |
| NC_033206 | Hubei_odonate_virus_5 | Iflaviridae | Iflaviridae | Iflaviridae |
| NC_033226 | Hubei_picorna-like_virus_51 | Iflaviridae | Iflaviridae | |
| NC_033227 | Hubei_picorna-like_virus_22 | Dicistroviridae | Dicistroviridae | |
| NC_033238 | Hubei_picorna-like_virus_41 | Endornaviridae | Endornaviridae | |
| NC_033243 | Hubei_odonate_virus_4 | Iflaviridae | Iflaviridae | |
| NC_033246 | Sanxia_picorna-like_virus_3 | Dicistroviridae | Dicistroviridae | |
| NC_033262 | Sanxia_picorna-like_virus_5 | Marnaviridae | Marnaviridae | |
| NC_033419 | Wuchan_romanomermis_nematode_virus_1 | Hepeviridae | Hepeviridae | |
| NC_033421 | Wuhan_arthropod_virus_1 | Astroviridae | Astroviridae | |
| NC_033462 | Wuhan_fly_virus_4 | Iflaviridae | Iflaviridae | |
| NC_033714 | Wuhan_spider_virus_5 | Virgaviridae | Virgaviridae | |
| NC_033722 | Wuhan_insect_virus_33 | Dicistroviridae | Dicistroviridae | Dicistroviridae |
| NC_033825 | Bat_badicivirus_1 | Iflaviridae | Iflaviridae | |
| NC_034248 | Rhizobium_phage_RHEph10 | Mimiviridae | Herpesviridae | Mimiviridae |
| NC_034249 | Kaumoebavirus | Mimiviridae | Myoviridae | Mimiviridae |
| NC_034249 | Kaumoebavirus | Phycodnaviridae | Mimiviridae | Phycodnaviridae |
| NC_034383 | Pacmanvirus_A23 | Mimiviridae | Phycodnaviridae | Mimiviridae |
| NC_034622 | Alfalfa_virus_S | Alphaflexiviridae | Alphaflexiviridae | Alphaflexiviridae |
| NC_035465 | Morelia_viridis_nidovirus | Coronaviridae | Coronaviridae |
| Measurement(s) | Protein Domain • RNA viral genome • DNA viral genome • protein domain neighborhoods • protein domain cluster |
| Technology Type(s) | digital curation • bioinformatics method • Cluster Analysis |
| Factor Type(s) | Viral Genome |
| Sample Characteristic - Organism | Viruses |