| Literature DB >> 36147515 |
Guillermo Rangel-Pineros1,2, Andrew Millard3, Slawomir Michniewski4, David Scanlan4, Kimmo Sirén1, Alejandro Reyes2, Bent Petersen5,6, Martha R J Clokie3, Thomas Sicheritz-Pontén5,6.
Abstract
Background: Fast and computationally efficient strategies are required to explore genomic relationships within an increasingly large and diverse phage sequence space. Here, we present PhageClouds, a novel approach using a graph database of phage genomic sequences and their intergenomic distances to explore the phage genomic sequence space.Entities:
Keywords: comparative genomics; genomic graph database; phage genomics
Year: 2021 PMID: 36147515 PMCID: PMC9041511 DOI: 10.1089/phage.2021.0008
Source DB: PubMed Journal: Phage (New Rochelle) ISSN: 2641-6530
FIG. 1.The number of complete phage genomes deposited in GenBank across time. The introduction of NGS technologies in the early 2000s was followed by an exponential increase in the number of phage genomes deposited in GenBank. The stacked bar plot on the left indicates the proportion of phage genomes that target different bacterial genera. The top eight targeted bacterial genera account for half of the phage genomes currently available in GenBank.
List of Reference Databases Used for Building Our Graph Database
| Source | Sequence count | Date accessed | % targeted host | % country |
|---|---|---|---|---|
| GenBank | 17,062 | July 29, 2021 | 87.5 | 62.1 |
| GPD | 142,809 | January 1, 2021 | 23 | 92.7 |
| TARA Oceans | 195,728 | November 2, 2020 | 0 | 0 |
| GTDB prophages | 64,180 | July 17, 2020 | 98.6 | 0 |
| IMG/VR terrestrial | 45,364 | January 29, 2021 | 9.82 | 0 |
| PIGEON | 95,047 | June 5, 2021 | 0 | 0 |
| GVD | 15,330 | June 23, 2020 | 0 | 0 |
| CHVD | 42,142 | June 5, 2021 | 0 | 0 |
| Horse virome | 1640 | June 5, 2021 | 0 | 0 |
| MarineUKSouth | 9915 | June 5, 2021 | 0 | 0 |
| Slurry | 6633 | June 5, 2021 | 0 | 0 |
Data on the number of sequences provided by each database, the last date they were accessed, and the percentage of entries that have information on targeted hosts and countries of isolation/detection are provided.
CHVD, Cenote Human Virome Database; GPD, Gut Phage Database; GTDB, Genome Taxonomy Database; GVD, Gut Virome Database; IMG/VR, Integrated Microbial Genome/Virus; PIGEON, Phages and Integrated Genomes Encapsidated Or Not.
FIG. 2.Clouds of phages targeting Pseudomonas. The graph database was queried to retrieve all phages that target the genus Pseudomonas and all phages to which they are connected at a maximum intergenomic distance of 0.15. Node colors indicate the database where the corresponding phages were collected from and their size is proportional to the phages' genome sizes. The red arrow points to a cloud that largely comprised GenBank phages targeting other genera, which is further discussed in the text. Names of some representative Pseudomonas aeruginosa phages are displayed next to the clouds that contain them.
Running Times for Searching Phage Clouds Related to Different Sets of Query Phages, Using Different Combinations of Search Parameters and 20 Computing Cores
| 0.15 | 0.21, GB | 0.15, WGD | 0.21, WGD | |
|---|---|---|---|---|
| Single-query phage | 00:00:10[ | 00:00:10 | 00:00:11 | 00:00:15 |
| 79 new GenBank phages | 00:13:42 | 00:25:09 | 00:39:37 | 02:22:11 |
Selected intergenomic distance threshold.
Graph database entries included in the search.
Time format h:min:s.
GB, GenBank; WGD, Whole graph database.
FIG. 3.Searching phage clouds for a set of input query phages and user-defined intergenomic distance thresholds. A set of 79 phage genomes from the GenBank that were not included in the graph database were used for searching phage clouds using two different intergenomic distance thresholds, 0.15 (A, C) and 0.21 (B, D). (A, B) Show the result of searching clouds composed exclusively of GenBank phages, while the clouds in (C, D) include phages from all reference databases. (E) Illustrates some examples of query phages shown as singletons in (B), but captured within some of the clouds present in (D). Arrows in (B, C, D) point to the query nodes present in the clouds found in (E). Node colors indicate the database where the corresponding phages were collected from and their size is proportional to the phages' genome sizes.
Best Reference Matches for 16 Singleton Phage Genomes That Resulted from a GenBank-Constrained Search of the Graph Database
| Query phage | Best match | Source database | Intergenomic distance |
|---|---|---|---|
| LC629455:uncultured_marine_virus | Station58_DCM_ALL_assembly_NODE_127_length_41467_cov_96.610693 | TARA Oceans | 0.108736 |
| LC629456:uncultured_marine_virus | Station58_DCM_ALL_assembly_NODE_174_length_37974_cov_33.677971 | TARA Oceans | 0.101924 |
| LC629458:Caudovirales_sp | Station56_SUR_ALL_assembly_NODE_61_length_88758_cov_21.840028 | TARA Oceans | 0.064562 |
| LC629459:uncultured_marine_virus | Station18_DCM_ALL_assembly_NODE_544_length_22784_cov_15.229311 | TARA Oceans | 0.166993 |
| LC629461:uncultured_marine_virus | Flavobacteriales_bacterium__NHFO01000063_phage12__64kb | GTDB prophages | 0.128538 |
| LC629467:uncultured_marine_virus | Bacterium_isolate__PAZJ01000031_phage54__57kb | GTDB prophages | 0.077934 |
| LC629474:uncultured_marine_virus | Station31_SUR_ALL_assembly_NODE_1914_length_19272_cov_27.587397 | TARA Oceans | 0.168987 |
| LC629494:Caudovirales_sp | Station30_DCM_ALL_assembly_NODE_1322_length_15351_cov_25.741828 | TARA Oceans | 0.181003 |
| LC629500:Siphoviridae_sp | Station58_DCM_ALL_assembly_NODE_568_length_22502_cov_143.650866 | TARA Oceans | 0.037512 |
| LC629563:uncultured_marine_virus | Station100_SUR_ALL_assembly_NODE_1036_length_14465_cov_49.298959 | TARA Oceans | 0.123668 |
| LC629575:uncultured_marine_virus | Station123_SUR_ALL_assembly_NODE_2449_length_10403_cov_69.936799 | TARA Oceans | 0.129898 |
| LC629600:uncultured_marine_virus | Station76_DCM_ALL_assembly_NODE_3213_length_10752_cov_25.308030 | TARA Oceans | 0.108739 |
| LC629609:Libanvirus_sp | Station18_DCM_ALL_assembly_NODE_18_length_107714_cov_7.163711 | TARA Oceans | 0.058169 |
| LC629612:uncultured_marine_virus | PIGEON_EarthsVirome_17958 | PIGEON | 0.160845 |
| MT025940:Enquatrovirus_sp | Station168_IZZ_ALL_assembly_NODE_477_length_89554_cov_52.179924 | TARA Oceans | 0.208478 |
| MW822601:Synechoccus_phage_S-SRP02 | IMGVR_UViG_3300035703_000158 | IMG/VR (Terrestrial) | 0.204246 |
FIG. 4.Herelleviridae clouds colored by genus. Clouds containing at least one member of the family Herelleviridae were retrieved from the graph database, using an intergenomic distance threshold of 0.15. Nodes are colored based on genus memberships, according to the annotations in the corresponding GenBank files. Gray nodes correspond to Herelleviridae phages without genus affiliation. Black nodes correspond to GenBank phages classified in other families or not currently classified at the family level. White nodes refer to phages from any of the other databases used to create the graph database. The red arrows highlight examples discussed in the text of GenBank phages classified in other phage families.
FIG. 5.Searching phage clouds for a set of 971 marine vOTUs. The search was conducted over the complete graph database using an intergenomic distance threshold of 0.20. (A) Illustrates a section of the obtained phage clouds (the complete set is illustrated in Supplementary Figure S3). Yellow nodes represent vOTUs from the U.K. coastal water viromes, and blue nodes represent entries from the TARA Oceans data set included in the PhageClouds' graph database. (B) Shows a detailed view of the cloud highlighted in red from (A). The green and blue arrows point to the phage sequences present in the sequence alignment depicted in (C). vOTUs, viral operational taxonomic units.