| Literature DB >> 26657537 |
Robert A Edwards1, Katelyn McNair2, Karoline Faust3, Jeroen Raes3, Bas E Dutilh4.
Abstract
Metagenomics has changed the face of virus discovery by enabling the accurate identification of viral genome sequences without requiring isolation of the viruses. As a result, metagenomic virus discovery leaves the first and most fundamental question about any novel virus unanswered: What host does the virus infect? The diversity of the global virosphere and the volumes of data obtained in metagenomic sequencing projects demand computational tools for virus-host prediction. We focus on bacteriophages (phages, viruses that infect bacteria), the most abundant and diverse group of viruses found in environmental metagenomes. By analyzing 820 phages with annotated hosts, we review and assess the predictive power of in silico phage-host signals. Sequence homology approaches are the most effective at identifying known phage-host pairs. Compositional and abundance-based methods contain significant signal for phage-host classification, providing opportunities for analyzing the unknowns in viral metagenomes. Together, these computational approaches further our knowledge of the interactions between phages and their hosts. Importantly, we find that all reviewed signals significantly link phages to their hosts, illustrating how current knowledge and insights about the interaction mechanisms and ecology of coevolving phages and bacteria can be exploited to predict phage-host relationships, with potential relevance for medical and industrial applications. © FEMS 2015.Entities:
Keywords: CRISPR; co-occurrence; metagenomics; oligonucleotide usage; phages; viruses of microbes
Mesh:
Year: 2015 PMID: 26657537 PMCID: PMC5831537 DOI: 10.1093/femsre/fuv048
Source DB: PubMed Journal: FEMS Microbiol Rev ISSN: 0168-6445 Impact factor: 16.408
Computational signals to identify bacteriophage–host relationships. The column ‘Performance’ shows for how many of the 820 phages in our benchmarking dataset we could correctly predict the host species (see Fig. 4).
| Signal category | Explanation and approach | Performance | Comments |
|---|---|---|---|
| Abundance profiles | Phages can only thrive in an environment if their host is also present. Phage and bacterial abundance patterns in metagenomes can be used to identify their association by (lagged) correlation. | Bacterium with the most similar abundance profile is the correct host species for 9.5% of the phages. | The metagenomics protocol affects the sensitivity of detecting phages and bacteria in a sample. Ecological processes such as Kill-the-Winner can lead to non-linear dynamics that confound straightforward correlations. Stratification of samples by environment may improve the performance. |
| Genetic homology | Genetic homology between phage and bacterial nucleotide and protein sequences may represent sequences that were acquired by a phage during a past infection event. | Top hit is the correct host species for 38.5% and 29.8% of the phages with blastn and blastx, respectively. | This signal depends on a comprehensive reference database to identify which bacteria are most similar to a given phage. Some gene families are more prone to horizontal gene transfer, leading to some genes being more frequently shared. |
| CRISPRs | Bacteria place a 25 to 75 bp fragment of an infecting phage sequence into CRISPR arrays on their genome. These arrays can be identified and the spacers aligned to phage genomes to detect recent infections. Multiple spacers between a bacterium and a phage enhance the signal. | Bacterium with the most similar CRISPR spacer is the correct host for 15.1% of the phages. Bacterium with the highest number of CRISPR spacers is the correct host for 21.3% of phages. | Only ∼40% of bacteria and ∼70% of archaea encode a CRISPR system, and the spacers in a CRISPR array are rapidly turned over in the environment. Most CRISPR spacers do not match any known sequence, so although this approach is specific (few false positives), it is not very sensitive (many false negatives). |
| Exact matches | Exact matches between phage and bacterial genomes can represent integration sites, CRISPR spacers, regions of genetic homology or integrated prophages. | Bacterium with the longest exact match is the correct host species for 40.5% of the phages. | Very short exact matches around the length of integration sites do not contain a significant signal as they can occur randomly. |
| Oligonucleotide profiles | Over time, phages ameliorate their nucleotide composition towards that of the host. This reflects intracellular nucleotide pools, codon usage and tRNA availability, and restriction-modification systems. | Bacterium with the most similar 4-mer or codon usage profile is correct host species for 17.2% or 10.4% of phages, respectively. | Contrary to this signal, it has often been observed that prophages have a different nucleotide usage profile than the surrounding host genome. Some phages carry tRNA genes to alter the typical host codon usage profile. GC content is a ID measure that does not have a lot of discriminatory power. |
Figure 4.Percentage of phages with a correctly predicted bacterial species among the top scoring hosts using the different computational phage–host prediction approaches. Only the highest scoring bacteria were included, but if multiple top scoring hosts were present, the prediction was scored as correct if the correct host was among the predicted hosts. For details, including the percentage of phages with a correctly predicted host at different taxonomic levels, see Tables S1–18 (Supporting Information).
Figure 1.ROC curves displaying the classification accuracy of computational phage–host prediction approaches. (A) Pearson correlation of phage and bacterial abundance profiles across environments; (B) overall alignment length of blastn hits between phage and bacterial genome sequences; (C) number of matching proteins in blastx search of phage DNA to bacterial proteins; (D) percent identity of CRISPR spacers aligned to phage genomes; (E) number of matching CRISPR spacers in phage genomes; (F) length of longest exact match between phage and bacterial genomes; (G) Pearson correlation of oligonucleotide usage profiles (tetramers, k = 4, for other lengths of k, see Fig. S2, Supporting Information); (H) similarity in codon usage profiles of phage and bacterial coding regions; (I) similarity in GC content between phage and bacterial genomes. Note that in some ROC plots, the TP and FP rates do not continue to FP rate = 1; TP rate = 1. In those cases, we used cutoffs for assignment of a hit.
Figure 2.The identification of the number of phages matching a CRISPR spacer in a bacterial genome depends on the number of mismatches between the spacer and the phage genome. (A) Number of phages that match at least one CRISPR spacer in a given host; (B) number of phages that match at least two CRISPR spacers in a given host. Incorrect host predictions are shown with solid bars and correct host predictions are shown with grey bars.
Figure 3.Histogram showing the length of the longest exact match for each phage, divided into correct and incorrect hosts. The approximate size range of several mechanisms leading to exact matches between phage and bacterial genomes are indicated. Note that multiple bacterial genomes can have the same longest exact match with a given phage, in which case they are all included.