| Literature DB >> 31560374 |
Roman Kogay1, Taylor B Neely1,2, Daniel P Birnbaum1,3, Camille R Hankel1,4, Migun Shakya1,5, Olga Zhaxybayeva1,6.
Abstract
Many of the sequenced bacterial and archaeal genomes encode regions of viral provenance. Yet, not all of these regions encode bona fide viruses. Gene transfer agents (GTAs) are thought to be former viruses that are now maintained in genomes of some bacteria and archaea and are hypothesized to enable exchange of DNA within bacterial populations. In Alphaproteobacteria, genes homologous to the "head-tail" gene cluster that encodes structural components of the Rhodobacter capsulatus GTA (RcGTA) are found in many taxa, even if they are only distantly related to Rhodobacter capsulatus. Yet, in most genomes available in GenBank RcGTA-like genes have annotations of typical viral proteins, and therefore are not easily distinguished from their viral homologs without additional analyses. Here, we report a "support vector machine" classifier that quickly and accurately distinguishes RcGTA-like genes from their viral homologs by capturing the differences in the amino acid composition of the encoded proteins. Our open-source classifier is implemented in Python and can be used to scan homologs of the RcGTA genes in newly sequenced genomes. The classifier can also be trained to identify other types of GTAs, or even to detect other elements of viral ancestry. Using the classifier trained on a manually curated set of homologous viruses and GTAs, we detected RcGTA-like "head-tail" gene clusters in 57.5% of the 1,423 examined alphaproteobacterial genomes. We also demonstrated that more than half of the in silico prophage predictions are instead likely to be GTAs, suggesting that in many alphaproteobacterial genomes the RcGTA-like elements remain unrecognized.Entities:
Keywords: zzm321990 Rhodobacter capsulatuszzm321990 ; GTA; binary classification; carbon depletion; support vector machine; virus exaptation
Mesh:
Year: 2019 PMID: 31560374 PMCID: PMC6821227 DOI: 10.1093/gbe/evz206
Source DB: PubMed Journal: Genome Biol Evol ISSN: 1759-6653 Impact factor: 3.416
. 1.—The “head–tail” cluster of the Rhodobacter capsulatus GTA “genome” and the amino acid composition of viral and alphaproteobacterial homologs for some of its genes. Genes that are used in the machine-learning classification are highlighted in gray. For those genes, the heatmap below a gene shows the relative abundance of each amino acid (rows) averaged across the RcGTA-like and viral homologs that were used in the classifier training (columns). The amino acids are sorted by the absolute difference in the average relative abundance between RcGTA-like and viral homologs, which was additionally averaged across 11 genes. The heatmaps of the amino acid composition in the individual homologs are shown in supplementary figure S1, Supplementary Material online.
. 2.—The pseudocode of the SVM classifier algorithm that distinguishes RcGTA-like genes from the “true” viruses. The algorithm is implemented in the GTA-Hunter software package (see “Software Implementation” section).
Number of the RcGTA Homologs in the “True GTA” and “True Virus” Training Data Sets
| Gene | “True GTAs” | “True Viruses” |
|---|---|---|
|
| 69 | 1,646 |
|
| 65 | 769 |
|
| 62 | 465 |
|
| 67 | 627 |
|
| 61 | 19 |
|
| 62 | 96 |
|
| 66 | 61 |
|
| 63 | 12 |
|
| 73 | 57 |
|
| 67 | 124 |
|
| 67 | 155 |
The Combinations of Features and Parameters That Showed the Highest Weighted Accuracy Score (WAS) in Cross-Validation
| Gene | Weighted Accuracy Score, WAS (%) | Matthews Correlation Coefficient, MCC | k-mer (Size) | PseAAC (Value of | Grouping Based on Physicochemical Properties of Amino Acids |
|
|
|---|---|---|---|---|---|---|---|
|
| 100 | 1 | 2 | — | — | 10,000 | 0.02 |
|
| 100 | 1 | 3 | — | — | 10,000 | 0.02 |
|
| 100 | 1 | 3 | 3 | — | 10,000 | 0.02 |
|
| 100 | 1 | 3 | — | — | 100 | 0.02 |
|
| 95.9 | 0.88 | 4 | — | + | 0.1 | 0.02 |
|
| 99.4 | 0.98 | 2 | 3 | — | 0.1 | 0.03 |
|
| 100 | 1 | 2 | — | — | 100 | 0.1 |
|
| 95.6 | 0.90 | 5 | — | — | 10,000 | 0.05 |
|
| 99.1 | 0.98 | 2 | — | — | 100 | 0 |
|
| 99.6 | 0.99 | 6 | 6 | — | 0.01 | 0.03 |
|
| 99.7 | 0.99 | 2 | — | — | 10,000 | 0.02 |
Note.—The listed parameter sets were used in predictions of the RcGTA-like genes in 1,423 alphaproteobacterial genomes. See Materials and Methods for the procedure on selecting one parameter set in the cases where multiple parameter sets had the identical highest WAS.
Throughout the table, “—” denotes that the feature type was not used.
Distribution of Prophages and RcGTA-Like Elements across Different Orders within Class Alphaproteobacteria
| Order | Number of Genomes | Number of Prophages | Number of RcGTA-Like Clusters | Number of OTUs | Corrected Abundance of RcGTA-Like Clusters | Percentage of OTUs That Have RcGTA-Like Clusters |
|---|---|---|---|---|---|---|
| Acetobacterales | 62 | 34 | 0 | 34 | 0 | 0 |
| Azospirillales | 13 | 10 | 0 | 12 | 0 | 0 |
| Caedibacterales | 1 | 0 | 0 | 1 | 0 | 0 |
| Caulobacterales | 50 | 30 | 39 | 45 | 35 | 78 |
| Elsterales | 1 | 0 | 0 | 1 | 0 | 0 |
| Kiloniellales | 5 | 1 | 0 | 3 | 0 | 0 |
| Oceanibaculales | 2 | 1 | 0 | 2 | 0 | 0 |
| Paracaedibacterales | 1 | 2 | 0 | 1 | 0 | 0 |
| Parvibaculales | 5 | 5 | 2 | 5 | 2 | 40 |
| Pelagibacterales | 5 | 0 | 0 | 5 | 0 | 0 |
| Rhizobiales | 730 | 763 | 435 | 300 | 155 | 52 |
| Rhodobacterales | 241 | 318 | 208 | 174 | 150 | 86 |
| Rhodospirillales | 24 | 10 | 0 | 15 | 0 | 0 |
| Rickettsiales | 70 | 18 | 0 | 24 | 0 | 0 |
| Sneathiellales | 2 | 1 | 0 | 2 | 0 | 0 |
| Sphingomonadales | 207 | 115 | 132 | 169 | 110 | 65 |
| Thalassobaculales | 1 | 0 | 0 | 1 | 0 | 0 |
| Unclassified order 1 | 1 | 0 | 0 | 1 | 0 | 0 |
| Unclassified order 2 | 1 | 2 | 1 | 1 | 1 | 100 |
| Unclassified order 3 | 1 | 2 | 1 | 1 | 1 | 100 |
See “Detection of RcGTA-Like Genes and Head–Tail Clusters in Alphaproteobacteria” section for explanation about the correction.
. 3.—Distribution of the detected RcGTA-like clusters across the class Alphaproteobacteria. The presence of RcGTA-like clusters is mapped to a reference phylogenetic tree that was reconstructed from a concatenated alignment of 83 marker genes (See Materials and Methods and supplementary table S9, Supplementary Material online). The branches of the reference tree are collapsed at the taxonomic rank of “order,” and the number of OTUs within the collapsed clade is shown in parentheses next to the order name. Orange and brown bars depict the proportion of OTUs with and without the predicted RcGTA-like clusters, respectively. The orders that contain at least one OTU with an RcGTA-like cluster are colored in green. Nodes 1–3 mark the last common ancestors of the unclassified orders. Node 4 marks the lineage where, based on this study, the RcGTA-like element should have already been present. Nodes 5 and 7 mark the lineages that were previously inferred to represent last common ancestor of the RcGTA-like element by Shakya et al. (2017) and Lang and Beatty (2007), respectively. Node 6 marks the clade where RcGTA-like elements are the most abundant. The tree is rooted using homologs from Escherichia coli str. K12 substr. DH10B and Pseudomonas aeruginosa PAO1 genomes. Branches with ultrafast bootstrap values >=95% are marked with black circles. The scale bar shows the number of substitutions per site. The full reference tree is provided in the FigShare repository.
. 4.—An overlap between prophage and GTA predictions. The “predicted RcGTA-like clusters” set refers to the GTA-Hunter predictions, whereas the “predicted intact prophages” set denotes predictions made by the PHASTER program (Arndt et al. 2016) on the subset of the genomes that are found within clade 4 (fig. 3).
. 5.—The number of predicted “intact” prophages in alphaproteobacterial genomes. The 1,423 genomes were divided into two groups: those without GTA-Hunter-predicted RcGTA-like clusters (in brown) and those with these RcGTA-like clusters (in dark orange). For the latter group, the number of prophages was recalculated after the RcGTA-like clusters that were designated as prophages were removed (in light orange). The distribution of the number of predicted intact prophages within each data set is shown as a violin plot with the black point denoting the average value. The data sets with significantly different average values are denoted by asterisks (P < 0.001; Mann–Whitney U test).