| Literature DB >> 29311716 |
Mauricio Barrientos-Somarribas1, David N Messina2, Christian Pou1, Fredrik Lysholm1,3, Annelie Bjerkner4, Tobias Allander4, Björn Andersson5, Erik L L Sonnhammer6.
Abstract
Massive amounts of metagenomics data are currently being produced, and in all such projects a sizeable fraction of the resulting data shows no or little homology to known sequences. It is likely that this fraction contains novel viruses, but identification is challenging since they frequently lack homology to known viruses. To overcome this problem, we developed a strategy to detect ORFan protein families in shotgun metagenomics data, using similarity-based clustering and a set of filters to extract bona fide protein families. We applied this method to 17 virus-enriched libraries originating from human nasopharyngeal aspirates, serum, feces, and cerebrospinal fluid samples. This resulted in 32 predicted putative novel gene families. Some families showed detectable homology to sequences in metagenomics datasets and protein databases after reannotation. Notably, one predicted family matches an ORF from the highly variable Torque Teno virus (TTV). Furthermore, follow-up from a predicted ORFan resulted in the complete reconstruction of a novel circular genome. Its organisation suggests that it most likely corresponds to a novel bacteriophage in the microviridae family, hence it was named bacteriophage HFM.Entities:
Mesh:
Substances:
Year: 2018 PMID: 29311716 PMCID: PMC5758519 DOI: 10.1038/s41598-017-18341-7
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Summary of the sequenced viral-enriched libraries.
| Sample type | DNA Libraries | RNA Libraries | 454 Platform | Total Reads |
|---|---|---|---|---|
| Feces | 1 | 1 | Titanium | 1 459 816 |
| Serum | 4 | 3 | GS FLX & Titanium | 1 095 915 |
| Nasopharyngeal Swabs | 2 | 2 | GS & GS FLX | 703 790 |
| Nasopharyngeal & Throat Swabs | 1 | 1 | Titanium | 432 919 |
| Cerebrospinal Fluid | 1 | 1 | Titanium | 209 748 |
|
| 3 902 188 |
The libraries analysed were prepared by pooling patient specimens of 5 different sample types. At least one DNA and one RNA library for each sample type, but in some cases more libraries were sequenced. The libraries were sequenced during the period of 2008 through 2012, during which the 454-pyrosequencing platform evolved, which is reflected in the different total number of reads per library.
Figure 1Flowchart of the ORFan protein family prediction pipeline. The diagram starts with the raw set of reads from the libraries described in Table 1. Squares in blue describe the preprocessing steps performed to obtain a data set consisting of unannotated sequences. The unannotated sequences were subsequently processed through our prediction pipeline (in green) resulting in 32 predicted protein families.
Figure 2(a) Summary of the resulting 32 high confidence families. The scatterplot summarizes the basic statistics of the predicted proteins. The X and Y axis encode for ORF length and RNAcode p-value respectively, while the size of the dots are scaled by number of sequences in the alignment. Protein families are colored based on their hits to the different databases. (b) Example of RNAcode output for predicted ORFan family 457. The multiple alignment for cluster 457 is shown with the RNAcode-predicted peptide sequence on the top and the high-scoring segment highlighted in yellow. Codons colored in green indicate the presence of synonymous mutations, suggesting that selective pressures act on those sites to preserve the amino acid. In contrast, pink or red codons indicate non-synonymous mutations which do not preserve the amino acid encoding.
Summary of how many of the 32 predicted novel ORFan protein families have hits to various microbiome or other databases.
| Protein family origin | # families | Microbiome Databases | NCBI nr | No Hits | ||
|---|---|---|---|---|---|---|
| MetaHIT | NCBI env | HMRGD | ||||
| Fecal | 23 | 23 | human gut (22); marine & human gut (1) | 0 | 2 | 0 |
| Fecal & CSF | 1 | 1 | gut (1) | 0 | 0 | 0 |
| Serum | 1 | 0 | — | 0 | 1 | 0 |
| Serum & CSF | 5 | 0 | wastewater (1) | 0 | 1 | 3 |
| Serum, CSF & Mucus | 2 | 1 | wastewater (1) | 0 | 1 | 0 |
The families are grouped by the source of their samples.
Protein family hits to described proteins.
| Family | DB | Tool | Best Hit (protein) | Best hit (species) | Curated Annotation |
|---|---|---|---|---|---|
| 1217 | nr | blastp & hmmsearch | unknown | Veillonella sp. CAG:933 | Bacterial protein |
| 532 | nr | hmmsearch | hypothetical protein | M. rupellensis | Bacterial protein |
| 565b | nr | blastp | hypothetical protein H257_12751 | A. astaci | Putative replication protein, viral or bacterial |
| nr | hmmsearch | putative replication protein | Phytophthora parasitica virus | ||
| 956b | nr | blastp & hmmsearch | hypothetical protein | C. trachomatis | Torque Teno virus ORF |
Four of the 32 ORFan protein families match proteins in the NR database. The table describes for each family: (1) the database of the hit, (2) the tool used to detect the similarity, (3) the description of the highest scoring hit, (4) the annotated species for the highest scoring hit, and (5) our manually curated annotation for the protein based on all the significant hits for the protein family. Manual annotation was required since the best hit for a sequence does not always correspond to the most plausible annotation, due to wrong metadata or to the discovery of a distant relative of a protein conserved in many different organisms.
Figure 3(a) Diagram of the bacteriophage HFM genome. This circular fragment was amplified from fecal samples using primers designed based on cluster 179a. The genome contains 7 candidate ORFs, all of which are located in the same strand and cover ~93% of the genome. Annotation suggests viral provenance due to the presence viral-like protein motifs such as a phage capsid motif (cap) and a replication protein (rep). The protein family (cluster 179a) from which the primers were designed is highlighted in light blue in ORF 6. (b) Phylogenetic tree showing the position of bacteriophage HFM in relation to 54 clearly annotated Microviridae genomes from the public databases. Due to lack of homology, it was impossible to include more distantly related sequences. It is a maximum likelihood tree, calculated using RAxML with 1000 bootstraps.
Figure 4Detection of known viruses using homology-based methods. Our method was used to detect unknown viruses, while Kaiju, Kraken and Metaphlan2 were used to detect known viruses. UpSet[49] plot showing the overlaps between the viral species detected by each tool and our method. The number of species detected by each tool is stated between parentheses next to the tool name, and the bar reflects the number of viruses detected by a specific combination of tools. The Phytophtora parasitica virus was detected as a novel family by our method as it was not a known virus at that time, and the family matching TTV only had very weak similarity.