| Literature DB >> 28167933 |
Uljana Hesse1, Peter van Heusden2, Bronwyn M Kirby3, Israel Olonade3, Leonardo J van Zyl3, Marla Trindade3.
Abstract
Sequencing, assembly, and annotation of environmental virome samples is challenging. Methodological biases and differences in species abundance result in fragmentary read coverage; sequence reconstruction is further complicated by the mosaic nature of viral genomes. In this paper, we focus on biocomputational aspects of virome analysis, emphasizing latent pitfalls in sequence annotation. Using simulated viromes that mimic environmental data challenges we assessed the performance of five assemblers (CLC-Workbench, IDBA-UD, SPAdes, RayMeta, ABySS). Individual analyses of relevant scaffold length fractions revealed shortcomings of some programs in reconstruction of viral genomes with excessive read coverage (IDBA-UD, RayMeta), and in accurate assembly of scaffolds ≥50 kb (SPAdes, RayMeta, ABySS). The CLC-Workbench assembler performed best in terms of genome recovery (including highly covered genomes) and correct reconstruction of large scaffolds; and was used to assemble a virome from a copper rich site in the Namib Desert. We found that scaffold network analysis and cluster-specific read reassembly improved reconstruction of sequences with excessive read coverage, and that strict data filtering for non-viral sequences prior to downstream analyses was essential. In this study we describe novel viral genomes identified in the Namib Desert copper site virome. Taxonomic affiliations of diverse proteins in the dataset and phylogenetic analyses of circovirus-like proteins indicated links to the marine habitat. Considering additional evidence from this dataset we hypothesize that viruses may have been carried from the Atlantic Ocean into the Namib Desert by fog and wind, highlighting the impact of the extended environment on an investigated niche in metagenome studies.Entities:
Keywords: Namib Desert; annotation; assembly; circovirus; simulated/virtual metagenomes; virome
Year: 2017 PMID: 28167933 PMCID: PMC5253355 DOI: 10.3389/fmicb.2017.00013
Source DB: PubMed Journal: Front Microbiol ISSN: 1664-302X Impact factor: 5.640
Assembly results for the simulated virome dataset SIM1 for all contigs larger than 500 nt.
| Read pairs mapped concordantly to assembly (%) | 97 | 74 | 91 | 97 | 50 |
| >1 times to same scaffold (%) | 0.1 | 0.0 | 8.8 | 0.5 | 5.5 |
| >1 times to different scaffolds (%) | 8.1 | 4.2 | 6.1 | 6.0 | 1.3 |
| Number of scaffolds (% misassembled) | 7545 (3) | 9454 (1) | 8333 (1) | 10461 (1) | 8179 (2) |
| 500–999 nt | 3400 (1) | 4336 (1) | 4016 (0) | 5288 (1) | 3974 (2) |
| 1–1.999 kb | 1946 (2) | 2655 (1) | 2127 (0) | 2811 (0) | 2190 (2) |
| 2–4.999 kb | 1207 (4) | 1556 (1) | 1285 (0) | 1416 (1) | 1336 (1) |
| 5–9.999 kb | 426 (5) | 458 (1) | 451 (1) | 411 (4) | 374 (1) |
| 10–19.999 kb | 227 (5) | 168 (2) | 224 (4) | 212 (9) | 161 (4) |
| 20–49.999 kb | 240 (7) | 189 (3) | 169 (10) | 229 (7) | 110 (6) |
| 50–99.999 kb | 68 (10) | 62 (5) | 47 (30) | 66 (14) | 25 (60) |
| 100–199.999 kb | 27 (7) | 26 (19) | 12 (17) | 24 (21) | 9 (44) |
| ≥200 kb | 4 (25) | 4 (0) | 2 (50) | 4 (25) | 0 (na) |
| Max scaffold length: | 277,192 | 257,672 | 249,422 | 262,666 | 178,000 |
| N50: | 27,249 | 15,340 | 11,876 | 17,270 | 5723 |
| Scaffolds correctly mapped to genomes | 7360 | 9389 | 8266 | 10,338 | 8022 |
| Scaffolds correctly mapped to genomes (%) | 98 | 99 | 99 | 99 | 98 |
| Number of misassembled scaffolds | 185 | 65 | 67 | 123 | 157 |
| Total correctly assembled length (Mb) | 30.1 | 31.0 | 24.8 | 31.5 | 19.0 |
| Genomes hit | 558 | 557 | 531 | 554 | 490 |
| Genomes recovered | 195 | 147 | 124 | 181 | 45 |
| ALE score [0,1] | 0.2 | 0.6 | 0.3 | 0.3 | 1.0 |
Assembly results for the simulated virome dataset SIM2 for all contigs larger than 500 nt.
| Read pairs mapped concordantly to assembly (%) | 91 | 62 | 67 | 95 | 43 |
| >1 times to same scaffold (%) | 0.1 | 0.0 | 0.3 | 42.0 | 0.8 |
| >1 times to different scaffolds (%) | 8.3 | 3.8 | 6.6 | 4.8 | 0.8 |
| Number of scaffolds (% misassembled) | 11,991 (38) | 19,346 (46) | 14,196 (51) | 12,414 (47) | 11,279 (18) |
| 500–999 nt | 4268 (20) | 5593 (30) | 4389 (21) | 5826 (25) | 5093 (9) |
| 1000–1499 nt | 1684 (30) | 2513 (43) | 1881 (37) | 2040 (37) | 1914 (14) |
| 1500–2099 nt | 3373 (17) | 5955 (16) | 3050 (22) | 1475 (42) | 3217 (8) |
| ≥2100 nt: | 2666 (100) | 5285 (100) | 4876 (100) | 3073 (100) | 1055 (100) |
| Max scaffold length: | 226,834 | 47,930 | 24,446 | 365,072 | 134,592 |
| N50: | 4023 | 2001 | 2454 | 14,846 | 2001 |
| Scaffolds correctly mapped to genomes | 4697 | 9136 | 5254 | 4524 | 5586 |
| Scaffolds correctly mapped to genomes (%) | 39 | 47 | 37 | 36 | 50 |
| Number of misassembled scaffolds | 7294 | 10,210 | 8942 | 7890 | 5693 |
| correctly assembled length (Mb) | 6.3 | 13.3 | 6.9 | 4.6 | 6.9 |
| Genomes hit | 4592 | 8267 | 4995 | 4030 | 5120 |
| Genomes recovered | 1809 | 4420 | 1881 | 451 | 1517 |
| ALE score [0,1] | 0.4 | 0.8 | 0.7 | 0.3 | 1.0 |
Assembly results for the simulated virome dataset SIM3 for all contigs larger than 500 nt.
| Read pairs mapped concordantly to assembly (%) | 89 | 31 | 40 | 92 | 18 |
| >1 times to same scaffold (%) | 0.2 | 0.0 | 29.1 | 1.1 | 1.1 |
| >1 times to different scaffolds (%) | 2.8 | 3.1 | 3.5 | 1.4 | 0.9 |
| Number of scaffolds (% misassembled) | 9068 (2) | 9543 (1) | 8430 (1) | 12,231 (1) | 7248 (1) |
| 500–999 nt | 4433 (1) | 4537 (1) | 4571 (0) | 6657 (0) | 4067 (1) |
| 1–1.999 kb | 2449 (2) | 2694 (1) | 2295 (0) | 3280 (1) | 1873 (1) |
| 2–4.999 kb | 1352 (3) | 1548 (1) | 1041 (1) | 1573 (1) | 867 (1) |
| 5–9.999 kb | 456 (3) | 436 (1) | 266 (2) | 374 (4) | 226 (1) |
| 10–19.999 kb | 166 (5) | 154 (2) | 117 (4) | 152 (6) | 130 (4) |
| 20–49.999 kb | 165 (5) | 132 (6) | 111 (8) | 155 (7) | 75 (8) |
| 50–99.999 kb | 35 (3) | 29 (10) | 23 (22) | 28 (14) | 7 (43) |
| 100–199.999 kb | 11 (0) | 12 (8) | 6 (17) | 11 (18) | 3 (67) |
| ≥200 kb | 1 (0) | 1 (0) | 0 (na) | 1 (0) | 0 (na) |
| Max scaffold length: | 244,834 | 244,953 | 163,001 | 245,076 | 141,829 |
| N50: | 8073 | 5958 | 5262 | 4759 | 3766 |
| Scaffolds correctly mapped to genomes | 8900 | 9471 | 8380 | 12,126 | 7158 |
| Scaffolds correctly mapped to genomes (%) | 98 | 99 | 99 | 99 | 99 |
| Number of misassembled scaffolds | 168 | 72 | 50 | 105 | 90 |
| Total correctly assembled length (Mb) | 25.0 | 24.2 | 18.1 | 26.1 | 13.6 |
| Genomes hit | 565 | 552 | 514 | 559 | 479 |
| Genomes recovered | 148 | 111 | 83 | 126 | 23 |
| ALE score [0,1] | 0.2 | 0.9 | 0.8 | 0.2 | 1.0 |
Assembly results for the Namib Desert copper-site dataset for all contigs larger than 200 nt.
| Primary assembly | 6,877,654 | 38,365 | 5,356,821 | 77.9 | 913 | 35,461 |
| Sub-assembly 1 | 2,241,214 | 35,966 | 1,562,711 | 69.7 | 961 | 35,670 |
| Sub-assembly 2 | 3,629,636 | 135 | 3,385,364 | 93.3 | – | 1835 |
Figure 1Taxonomic tree of all viral protein sequences discovered in this study. Proteins were assigned taxonomic IDs using the lowest common ancestor method. Bars depict numbers of proteins assigned to the corresponding taxa.
Figure 2Phylogenetic tree of the circovirus-like replication-associated proteins. Alignments were generated using MAFFT, the phylogenetic tree was produced using the RAxML BlackBox, the tree was visualized using FigTree. The asterisk serves to identify the replication and capsid proteins identified in this study.
Figure 3Phylogenetic tree of the circovirus-like capsid proteins. Alignments were generated using MAFFT, the phylogenetic tree was produced using the RAxML BlackBox, the tree was visualized using FigTree. The asterisk serves to identify the Replication and capsid proteins identified in this study.
Annotation table for novel circoviral-like genomes discovered in this study.
| Contig_21 | ORF1 | 128–937 (−1) | PF00844 | 44% id to hypothetical protein from Circoviridae 3 LDMD-2013 | putative gemini coat protein |
| ORF2 | 1127–2130 (+2) | PF02407, PF00910 | 38% id to replication-associated protein of the Avon-Heathcote Estuary associated circular virus 5 | putative viral replication protein | |
| ORI | 1078–1115 | GACC | |||
| Contig_71 | ORF1 | 109–1410 (+1) | PF02407, PF00910 | 40% id to the replication-associated protein of the Avon-Heathcote Estuary associated circular virus 5 | putative viral replication protein |
| ORF2 | 1572–2360 (−3) | PF00844 | 35% id to a hypothetical protein from Circoviridae 3 LDMD-2013 | putative gemini coat protein | |
| ORI | 42–83 | CT | |||
| Contig_176 | ORF1 | 50–955 (+2) | PF02407 | 59% id to the putative replication initiation protein of the | putative viral replication protein |
| ORF2 | 1187–1951 (−3) | – | – | – | |
| ORI | 1–33 | CCT | |||
The nonanucleotide sequence where ssDNA synthesis is initiated is shown in bold.
Figure 4Comparison of the published .