| Literature DB >> 28414526 |
Daniel J White1, Jing Wang2, Richard J Hall3.
Abstract
Applying high-throughput sequencing to pathogen discovery is a relatively new field, the objective of which is to find disease-causing agents when little or no background information on disease is available. Key steps in the process are the generation of millions of sequence reads from an infected tissue sample, followed by assembly of these reads into longer, contiguous stretches of nucleotide sequences, and then identification of the contigs by matching them to known databases, such as those stored at GenBank or Ensembl. This technique, that is, de novo metagenomics, is particularly useful when the pathogen is viral and strong discriminatory power can be achieved. However, recently, we found that striking differences in results can be achieved when different assemblers were used. In this study, we test formally the impact of five popular assemblers (MIRA, VELVET, METAVELVET, SPADES, and OMEGA) on the detection of a novel virus and assembly of its whole genome in a data set for which we have confirmed the presence of the virus by empirical laboratory techniques, and compare the overall performance between assemblers. Our results show that if results from only one assembler are considered, biologically important reads can easily be overlooked. The impacts of these results on the field of pathogen discovery are considered.Entities:
Keywords: algorithms; assemblers; de novo metagenomics; pathogen discovery; test
Mesh:
Year: 2017 PMID: 28414526 PMCID: PMC5610382 DOI: 10.1089/cmb.2017.0008
Source DB: PubMed Journal: J Comput Biol ISSN: 1066-5277 Impact factor: 1.479
Assemblers Used and Their Parameters
| MIRA | job = est,denovo,accurate parameters = COMMON_SETTINGS -NW:cnfs = no:cmrnl = no -GE:not = 12 SOLEXA_SETTINGS -CL:cpat = no |
| MIRA (singletons) | job = est,denovo,accurate parameters = COMMON_SETTINGS -NW:cnfs = no:cmrnl = no -GE:not = 12 SOLEXA_SETTINGS -AS:mrpc = 1-OUT:sssip = yes -CL:cpat = no |
| VELVET velveth: | kmer is 99 and used paired end reads |
| velvetg: | -cov_cutoff auto -exp_cov auto -unused_reads yes |
| METAVELVET velveth: | kmer is 99 and used paired-end reads |
| velvetg: | -cov_cutoff auto exp_cov auto -unused_reads yes |
| SPADES | -k 21,33,55,77,99,127—careful—pe1-1—pe1-2—cov-cutoff auto |
| OMEGA | -pe -l 100 |
Performance of Five Different Assemblers in Virus Discovery
| General assembly | ||||||
| Wall clock time | 08:36:07 | 19:52:01 | 00:14:51 | 00:17:03 | 1d11:06:25 | 01:28:22 |
| Size (megabytes) | 221.1 | 418.5 | 384.7 | 384.7 | 38.2 | 9.8 |
| No. of reads assembled[ | 4,427,786 (43.1) | 6,341,535 (61.8) | 4,201,448 (40.9) | 4,201,448 (40.9) | N/A | N/A |
| No. of contigs | 512,750 | 1,616,750 | 1,002,664 | 1,002,675 | 87,340 | 16,269 |
| N50 | 458 | 273 | 348 | 348 | 468 | 582 |
| Maximum contig length (bp) | 32,359 | 22,061 | 2418 | 2344 | 12,339 | 11,634 |
| Consensus | 202,611,745 | 370,456,167 | 350,400,905 | 350,424,276 | 34,622,064 | 8,196,640 |
| Sensitivity | ||||||
| No. of contigs assigned taxon id (%) | 447,185 (87.2) | 1,353,011 (83.7) | 873,014 (87.1) | 878,000 (87.6) | 71,498 (81.9) | 11,962 (73.5) |
| No. of “unknown” contigs (%) | 62,664 (12.2) | 257,368 (15.9) | 124,513 (12.4) | 119,625 (11.9) | 15,018 (17.2) | 4057 (24.9) |
| No. of viral taxa[ | 20 | 30 | 44 | 39 | 12 | 8 |
| Number of unique viral taxa[ | 0 | 5 | 3 | 1 | 0 | 0 |
| Power and accuracy | ||||||
| Max bit score for a viral contig | 4697 | 4697 | 4213 | 4213 | 4744 | 4473 |
| Length of contig (in bp) | 2885 | 2885 | 2344 | 2344 | 2765 | 2490 |
| Taxon ID of highest scoring viral contig | Sewage- associated circular DNA virus 27 | Sewage- associated circular DNA virus 27 | Sewage- associated circular DNA virus 27 | Sewage- associated circular DNA virus 27 | Sewage- associated circular DNA virus 27 | Sewage- associated circular DNA virus 27 |
| Region of rowi kiwi CVLV with highest BLASTn hit[ | 1740/1740 | 1740/1740 | 1952/1952 | 1580/1580 | 1713/1713 | 1693/1694 |
| Detection of rowi kiwi CVLV homologue[ | Yes (1) | Yes (1) | Yes (2) | yes (2) | No | No |
| Highest BLAST bit score to rowi kiwi CVLV homologue[ | 66.2 (54/66) | 66.2 (54/66) | 113 (149/207) | 113 (149/207) | / | / |
Defined as the number of reads that contributed to contigs.
Set at species level, with minimum number of contributing contigs before taxon assigned by MEGAN set at 2.
Defined as the number of taxa not shared with another assembler.
When VELVET and METAVELVET are combined, there are 19 viruses reported that are not found with the other assemblers.
Identity across region expressed as a fraction of longest matching region, in base pairs.
Meles meles circovirus-like virus (White et al., 2016).
CVLV, circovirus-like virus.