| Literature DB >> 31167846 |
Annika Brinkmann1, Andreas Andrusch2, Ariane Belka3, Claudia Wylezich3, Dirk Höper3, Anne Pohlmann3, Thomas Nordahl Petersen4, Pierrick Lucas5, Yannick Blanchard5, Anna Papa6, Angeliki Melidou6, Bas B Oude Munnink7, Jelle Matthijnssens8, Ward Deboutte8, Richard J Ellis9, Florian Hansmann10, Wolfgang Baumgärtner10, Erhard van der Vries11, Albert Osterhaus12, Cesare Camma13, Iolanda Mangone13, Alessio Lorusso13, Maurilia Marcacci13, Alexandra Nunes14, Miguel Pinto14, Vítor Borges14, Annelies Kroneman15, Dennis Schmitz7,15, Victor Max Corman16, Christian Drosten16, Terry C Jones16,17, Rene S Hendriksen4, Frank M Aarestrup4, Marion Koopmans7, Martin Beer3, Andreas Nitsche2.
Abstract
Quality management and independent assessment of high-throughput sequencing-based virus diagnostics have not yet been established as a mandatory approach for ensuring comparable results. The sensitivity and specificity of viral high-throughput sequence data analysis are highly affected by bioinformatics processing using publicly available and custom tools and databases and thus differ widely between individuals and institutions. Here we present the results of the COMPARE [Collaborative Management Platform for Detection and Analyses of (Re-)emerging and Foodborne Outbreaks in Europe] in silico virus proficiency test. An artificial, simulated in silico data set of Illumina HiSeq sequences was provided to 13 different European institutes for bioinformatics analysis to identify viral pathogens in high-throughput sequence data. Comparison of the participants' analyses shows that the use of different tools, programs, and databases for bioinformatics analyses can impact the correct identification of viral sequences from a simple data set. The identification of slightly mutated and highly divergent virus genomes has been shown to be most challenging. Furthermore, the interpretation of the results, together with a fictitious case report, by the participants showed that in addition to the bioinformatics analysis, the virological evaluation of the results can be important in clinical settings. External quality assessment and proficiency testing should become an important part of validating high-throughput sequencing-based virus diagnostics and could improve the harmonization, comparability, and reproducibility of results. There is a need for the establishment of international proficiency testing, like that established for conventional laboratory tests such as PCR, for bioinformatics pipelines and the interpretation of such results.Entities:
Keywords: external quality assessment; high-throughput sequencing; next-generation sequencing; proficiency testing; virus diagnostics
Year: 2019 PMID: 31167846 PMCID: PMC6663916 DOI: 10.1128/JCM.00466-19
Source DB: PubMed Journal: J Clin Microbiol ISSN: 0095-1137 Impact factor: 5.948
Tools and programs for analysis of HTS data used in the COMPARE virus proficiency test
| Program (reference) | Application | Description/relevance for viral HTS | URL |
|---|---|---|---|
| BWA ( | Alignment (nucleotide) | Burrows-Wheeler Alignment Tool for efficient alignment of short sequencing reads against a large reference genome. Based on string matching with Burrows-Wheeler transform. | |
| DIAMOND ( | Alignment (protein) | Double-index alignment of NGS data. Shown to be as much as 20,000 times faster than comparable programs, with high sensitivity. | |
| FastQC ( | Quality control, trimming | Generates base quality scores and sequence contents, sequence length distributions, identification of duplicate or overrepresented sequences, adapter, and k-mer contents. | |
| Kmerfinder ( | Taxonomic assignment | Online user interface also allows the prediction of human and vertebrate viruses. | |
| Kraken ( | Alignment (nucleotide) | Uses only exact alignments for its taxonomic classification with high speed. | |
| MetaPhlAn | Taxonomic assignment | Metagenomic Phylogenetic Analysis is a tool for the taxonomic assignment of microbial communities. High accuracy and speed are supported by only high-confidence matches. Such approaches allow the assignment of 25,000 microbial reads per second but might fail with viral genomes, which often lack common markers and genes. | |
| MGMapper ( | Pipeline | Online tool for processing, assigning, and analyzing HTS sequences. | |
| MIRA | Mimicking Intelligent Read Assembly, an overlap-layout-consensus graph (OLC) assembler for metagenomics data from several sequencing platforms. Assembles the most as well as the largest contigs among | ||
| NCBI BLAST ( | Alignment (nucleotide and protein) | Basic local alignment search tool. Offers very sensitive online and stand-alone alignments of nucleotides, translated nucleotides, and protein sequences. | |
| One Codex ( | Taxonomic assignment | Web-based data platform for k-mer-based taxonomic classification. Very high degrees of sensitivity and specificity, even when analyzing highly divergent and mutated sequences. | |
| PAIPline ( | Pipeline | Pipeline for metagenomic analysis of HTS data. | |
| QUASR ( | Pipeline | Combination of several R packages and external software for HTS read analysis. Part of the Bioconductor project. | |
| RIEMS ( | Pipeline | Pipeline for metagenomics sequence analysis, combining several established programs and tools for pathogen detection in one automated workflow. Separated into a workflow of accurate and fast “basic analysis” and a more sensitive “further analysis.” | |
| Skewer ( | Quality control, trimming | Trimming of primer and adapter sequences focusing on the characteristics of paired-end and mate-pair reads. A statistical scheme based on quality values allows the accurate trimming of adapters with mismatches. | |
| SNAP ( | Alignment (nucleotide) | As much as 10 to 100 times faster than similar alignment programs but offers greater sensitivity due to richer error acceptance. | |
| SPAdes, MetaSPAdes ( | De Bruijn graph assembler. MetaSPAdes specifically addresses the challenges that arise with complex metagenomics data. | ||
| Taxonomer ( | Taxonomic assignment | Web-based tool for nucleotide- and protein-based read assignment. User-friendly interactive result visualization. Based on exact k-mer matching with low error tolerance. Speed as high as ∼32 million reads/min. Furthermore, protein-based read identification offers the detection of divergent viral sequences but is based on exact k-mer matching without error allowance. | |
| Trimmomatic ( | Quality control, trimming | Paired-end sequence reads can be cut from technical sequences as adapters, primers, or low-quality bases. Has been shown to improve downstream analyses considerably, for example, | |
| USEARCH ( | Alignment (protein) | Exceptionally high speed for protein or translated nucleotide read alignment. The sensitivity of USEARCH is comparable to that of the NCBI protein BLAST, but USEARCH is ∼350 times faster. | |
| Velvet ( | Can be used for |
Listed in alphabetical order.
Composition of the simulated sequence data set
| Organism | No. of reads | Nucleotide sequence identity with reference (%) |
|---|---|---|
| Human | 4,834,491 | 100 |
| 500,000 | 100 | |
| 500,000 | 100 | |
| 500,000 | 100 | |
| Torque teno virus | 1,917 | 100 |
| Human herpesvirus 1 | 2,000 | 100 |
| Measles virus | 1,000 | 82 |
| (Novel) avian bornavirus | 500 | 55 |
The total number of reads is 6,339,908.
Interpretation of bioinformatics results
| Participant | Results of: | Participant’s background | |
|---|---|---|---|
| Bioinformatics | Diagnostics | ||
| 1 | TTV, HSV-1, MeV | HSV-1 | Bioinformatics |
| 2 | TTV, HSV-1, MeV | HSV-1 | Food and environmental health |
| 3 | TTV, HSV-1, MeV, nABV | SSPE/HSV-1 | Veterinarian, virology |
| 4 | HSV-1 | HSV-1 | University, virology |
| 5 | TTV, HSV-1, MeV, nABV | nABV | Virology |
| 6 | TTV, HSV-1, MeV, nABV | nABV | Medical research |
| 7 | TTV, HSV-1, MeV | SSPE | Animal and plant health |
| 8 | TTV, HSV-1, MeV | SSPE | Veterinarian, virology |
| 9 | TTV, HSV-1, MeV | SSPE | Public health |
| 10 | TTV, HSV-1, MeV | SSPE | Public health |
| 11 | TTV, HSV-1, MeV | SSPE | Public health and environment |
| 12 | TTV, HSV-1, MeV, nABV | SSPE/HSV-1 | Diagnostics, virology |
| 13 | TTV, HSV-1, MeV | SSPE | Virology |
Abbreviations: TTV, Torque teno virus; HSV-1, human herpesvirus 1; MeV, measles virus; nABV, novel avian bornavirus; SSPE, subacute sclerosing panencephalitis.
Sensitivity for identified reads of the COMPARE virus proficiency test
| Participant | Sensitivity | No false-positive result | Time of analysis (h) | |||
|---|---|---|---|---|---|---|
| Torque teno virus | Human herpesvirus | Measles virus | Avian bornavirus | |||
| 1 | 0.99 | 0.21 | 0 | √ | 3 | |
| 2 | 1.01 | 0.46 | 0 | √ | 15.5 | |
| 3 | 0.96 | 0.96 | √ | 60 | ||
| 4 | 0 | 0.10 | 0 | 0 | √ | 216 |
| 5 | 0.98 | √ | 26 | |||
| 6 | 0.84 | – | 12 | |||
| 7 | 0.94 | 4.00 | 1.41 | 0 | √ | 6 |
| 8 | 1.04 | 0.99 | 0 | √ | 7 | |
| 9 | 0.29 | 0.84 | 0.49 | 0 | √ | 5 |
| 10 | 0 | √ | 48 | |||
| 11 | 0 | √ | 14 | |||
| 12 | 1.02 | 0.23 | √ | 18 | ||
| 13 | 1.02 | 0.90 | 0.34 | 0 | √ | 48 |
Numbered randomly.
√, no false-positive result; –, false-positive result(s).
FIG 1Numbers of Torque teno virus (TTV), human herpesvirus 1 (HSV-1), measles virus (MeV), and novel avian bornavirus (nABV) reads identified by participants 1 to 13.
Total time of computational analysis, maximum computer/server specifications, and reference databases used
| Participant | Time of analysis (h) | Database | Operating system | CPU | CPU MHz | RAM (GB) |
|---|---|---|---|---|---|---|
| 1 | 3 | NCBI nt | UNIX | VM | VM | VM |
| 2 | 15.5 | NCBI nt | Ubuntu 16.04 LTS | 56 | 1,270 | 378 |
| 3 | 60 | NCBI nt/nr | CentOS 6 | 24 | 2,400 | 64 |
| 4 | 216 | NCBI nt | Windows XP | Intel core i5 | 2,300 | 8 |
| 5 | 26 | NCBI viral db | OS X | 2 | NA | NA |
| 6 | 12 | NCBI nr | Ubuntu 14.04 | 32 | 2,000 | 503 |
| 7 | 6 | ViPR and NCBI nt | BioLinux Ubuntu 14.04 | 8 | 3.6 | 16 |
| 8 | 7 | NCBI nt | CentOS 6.5 | 64 | 2,300 | 250 |
| 9 | 5 | NCBI nr | Ubuntu 12.04.5 | NA | 3,800 | 50 |
| 10 | 48 | NCBI nt | CentOS 6.5 | 2 × AMD Opteron | 2,200 | 32 |
| 11 | 14 | NCBI nt/nr | RHEL | VM, variable | VM, variable | VM, variable |
| 12 | 18 | NCBI viral db | Linux Mint | Intel Xenon X5650 | 6 × 2.67 Ghz | 25 |
| 13 | 48 | NCBI nt | Ubuntu 14.04.4 LTS | 2 × AMD Opteron 6174 | 24 × 2.2 GHz | 128 |
nr, nonredundant; nt, nucleotide; db, database; VM, virtual machine; NA, not available.
FIG 2Simplified comparison of different bioinformatics workflows for virus identification used in the COMPARE virus proficiency test. Colored plus signs indicate the identification of human herpesvirus (turquoise), Torque teno virus (turquoise), measles virus (blue), or avian bornavirus (red).