| Literature DB >> 33920047 |
Denis Kutnjak1, Lucie Tamisier2, Ian Adams3, Neil Boonham4, Thierry Candresse5, Michela Chiumenti6, Kris De Jonghe7, Jan F Kreuze8, Marie Lefebvre5, Gonçalo Silva9, Martha Malapi-Wight10, Paolo Margaria11, Irena Mavrič Pleško12, Sam McGreig3, Laura Miozzi13, Benoit Remenant14, Jean-Sebastien Reynard15, Johan Rollin2,16, Mike Rott17, Olivier Schumpp15, Sébastien Massart2, Annelies Haegeman7.
Abstract
High-throughput sequencing (HTS) technologies have become indispensable tools assisting plant virus diagnostics and research thanks to their ability to detect any plant virus in a sample without prior knowledge. As HTS technologies are heavily relying on bioinformatics analysis of the huge amount of generated sequences, it is of utmost importance that researchers can rely on efficient and reliable bioinformatic tools and can understand the principles, advantages, and disadvantages of the tools used. Here, we present a critical overview of the steps involved in HTS as employed for plant virus detection and virome characterization. We start from sample preparation and nucleic acid extraction as appropriate to the chosen HTS strategy, which is followed by basic data analysis requirements, an extensive overview of the in-depth data processing options, and taxonomic classification of viral sequences detected. By presenting the bioinformatic tools and a detailed overview of the consecutive steps that can be used to implement a well-structured HTS data analysis in an easy and accessible way, this paper is targeted at both beginners and expert scientists engaging in HTS plant virome projects.Entities:
Keywords: bioinformatics; detection; discovery; high-throughput sequencing; plant virus
Year: 2021 PMID: 33920047 PMCID: PMC8071028 DOI: 10.3390/microorganisms9040841
Source DB: PubMed Journal: Microorganisms ISSN: 2076-2607
Figure 1Glossary of terms commonly used in bioinformatics analysis of high-throughput sequencing (HTS) data for plant virus detection.
Figure 2Flowchart representing different approaches for the analysis of HTS data for the detection of plant viruses. Boxes represent different steps in data analysis and interpretation. Arrows connect different possible sequences of the analysis steps. As an example, a non-exhaustive list of possible analysis tools is added in the square brackets at each of the analysis steps. Tools designated with * are intended for use with long-read or, specifically, nanopore sequencing data. Pointing hands lead to the text sections (or figures) with more detailed description of the corresponding steps.
Summary of the most commonly used similarity search strategies with advantages and limitations for each of the strategies.
| Tool Name | Advantages | Limits and Considerations | Important Thresholds |
|---|---|---|---|
| BLASTx or BLASTn | High sensitivity | Slow, intensive use of computing power if a large database is used, BLASTx needed for the detection of divergent novel viruses, BLASTn needed for the detection of viroids and noncoding regions of viral genomes or satellites; performance improved by prior assembly of contigs. | Minimum percentage of identity; length of identified region of similarity; minimal e-value, bit-score. |
| MegaBLAST | Faster than BLASTn, | Less sensitive than BLASTn, only useful for detection of nucleotide sequences very similar to the ones in the used database; performance improved by prior assembly of contigs. | Minimum percentage of identity; length of identified region of similarity; minimal e-value, bit-score. |
| BLASTp | High sensitivity | Slow, need to translate nucleotide sequences to proteins first; performance improved by prior assembly of contigs; not applicable for viroids or noncoding regions of viral genomes or satellites. | Minimum percentage of identity; length of identified region of similarity; minimal e-value, bit-score. |
| DIAMOND | Faster than BLASTx | Less sensitive, annotation less accurate than BLAST; performance improved by prior assembly of contigs; only available for searches against protein databases; not applicable for viroids or noncoding regions of viral genomes or satellites. | Minimum percentage of identity; length of identified region of similarity; minimal e-value, bit-score; use sensitive mode. |
| Burrows-Wheeler transform-based mapping algorithms (e.g., BWA or Bowtie2) | Does not require prior assembly of contigs, high sensitivity for short sequences | Only allows detection of known agents. Difficult to adjust mapping stringency to (1) allow detection of divergent isolates while (2) avoiding cross-mapping between related agents; prior assembly of contigs reduces cross-mapping between related agents. | Mapping stringency (e.g., mismatch penalties, gap open/extension penalties, percent of read length matching reference, minimum percentage of identity) |
| HMMER or HMMScan | High efficiency for detection of distant homologs | Annotation more complex for protein families shared between cellular organisms and viruses; not applicable for viroids or noncoding regions of viral genomes or satellites. | Minimal e-value. |
| K-mer based classification algorithms (Kraken or Taxonomer) | Fast | Requires large computer memory; accuracy may be limited for the shorter genomes of plant viruses; the confidence scoring of the results is not straight forward. | C/Q ratio for Kraken (advise the manual). |
Figure 3Checklist of the most important considerations to keep in mind during HTS data processing for detection of plant viruses.
Figure 4Checklist of the most important considerations during taxonomic classification of plant viruses detected by HTS.
Figure 5Quick-start guide assisting selection of analysis approaches for plant virus detection from HTS data.
List of selected easy-to-use analysis solutions for detection of plant viruses with their pros and cons.
| Pipeline | Brief Description | Web Link/Publication | Pros | Cons |
|---|---|---|---|---|
|
| Virus discovery using sRNA and RNAseq sequences |
Easy to use: single command to run one or multiple datasets simultaneously. Performs Automatic results organization and presentation in html table providing key metrics on coverage, sequence depth, virus and genus name, and link to visual map and NCBI GenBank reference sequence. Options to modify key assembly, mapping, and reporting parameters. Windows version with visual interface and automatic quality control and trimming to be released in 2021. Available via user account online. |
Uses complete NCBI GenBank database for viruses (divided along host type) for reference mapping and identity searches. NCBI GenBank sequences are poorly curated and may lead to reports of wrong results. Creating and formatting new custom or up-to-date NCBI GenBank reference library is not very straightforward and ready formatted updates are not uploaded very regularly to the VirusDetect webpage. Currently requires Linux environment, which is an impediment for many diagnosticians. Default reporting cutoff settings are optimized for siRNA to minimize false positives due to index-hopping; however, they may lead to the non-reporting of low concentration viruses. | |
|
| HTS sample manager with virus detection, discovery and analysis workflows |
Open source modern graphical optimized for cloud computing. User and group control with password protection, sample data management, security, and QA features. Support for multiple workflows and versioned databases for viral and non-viral pathogens. Can process short and long reads (Illumina). Result visualization, filtering, and sorting. HTTP API for automation or integration with other services such as LIMS. Can also be controlled via the command line for more complex tasks. |
Requires some computational skills for user (or help of informatician) to install as a local server on Linux operating system. Limited ability to change parameters within a workflow. | |
|
| Command-line tool for virus detection and viral diversity estimation | [ |
Wide options to modify assembly, mapping, annotation, and clustering parameters. Performs parallel analysis of samples from the same dataset. Estimation of viral diversity through Operational Taxonomic Units (OTUs). Easy results visualization with Krona and phylogenic trees. |
Requires a Linux environment, which is an impediment for many diagnosticians. Need a cluster access for the annotation step. Requires a good knowledge of command-line and Unix packages installation. |
|
| Online virus discovery tool |
Available via user account online. Performs reference mapping, |
Analysis by online version can take several days. Output only in text files: experience needed for further interpretation. | |
|
| Command-line tool for virus detection |
Simple: can be executed with one command but has a number of parameters/tools that can be tweaked Uses full nt and nr GenBank databases so is sensitive Manual inspection of results with a local MEGAN installation improves accuracy Supports single and paired-end analysis Supports BLASTn/MEGAN parallelization |
Requires a Linux environment, which is an impediment for many diagnosticians. Dependent on locally stored nt and nr GenBank databases. BLASTx stage can take a long time. Manual inspection of results with a local MEGAN installation is required. | |
|
| k-mer based command-line tool for virus detection |
Available as Galaxy plug-in or as command-line tool that can be installed using conda. k-mer based rather than assembly and mapping, which makes it more sensitive and computationally less intensive. |
Requires a Linux environment for the command-line tool, which is an impediment for many diagnosticians. | |
|
| Targeted virus detection using e-probes based approach | [ |
Results easy to interpret, good sensitivity. Requires relatively low computational resources. |
Undescribed virus or viral strain will not be detectable using this pipeline. Only grapevine and citrus viruses are available; however, e-probes for other viruses can be designed. Requires a Linux environment, which is an impediment for many diagnosticians. |
|
| Online metagenomic analysis tool |
Both standalone and web server available. Quick analysis not requiring any knowledge in bioinformatics and data analysis. Prepared downloadable databases available. |
Not specifically made for virus detection. Protein based, hence blind for non-coding sequences (viroids, satellites). | |
|
| Workflow system for computational analyses |
Web-based platform. Open source. Vast choice of computational biology tools. |
Limit in data upload, unless if you establish own local galaxy server. Not specifically made for virus detection. | |
|
| Online metagenomic analysis tool |
Easy-to-use visual interface of results. Quick analysis not requiring any knowledge in bioinformatics and data analysis. |
Not possible to change parameters of the workflow. Complementary software needed for reads alignment. Not specifically made for virus detection. | |
|
| Software for molecular biology and sequence analysis |
Graphical interface. Multiple plugins available, including some frequently used freeware assembly algorithms. Automated, customizable workflows. Constant release of updated versions and customer support. Nice and efficient visualization tools. Free trial version available. |
Licensed, including license fee; HTS data analysis requires computational resources. | |
|
| Comprehensive software solution of molecular biology analysis tools |
Graphical interface. Automated, customizable workflows. Constant release of updated versions and customer support. Nice and efficient visualization tools. Free trial version available. |
Expensive ongoing licensing fee. HTS data analysis requires computational resources. |