| Literature DB >> 36146797 |
Philippe Pérot1, Thomas Bigot2, Sarah Temmam1, Béatrice Regnault1, Marc Eloit1,3.
Abstract
We present Microseek, a pipeline for virus identification and discovery based on RVDB-prot, a comprehensive, curated and regularly updated database of viral proteins. Microseek analyzes metagenomic Next Generation Sequencing (mNGS) raw data by performing quality steps, de novo assembly, and by scoring the Lowest Common Ancestor (LCA) from translated reads and contigs. Microseek runs on a local computer. The outcome of the pipeline is displayed through a user-friendly and dynamic graphical interface. Based on two representative mNGS datasets derived from human tissue and plasma specimens, we illustrate how Microseek works, and we report its performances. In silico spikes of known viral sequences, but also spikes of fake Neopneumovirus viral sequences generated with variable evolutionary distances from known members of the Pneumoviridae family, were used. Results were compared to Chan Zuckerberg ID (CZ ID), a reference cloud-based mNGS pipeline. We show that Microseek reliably identifies known viral sequences and performs well for the detection of distant pseudoviral sequences, especially in complex samples such as in human plasma, while minimizing non-relevant hits.Entities:
Keywords: bioinformatics; diagnostic; discovery; metagenomics; pipeline; virus
Mesh:
Substances:
Year: 2022 PMID: 36146797 PMCID: PMC9500916 DOI: 10.3390/v14091990
Source DB: PubMed Journal: Viruses ISSN: 1999-4915 Impact factor: 5.818
Figure 1Microseek pipeline steps and output examples. (A) Step details of the pipeline. For each operation, software and database names are indicated. The times on the right represent the cumulated duration of each step (in minutes), respectively, for the Tissues (T) and Plasma (P) datasets and corresponding to spiked experiments at d1. For instance, for Plasma dataset, B1 step would take 1113 min if it was run with a single CPU as a unique chunk. (B) Example of a result browser webpage. The Parvovirus B19 taxon was selected from the Krona chart as an example. (C) Resulting list of hits of the Tissues dataset after filtering on Human Parvovirus B19 with e-value ≤ 10−3, showing information such are the existence of contigs (C), e-value, %ID, length of the sequence and protein names. Information boxes associated to each hit give the nt and aa sequences and allow direct access to the corresponding NCBI entries.
Detection of known viruses. The table represents the detection of spike vs. detected sequences of the six known viruses at dilution 1, dilution 10 and dilution 100 in the Tissue and Plasma experiments. sp.: species.
| TISSUES | PLASMA | |||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
|
|
| |||||||||||
| Number of sequences | Horizontal coverage (%) | Horizontal coverage (%) | Taxonomic precision (family, genus, sp) | Number of sequences | Horizontal coverage | Horizontal coverage (%) | Taxonomic precision (family, genus, sp) | |||||||
|
| Microseek | CZ ID | Microseek | CZ ID | Microseek | CZ ID | Microseek | CZ ID | ||||||
|
| ||||||||||||||
| ssDNA | Parvovirus B19 | 158 | 99.8 | 92.7 | 99.0 | sp. | sp. | 533 | 99.8 | 90.8 | 99.0 | sp. | sp. | |
| dsDNA | Epstein–Barr virus | 500 | 7.5 | 5.7 | 6.5 | sp. | sp. | 3650 | 96.3 | 79.7 | 51.0 | sp. | sp. | |
| ssRNA+ | Coxsackievirus B6 | 171 | 94.8 | 93.8 | 94.5 | sp. | sp. | 667 | 99.6 | 99.3 | 96.4 | sp. | sp. | |
| ssRNA− | Respiratory syncytial virus | 107 | 62.8 | 59.5 | 57.4 | sp. | sp. | 67 | 51.1 | 49.1 | 18.1 | sp. | sp. | |
| dsRNA | Reovirus 1 (10 segments) | 1404 | 98.3 | 97.7 | 98.3 | sp. | sp. | 3786 | 97.9 | 97.4 | 92.7 | sp. | sp. | |
| Retrovirus | HIV-1 | 160 | 59.0 | 50.8 | 55.8 | sp. | sp. | 556 | 99.7 | 55.3 | 93.0 | sp. | sp. | |
|
| ||||||||||||||
| ssDNA | Parvovirus B19 | 16 | 35.1 | 30.4 | 29.7 | sp. | sp. | 53 | 78.7 | 69.5 | 34.4 | sp. | sp. | |
| dsDNA | Epstein–Barr virus | 49 | 2.0 | 1.7 | 1.9 | sp. | sp. | 365 | 27.1 | 19.3 | 7.8 | sp. | sp. | |
| ssRNA+ | Coxsackievirus B6 | 18 | 30.7 | 27.1 | 27.0 | sp. | sp. | 67 | 78.5 | 72.1 | 30.3 | sp. | sp. | |
| ssRNA− | Respiratory syncytial virus | 11 | 10.4 | 9.4 | 10.4 | sp. | sp. | 7 | 6.9 | 6.9 | 1.0 | sp. | sp. | |
| dsRNA | Reovirus 1 (10 segments) | 138 | 56.9 | 56.5 | 55.5 | sp. | sp. | 379 | 75.9 | 75.2 | 38.7 | sp. | sp. | |
| Retrovirus | HIV-1 | 16 | 22.5 | 16.4 | 19.2 | sp. | sp. | 56 | 63.5 | 25.9 | 27.4 | sp. | sp. | |
|
| ||||||||||||||
| ssDNA | Parvovirus B19 | 3 | 7.3 | 2.7 | 7.3 | genus | sp. | 5 | 13.4 | 10.7 | 0.0 | sp. | n.a. | |
| dsDNA | Epstein–Barr virus | 5 | 0.3 | 0.3 | 0.3 | sp. | sp. | 36 | 3.1 | 2.5 | 0.8 | sp. | sp. | |
| ssRNA+ | Coxsackievirus B6 | 3 | 5.6 | 2.0 | 5.6 | sp. | sp. | 7 | 14.2 | 14.2 | 2.0 | sp. | sp. | |
| ssRNA− | Respiratory syncytial virus | 3 | 3.0 | 3.0 | 3.0 | sp. | sp. | 1 | 1.0 | 1.0 | 1.0 | sp. | sp. | |
| dsRNA | Reovirus 1 (10 segments) | 30 | 16.1 | 16.1 | 15.8 | sp. | sp. | 37 | 20.4 | 20.4 | 6.4 | sp. | sp. | |
| Retrovirus | HIV-1 | 2 | 3.3 | 1.6 | 3.3 | sp. | sp. | 6 | 9.8 | 1.6 | 3.3 | genus | sp. | |
Figure 2Detection of six known viruses (spiked vs. detected). Scatter plots of spiked (X axis) vs. detected (Y axis) percentage horizontal coverage for the 6 known viruses of the study, for the Tissues (A,B) and Plasma (C,D) experiments. Each virus is depicted by 3 points, corresponding to the d1 (large circles), d10 (intermediate circles) and d100 (small circles) experiments, and is associated with a specific color (red: Parvovirus B19; orange: Epstein-Barr virus; green: Coxsackievirus B6; blue: HRSV; purple: Reovirus-1; grey: HIV-1). The identity line (Y = X) is represented in black.
Figure 3Phylogeny of Neopneumoviruses. Protein sequences of the polymerases were aligned with MAFFT v7.450, and the phylogeny was reconstructed with IQtree 2.0.6 with model of substitution JTT + F + G4. The resulting tree was rooted using members of the Paramyxoviridae family as outgroup. AMPV: Avian metapneumovirus; HMPV: Human metapneumovirus; MPV: Murine orthopneumovirus; BRSV: Bovine orthopneumovirus; HSRV: Human respiratory syncytial virus. Accession numbers are indicated in the figure. Numbers on the branches represent their lengths.
Detection of Neopneumoviruses. The table represents the detection of spike vs. detected sequences of the three Neopneumoviruses at dilution 1 and dilution 10 in the Tissue and Plasma experiments. n.a.: not applicable.
| TISSUES | PLASMA | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
|
|
| ||||||||||
| Number of sequences | Horizontal coverage (%) | Horizontal coverage (%) | Taxonomic precision (family, genus, sp) | Number of sequences | Horizontal coverage | Horizontal coverage (%) | Taxonomic precision (family, genus, sp) | ||||||
|
| Microseek | CZ ID | Microseek | CZ ID | Microseek | CZ ID | Microseek | CZ ID | |||||
|
| |||||||||||||
| Neopneumovirus-1 | 107 | 81.6 | 73.3 | 71.5 | 67 | 52.2 | 47.7 | 19.4 |
| ||||
| Neopneumovirus-2 | 107 | 70.5 | 51.6 | 56.6 | 67 | 56.1 | 38.1 | 18.4 |
| ||||
| Neopneumovirus-3 | 107 | 70.1 | 20.6 | 34.0 |
| 67 | 52.5 | 14.3 | 11.8 |
|
| ||
|
| |||||||||||||
| Neopneumovirus-1 | 11 | 12.1 | 9.7 | 10.9 | 7 | 8.4 | 7.2 | 1.2 |
|
| |||
| Neopneumovirus-2 | 11 | 12.9 | 4.8 | 10.5 |
| 7 | 8.4 | 3.6 | 3.6 |
|
| ||
| Neopneumovirus-3 | 11 | 12.1 | 0.0 | 3.6 | n.a. | 7 | 7.6 | 2.7 | 1.2 |
|
| ||
Figure 4Detection of three Neopneumoviruses (spiked vs. detected). Scatter plots of spiked (X axis) vs. detected (Y axis) percentage horizontal coverage for the 3 fake Neopneumoviruses of the study, for the Tissues (A,B) and Plasma (C,D) experiments. Each virus is depicted by 2 points, corresponding to the d1 (large circles) and d10 (intermediate circles) experiments, and is associated with a specific color (blue: Neopneumovirus-1; purple: Neopneumovirus-2; pink: Neopneumovirus-3). The identity line (Y = X) is represented in black.
Figure 5Signal-to-noise ratio in the Plasma experiments. Histograms below the table represent the signal (blue bars with counts above zero on the primary axis) vs. noise (orange bars with counts below zero on the primary axis) and associated signal-to-noise ratios (black dots on the secondary axis).