| Literature DB >> 33408878 |
Ilya Plyusnin1, Ravi Kant2, Anne J Jääskeläinen3, Tarja Sironen2, Liisa Holm1, Olli Vapalahti2, Teemu Smura4.
Abstract
The study of the microbiome data holds great potential for elucidating the biological and metabolic functioning of living organisms and their role in the environment. Metagenomic analyses have shown that humans, along with for example, domestic animals, wildlife and arthropods, are colonized by an immense community of viruses. The current Coronavirus pandemic (COVID-19) heightens the need to rapidly detect previously unknown viruses in an unbiased way. The increasing availability of metagenomic data in this era of next-generation sequencing (NGS), along with increasingly affordable sequencing technologies, highlight the need for reliable and comprehensive methods to manage such data. In this article, we present a novel bioinformatics pipeline called LAZYPIPE for identifying both previously known and novel viruses in host associated or environmental samples and give examples of virus discovery based on it. LAZYPIPE is a Unix-based pipeline for automated assembling and taxonomic profiling of NGS libraries implemented as a collection of C++, Perl, and R scripts.Entities:
Keywords: NGS data analysis; bioinformatics pipeline; taxonomic profiling; viral metagenomics; virome; virus discovery
Year: 2020 PMID: 33408878 PMCID: PMC7772471 DOI: 10.1093/ve/veaa091
Source DB: PubMed Journal: Virus Evol ISSN: 2057-1577
Figure 1.Lazypipe flowchart. Binaries and scripts are displayed in white, input and output files in green.
Accessing accuracy of virus taxon retrieval by different tools.
| Tool | Rank | TP | FP | FN | Pr | Rc | F |
|---|---|---|---|---|---|---|---|
| Lazypipe-nt | Genus | 41 | 2 | 4 | 0.953 | 0.911 | 0.932 |
| Lazypipe | 41 | 2 | 4 | 0.953 | 0.911 | 0.932 | |
| Centrifuge | 45 | 8 | 0 | 0.849 | 1.000 | 0.918 | |
| MetaPhlan2 | 32 | 4 | 13 | 0.889 | 0.711 | 0.790 | |
| Kraken2 | 21 | 1 | 24 | 0.955 | 0.467 | 0.627 | |
| Lazypipe-nt | Species | 69 | 2 | 15 | 0.972 | 0.821 | 0.890 |
| Lazypipe | 72 | 8 | 12 | 0.900 | 0.857 | 0.878 | |
| Centrifuge | 80 | 47 | 4 | 0.630 | 0.952 | 0.758 | |
| MetaPhlan2 | 38 | 7 | 46 | 0.844 | 0.452 | 0.589 | |
| Kraken2 | 16 | 1 | 68 | 0.941 | 0.190 | 0.317 |
Compared tools are ordered by the descending F1-score for virus taxa predicted for simulated metagenome (Fosso et al. 2017).
TP, true positives; FP, false positives; FN, false negatives; Pr, precision; Rc, recall; F, F1-score.
Accessing accuracy of viral and bacterial taxon retrieval by different tools.
| Tool | Rank | TP | FP | FN | Pr | Rc | F |
|---|---|---|---|---|---|---|---|
| Lazypipe | Genus | 84 | 22 | 4 | 0.792 | 0.955 | 0.866 |
| MetaPhlan2 | 70 | 7 | 18 | 0.909 | 0.795 | 0.848 | |
| Lazypipe-nt | 64 | 12 | 24 | 0.842 | 0.727 | 0.780 | |
| Kraken2 | 50 | 3 | 38 | 0.943 | 0.568 | 0.709 | |
| Centrifuge | 82 | 162 | 6 | 0.336 | 0.932 | 0.494 | |
| MetaPhlan2 | Species | 105 | 10 | 51 | 0.913 | 0.673 | 0.775 |
| Lazypipe | 143 | 94 | 13 | 0.603 | 0.917 | 0.728 | |
| Lazypipe-nt | 100 | 40 | 56 | 0.714 | 0.641 | 0.676 | |
| Kraken2 | 52 | 22 | 104 | 0.703 | 0.333 | 0.452 | |
| Centrifuge | 126 | 471 | 30 | 0.211 | 0.808 | 0.335 |
Compared tools are ordered by the descending F1-score for all predictions for simulated metagenome (Fosso et al. 2017).
TP, true positives, FP, false positives, FN, false negatives, Pr, precision, Rc, recall, F, F1-score.
Figure 2.Accessing the classification accuracy with precision–recall analysis. (A) Precision–recall curves for reported virus taxa. (B) Precision–recall curves for all reported taxa. Area under the precision–recall curves is displayed after the tool’s name in the figure legend. The dot on each curve corresponds to the maximum F1 value (Fmax).
Virus contigs retrieved by Lazypipe for the mink fecal sample.
| Order | Family | Genus | Length (nt) | Closest match | Gene | Identity (%) |
|---|---|---|---|---|---|---|
| Picornavirales | 8,990 | Kilifi virus | 30 | |||
| NA | Caliciviridae |
| 8,006 | Norovirus GIV and GVI | ORF1 | 89 |
| GII | ORF2 | 63 | ||||
| ORF3 | 57 | |||||
|
| 7,511 | Sapovirus genotype XII | ORF1 | 81 | ||
| VP1 | 73 | |||||
| ORF2 | 45 | |||||
| NA | Parvoviridae |
| 3,069 | Chicken chapparvovirus 1 | NS | 96 |
| Chicken chapparvovirus 2 | VP1 | 35 | ||||
|
| 2,448 | Chiropteran protoparvovirus 1 | NS | 44 | ||
| Carnivore amdoparvovirus 1 | VP1 | 39 | ||||
| Unclassified | Toti-like viruses | 3,792 | Beihai toti-like virus 4 | 29-32 | ||
| 3,219 | Hubei unio douglasiae virus 1 | 38-48 | ||||
| Bicobirna-like viruses | 1,346 | Beihai picobirna-like virus 11 | 81 | |||
| Noda-like viruses | 1,250 | Beihai barnacle virus 11 | 53 | |||
| 1,070 | Wenzhou noda-like virus 2 | 46 | ||||
| 857 | Wenzhou noda-like virus 2 | 78 | ||||
| 785 | Wenling noda-like virus 1 | 72 | ||||
| 943 | Wuhan pillworm virus 4 | 42 | ||||
| Circo-like virus | 2,377 | uncultured marine virus | 34 | |||
| 919 | Anguilla anguilla circovirus | 60 | ||||
| 692 | Dromedary stool-associated circular ssDNA virus | 55 | ||||
| 537 | Hermit crab-associated circular genome | 52 | ||||
Displaying contigs exceeding 500 nt in length. Length (nt), contig nt length, Identity (%), aa identity to the closest match.
Lazypipe summary for various sample types with known human pathogenic viruses.
| Host | Sample type | Genus | Length (nt) | Closest match | Identity (%) |
|---|---|---|---|---|---|
| Human | CSF |
| 7,384 | Coxsackievirus B5 | 99 |
| Serum | 7,375 | Coxsackievirus A6 | 100 | ||
| Serum |
| 7,321 | Human parechovirus 3 | 99 | |
| Brain (cerebellum) |
| 10,681 | TBEV | 100 | |
| Nasopharyngeal swab |
| 29,806 | SARS-coronavirus-2 | 100 | |
|
| 333–702 | Human mastadenovirus C | 96–100 | ||
|
| Tick homogenate |
| 11,090 | TBEV | 99 |
| 2,696–3,014 | Alongshan virus | 96–99 |
Segmented genome. Length (nt), contig nt length, Identity (%), aa identity to the closest match.
Lazypipe summary for SARS-CoV-2 clinical samples from Wuhan, China.
| Accession | Library | Virus | Taxid | Readn | Readn% | Csum | Contign | Length (nt) |
|---|---|---|---|---|---|---|---|---|
| SRR11092058 | WIV02 | SARS-related coronavirus | 694009 | 36 | 0.517 | 3 | 9 | 362–568 |
| SRR11092063 | WIV02-2 | SARS-related coronavirus | 694009 | 559 | 0.368 | 1 | 23 | 305–2,169 |
| SRR11092057 | WIV04 | SARS-related coronavirus | 694009 | 732 | 13.088 | 1 | 15 | 393–4,488 |
| SRR11092062 | WIV04-2 | SARS-related coronavirus | 694009 | 5,918 | 3.003 | 1 | 1 | 29,850 |
| SRR11092062 | WIV04-2 | Influenza A virus | 11320 | 274 | 0.139 | 1 | 2 | 1,065–4,609 |
| SRR11092062 | WIV04-2 | Autographa californica multiple nucleopolyhedrovirus | 307456 | 205 | 0.104 | 1 | 2 | 397–440 |
| SRR11092061 | WIV05 | SARS-related coronavirus | 694009 | 234 | 0.051 | 1 | 20 | 315–2,044 |
| SRR11092061 | WIV05 | Saccharomyces 20S RNA narnavirus | 186772 | 135 | 0.029 | 2 | 1 | 2,378 |
| SRR11092060 | WIV06-2 | SARS-related coronavirus | 694009 | 525 | 0.142 | 1 | 22 | 305–2,530 |
| SRR11092060 | WIV06-2 | Spodoptera frugiperda rhabdovirus | 1481139 | 165 | 0.045 | 1 | 1 | 468 |
| SRR11092060 | WIV06-2 | Saccharomyces 20S RNA narnavirus | 186772 | 103 | 0.028 | 2 | 3 | 439–1,423 |
| SRR11092064 | WIV07 | Influenza A virus | 11320 | 9,063 | 0.097 | 1 | 4 | 399–4,765 |
| SRR11092064 | WIV07 | Saccharomyces 20S RNA narnavirus | 186772 | 3,386 | 0.036 | 1 | 1 | 2,440 |
| SRR11092064 | WIV07 | SARS-related coronavirus | 694009 | 819 | 0.009 | 2 | 16 | 408–5,727 |
| SRR11092064 | WIV07 | Bamboo mosaic virus | 35286 | 325 | 0.003 | 2 | 1 | 355 |
| SRR11092064 | WIV07 | Spodoptera frugiperda rhabdovirus | 1481139 | 168 | 0.002 | 2 | 1 | 442 |
| SRR11092059 | WIV07-2 | Saccharomyces 20S RNA narnavirus | 186772 | 1,693 | 0.019 | 2 | 1 | 2,440 |
| SRR11092059 | WIV07-2 | SARS-related coronavirus | 694009 | 467 | 0.005 | 2 | 5 | 1,531–5,727 |
Viral taxa identified by Lazypipe from public Illumina libraries sequenced from five patients with pneumonia at the early stage of the COVID-19 outbreak in Wuhan, China. Lazypipe correctly identified SARS-CoV in eight out of nine samples. Additionally, two samples were identified with Influenza A and one sample with human mastadenovirus C coinfection. Accession, NCBI SRA accession, Library, library identifier, Virus, name of the viral taxon, Readn, reads assigned to the taxon, Contign, contigs assigned to the taxon, Csum, csum confidence score (see text), Length (nt), contig nt length.