| Literature DB >> 33031509 |
Rosa L Allesøe1,2, Camilla K Lemvigh1,3, My V T Phan4, Philip T L C Clausen1, Alfred F Florensa1, Marion P G Koopmans4, Ole Lund1, Matthew Cotten4,5,6.
Abstract
SUMMARY: Here, we present an automated pipeline for Download Of NCBI Entries (DONE) and continuous updating of a local sequence database based on user-specified queries. The database can be created with either protein or nucleotide sequences containing all entries or complete genomes only. The pipeline can automatically clean the database by removing entries with matches to a database of user-specified sequence contaminants. The default contamination entries include sequences from the UniVec database of plasmids, marker genes and sequencing adapters from NCBI, an E.coli genome, rRNA sequences, vectors and satellite sequences. Furthermore, duplicates are removed and the database is automatically screened for sequences from green fluorescent protein, luciferase and antibiotic resistance genes that might be present in some GenBank viral entries, and could lead to false positives in virus identification. For utilizing the database, we present a useful opportunity for dealing with possible human contamination. We show the applicability of DONE by downloading a virus database comprising 37 virus families. We observed an average increase of 16 776 new entries downloaded per month for the 37 families. In addition, we demonstrate the utility of a custom database compared to a standard reference database for classifying both simulated and real sequence data. AVAILABILITYAND IMPLEMENTATION: The DONE pipeline for downloading and cleaning is deposited in a publicly available repository (https://bitbucket.org/genomicepidemiology/done/src/master/). SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.Entities:
Mesh:
Substances:
Year: 2021 PMID: 33031509 PMCID: PMC8097684 DOI: 10.1093/bioinformatics/btaa857
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.Overview of the automated download and cleaning process with DONE. Annotations include sub-database name, species and taxonomy ID
Fig. 2.Results on three different downloading options for two virus families, Picornaviridae and Reoviridae, including all entries, entries with ‘complete’ or ‘partial’ in the description and entries within predefined length criteria specific for each virus family. (a) Venn diagram of entry overlap for Picornaviridae. (b) Length profile distributions for Picornaviridae. (c) Venn diagram of entry overlap for Reoviridae. (d) Length profile distributions for Reoviridae. The colors in (a) and (c) are as follows; Gray: ‘all’ +, ‘complete/partial’ -, ‘length’ -. Light blue: ‘all’ +, ‘complete/partial’ -, ‘length’ +. Blue: ‘all’ +, ‘complete/partial’ +, ‘length’ -. Dark blue: ‘all’ +, ‘complete/partial’ +, ‘length’ +
Fig. 3.Percentage of mapped reads in the simulated viral metagenomic sample mapping to each of the sub-databases using the kmer-based alignment tool KVIT. The simulated sample contains equal amounts of reads for the five included viral families (Caliciviridae, Coronaviridae, Paramyxoviridae, Polyomaviridae and Rhabdoviridae) and additional contamination reads (phages, human and E.coli). Here, we show the distribution on a viral family level
Percentage of entries removed at each similarity threshold against the contamination database of common contaminants
| All database | Length-filtered database | |||||||
|---|---|---|---|---|---|---|---|---|
| Threshold | 50% | 60% | 75% | 95% | 50% | 60% | 75% | 95% |
| Herpesviridae | 8.74 | 8.51 | 6.75 | 1.33 | 65.01 | 64.95 | 52.83 | 12.63 |
| Papillomaviridae | 0.16 | 0.0042 | 0.0042 | 0 | 0.61 | 0.017 | 0.017 | 0 |
| Picornaviridae | 0.051 | 0.042 | 0.033 | 0.00088 | 0.6 | 0.54 | 0.46 | 0.013 |
| Polyomaviridae | 1.1 | 1.1 | 1.1 | 0.41 | 2.18 | 2.18 | 2.18 | 1.27 |
| Retroviridae (not HIV1) | 0.2 | 0.16 | 0.097 | 0.026 | 3.35 | 2.91 | 1.26 | 0.62 |
| Togaviridae | 0.21 | 0.2 | 0.024 | 0.012 | 0.63 | 0.63 | 0.049 | 0 |
Note: We only list virus families for which more than 50% entries were removed at the lowest threshold (50%) in one of the databases. The ‘all’ database consists of every entry associated with the downloaded taxid. The ‘length filtered’ database includes length criteria on downloads specific for each database.
Additional hits identified with KVIT above the coverage threshold of 80% besides the known species included in the simulated sample
| Family hit (entry name/species) | All raw | All clean | All clean (decon) | Length filtered raw | Length filtered clean | Length filtered clean (decon) |
|---|---|---|---|---|---|---|
| Herpesviridae (Stealth virus 1) | 1 | 0 | 0 | 0 | 0 | 0 |
| Retroviridae not HIV1 (Human Endogenous Retrovirus K, Multiple sclerosis associated retrovirus, Human endogenous retrovirus) | 20 | 20 | 1 | 0 | 0 | 0 |
Note: The number states the unique entries identified for the family with the name of specific species listed for each family.
Fig. 4.Distribution of the reads mapping to each sub-database on virus family level when using the kmer-based alignment tool KVIT on 10 real metagenomics samples from pigs. (a) Using the ‘all’ clean database with human decontamination. (b) Using the ‘length filtered’ clean database with human decontamination