| Literature DB >> 33139591 |
Anne Gleizes1, Florian Laubscher2,3, Nicolas Guex4, Christian Iseli4, Thomas Junier1,5, Samuel Cordey2,3, Jacques Fellay6,7,8, Ioannis Xenarios9, Laurent Kaiser2,3,10, Philippe Le Mercier11.
Abstract
The huge genetic diversity of circulating viruses is a challenge for diagnostic assays for emerging or rare viral diseases. High-throughput technology offers a new opportunity to explore the global virome of patients without preconception about the culpable pathogens. It requires a solid reference dataset to be accurate. Virosaurus has been designed to offer a non-biased, automatized and annotated database for clinical metagenomics studies and diagnosis. Raw viral sequences have been extracted from GenBank, and cleaned up to remove potentially erroneous sequences. Complete sequences have been identified for all genera infecting vertebrates, plants and other eukaryotes (insect, fungus, etc.). To facilitate the analysis of clinically relevant viruses, we have annotated all sequences with official and common virus names, acronym, genotypes, and genomic features (linear, circular, DNA, RNA, etc.). Sequences have been clustered to remove redundancy at 90% or 98% identity. The analysis of clustering results reveals the state of the virus genetic landscape knowledge. Because herpes and poxviruses were under-represented in complete genomes considering their potential diversity in nature, we used genes instead of complete genomes for those in Virosaurus.Entities:
Keywords: HTS; bioinformatics; complete genome; database; diagnostics; sequencing; viral infections; viruses
Mesh:
Year: 2020 PMID: 33139591 PMCID: PMC7693494 DOI: 10.3390/v12111248
Source DB: PubMed Journal: Viruses ISSN: 1999-4915 Impact factor: 5.048
Figure 1Workflow for the creation of the Virosaurus datasets. References are in blue, and output datasets are in green.
Figure 2Examples of virus genome annotation. The usual name and clinical typing should be the default output for clinical studies and are shown in red.
Figure 3Example of gathering reads for the same virus. In the left part, 10 isolates represent clusters for this virus. Twenty-eight reads show homology to those reference sequences, they can be all grouped under the “human polyomavirus 2” entity, thereby facilitating interpretation of results.
Figure 4Relative number of sequences for the 13 most sequenced human viruses: (A) total sequences from GenBank, (B) complete virus sequences, (C) Virosaurus 98 and (D) Virosaurus 90. (Data from release 2019_10).
Figure 5Percentage of sequence reduction by clustering complete genomes at 90%. (Data Virosaurus 2019_10).
Figure 6Human blood samples were sequenced and reads generated using the RNA protocol [9] were aligned to the Virosaurus database. The result is easy to interpret and confirms that the patient was positive for a novel human astrovirus, HIV-1 and HHV-8 sequences, as previously reported [23]. Top panel: 2D representation of detected sequences with %segment coverage in the X-axis, and depth (median) in Y-axis; bottom panel: raw data. Size of dots is relative to number of reads. Anellovirus (TTV) sequences were also detected. The Virosaurus hierarchy allows allocating reads to viral entities: at the level of virus (HIV-1, HHV-8, MastV-6) or higher (TTV). Unclassified SEN viruses are TTV-like genomes. Mamastrovirus (Novel) is a subtyping, allowing differentiating between novel (i.e., MLB and VA/HMO) and classical human astroviruses.