Literature DB >> 18948277

The RNA virus database.

Robert Belshaw¹, Tulio de Oliveira, Sidney Markowitz, Andrew Rambaut.

Abstract

The RNA Virus Database is a database and web application describing the genome organization and providing analytical tools for the 938 known species of RNA virus. It can identify submitted nucleotide sequences, can place them into multiple whole-genome alignments (in species where more than one isolate has been fully sequenced) and contains translated genome sequences for all species. It has been created for two main purposes: to facilitate the comparative analysis of RNA viruses and to become a hub for other, more specialised virus Web sites. It is available at the following four mirrored sites. http://virus.zoo.ox.ac.uk/rnavirusdb; http://hivweb.sanbi.ac.za/rnavirusdb; http://bioinf.cs.auckland.ac.nz/rnavirusdb; http://tree.bio.ed.ac.uk/rnavirusdb.

Entities: Chemical Disease Species

Mesh：

Year: 2008 PMID： 18948277 PMCID： PMC2686580 DOI： 10.1093/nar/gkn729

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Viruses are divided into two similar-sized groups depending on whether the virus particle contains DNA or RNA, and, as causes of human fatality, RNA viruses are by far the more important (1). New viral diseases continue to appear as a result of several changes in human activity: travel, population growth, interaction with wild habitats etc. Well-known novel, or emergent, RNA diseases include severe acute respiratory syndrome (SARS) (2), human immunodeficiency virus 1 (HIV-1) (3), and may come to include Avian influenza H5N1 virus (4). These emergent diseases are an important factor behind the increase in the number of genome sequences that NCBI treats as representing new species (Figure 1). In 2005, more than 200 new virus species were submitted to GenBank (more recent dates are less reliable because there is typically a delay between submission and public availability). As more emergent viruses appear, it is important to have a site that allows their genomes to be compared to those of known viruses. The origin of most major infectious diseases is unknown because of our ignorance of the diversity of pathogens in wild animals. This restricts our ability to both predict risks and develop treatments (5).

Figure 1.

Submission of new virus species to GenBank between 1970 and 2006. Dates are the earliest given in the accession (either of submission or publication). Submissions after 2006 are excluded because accessions are made public typically only following publication and thus the frequency of submissions is more recent time periods is underestimated. Despite some advances (6,7), the evolutionary history of RNA viruses is in general poorly known, especially the deep phylogenetic relationships between virus families (8,9). We believe that one of the reasons for this is a lack of easily available translated genes and genomes for all species, and the lack of aligned genome sequences representing different isolates of the same species. In addition to the need to facilitate greater comparative analysis of RNA viruses is the need to link together the existing virus Web sites and their underlying databases. There are many Web sites that provide genomic data, tools for genetic analysis and/or biological information for some viruses (see ‘Links’ on our site home page). The RNA Virus Database is intended to complement these other sites by providing basic genomic information and tools for all RNA viruses and linking the user to more specialist sites, where they exist, e.g. for HIV-1 and hepatitis C virus (HCV), we provide links to sites such as the Los Alamos Laboratory on the main page for each of these viruses (find by typing HIV-1 or HCV into the search window in the top toolbar). For such viruses, we do not duplicate the work of other groups by attempting to display the available diversity of genomes. We intend that the RNA Virus Database should develop further as a hub for other sites and we therefore encourage other workers to contact us with details of their sites that they wish linked to ours. Also, we encourage workers to ‘adopt a virus’ and improve and/or expand the information that we provide for individual species. This can be done by emailing us or getting involved directly in developing the database, which is an Open Source project available at our GoogleCode site (http://code.google.com/p/rnavirusdb). Some of the data and tools on the RNA Virus Database can be found elsewhere, but not all of them can, e.g. NCBI's Genome site provides genomic overviews of virus species and pairwise alignments of other isolates to the reference sequence, but it does not provide multiple alignments or complete translated genomes as we do. Similarly, its general Entrez site provides pair-wise alignments of the query sequence and similar sequences in the database, plus a phylogenetic tree calculated from those distances; however, no multiple alignment is built. We also corrected the (few) errors in the GenBank entries, and our database records features such as RNA editing (10) that make genome translation problematic. We have, therefore, created the RNA Virus Database as a user-friendly site devoted to RNA viruses, providing essential genomic data and tools (discussed in more detail below) and links to the other virus Web sites. The three main features are as follows. Provide multiple whole-genome alignments, gene and whole-genome translations for all RNA virus species Identification and taxonomic searching facility Guidance to other web resources.

WHOLE-GENOME ALIGNMENTS

We provide multiple alignments of whole genomes (as nucleotides) for all species where GenBank contains multiple representatives. Our database, thus, currently has multiple alignments for approximately half the species (available from the main page of any virus species under ‘Download alignment’). These alignments were made by downloading from GenBank all complete (or near-complete) genomes using the BioPerl GenBank modules (11). Accidental mismatches were excluded by performing a preliminary alignment using BlastAlign (12), which is designed to cope with non- or poorly homologous sequences and reports such matches. The sequences that showed clear homology to the NCBI reference sequences (we used a cutoff of a maximum of 40% of positions being represented by gaps in the BlastAlign alignment), up to a maximum of 50, were then aligned using ClustalW (default parameter values) (13). For species for which we have at least three sequences, a neighbor-joining tree was then constructed using PAUP (with HKY-adjusted genetic distances) (14), and this tree is displayed both as a pdf [via the TreeGraph program (15)] and using FigTree, which is a new Java-based tree-drawing application created by one of us (Rambaut,A., unpublished data). FigTree will also display strain and isolate information as well as accession numbers.

IDENTIFY RELATIONSHIPS OF SUBMITTED SEQUENCES

Our site allows virus nucleotide or amino acid sequences submitted by the user to be identified or, if the query is a new species, its closest relative to be found. In addition, the genomic location of any matched region of the library sequences is shown. For this we use NCBI's suite of BLAST programs (16) (go to the ‘BLAST’ link on the toolbar of the home page). Once the most closely related reference species has been located, the query sequence can then be placed into a whole-genome multiple alignment for that species (where such an alignment is present) in order to show the query's phylogenetic relationships to the genomes in our database (go to ‘Align your sequence’ on the virus species main page). An example of this process is illustrated in Figure 2. Two procedures are available here for building the new multiple alignment. (i) A BLAST of the query to the reference species sequence provides coordinates from the resulting pair-wise alignment. These coordinates are then used to select homologous regions from the reference multiple alignment, and a new multiple alignment is then built using ClustalW along with a phylogenetic tree using PAUP as described above. (ii) BlastAlign (described above) is used to generate a new multiple alignment using the query sequence plus the sequences from the reference multiple alignment.

Figure 2.

Screenshots illustrating use of the RNA Virus Database to investigate a submitted virus nucleotide sequence.

Screenshots illustrating use of the RNA Virus Database to investigate a submitted virus nucleotide sequence. Another database table that is available as part of the download, ‘isolates’, includes the biological data of the isolates used in the whole-genome multiple alignments [except for HIV-1, HCV and hepatitis B virus (HBV), where we used manually built multiple alignments; for these three species, accession numbers for the isolates are given in the Supplementary Data as AccessionHIVHCVHBV.xls].

TRANSLATED GENES AND GENOMES

We provide amino acid sequences for all virus genes (or, more strictly, for all Open Reading Frames) plus complete translated genomes for each virus species (go to ‘Proteins’ on the toolbar of any virus species main page). The translated genomes are intended to facilitate phylogenetic analysis of more distantly related viruses (9). One feature that makes annotation of RNA viruses difficult is that most species have some gene overlap (17), i.e. where the same nucleotides code for two different genes by being read in two different frames. We, therefore, allow the user to select from three possible options for dealing with this feature: (i) have overlapped regions excised from the translated genome, (ii) have one only of any overlapped amino acid sequences represented or (iii) have all the amino acids sequences present, with overlapped sequences placed sequentially (and thus the nucleotides represented twice). Using a key word search of the GenBank entries, and a standard reference work (18) where this did not reveal a match, we have placed most of the genes into functional groups (go to ‘Proteins’ on the toolbar of the home page).

DATABASE STRUCTURE AND DATA SOURCES

The RNA Virus Database is a PHP web application on top of a mySQL database. The underlying database has eight tables linked as shown in Figure 3. PHP is available at http://www.php.net, but should come pre-installed on UNIX machines. MySQL is available free from http://www.mysql.com. All data have been taken from the NCBI's Genome and Nucleotide sites at the following two URLs.

Figure 3.

Relationships of tables in the underlying mySQL database. Table names are in large bold font and interlinking column names are in small regular font.

Relationships of tables in the underlying mySQL database. Table names are in large bold font and interlinking column names are in small regular font. http://www.ncbi.nlm.nih.gov/genomes/VIRUSES/10239.html. http://www.ncbi.nlm.nih.gov/sites/entrez?db=nucleotide Species names, nucleotide sequences and accession numbers were downloaded directly from GenBank using BioPerl modules, while further details of the virus—gene coordinates, taxonomic affinities etc.—were subsequently extracted from the flatfile of all GenBank entries that can be downloaded from the NCBI Genome site (19). Our approach was to treat all entries in the NCBI Virus Genome sites as species and to follow their taxonomic classification, although we only give virus type (e.g. single-stranded positive-sense, retrotranscribing), family and genus at our site. We, therefore, follow NCBI's inclusion of hepadnaviruses, which include HBV, and caulimoviruses among the RNA viruses despite their mature virion containing DNA rather than RNA (their possession of reverse transcriptase clearly places them biologically and evolutionarily among the reverse-transcribing group of RNA viruses).

AVAILABILITY, FUTURE EXTENSIONS AND UPDATES

The URLs of our mirrors are given in Abstract. The PHP scripts can be accessed using subversion (http://subversion.tigris.org/) from our GoogleCode site at http://code.google.com/p/rnavirusdb. The mySQL database (a gzipped 16 Mb dump) may also be downloaded from the same site for installation on the user's computer if required. A README with installation instructions is present among the PHP scripts. If required, the database can subsequently be updated by other users following instructions and Perl scripts given in the Supplementary Material (perl_scripts.tar). This updating involves a series of short intervening manual steps (we find that complete automation of such processes is inefficient). We intend to update the database on at least a 6-monthly basis in order to include newly discovered viruses, and are currently working to incorporate biological and epidemiological data. We also intend to release soon a DNA virus database using the same format. As discussed in Introduction, we are keen to collaborate with other groups over further developments of our database.

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online.

FUNDING

Wellcome Trust (to R.B.); EU Marie Curie Fellowship scheme (to T.d.O.); the Royal Society (to A.R.). Conflict of interest statement. None declared.

15 in total

1. The versatility of paramyxovirus RNA polymerase stuttering.

Authors: S Hausmann; D Garcin; C Delenda; D Kolakofsky
Journal: J Virol Date: 1999-07 Impact factor: 5.103

Review 2. Big nidovirus genome. When count and order of domains matter.

Authors: A E Gorbalenya
Journal: Adv Exp Med Biol Date: 2001 Impact factor: 2.622

Review 3. The causes and consequences of HIV evolution.

Authors: Andrew Rambaut; David Posada; Keith A Crandall; Edward C Holmes
Journal: Nat Rev Genet Date: 2004-01 Impact factor: 53.242

4. The Bioperl toolkit: Perl modules for the life sciences.

Authors: Jason E Stajich; David Block; Kris Boulez; Steven E Brenner; Stephen A Chervitz; Chris Dagdigian; Georg Fuellen; James G R Gilbert; Ian Korf; Hilmar Lapp; Heikki Lehväslaiho; Chad Matsalla; Chris J Mungall; Brian I Osborne; Matthew R Pocock; Peter Schattner; Martin Senger; Lincoln D Stein; Elia Stupka; Mark D Wilkinson; Ewan Birney
Journal: Genome Res Date: 2002-10 Impact factor: 9.043

5. BlastAlign: a program that uses blast to align problematic nucleotide sequences.

Authors: Robert Belshaw; Aris Katzourakis
Journal: Bioinformatics Date: 2004-08-13 Impact factor: 6.937

6. Basic local alignment search tool.

Authors: S F Altschul; W Gish; W Miller; E W Myers; D J Lipman
Journal: J Mol Biol Date: 1990-10-05 Impact factor: 5.469

7. A reevaluation of the higher taxonomy of viruses based on RNA polymerases.

Authors: P M Zanotto; M J Gibbs; E A Gould; E C Holmes
Journal: J Virol Date: 1996-09 Impact factor: 5.103

Review 8. Evolution and taxonomy of positive-strand RNA viruses: implications of comparative analysis of amino acid sequences.

Authors: E V Koonin; V V Dolja
Journal: Crit Rev Biochem Mol Biol Date: 1993 Impact factor: 8.250

9. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice.

Authors: J D Thompson; D G Higgins; T J Gibson
Journal: Nucleic Acids Res Date: 1994-11-11 Impact factor: 16.971

10. NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins.

Authors: Kim D Pruitt; Tatiana Tatusova; Donna R Maglott
Journal: Nucleic Acids Res Date: 2005-01-01 Impact factor: 16.971

9 in total

Review 1. Unraveling the web of viroinformatics: computational tools and databases in virus research.

Authors: Deepak Sharma; Pragya Priyadarshini; Sudhanshu Vrati
Journal: J Virol Date: 2014-11-26 Impact factor: 5.103

Review 2. The unequivocal preponderance of biocomputation in clinical virology.

Authors: Sechul Chun; Manikandan Muthu; Judy Gopal; Diby Paul; Doo Hwan Kim; Enkhtaivan Gansukh; Vimala Anthonydhason
Journal: RSC Adv Date: 2018-05-18 Impact factor: 4.036

3. Discovery of the first insect nidovirus, a missing evolutionary link in the emergence of the largest RNA virus genomes.

Authors: Phan Thi Nga; Maria del Carmen Parquet; Chris Lauber; Manmohan Parida; Takeshi Nabeshima; Fuxun Yu; Nguyen Thanh Thuy; Shingo Inoue; Takashi Ito; Kenta Okamoto; Akitoyo Ichinose; Eric J Snijder; Kouichi Morita; Alexander E Gorbalenya
Journal: PLoS Pathog Date: 2011-09-08 Impact factor: 6.823

4. Virus Database and Online Inquiry System Based on Natural Vectors.

Authors: Rui Dong; Hui Zheng; Kun Tian; Shek-Chung Yau; Weiguang Mao; Wenping Yu; Changchuan Yin; Chenglong Yu; Rong Lucy He; Jie Yang; Stephen St Yau
Journal: Evol Bioinform Online Date: 2017-12-17 Impact factor: 1.625

5. GLUE: a flexible software system for virus sequence data.

Authors: Joshua B Singer; Emma C Thomson; John McLauchlan; Joseph Hughes; Robert J Gifford
Journal: BMC Bioinformatics Date: 2018-12-18 Impact factor: 3.169

Review 6. Practical application of bioinformatics by the multidisciplinary VIZIER consortium.

Authors: Alexander E Gorbalenya; Philippe Lieutaud; Mark R Harris; Bruno Coutard; Bruno Canard; Gerard J Kleywegt; Alexander A Kravchenko; Dmitry V Samborskiy; Igor A Sidorov; Andrey M Leontovich; T Alwyn Jones
Journal: Antiviral Res Date: 2010-02-11 Impact factor: 5.970