Literature DB >> 17142229

euHCVdb: the European hepatitis C virus database.

Christophe Combet¹, Nicolas Garnier, Céline Charavay, Delphine Grando, Daniel Crisan, Julien Lopez, Alexandre Dehne-Garcia, Christophe Geourjon, Emmanuel Bettler, Chantal Hulo, Philippe Le Mercier, Ralf Bartenschlager, Helmut Diepolder, Darius Moradpour, Jean-Michel Pawlotsky, Charles M Rice, Christian Trépo, François Penin, Gilbert Deléage.

Abstract

The hepatitis C virus (HCV) genome shows remarkable sequence variability, leading to the classification of at least six major genotypes, numerous subtypes and a myriad of quasispecies within a given host. A database allowing researchers to investigate the genetic and structural variability of all available HCV sequences is an essential tool for studies on the molecular virology and pathogenesis of hepatitis C as well as drug design and vaccine development. We describe here the European Hepatitis C Virus Database (euHCVdb, http://euhcvdb.ibcp.fr), a collection of computer-annotated sequences based on reference genomes. The annotations include genome mapping of sequences, use of recommended nomenclature, subtyping as well as three-dimensional (3D) molecular models of proteins. A WWW interface has been developed to facilitate database searches and the export of data for sequence and structure analyses. As part of an international collaborative effort with the US and Japanese databases, the European HCV Database (euHCVdb) is mainly dedicated to HCV protein sequences, 3D structures and functional analyses.

Entities: Chemical Disease Gene Species

Mesh：

Substances：
Viral Proteins

Year: 2006 PMID： 17142229 PMCID： PMC1669729 DOI： 10.1093/nar/gkl970

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Hepatitis C virus (HCV) infection is a major cause of chronic hepatitis, liver cirrhosis and hepatocellular carcinoma worldwide. The HCV genome is ∼9600 nt in length and carries a single, long open reading frame (ORF) flanked by 5′ and 3′ non-translated regions. The ORF encodes a polyprotein of ∼3000 amino acids that is processed by cellular and viral proteases to yield at least 10 mature proteins: C, E1, E2, p7, NS2, NS3, NS4A, NS4B, NS5A and NS5B (1,2). The sequence diversity among HCV genomes leads to the definition of a large number of subtypes distributed into six genotypes (3). In addition, HCV exists within its hosts as a pool of genetically distinct but closely related variants referred to as quasispecies (4). It is now well established that the genotype is a predictive factor of the response to interferon-α therapy (5). Consequently, intensive sequence analyses of HCV genomes are currently conducted, and >40 000 nt sequences have been deposited to date into the DDBJ/EMBL/GenBank databases. In order to manage such large and growing collections of sequences, to facilitate sequence analysis and drug and vaccine design, several specialized databases have been developed (6). The database team members are involved in a network of experts for HCV nomenclature definition and harmonization (3). We present here the European HCV Database (euHCVdb). This database combines computer-annotated HCV sequences with protein three-dimensional (3D) models and it is linked to numerous sequence and structure analysis tools on dedicated websites. The euHCVdb is mainly oriented towards the structural biology of HCV, including protein sequence, structure and function analyses (2).

DATABASE CONSTRUCTION

The euHCVdb is an extension of the French HCV Database developed in 1999 (7). It is updated on a monthly basis from the EMBL nucleotide sequence database (8) and maintained in the PostgreSQL relational database management system (RDMS). The programs for parsing the EMBL database flat files, annotating HCV entries, filling up and querying the database use SQL and Java. Great effort has been devoted to developing a fully automated annotation procedure using a reference set of 26 well-characterized complete HCV genomes representing 18 subtypes (3). The molecular models based protein homology are automatically computed and stored in a separate database by a protein molecular model database management system written in Python and called Modeome3D (N. Garnier, C. Combet, C. Geourjon, G. Deléage and E. Bettler, manuscript in preparation). The building procedure starts by the parsing of the EMBL database (vrl section) flat text file. The entries corresponding to the organism HCV are loaded into the RDMS. The second step of the procedure is the sequence annotation, which results in the creation of the euHCVdb and the euHCVdb3D databases by adding annotations to the EMBL data (i.e. genome mapping, protein features, genotype/subtype or 3D models). The sets of pre-computed multiple sequence alignments of complete proteins or reference genomes and 3D models are also updated monthly.

DATABASE CONTENTS

The format of euHCVdb is an extension of the EMBL database format. Thus, the primary accession number, the creation date, the bibliographic references, the source feature and the sequence of each EMBL database entry are conserved in the euHCVdb entries. The euHCVdb entry identifiers are built from the EMBL primary accession number to avoid any change in references for sequences. A text search on the EMBL source feature is performed to retrieve subtype and isolate name of the deposited sequence. They are stored in dedicated qualifiers in the euHCVdb entries. The sequence of the entry is then retrieved and used to run a FASTA (9) similarity search against the sequences of the reference genomes. The annotations of the most similar genome are retrieved to map regions and features on the current sequence. The current entry is then subtyped by using two sets of reference nucleotide sequences as described in (3). These reference sequences cover two subgenomic regions, one in C/E1 and one in NS5B proteins, and are stored in two sequence databanks. A FASTA search is run for each sequence against each databank and the provisional genotype field of the entry is filled only if the following requirements are satisfied: sufficient coverage of the alignment, similarity level and consistency of genotype between the two subgenomic regions. The annotation procedure ends by looking for protein 3D structural templates to build 3D models of the HCV proteins by homology modeling. For each sequence, a similarity search is run against a selected set of HCV protein structures deposited in the Protein Data Bank (PDB) (10). When acceptable alignments are found, the 3D models are computed using the Geno3D program for automatic comparative homology modeling of proteins (11). A product feature (prod_ft qualifier) indicated by model3d, which includes the template PDB code, is added to the corresponding mat_peptide feature of the entry and the 3D model is stored in the euHCVdb3D. Each euHCVdb entry offers external cross-references to the NCBI taxonomy, the UniProt knowledgebase (UniProtKB) (12) and to EMBL database, as well as internal cross-references to the reference genome used for annotation and to the euHCVdb3D molecular models. The euHCVdb database is also cross-referenced by the UniProtKB/Swiss-Prot (12) and LANL HCV (13) databases and by the NCBI LinkOut system. All these cross-references allow rapid collection of additional information when required. The automatic annotation procedure ensures standardized nomenclature for all entries across the database and builds a description of the HCV genomic regions and proteins that are included in the entry. This procedure also provides bibliographic references, cross-references to external databases, genotype, well-characterized sites (e.g. hypervariable region 1, HVR1) or domains (e.g. NS3 helicase), the source of the sequence (e.g. isolate), and structural data as 3D protein models. The protein annotations are done in close collaboration with the Swiss-Prot group of the Swiss Institute of Bioinformatics. The format and the controlled vocabulary of the UniProtKB/Swiss-Prot database are used in euHCVdb at the level of the prod_ft.

WEB INTERFACE

The euHCVdb is accessible through a website () (Figure 1). It is divided into static and dynamic parts. The static part allows the user to access the description of genomic regions or proteins by clicking on pictures (Figure 1A). Pre-computed multiple sequence alignments of reference genomes or complete protein sequences can be viewed and edited with the EditAlignment java Applet developed in our team (Figure 1B). When the experimental 3D structures of the molecules are known, links to PDB files are available that allow users to view and analyze the structure using the Jmol applet. Users can also find recommended nomenclature, the list of the reference genomes and links to other resources in the static part.

Figure 1

An overview of euHCVdb. (A) A static page describing known data for NS3 protein. (B) The multiple sequence alignment of NS3 protein from (A) viewed in the EditAlignment applet. (C) A query result page with the display and tools boxes, and list of hits. (D) An entry details page, with data for NS3 and links to the euHCVdb3D models. (E) A molecular model viewed through the euHCVdb3D interface. The Jmol applet is linked with the alignment between the modeled and template PDB sequences. In the dynamic part, a query system (not shown) allows the building of dynamic sets of sequences or 3D models according to user-defined criteria (>30 different criteria can be defined) and selected by the user through the query interface, e.g. ‘extract all the 3D models of the NS3 protease of confirmed genotype 3, 4 and 5’. The results (Figure 1C) are displayed in a table where each row corresponds to an entry or a genomic region/protein that is described by a small set of identifiers and characteristics such as accession number, protein name, genotype, isolate, length and description. Links to other HCV databases are also mentioned in this table. A hypertext link on the accession number of each entry allows display of the complete entry data in the Entry details page (Figure 1D). This page contains hyperlinks to external resources such as EMBL database, to retrieve partial or complete nucleotide or protein sequences of the entry or to the Jmol Web page to analyze pre-computed 3D protein models when available (Figure 1E). The results in Figure 1C can be ordered by ascending/descending values of each column. Above the table, a form allows the user to change the number of results by page or to switch to a given page or accession number. A toolbox is also available to export data (e.g. sequences) to other resources. Data to be exported depend on the Sequence type selected in the query form. Available sequence types are the full-length nucleotide sequence or the corresponding polyprotein, a genomic region or a protein, or a sequence corresponding to a polynucleotide or protein feature. The toolbox allows the export of all/page/(un)selected results as a text file of entries, a Pearson sequence file or a tab-delimited text file. Sequences can be transferred to the NPS@ server (14) which is an integrated sequence analysis Web server with a simple interface that provides the access to 46 programs for sequence analyses (e.g. BLAST and CLUSTAL W) and 12 biological databases. With the support of this server, the user can, for example, obtain a multiple sequence alignment of the full-length NS3 protein sequences to analyze the amino acid variability at each position of the alignment with the residue repertoire tool (F. Dorkeld, F. Penin, G. Deléage and C. Combet, unpublished data). In addition, the 3D models of different variants can be extracted from the database and further analyzed using the PIG Web server for protein structure analysis (N. Garnier, G. Deléage and E. Bettler, manuscript in preparation). Selected 3D models can be superimposed to generate a multiple structural alignment. The user can interactively analyze the fitted models by using the Jmol applet (Figure 1E) that is dynamically linked to the corresponding sequence alignment. In this way, one can identify the differences between the variants. Such analysis is helpful for a better understanding of structure–function relationships, which is of high relevance, e.g. for understanding drug resistance. The user can also visualize the residue conservation at the model level, obtain a list of accessible residues, or detect ligand-binding or active sites using the SuMo Web server (15).

DATABASE STATISTICS

The database has been running since March 2005. The current release number 71 (October 2006) comprises 43 124 entries and 646 protein 3D models representing 914 protein chains for a total of 383 443 residues and 2 930 254 atoms. The database currently receives ∼200 queries a day.

CONCLUSIONS

The relational database of annotated HCV sequences and protein 3D models we have developed focuses on HCV protein sequence, structure and function analyses. The automatic annotation process used to generate euHCVdb guarantees the consistency of annotations across the database, which allows efficient keyword searches and comprehensive sequence and structure analyses. The euHCVdb website permits dynamic queries through a very simple interface, and query results can be further analyzed with a set of bioinformatics programs available in the NPS@ and PIG Web servers. By providing the ability to conduct complex searches and analyses, euHCVdb is a powerful tool for researchers working in the HCV field. Moreover, we are working in an international collaborative effort with the US and Japanese HCV databases to improve the use of harmonized HCV data and nomenclature, and making an effort to be as complementary as possible with the goal of providing helpful and efficient tools for HCV researchers around the world.

15 in total

Review 1. NPS@: network protein sequence analysis.

Authors: C Combet; C Blanchet; C Geourjon; G Deléage
Journal: Trends Biochem Sci Date: 2000-03 Impact factor: 13.807

2. Geno3D: automatic comparative molecular modelling of protein.

Authors: Christophe Combet; Martin Jambon; Gilbert Deléage; Christophe Geourjon
Journal: Bioinformatics Date: 2002-01 Impact factor: 6.937

Review 3. Hepatitis C virus genetic variability: pathogenic and clinical implications.

Authors: Jean-Michel Pawlotsky
Journal: Clin Liver Dis Date: 2003-02 Impact factor: 6.126

4. The SuMo server: 3D search for protein functional sites.

Authors: Martin Jambon; Olivier Andrieu; Christophe Combet; Gilbert Deléage; François Delfaud; Christophe Geourjon
Journal: Bioinformatics Date: 2005-09-01 Impact factor: 6.937

Review 5. Unravelling hepatitis C virus replication from genome to function.

Authors: Brett D Lindenbach; Charles M Rice
Journal: Nature Date: 2005-08-18 Impact factor: 49.962

6. Hepatitis C databases, principles and utility to researchers.

Authors: Carla Kuiken; Masashi Mizokami; Gilbert Deleage; Karina Yusim; Francois Penin; Tadasu Shin-I; Céline Charavay; Ning Tao; Daniel Crisan; Delphine Grando; Anita Dalwani; Christophe Geourjon; Ashish Agrawal; Christophe Combet
Journal: Hepatology Date: 2006-05 Impact factor: 17.425

Review 7. Consensus proposals for a unified system of nomenclature of hepatitis C virus genotypes.

Authors: Peter Simmonds; Jens Bukh; Christophe Combet; Gilbert Deléage; Nobuyuki Enomoto; Stephen Feinstone; Phillippe Halfon; Geneviève Inchauspé; Carla Kuiken; Geert Maertens; Masashi Mizokami; Donald G Murphy; Hiroaki Okamoto; Jean-Michel Pawlotsky; François Penin; Erwin Sablon; Tadasu Shin-I; Lieven J Stuyver; Heinz-Jürgen Thiel; Sergei Viazov; Amy J Weiner; Anders Widell
Journal: Hepatology Date: 2005-10 Impact factor: 17.425

8. HCVDB: hepatitis C virus sequences database.

Authors: Christophe Combet; François Penin; Christophe Geourjon; Gilbert Deléage
Journal: Appl Bioinformatics Date: 2004

9. EMBL Nucleotide Sequence Database: developments in 2005.

Authors: Guy Cochrane; Philippe Aldebert; Nicola Althorpe; Mikael Andersson; Wendy Baker; Alastair Baldwin; Kirsty Bates; Sumit Bhattacharyya; Paul Browne; Alexandra van den Broek; Matias Castro; Karyn Duggan; Ruth Eberhardt; Nadeem Faruque; John Gamble; Carola Kanz; Tamara Kulikova; Charles Lee; Rasko Leinonen; Quan Lin; Vincent Lombard; Rodrigo Lopez; Michelle McHale; Hamish McWilliam; Gaurab Mukherjee; Francesco Nardone; Maria Pilar Garcia Pastor; Siamak Sobhany; Peter Stoehr; Katerina Tzouvara; Robert Vaughan; Dan Wu; Weimin Zhu; Rolf Apweiler
Journal: Nucleic Acids Res Date: 2006-01-01 Impact factor: 16.971

10. The Universal Protein Resource (UniProt): an expanding universe of protein information.

Authors: Cathy H Wu; Rolf Apweiler; Amos Bairoch; Darren A Natale; Winona C Barker; Brigitte Boeckmann; Serenella Ferro; Elisabeth Gasteiger; Hongzhan Huang; Rodrigo Lopez; Michele Magrane; Maria J Martin; Raja Mazumder; Claire O'Donovan; Nicole Redaschi; Baris Suzek
Journal: Nucleic Acids Res Date: 2006-01-01 Impact factor: 16.971

83 in total

1. NMR structure and ion channel activity of the p7 protein from hepatitis C virus.

Authors: Roland Montserret; Nathalie Saint; Christophe Vanbelle; Andrés Gerardo Salvay; Jean-Pierre Simorre; Christine Ebel; Nicolas Sapay; Jean-Guillaume Renisio; Anja Böckmann; Eike Steinmann; Thomas Pietschmann; Jean Dubuisson; Christophe Chipot; François Penin
Journal: J Biol Chem Date: 2010-07-28 Impact factor: 5.157

2. Dimerization of the hepatitis C virus nonstructural protein 4B depends on the integrity of an aminoterminal basic leucine zipper.

Authors: Martin-Walter Welker; Christoph Welsch; Aline Meyer; Iris Antes; Mario Albrecht; Nicole Forestier; Bernd Kronenberger; Thomas Lengauer; Albrecht Piiper; Stefan Zeuzem; Christoph Sarrazin
Journal: Protein Sci Date: 2010-07 Impact factor: 6.725

3. Base pairing between hepatitis C virus RNA and microRNA 122 3' of its seed sequence is essential for genome stabilization and production of infectious virus.

Authors: Tetsuro Shimakami; Daisuke Yamane; Christoph Welsch; Lucinda Hensley; Rohit K Jangra; Stanley M Lemon
Journal: J Virol Date: 2012-04-24 Impact factor: 5.103

4. Cell culture adaptation of hepatitis C virus and in vivo viability of an adapted variant.

Authors: Artur Kaul; Ilka Woerz; Philip Meuleman; Geert Leroux-Roels; Ralf Bartenschlager
Journal: J Virol Date: 2007-09-19 Impact factor: 5.103

Review 5. Protein Bioinformatics Databases and Resources.

Authors: Chuming Chen; Hongzhan Huang; Cathy H Wu
Journal: Methods Mol Biol Date: 2017

6. Near-Neighbor Interactions in the NS3-4A Protease of HCV Impact Replicative Fitness of Drug-Resistant Viral Variants.

Authors: Nadezhda T Doncheva; Francisco S Domingues; David R McGivern; Tetsuro Shimakami; Stefan Zeuzem; Thomas Lengauer; Christian M Lange; Mario Albrecht; Christoph Welsch
Journal: J Mol Biol Date: 2019-04-30 Impact factor: 5.469

7. Identification of alpha interferon-induced envelope mutations of hepatitis C virus in vitro associated with increased viral fitness and interferon resistance.

Authors: Stéphanie B N Serre; Henrik B Krarup; Jens Bukh; Judith M Gottwein
Journal: J Virol Date: 2013-09-18 Impact factor: 5.103

Review 8. Sequence diversity of hepatitis C virus: implications for immune control and therapy.

Authors: Joerg Timm; Michael Roggendorf
Journal: World J Gastroenterol Date: 2007-09-28 Impact factor: 5.742

9. The evolution of the major hepatitis C genotypes correlates with clinical response to interferon therapy.

Authors: Phillip S Pang; Paul J Planet; Jeffrey S Glenn
Journal: PLoS One Date: 2009-08-11 Impact factor: 3.240

10. Essential role of cyclophilin A for hepatitis C virus replication and virus production and possible link to polyprotein cleavage kinetics.

Authors: Artur Kaul; Sarah Stauffer; Carola Berger; Thomas Pertel; Jennifer Schmitt; Stephanie Kallis; Margarita Zayas; Margarita Zayas Lopez; Volker Lohmann; Jeremy Luban; Ralf Bartenschlager
Journal: PLoS Pathog Date: 2009-08-14 Impact factor: 6.823