Literature DB >> 20158919

HDAPD: a web tool for searching the disease-associated protein structures.

Yi-Ruen Lin1, Hsin-Yuan Wei, Tsung-Lin Tsai, Thy-Hou Lin.   

Abstract

BACKGROUND: The protein structures of the disease-associated proteins are important for proceeding with the structure-based drug design to against a particular disease. Up until now, proteins structures are usually searched through a PDB id or some sequence information. However, in the HDAPD database presented here the protein structure of a disease-associated protein can be directly searched through the associated disease name keyed in. DESCRIPTION: The search in HDAPD can be easily initiated by keying some key words of a disease, protein name, protein type, or PDB id. The protein sequence can be presented in FASTA format and directly copied for a BLAST search. HDAPD is also interfaced with Jmol so that users can observe and operate a protein structure with Jmol. The gene ontological data such as cellular components, molecular functions, and biological processes are provided once a hyperlink to Gene Ontology (GO) is clicked. Further, HDAPD provides a link to the KEGG map such that where the protein is placed and its relationship with other proteins in a metabolic pathway can be found from the map. The latest literatures namely titles, journals, authors, and abstracts searched from PubMed for the protein are also presented as a length controllable list.
CONCLUSIONS: Since the HDAPD data content can be routinely updated through a PHP-MySQL web page built, the new database presented is useful for searching the structures for some disease-associated proteins that may play important roles in the disease developing process for performing the structure-based drug design to against the diseases.

Entities:  

Mesh:

Substances:

Year:  2010        PMID: 20158919      PMCID: PMC2833151          DOI: 10.1186/1471-2105-11-88

Source DB:  PubMed          Journal:  BMC Bioinformatics        ISSN: 1471-2105            Impact factor:   3.169


Background

Most of the current disease databases designed can provide clinical information such as features, syndromes, diagnosis methods and therapy for doctors, pharmacists, nurses and medical technology staffs who work in the clinic settings. The disease-associated databases remain fewer as compared with the molecular biological databases built so far. The Online Mendelian Inheritance in Man (OMIM) [1] of NCBI and GeneCards [2] are the two disease databases built that can provide relationship between diseases and genes. Genes and Disease [3] is a database built for collecting articles and discussing genes that derive the diseases. The International Classification of Diseases (ICD) [4] is a WHO database that supplies diagnostic classification for diseases, analyzes general health situation for population groups, and monitors incidence and prevalence of diseases and other health related problems. The disease-related protein structures are of great research interest for both experimental and computational scientists. Their importance stems from the fact that they provide molecular pictures of disease processes, a necessary prerequisite for structure-based drug design. In this work, we have developed a web searching tool Human Disease-Associated Protein Database (HDAPD) for searching the disease-associated protein structures for interested researchers. The web searching tool can be initiated by typing some related keywords for searching the information on structures, structure-related sequences, location, function, pathway, and literatures for a disease-associated protein collected in the database.

Construction and content

The sources of data used in HDAPD include two parts namely databases and literatures. The disease-associated proteins were mined from the research articles, and databases Genes and Disease, OMIM, and ICD. The ICD classification for diseases was followed for grouping proteins collected for each disease. A background introduction extracted from database Genetics Home Reference (GHR) [5] or online publication of GeneReviews [6] was listed for each disease name clicked. The overall disease and literature information was extracted from databases OMIM, Genes and Disease, ICD, PubMed [7], GeneReviews, and GHR while the associated protein information such as name, structure, sequence, classification, PDB id, ontological data, and KEGG map was extracted from databases PDB [8], UniProt [9], GO [10], and KEGG [11], respectively. We searched the proteins that were associated with the diseases from database Genes and Diseases first and then fed this information into database OMIM for mining more potential proteins. These disease-associated proteins were grouped together based on the disease caused by them. We also extracted the disease information from database ICD and these were similarly processed as those extracted from database Genes and Diseases. The gathered information for diseases and disease-related proteins were classified and tabulated using the MySQL [12] format. Meanwhile, the background information of each disease collected was extracted from databases GeneReviews and GHR and recorded as text files. The disease name was treated as header or footer for separating each disease introduction and searching the correct paragraph where the disease-related information was recorded. The contents of diseases plus a brief introduction for each disease and the total disease-associated proteins collected in HDAPD were presented as pull-down lists once the corresponding hyperlinks given in the front page were clicked.

Utility

The Perl scripts [13] were used for dividing a group of proteins and sending these proteins to database UniprotKB for searching the corresponding PDB codes. More comprehensive information such as protein description, donation organism, classification, experimental type, and X-ray structure corresponding to each protein searched was extracted through the wiped FTP archive. A unique table in MySQL format was generated for each protein searched with a PDB entry. We also integrated a molecular viewer in HDAPD so that users can view the structures of proteins collected in the database. A hyperlink to Jmol [14], a molecular viewer written with Java Applet [15], was built in HDAPD for users to directly viewing and operating the structure of a protein searched. The functional bar in Jmol could be initiated once the cursor was moved into the Jmol template and the mouse's right button was clicked. In addition to using the functional bar, the usage of a Jmol script could also be viewed and used from the "Jmol interactive scripting documentation" via a hyperlink clicked. The desired Jmol script could be entered into a box provided below the Jmol template for operating the protein molecule. The protein sequences were collected from the wiped FTP archives and saved as a huge text file. This text file was separated in alphabetic order according to the second word identified from a PDB code. The protein sequence of a protein could be assessed from the corresponding protein page presented by clicking the button ProteinSequence on the bottom. The PDB id, molecular type, length, and name were presented ahead of sequence. The protein sequence was also presented in FASTA [16] format and could be directly copied into BLAST [17] for other searching tasks if the hyperlink "FASTA format" located on upper right of the page was clicked. The GO Consortium rules use a consistent language and controlled vocabularies to describing the role of genes and related proteins including their cellular component, molecular function, and biological process for eukaryotic organisms. The protein structures in GO Human are collected from UniProt while those in GO Rattus Norvegicus are collected from the Rat Genome Database [18]. To assess the GO data, a protein name typed in HDAPD was converted to an internal id for identifying the GO ids. The GO ids were consisted of three parts according to the three ontological data such as location, function, and description provided by the database. These were presented in three tables with each being divided into two parts, the left and right portions. While the GO ids were recorded in the left portion, the corresponding definitions were given in the right one. The GO ids can be keyed into AmiGO [19], a web tool designed by GO, for obtaining more useful data such as term linkages from the GO database. We collected a variety of protein names and ids from several databases and kept this information as tables in HDAPD. The KEGG protein and pathway ids acquired from Uniprot were used to assess the KEGG maps which would provide useful information such as the role of a protein in a metabolic pathway and whether or not the protein would interact with other proteins in the same pathway. When the KEGG button of a hyperlink built for each collected protein in HDAPD was clicked, a table containing two portions with left one gives the hyperlinks of pathways while right one gives the names of the corresponding pathways was presented. Clicking on the hyperlink of a pathway would link HDAPD to the KEGG server and bring up the corresponding KEGG analyses and graphic illustrations. HDAPD also provides a hyperlink to database PubMed of NCBI in the protein page. This was initiated by a keyword generated through a Perl script and sent to the NCBI server. The information extracted by HDAPD from PubMed was titles, journals, authors, and abstracts published for an interested protein structure searched and these were presented as a long list. Not only the length of this list is controllable, the content of this list could be also sorted in order of author, last author, journal, or publication date.

Discussion

The protein structures collected in HDAPD can be routinely updated through six PHP-MySQL templates designed as shown in Figure 1. A comparison for database contents and searching functions provided by HDAPD with those provided by databases NCBI Entrez, EMBL, UniProt, and GHR is shown in Table 1. Apparently, while the other databases providing more information on protein sequence, gene ontology, and protein structure, only disease-related information including literatures are supplied by database GHR (Table 1). Moreover, there are no disease list, disease introduction for each disease, and disease-associated protein lists provided by both NCBI Entrez and EMBL databases. The UniProt database does give an introduction for each disease but provide no list in diseases and diseases-associated proteins (Table 1). Undoubtedly, NCBI Entrez, EMBL, and UniProt are far more superior to HDAPD in providing information on protein sequences either determined or annotated (Table 1). In fact, most of the protein sequences in HDAPD are collected from PDB and are therefore associated with the corresponding structures determined. By typing in some keywords, a protein list with PDB id, classification, and taxonomy information attached for each protein are provided by NCBI Entrez, EMBL, and UniProt. However, HDAPD takes only keywords in diseases or disease-associated proteins for searching protein structures. The protein lists provided by HDAPD are also classified based on their functions and then hyperlinks for PDB, GO, KEGG, and PubMed are provided if a protein on a list is clicked. Note that no GO information is provided by NCBI Entrez. However, the KEGG pathway description and maps are provided by all the databases compared except GHR (Table 1). Moreover, all the databases compared except GHR are able to provide literature searching results for the protein structures searched. However, the literatures searched in HDAPD can be selected and sorted in alphabetic order of first author, last author, and journal name, or in chronologic order of publication date.
Figure 1

The protein structures collected in HDAPD can be routinely updated through six PHP-MySQL templates designed namely (a), (b), (c), (d), (e), and (f). These templates are used for entering (a) disease and protein names; (b) disease and protein classification; (c) disease and PDB code; (d) protein names, UniProt and GO id; (e) protein name, UniProt id, GO id, gene name, NCBI id, and KEGG id; and (f) protein name, protein description, source, authors, resolution, and method.

Table 1

A comparison for database contents and searching functions provided by HDAPD with those provided by databases NCBI Entrez, EMBL, UniProt, and GHR

DatabasesHDAPDNCBI EntrezEMBLUniProtGHR

Web sitehttp://140.114.100.145/index.htmlhttp://www.ncbi.nlm.nih.gov/sites/gqueryhttp://www.ebi.ac.uk/embl/http://www.uniprot.org/http://ghr.nlm.nih.gov/
FunctionsDisease typesICD-10; Genes and Disease (classified diseases into 14 groups; 285 diseases)(classified diseases into 17 groups)

Disease list

Disease introductionGene Review; Genetics Home Reference

Disease introduction

Disease-associated proteinsICD-10; Genes and Disease; OMIM

Disease-associated protein list***

*A protein list is provided by typing in keywords.

Determine-d and annotated protein sequencesSwissProt, PIR, PRF, PDB,PRIDEUniProtKBUniRefUniParcUniProtKBUniRefUniParc

Sequence database

Determine-d protein structure3-D macromolecular structuresPDB(3,189 all are diseases-associated)MMDB(over 20000 structures, but not disease-associated)PDBePDBPDBe

PDB ID

Compound

Classification

Source

Resolution

Method

Author List

Accession Date

Protein sequence(Primary)

Molecular viewer

◎: A hyperlink to PDB is provided for a search result.

Gene ontologyGOTaxonomyBioCatalogueGOSBOTaxonomyGOTaxonomy

Gene Ontology (GO)

◎: A hyperlink to PubMed of NCBI is provided for a search result.

PathwayPathway and systems of interacting moleculesKEGGKEGGReactomeBioModelsReactomeRheaUniProtKB/Swiss-Prot

Pathway description

Pathway map

LiteratureFull text and journal articlesPubMedPubMedMedline PatentsPubMed SRS CiteXplorePubMed
Literature extracting

Author

Journal

Relative date

Sorting

Date

◎: A hyperlink to PubMed of NCBI is provided for a search result.

PRIDE: Proteomics identification database, UniProtKB: UniProt knowledge base of protein sequences, UniRef: UniProt Non-redundant reference databases, UniParc: Non-redundant archive of protein sequences, PDB: Protein database bank, MMDB: The Molecular modeling database, PDBe: Macromolecular structures database, GO: Gene ontology, Taxonomy: NCBI Taxonomy database of organism names, BioCatalogue: BioCatalogue, SBO: Systems biology ontology, KEGG: Kyoto encyclopedia of genes and genomes, Reactome: Database of core biochemical pathways and reactions, BioModels: Database of mathematical models of biological interest, Rhea: Manually annotated database of chemical reactions created in collaboration with the Swiss Institute of Bioinformatics (SIB), PubMed: PubMed of NCBI, Medline: Citations and abstracts from many life-science journals, Patents: Biology-related abstracts of patent applications.

The protein structures collected in HDAPD can be routinely updated through six PHP-MySQL templates designed namely (a), (b), (c), (d), (e), and (f). These templates are used for entering (a) disease and protein names; (b) disease and protein classification; (c) disease and PDB code; (d) protein names, UniProt and GO id; (e) protein name, UniProt id, GO id, gene name, NCBI id, and KEGG id; and (f) protein name, protein description, source, authors, resolution, and method. A comparison for database contents and searching functions provided by HDAPD with those provided by databases NCBI Entrez, EMBL, UniProt, and GHR PRIDE: Proteomics identification database, UniProtKB: UniProt knowledge base of protein sequences, UniRef: UniProt Non-redundant reference databases, UniParc: Non-redundant archive of protein sequences, PDB: Protein database bank, MMDB: The Molecular modeling database, PDBe: Macromolecular structures database, GO: Gene ontology, Taxonomy: NCBI Taxonomy database of organism names, BioCatalogue: BioCatalogue, SBO: Systems biology ontology, KEGG: Kyoto encyclopedia of genes and genomes, Reactome: Database of core biochemical pathways and reactions, BioModels: Database of mathematical models of biological interest, Rhea: Manually annotated database of chemical reactions created in collaboration with the Swiss Institute of Bioinformatics (SIB), PubMed: PubMed of NCBI, Medline: Citations and abstracts from many life-science journals, Patents: Biology-related abstracts of patent applications. The searched results for two diseases and a disease-associated protein namely Lung cancer, Diabetes, and Tumor protein 53 by HDAPD and by all the four databases compared are shown in Table 2. This table contrasts the major difference between HDAPD and the other four databases compared in that all the diseases-associated proteins collected in the former are classified into different disease groups while no such classification is given in the latter. Therefore, the structure of a disease-associated protein can be more conveniently searched through HDAPD though more information on sequence, taxonomy, and genome are provided by the other databases (Table 2). Except those searched by UniProt, more structures of Lung cancer-associated proteins and Tumor protein 53 are searched by HDAPD than by NCBI Entrez and EMBL (Table 2). HDAPD also gives more protein structures for Tumor protein 53 than those given by both NCBI Entrez and EMBL (Table 2). However, more structures of Diabetes-associated proteins are searched by UniProt than by HDAPD (Table 2). Except HDAPD and UniProt, the GO-related information is indirectly provided by NCBI Entrez or unavailable in EMBL and GHR (Table 2). In general, UniProt does provide much more GO information than HDAPD since the latter only focused on protein structures while the former on all relevant protein sequences. This is also true between the number of literatures searched by HDAPD and those by both NCBI Entrez and EMBL (Table 2). However, more KEGG pathway information for both Lung cancer and Diabetes is provided by HDAPD than by all the other databases compared (Table 2).
Table 2

The searched results for two diseases and a disease-associated protein namely Lung cancer, Diabetes, and Tumor protein 53 by HDAPD and NCBI Entrez, EMBL, UniProt, and GHR are compared.

describeHDAPDNCBI EntrezEMBLUniProtGHR
Disease & proteinLung cancer293699

tumor protein 5311

diabetes161705

Protein sequencesLung cancer-203141524765

tumor protein 53-6793532

diabetes-19353751991

Protein structureLung cancer215517332360

tumor protein 536112279

diabetes9062545471082

GO: biological processLung cancer1872none0625

tumor protein 5348none031

diabetes1127none1902

GO: cellular componentLung cancer2017none0668

tumor protein 5311none031

diabetes1080none0895

GO: molecular functionLung cancer1953none0662

tumor protein 5312none027

diabetes1096none0862

Number of KEGG paths involvedLung cancer89356126

tumor protein 532012802

diabetes354652160

literatureLung cancer6512481730481056352876

tumor protein 533831157392818251

diabetes1363533934332543511155

* The GO information in NCBI Entrez is indirectly provided through a hyperlink to PDB.

The searched results for two diseases and a disease-associated protein namely Lung cancer, Diabetes, and Tumor protein 53 by HDAPD and NCBI Entrez, EMBL, UniProt, and GHR are compared. * The GO information in NCBI Entrez is indirectly provided through a hyperlink to PDB.

Conclusions

In this work, we present a comprehensive web tool HDAPD for searching a variety of important information such as name, sequence, structure, ontological data, metabolic pathway data, and relevant literatures for some disease-associated protein structures collected in the database. Currently, the total number of diseases and disease-associated protein structures collected in the database are 454 and 11657, respectively. Moreover, the number of proteins collected by HDAPD with available GO ontological and KEGG metabolic pathway id's are 1086 and 1079, respectively. These numbers will be gradually increased in the future to reflect the new developments and advancements in the related fields since the HDAPD data content can be routinely updated through a PHP-MySQL web page built. The new database presented is useful for searching the structures for some disease-associated proteins that may play important roles in the disease developing process for performing the structure-based drug design to against the diseases.

Availability and Requirements

Project name: building a web tool for searching some disease-related protein structures Project home page: http://www.life.nthu.edu.tw/~lslth, HDAPD can be freely assessed through the following URL: http://140.114.100.145/index.html Operating system: Windows XP Programming language: Perl, PHP, and MySQL License: None Any restriction to use by academics: None

Authors' contributions

YRL wrote most of the source codes for HDAPD, while HYW and TLT conducted the testing and THL wrote the manuscript. All the authors read and approved the final manuscript.
  8 in total

1.  Gene ontology: tool for the unification of biology. The Gene Ontology Consortium.

Authors:  M Ashburner; C A Ball; J A Blake; D Botstein; H Butler; J M Cherry; A P Davis; K Dolinski; S S Dwight; J T Eppig; M A Harris; D P Hill; L Issel-Tarver; A Kasarskis; S Lewis; J C Matese; J E Richardson; M Ringwald; G M Rubin; G Sherlock
Journal:  Nat Genet       Date:  2000-05       Impact factor: 38.330

Review 2.  Applying the Gene Ontology in microbial annotation.

Authors:  Michelle G Giglio; Candace W Collmer; Jane Lomax; Amelia Ireland
Journal:  Trends Microbiol       Date:  2009-07-02       Impact factor: 17.079

3.  Proteomics FASTA archive and reference resource.

Authors:  Jayson A Falkner; James A Hill; Philip C Andrews
Journal:  Proteomics       Date:  2008-05       Impact factor: 3.984

4.  GeneCards: a novel functional genomics compendium with automated data mining and query reformulation support.

Authors:  M Rebhan; V Chalifa-Caspi; J Prilusky; D Lancet
Journal:  Bioinformatics       Date:  1998       Impact factor: 6.937

5.  The Rat Genome Database, update 2007--easing the path from disease to data and back again.

Authors:  Simon N Twigger; Mary Shimoyama; Susan Bromberg; Anne E Kwitek; Howard J Jacob
Journal:  Nucleic Acids Res       Date:  2006-12-06       Impact factor: 16.971

6.  The worldwide Protein Data Bank (wwPDB): ensuring a single, uniform archive of PDB data.

Authors:  Helen Berman; Kim Henrick; Haruki Nakamura; John L Markley
Journal:  Nucleic Acids Res       Date:  2006-11-16       Impact factor: 16.971

7.  McKusick's Online Mendelian Inheritance in Man (OMIM).

Authors:  Joanna Amberger; Carol A Bocchini; Alan F Scott; Ada Hamosh
Journal:  Nucleic Acids Res       Date:  2008-10-08       Impact factor: 16.971

8.  KEGG for linking genomes to life and the environment.

Authors:  Minoru Kanehisa; Michihiro Araki; Susumu Goto; Masahiro Hattori; Mika Hirakawa; Masumi Itoh; Toshiaki Katayama; Shuichi Kawashima; Shujiro Okuda; Toshiaki Tokimatsu; Yoshihiro Yamanishi
Journal:  Nucleic Acids Res       Date:  2007-12-12       Impact factor: 16.971

  8 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.