Literature DB >> 28012137

VaProS: a database-integration approach for protein/genome information retrieval.

Takashi Gojobori1,2, Kazuho Ikeo2, Yukie Katayama3, Takeshi Kawabata4, Akira R Kinjo4, Kengo Kinoshita5,6, Yeondae Kwon3, Ohsuke Migita7,8, Hisashi Mizutani2, Masafumi Muraoka2, Koji Nagata3, Satoshi Omori5, Hideaki Sugawara2, Daichi Yamada9, Kei Yura10,11.   

Abstract

Life science research now heavily relies on all sorts of databases for genome sequences, transcription, protein three-dimensional (3D) structures, protein-protein interactions, phenotypes and so forth. The knowledge accumulated by all the omics research is so vast that a computer-aided search of data is now a prerequisite for starting a new study. In addition, a combinatory search throughout these databases has a chance to extract new ideas and new hypotheses that can be examined by wet-lab experiments. By virtually integrating the related databases on the Internet, we have built a new web application that facilitates life science researchers for retrieving experts' knowledge stored in the databases and for building a new hypothesis of the research target. This web application, named VaProS, puts stress on the interconnection between the functional information of genome sequences and protein 3D structures, such as structural effect of the gene mutation. In this manuscript, we present the notion of VaProS, the databases and tools that can be accessed without any knowledge of database locations and data formats, and the power of search exemplified in quest of the molecular mechanisms of lysosomal storage disease. VaProS can be freely accessed at http://p4d-info.nig.ac.jp/vapros/ .

Entities:  

Keywords:  Big data analysis; Database integration; Lysosomal storage disease; Protein 3D structure

Mesh:

Year:  2016        PMID: 28012137      PMCID: PMC5274651          DOI: 10.1007/s10969-016-9211-3

Source DB:  PubMed          Journal:  J Struct Funct Genomics        ISSN: 1345-711X


Introduction

The advance of the molecular biology has yielded a huge amount of biological data including DNA/RNA/protein sequences [1-3], their expression levels [4], difference in the sequences of individuals [5], three-dimensional (3D) structures of the biomolecules [6], phenotypes of the organisms [7] and so forth. These data have been stored in independent databases located on the Internet and researchers exploit these databases for new knowledge of the target of their study. Database mining facilitates the process of knowledge acquisition and that of building new hypotheses for planning new experiments [8]. The expansion of the data size has been coped with the increase in the size of the storage and with invention of a new algorithm for searching the whole data swiftly. One of the famous examples of the tool for quick search of a database in this field is BLAST [9], a tool to search similar sequences out of the huge nucleotide/amino acid sequence databases. Further expansion of the size of the independent database and the increase in the variety of databases may have enhanced chances for performing novel experiments by extending the scope of hypotheses, yet the lack of technology for integrating different types of databases and of an application for searching the multiple databases have precluded extensive application of this approach. The researchers aiming for an integrated search of different databases should approach the databases one by one, learn how to use each database and obtain information relevant for their studies. The users then integrate the data obtained from different databases by themselves. This process evidently requires tedious labour as well as skills for manipulating data in different formats. Hence, the biggest hurdle that we have to overcome in the current life science activity is the complexity in integrating databases in a way that enables us to come up with novel ideas and hypotheses. Once the up-to-date data is comprehensively integrated, then researchers with an experience in a specific field can start deducing a hypothesis in a data-driven manner. The effort for integrating the management of different databases has been made by a number of groups [10-14]. Linking data with a common framework is one of the possible approaches, and the Semantic Web technologies are becoming increasingly popular in recent years [15]. While the Semantic Web technologies based on linked open data and ontologies are a promising approach, extremely diverse set of ontologies as well as non-uniform uses of URI (Uniform Resource Identifiers) to describe identical resources by different parties make it difficult to integrate various information resources without extensive manual intervention. Although some efforts have been dedicated to solve these difficulties (e.g., http://identifiers.org), it will take time for the research community to agree on a unified convention. To overcome these difficulties in a search of multiple databases in the information of life science, we started developing a new type of application that searches databases in different locations simultaneously by a simple search query and displays the result in a simple interface at http://p4d-info.nig.ac.jp/vapros/. We named the application VaProS, VAriation effect of PROtein Structure and function. The name derived from the aim of the application, namely to focus on analysing effects of DNA sequence variations on protein structures and function. VaProS aims to realize an idea of “data cloud”, that is to retrieve data without any knowledge of databases scattered in the Internet. The idea embedded in VaProS that is different from other general database integration efforts is that VaProS makes much of the relationship among the biological molecules and phenomena. The relationship is governed by the central dogma; hence all the incidents can be described in either gene-centered or protein-centered manner. Phenotypic changes of an organism likely derive from changes in the biological system of the organism, which is sustained by the network of biomolecules and those biomolecules are ultimately encoded in DNA. This flow of information is nothing but the opposite direction of the central dogma, and hence the organization of data and databases in VaProS follows the information flow in the central dogma. Technically, the search results of the variety of databases are inter-connected using the protein as a hub of information. In the following sections, the detail of VaProS and the example of the usage are presented.

Materials and methods

Databases and tools on the internet

Table 1 lists the databases that are integrated on VaProS and the tools that visualize the search results. There are 16 databases and 15 tools of which VaProS are made. The latest information of the databases, namely the version and the size, is given at http://p4d-info.nig.ac.jp/vapros/statistics.html. The integration of the databases took the form of either dynamic link or a data copy from the original site to the local site of VaProS. Ideally, all the databases should be accessed dynamically to avoid time lag of the data and to save the local disk space, but such dynamic access often sacrifices a prompt response to a query. Therefore, we downloaded the part of the data from each database and achieved an optimum response speed. The data update is scheduled once in every six month to keep abreast with the latest data in all the databases. VaProS deals with the data of humans, rats and mice and focuses on phenomena related to humans.
Table 1

Components of VaProS

DB/tool nameData resourceSearch toolData/function used in VaProSMethod of accessOriginal locationReference
EntrezGeneNomenclature, reference and other biological information of genesCopy and link http://www.ncbi.nlm.nih.gov/gene/ [16]
UniprotKBAmino acid sequences with biological annotation such as ontology and classificationCopy and link http://www.uniprot.org/ [17]
BioGRIDGenetic and protein interactions with curation based on biomedical literatureCopy and link http://thebiogrid.org/ [18]
ChEMBLDrug-like small molecules with interacting proteinsCopy and link https://www.ebi.ac.uk/chembl/[19]
DrugBankDrug molecules combined with drug target informtaionCopy and link http://www.drugbank.ca/ [20]
IntActMolecular interactions obtained from literature and direct submissionCopy and link http://www.ebi.ac.uk/intact/ [21]
PID (NDEx)Biological interaction data of proteinsCopy and link http://www.ndexbio.org/#/ [22]
ReactomeBiological pathway dataCopy and link http://www.reactome.org/ [23]
OMIMMendelian disease related phenotype and its causative geneLink http://www.omim.org/ [7]
hGtoP3D structural and comparative genomics annotations of humans, mice and ratsLink http://p4d-info.nig.ac.jp/hGTOP/ [24]
Natural Ligand Database3D models of proteins and their natural ligands registered in KEGG reaction databaseLink http://nldb.hgc.jp/nldb/ [25]
COXPRESdbRelationship of gene expression based on RNAseq and microarray dataLink http://coxpresdb.jp/ [26]
Mutation@A GlanceGenetic variants on proteins including disease-causing mutations observed in humansLink http://harrier.nagahama-i-bio.ac.jp/mutation/ [27]
3D InteractionModels of protein 3D structure and the structure in complex with other moleculesLink http://homcos.pdbj.org/ [28]
Autophagy DBList of genes and proteins for autophagyBuilt-in http://www.tanpaku.org/autophagy/ [29]
GNP expressionGenes clustered by expression pattern showning co-regulation and anti-regulationBuilt-in http://genomenetwork.nig.ac.jp/
Molecular InteractionsGraphic tool for interaction networks of proteins, compounds and phenotypesBuilt-in
TagCloudGraphic tool to display frequency of words in the titles of papers registered in UniProtBuilt-in
Pathway DBFinder of the related pathways from the databases in useBuilt-in
PhenotypeFinder of medelian disease related to the protein/gene in queryBuilt-in
Cis-finderFinder of the cis element candidate motifs in DNA sequenceBuilt-in
S-VAREvaluator of the impact of missense mutation in a proteinLink http://p4d-info.nig.ac.jp/s-var/
Genome explorerAnnotator of genes with transcription start sites and other biological functionBuilt-in http://genomenetwork.nig.ac.jp/
NORENID connector from UniProt AC to all the other IDs of the databases in useBuilt-in http://cib.cf.ocha.ac.jp/DC/
Components of VaProS

Data integration and data presentation in VaProS

VaProS is unique in the style of data integration. VaProS tries to integrate different databases dynamically and relationship amongst the data in the databases is taken by UniProt accession key. Central dogma guarantees the relationship between biomolecules in the organisms, hence all the phenotypes should basically stem from the perturbation on biomolecules. Therefore, phenomena observed in the organization can be tagged to either DNA or protein. We chose a protein identifier to tag all the other data, because VaProS is aimed for the analysis of protein variation. UniProtKB [2], GeneCards [12] and Cosmic [13] assume a similar approach for the integration of relevant data. VaProS put stress on a graphical presentation of the search results as found in “Molecular Interactions” and “TagCloud”, and on the analyses on protein 3D structures as found in “hGtoP”, “3D Interaction” and “Natural Ligand Database”.

Search method

VaProS accepts keywords, DNA/protein sequence, EntrezGene ID and UniProtKB accession as a query (Fig. 1). A keyword can be a gene name, a protein name, a ligand (drug) name, a disease (phenotype) name and an identifier found in the databases. Input of the keywords is assisted by a keyword-suggestion function. Incomplete input makes VaProS find a related keywords in the keyword database and it shows a list of candidate words below the query input window. Once the Search button is pressed, VaProS throws the input data to NOREN, an original tool to search for whole IDs in the databases related to the query. NOREN is based on the ID mapping table provided by UniProtKB [2], and BLAST [9]. The result of the query is presented as a list of candidates to the user. The candidate list is categorized into three different types, namely Gene/Protein, Ligand and Phenotype (Fig. 2). The user may select the most relevant element in the list, press “Details (Go)” button and obtain the results of the search done by IDs relevant to the keywords (Fig. 3). The results are presented through the tools tabulated in Table 1. The search results shown by each tool can be opened by clicking the corresponding icon on the left in Fig. 3.
Fig. 1

The top page of VaProS located at http://p4d-info.nig.ac.jp/vapros

Fig. 2

The initial search result by VaProS. The query word is “HEXA”, the causative gene of Tay-Sachs disease

Fig. 3

The search result in detail by pressing the “Details (Go)” button in Fig. 2. The protein–protein interactions and frequently used terms in literature related to HEXA are displayed

The top page of VaProS located at http://p4d-info.nig.ac.jp/vapros The initial search result by VaProS. The query word is “HEXA”, the causative gene of Tay-Sachs disease The search result in detail by pressing the “Details (Go)” button in Fig. 2. The protein–protein interactions and frequently used terms in literature related to HEXA are displayed

Results and discussion

We explored the current knowledge on lysosomal storage diseases (LSDs) and built a tenable hypothesis as a case study to show the usage of VaProS. The similar analyses can be conducted on different diseases by accessing http://p4d-info.nig.ac.jp/vapros.

Lysosomal storage disease

Lysosomes are subcellular organelles responsible for the physiological turnover of the cell constituents. They contain catabolic enzymes that require a low pH environment for their optimal function. LSDs are a heterogeneous group of more than 50 rare inherited disorders characterized by the accumulation of undigested or partially digested macromolecules (Table 2). LSDs ultimately result in cellular dysfunction and clinical abnormalities. LSDs are caused by deficiencies or defects in enzymes for lysosomes, in proteins necessary for the normal post-translational modification of lysosomal enzymes, in the activator proteins of lysosomal enzymes, and in the proteins important for proper intracellular trafficking between the lysosome and other intracellular compartments. The individual diseases are rare, but LSDs as a group affects many people around the world with a frequency of about one in every 7000–8000 live births [30, 31].
Table 2

Lysosomal storage diseases

DiseaseTypeGeneUniProt IDPDBPDB identity*
MucopolysaccharidosisIH (Hurler syndrome)
IH-S (Hurler-Scheie syndrome)IDUAIDUA_HUMAN3W81100%
IS (Hurler, Hurler/Scheie, Scheie syndrome)
II (Hunter syndrome)IDSIDS_HUMAN4UG436%
III-A (Sanfilippo syndrome)SGSHSPHM_HUMAN4MIV100%
III-BNAGLUANAG_HUMAN4XWH100%
III-CHGSNATHGNAT_HUMAN
III-DGNSGNS_HUMAN4UG430%
IV-A (Morquio syndrome)GALNSGALNS_HUMAN4FDI100%
IV-BGLB1BGAL_HUMAN3WF2100%
VI (Maroteaux-Lamy syndrome)ARSBARSB_HUMAN1FSU100%
VII (Sly syndrome)GUSBBGLR_HUMAN1BHG100%
IX (Hyaluronidase deficiency)HYAL1HYAL1_HUMAN2PE499%
Niemann-Pick diseaseASMPD1ASM_HUMAN5FC535%
B
C1NPC1NPC1_HUMAN3JD8100%
C2NPC2NPC2_HUMAN2HKA80%
GM1 gangliosidosisIGLB1BGAL_HUMAN3WF2100%
IIGLB1BGAL_HUMAN3WF2100%
IIIGLB1BGAL_HUMAN3WF2100%
GM2 gangliosidosisTay-Sachs diseaseHEXAHEXA_HUMAN2GJX99%
Sandhoff’s diseaseHEXBHEXB_HUMAN5BRO98%
AB variantGM2ASAP3_HUMAN1PUB100%
Sulfatide lipidosisMetachromatic leukodystrophyARSAARSA_HUMAN1N2L100%
ARSAARSA_HUMAN1N2L100%
Multiple sulfatase DeficiencyARSBARSB_HUMAN1FSU100%
SUMF1SUMF1_HUMAN1Y1H100%
Saposin dificiencyProsaposin deficiency4V2O100% (fragments)
Krabbe disease, atypical3BQQ
Saposin B deficiencyPSAPSAP_HUMAN2DOB
Gaucher disease, atypical1SN6
GlycogenosisII (Pompe disease)GAALYAG_HUMAN2QLY47%
Gaucher diseaseGaucher diseaseGBAGLCM_HUMAN2WKL100%
Fabry diseaseFabry diseaseGLAAGAL_HUMAN3LXB99%
CeramidosisFarber’s diseaseASAH1ASAH1_HUMAN
Krabbe diseaseKrabbe diseaseGALCGALC_HUMAN4UFH84%
Cholesterol ester storage diseaseCholesterol ester storage diseaseLIPALICH_HUMAN1K8Q60%
Wolman diseaseWolman disease
Glycoprotein disorderAlpha-fucosidosisFUCA1FUCO_HUMAN2ZXA39%
Alpha-mannosidosisMAN2B1MA2B1_HUMAN1O7D83%
Beta-mannosidosisMANBAMANBA_HUMAN2VR431%
AspartylglycosaminuriaAGAASPG_HUMAN1APZ99%
GalactosialidosisCTSAPPGB_HUMAN1IVY99%
Mucolipidosis INEU1NEUR1_HUMAN1EUS37%
Mucolipidosis II
Mucolipidosis IIIGNPTABGNPTA_HUMAN2N6D99% (fragment)
Schindler’s diseaseNAGANAGAB_HUMAN4DO499%
Membrane metabolism disorderCystinosisCTNSCTNS_HUMAN
Sialic acid storage disease (Salla disease)SLC17A5S17A5_HUMAN
Cathepsin K deficiency disease (pycnodysostosis)CTSKCATK_HUMAN7PCK100%
Cobalamin F disease (cblF)LMBRD1LMBD1_HUMAN
Danon diseaseLAMP2LAMP2_HUMAN2MOM100% (fragment)
Neuronal Ceroid LipofuscinosisNeuronal ceroid lipofuscinosis-1PPT1PPT1_HUMAN3GRO100%
Neuronal ceroid lipofuscinosis-2TPP1TPP1_HUMAN3EDY100%
Neuronal ceroid lipofuscinosis-3CLN3CLN3_HUMAN
Neuronal ceroid lipofuscinosis-4ACLN6CLN6_HUMAN
Neuronal ceroid lipofuscinosis-4BDNAJC5DNJC5_HUMAN2CTW100% (fragment)
Neuronal ceroid lipofuscinosis-5CLN5CLN5_HUMAN
Neuronal ceroid lipofuscinosis-6CLN6CLN6_HUMAN
Neuronal ceroid lipofuscinosis-7MFSD8MFSD8_HUMAN
Neuronal ceroid lipofuscinosis-8CLN8CLN8_HUMAN
Neuronal ceroid lipofuscinosis-10CTSDCATD_HUMAN2PSG49%
Neuronal ceroid lipofuscinosis-11GRNGRN_HUMAN2JYE100% (fragment)
Neuronal ceroid lipofuscinosis-12ATP13A2AT132_HUMAN3WGV27%
Neuronal ceroid lipofuscinosis-13CTSFCATF_HUMAN1M6D99%
Neuronal ceroid lipofuscinosis-14KCTD7KCTD7_HUMAN4UES50% (fragment)
Congenital disorder of glycosylationIAPMM2PMM2_HUMAN2AMY100%

*Amino acid sequence identity between the UniProt and PDB entries

Lysosomal storage diseases *Amino acid sequence identity between the UniProt and PDB entries

Search in the first step

A search by a keyword “gangliosidosis”, one of the major groups in LSDs, resulted in six candidates as shown in Fig. 4. As shown in Table 2, gangliosidoses are classified into two types, GM1 and GM2, both of which are further classified into three subtypes. The estimated incidence of GM1-gangliosidosis is 1 per 100,000 to 200,000 births, and those of GM2-gangliosidosis are 1 per 360,000 births for Tay-Sachs disease and 1 per 310,000 or 1,000,000 births for Sandhoff disease. GM2-gangliosidosis AB variant is extremely rare [32]. Each line in the search result happened to correspond to an individual entry of genetic diseases/disorders in OMIM database [7]. The three types in GM1-gangliosidoses (types I, II, and III) were related to the same “Molecule Symbol”, namely GLB1 gene, but Tay-Sachs disease, Sandhoff disease and AB variant in GM2-gangliosidoses were related to HEXA, HEXB and GM2A genes, respectively. Each gene was linked to the databases listed in Table 1. Ticking the far left box in Fig. 4 and pressing “Details (Go)” button on the top led the user to the further detail of the selected item. In the following section, the search result of each tool listed in Table 1 is explained.
Fig. 4

Initial search result by VaProS. The search by “gangliosidosis” initially results in a table of candidates

Initial search result by VaProS. The search by “gangliosidosis” initially results in a table of candidates

Molecular interactions

The interaction network of the proteins encoded in HEXA and HEXB was found in “Molecular Interactions” window (Fig. 5). This window can be displayed by ticking both HEXA and HEXB in the table shown in Fig. 4 and pressing “Details (Go)” button. A protein is represented with a big node and a ligand is represented with a small node. A protein–protein/ligand interaction is represented with an edge. Figure 5 tells that eight proteins and four ligands interact both with HEXA and HEXB, and each protein interacts with a number of other proteins and ligands. These interactions were extracted from different databases listed in Table 1. In Fig. 5, the nodes in red are proteins associated with diseases. The information was extracted from OMIM (Table 1), and the catalog of specific disease is given on the right side of the window. There are two nodes in red that interact with both HEXA and HEXB, which suggest disease–disease interactions. By right clicking a node, protein–protein interactions can be extended. The pathway of two nodes in the window can be automatically detected using “Path Search” on the top menu.
Fig. 5

“Molecular Interactions” after selecting HEXA and HEXB in the initial search result (Fig. 4). A big node represents a protein, a small node represents a ligand and an edge represents a protein–protein/ligand interaction. A node in red is associated with a disease (selected in the top-right window)

“Molecular Interactions” after selecting HEXA and HEXB in the initial search result (Fig. 4). A big node represents a protein, a small node represents a ligand and an edge represents a protein–protein/ligand interaction. A node in red is associated with a disease (selected in the top-right window) By clicking a node or an edge, the detail information of the node/edge can be displayed on the right bottom of the window. In Fig. 5, HEXA was selected, hence the detail of HEXA was presented on the right. The link to “3D Interaction” shows the protein 3D structural information of HEXA protein. In 3D Interaction, SiteTable/SitesByVariants link leads the users to the information that VaProS aims for, namely the relationship between variations on DNA and protein structure/function.

TagCloud

An overview of the target protein can be obtained by analyzing the frequency of words in the manuscripts related to the protein. Figure 6 is the result of such analysis on the titles of papers registered in UniProtKB under the entry HEXA. TagCloud emphasizes words that frequently appear in the titles of these papers by enlarging the size of the fonts. Visual inspection of TagCloud makes us recognize that HEXA protein is beta-hexosaminidase and may be a multimeric protein. TagCloud also ascertains that the protein is connected with the notion of disease. These facts are trivial for specialists in the field, but are not so for the researchers in different fields and are valuable information for the interdisciplinary study. The list of the papers using the word in the title can be found by clicking the word in the TagCloud.
Fig. 6

Artistic representation of the frequency of words in the titles of the manuscripts stored in the entry of HEXA_HUMAN in UniProtKB. The visualization was realized by d3-cloud (https://github.com/jasondavies/d3-cloud)

Artistic representation of the frequency of words in the titles of the manuscripts stored in the entry of HEXA_HUMAN in UniProtKB. The visualization was realized by d3-cloud (https://github.com/jasondavies/d3-cloud)

hGtoP

hGtoP provides relationship between genes and proteins as its name suggested (G in hGtoP stands for Gene/Genome and P for Protein). The original GtoP was developed by Kawabata et al. [24]. VaProS included human specific GtoP as one of the tools. With hGtoP, the structural information of the protein in the query is easily found. In addition, the homologues of the protein in different species can be found. “3D Interaction” and “hGtoP” contain similar information about protein 3D structures, however, the former focuses more on 3D modelling of complex structures, and the latter focuses on comparative genomics. PDB information in Table 2 was obtained by hGtoP and 3D Interaction. The information clarified that some of the proteins in LSDs do not have structural information yet. In other words, Table 2 provides valuable information for structural genomes of LSDs, namely the target proteins for determining 3D structures.

Natural ligand database

Natural Ligand Database (NLDB) [25] provides the model of protein structures with natural ligands. The idea stemmed from the fact that many ligands in PDB are modified ligands for the sake of crystallisation and the bridge between those modified ligands and natural ligands should be provided to enhance the 3D structural information in PDB. The search query “GM1 gangliosidosis” led the user to the causative gene GLB1. NLDB demonstrated that GLB1 was involved in 15 KEGG reactions, and these 15 reactions were classified into five pathways according to “UniProt search view”. The same 15 reactions can be found in “Pathway DB” tool (Table 1). Of the five pathways in NLDB, glycosphingolipid biosynthesis (hsa00604) contained the reaction of GM1 degradation (R06010). In this reaction, 56 natural ligand complexes were registered in NLDB derived from the proteins of various species. The link to human beta-galactosidase, the product of GLB1 gene, with galactose (PDB ID: 3THC) led to the 3D structures of the ligand–protein complex with reported variation in amino acid residues. The variations in amino acid residues around the ligand-binding site were highlighted on the table of NLDB window. In this entry, ten variations were reported around the ligand-binding site, and eight of them were related to diseases, namely three to GM1 type I, one to GM1 type II, one to GM1 type III, and two to mucopolysaccharidosis IV-B.

COXPRESdb

Gene co-expression sometimes sheds light on relationship between genes, and COXPRESdb [26] provides user-friendly interface to gene co-expression information in humans, mice and rats. The search query “GM2 gangliosidosis” on VaProS led the user to a list of causative genes that included HEXA and HEXB. COXPRESdb demonstrated that the co-expression for HEXA (PCC = 0.43) and HEXB (PCC = 0.45), which are known to be related with Sandhoff disease and Tay-Sachs disease, respectively from OMIM information. By following the link to COXPRESdb, the user can also check the co-expression networks of HEXA and HEXB, which led to the finding that ten more lysosomal proteins are tightly co-expressed with them.

Other tools

GNP Expression, Phenotype, Autophagy DB, Genome Explorer and Mutation@A Glance (Table 1) show the search results in a tabulated or graphical form. The user can further analyze the database of each tool by following the link in the search result. S-VAR is a special tool that evaluates impact of amino acid substitution to the function of the protein. By providing a specific mutation to the window of S-VAR, the tool starts a couple of software that evaluate the impact of the mutation [33-36] and provides each and consensus results for the user at some intervals.

Building hypothesis on the relationship between phenotype and protein 3D structure

Combination of the search results in each tool can be a basis of building a hypothesis that can be verified by wet-lab experiments. The search query “GM1 gangliosidosis” on VaProS led the user to the current knowledge that the causative gene is GLB1, which encodes lysosomal enzyme β-d-galactosidase. By following the link to OMIM [7], user can acquire information on the detail of the disease, namely GM1 gangliosidosis is classified into three types in accordance with the onset age and severity; type I (infantile), type II (juvenile) and type III (adult) as shown in Table 2. “3D Interaction” summarized the variations leading to each type of diseases on the protein 3D structure (Fig. 7). Visual inspection of the figure tells that the variation tends to be located inside the protein. Indeed, the ratios of buried residues for type I was 95.5%, for type II was 88.2%, and for type III was 86.7%. These ratios are significantly higher than the average of the protein (Table 3). Note that the ratio of buriedness of the variation site is highest in type I and lowest in type III. Generally, mutations on the buried sites often make the protein less stable than the native one. Hence the observation suggests that the variations on type I have more impact on the stability of the protein than those on type III. 3D Interaction also provided amino acid frequencies of homologous proteins. The mutation to the rare amino acid implies that the type of amino acid has not sufficiently fixed during the molecular evolution. The ratios of mutations to the amino acids that no homologues used were 78.4% (type I), 73.7% (type II) and 53.3% (type III). Serious phenotypes are expected by mutations to amino acid types rarely observed in homologues. The buriedness and trends in amino acid types between types I and III apparently correlated with the degree of seriousness in each type of GM1 gangliosidoses. The similar trend was discussed by Ohto et al. at the time they determined the 3D structure of the protein [37]. VaProS enables such complex hypothesis building in a short period of time.
Fig. 7

Variations of GM1-gangliosidosis mapped onto the protein 3D structure of GLB1. Variations in type I (a) and variations in type III (b). The structure of human β-galactosidase (PDB ID: 3WF2) is used for the mapping

Table 3

Summary of the mutated sites of GLB1 on protein 3D structure for GM1-gangliosidosis

GM1 type IGM1 type IIGM1 type IIIAll residues
Number of residues451715677
Buried residuea (%)95.588.286.756.8
Exposed residueb (%)4.511.813.343.2

aResidue with relative solvent accessibility less than 20%.

bResidue with relative solvent accessibility no less than 20%.

Variations of GM1-gangliosidosis mapped onto the protein 3D structure of GLB1. Variations in type I (a) and variations in type III (b). The structure of human β-galactosidase (PDB ID: 3WF2) is used for the mapping Summary of the mutated sites of GLB1 on protein 3D structure for GM1-gangliosidosis aResidue with relative solvent accessibility less than 20%. bResidue with relative solvent accessibility no less than 20%.

Conclusion

Here we launched VaProS, a new type of database integration application. VaProS enables a quick search of multiple databases with interrelation of each search result. This application can be used as a textbook for acquiring expert knowledge for researchers in different fields, and can be a tool for building a data-driven hypothesis that can be tested by wet-lab experiments [16-23].
  34 in total

1.  dbSNP: the NCBI database of genetic variation.

Authors:  S T Sherry; M H Ward; M Kholodov; J Baker; L Phan; E M Smigielski; K Sirotkin
Journal:  Nucleic Acids Res       Date:  2001-01-01       Impact factor: 16.971

2.  Announcing the worldwide Protein Data Bank.

Authors:  Helen Berman; Kim Henrick; Haruki Nakamura
Journal:  Nat Struct Biol       Date:  2003-12

3.  Mutation@A Glance: an integrative web application for analysing mutations from human genetic diseases.

Authors:  Atsushi Hijikata; Rajesh Raju; Shivakumar Keerthikumar; Subhashri Ramabadran; Lavanya Balakrishnan; Suresh Kumar Ramadoss; Akhilesh Pandey; Sujatha Mohan; Osamu Ohara
Journal:  DNA Res       Date:  2010-04-01       Impact factor: 4.458

4.  NDEx, the Network Data Exchange.

Authors:  Dexter Pratt; Jing Chen; David Welker; Ricardo Rivas; Rudolf Pillich; Vladimir Rynkov; Keiichiro Ono; Carol Miello; Lyndon Hicks; Sandor Szalma; Aleksandar Stojmirovic; Radu Dobrin; Michael Braxenthaler; Jan Kuentzer; Barry Demchak; Trey Ideker
Journal:  Cell Syst       Date:  2015-10-28       Impact factor: 10.304

Review 5.  GM1 gangliosidosis: review of clinical, molecular, and therapeutic aspects.

Authors:  Nicola Brunetti-Pierri; Fernando Scaglia
Journal:  Mol Genet Metab       Date:  2008-06-03       Impact factor: 4.797

6.  In-silico human genomics with GeneCards.

Authors:  Gil Stelzer; Irina Dalah; Tsippi Iny Stein; Yigeal Satanower; Naomi Rosen; Noam Nativ; Danit Oz-Levi; Tsviya Olender; Frida Belinky; Iris Bahir; Hagit Krug; Paul Perco; Bernd Mayer; Eugene Kolker; Marilyn Safran; Doron Lancet
Journal:  Hum Genomics       Date:  2011-10       Impact factor: 4.639

7.  The BioGRID interaction database: 2015 update.

Authors:  Andrew Chatr-Aryamontri; Bobby-Joe Breitkreutz; Rose Oughtred; Lorrie Boucher; Sven Heinicke; Daici Chen; Chris Stark; Ashton Breitkreutz; Nadine Kolas; Lara O'Donnell; Teresa Reguly; Julie Nixon; Lindsay Ramage; Andrew Winter; Adnane Sellam; Christie Chang; Jodi Hirschman; Chandra Theesfeld; Jennifer Rust; Michael S Livstone; Kara Dolinski; Mike Tyers
Journal:  Nucleic Acids Res       Date:  2014-11-26       Impact factor: 19.160

8.  COSMIC: exploring the world's knowledge of somatic mutations in human cancer.

Authors:  Simon A Forbes; David Beare; Prasad Gunasekaran; Kenric Leung; Nidhi Bindal; Harry Boutselakis; Minjie Ding; Sally Bamford; Charlotte Cole; Sari Ward; Chai Yin Kok; Mingming Jia; Tisham De; Jon W Teague; Michael R Stratton; Ultan McDermott; Peter J Campbell
Journal:  Nucleic Acids Res       Date:  2014-10-29       Impact factor: 16.971

9.  NLDB: a database for 3D protein-ligand interactions in enzymatic reactions.

Authors:  Yoichi Murakami; Satoshi Omori; Kengo Kinoshita
Journal:  J Struct Funct Genomics       Date:  2016-08-16

10.  The Reactome pathway Knowledgebase.

Authors:  Antonio Fabregat; Konstantinos Sidiropoulos; Phani Garapati; Marc Gillespie; Kerstin Hausmann; Robin Haw; Bijay Jassal; Steven Jupe; Florian Korninger; Sheldon McKay; Lisa Matthews; Bruce May; Marija Milacic; Karen Rothfels; Veronica Shamovsky; Marissa Webber; Joel Weiser; Mark Williams; Guanming Wu; Lincoln Stein; Henning Hermjakob; Peter D'Eustachio
Journal:  Nucleic Acids Res       Date:  2015-12-09       Impact factor: 16.971

View more
  3 in total

1.  Toward the next step in G protein-coupled receptor research: a knowledge-driven analysis for the next potential targets in drug discovery.

Authors:  Koji Nagata; Yukie Katayama; Tomomi Sato; Yeondae Kwon; Takeshi Kawabata
Journal:  J Struct Funct Genomics       Date:  2017-01-06

2.  Biophysics at Waseda University.

Authors:  Mitsunori Takano; Kei Yura; Taro Uyeda; Kenji Yasuda
Journal:  Biophys Rev       Date:  2020-03-10

3.  Computational study of the impact of nucleotide variations on highly conserved proteins: In the case of actin.

Authors:  Ha T T Duong; Hirofumi Suzuki; Saki Katagiri; Mayu Shibata; Misae Arai; Kei Yura
Journal:  Biophys Physicobiol       Date:  2022-07-28
  3 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.