Literature DB >> 22389811

Identification of potential drug targets implicated in Parkinson's disease from human genome: insights of using fused domains in hypothetical proteins as probes.

N Rathankar1, K A Nirmala, Varun Khanduja, H G Nagendra.   

Abstract

High-throughput genome sequencing has led to data explosion in sequence databanks, with an imbalance of sequence-structure-function relationships, resulting in a substantial fraction of proteins known as hypothetical proteins. Functions of such proteins can be assigned based on the analysis and characterization of the domains that they are made up of. Domains are basic evolutionary units of proteins and most proteins contain multiple domains. A subset of multidomain proteins is fused domains (overlapping domains), wherein sequence overlaps between two or more domains occur. These fused domains are a result of gene fusion events and their implication in diseases is well established. Hence, an attempt has been made in this paper to identify the fused domain containing hypothetical proteins from human genome homologous to parkinsonian targets present in KEGG database. The results of this research identified 18 hypothetical proteins, with domains fused with ubiquitin domains and having homology with targets present in parkinsonian pathway.

Entities:  

Year:  2011        PMID: 22389811      PMCID: PMC3263550          DOI: 10.5402/2011/265253

Source DB:  PubMed          Journal:  ISRN Neurol        ISSN: 2090-5505


1. Introduction

Hypothetical proteins basically are defined as “a protein coded by a gene with no known function based on its DNA sequence” [2]. Certain regions in hypothetical proteins are highly conserved between species in both composition and sequence. Proteins with such regions are annotated as conserved hypothetical proteins and range from 13% in E. coli and 14% in Rickettsia prowazekii to 40% in Pyrococcus abyssi and 47% in Plasmodium falciparum [3]. The human genome too has about 20% of them classified as hypothetical [4-6]. The function of such proteins can be predicted based on the arrangement of distinct domains [7] in them since this arrangement in proteomes reflects the fundamental evolutionary differences in their genomes [8]. But with proteins containing more than one domain, the general function can only be suggested. The difficulty one observes in predicting a protein's function based on domains alone would be when there are no clear cut boundaries between any two domains. Proteins with appreciable overlap in their domain boundaries are known as fused domain containing proteins or chimeric proteins. Such proteins are formed by the process of gene duplication and combination during evolution. Proteins containing such domains are created by joining two or more genes, which originally code for separate proteins [9]. Translation of this fusion gene results in a single polypeptide with functional properties derived from each of the original proteins [10]. Analysis of these fused domains in related genomes reveals the fact that fused domain proteins in eukaryotic genomes correspond to single, full-length proteins in prokaryotic genomes [11]. Proteins with fused domains [12] in a genome are likely to be involved in metabolic and signaling pathways [13]. A study by Chia and Kolatkar [14] illustrates that domain fusions can be used to predict protein-protein interactions. This method has proven to be effective in predicting functional links between proteins. Analysis of the structures of multidomain single-chain peptides in their study revealed that domain pairs located less than 30 residues apart on a chain share a physical interface, and their interactions are conserved. Apart from its normal functions, these multidomain-containing proteins are also implicated in several diseases. The bcr-abl fusion protein is a well-known example of an oncogenic fusion protein and is considered to be the primary oncogenic driver of chronic myelogenous leukemia [15]. A study on 70 positionally cloned human genes mutated in diseases found that a significantly high proportion of these “disease genes” contained several signaling domains including the DEATH domain and play active roles in cell signaling [16, 17]. Structural Classification of Proteins (SCOP) [18] suggests that these multidomain proteins can be classified based on the fold of a protein that contain two or more domains belonging to different classes. Based on this, SCOP 1.73 classifies the PDB structures with multidomains into 53 folds, which covers 1277 structures in total. A recent classification of multidomains on this SCOP database by Wang and Caetano-Anollés [19] broadly classifies them into five categories, namely, (i) single-domain proteins, (ii) single domain in multidomain proteins, (iii) domain repeats, (iv) domain repeats in multidomains, and (v) domain pairs. Interestingly it is observed that none of these classifications addressed the proteins containing fused/overlapping domain containing proteins. Hence, an attempt has been made by us in this paper to classify the multidomain proteins from the Human Hypothetical protein dataset into three major classes, namely, nonrepeating and unique domains, repeat and nonoverlapping domains, and overlapping/fused domains. Further, as a case study, an in-depth analysis has been carried out to elucidate the roles of multidomain proteins involved in Parkinson's disease.

2. Materials and Methods

Characterizing the protein function in a proteome is a multistep process involving selection of homologs, building multiple sequence alignment, extracting relevant domain information, and then targeting them to the proteome using machine's learning algorithms such as Hidden Markov Models (HMMs), Support Vector Machines (SVMs), consensus sequences, and so forth, in order to denote their functional annotation. Hence, multiple sequence alignments from the CDD [20] database were used as targets to build HMMs. This approach has seen success in classifying human proteins with novel functions [21]. The protocol followed is briefed below.

2.1. Step 1: Extracting the Dataset of Multidomain Proteins

In order to extract the hypothetical proteins with multidomains, domain information from the CDD was used as a resource, and HMMs were built for all the 2009 domains present in the CDD using the HMMBUILD module of HMMER. These HMMs were used as targets to search against the hypothetical proteins database using the HMMSEARCH [22] module. Sequences with e-value less than 0.001 were only considered as meaningful targets, which resulted in a total of 1,777 sequences.

2.2. Step 2: Extracting Fused Domains Sequences from Multidomain Sequences

Of these 1777 protein sequences, 984 were with single domain, and 793 belonged to multidomain sequences. A parameter known as overlapping ratio (L), defined by was calculated for all the 793 multidomain sequences. Thus, sequences with L = 0 denotes nonoverlapping multidomain proteins and that with L > 0 denotes multidomain proteins with fused domains. A cut-off value of L = 0.50 was chosen to extract more probable fused domain sequences from the multidomain sequence dataset. Thus, these calculations resulted into a total of 360 sequences with nonoverlapping domains (L = 0) and 433 sequences with overlapping or fused domains (L > 0.5). Interestingly 20% of the domain fusions is prominent due to the three domains cd00053, cd00054, and cd00079.

2.3. Step 3: Clustering Domains Based on Overlap Data

Frequencies of the fused domains in the hypothetical proteins dataset was used as an input for clustering using a clustering software known as Cytoscape [23]. This yielded a total of 17 clusters (Figure 1), of which the largest cluster had a total of 36 domains resulting from 106 hypothetical sequences.
Figure 1

17 clusters in fused hypothetical proteins containing 36 domains.

This cluster containing ubiquitin, ubiquitin-like & kinase motor domain(s) sequences were associated with diseases such as Alzheimer's, Von Hippel Lindau, juvenile parkinsonism, and spinocerebellar ataxia. In a similar way, domains in each cluster were analyzed by using their functional information from the CDD, and a table of these clusters with their functions are as shown in Table 1.
Table 1

Summary of 17 clusters along with their involvement in diseases and major functions. Words in bold indicate neurodegenerative disorders and the role of ubiquitin domains in these disorders.

Cluster no.No. of domainsFunction of domainsNo. of sequencesDisease implication
136 Ubiquitin & kinase motor domain106 Alzheimer's, Von Hippel Lindau, juvenile parkinsonism, spinocerebellar ataxia
211NTF2, Ubiquitin, CoA dehydrogenase11Fatty acid disorders
320Myosin Motor135Familial hypertrophic cardiomyopathy, neuromuscular disorders
49Cyclophilin3Immunosuppression, antiviral activity
516vWFA & PH, Ubiquitin 196Von Willebrand disease, thrombotic thrombocytopenic purpura (TTP)
66PH & PTB70Cardiovascular diseases
710 tRNA synthase15Aminoacyl tRNA synthetase- charcot-Marie-Tooth disease type 2D, Mobius syndrome, cardiac disorders
813 Ubiquitin & PLAT32 Neurological disorders
95RNA binding1Myxoid liposarcoma, sars
105SIR 23 Neurological disorders
115HMG box10 Alzheimer's disease, heart diseases, diabetes and cancer
124PI3K2Cancer, diabetes and respiratory
134Sm & Sm-like8Inflammatory bowel disease, Salla disease, diabetes
143Methyl-CpG and PH3Cancer, Down syndrome-like Alzheimer's, Wiskott-Aldrich syndrome, B-cell chronic lymphocytic leukemia
153EVH11Wiskott-Aldrich syndrome
163 CGH & Ntn1Lysosomal storage disease and progressive neurodegeneration, Farber's disease, and Alzheimer's Disease
173Nidogen, thyroglobulin type 16Human gastrointestinal cancer, cancer, acute leukemia, heart diseases
Table 1 indicates clearly that ubiquitin-like domains are involved in neurological disorders. Hence, clusters with fused ubiquitin domains were considered for further analysis, as they could become potential drug targets for a variety of neurodegenerative disorders [24]. Based on this criteria, sequences from the clusters 1, 2, 5, and 8 were selected. To investigate the role of multidomains in neurodegenerative diseases, fused ubiquitin domains related to Parkinson's disease were considered. Parkinson's disease (PD) is a progressive disorder of the central nervous system affecting approximately one million people in the United States alone, wherein 50,000 new cases are reported annually [1]. Clinically, the disease is characterized by a decrease in spontaneous movements, gait difficulty, postural instability, rigidity, and tremor [25]. At the molecular level, the details regarding the genes that have been suggested to cause hereditary parkinsonism, and chromosomal loci associated with Parkinsonism in other families are as tabulated in Table 2. From Table 2, it is clear that ubiquitin/ubiquitin-like domains play a dominant role in the onset of Parkinson's disease. Hence, fused Ubiquitin domain sequences from clusters 1, 2, 5, and 8 were considered for a detailed investigation to ascertain their roles as well.
Table 2

A description of genes, domains, and type of inheritance for Parkinson's disease (source: Nirit Lev and Melamed [1]).

Gene/locus/assignmentDomains presentInheritanceAge of onset
a-Synuclein/SNCA/PARK1 & 4Synuclein/ubiquitin-likeAutosomal dominant/susceptibilityEarly/late
Parkin/PRKN/PARK2Parkin and ubiquitin Autosomal recessive/possible susceptibilityJuvenile/early
Ubiquitin C-terminal hydrolase/UCH-L1/PARK5Peptidase, ubiquitin Autosomal dominant/susceptibilityLate
DJ-1/DJ-1/PARK7GATase/ubiquitin-likeAutosomal recessiveEarly

2.4. Step 4: Arrival of a Target Dataset for Parkinson's Disease

In order to characterize these hypothetical proteins with fused domains, 18 sequences belonging to Parkinson's pathway were extracted from the KEGG database [26] (Figure 2) and were queried against the CDD [20]. Four (UB, UBA1, PARK2, and PARK7) out of eighteen sequences were observed to have fused ubiquitin domains. These sequences are highlighted in Figure 2, which is illustrated in the KEGG Parkinson's disease pathway.
Figure 2

Parkinson's disease pathway from the KEGG disease database. Proteins encircled with red color are the ones having fused domains with ubiquitin domains (source: the KEGG disease pathway database).

2.5. Step 5: Extracting Relevant Homologues for Parkinson Diseased Targets

These four Parkinson's diseased sequences (UB, UBA1, PARK2 and PINK1) were searched against the sequences in four clusters (i.e., clusters numbered 1, 2, 5, and 8). A cut-off e-value of 1e − 04 was used as a filter to arrive at relevant hypothetical protein homologues. This search resulted into a total of 18 hypothetical sequences, which could be potential drug targets. A table representing the homologues with the sequences from the KEGG database is as shown in Table 3.
Table 3

Hypothetical proteins homologous to KEGG sequences with fused domains in Parkinson's disease pathway.

Sl. no.The KEGG protein IDGi IDCluster no.No. of Hypothetical proteinsGi ID
1UB11024714115912028
2PARK247588841510241759
12052812
44662819
57997480
37589137
3UBA1235103382112053109
847018418
7018436
34304594
63994165
4PINK114165272571905906
3510234
5912043
12053281
52545876
57997093
57997188

2.6. Step 6: Sequence Analysis of Hypothetical Proteins

Cluster-1: (Ubiquitin, Ubiquitin-Like & Kinase Motor Domain(s)).

The first is the largest cluster amongst the 17 clusters. This cluster (Figure 3) contained a total of 36 domains resulting from 106 hypothetical sequences. Sequences in this cluster were associated with neurodegenerative disorders such as Alzheimer's, juvenile parkinsonism, and spinocerebellar ataxia. A database search for the four Parkinson's diseased targets (UB, PARK2, UBA1, and PINK1) against the sequences in this cluster resulted in six hypothetical sequences from human genome, of which five were related to UB, three related to PARK2, and two homologous to both UB and PARK2 proteins.
Figure 3

Domains in cluster-1.

As illustrated in Table 4, the fusions between the domains cd00196, cd01796, and cd01089 are conserved in all these six hypothetical sequences and their respective protein targets. A pairwise comparison of these sequences with their targets are shown in Figure 4 to reiterate the same at the sequence level.
Table 4

Conservation of domain fusions in Parkinson's disease targets and human hypothetical sequences.

KEGG sequenceHypothetical protein's Gi IDsRegion of domain fusion with the target Sequence identity in the fused region (%)
Cd00196Cd01769Cd01809UBPARK2UBPARK2
UB 5912028 14–82,14–82,11–82,4–72, 98
90–158,90–158,87–158,4–72,
166–234166–234163–2345–72
12052812 50–10752–10750–1054–72, 30
4–72,
5–72
446628194–724–703–704–72, 30
4–72,
5–72
375891374–724–703–704–72, 30
4–72,
5–72
5799748021–8720–8717–874–72, 36
4–72,
5–72
PARK2 5912028 14–82,14–82,11–82, 4–72, 30
90–158,90–58,87–58, 4–72,
166–234166–234163–234 5–72
12052812 50–10752–10750–105 4–72, 29
4–72,
5–72
102417591–671–671–65 4–72, 30
4–72,
5–72
Figure 4

Multiple sequence alignment of the fused domains in 6 unique hypothetical proteins and their target sequences (PARK2 and ubiquitin sequence).

Mutational Analysis

Mutational analysis of these proteins was carried out using the PROSITE [27] signature PS00299 (Figure 5).
Figure 5

PROSITE signature PS00299 comparison in hypothetical sequences for a ubiquitin domain. Square brackets in the signature indicate the presence of either of the residues at that position, whereas the x(3) indicates any three amino acids. The red-colored residue indicates the strictly conserved residues, blue-colored ones indicates the residues present in the regular expression patterns, and the orange-colored ones indicate the mutant residues as observed from the mutant database.

This signature of 26 residues, from the 27th position and to the 52nd position, is the characteristic of ubiquitin domain. Of the four Parkinson's disease homologs, PARK2 and the Ubiquitin protein (UB) have this signature. A comparison of this motif with the homologs of the parkinsonian targets was carried out, and the mutants were compared with the protein mutant database (PMD) [28] to infer the effects of such mutations. A table depicting the mutants in the ubiquitin domain with their altered functions is as shown (Table 5).
Table 5

Mutational analysis of the hypothetical proteins with ubiquitin domains.

Sl. no.Gi IDMutational positionsFunction as predicted by protein mutant database (PMD)
15912028NilNo change
2 37589137I30VStability is retained.
G35KMelting temperature at pH 3.0 decreases.
3 44662819I30VStability is retained.
G35KMelting temperature at pH 3.0 decreases.
4 57997480G35SMelting temperature at pH 3.0 decreases.
K48RIncrease in morphologic response of cells to canavanine, accumulation of high-molecular-weight ubiquitin conjugates and proteome substrates is observed.
5 12052812R42LUbiquitin adenylate affinity for E1 protein decreases.
G35KMelting temperature at pH 3.0 decreases.
6 10241759R42LUbiquitin adenylate affinity for E1 protein decreases.
G35KMelting temperature at pH 3.0 decreases.

Cluster-2

Cluster-2 (Figure 6) had 11 domains spanning a total 11 hypothetical protein sequences. Majority of the domains in cluster-2 were ubiquitin, NTF2, and ubiquitin-like domains. Sequences in this cluster were associated with fatty acid disorders. A database search for the four parkinson's diseased targets (UB, PARK2, UBA1, and PINK1) against the sequences in this cluster resulted in only one hypothetical protein from human genome. This hypothetical protein with domains cd01491 and cd01492 being fused, is seen to be conserved as observed in UBA1 (Table 6). A pairwise comparison of the hypothetical sequence (gi:12053109) with its parkinsonian homolog (UBA1) is as shown in Figure 7.
Figure 6

Domains in cluster-2.

Table 6

Conservation of domain fusions in Parkinson's disease targets and human hypothetical sequences.

KEGG sequence Hypothetical protein Gi IDsRegion of domain fusion with the targetSequence identity in the fused region
Cd01492Cd01491UBA1
UBA11205310913–16213–16254–16252/160 = 32%
Figure 7

Pairwise sequence alignment of the fused domains between the ubiquitin sequence and its homolog hypothetical sequence (gi:12053109).

Cluster-5

Cluster-5 had 16 domains spanning over 196 hypothetical protein sequences as depticted in Figure 8. Majority of the domains were PH, vWFA, and Ubiquitin-like domains. Sequences in this cluster were associated with von Willebrand disease, thrombotic thrombocytopenic purpura (TTP), hemolytic uremic syndrome (HUS), and ADAMTS13. A database search for the four parkinson's diseased targets (UB, PARK2, UBA1, and PINK1) against the sequences in this cluster resulted in seven hypothetical sequences from human genome, homologous to the parkinsonian target PINK1. The fusions between the domains cd00192 and cd00180 are conserved in all the hypothetical sequences and PINK1. Comparison of domain fusions in hypothetical sequences with their targets is represented in Table 7. A multiple sequence alignment of these sequences with PINK1 are shown in Figure 9.
Figure 8

Domains in cluster-5.

Table 7

Conservation of domain fusions in Parkinson's disease targets and human hypothetical sequences in cluster-5.

KEGG sequence Hypothetical protein Gi IDsRegion of domain fusion with the targetSequence identity in the fused region (%)
Cd00180Cd00192PINK1
PINK15799718814–28819–273271–501 17
591204358–34063–337271–501 8
52545876147–405152–388271–501 18
190590653–30357–299271–501 19
12053281173–419173–415271–501 11
57997093199–438193–433271–501 19
351023428–28734–284271–501 13
Figure 9

Multiple-sequence alignment of the fused domains in 7 unique hypothetical proteins and PINK1 sequence.

Cluster-8

Cluster-8 has 13 domains spanning 32 hypothetical protein sequences as shown in Figure 10. The functions of the majority of the domains were related to ubiquitin, PLAT, and ubiquitin-like domains. Sequences in this cluster were associated with neurological disorders.
Figure 10

Domains in cluster-8.

A database search for the four Parkinson's diseased targets (UB, PARK2, UBA1, and PINK1) against the sequences in this cluster resulted into four hypothetical sequences from human genome, homologous to the parkinsonian target PINK1. The fusions between the domains cd01488, cd01489 and cd01490 were observed to be conserved in all the hypothetical sequences and UBA1. Comparison of domain fusions in hypothetical sequences with their targets is represented in Table 8. A pairwise comparison of these sequences with their targets are shown in Figure 11.
Table 8

Conservation of domain fusions in Parkinson's disease targets and human hypothetical sequences.

KEGG sequence Hypothetical protein Gi IDsRegion of domain fusion with the targetSequence identity in the fused region (%)
Cd01488Cd01489Cd01490UBA1
UBA163994165198–377198–397197–512470–67153
343045941–1681–1691–303470–67148
701843630–19230–20030–419470–67131
701841871–36871–33171–342470–67125
Figure 11

Multiple-sequence alignment of the fused domains in 4 unique hypothetical proteins and their target sequence (UBA1).

3. Results and Discussions

This study was initiated to understand the diversity of functions in proteins with multiple-fused domains and to characterize the hypothetical proteins containing multiple-fused domains from human genome. The approach involved characterizing hypothetical protein sequences (15480) based on identification of domains using the CDD database. This provided 1777 sequences with domains, of which 984 were single domains and 793 with multidomain sequences. Of these 793 sequences, 433 were multidomain-fused proteins. Frequencies of the 433 fused domain proteins were fed as an input for clustering using Cytoscape, which yielded a total of 17 clusters, as depicted in Figure 1. Four clusters amongst these 17 had ubiquitin fused-domain-containing sequences, which play an important role in a variety of neuropathological conditions including Parkinson's disease, Pick's disease, and Alzheimer's disease as indicated in Table 1. Ubiquitin domain consists of 76 amino acids and has been found in all eukaryotic cells. Apart from its use in protein degradation, ubiquitins are also involved in Parkinson's disease. Parkinson's disease-related genes such as PARK2 and PINK1 has ubiquitin domains associated with them. Mutations in these sequences have prominently been associated with the onset of Parkinson's disease. As a case study, sequences in Parkinson's disease were used as basis to characterize the hypothetical proteins from the above-mentioned four clusters. From Table 2, it is clear that ubiquitin/ubiquitin-like domains play a dominant role in the onset of Parkinson's disease. Hence, fused ubiquitin domain sequences from clusters 1, 2, 5, and 8 were consid-ered for a detailed investigation to ascertain their roles as well. Similarity searches revealed 18 hypothetical proteins, homologous with the sequences implicated in Parkinson's disease, as shown in Table 3. Sequences in each of these clusters were then multiply aligned with the parkinsonian targets UB, UBA1, PARK2 & PINK1, to ascertain the presence of key patterns/signatures amongst them. As illustrated in Figures 4, 7, 9, and 11, conservation of residues amongst hypothetical proteins and Parkinson's sequences is highlighted.

4. Conclusions

We herewith conclude that the presence of fused domain as a signal in ubiquitin-containing proteins from parkinsonian targets is used as a probe to identify and characterize the functions of 18 hypothetical sequences, which could be used as lead drug targets for designing drugs in Parkinson's disease from human genome.
  27 in total

1.  KEGG: kyoto encyclopedia of genes and genomes.

Authors:  M Kanehisa; S Goto
Journal:  Nucleic Acids Res       Date:  2000-01-01       Impact factor: 16.971

2.  Cytoscape: a software environment for integrated models of biomolecular interaction networks.

Authors:  Paul Shannon; Andrew Markiel; Owen Ozier; Nitin S Baliga; Jonathan T Wang; Daniel Ramage; Nada Amin; Benno Schwikowski; Trey Ideker
Journal:  Genome Res       Date:  2003-11       Impact factor: 9.043

3.  The Protein Mutant Database.

Authors:  T Kawabata; M Ota; K Nishikawa
Journal:  Nucleic Acids Res       Date:  1999-01-01       Impact factor: 16.971

4.  SMART, a simple modular architecture research tool: identification of signaling domains.

Authors:  J Schultz; F Milpetz; P Bork; C P Ponting
Journal:  Proc Natl Acad Sci U S A       Date:  1998-05-26       Impact factor: 11.205

5.  Mokola virus glycoprotein and chimeric proteins can replace rabies virus glycoprotein in the rescue of infectious defective rabies virus particles.

Authors:  T Mebatsion; M J Schnell; K K Conzelmann
Journal:  J Virol       Date:  1995-03       Impact factor: 5.103

Review 6.  Functional classification using phylogenomic inference.

Authors:  Duncan Brown; Kimmen Sjölander
Journal:  PLoS Comput Biol       Date:  2006-06-30       Impact factor: 4.475

7.  Genome sequences and great expectations.

Authors:  I Iliopoulos; S Tsoka; M A Andrade; P Janssen; B Audit; A Tramontano; A Valencia; C Leroy; C Sander; C A Ouzounis
Journal:  Genome Biol       Date:  2000-12-29       Impact factor: 13.583

8.  Conserved 'hypothetical' proteins: new hints and new puzzles.

Authors:  M Y Galperin
Journal:  Comp Funct Genomics       Date:  2001

9.  Hypo, hype and 'hyp' human proteins.

Authors:  Prashanth Suravajhala
Journal:  Bioinformation       Date:  2007-07-10

10.  Domain fusion analysis by applying relational algebra to protein sequence and domain databases.

Authors:  Kevin Truong; Mitsuhiko Ikura
Journal:  BMC Bioinformatics       Date:  2003-05-06       Impact factor: 3.169

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.