Literature DB >> 15980440

PRODOC: a resource for the comparison of tethered protein domain architectures with in-built information on remotely related domain families.

O Krishnadev¹, N Rekha, S B Pandit, S Abhiman, S Mohanty, L S Swapna, S Gore, N Srinivasan.

Abstract

PROtein Domain Organization and Comparison (PRODOC) comprises several programs that enable convenient comparison of proteins as a sequence of domains. The in-built dataset currently consists of approximately 698 000 proteins from 192 organisms with complete genomic data, and all the SWISSPROT proteins obtained from the Pfam database. All the entries in PRODOC are represented as a sequence of functional domains, assigned using hidden Markov models, instead of as a sequence of amino acids. On average 69% of the proteins in the proteomes and 49% of the residues are covered by functional domain assignments. Software tools allow the user to query the dataset with a sequence of domains and identify proteins with the same or a jumbled or circularly permuted arrangement of domains. As it is proposed that proteins with jumbled or the same domain sequences have similar functions, this search tool is useful in assigning the overall function of a multi-domain protein. Unique features of PRODOC include the generation of alignments between multi-domain proteins on the basis of the sequence of domains and in-built information on distantly related domain families forming superfamilies. It is also possible using PRODOC to identify domain sharing and gene fusion events across organisms. An exhaustive genome-genome comparison tool in PRODOC also enables the detection of successive domain sharing and domain fusion events across two organisms. The tool permits the identification of gene clusters involved in similar biological processes in two closely related organisms. The URL for PRODOC is http://hodgkin.mbu.iisc.ernet.in/~prodoc.

Entities: Species

Mesh：

Substances：
Proteins

Year: 2005 PMID： 15980440 PMCID： PMC1160235 DOI： 10.1093/nar/gki474

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Modular representation of gene products as sequences of functional domains, instead of as sequences of amino acids, is useful in understanding the molecular basis of the functions of multi-domain proteins (1–3). Knowledge of the functions of the individual domains of a multi-domain protein contributes to our understanding of the properties of the protein as a whole (4–6). Viewing multi-domain proteins as sequences of domains also enables the identification of gene fusion events, interacting proteins (7,8) and preferred domain associations (9–14), and the comparison of sequences of domains helps in obtaining clues about domain function. For example, in two multi-domain proteins with many common domains, alignment of a region of unknown function with a domain of known function raises the possibility of a distant relationship between the region of unknown function and the aligned domain. Realizing the importance of viewing proteins as sequences of domains, many databases of protein domain families and sequences of protein domains have been developed, such as PRODOM (15), DOMO (16), BLOCKS (17), Pfam (18), SMART (19), InterPRO (20), PRINTS (21) and DART (22). The entire compendium of proteins listed in SWISSPROT (23) is available in SWISSPFAM, wherein every SWISSPROT entry is represented as a sequence of domains. Domain assignments to various proteomes are also available in the form of databases (24–26). Several software tools are available in PRODOC (PROtein Domain Organization and Comparison) to facilitate searching for a given sequence of domains in various genomes, identification of domain fusion events, recognition of gene products with identical or similar domain compositions and identification of proteins with a circularly permuted or jumbled arrangement of the order of domains. A tool for complete genome–genome comparison is also available in PRODOC. By considering two genomes at a time the program can identify series of gene products that exhibit domain sharing. This process enables the proposal of functional gene clusters in the two genomes. This is radically different from COG (27) as we consider the sharing of a domain family to be a criterion in identifying series of gene fusion events in a set of genes from two organisms. The database component of PRODOC is the sequence of functional domains of proteins encoded in a large number of organisms, as well as the entire set of proteins in the SWISSPROT database. The objective behind the generation of the PRODOC suite of programs is that it should provide a convenient platform to perform domain analysis at the genomic scale for the applications mentioned above. The most distinguishing feature of PRODOC compared with similar resources for domain analysis is the use of the notion of remotely related domain families forming superfamilies. A superfamily is constituted by families which exhibit similarity in the functions and structures of protein domains (28). We have incorporated in PRODOC knowledge of such distantly related protein domain families in a superfamily with and without known three-dimensional structures (29,30). Thus it is possible to recognize those sequences of domains with one or more domains belonging to the same superfamily as those in the query. Such searches enable the user to study the evolution of the functions of multi-domain proteins. It has been suggested that homologous protein domains with extensive sequence divergence, forming protein domain superfamilies, are involved in novel domain combinations during gene fusion events while retaining the broad nature of the function (14). It is suggested that such variations in domain recruitment and high sequence divergence form turning points in otherwise similar biochemical pathways (14).

THE CONSTRUCTION AND ORGANIZATION OF PRODOC

The various tools and datasets present in PRODOC and the software's overall organization are shown in Figure 1, and these features are discussed below.

Figure 1

The organization of PRODOC and the utilities offered.

Domain assignments to genomes

The amino acid sequences of predicted gene products in the completely sequenced genomes of various organisms are available in public databases such as NCBI, ENSEMBL (31,32), FlyBase (33) and PlasmoDB (34). The hidden Markov models (HMMs) for protein domain families available in the PfamA dataset have been used in generating a database of HMMs for 7677 domain families available in Pfam (18) version 16. HMMER (35) enables the mapping of various domains along the amino acid sequence of the query. An E-value threshold of 10−2 is considered reasonable for the assignment of domains. In addition, it is ensured that the alignment is of considerable length (36). For the current and first major release of PRODOC, the domain assignments for the proteins from various genomes and for those proteins listed in SWISSPROT have been obtained from Pfam and SWISSPFAM, respectively. The domain assignments are confined to the regions showing a significant match with the HMMs of protein families, leaving a proportion of the gene products with no domain assignment. At the time of preparation of this article, the PRODOC database contained functional domain assignments for 192 completed proteomes (156 eubacterial, 19 archaeal and 17 eukaryotic proteomes) consisting of 697 976 proteins. Typically, 69% and 68% of the proteins of a proteome are covered by HMM-based domain assignments in prokaryotic and eukaryotic organisms, respectively. In every protein the domain assignments could be made for a substantial proportion, and on average 49% of the residues are covered by domain assignments. In the future the dataset will be updated periodically using HMMER2 running on locally available multi-processor systems.

Tool for the comparison of proteins with linear and shuffled domain order

One of the tools available in PRODOC allows the user to query the datasets for occurrences of a sequence of domains. It has been observed that in many similar multi-domain proteins, the order of occurrence of domains in the primary structure is different. Such cases cannot be easily detected by simple amino acid sequence search methods, but a tool has been built in PRODOC to search for such cases. The user is allowed to input a number of domains as a query. Following this step, a search is made in the dataset of interest to identify all the multi-domain proteins with a different or cyclically permuted order of domains compared with the query protein. It is known that the overall functions of two proteins related by jumbled domain architectures are often similar (37).

Tool for the comparison of the sequences of domain families considering superfamily relationships

When comparing the sequences of the domains of two multi-domain proteins, it is possible that some of the domains in one protein are distantly related to domain(s) in the other protein (superfamilies). We have formed a dataset of distantly related Pfam domain families by relating Pfam families with proteins of known three-dimensional structure and by identifying new potential sequence superfamilies (29,30). This information is used in the domain architecture search tool to result in the identification of distantly related multi-domain proteins with one or more domains related by a superfamily connection (14).

Clues to the functions of domain-unassigned regions

Tools in PRODOC can also aid remote homology detection and function annotation based on alignment of a domain with a region with no domain assignment. For example, in two multi-domain proteins with many common domains, alignment of a region of unknown function with a domain of known function raises the possibility of a distant relationship between the region of unknown function and the aligned domain. Thus PRODOC can be helpful in suggesting new possibilities for the functional annotation of domain-unassigned regions.

Tools for the identification of domain sharing, gene fusion and functional clusters

Putative gene fusion events across two organisms can be identified using PRODOC. This can be illustrated as follows. Let us consider that two different gene products in organism A encode for domain families P and Q, respectively. If, in a closely related organism B, a protein with domain families P and Q fused as a single gene product can be identified, this forms a potential gene fusion event and the possibility of functional interaction between the two gene products in organism A is raised (7,8,11,38,39). PRODOC also facilitates searches to identify several domain fusion events successively. For example, it is possible that, in organism B, the gene product with domain families P and Q is also tethered to another domain family, R. In such a situation one can search in organism A for a gene product containing the domain family R. If such a gene product can be found in organism A and domain family S is tethered to R, a further search can be made in organism B for a protein with domain S, and so on. Such a repetitive search for successive domain fusion events across two organisms will eventually result in two sets of genes from two organisms with several domains shared between the sets. Such sets of gene products can be considered functional clusters of proteins involved in similar series of events in similar biological pathways across the two organisms. Using PRODOC, the user can easily compare domain organization between two genomes of interest. Pairs of proteins that contain at least a common domain are displayed as output. This information can be harnessed to derive cases of gene fusion and functional gene clusters.

SUPPLEMENTARY MATERIAL

Supplementary Material is available at NAR Online.

39 in total

1. The geometry of domain combination in proteins.

Authors: Matthew Bashton; Cyrus Chothia
Journal: J Mol Biol Date: 2002-01-25 Impact factor: 5.469

2. Sensitive protein comparisons with profiles and hidden Markov models.

Authors: K Hofmann
Journal: Brief Bioinform Date: 2000-05 Impact factor: 11.622

3. SUPFAM--a database of potential protein superfamily relationships derived by comparing sequence-based and structure-based families: implications for structural genomics and function annotation in genomes.

Authors: Shashi B Pandit; Dilip Gosar; S Abhiman; S Sujatha; Sayali S Dixit; Natasha S Mhatre; R Sowdhamini; N Srinivasan
Journal: Nucleic Acids Res Date: 2002-01-01 Impact factor: 16.971

4. The Ensembl genome database project.

Authors: T Hubbard; D Barker; E Birney; G Cameron; Y Chen; L Clark; T Cox; J Cuff; V Curwen; T Down; R Durbin; E Eyras; J Gilbert; M Hammond; L Huminiecki; A Kasprzyk; H Lehvaslaiho; P Lijnzaad; C Melsopp; E Mongin; R Pettett; M Pocock; S Potter; A Rust; E Schmidt; S Searle; G Slater; J Smith; W Spooner; A Stabenau; J Stalker; E Stupka; A Ureta-Vidal; I Vastrik; M Clamp
Journal: Nucleic Acids Res Date: 2002-01-01 Impact factor: 16.971

Review 5. Protein domain analysis in the era of complete genomes.

Authors: Richard R Copley; Tobias Doerks; Ivica Letunic; Peer Bork
Journal: FEBS Lett Date: 2002-02-20 Impact factor: 4.124

6. The identification of functional modules from the genomic association of genes.

Authors: Berend Snel; Peer Bork; Martijn A Huynen
Journal: Proc Natl Acad Sci U S A Date: 2002-04-30 Impact factor: 11.205

Review 7. The natural history of protein domains.

Authors: Chris P Ponting; Robert R Russell
Journal: Annu Rev Biophys Biomol Struct Date: 2001-10-25

8. Domain combinations in archaeal, eubacterial and eukaryotic proteomes.

Authors: G Apic; J Gough; S A Teichmann
Journal: J Mol Biol Date: 2001-07-06 Impact factor: 5.469

9. Prediction of protein interactions: metabolic enzymes are frequently involved in gene fusion.

Authors: S Tsoka; C A Ouzounis
Journal: Nat Genet Date: 2000-10 Impact factor: 38.330

10. Functional associations of proteins in entire genomes by means of exhaustive detection of gene fusions.

Authors: A J Enright; C A Ouzounis
Journal: Genome Biol Date: 2001 Impact factor: 13.583

3 in total

1. Evolution of domain promiscuity in eukaryotic genomes--a perspective from the inferred ancestral domain architectures.

Authors: Inbar Cohen-Gihon; Jessica H Fong; Roded Sharan; Ruth Nussinov; Teresa M Przytycka; Anna R Panchenko
Journal: Mol Biosyst Date: 2010-12-03

2. Accommodation of profound sequence differences at the interfaces of eubacterial RNA polymerase multi-protein assembly.

Authors: Lakshmipuram Seshadri Swapna; Nambudiry Rekha; Narayanaswamy Srinivasan
Journal: Bioinformation Date: 2012-01-06

3. DAhunter: a web-based server that identifies homologous proteins by comparing domain architecture.

Authors: Byungwook Lee; Doheon Lee
Journal: Nucleic Acids Res Date: 2008-04-14 Impact factor: 16.971

3 in total