Literature DB >> 19465389

d-Omix: a mixer of generic protein domain analysis tools.

Duangdao Wichadakul¹, Somrak Numnark, Supawadee Ingsriswang.

Abstract

Domain combination provides important clues to the roles of protein domains in protein function, interaction and evolution. We have developed a web server d-Omix (a Mixer of Protein Domain Analysis Tools) aiming as a unified platform to analyze, compare and visualize protein data sets in various aspects of protein domain combinations. With InterProScan files for protein sets of interest provided by users, the server incorporates four services for domain analyses. First, it constructs protein phylogenetic tree based on a distance matrix calculated from protein domain architectures (DAs), allowing the comparison with a sequence-based tree. Second, it calculates and visualizes the versatility, abundance and co-presence of protein domains via a domain graph. Third, it compares the similarity of proteins based on DA alignment. Fourth, it builds a putative protein network derived from domain-domain interactions from DOMINE. Users may select a variety of input data files and flexibly choose domain search tools (e.g. hmmpfam, superfamily) for a specific analysis. Results from the d-Omix could be interactively explored and exported into various formats such as SVG, JPG, BMP and CSV. Users with only protein sequences could prepare an InterProScan file using a service provided by the server as well. The d-Omix web server is freely available at http://www.biotec.or.th/isl/Domix.

Entities: Chemical Disease Gene Species

Mesh：

Substances：
Proteins

Year: 2009 PMID： 19465389 PMCID： PMC2703976 DOI： 10.1093/nar/gkp329

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Protein domains are units of evolution (1,2). Different combinations of protein domains generate several types of modifications affecting protein functions. Addition or deletion of domains can modify substrate binding, increase or decrease catalytic activity, change the categorized reaction, cause loss of catalytic function, or regulate enzyme function (3). The comparison of protein domain combinations and architectures (DAs) will shed light on their related functions, possible annotations of unknown proteins and evolution. Domain combination has been analyzed for examining and predicting protein functions (3–6), protein cellular localization (7,8) and protein–protein interactions (PPIs), especially on domain fusion (9,10) and domain–domain interactions (DDIs) (11–14). To analyze and compare different domain combinations, a topology of co-occurring domains called domain graph was introduced (15). The highly connected nodes or versatile nodes in the graph characterize functional hubs in various cellular facets (15,16) and functional homogeneity (17). Domain distance was proposed to measure the similarity between two DAs for investigating protein evolution. The number of mismatched domains in the alignment relates to the number of evolutionary events (18) and proteins having the same DA tend to evolve from the same ancestor (19). Several web servers concerning protein domain analyses and visualization are available. Among them are CDART (20), PDART (21), PfamAlyzer (22) and DAhunter (23), all of which mainly serve for homology search based on domain architectures. CADO (17) web server allows a user to query a domain graph and compare domain combinations among the organisms in their built-in database. TreeDomViewer (24) web server provides a visualization tool that incorporates protein domain information over a phylogenetic tree. PhyloDome (25) web server provides a quick visualization of lineage specific distribution of protein domains. In this article, we propose a new web server, d-Omix, which is distinct from previously developed servers in two aspects. First, it integrates various analyses of domain combinations into a unified and comparative platform. Second, all services except the building of putative protein network are applicable with various domain search tools.

WEB SERVER IMPLEMENTATION

The d-Omix web server is organized into five sections: Data tab for data submission and four services including Tree tab for comparative protein evolution based on domain distances; Graph tab for comparative domain combination based on domain graphs; Alignment tab for comparative proteomes based on domain architecture alignments; and Interaction tab for building a putative protein interaction network from DDIs.

Data submission

The d-Omix web server requires an InterProScan (26) file in raw format as an input. Under Data tab, users may upload multiple files and merge some of them for the comparative analyses across protein sets (e.g. among pathways in the same organism or among organisms for the same pathway). Normally, InterProScan files generated from the proteomes of model organisms with genome sequences will be available (e.g. TAIR8_all.domains of Arabidopsis thaliana (Arabidopsis) from http://www.arabidopsis.org/, all.interpro of TIGR Rice release 6 from http://rice.Plantbiology.msu.edu/). Users with only protein sequences could also prepare the InterProScan file using feature ‘Prepare InterProScan file’. Figure 1A shows Data tab with Example1 data sets of proteins from the Arabidopsis and rice proteomes that are related by DAs to the three microRNA-processing proteins in Arabidopsis: DCL1 (AT1G01040), AGO1 (AT1G48410) and DRB4 (AT3G62800).

Figure 1.

Screenshots of the d-Omix web interface. (A) Data tab with Example1 data sets. (B) The DA-based tree generated from the mergefile between TAIR8 and TIGR6 data sets in Example1. (C) The alignment results between TAIR8 and TAIR8_same data sets which are the same set of proteins from Arabidopsis and between Arabidopsis as source- (TAIR8) and rice as target- (TIGR6) data sets. (D) The domain graph built from the mergefile in Example1. The highlighted node in the domain graph corresponds with the highlighted row in the summary table on the left. Colors of the edges in the graph indicate different sources of protein sets. (E) A putative protein network of TAIR8_select data set with the detailed DDIs between DCL1 (AT1G01040.1) and DRB4 (AT3G62800.1) proteins. All services of d-Omix are composed of input data sets selected and/or merged from Data tab and eleven domain search tools incorporated with InterProScan. Users may choose only some data sets and domain search tools for a specific running. An analysis for a large data set will be batched. The results of all services will be presented as a series of tabs of chosen data sets for the highlighted search tool. Users may switch the representation between protein domain ID (e.g. PF03368) and AC (e.g. DUF283) if both are available in the input data files. The click on a protein domain will link to its corresponding online database.

Comparative protein evolution

The Tree tab enables users to explore common ancestors, conservation, or linage-specific DAs among proteins. It uses PHYLIP (27) to construct a phylogenetic tree for a selected protein set from a distance matrix of DA scores calculated from all pairs of proteins. CLUSTALW (28) is also incorporated to enable the building of alternative phylogenetic tree based on global sequence alignments. The DA-based tree complements the sequence-based tree. It reveals the closest neighbor for each domain architecture and efficiently categorizes multi-domain proteins that are distantly related or containing ‘promiscuous domains’ (18). Promiscuous domains such as PF00017 (SH2) and PF00400 (WD40) are small, versatile, typically repetitive and occurring in proteins with a variety of functions (9). Users may compare trees generated from different domain search tools (e.g. hmmpfam, hmmsmart, etc.) or distance matrixes (e.g. DA-based, sequence-based). Proteins with the same or similar DAs will be clustered together. Colors of proteins in trees indicate their source data sets. Users may export trees into SVG, JPG, BMP or NEWICK format and edit the tree using PhyloWidget (29). The DA-based tree built from the mergefile data set in Example1 (Figure 1B) reveals the conservation of Dicer and Argonaute proteins between the Arabidopsis and rice. The clustered sets are categorized by their detailed DAs that might be caused by domain insertion/deletion, suggesting possible functional modifications. In addition, it suggests specific co-occurrences of the PAZ (PF02170), DUF1785 (PF08699) and Piwi (PF02171) domains in the cluster of Argonaute proteins and the PAZ and DUF283 domains in the Dicer proteins.

Comparative proteome

The Alignment tab enables users to compare the similarity and explore the diversification of proteins based on domain architectures within and across data sets. It calculates DA scores for all pairs of proteins between source- and target- data sets. It is analogous to BLAST with DA based comparison. Users may limit the alignment results using the DA score and hit limit; the lower DA score represents the more similar DAs. The alignment results are summarized in a table, where each row shows a protein name with its DA from the source data set and the number of proteins hit with satisfying DA scores from each target data set. The number of hits suggests DA conservation, proteins with redundant or related functions, and possible annotations for unknown proteins. To explore the alignments in detail, users may click for further information on the hit number. Figure 1C shows the alignment results within the same set of proteins from Arabidopsis and between Arabidopsis as source- and rice as target- data sets. Results with the exact matched DA (DA score = 0) show that most Arabidopsis proteins hit some rice proteins with the same DA. There are 11 and 22 proteins respectively in Arabidopsis and rice having exactly the same DA as of AGO1 (AT1G48410) protein in Arabidopsis.

Comparative domain combination

The Graph tab builds domain graphs (15) that enable users to (i) investigate the versatility and abundance of protein domains and domain pairs, (ii) explore the modularity of protein domains based on clustering coefficient (30) and (iii) compare shared and specific domain pairs across data sets. The results include a summary table with sortable versatility and abundance of all domains occurring in protein sequences of a selected data set and domain search tool. The click on a domain in the summary table will highlight its corresponding node and neighbors (co-present domains) in the domain graph on the right with the clustering coefficient. Domains in a small cluster with clustering coefficient close to 1 tend to have high functional homogeneity (17). The number of neighbors of a domain in the graph represents versatility of the domain. Most versatile domains tend to be functional centers in different biological aspects (15,16). The click on a co-present abundance number in the summary table or on an edge label in the graph will provide its corresponding protein list for the domain pair. An arrowed edge in a domain graph with direction indicates the presence of both domains in a consecutive order from N- to C- terminals. Users may save domain graphs into SVG, JPG, BMP, or DOT format and further explore a large graph using ZGRViewer (31) with smooth zoomable features. The domain graph built from the mergefile in Example1 is shown in Figure 1D. The functions of DUF283 domain and its neighbors (e.g. PAZ, dsrm) tend to be homogeneous. This corresponds with the previous report that DUF283 domain contains a double-stranded RNA-binding fold and involves in siRNA/miRNA selection (32). The co-presence of DUF1785, Piwi and PAZ domains in both Arabidopsis and rice proteins suggests their related functions in RNA silencing of AGO1.

Building putative protein interaction

The Interaction tab allows users to investigate possible PPIs for an input protein set. It builds a putative protein interaction network based on DDIs from DOMINE (33). Each edge between a putative PPI represents an existing DDI between the two proteins, where its color denotes the DDI confidence level from DOMINE. Users may filter the network based on these confidence levels. The DA alignment detail on the right shows the DAs of all participating proteins in the network on the left. The click on a PPI in the network will limit the DA alignment detail on the right to the DAs of the two proteins of the PPI. The click on a domain with the DDIs between the two proteins will highlight the domain and its interacting partners. The DDI tab lists all source DDIs of the current putative protein interaction network. Users may filter the network to focus on a specific domain of interest and its DDIs. All PPIs in the protein network are listed under PPI tab and interactively updated according to filtering conditions. The more number of DDIs with high confidence level between a protein pair suggests a higher chance of protein interaction. All DDIs of a PPI will be shown under DDIofPPI tab when the number of all DDIs is clicked. Similar to domain graphs, users may send the current protein network to ZGRViewer for smooth zoomable features. Figure 1E shows a putative protein network of selected proteins from Arabidopsis (TAIR8_select data set in Example1) where each protein is the representative of a cluster or group of proteins with the same DA resulted from the DA-based tree and DA alignment. The possible PPI between DCL1 (AT1G01040.1) and DRB4 (AT3G62800.1) proteins come from five DDIs from DOMINE where three of them show high DDI confidence level. DRB4 and HYL1 (AT1G09700.1) have been reported to interact with DCL4 (AT5G20320.1) and DCL1, respectively (34). DCL4 has the same DA as of DCL1 while HYL1 has the same DA as of DRB4. While a putative PPI might not have the exact participating partners, it suggests and/or narrows down possible partners and their related domains for the interaction.

METHODS

DA score

DA score measures the similarity between two protein sequences based on the alignment of their DAs. With one protein as source and the other as target, their protein domain units from N-terminal to C-terminal will be orderly compared and scored with the following function. We define the Matched Domain RatioPfam (MDRPfam) as the number of matched Pfam domains, in conserved order between proteins P1 and P2 over the total number of Pfam domains in the longer DA defined as len(longer_DA). Gap is the number of the inserted gaps during the alignment of the two DAs. The number 10 in the above equation is introduced to make gap penalties small. The small gap penalties are necessary to be included to avoid sporadic gaps in long repeating regions (18). A DA score between two proteins is calculated for individual sources of protein domains (e.g. Pfam, SUPERFAMILY); the lower the DA score, the more similar DAs between the two proteins. DA score is fundamental for both tree and alignment services.

Domain graph

A domain graph is an undirected graph where each vertex represents a protein domain and an edge between two domains indicates the co-presence of the two domains on at least a protein sequence (15). The d-Omix builds a domain graph according to this definition and extends it with direction. A domain graph is drawn using GraphViz (35) via PHP GraphViz extension.

FUNDING

Cluster Program Management Office (CPMO) [grant FC0033 B21 (SPA B1-1)]; National Science and Technology Development Agency (NSTDA), Thailand. Funding for open access charge: Cluster Program Management Office (FC0033 B21). Conflict of interest statement. None declared.

32 in total

1. An alignment-free domain architecture similarity search (ADASS) algorithm for inferring homology between multi-domain proteins.

Authors: Divya P Syamaladevi; Adwait Joshi; Ramanathan Sowdhamini
Journal: Bioinformation Date: 2013-06-08

1 in total

d-Omix: a mixer of generic protein domain analysis tools.

INTRODUCTION

WEB SERVER IMPLEMENTATION

Data submission

Comparative protein evolution

Comparative proteome

Comparative domain combination

Building putative protein interaction

METHODS

DA score

Domain graph

FUNDING

1. Detecting protein function and protein-protein interactions from genome sequences.

2. Scale-free behavior in protein domain networks.

3. Annotation transfer for genomics: measuring functional divergence in multi-domain proteins.

4. CDART: protein homology by domain architecture.

5. Predicting protein cellular localization using a domain projection method.

6. Comparative analysis of protein domain organization.

7. PreSPI: a domain combination based prediction system for protein-protein interaction.

8. Specific interactions between Dicer-like proteins and HYL1/DRB-family dsRNA-binding proteins in Arabidopsis thaliana.

9. PhyloDome--visualization of taxonomic distributions of domains occurring in eukaryote protein sequence sets.

10. InterProScan: protein domains identifier.

1. An alignment-free domain architecture similarity search (ADASS) algorithm for inferring homology between multi-domain proteins.