| Literature DB >> 25282152 |
Mutharasu Gnanavel, Prachi Mehrotra, Ramaswamy Rakshambikai, Juliette Martin, Narayanaswamy Srinivasan1, Ramachandra M Bhaskara.
Abstract
BACKGROUND: The function of a protein can be deciphered with higher accuracy from its structure than from its amino acid sequence. Due to the huge gap in the available protein sequence and structural space, tools that can generate functionally homogeneous clusters using only the sequence information, hold great importance. For this, traditional alignment-based tools work well in most cases and clustering is performed on the basis of sequence similarity. But, in the case of multi-domain proteins, the alignment quality might be poor due to varied lengths of the proteins, domain shuffling or circular permutations. Multi-domain proteins are ubiquitous in nature, hence alignment-free tools, which overcome the shortcomings of alignment-based protein comparison methods, are required. Further, existing tools classify proteins using only domain-level information and hence miss out on the information encoded in the tethered regions or accessory domains. Our method, on the other hand, takes into account the full-length sequence of a protein, consolidating the complete sequence information to understand a given protein better.Entities:
Mesh:
Substances:
Year: 2014 PMID: 25282152 PMCID: PMC4287353 DOI: 10.1186/1471-2105-15-343
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Schematic of the CLAP server. Left panel - The inputs to the server are: a set of n protein sequences (Fasta format), a tree parsing cut-off ‘×’, between 0 and 1 (optional) and a tab-delimited file containing domain architecture details for each protein file (optional). Middle panel - A pairwise sequence comparison is performed using the Local Matching Scores method and a normalized distance matrix is computed. Right panel - This distance matrix is subjected to hierarchical clustering using Wards method. The resulting dendrogram is parsed using the user specified cut-off ‘×’. The clusters obtained are analyzed for similarities in domain-architectures.
Clustering of different data-sets of small, medium and large sized protein sequences using different methods
|
| ||||
|
| ||||
|
|
|
|
|
|
| CW | 15 | 0.5 | NA | 0 m 11.835 s |
| k-tuple | 3 | 0.5 | 2 | 0 m 1.539 s |
| CLAP | 7 | 0.5 | 5 | 2 m 28.322 s |
| CLUSS | 68 | NA | 4 | 0 m 11.000 s |
| CD-HIT | 223 | 0.5 | 3 | 0 m 0.034 s |
|
| ||||
|
| ||||
|
|
|
|
|
|
| CW | 23 | 0.5 | NA | 0 m 59.788 s |
| k-tuple | 3 | 0.5 | 2 | 0 m 5.659 s |
| CLAP | 17 | 0.5 | 5 | 9 m 52.099 s |
| CLUSS | NA | NA | NA | 0 m 11.000 s |
| CD-HIT | 607 | 0.5 | 3 | 0 m 0.091 s |
|
| ||||
|
| ||||
|
|
|
|
|
|
| CW | 2 | 0.5 | NA | 8 m 46.895 s |
| k-tuple | 3 | 0.5 | 2 | 0 m 2.25 s |
| CLAP | 3 | 0.5 | 5 | 2 m 50.918 s |
| CLUSS | 95 | NA | 4 | 0 m 3.133 s |
| CD-HIT | 227 | 0.5 | 3 | 0 m 0.592 s |
|
| ||||
|
| ||||
|
|
|
|
|
|
| CW | 5 | 0.5 | NA | 32 m 50.379 s |
| k-tuple | 2 | 0.5 | 2 | 0 m 7.789 s |
| CLAP | 7 | 0.5 | 5 | 11 m1 2.664 s |
| CLUSS | NA | NA | NA | NA |
| CD-HIT | 708 | 0.5 | 3 | 0 m 3.281 s |
|
| ||||
|
| ||||
|
|
|
|
|
|
| CW | 15 | 0.5 | NA | 42 m 1.184 s |
| k-tuple | 4 | 0.5 | 2 | 0 m 2.91 s |
| CLAP | 4 | 0.5 | 5 | 4 m 22.752 s |
| CLUSS | NA | NA | NA | NA |
| CD-HIT | 125 | 0.5 | 3 | 0 m0.916 s |
The processing time was computed using the workstation that hosts the CLAP web-server, with a 2.40 GHz, Intel xeon processor and 16GB RAM running CentOS. The number of clusters generated at a specific threshold and word-length used in the computations is also shown.
Figure 2Plot of Robinson-Foulds (RF) distance between dendrograms from CLAP and ClustalW with respect to different sizes of proteins sequences.