| Literature DB >> 20718947 |
Aydin Albayrak1, Hasan H Otu, Ugur O Sezerman.
Abstract
BACKGROUND: Phylogenetic analysis can be used to divide a protein family into subfamilies in the absence of experimental information. Most phylogenetic analysis methods utilize multiple alignment of sequences and are based on an evolutionary model. However, multiple alignment is not an automated procedure and requires human intervention to maintain alignment integrity and to produce phylogenies consistent with the functional splits in underlying sequences. To address this problem, we propose to use the alignment-free Relative Complexity Measure (RCM) combined with reduced amino acid alphabets to cluster protein families into functional subtypes purely on sequence criteria. Comparison with an alignment-based approach was also carried out to test the quality of the clustering.Entities:
Mesh:
Substances:
Year: 2010 PMID: 20718947 PMCID: PMC2936399 DOI: 10.1186/1471-2105-11-428
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
General Properties of the Datasets
| Family | # of sequences | # of subfamilies | μ Length | σ Length | μ PID* |
|---|---|---|---|---|---|
| Crotonases | 467 | 13 | 332 | 87 | 21 |
| Mandelate racemases | 184 | 8 | 416 | 74 | 27 |
| Vicinal oxygen chelates | 309 | 18 | 294 | 108 | 14 |
| Haloacid dehalogenases | 195 | 14 | 303 | 137 | 12 |
| Nucleotidyl cyclases | 75 | 2 | 1059 | 200 | 21 |
| Acyl transferases | 177 | 2 | 290 | 12 | 41 |
| GH2 hydrolases | 33 | 4 | 872 | 160 | 15 |
* Mean Percent Identity (μ PID) is the average of all pairwise sequence identities in a given family.
Reduced Amino Acid Alphabets
| Scheme | Size | Matrix | Gaps# | Reference |
|---|---|---|---|---|
| ML* | 4,8,10,15 | BL50 | 12/2 | [ |
| EB§ | 13,11,9,8,5 | BL62 | 11/1 | [ |
| HSDM* | 17 | HSDM | 19/1 | [ |
| SDM* | 12 | SDM | 7/1 | [ |
| GBMR* | 4 | BL62 | 11/1 | [ |
| RANDOM§ | 4,4,4 | BL62 | 11/1 | This study |
Reduced amino acid schemes used in this study.* Substitution matrices for these reduced alphabets were obtained from reference [33]. § BL62 frequency counts were used to derive these substitution matrices using the formula outlined in reference [33]. #Gap opening/gap extension penalties used for MSAs in ClustalW2.
Lempel-Ziv Complexity
| Sequence X = AAILNAIIANNL | |
|---|---|
| A | 1 |
| AI | 2 |
| L | 3 |
| N | 4 |
| AII | 5 |
| AN | 6 |
| NL | 7 |
The exhaustive library construction and Lempel-Ziv complexity score calculation of sequence X.
Figure 1Protocol Overview. For RCM, the original sequences and sequences recoded with reduced alphabets are used to calculate RCM-based distances which are then inputted sequentially to the Neighbor-Joining and Retree programs of the PHYLIP v3.68 package. For MSA, first, alignments are carried out using ClustalW2 with substitution matrices corresponding to each amino acid alphabet. Following bootstrap analysis with ClustalW2, Retree program is used to root the trees with midpoint rooting and to discard branch lengths. Each phylogenetic tree is then inputted to the TBC algorithm along with its attribute file that shows the expert assignment of each sequence to each family to calculate the TBC error.
Figure 2Tree topology of the simulated dataset. The identical topology of the three phylogenetic trees (i.e., RCM tree, bootstrap tree and true tree) for the simulated dataset is shown.
Figure 3Phylogenetic trees of protein families. RCM trees were drawn using ML15 alphabet. For each family, the taxa corresponding to different subfamilies are colored differently. (A) Crotonases (B) Mandelate racemases (C) Vicinal oxygen chelates (D) Haloacid dehalogenase (E) Nucleotidyl cyclases (F) Acyl transferases (G) GH2 hydrolases
TBC errors for top performing RAAA
| Crotonases | Mandelate racemases | Vicinal oxygen | Haloacid | Nucleotidyl | Acyl transferases | GH2 | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| RCM | MSA | RCM | MSA | RCM | MSA | RCM | MSA | RCM | MSA | RCM | MSA | RCM | MSA | ||
| 20 letter | Accuracy | 100 | 100 | 100 | 100 | 91.6 | 91.3 | 93.3 | 99.5 | 100 | 100 | 91.5 | 97.2 | 87.9 | 100 |
| Error | 0 | 0 | 0 | 0 | 8.4 | 8.7 | 6.7 | 0.5 | 0 | 0 | 8.5 | 2.8 | 12.1 | 0 | |
| Statistics for top performing | Accuracy | 100 | 100 | 100 | 100 | 91.3 | 96.9 | 100 | 100 | 97.2 | 97.2 | 100 | 100 | ||
| Error | 0 | 0 | 0 | 0 | 7.8 | 8.7 | 3.1 | 0.5 | 0 | 0 | 2.8 | 2.8 | 0 | 0 | |
| Top performing RAAAs | RAAA | GBMR4 | ML4 | ML4 | GBMR4 | EB8 | GBMR4 | ML15 | ML8 | ML4 | GBMR4 | ML4 | ML4 | ML4 | ML4 |
TBC accuracy and percentage of TBC error are reported for the 20-letter alphabet and the top performing RAAA. If two RAAAs with the same size have identical TBC accuracies, both RAAAs are reported at the final row in the table. Bold entries correspond to top performers using RCM and MSA for the specified datasets