| Literature DB >> 18706079 |
Micah Hamady1, Jeremy Widmann, Shelley D Copley, Rob Knight.
Abstract
MotifCluster finds related motifs in a set of sequences, and clusters the sequences into families using the motifs they contain. MotifCluster, at http://bmf.colorado.edu/motifcluster, lets users test whether proteins are related, cluster sequences by shared conserved motifs, and visualize motifs mapped onto trees, sequences and three-dimensional structures. We demonstrate MotifCluster's accuracy using gold-standard protein superfamilies; using recommended settings, families were assigned to the correct superfamilies with 0.17% false positive and no false negative assignments.Entities:
Mesh:
Substances:
Year: 2008 PMID: 18706079 PMCID: PMC2575518 DOI: 10.1186/gb-2008-9-8-r128
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 13.583
Summary of key features of MotifCluster and a selection of other programs that perform clustering of motifs or remote homology detection
| Strategy | Program | Overview of program | Publication |
| Clustering proteins by motifs they contain | MotifCluster | Takes aligned or unaligned protein and nucleotide sequences and a MEME file showing motifs; allows clustering of the sequences according to the motifs they contain, and visualization of the motifs on the aligned and unaligned sequences and three-dimensional structures | This article |
| Clustering of transcription factor binding sites (in DNA) | MCAST | Takes list of transcription factor binding sites as input: uses hidden Markov models to find | [ |
| Cluster-Buster | Takes list of transcription factor binding sites as input: uses Forward algorithm and expected uniform distribution to find motif co-occurrence in DNA | [ | |
| ClusterDraw | Takes list of transcription factor binding sites as input: uses r-scan algorithm and sweep over parameter values to visualize significant clusters as peaks on the DNA sequence | [ | |
| COMET | Calculates significance of collection of position-specific score matrices that appear in order: can apply to DNA or protein, in principle | [ | |
| PEAKS | Calculates significance of collection of transcription factor binding sites that appear at specified distance from transcription start site or other feature in the DNA | [ | |
| CompMoby | Aligns all pairs of motifs that appear significant in different promoters, then groups these into clusters using the CAST algorithm. DNA-specific | [ | |
| CREME | Identifies groups of DNA motifs that co-occur significantly within a defined distance using both order-dependent and order-independent models | [ | |
| PHYLOCLUS | Uses Bayesian method to find clusters of evolutionarily conserved DNA motifs that appear in different promoters. | [ | |
| INCLUSive | Clusters genes based on microarray analysis: feeds promoters to Gibbs sampler to find DNA motifs overrepresented in each cluster | [ | |
| Identifying kernels for SVMs* | SVM kernels | Introduces kernels based on k-word occurrences and best BLAST hit for SVM clustering: does not focus on conserved motifs | [ |
| WCM (word correlation matrices) | Introduces k-word kernel for SVM clustering based on correlations in appearance of pairs of k-words: does not focus on conserved motifs. | [ | |
| ODH (oligomer distance histograms) | Introduces new kernel for SVM clustering based on histograms of distances between all words in protein: does not focus on conserved motifs | [ | |
| Iterative BLAST | Shotgun | BLAST-based approach for identifying remote homologs by iterative searches: not motif-based | [ |
| DivergentSet | Among other features, can perform BLAST and PSI-BLAST versions of Shotgun and choose representative sequences of each group: not motif-based | [ | |
| Cascade PSI-BLAST | Performs iterative steps of PSI-BLAST, otherwise like Shotgun: not motif-based. | [ | |
| ProClust | Performs graph-based connection of proteins based on pairwise sequence similarity: not motif based | [ | |
| k-word clustering | CD-Hit | Clusters proteins based on shared segments of overall sequence, not by motifs already known to be significant | [ |
| Profile-profile alignment | COMPASS | Performs profile-profile alignments for remote homology detection: assesses statistical significance matches in the profiles overall, rather than specifically using shared motifs | [ |
| Clustering of motifs | STAMP | Aligns motifs with one another so that relationships among motifs can be detected; performs many other tasks for promoter characterization, but specific to promoters | [ |
| TAMO | Performs many functions for | [ | |
| SOMBRERO | Aligns and clusters DNA motifs with one another to improve transcription factor binding site searches | [ | |
| Identification of functions in labeled structures | FunClust | Takes set of three-dimensional structures with annotated functions; identifies three-dimensional motif fragments that are common to the structures with each function. | [ |
*SVMs are support vector machines, a common machine learning approach to pattern classification. A kernel is a function that calculates the inner product of all pairs of input vectors in an abstract space, which is an important step in the process and affects the clustering.
Figure 1Methods for measuring distances between sequences using motif information. (a) Common fraction score; (b) Longest common substring score; (c) Needleman-Wunsch alignment score; (d) delta-delta score.
Figure 2Clustering of motifs found in 80 members of the RpiA and RpiB families of ribose 5-phosphate isomerases. The blue box encloses RpiAs, and the red box encloses RpiBs.
Figure 3Graph representation of clusters generated from motifs identified in (a) members of the dihydroorotase and (b) β-phosphoglucomutase families, which belong to separate superfamilies, and (c) members of the 2-haloacid dehalogenase (blue) and β-phosphoglucomutase (red) families, which belong to the same superfamily. The families can be subdivided further into additional groups by increasing the edge threshold.
Figure 4Summary of the performance of MotifCluster using motifs found by MEME and the Gibbs sampler for 741 pairs of families in the gold-standard set of families. (a) Incorrect inferences of superfamily assignment. (b) Failure to assign sequences to the leading component (for members of the same superfamily) or to one of the two leading components (for members of two different superfamilies). The numbers 1 and 2 in the legend (for example, Gibbs 1 and Gibbs 2) refer to the two largest components, which invariably contain most of the sequences from the two distinct families when these families belong to different superfamilies.
Figure 5Analysis of haloacid dehalogenases. (a) Clustering of motifs in the haloacid dehalogenase and β-phosphoglucomutase families of the haloacid dehalogenase superfamily. (b) Sequences of the three shared motifs, with highly conserved and mechanistically important residues highlighted by MotifCluster.
Figure 6Motifs identified by MEME mapped onto the crystal structures of (a) haloacid dehalogenase [PDB:1QQ7] and (b) β-phosphoglucomutase [PDB:1O03] by MotifCluster.
Figure 7Active site regions of (a) haloacid dehalogenase and (b) β-phosphoglucomutase, with conserved residues highlighted according to the motif color scheme shown in Figure 6. Note that the side-chain coloring was added manually in PyMol.
Figure 8Clustering of a divergent set of 96 sequences from the Prx, Trx and CMP families. Prxs are circled in red, Trxs are circled in blue and CMPs are circled in green. In each case, both the clustering of motifs and the connected components are shown. (a) Clustering of the Prx (top right graph) and Trx (bottom right graph) families using the NW module alignment 1-(actual/max) score; (b) clustering of the Prx, Trx, and CMP families using the NW module alignment 1-(actual/max) score. The 34 Trx sequences range from 89-578 residues in length (average 141) and are 48.6% identical on average. The 40 Prx sequences range from 133-321 residues in length (average 180) and are 43.8% identical on average. The 22 CMP sequences range from 121-403 residues in length (average 194) and are 44.7% identical on average.
Figure 9Phylogenetic tree of 96 sequences from the Prx, Trx and CMP families. This figure shows clustering of the Prx, Trx, and Cmp families using the phylogenetic tree generated by MUSCLE. See legend to Figure 8 for details of the sequence families and display.