| Literature DB >> 21429187 |
Juliana S Bernardes1, Alessandra Carbone, Gerson Zaverucha.
Abstract
BACKGROUND: Remote homology detection is a hard computational problem. Most approaches have trained computational models by using either full protein sequences or multiple sequence alignments (MSA), including all positions. However, when we deal with proteins in the "twilight zone" we can observe that only some segments of sequences (motifs) are conserved. We introduce a novel logical representation that allows us to represent physico-chemical properties of sequences, conserved amino acid positions and conserved physico-chemical positions in the MSA. From this, Inductive Logic Programming (ILP) finds the most frequent patterns (motifs) and uses them to train propositional models, such as decision trees and support vector machines (SVM).Entities:
Mesh:
Substances:
Year: 2011 PMID: 21429187 PMCID: PMC3078102 DOI: 10.1186/1471-2105-12-83
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Flowchart of the method. A) Training phase. Each sequence in the positive training set is represented through first-order logic predicates. WARMR learns logical rules on the set. These rules are converted into binary attributes in order to train propositional models; this step is called propositionalization. Next, each sequence in the positive and negative training set is represented through binary attributes, and finally propositional models, such as DTs or SVM, are trained. B) Test phase. Each sequence in the positive and negative test set is represented through binary attributes that correspond to the logical rules learned during the training phase. Next, the propositional model is tested and its output is divided into sequences classified as positives and sequences classified as negatives.
Figure 2Partial alignment of "Glucocorticoid receptor-like (DNA-binding domain)" superfamily sequences. Some conserved positions are highlighted: * marks completely conserved positions, • marks partially conserved positions. The alignment is built by using CLUSTALW.
Logical rules constructed by WARMR.
| WARMR output |
|---|
| R1 : Homologous(A):- col(A,c,24), col(A,c,27), col(A,c,51) |
| R2 : Homologous(A):- col(A,c,45) |
| R3 : Homologous(A):- col(A,k,29) |
| R4 : Homologous(A):- hydrophobic(A,2) |
| R5 : Homologous(A):- aminoacidPairRatio(A,cg,1) |
| R6 : Homologous(A):- col(A,B,34), small(B) |
| R1 : 100% of homologous proteins have the C amino acid in positions 24, 27 and 51. |
| R2 : 70% of homologous proteins have the C amino acid in position 45. |
| R3 : 45% of homologous proteins have the K amino acid in position 29. |
| R4 : 70% of homologous proteins have between 10 and 20% of hydrophobic amino acids. |
| R5 : 77% of homologous proteins have at least 1 pair of CG. |
| R6 : 100% of homologous proteins have a small amino acid in positions 34. |
Some logical rules learned from "Glucocorticoid receptor-like (DNA-binding domain)" sequences and their alignment (see Figure 2). They are shown in their original WARMR output and their interpretation is given below.
Average AUC for Sfull and S30 databases.
| ILP-SVM-Seq | 0.79 | 0.77 |
| ILP-SVM-Aln | 0.81 | 0.81 |
| ILP-SVM-Aln | 0.80 | 0.81 |
| ILP-SVM-Seq-Aln | 0.85 | 0.80 |
| ILP-SVM-Aln | 0.82 | 0.82 |
| ILP-SVM-Seq-Aln | ||
| ILP-DT-Seq | 0.67 | 0.65 |
| ILP-DT-Aln | 0.70 | 0.69 |
| ILP-DT-Aln | 0.68 | 0.67 |
| ILP-DT-Seq-Aln | 0.72 | 0.69 |
| ILP-DT-Aln | 0.71 | 0.71 |
| ILP-DT-Seq-Aln | 0.74 | 0.71 |
| SVM-Ngram-LSA | 0.79 | 0.77 |
| SVM-LA | 0.80 | |
| PSI-BLAST | 0.75 | 0.69 |
| HMMer-3.0 | 0.63 | 0.60 |
Figure 3Performance as measured by AUC-ROC. For each SCOP family, we run SVM and DT models T times (see Methods) over the same positive set, but over several different negative sets. We calculated the AUC-ROC for each run and the AUC-ROC of SCOP family by averaging the AUC-ROC on all runs. We plot the AUC-ROC versus the number of families that achieved a given AUC-ROC score or better for the database S(A) and for the database S(B). We shows only ILP models that achieved best performance. We carried out rank-sum test [38] to compare the curves, see Table 4. For both databases ILP-SVM-Seq-Aln-Alnoutperformed all methods, except SVM-LA, and ILP-DT-Seq-Aln-Alnoutperformed only HMMer-3.0.
Average AUC and number of logical rules according to chi-square test for Sfull and S30 databases.
| Methods | S30 | |||||
|---|---|---|---|---|---|---|
| no chi-square | no chi-square | |||||
| ILP-SVM-Seq | 0.79 (228.59) | 0.79 (89.15) | 0.75 (59.30) | 0.77 (311.09) | 0.77 (91.04) | 0.70 (26.4) |
| ILP-SVM-Aln | 0.81 (44.91) | 0.81 (34.98) | 0.77 (12.76) | 0.81 (56.72) | 0.80 (36.44) | 0.72 (13.16) |
| ILP-SVM-Aln | 0.80 (191.65) | 0.79 (139.61) | 0.75 (66.15) | 0.81 (241.96) | 0.81 (178.72) | 0.73 (71) |
| ILP-SVM-Seq-Aln | 0.85 (311.09) | 0.83 (144.07) | 0.79 (35.8) | 0.80 (381.12) | 0.80 (178.56) | 0.74 (49.96) |
| ILP-SVM-Aln | 0.82 (236.56) | 0.82 (174.59) | 0.79 (46.04) | 0.82 (283.12) | 0.82 (209.76) | 0.80 (57.28) |
| ILP-SVM-Seq-Aln | 0.87 (502.74) | 0.85 (283.69) | 0.81 (74.3) | 0.82 (623.56) | 0.82 (357.28) | 0.79 (90.96) |
| ILP-DT-Seq | 0.67 (228.59) | 0.67 (89.15) | 0.62 (59.30) | 0.65 (311.09) | 0.65 (91.04) | 0.61 (26.4) |
| ILP-DT-Aln | 0.70 (44.91) | 0.70 (34.98) | 0.72 (12.76) | 0.69 (56.72) | 0.69 (36.44) | 0.65 (13.16) |
| ILP-DT-Aln | 0.68 (191.65) | 0.68 (139.61) | 0.64 (66.15) | 0.67 (241.96) | 0.67 (178.72) | 0.62 (71) |
| ILP-DT-Seq-Aln | 0.72 (311.09) | 0.71 (144.07) | 0.67 (35.8) | 0.69 (381.12) | 0.68 (178.56) | 0.63 (49.96) |
| ILP-DT-Aln | 0.71 (236.56) | 0.71 (174.59) | 0.73 (46.04) | 0.71 (283.12) | 0.70 (209.76) | 0.62 (57.28) |
| ILP-DT-Seq-Aln | 0.74 (502.74) | 0.74 (283.69) | 0.69 (74.3) | 0.71 (623.56) | 0.71 (357.28) | 0.63 (90.96) |
Rank-sum test p-values for curves of Figure 3.
| ILP-SVM-Seq-Aln | 0.93 | 4.93e-05 | 2.55e-07 | 5.63e-07 |
| ILP-DT-Seq-Aln | 7.27e-06 | 2.23e-04 | 0.17 | 5.44e-07 |
| ILP-SVM-Seq-Aln | 0.07 | 0.05 | 3.85e-05 | 1.33e-06 |
| ILP-DT-Seq-Aln | 1.44e-05 | 0.015 | 0.41 | 1.2e-05 |
We consider a result with p ≤ 0.05 to be significant.
Sequential Predicates
| Property/amino acid set | Predicate |
|---|---|
| 1- small {A,G,S,T} | small(X,Y) |
| 2- polar {D,E,H,K,N,Q,R,S,T,W,Y} | polar(X,Y) |
| 3- polar uncharged {N,Q} | polarUncharged(X,Y) |
| 4- aromatic {F,H,W,Y} | aromatic(X,Y) |
| 5- charged {D,E,H,I,K,L,R,V} | charged(X,Y) |
| 6- positively charged {H,K,R} | positivelyCharged(X,Y) |
| 7- negatively charged {D,E} | negativelyCharged(X,Y) |
| 8- tiny {A,G} | tiny(X,Y) |
| 9- bulky {F,H,R,W,Y} | bulky(X,Y) |
| 10- aliphatic {I,L,V} | aliphatic(X,Y) |
| 11- hydrophobic {I,L,M,V} | hydrophobic(X,Y) |
| 12- hydrophilic basic {K,R,H} | hydrophilicBasic(X,Y) |
| 13- hydrophilic acidic {E,D,N,Q} | hydrophilicAcidic(X,Y) |
| 14- neutral weakly hydrophobic {A,G,P,S,T} | neutralWeakHydrophobic(X,Y) |
| 15- hydrophobic aromatic {F,W,Y} | hydrophobicAromatic(X,Y) |
| 16- acidic {E,D} | acidic(X,Y) |
| 17- amino acid ratio | aminoacidRatio(X,W,Y) |
| 18- amino acid pair ratio | aminoacidPairRatio(X,W,Y) |
For each predicate like property(X, Y) numbered from 1 to 18, X is the sequence identifier, Y is the percentage of amino acids with some physico-chemical property. For the predicate aminoacidRatio(X, W, Y), X is the sequence identifier, W is an amino acid, and Y is the percentage of amino acid W within sequence X. For the predicate aminoacidPairRatio(X, W, Y), X and Y are defined as before, and W is a pair of amino acids.
Figure 4Creating ground predicates from alignment positions. A ground predicate is defined for each alignment position. For instance, the ground predicate col(s, v, 1) means that the sequence shas the amino acid v in the first alignment position.
Figure 5Using pHMMs to create ground predicates for alignment positions. A) First, the positive training sequences are aligned. B) Second, a pHMM is built from this alignment. Each query sequence (negative training sequences, and positive and negative test sequences) is matched by pHMM producing the Viterbi output that shows the correspondence between each amino acid in the query sequence and pHMM states (match (M), insert (I) or delete (D)). C) Finally, we know the mapping between alignment positions and pHMM states, thus we create, for each query sequence, ground predicates in a similar way to Figure 4.