| Literature DB >> 11972320 |
Andreas Karwath1, Ross D King.
Abstract
BACKGROUND: The inference of homology between proteins is a key problem in molecular biology The current best approaches only identify approximately 50% of homologies (with a false positive rate set at 1/1000).Entities:
Mesh:
Substances:
Year: 2002 PMID: 11972320 PMCID: PMC107726 DOI: 10.1186/1471-2105-3-11
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1A graphical representation of two different distributions of a homology search. The first distribution represents homologous sequences found by the search; while the second distribution represents the non-homologous hits produced by the search. Depending on the cut-off value used, a part of the distribution is called true positives (TP) as they were predicted by the search to be homologous and are homologous; while a small part of the real homologous proteins is predicted to be non-homologous proteins. This part of the distribution is called false negatives (FN). The second distribution is split as well into two parts: the first part being the so-called false positives (FP), non-homologous proteins being predicted to be homologous. The second part of this distribution are non-homologous proteins predicted to be non-homologous. This part is called true negatives (TN). The cut-off value is indicated by a vertical line. It is clear that for any cut-off value, false positives will be included in a prediction.
The distribution of number of rules learnt for different targets using HIall and HIseq. HIall can generally describe patterns using fewer rules. This is expected as it uses more background types of biological knowledge. Note the strange bimodal distribution for HIseq rules. The reason for this is unknown.
| Rule number | HIall | HIseq |
|---|---|---|
| 0 | 423 | 485 |
| 1 | 425 | 371 |
| 2 | 701 | 228 |
| 3 | 133 | 137 |
| 4 | 38 | 81 |
| 5 | 37 | 49 |
| 6 | 37 | 38 |
| 7 | 14 | 31 |
| 8 | 11 | 119 |
| 9 | 2 | 4 |
| 10 | 1 | 0 |
The distribution of the size of the rules learnt, i.e. the number of predicates used in each rule in a rule set. The most common predicate used in the HIall setting with only one predicate was references to databases, followed by SWISS-PROT description arguments and keywords. The larger the number of predicates used in this setting, the more dominant becomes the use of predicates based on pure amino acid distributions and predicted secondary structure. In the HIseq setting a similar shift of use from predicates involving amino acid distributions towards predicted secondary structure predicates was observed. Rules with more than eight predicated are solely based on secondary structure.
| Number of predicates used in each rule | HIall | HIseq |
|---|---|---|
| 1 | 1030 | 169 |
| 2 | 369 | 940 |
| 3 | 314 | 810 |
| 4 | 121 | 340 |
| 5 | 14 | 86 |
| 6 | 1 | 18 |
| 7 | 1 | 6 |
| 8 | 1 | 1 |
| 9 | 0 | 1 |
The precision and recall for PSI-BLAST, HIall and HIseq.
| Method | Precision | Recall |
|---|---|---|
| PSI-BLAST | 0.34 | 0.717 |
| HIall | 0.32 | 0.787 |
| HIseq | 0.30 | 0.789 |
The contingency tables for χ2 comparing PSI-BLAST with HIall and HIseq in the twilight zone. The numbers in brackets are the expected values.
| PSI-BLAST | HIall | ||
|---|---|---|---|
| True Positives | 460 (512.68) | 312 (259.32) | 772 |
| False Positives | 574 (521.32) | 211 (263.68) | 785 |
| 1034 | 523 | 1557 | |
| PSI-BLAST | HIseq | ||
| True Positives | 460 (490.21) | 208 (177.79) | 668 |
| False Positives | 574 (543.79) | 167 (197.21) | 741 |
| 1034 | 375 | 1409 | |
Figure 2The three ROC curves produced by PSI-BLAST, HIall, and HIseq for predictions in the twilight zone. While the ROC curve for PSI-BLAST results from applying ROC analysis directly to the results produced, the ROC curves for both HI methods are maximized using an optimal value for re-sorting. The ROC curve for HIall dominates over the other two curves at all times; while the curves for PSI-BLAST and HIseq oscillate around each other. HIseq dominates the PSI-BLAST curve between ~0.38 and ~0.5.
The results of the k-fold cross-validation with the different areas inder ROC curve and the optimal parameters. The varations in the optimal factors are due to some factors f being an order of magnitude higher than the rest of the factors.
| AUROC Hiall | factor | AUROC Hiseq | factor | |
|---|---|---|---|---|
| 5 | 0.6525 ± 0.059 | 9.6 × 10-5 ± 6.07 × 10-5 | 0.6135 ± 0.0589 | 6.0 × 10-2 ± 2.82 × 10-2 |
| 10 | 0.6728 ± 0.1085 | 7.8 × 10-5 ± 4.47 × 10-5 | 0.6391 ± 0.1022 | 7.5 × 10-2 ± 2.12 × 10-2 |
| 25 | 0.7342 ± 0.1234 | 6.6 × 10-5 ± 2.8 × 10-5 | 0.6951 ± 0.1029 | 7.56 × 10-2 ± 1.73 × 10-2 |
Figure 3This figure shows the calculated areas under ROC curve for both HI methods (HIall and HIseq) for a range of re-sorting factors. The AUROC values for HIall increases steadily and reaches its maximum value at 6 × 10-5 with a value of 0.651; while the AUROC values for HIseq first increases and then decreases again with a peak at 8 × 10-2 with an AUROC value of 0.613. Comparing both methods with PSI-BLAST shows that HIseq has a smaller improvement over PSI-BLAST which has an AUROC value of 0.607. HIall increases the AUROC value by approximately 7.4 per cent.
Three selected examples of rules generated by HIall and HIseq. Where # rules is the total number of rules found, # pc is the number of positive examples covered in the training data, # pnc is the number of positive examples not covered in the training data, % CovP is the percentage coverage of the positive examples in the training data, % CovN is the percentage coverage of negative examples in the training data, # uc is the number of uncertain examples covered, and # unc the number of uncertain examples not covered.
| HIall | |||||||
|---|---|---|---|---|---|---|---|
| HIall | |||||||
| 1CPC | 2 | 120 | 0 | 100.00 | 1.00 | 1 | 2 |
| 1MPP | 1 | 91 | 1 | 98.91 | 0.00 | 2 | 13 |
| 1MLA | 1 | 17 | 5 | 77.27 | 0.00 | 4 | 13 |
| HIseq | |||||||
| PDB | # rules | # pc | # pnc | % CovP | % CovN | # uc | # unc |
| 1CPC | 2 | 89 | 31 | 74.17 | 1.20 | 1 | 2 |
| 1MPP | 3 | 62 | 30 | 67.39 | 1.20 | 3 | 12 |
| 1MLA | 1 | 16 | 6 | 72.73 | 0.20 | 5 | 12 |
The HI rules learnt to identify 1CPC (C-Phycocyanin) are illustrated first in their original Prolog form and in English translation. Two sets of rules are shown those using HIall, and those learnt from HIseq. All numbers were discretised into 10 levels for ease of symbolic induction (1 low – 10 high).
| Prolog | |
| homologous(A) :- | |
| desc(A,chain), | |
| amino_acid_ratio_rule(A,h,1). | |
| homologous(A) :- | |
| keyword(A,phycobilisome). | |
| English | |
| A protein is homologous if | |
| a1 | it has the word 'chain' in its SWISS-PROT description line and |
| it has a level 1 histidine content in the residue chain and | |
| a2 | or it has the word 'phycobilisome' as a SWISS-PROT keyword. |
| Prolog | |
| homologous(A) :- | |
| amino_acid_ratio_rule(A,w,1), | |
| amino_acid_ratio_rule(A,h,1), | |
| amino_acid_pair_ratio_rule(A,l,r,10). | |
| homologous(A):- | |
| mol_wt_rule(A,3), | |
| sec_struc_distribution_rule(A,a,10). | |
| English | |
| A protein is homologous if | |
| s1 | it has a level 1 tryptophan content and |
| it has a level 1 histidine content and | |
| it has a level 10 leucine-arginine pair content. | |
| s2 | or |
| it has a level 3 molecular weight and | |
| it has a level 10 predicted α-helix content. | |
The HI rules learnt to identify IMLA are shown in English translation. The secondary structure elements along the sequence are ordered into ten equal groups (deciles). The 1st decile are the 10% of elements near the N-teminal and the 10th decile at the C-terminal.
| A protein is homologous if | |
| a1 | it has the word 'synthase' in its description line and |
| it is in the 10th decile of predicted secondary structures a coil of length level 4. | |
| A protein is homologous if | |
| s1 | it has in the 10th decile of predicted α-helices a helix of length level 3 and |
| it has in the 10th decile of predicted β-strands a strand of length level 1. | |
The HI rules learnt to identify 1MPP are shown in English translation.
| HIall | |
| A protein is homologous if | |
| a1 | it has the classification 'eukaryota' and |
| it has the PROSITE pattern 'PS00141'. | |
| HIseq | |
| A protein is homologous if | |
| s1 | it has it has a level 10 serine-serine pair content and |
| it has it has a level 10 glycine-serine pair content and | |
| it has in the 8th decile of predicted β-strands a strand of length level 9 | |
| or | |
| s2 | it has a molecular weight of level 7 and |
| it has in the 9th decile of predicted coils a coil of length level 1 | |
| or | |
| s3 | it has it has a level 2 histidine content and |
| it has in the 7th decile of predicted secondary structures a β-strand of lengthlevel 5 and | |
| it has in the 7th decile of predicted secondary structures a coil of lengthlevel 5 and | |
| it has in the 4th decile of predicted secondary structures a β-strand of length level 6. | |