| Literature DB >> 20377887 |
Rabie Saidi1, Mondher Maddouri, Engelbert Mephu Nguifo.
Abstract
BACKGROUND: This paper deals with the preprocessing of protein sequences for supervised classification. Motif extraction is one way to address that task. It has been largely used to encode biological sequences into feature vectors to enable using well-known machine-learning classifiers which require this format. However, designing a suitable feature space, for a set of proteins, is not a trivial task. For this purpose, we propose a novel encoding method that uses amino-acid substitution matrices to define similarity between motifs during the extraction step.Entities:
Mesh:
Substances:
Year: 2010 PMID: 20377887 PMCID: PMC2868007 DOI: 10.1186/1471-2105-11-175
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Sequence pre-processing based on motif extraction. This figure describes the process of sequence encoding. The extracted motifs are used as attributes to build a binary context where each row represents a sequence.
Motifs clustering
| LLK | IMK | VMK | GGP | RI | RV | RF | RA | PP | |
|---|---|---|---|---|---|---|---|---|---|
| 0.89 | 0.87 | 0.86 | 0 | 0.75 | 0.72 | 0.72 | 0.5 | 0 | |
| Main motif | LLK | LLK | LLK | GGP | RI | RI | RI | RV | PP |
is a set of motifs (table 1) sorted by their lengths and P. The third row shows the cluster main motifs.
Figure 2Motifs clustering. This figure illustrates the set of clusters and main motifs obtained from the data of table 1 after application of our algorithm. RV belongs to 2 clusters and is the main motif of one of them.
Experimental data
| Dataset (source) | Identity percentage (%) | Family/class | Size | Total |
|---|---|---|---|---|
| DS1 (Swiss-prot) | 48 | High-potential Iron-Sulfur Protein | 19 | 60 |
| Hydrogenase Nickel Incorporation Protein HypA | 20 | |||
| Hlycine Dehydrogenase | 21 | |||
| DS2 (Swiss-prot) | 48 | Chemokine | 255 | 510 |
| Melanocortin | 255 | |||
| DS3 (Swiss-prot) | 25 | Monomer | 208 | 717 |
| Homodimer | 335 | |||
| Homotrimer | 40 | |||
| Homotetramer | 95 | |||
| Homopentamer | 11 | |||
| Homohexamer | 23 | |||
| Homooctamer | 5 | |||
| DS4 (Swiss-prot) | 28 | human TLR | 14 | 40 |
| Non-human TLR | 26 | |||
| DS5 (SCOP) | 84 | All-α domain | 70 | 277 |
| All-β domain | 61 | |||
| α/β domain | 81 | |||
| α + β domain | 65 | |||
Machine learning classifiers coupled with encoding methods
| Encoding method | ||||||
|---|---|---|---|---|---|---|
| DS1 | CA | C4.5 | 96.7 | 95 | 95 | 96.7 |
| SVM | 93.3 | |||||
| NB | 86.7 | 90 | 81.7 | 80 | ||
| NN | 63.3 | 78.3 | 60 | 61.7 | ||
| NA | 4935 | 2060 | 4905 | 2565 | ||
| DS2 | CA | C4.5 | 99.6 | 99.4 | 99.8 | 99.4 |
| SVM | 99.4 | |||||
| NB | 74.7 | |||||
| NN | 98.8 | |||||
| NA | 6503 | 7055 | 10058 | 1312 | ||
| DS3 | CA | C4.5 | - | |||
| SVM | - | 78.94 | ||||
| NB | - | 59.4 | ||||
| NN | - | 77 | ||||
| NA | 7983 | - | 8403 | 508 | ||
| DS4 | CA | C4.5 | 57.5 | 77.5 | 82.5 | |
| SVM | 67.5 | 65 | 87.5 | 87.5 | ||
| NB | 57.5 | 40 | 92.6 | |||
| NN | 52.5 | 60 | 80 | 80 | ||
| NA | 5561 | 3602 | 7116 | 5505 | ||
| DS5 | CA | C4.5 | 75.5 | 75.1 | 67.9 | 73.3 |
| SVM | 84.1 | 81.2 | 82.3 | 82.3 | ||
| NB | 77.3 | 63.7 | 84.5 | |||
| NN | 80.5 | 79.4 | 78 | 78 | ||
| NA | 6465 | 2393 | 13830 | 13083 | ||
Mtr: Metric, Clfr: Classifier, CA: Classification Accuracy (%), NA: Number of Attributes.
Comparison between Blast and DDSM in term of accuracy (%)
| Dataset | Blast-based | (DDSM & SVM) | Best of DDSM (from table 3) |
|---|---|---|---|
| DS1 | 100 | 96.7 | 96.7 |
| DS2 | 100 | 100 | 100 |
| DS3 | 69.60 | 78.94 | 79.2 |
| DS4 | 78.57 | 87.5 | 95 |
| DS5 | 78.3 | 82.3 | 85.9 |
Figure 3ROC curve samples for the NB classifier in the dataset DS3 with the DDSM, DD and NG encoding methods. The positive class is Homotetramer. This figure shows a sample of ROC curves of the NB classifier based on the DDSM, DD and NG encoding methods with Homotetramer as the positive class (DS3). It appears that the DDSM based ROC curve is obviously higher than the two other ones. A ROC graph enables to compare two or more supervised learning algorithms. It depicts relative trade-offs between true positive rates and false positive rates [49]. It is possible to derive a synthetic indicator from the ROC curve, known as the AUC (Area Under Curve - Area Under the Curve). The AUC indicates the probability that the classifier will rank a randomly chosen positive instance higher than a randomly chosen negative instance. There exists a threshold value: if we classify the instances at random, the AUC will be equal to 0.5, so a significant AUC must be superior to this threshold.
Experimental results per substitution matrix for DS3
| Substitution matrix | Attributes | Accuracy (%) | |||
|---|---|---|---|---|---|
| C4.5 | SVM | NB | NN | ||
| Blosum45 | 377 | 78.5 | 79.2 | 59.4 | 77.7 |
| Blosum62 | 508 | 79.2 | 78.9 | 59.4 | 77 |
| Blosum80 | 532 | 77.6 | 80.5 | 60 | 77.6 |
| Pam30 | 2873 | 77.8 | 82 | 60.3 | 76.7 |
| Pam70 | 802 | 78.1 | 80.5 | 60.5 | 77 |
| Pam250 | 1123 | 77.3 | 79.4 | 59.6 | 78.7 |
Experimental results per substitution matrix for DS4
| Substitution matrix | Attributes | Accuracy (%) | |||
|---|---|---|---|---|---|
| C4.5 | SVM | NB | NN | ||
| Blosum45 | 5095 | 82.5 | 85 | 95 | 80 |
| Blosum62 | 5505 | 82.5 | 87.5 | 95 | 80 |
| Blosum80 | 5968 | 72.5 | 87.5 | 92.5 | 80 |
| Pam30 | 7005 | 82.5 | 92.5 | 92.5 | 65 |
| Pam70 | 5846 | 82.5 | 85 | 92.5 | 80 |
| Pam250 | 1948 | 82.5 | 77.5 | 95 | 80 |
Experimental results per substitution matrix for DS5
| Substitution matrix | Attributes | Accuracy (%) | |||
|---|---|---|---|---|---|
| C4.5 | SVM | NB | NN | ||
| Blosum45 | 12603 | 69.3 | 82.3 | 85.9 | 78 |
| Blosum62 | 13083 | 73.3 | 82.3 | 85.9 | 78 |
| Blosum80 | 13146 | 70.1 | 82.3 | 84.1 | 78 |
| Pam30 | 13830 | 69.3 | 82.3 | 84.5 | 78 |
| Pam70 | 13822 | 70.4 | 82.3 | 84.5 | 78 |
| Pam250 | 1969 | 66.1 | 85.2 | 79.4 | 78 |
Comparison with results reported in (Yu et al., 2006) for DS3
| Methods | Accuracy % | Correctly classified sequences |
|---|---|---|
| DDSM & C4.5 | 79.2 | 568 |
| DDSM & SVM | 78.9 | 588 |
| DDSM & NB | 59.4 | 434 |
| DDSM & NN | 77 | 564 |
| FDC & NN | 75.2 | 539 |
| AAC & NN | 41.4 | 297 |
| Blast-based | 69.6 | 499 |
Comparison with results reported in (Chen et al., 2006) and (Zhou, 1998) for DS5
| Methods | Accuracy % | Correctly classified sequences |
|---|---|---|
| DDSM & C4.5 | 73.3 | 203 |
| DDSM & SVM | 82.3 | 228 |
| DDSM & NB | 85.9 | 238 |
| DDSM & NN | 78 | 216 |
| Blast-based | 78.3 | 220 |
| AAC[ | 80.5 | 223 |
| pair-coupled AAC[ | 77.6 | 215 |
| PseAAC[ | 80.5 | 223 |
| SVM fusion [ | 87.7 | 243 |
| AAC[ | 79.1 | 219 |
| AAC[ | 59.9 | 166 |
| AAC[ | 55.2 | 153 |
| AAC[ | 74.7 | 206 |
| AAC[ | 79.4 | 219 |
| AAC[ | 84.1 | 233 |
| AAC[ | 79.4 | 219 |