| Literature DB >> 19055831 |
Sébastien Boisvert1, Mario Marchand, François Laviolette, Jacques Corbeil.
Abstract
BACKGROUND: Human immunodeficiency virus type 1 (HIV-1) infects cells by means of ligand-receptor interactions. This lentivirus uses the CD4 receptor in conjunction with a chemokine coreceptor, either CXCR4 or CCR5, to enter a target cell. HIV-1 is characterized by high sequence variability. Nonetheless, within this extensive variability, certain features must be conserved to define functions and phenotypes. The determination of coreceptor usage of HIV-1, from its protein envelope sequence, falls into a well-studied machine learning problem known as classification. The support vector machine (SVM), with string kernels, has proven to be very efficient for dealing with a wide class of classification problems ranging from text categorization to protein homology detection. In this paper, we investigate how the SVM can predict HIV-1 coreceptor usage when it is equipped with an appropriate string kernel.Entities:
Mesh:
Substances:
Year: 2008 PMID: 19055831 PMCID: PMC2637298 DOI: 10.1186/1742-4690-5-110
Source DB: PubMed Journal: Retrovirology ISSN: 1742-4690 Impact factor: 4.602
Figure 1The algorithm for computing .
Figure 2The algorithm for extracting the features of a string s into a Map. Here, s (i : j) denotes the substring of s starting at position i and ending at position j.
Figure 3The algorithm for merging every feature from the set S = {(s1, y1), (s2, y2), ..., (s, y)} of all support vectors into a Map representing the discriminant w.
Datasets. Contradictions are in parenthesis.
| Coreceptor usage | Training set | Test set | ||||
| Negative examples | Positive examples | Total | Negative examples | Positive examples | Total | |
| CCR5 | 200 (13) | 1225 (12) | 1425 (25) | 225 (22) | 1200 (16) | 1425 (38) |
| CXCR4 | 1050 (44) | 375 (18) | 1425 (62) | 1027 (28) | 398 (21) | 1425 (38) |
| CCR5 and CXCR4 | 1250 (57) | 175 (30) | 1425 (87) | 1252 (48) | 173 (35) | 1425 (83) |
HIV-1 subtypes.
| Subtype | Training set | Test set | Total |
| A | 39 | 46 | 85 |
| B | 955 | 943 | 1898 |
| C | 168 | 149 | 317 |
| 02_AG | 12 | 15 | 27 |
| O | 11 | 11 | 22 |
| D | 69 | 95 | 164 |
| A1 | 25 | 18 | 43 |
| AG | 5 | 5 | 10 |
| 01_AE | 97 | 106 | 203 |
| G | 7 | 7 | 14 |
| Others | 37 | 30 | 67 |
| Total | 1425 | 1425 | 2850 |
Sequence length distribution. The minimum length is 31 residues and the maximum length is 40 residues.
| Residues | Training set | Test set | Total |
| 31 | 1 | 0 | 1 |
| 32 | 0 | 0 | 0 |
| 33 | 2 | 2 | 4 |
| 34 | 18 | 22 | 40 |
| 35 | 210 | 189 | 399 |
| 36 | 1142 | 1162 | 2304 |
| 37 | 30 | 31 | 61 |
| 38 | 11 | 10 | 21 |
| 39 | 11 | 8 | 19 |
| 40 | 0 | 1 | 1 |
| Total | 1425 | 1425 | 2850 |
Classification results on the test sets. Accuracy, specificity and sensitivity are defined in Methods. See [25] for a description of the ROC area.
| Coreceptor usage | SVM parameter C | Kernel parameter | Support vectors | Accuracy | Specificity | Sensitivity | ROC area |
| CCR5 | 0.04 | 3 | 204 | 96.63% | 85.33% | 98.75% | 98.68% |
| CXCR4 | 0.7 | 9 | 392 | 93.68% | 96.00% | 87.68% | 96.59% |
| CCR5 and CXCR4 | 2 | 15 | 430 | 94.38% | 98.16% | 67.05% | 90.16% |
| CCR5 | 9 | 1 | 200 | 96.42% | 87.55% | 98.08% | 98.12% |
| CXCR4 | 0.02 | 0.05 | 321 | 92.21% | 97.56% | 78.39% | 95.11% |
| CCR5 and CXCR4 | 0.5 | 0.1 | 399 | 92.28% | 97.20% | 56.64% | 87.49% |
| CCR5 | 0.4 | 30 | 533 | 96.35% | 83.55% | 98.75% | 98.95% |
| CXCR4 | 0.0001 | 30 | 577 | 94.80% | 97.56% | 87.68% | 96.25% |
| CCR5 and CXCR4 | 0.2 | 35 | 698 | 95.15% | 99.20% | 65.89% | 90.97% |
| CCR5 | - | - | - | 99.15% | 99.55% | 99.08% | - |
| CXCR4 | - | - | - | 98.66% | 99.70% | 95.97% | - |
| CCR5 and CXCR4 | - | - | - | 97.96% | 99.68% | 85.54% | - |
| CCR5 | 0.3 | 40 | 425 | 98.45% | 92.88% | 99.5% | 99.17% |
| CXCR4 | 0.0001 | 35 | 611 | 98.66% | 99.70% | 95.97% | 98.29% |
| CCR5 and CXCR4 | 0.0001 | 40 | 618 | 97.96% | 99.68% | 85.54% | 96.27% |
Figure 4Features (20 are shown) with highest and lowest weights for each coreceptor usage prediction task.
Available methods. The results column contains the metric and what the classifier is predicting.
| Reference | Learning method | Training set | Testing set | Multiple alignments | Results |
| Pillai et al. 2003 | Charge rule [ | 271 | - | yes | Accuracy (CXCR4): 87.45% |
| Resch et al. 2001 | Neural networks | 181 | - | yes | Specificity (X4): 90.00% |
| Pillai et al. 2003 | SVM | 271 | - | yes | Accuracy (CXCR4): 90.86% |
| Jensen et al. 2003 | PSSM1 | 213 | 175 | yes | Specificity (CXCR4): 96.00% |
| Jensen et al. 2006 | PSSM | 279 | - | yes | Specificity (CXCR4): 94.00%2 |
| Sander et al. 2007 | SVM | 432 | - | yes | Accuracy (CXCR4): 91.56% |
| Xu et al. 2007 | Random forests | 651 | - | yes | Accuracy (R5): 95.10% |
| Lamers et al. 2008 | Neural networks | 149 | - | yes | Accuracy (R5X4): 75.50% |
| This manuscript | SVM | 1425 | 1425 | Accuracy (CXCR4): |
1Position-specific scoring matrices
2Subtype C