| Literature DB >> 21453485 |
Dominik Heider1, Jens Verheyen, Daniel Hoffmann.
Abstract
BACKGROUND: Machine learning techniques have been widely applied to biological sequences, e.g. to predict drug resistance in HIV-1 from sequences of drug target proteins and protein functional classes. As deletions and insertions are frequent in biological sequences, a major limitation of current methods is the inability to handle varying sequence lengths.Entities:
Year: 2011 PMID: 21453485 PMCID: PMC3079662 DOI: 10.1186/1756-0500-4-94
Source DB: PubMed Journal: BMC Res Notes ISSN: 1756-0500
Figure 1Workflow of the applied procedure. The protein sequences are first encoded as vectors of numerical descriptor values, e.g. with the hydropathy values of Kyte and Doolittle [30]. These vectors are normalized to a fixed length by applying the described interpolation methods. Finally, the normalized encoded sequences are used as input for the random forests in the classification.
Summary of the datasets
| dataset | # sequences | positive samples | negative samples | length |
|---|---|---|---|---|
| APV | 768 | 61% | 39% | 99.70 ± 1.24% |
| ATV | 329 | 48% | 52% | 99.59 ± 1.06% |
| IDV | 827 | 51% | 49% | 99.68 ± 1.23% |
| LPV | 517 | 45% | 55% | 99.73 ± 1.22% |
| NFV | 844 | 40% | 60% | 99.67 ± 1.22% |
| RTV | 795 | 49% | 51% | 99.71 ± 1.24% |
| SQV | 826 | 60% | 40% | 99.69 ± 1.23% |
| 3TC | 633 | 31% | 69% | 240.87 ± 2.33% |
| ABC | 628 | 29% | 71% | 240.54 ± 4.20% |
| AZT | 630 | 52% | 48% | 240.87 ± 2.33% |
| d4T | 630 | 54% | 46% | 240.54 ± 4.20% |
| ddI | 632 | 49% | 51% | 240.87 ± 2.33% |
| TDF | 353 | 67% | 33% | 240.72 ± 1.88% |
| DLV | 732 | 64% | 36% | 241.28 ± 1.49% |
| EFV | 734 | 62% | 38% | 241.32 ± 1.49% |
| NVP | 746 | 57% | 43% | 241.30 ± 1.48% |
| BVM | 155 | 28% | 72% | 20.77 ± 2.07% |
| GTP | 1435 | 46% | 54% | 232.18 ± 22.37% |
| MIP | 49 | 39% | 61% | 261.41 ± 21.47% |
The table summarizes number of sequences within each dataset, percentages of positive and negative samples, average lengths ± standard deviations in percent.
Prediction results
| Drug | linear max most | splines max most | fmm max most | periodic max most | natural max most |
|---|---|---|---|---|---|
| APV | 0.934 ± 0.001 | 0.929 ± 0.002 | 0.928 ± 0.001 | 0.927 ± 0.001 | 0.928 ± 0.001 |
| 0.932 ± 0.001 | 0.934 ± 0.001 | 0.932 ± 0.002 | 0.933 ± 0.001 | 0.933 ± 0.001 | |
| ATV | 0.936 ± 0.002 | 0.917 ± 0.003 | 0.920 ± 0.002 | 0.919 ± 0.002 | 0.920 ± 0.002 |
| 0.928 ± 0.002 | 0.915 ± 0.003 | 0.919 ± 0.003 | 0.918 ± 0.003 | 0.920 ± 0.003 | |
| IDV | 0.972 ± 0.001 | 0.968 ± 0.001 | 0.968 ± 0.001 | 0.968 ± 0.001 | 0.968 ± 0.001 |
| 0.970 ± 0.001 | 0.970 ± 0.001 | 0.971 ± 0.001 | 0.971 ± 0.001 | 0.972 ± 0.001 | |
| LPV | 0.964 ± 0.001 | 0.963 ± 0.001 | 0.963 ± 0.001 | 0.962 ± 0.001 | 0.963 ± 0.001 |
| 0.963 ± 0.001 | 0.964 ± 0.001 | 0.963 ± 0.001 | 0.963 ± 0.001 | 0.964 ± 0.001 | |
| NFV | 0.941 ± 0.001 | 0.938 ± 0.001 | 0.940 ± 0.001 | 0.940 ± 0.001 | 0.940 ± 0.001 |
| 0.939 ± 0.001 | 0.943 ± 0.001 | 0.947 ± 0.001 | 0.946 ± 0.001 | 0.945 ± 0.001 | |
| RTV | 0.984 ± 0.001 | 0.980 ± 0.001 | 0.981 ± 0.001 | 0.981 ± 0.001 | 0.981 ± 0.001 |
| 0.983 ± 0.001 | 0.986 ± 0.001 | 0.986 ± 0.001 | 0.986 ± 0.001 | 0.986 ± 0.001 | |
| SQV | 0.955 ± 0.001 | 0.950 ± 0.001 | 0.951 ± 0.001 | 0.951 ± 0.001 | 0.951 ± 0.001 |
| 0.952 ± 0.001 | 0.953 ± 0.001 | 0.957 ± 0.001 | 0.955 ± 0.001 | 0.956 ± 0.001 | |
| 3TC | 0.933 ± 0.002 | 0.936 ± 0.002 | 0.939 ± 0.002 | 0.938 ± 0.002 | 0.939 ± 0.002 |
| 0.927 ± 0.003 | 0.934 ± 0.002 | 0.937 ± 0.002 | 0.937 ± 0.002 | 0.937 ± 0.003 | |
| ABC | 0.916 ± 0.002 | 0.906 ± 0.002 | 0.909 ± 0.003 | 0.909 ± 0.002 | 0.909 ± 0.002 |
| 0.914 ± 0.003 | 0.910 ± 0.003 | 0.918 ± 0.003 | 0.919 ± 0.002 | 0.918 ± 0.003 | |
| AZT | 0.908 ± 0.002 | 0.890 ± 0.002 | 0.894 ± 0.002 | 0.893 ± 0.002 | 0.894 ± 0.002 |
| 0.908 ± 0.002 | 0.898 ± 0.002 | 0.905 ± 0.002 | 0.903 ± 0.002 | 0.904 ± 0.002 | |
| d4T | 0.903 ± 0.002 | 0.886 ± 0.002 | 0.889 ± 0.002 | 0.889 ± 0.002 | 0.889 ± 0.002 |
| 0.900 ± 0.002 | 0.892 ± 0.002 | 0.901 ± 0.002 | 0.899 ± 0.002 | 0.901 ± 0.002 | |
| ddI | 0.853 ± 0.003 | 0.829 ± 0.003 | 0.837 ± 0.003 | 0.836 ± 0.003 | 0.836 ± 0.002 |
| 0.852 ± 0.003 | 0.841 ± 0.003 | 0.846 ± 0.003 | 0.839 ± 0.003 | 0.844 ± 0.003 | |
| TDF | 0.832 ± 0.004 | 0.808 ± 0.005 | 0.817 ± 0.004 | 0.818 ± 0.005 | 0.816 ± 0.005 |
| 0.825 ± 0.005 | 0.812 ± 0.005 | 0.813 ± 0.005 | 0.814 ± 0.005 | 0.813 ± 0.005 | |
| DLV | 0.901 ± 0.002 | 0.888 ± 0.002 | 0.891 ± 0.002 | 0.891 ± 0.002 | 0.891 ± 0.002 |
| 0.898 ± 0.002 | 0.881 ± 0.002 | 0.882 ± 0.002 | 0.883 ± 0.002 | 0.883 ± 0.002 | |
| EFV | 0.932 ± 0.002 | 0.921 ± 0.002 | 0.928 ± 0.002 | 0.929 ± 0.002 | 0.928 ± 0.002 |
| 0.925 ± 0.002 | 0.911 ± 0.002 | 0.915 ± 0.002 | 0.919 ± 0.002 | 0.915 ± 0.002 | |
| NVP | 0.917 ± 0.002 | 0.910 ± 0.002 | 0.916 ± 0.002 | 0.917 ± 0.002 | 0.916 ± 0.002 |
| 0.908 ± 0.003 | 0.902 ± 0.003 | 0.906 ± 0.003 | 0.909 ± 0.003 | 0.906 ± 0.003 | |
| BVM | 0.918 ± 0.002 | 0.932 ± 0.002 | 0.932 ± 0.002 | 0.923 ± 0.003 | 0.933 ± 0.002 |
| GTP | 0.981 ± 0.001 | 0.979 ± 0.001 | 0.978 ± 0.001 | 0.977 ± 0.001 | 0.979 ± 0.001 |
| 0.980 ± 0.001 | 0.979 ± 0.001 | 0.979 ± 0.001 | 0.976 ± 0.001 | 0.979 ± 0.001 | |
| MIP | 0.815 ± 0.010 | 0.789 ± 0.013 | 0.789 ± 0.011 | 0.787 ± 0.016 | 0.788 ± 0.017 |
| 0.827 ± 0.012 | 0.815 ± 0.014 | 0.813 ± 0.014 | 0.816 ± 0.013 | 0.812 ± 0.013 | |
AUC ± standard deviations with max representing the maximal occuring sequence length within a dataset, most the most frequent sequence length in a dataset. For BVM most and max are the same.
Wilcoxon Signed-Rank tests
| method | APV | ATV | IDV | LPV | NFV | RTV | SQV | 3TC | ABC | AZT | D4T | DDI | TDF | DLV | EFV | NVP | BVM | GTP | MIP |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| linear | * | * | * | * | * | * | * | * | * | * | * | * | * | * | * | * | * | * | |
| splines | |||||||||||||||||||
| fmm | * | * | |||||||||||||||||
| periodic | * | * | |||||||||||||||||
| natural | * | * | |||||||||||||||||
| linear | * | * | * | * | * | * | * | * | * | * | * | ||||||||
| splines | |||||||||||||||||||
| fmm | * | * | * | * | * | * | * | ||||||||||||
| periodic | * | * | * | ||||||||||||||||
| natural | * | * | * | * | * | * | |||||||||||||
Wilcoxon Signed-Rank tests on the AUC distributions. The method performing best and having significantly higher AUC values (α = 0.05) is marked with *. When a test is not significant more than one method is marked. The upper part shows the result of the max-interpolation, the lower part the results of the most-interpolation.
Comparison of the prediction accuracy
| drug | Rhee | Hou | this study |
|---|---|---|---|
| APV | 84% | 88% | |
| ATV | 77% | 86% | |
| IDV | 79% | 86% | |
| LPV | 81% | 91% | |
| NFV | 82% | 87% | |
| RTV | 89% | 93% | |
| SQV | 84% | ||
| 3TC | * | ||
| ABC | 77% | * | |
| AZT | 76% | * | |
| d4T | 78% | * | |
| ddI | 75% | * | |
| TDF | 73% | * | |
| DLV | 84% | * | |
| EFV | 87% | * | |
| NVP | * | 87% | |
*: Hou et al. used only the PI datasets [21].
AUC comparison
| drug | Kierczak | this study |
|---|---|---|
| 3TC | 0.95 ± 0.03 | 0.94 ± 0.00 |
| ABC | 0.83 ± 0.05 | 0.92 ± 0.00 |
| AZT | 0.89 ± 0.05 | 0.91 ± 0.00 |
| d4T | 0.85 ± 0.06 | 0.90 ± 0.00 |
| ddI | 0.82 ± 0.08 | 0.85 ± 0.00 |
| TDF | 0.85 ± 0.05 | 0.83 ± 0.00 |
| DLV | 0.76 ± 0.06 | 0.90 ± 0.00 |
| EFV | * | 0.93 ± 0.00 |
| NVP | 0.85 ± 0.05 | 0.92 ± 0.00 |
AUC ± standard deviations.
*: Kierczak et al. analyzed the NRTI and NNRTI datasets except EFV [22].
Figure 2Most important sequence positions for the PI classification. Sequences of HIV-1 protease with the ten most important positions marked in gray.
Figure 3Simple linear interpolation. Circles mark descriptor values for the amino acids, Xs represent interpolated values. In this example, the sequence is interpolated from 8 to 15 values.