| Literature DB >> 21966893 |
Koen Van der Borght1, Elke Van Craenenbroeck, Pierre Lecocq, Margriet Van Houtte, Barbara Van Kerckhove, Lee Bacheler, Geert Verbeke, Herman van Vlijmen.
Abstract
BACKGROUND: Linear regression models are used to quantitatively predict drug resistance, the phenotype, from the HIV-1 viral genotype. As new antiretroviral drugs become available, new resistance pathways emerge and the number of resistance associated mutations continues to increase. To accurately identify which drug options are left, the main goal of the modeling has been to maximize predictivity and not interpretability. However, we originally selected linear regression as the preferred method for its transparency as opposed to other techniques such as neural networks. Here, we apply a method to lower the complexity of these phenotype prediction models using a 3-fold cross-validated selection of mutations.Entities:
Mesh:
Substances:
Year: 2011 PMID: 21966893 PMCID: PMC3223907 DOI: 10.1186/1471-2105-12-386
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Complexity and performance of 3F and Reference models on genotype-phenotype data sequenced at Virco up to September 2006
| Reference Sep 2006 | Unseen data | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Sep 2006 - Dec 2008 | ||||||||||
| Nucleoside RT inhibitors | ||||||||||
| AZT | 45734 | 80 | 108 | 123 | 66 | 77 | 102 | 8698 | 0.091 | 0.093 |
| 3TC | 47422 | 59 | 64 | 70 | 43 | 52 | 45 | 8733 | 0.059 | 0.059 |
| ddI | 47269 | 49 | 21 | 62 | 50 | 25 | 54 | 8746 | 0.054 | 0.054 |
| d4T | 47235 | 47 | 34 | 68 | 54 | 20 | 60 | 8749 | 0.050 | 0.050 |
| ABC | 45908 | 71 | 46 | 90 | 63 | 24 | 68 | 8749 | 0.048 | 0.048 |
| FTC | 16440 | 31 | 35 | 46 | 34 | 34 | 36 | 8722 | 0.086 | 0.086 |
| TDF | 31640 | 64 | 91 | 110 | 79 | 83 | 111 | 8757 | 0.065 | 0.064 |
| Nonnucleoside RT inhibitors | ||||||||||
| NVP | 47400 | 124 | 190 | 142 | 103 | 148 | 110 | 8729 | 0.101 | 0.100 |
| EFV | 46054 | 191 | 167 | 211 | 126 | 101 | 142 | 8687 | 0.266 | 0.264 |
| ETR | 18166 | 122 | 158 | 160 | 94 | 72 | 119 | 8493 | 0.126 | 0.124 |
aJuly-September genotype-phenotype 2006 data was used as validation set for 3F.
bNumber of single terms (first order effects) in model.
cNumber of interaction terms in model.
dNumber of mutations in model.
eAverage squared error on unseen genotype-phenotype data collected between September 2006 and December 2008.
Figure 1RT mutations as first order effect in the 3F linear regression models. RT mutations with their regression coefficient (RWF in log Fold Change) and standard error (STDERR) in the 3F linear models. Mutations with |RWF| ≥ 0.5 are labeled. RWF stands for Resistance Weight Factor (terminology adopted from VirtualPhenotype™-LM).
Figure 2Impact of known NNRTI resistance associated mutations. LDA on reference (blue) and 3F (red) predicted phenotypes. Mutations shown are from the list of 44 known non-nucleoside RT inhibitor resistance associated mutations [14] having F1 > 0, ranked by F1.
Novel non-nucleoside RT inhibitor resistance associated mutations
| NVP | EFV | ETR | max( |
| |||||
|---|---|---|---|---|---|---|---|---|---|
| 67699 | 31 | 0.327 (EFV) | 13 (ETR) | 8/(8 + 10) (EFV) | 1.35 (ETR) | ||||
| 76783 | 31 | 0.118 (EFV) | 2 (EFV) | 2/(2 + 1) (EFV) | 2.03 (ETR) | ||||
| 67699 | 17 | 0.063 (ETR) | 16 (ETR) | 16/(16 + 472) (ETR) | 1.44 (ETR) | ||||
| K219H | 67913 | 129 | 0.047 (ETR) | 14 (ETR) | 14/(14 + 447) (ETR) | 1.45 (ETR) | |||
| K219D | 67913 | 161 | 0.044 (ETR) | 24 (ETR) | 24/(24 + 894) (ETR) | 1.22 (ETR) | |||
| 67699 | 15 | 0.039 (ETR) | 8 (ETR) | 8/(8 + 391) (ETR) | 1.50 (ETR) | ||||
| T376S | 15617 | 7415 | 0.035 (EFV) | 135 (EFV) | 135/(135 + 145) (EFV) | 1.67 (ETR) | |||
| 72643 | 3 | NA | NA | NA | 0.019 (ETR) | 3 (ETR) | 3/(3 + 310) (ETR) | 1.45 (ETR) | |
| 77197 | 8 | 0.017 (ETR) | 6 (ETR) | 6/(6 + 678) (ETR) | 1.63 (ETR) | ||||
| 72643 | 9 | 0.016 (ETR) | 4 (ETR) | 4/(4 + 503) (ETR) | 1.33 (ETR) | ||||
| K102L | 72791 | 12 | 0.015 (ETR) | 4 (ETR) | 4/(4 + 505) (ETR) | 1.77 (ETR) | |||
| 72175 | 94 | 0.015 (ETR) | 4 (ETR) | 4/(4 + 434) (ETR) | 1.66 (ETR) | ||||
| 75529 | 44 | 0.015 (ETR) | 5 (ETR) | 5/(5 + 617) (ETR) | 1.69 (ETR) | ||||
| 75869 | 1828 | 0.014 (ETR) | 14 (ETR) | 14/(14 + 157) (ETR) | 2.10 (ETR) | ||||
| M357T | 44115 | 17866 | NA | 0.013 (ETR) | 117 (ETR) | 117/(117 + 199) (ETR) | 1.88 (ETR) | ||
| T139R | 76899 | 243 | 0.010 (ETR) | 2 (ETR) | 2/(2 + 137) (ETR) | 2.22 (ETR) | |||
| E370G | 75915 | 489 | NA | NA | NA | 0.007 (ETR) | 2 (ETR) | 1/(1 + 1) (EFV) | 2.38 (ETR) |
| 45829 | 18410 | NA | NA | NA | 0.006 (ETR) | 52 (ETR) | 52/(52 + 94) (ETR) | 2.16 (ETR) | |
| 79037 | 98 | 0.004 (ETR) | 1 (ETR) | 1/(1 + 353) (ETR) | 1.95 (ETR) | ||||
| S379C | 69973 | 3578 | NA | NA | NA | 0.001 (ETR) | 1 (ETR) | 1/(1 + 1) (ETR) | 3.35 (ETR) |
| R206I | 79051 | 8 | NA | 0 (ETR) | 0/(0 + 102) (ETR) | 2.33 (ETR) | |||
| S134N | 79041 | 19 | NA | 0 (ETR) | 0/(0 + 69) (ETR) | 2.45 (ETR) | |||
| 76783 | 59 | NA | NA | NA | NA | 0 (ETR) | 0/(0 + 20) (ETR) | 2.73 (ETR) | |
| I382T | 78025 | 329 | NA | 0 (ETR) | 0/(0 + 2) (ETR) | 3.30 (ETR) | |||
| D237E | 78246 | 423 | NA | NA | NA | NA | 0 (EFV) | 0/(0 + 3) (EFV) | 3.67 (ETR) |
| N348T | 74372 | 170 | NA | NA | NA | NA | 0 (ETR) | 0/(0 + 0) (ETR) | 4.05 (ETR) |
| E399G | 66049 | 670 | NA | NA | NA | NA | 0 (ETR) | 0/(0 + 0) (ETR) | 4.10 (ETR) |
| 72912 | 10 | NA | 0 (EFV) | 0/(0 + 2) (EFV) | 4.16 (ETR) | ||||
| 76892 | 41 | NA | 0 (NVP) | 0/(0 + 0) (NVP) | 4.70 (NVP) | ||||
| 72462 | 5930 | NA | NA | NA | NA | 0 (ETR) | 0/(0 + 0) (ETR) | 5.01 (ETR) | |
| 72175 | 50 | NA | 0 (EFV) | 0/(0 + 3) (EFV) | 5.04 (NVP) | ||||
| 72175 | 7 | NA | 0 (EFV) | 0/(0 + 3) (EFV) | 5.08 (EFV) | ||||
| T139K | 76899 | 348 | NA | 0 (ETR) | 0/(0 + 0) (ETR) | 5.10 (ETR) | |||
| T165L | 75078 | 183 | NA | NA | NA | NA | 0 (ETR) | 0/(0 + 0) (ETR) | 5.81 (ETR) |
| 59810 | 1756 | NA | NA | 0 (NVP) | 0/(0 + 0) (NVP) | 6.07 (NVP) | |||
| V241M | 78771 | 23 | NA | 0 (NVP) | 0/(0 + 0) (NVP) | 6.90 (NVP) | |||
| I382L | 78025 | 228 | NA | NA | NA | NA | 0 (NVP) | 0/(0 + 0) (NVP) | 7.30 (NVP) |
| G335S | 65035 | 1877 | NA | NA | NA | NA | 0 (ETR) | 0/(0 + 0) (ETR) | 7.55 (ETR) |
| 66049 | 10697 | NA | NA | NA | NA | 0 (ETR) | 0/(0 + 0) (ETR) | 7.98 (ETR) | |
| R358K | 70517 | 5995 | NA | NA | NA | NA | 0 (NVP) | 0/(0 + 0) (NVP) | 8.06 (NVP) |
aFold-Change range from 3 measurements, unless otherwise indicated between brackets.
FC > Biological Cut-Off (BCO) in bold, FC ≤ BCO in italic; BCO for NVP is 6.0, BCO for EFV is 3.3 and BCO for ETR is 3.2.
bSummarized for the three non-nucleoside RT inhibitors (NNRTI).
cTop 40 mutations, ranked by max(LDA F1) descending, then by min(LDA cutoff) ascending. Mutations shown are from the list of 124 NNRTI mutations with RWF ≥ 0 and LDA cutoff > 0 for NVP, EFV and ETR. Known NNRTI positions or novel mutations listed in [32,33] are shown in bold.
dFrequency of wild-type (not within a mixture) in LDA data set.
eFrequency of mutation (not within a mixture) in LDA data set.
fn11 is the number of samples with amino acid mutation having a predicted phenotype above the LDA cutoff.
gn01 is the number of samples with wild type amino acid having a predicted phenotype above the LDA cutoff.
hCutoff in log Fold-Change (taking the wild-type and mutation frequency percentages as prior probabilities in the LDA can result in cutoff values outside of the range of the predicted phenotypes).
Site-Directed Mutants of novel NNRTI resistance associated mutations 139R, 219D and 219H in combination with 103N+181C and SDM 181G
| SDM | drug | |||
|---|---|---|---|---|
| 139R | NVP | |||
| EFV | ||||
| ETR | ||||
| 219D | NVP | |||
| EFV | ||||
| ETR | ||||
| 219H | NVP | |||
| EFV | ||||
| ETR | ||||
| 103N | NVP | |||
| EFV | ||||
| ETR | ||||
| 103N+181C | NVP | |||
| EFV | ||||
| ETR | ||||
| 139R+103N+181C | NVP | |||
| EFV | ||||
| ETR | ||||
| 219D+103N+181C | NVP | |||
| EFV | ||||
| ETR | ||||
| 219H+103N+181C | NVP | |||
| EFV | ||||
| ETR | ||||
| 181G | NVP | |||
| EFV | ||||
| ETR | ||||
a > Biological Cut-Off (BCO) in bold, ≤ BCO in italic; BCO for NVP is 6.0, BCO for EFV is 3.3 and BCO for ETR is 3.2.
Figure 3Reference and 3F methodology: schematic overview. In the 3F method cross-validated prediction error (CVPRESS) was used instead of significance levels (p-values) in the reference approach. In the reference approach two stepwise regression procedures were used: all possible mutation pairs were made from mutations in the first order model and candidate for entry in the second order model. In the 3F method, the initial search space consisted of all individual mutations. Mutation pairs could only enter the model if both mutations in the pair were already selected for the model.
3F model selection on genotype-phenotype data up to September 2006
| drug | # 3F Models | # lower SBC | # lower AIC |
| ase | ||||
|---|---|---|---|---|---|---|---|---|---|
| Nucleoside RT inhibitorsf | AZT | 300 | 86 | 0 | yes | no | 800 | 0.103 | 296 |
| 3TC | 150 | 60 | 34 | yes | no | 807 | 0.037 | 99 | |
| ddI | 150 | 20 | 70 | no | yes | 807 | 0.049 | 83 | |
| d4T | 120 | 41 | 35 | yes | no | 806 | 0.040 | 81 | |
| ABC | 200 | 111 | 53 | yes | no | 807 | 0.038 | 95 | |
| FTC | 80 | 28 | 22 | yes | yes | 804 | 0.071 | 76 | |
| TDF | 400 | 66 | 196 | no | yes | 807 | 0.039 | 298 | |
| NNRTIg | NVP | 400 | 93 | 0 | yes | no | 801 | 0.089 | 391 |
| EFV | 500 | 101 | 0 | yes | no | 807 | 0.246 | 386 | |
| ETR | 700 | 49 | 0 | yes | no | 777 | 0.113 | 656 | |
| Protease inhibitors | IDV | 485 | 50 | 51 | yes | yes | 805 | 0.075 | 482 |
| NFV | 375 | 64 | 6 | yes | yes | 808 | 0.063 | 375 | |
| SQV | 600 | 53 | 0 | yes | no | 807 | 0.092 | 575 | |
| APV | 1000 | 0 | 656 | no | yes | 808 | 0.060 | 709 | |
| LPV | 500 | 205 | 28 | yes | no | 807 | 0.157 | 319 | |
| ATV | 1275 | 0 | 2 | no | yes | 805 | 0.117 | 1158h | |
| TPV | 1000 | 641 | 142 | yes | no | 806 | 0.059 | 428 | |
| DRV | 1000 | 823 | 799 | yes | yes | 816 | 0.096 | 707 | |
aThe number of 3F models generated was arbitrary but taken large enough such that at least one 3F model was found with a lower SBC or AIC than the reference on the genotype-phenotype data set up to July 2006.
bFrom the remaining 3F models with lower SBC or AIC than the reference, the 3F model was then selected with the lowest average squared error (ase) on an unseen genotype-phenotype data set collected between July and September 2006 (test set) containing approximately 800 samples.
cSBC of the selected 3F model < SBC reference on the test set (yes/no).
dAIC of the selected 3F model < AIC reference on the test set (yes/no).
eThe number of different random divisions used in the stepwise regression in the selected 3F model.
fFor the nucleoside RT inhibitors the number of random divisions needed was less than 100, with exception of AZT and TDF.
gFor the non-nucleoside RT inhibitors most random divisions were needed for ETR.
hATV was the only drug for which more than 1000 different random divisions were needed.