| Literature DB >> 25254639 |
N Lance Hepler1, Konrad Scheffler2, Steven Weaver2, Ben Murrell2, Douglas D Richman3, Dennis R Burton4, Pascal Poignard5, Davey M Smith6, Sergei L Kosakovsky Pond2.
Abstract
Since its identification in 1983, HIV-1 has been the focus of a research effort unprecedented in scope and difficulty, whose ultimate goals--a cure and a vaccine--remain elusive. One of the fundamental challenges in accomplishing these goals is the tremendous genetic variability of the virus, with some genes differing at as many as 40% of nucleotide positions among circulating strains. Because of this, the genetic bases of many viral phenotypes, most notably the susceptibility to neutralization by a particular antibody, are difficult to identify computationally. Drawing upon open-source general-purpose machine learning algorithms and libraries, we have developed a software package IDEPI (IDentify EPItopes) for learning genotype-to-phenotype predictive models from sequences with known phenotypes. IDEPI can apply learned models to classify sequences of unknown phenotypes, and also identify specific sequence features which contribute to a particular phenotype. We demonstrate that IDEPI achieves performance similar to or better than that of previously published approaches on four well-studied problems: finding the epitopes of broadly neutralizing antibodies (bNab), determining coreceptor tropism of the virus, identifying compartment-specific genetic signatures of the virus, and deducing drug-resistance associated mutations. The cross-platform Python source code (released under the GPL 3.0 license), documentation, issue tracking, and a pre-configured virtual machine for IDEPI can be found at https://github.com/veg/idepi.Entities:
Mesh:
Substances:
Year: 2014 PMID: 25254639 PMCID: PMC4177671 DOI: 10.1371/journal.pcbi.1003842
Source DB: PubMed Journal: PLoS Comput Biol ISSN: 1553-734X Impact factor: 4.475
Figure 1IDEPI workflow.
Abbreviations: MSA - multiple sequence analysis; mRMR - minimum redundancy maximum relevance; SVM - support vector machine.
IDEPI performance in predicting phenotype and recovering features from simulated data.
| Simulation | L | M | Median performance metrics, phenotype | Mean epitope recovery, by class | ||||||
| Sensitivity | Specificity | MCC | Features | Slow, % | Intermediate, % | Fast, % | FP | |||
| Simple | 5 | ≥1 | 0.98 | 1.0 | 0.98 | 2 | 11.1 | 56.6 | 80.0 | 0.09 |
| Intermediate | 8 | ≥2 | 0.95 | 1.0 | 0.94 | 3 | 10.4 | 42.6 | 71.6 | 0.16 |
| Complex | 10 | ≥3 | 0.85 | 0.98 | 0.78 | 3 | 6.0 | 39.4 | 58.3 | 0.16 |
| Random | N/A | N/A | 0.57 | 0.47 | 0.04 | 1 | N/A | N/A | N/A | 1 |
Forward feature selection (to optimize MCC), and 10-fold nested cross-validation were used to learn the models. L: the number of sites in an epitope; M: how many escape mutations are needed to confer resistance; epitope recover classes are based on simulated evolutionary rates; FP: mean (per replicate) number of selected features not in a simulated epitope; a feature was counted as recovered if it were selected in 50% or more of cross-validation replicates.
Figure 2IDEPI performance, measured by MCC, as a function of the number of model features.
(A): on a representative of each of the four classification problems, (B): on predicting resistance to a particular broadly neutralizing monoclonal antibody. Abbreviations: NVP - Nevirapine; DRAM - drug resistance associated mutations; HAD - HIV associated dementia; bNab - broadly neutralizing antibody. The optimal number of features is highlighted with a filled circle for each line plot.
IDEPI performance in predicting phenotypes from genotypes based on training data analyzed previously.
| Problem | N | B | F | IDEPI performance | ||||
| 5-fold cross-validation metrics | Benchmark (IDEPI: ref) | |||||||
| Sens. | Spec. | Accu. | MCC | |||||
| NVP resistance | 1461 | 62.3% | 80 | 0.88 | 0.97 | 0.92 | 0.83 | CV Accu. 0.92: 0.92 |
| V3 tropism | 1356 | 15.1% | 90 | 0.89 | 0.94 | 0.94 | 0.78 | Training Accu. 0.95:0.96 |
| Dementia | 861 | 70.3% | 90 | 0.96 | 0.93 | 0.95 | 0.89 | CV Accu. 0.95:0.75 |
| 2F5 bNab | 465 | 48.6% | 3 | 0.93 | 0.88 | 0.90 | 0.81 | Training Accu. 0.90 vs proportion of residuals explained 0.49 |
| b12 bNab | 247 | 64.4% | 5 | 0.74 | 0.62 | 0.70 | 0.36 | Training Accu. 0.75:0.86 |
| 10E8 bNab | 178 | 4.0% | 5 | 0.30 | 0.96 | 0.93 | 0.23 | Training Accu. 0.96 vs proportion of residuals explained 0.21 |
| PG9 bNab | 301 | 26.2% | 60 | 0.56 | 0.86 | 0.78 | 0.43 | Training Accu. 0.96 vs proportion of residuals explained 0.31 |
| PGT-121 bNab | 118 | 37.2% | 1 | 0.80 | 0.79 | 0.80 | 0.58 | Training Accu. 0.80 vs proportion of residuals explained 0.52 |
| 8ANC131 bNab | 178 | 30.9% | 15 | 0.51 | 0.69 | 0.63 | 0.19 | |
| 8ANC195 bNab | 178 | 42.7% | 2 | 0.94 | 0.75 | 0.83 | 0.67 | Training Accu. 0.83 vs proportion of residuals explained 0.58 |
IDEPI metrics were obtained using 5-fold cross-validation. B (balance) is defined as the proportion of "positive" training samples. The number of features (F) was chosen by selecting a value from a pre-defined grid to maximize cross-validation MCC.
random forests trained on combined sequence and structural features using resistance classifications from the Stanford Drug Resistance Database [51];
a two-level classifier combining random forest predictions based on an electrostatic hull and hydrophobicity features of the V3 loop (680 features) trained on the same data [27];
a hierarchical decision tree classifier using composite amino-acid features trained on the same data [35].
a rule based additive regression model trained to minimize IC50 residuals [45].
an ensemble classifier using signature rules and logistic regression trained on the same data [44].
Key features selected by IDEPI for each of the example problems.
| Problem | Features selected by IDEPI | ||||
| Rank | Identity | Direction | MCC | Remarks | |
| NVP resistance | 1 | K103K | Susceptible | 0.46 | Canonical NNRTI resistance site |
| 2 | Y181Y | Susceptible | 0.65 | Canonical NNRTI resistance site | |
| 3 | G190G | Susceptible | 0.74 | Canonical NNRTI resistance site | |
| V3 tropism | 1 | PNGS(N301) | CCR5 | 0.55 | Essential for CCR5 binding |
| 2 | R306R | CCR5 | 0.67 | dual-tropic viruses | |
| Dementia | 1 | T297K | Non-HAD | 0.57 | |
| 2 | PNGS (N488) | HAD | |||
| 3 | R298D | Non-HAD | |||
| 4 | I320[] | non-HAD | |||
| 5 | PNGS(T188) | HAD | 0.71 | ||
| 2F5 bNab | 1 | K665K | Susceptible | 0.73 | Parts of the canonical |
| 2 | A667A | Susceptible | 0.75 | linear epitope (662–667) | |
| b12 bNab | 1 | D185D | Susceptible | 0.26 | The strongest association found in |
|
| |||||
| 10E8 bNab | 3–4 | T676T | Susceptible | N/A | A part of the structural epitope |
| PG9 bNab | 1 | PNGS (N160) | Susceptible | 0.36 | Key residue for binding based on |
| 8 | V169E | Resistant | structure | ||
|
| |||||
| PGT-121 bNab | 1 | PNGS(N301+N332) | Susceptible | 0.58 | tralization |
| 8ANC195 bNab | 1 | PNGS (N234+N276) | Susceptible | 0.59 | Encompasses the three mutants (sites 234, 236, and 276) any of which confers resistance |
| 2 | PNGS(N160+N230) | Resistant | 0.67 | ||
| 8ANC131 bNab | 3.75 | PNGS(N339+Q442) | Resistant | ||
| 5 | K151G | Susceptible | |||
Notation: T297K means that K is found in position 297 (HBX2 coordinates, T is the residue found in HXB2); PNGS (T188) – a potential N-linked glycosylation site with N at HXB2 coordinate 188; PNGS (N234+N276) – a pair of potential N-linked glycosylation site with N at HXB2 coordinates N234 and N276; [] – a deletion relative to HXB2. The ranking of the features is based on what order they were added to the model, and averaged over cross-validation replicates. For datasets with little signal (e.g. 10E8 bNab, 8ANC131 bNab), there was considerable variation in feature ranks among CV replicates, hence the best ranking feature has a median rank worse than 1. The values in the MCC column are for the models with the corresponding number of features (e.g. the MCC of a 2-feature model for V3 tropism in 0.67).
IDEPI model performance on independent datasets and comparison with benchmark methods.
| Problem | Independent dataset | |||
| N | Reference | Benchmark | Performance | |
| NVP resistance | 1639 |
| Stanford HIVdb | Cohen's |
| V3 tropism | 74 |
| Best of 5 methods, including SVM, decision trees, and position-specific scoreing matrices | Accu. IDEPI 0.91 vs 0.86 |
| Dementia | 10 |
| Ensemble of rule learning and de-cision trees from | IDEPI 10/10 vs 8/10 |
| b12 bNab | 55 |
| Ensemble of signatures and logis-tic regression | Accu. IDEPI 0.73 vs 0.61 |