| Literature DB >> 24453480 |
Adam Zemla1, Tanya Kostova2, Rodion Gorchakov3, Evgeniya Volkova3, David W C Beasley4, Jane Cardosa5, Scott C Weaver3, Nikos Vasilakis3, Pejman Naraghi-Arani6.
Abstract
A computational approach for identification and assessment of genomic sequence variability (GeneSV) is described. For a given nucleotide sequence, GeneSV collects information about the permissible nucleotide variability (changes that potentially preserve function) observed in corresponding regions in genomic sequences, and combines it with conservation/variability results from protein sequence and structure-based analyses of evaluated protein coding regions. GeneSV was used to predict effects (functional vs. non-functional) of 37 amino acid substitutions on the NS5 polymerase (RdRp) of dengue virus type 2 (DENV-2), 36 of which are not observed in any publicly available DENV-2 sequence. 32 novel mutants with single amino acid substitutions in the RdRp were generated using a DENV-2 reverse genetics system. In 81% (26 of 32) of predictions tested, GeneSV correctly predicted viability of introduced mutations. In 4 of 5 (80%) mutants with double amino acid substitutions proximal in structure to one another GeneSV was also correct in its predictions. Predictive capabilities of the developed system were illustrated on dengue RNA virus, but described in the manuscript a general approach to characterize real or theoretically possible variations in genomic and protein sequences can be applied to any organism.Entities:
Keywords: dengue virus (DENV); genomic sequence variability; mutant viability; protein structure; quasispecies
Year: 2014 PMID: 24453480 PMCID: PMC3893053 DOI: 10.4137/BBI.S13076
Source DB: PubMed Journal: Bioinform Biol Insights ISSN: 1177-9322
Snapshot of results generated by the GeneSV system for 32 hypothetical mutations (complete output is provided in the supplemental data – Suppl 1), experimental validation of those results, and comparison of those results to other similar systems.
| #BASE | RNUM | COD2 | A | TSTV | COD1 | A | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | CH | SCON | GCSID4 | AASIDH | GSV1 | GSV2 | SIFT | PRV |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 7840 | 91 | TGT | C | 0 3 | ATG | M | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | −2.1 | 70.80 | 76.00 | V2 | V2 | V | V |
| 7840 | 91 | TGT | C | 0 2 | GCT | A | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | −1.1 | – | 66.70 | V4 | V4 | V | V |
| 8104 | 179 | TGC | C | 0 2 | GTC | V | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | −4.0 | – | 62.60 | V4 | V4 | V | N |
| 8104 | 179 | TGC | C | 0 2 | GCC | A | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | −2.0 | – | 63.40 | V4 | V4 | V | N |
| 8245 | 226 | GGG | G | 1 1 | ACG | T | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | −8.0 | – | 63.40 | V4 | V4 | V | V |
| 8245 | 226 | GGG | G | 2 1 | AAC | N | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | −1.0 | – | 63.30 | V4 | V4 | N | V |
| 8245 | 226 | GGG | G | 1 0 | GAG | E | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | +0 | −5.0 | – | – | N6 | N6 | N | N |
| 8245 | 226 | GGG | G | 1 1 | AGC | S | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | −2.0 | – | 68.00 | V4 | V4 | V | V |
| 8404 | 279 | AAA | K | 0 1 | AAC | N | 0 | 1 | 1 | 0 | 0 | 1 | 1 | 1 | 4 | −0.1 | – | 66.90 | V4 | V4 | V | V |
| 8404 | 279 | AAA | K | 1 0 | GAA | E | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | −4 | 2.2 | 73.90 | 81.00 | V2 | V2 | V | V |
| 8767 | 400 | TGC | C | 0 2 | ACC | T | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 4.6 | *69.20 | 93.80 | V3 | V3 | V | V |
| 8767 | 400 | TGC | C | 0 3 | ACA | T | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 4.6 | 82.70 | 93.80 | V1 | V1 | V | V |
| 8878 | 437 | AGA | R | 1 1 | GCA | A | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 8 | −4.0 | – | 62.40 | V4 | V4 | V | V |
| 8878 | 437 | AGA | R | 1 1 | CAA | Q | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 1 | 8 | 0.0 | – | 65.50 | V4 | V3 | V | V |
| 8878 | 437 | AGA | R | 2 0 | GAA | E | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | −8 | 0.2 | *67.70 | 72.80 | V3 | V4 | V | V |
| 8878 | 437 | AGA | R | 0 1 | AGC | S | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 1 | 8 | −3.0 | – | 57.10 | V4 | V4 | N | V |
| 9016 | 483 | TTC | F | 1 0 | CTC | L | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | −2.0 | – | 17.90 | V4 | V4 | V | N |
| 9016 | 483 | TTC | F | 0 1 | GTC | V | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 0 | −4.0 | *72.90 | 20.30 | V3 | V4 | N | N |
| 9016 | 483 | TTC | F | 1 0 | TCC | S | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | −6.0 | – | – | N7 | N7 | N | N |
| 9016 | 483 | TTC | F | 0 1 | TAC | Y | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 0 | 4.2 | 85.40 | 88.90 | V2 | V2 | V | V |
| 9223 | 552 | AAG | K | 0 2 | CTG | L | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 5 | −3.0 | 70.80 | 92.40 | V2 | V2 | V | V |
| 9223 | 552 | AAG | K | 1 1 | ATA | I | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 5 | −5.0 | *81.40 | 92.90 | V3 | V3 | V | V |
| 9223 | 552 | AAG | K | 0 1 | CAG | Q | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 1 | 5 | 0.0 | – | 66.60 | V3 | V3 | V | V |
| 9223 | 552 | AAG | K | 0 1 | AAC | N | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 1 | 5 | −2.0 | – | *69.60 | V4 | V4 | V | V |
| 9667 | 700 | TGG | W | 0 2 | TTC | F | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0.0 | – | 66.60 | V4 | V4 | V | N |
| 9667 | 700 | TGG | W | 0 2 | GCG | A | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | −8.0 | – | 18.50 | V4 | V4 | V | N |
| 9667 | 700 | TGG | W | 1 1 | TAC | Y | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 1.0 | – | 69.00 | V4 | V4 | V | N |
| 9667 | 700 | TGG | W | 1 2 | GAC | D | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | +0 | −8.0 | – | – | N6 | N6 | N | N |
| 9667 | 700 | TGG | W | 1 1 | GAG | E | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | +0 | −6.0 | – | 49.70 | V4 | V4 | V | N |
| 9667 | 700 | TGG | W | 0 1 | TGC | C | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | −5.0 | – | – | N6 | N6 | N | N |
| 9700 | 711 | CAC | H | 0 1 | AAC | N | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 7 | 1.1 | – | 75.60 | V4 | V4 | V | N |
| 9700 | 711 | CAC | H | 1 0 | CGC | R | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | −7 | −2.0 | – | – | N6 | N6 | N | N |
Notes: The mutations listed in blue font are those that were experimentally confirmed as not viable. The mutations listed in green font are those with a correct prediction of their functionality. The mutations listed in red font are those with an incorrect prediction of their functionality. Viable prediction are marked as (V), not viable (N), and by (−) no results reported. For each base position the following information is reported.
Abbreviations: Rnum, residue number within a reference protein; Cod2, codon at the base position in the reference; A, corresponding amino acid; TsTv, number of transitions transversions for Cod2 -> Cod1 mutation; Cod1, codon at the base position in the test (mutant) set; i, binary check if a given codon fulfills criteria Oi (i = 1,..,8); ch, conservation of charged amino acids (HKR DE) (0–9); ‘+’ when test is charged; sCON, sequence conservation [−9,9]; gcSid4, sequence identity to genomic sequences from the expanded dataset (exact codon match); aaSidH, sequence identity to protein sequences from homologous proteins only (amino acid match); GSV1, viability predictions V1–V4 and N5–N7 calculated by GeneSV using dataset 2012.02.12; GSV2, viability predictions V1–V4 and N5–N7 calculated by GeneSV using dataset 2013.02.18; SIFT, viability predictions V or N by SIFT server; PRV, viability predictions V or N by PROVEAN server;
(*)—codons with fewer than 3 position hits should be evaluated with caution (“viability” predictions may be not reliable).
Figure 1A structural model of NS5 polymerase (FLIC; gi:132271146; 900 aa). Residues highlighted in green are those positions that were selected for testing various predictions via generation of mutants.
Protein sequence and structure-based characteristics calculated for 10 selected positions in the RNA-dependent RNA polymerase gene of DENV-2 from a full-length infectious clone [FLIC] used to test our predictions.
| RESIDUE-POSITION | SSE | ACC | EPI | ECON | SCON (FLIC) | SCON (REFSEQ) |
|---|---|---|---|---|---|---|
| C-91 | H | 0 | 0 | 4.9 | 5.6 (C) | 5.6 (C) |
| C-179 | E | 0 | 0 | 8.9 | 8.9 (C) | 8.9 (C) |
| G-226 | C | 1 | 1 | 8.9 | 8.9 (G) | 8.9 (G) |
| K-279 | H | 0 | 0 | −0.9 | 2.2 (K) | 2.2 (K) |
| C-400 | H | 0 | 0 | 4.9 | −3.0 (C) | 4.6 (T) |
| R-437 | H | 0 | 0 | 2.9 | 4.5 (R) | 2.1 (K) |
| F-483 | H | 0 | 0 | 5.9 | 7.7 (F) | 7.7 (F) |
| K-552 | H | 0 | 1 | 1.9 | 3.5 (K) | −1.2 (M) |
| W-700 | E | 1 | 1 | 6.9 | 8.9 (W) | 8.9 (W) |
| H-711 | E | 0 | 0 | 5.9 | 7.7 (H) | 7.7 (H) |
Abbreviations: SSE, secondary structure element assignments; ACC, solvent accessibility scores; EPI, predictions that indicate that a given position is a part of the epitope; eCON, estimated amino acid conservation using Shannon’s entropy and frequencies calculated by the SeqalSV and StralSV algorithms; sCON, sequence similarity conservation using Sum-of-Pairs algorithm; H, helix; C, coil; E, strand.
Results from the experimental validation of 5 sets of double mutations at the positions K279 and R437 for which constructed mutant viruses were rescued from plasmid transfected in Vero cells.
| MUTATION IN INFECTIOUS CLONE | GENESV PREDICTION | VIRUS RESCUED | PREDICTION VALIDATED? |
|---|---|---|---|
| (AAA)K-279-N(AAC) | Viable | Not Viable | No |
| (AAA)K-279-R(AGA) | Viable | Yes | Yes |
| (AAA)K-279-E(GAA) | Viable | Yes | Yes |
| (AAA)K-279-E(GAA) | Viable | Yes | Yes |
| (AAA)K-279-E(GAA) | Not Viable | Not Viable | Yes |
Results from GeneSV assessment of new mutations identified by comparison of two datasets of genetic sequences separated by 12 months: 2012.02.12 and 2013.02.18.
| NAME | SIZE | NEW-MUTATIONS | PREDICTED VIABLE | PREDICTED NON-VIABLE |
|---|---|---|---|---|
| Capsid | 114 | 57 | 51 | 6 |
| prM | 166 | 55 | 50 | 5 |
| Envelope | 495 | 250 | 237 | 13 |
| Nsp_1 | 352 | 106 | 105 | 1 |
| Nsp_2A | 218 | 129 | 128 | 1 |
| NS2B | 130 | 55 | 55 | 0 |
| NS3 | 618 | 146 | 146 | 0 |
| Nsp_4A | 150 | 72 | 70 | 2 |
| Nsp_4B | 248 | 68 | 66 | 2 |
| NS5 | 900 | 232 | 231 | 1 |
| 3391 | 1170 | 1088 | 25 | |
| 5-end(*) | 96* | 4* | 3* | 1* |
| 3-end(*) | 456* | 31* | 26* | 5* |
| 552* | 35* | 29* | 6* |
Notes: Name: region in the DENV-2 genome using RefSeq as a reference sequence. Size: number of codons in protein coding regions or bases in noncoding regions (marked by *). New: number of new mutations observed in the library 2013.02.18 and NOT seen in 2012.02.12. Predicted Viable: number of mutations predicted to be “VIABLE” (correct predictions). Predicted Non-viable: number of mutations predicted as “NOT viable” (incorrect predictions).
Figure 2Column chart representation of the frequencies of mutations (codon variations) observed in the locations with different amino-acid and position characteristics within each protein coding region in DENV-2 dataset 2013.02.18. The bars colored in black and marked as “Obsrvd” show average frequencies observed within a given gene. Frequencies of mutations observed in the coil regions within a given gene are colored in cyan, strand – blue, helix – dark blue, buried – yellow, exposed – red, not a part of the antigenic epitopes – light green, and the frequencies of mutations within the epitope regions – green. In the AVERAGE set of bars are shown corresponding frequencies calculated as average from all combined protein coding regions.
Comparison of sequence mutation analysis tools based on types of datasets used for homology searches, and approaches they use for conservation analysis. List of structural features contributing to calculated conservation scores.
| METHOD | DATASETS FOR HOMOLOGY SEARCHES | PROTEIN STRUCTURES | CONSERVATION ANALYSIS | ||
|---|---|---|---|---|---|
| GENOMIC SEQUENCES | PROTEIN SEQUENCES | SEQUENCE ALIGNMENT | STRUCTURAL FEATURES | ||
| SIFT | – | Yes | – | PsiBlast | – |
| PROVEAN | – | Yes | – | Blastp | – |
| PolyPhen-2 | – | Yes | Yes | Blast | Yes* |
| GeneSV | Yes | Yes | Yes | PsiBlast | Yes |
Notes:
accuracy of constructed structural models.
surface exposure (location of the mutation position: buried, exposed).
structural element (coil, helix, strand).
antigenic site (position within the epitope: yes, not).
structural conservation.
(*) – structure-based features are estimated from the analysis of closest homologs with known 3D structure (no structural models are constructed).
| PROVEAN |
| (78%; 0.837) |
| SIFT |
| (78%; 0.857) |
| GeneSV |
| (81%; 0.885) |
| Prediction by consensus |
| (84%; 0.902) |