| Literature DB >> 21912654 |
Christopher S Poultney1, Glenn L Butterfoss, Michelle R Gutwein, Kevin Drew, David Gresham, Kristin C Gunsalus, Dennis E Shasha, Richard Bonneau.
Abstract
Temperature-sensitive (ts) mutations are mutations that exhibit a mutant phenotype at high or low temperatures and a wild-type phenotype at normal temperature. Temperature-sensitive mutants are valuable tools for geneticists, particularly in the study of essential genes. However, finding ts mutations typically relies on generating and screening many thousands of mutations, which is an expensive and labor-intensive process. Here we describe an in silico method that uses Rosetta and machine learning techniques to predict a highly accurate "top 5" list of ts mutations given the structure of a protein of interest. Rosetta is a protein structure prediction and design code, used here to model and score how proteins accommodate point mutations with side-chain and backbone movements. We show that integrating Rosetta relax-derived features with sequence-based features results in accurate temperature-sensitive mutation predictions.Entities:
Mesh:
Substances:
Year: 2011 PMID: 21912654 PMCID: PMC3166291 DOI: 10.1371/journal.pone.0023947
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Rosetta score terms and derived features.
| Feature | Description |
| score | overall score: weighted sum of other score terms |
| fa_atr | Lennard-Jones attractive component |
| fa_rep | Lennard-Jones repulsive component |
| fa_sol | Lazaridis-Karplus solvation energy |
| fa_intra_rep | LJ repulsive between same-residue atoms |
| pro_close | proline ring closure energy |
| fa_pair | pair term, statistics-based: electrostatics, disulfides |
| hbond_sr_bb | H-bonds: backbone-to-backbone, close in sequence |
| hbond_lr_bb | H-bonds: backbone-to-backbone, distant in sequence |
| hbond_bb_sc | H-bonds: backbone-to-side chain |
| hbond_sc | H-bonds: side chain-to-side chain |
| dslf_ss_dst | disulfide bond S-S distance score |
| dslf_cs_ang | disulfide bond C |
| dslf_ss_dih | disulfide bond S-S dihedral score |
| dslf_ca_dih | disulfide bond C |
| rama | probability of |
| omega | deviation of |
| fa_dun | rotamer self-energy from Dunbrack library |
| p_aa_pp | probability of amino acid given |
| ref | reference state (unfolded) energy |
| Repack_average_score | average of overall score across 3 relax iterations |
| Repack_stdev_score | stdev of overall score across 3 relax iterations |
| gdtmm1_1 | maxsub fraction: maxsub term/# residues, using maxsub rms thresh = 1.0 and distance thresh = 1.0 |
| gdtmm2_2 | maxsub fraction: rms thresh = 2.0, distance thresh = 2.0 |
| gdtmm3_3 | maxsub fraction: rms thresh = 3.0, distance thresh = 3.0 |
| gdtmm4_3 | maxsub fraction: rms thresh = 4.0, distance thresh = 3.0 |
| gdtmm7_4 | maxsub fraction: rms thresh = 7.0, distance thresh = 4.0 |
| irms | RMS from input structure |
| maxsub | size of C |
| maxsub2.0 | maxsub w/rms thresh = 2.0 and distance thresh = 3.5 |
| rms | RMS from native |
Removed due to high correlation with other feature(s).
Always zero.
Rosetta score terms and descriptions. Three features were derived from each Rosetta score term, denoted by suffix Q1, Q2, or Q3, based on mutant distribution quartiles 1–3 as described in Methods and Fig. 5. Superscripts denote feature groups removed from the final training set.
Figure 5Quartile method for comparing distributions of Rosetta score terms.
Mutant ensemble quartiles 1–3 were calculated for the mutant ensemble distribution (top) of the omega score term, which measures deviation of the bond angle from its ideal of . Q1–Q3 are indicated by red lines, with the corresponding values above and percentiles below. The mutant Q1–Q3 values were then mapped to locations in the wild type (wt) ensemble distribution (bottom). Q1–Q3 of the mutant distribution are again indicated by red lines, with their percentiles relative to the wt distribution shown below. Wild type ensemble Q1–Q3 are shown in blue for reference.
Figure 1Typical ensembles of structures produced by Rosetta relax runs for calmodulin.
Shown here are structures generated by Rosetta relax runs that allow protein structures to “relax” to a lower energy state. The starting structure – one domain of yeast calmodulin – is shown in green, and the generated structures are shown in gray, with runs starting from the native structure on the left and runs from a mutation (F89I) on the right. The mutated site is shown in red in the mutant structure. The wt ensemble shows less variation in both difference from the starting structure and difference within the ensemble than the mutation ensemble. The differences between wild-type and mutation ensembles are quantified by comparing distributions of Rosetta score terms.
Figure 2Effects of a single amino acid mutation.
Shown here is a Rosetta-generated structure for one mutation (F89I) to yeast calmodulin. The relaxed starting structure is shown in transparent gray, the mutant structure in pink, the native phenylalanine at position 89 in transparent blue, and the mutation to isoleucine in solid red. The mutated structure has accommodated the F89I mutation by small backbone movements, such as the shift in helix position at residue 89, and reconfiguration of nearby side chains.
Figure 3Sites of predicted temperature-sensitive mutations.
The crystal structure of one domain of yeast calmodulin is shown in cartoon representation in green. Residues in the hydrophobic core are shown as green sticks, and hydrophobic core residues with predicted ts mutations are shown in purple. Of the top 20 predictions on calmodulin, 10 each from SVM-LIN and SVM-RBF, 15 mutations occur at these six sites.
Figure 4Training set statistics.
Counts are shown for the total number of proteins (#prot), positions (#pos), and non-ts and ts samples, separated by species. The training set comprises a total of 205 mutations (75 ts, 130 non-ts) to 177 sites in 66 proteins. Yeast has the largest number of samples, and the most balanced distribution of ts and non-ts samples; worm has only 5 ts samples, and fly lacks non-ts samples. The difference between the number of proteins and the number of positions for yeast is due to the presence of the histone complex data, which comprise many mutations to different positions within the same structure.
Non-Rosetta structure-based features.
| Feature | Description |
| ACCP | solvent-accessible surface area (ACC) |
| ss_H | secondary structure: |
| ss_S | secondary structure: |
| ss_L | secondary structure: loop region |
Worsened performance.
Structure-based features not based on Rosetta score terms. Superscripts denote features removed from the final training set.
Sequence-based features.
| aminochange | four-category change in amino acid: 0 = same amino acid, 1 = different amino acid in same category, 2 = different category |
| aminochange | seven-category change in amino acid |
| pssm_mut | log-likelihood of mutated amino acid from position-sensitive scoring matrix |
| pssm_nat | log-likelihood of native amino acid from position-sensitive scoring matrix |
| pssm_diff | difference in log-likelihood of mutated and native amino acid |
| freq_mut | frequency of mutated amino acid in multiple sequence alignment |
| freq_nat | frequency of native amino acid in multiple sequence alignment |
| freq_diff | difference in frequency of mutated and native amino acid |
| info_cont | position information content from PSI-BLAST |
Removed due to high correlation with other feature(s).
Worsened performance.
Sequence-based features from BLAST, PSI-BLAST, or other analysis. Superscripts denote features removed from the final training set.
Figure 6SVM-RBF parameter space.
SVM-RBF precision on the ts class is shown as a function of and parameters. Values shown are the mean across the five leave-out CV runs, and range from 0.5822 to 0.788. Blue circles indicate the parameter values yielding the highest ts precision for each of the five leave-out CV runs. The final median and values are indicated by the black cross. While the optimum parameter values across the five leave-out CV runs differ, they are all located along the “valley” of high precision that is visible running from upper right to lower left, indicating that multiple combinations of and values lead to classifiers having similarly good performance.
Figure 7Classifier Performance.
The Receiver-operating characteristic (ROC) curve is shown for SVM-LIN, SVM-RBF, and SVM-seq (RBF classifier trained only on sequence data). ROC curves for each classifier showing false positive rate (fpr) and true positive rate (tpr), with the reference line for random classification is shown in gray. The difference between each classifier and the reference line shows the improvement over random of our method. The steep slope at the lower left of the classifier curves indicates that the highest-ranked predictions are most likely to be accurate for all three classifiers. Area under curve: SVM-LIN = 0.713, SVM-RBF = 0.734, SVM-seq = 0.563.