| Literature DB >> 24743239 |
ManChon U1, Eric Talevich2, Samiksha Katiyar3, Khaled Rasheed1, Natarajan Kannan4.
Abstract
Cancer is a genetic disease that develops through a series of somatic mutations, a subset of which drive cancer progression. Although cancer genome sequencing studies are beginning to reveal the mutational patterns of genes in various cancers, identifying the small subset of "causative" mutations from the large subset of "non-causative" mutations, which accumulate as a consequence of the disease, is a challenge. In this article, we present an effective machine learning approach for identifying cancer-associated mutations in human protein kinases, a class of signaling proteins known to be frequently mutated in human cancers. We evaluate the performance of 11 well known supervised learners and show that a multiple-classifier approach, which combines the performances of individual learners, significantly improves the classification of known cancer-associated mutations. We introduce several novel features related specifically to structural and functional characteristics of protein kinases and find that the level of conservation of the mutated residue at specific evolutionary depths is an important predictor of oncogenic effect. We consolidate the novel features and the multiple-classifier approach to prioritize and experimentally test a set of rare unconfirmed mutations in the epidermal growth factor receptor tyrosine kinase (EGFR). Our studies identify T725M and L861R as rare cancer-associated mutations inasmuch as these mutations increase EGFR activity in the absence of the activating EGF ligand in cell-based assays.Entities:
Mesh:
Substances:
Year: 2014 PMID: 24743239 PMCID: PMC3990476 DOI: 10.1371/journal.pcbi.1003545
Source DB: PubMed Journal: PLoS Comput Biol ISSN: 1553-734X Impact factor: 4.475
Selected protein features.
| Feature | Votes | AvgRank |
| Protein kinase family | 5 | 1.40 |
| Protein kinase group | 5 | 1.80 |
| Amino acid type, WT | 5 | 8.00 |
| BLOSUM62 pairwise score | 5 | 8.20 |
| Side-chain polarity, mutant | 5 | 11.00 |
| Conservation of wild type in all kinases | 5 | 11.60 |
| Conservation of consensus type in kinase group | 5 | 11.60 |
| Conservation of consensus type in all kinases | 5 | 13.00 |
| Conservation of consensus type in kinase family | 4 | 5.75 |
| Kinase subdomain | 4 | 6.00 |
| Average mass of amino acid, WT | 4 | 7.50 |
| Is a binding site? | 4 | 8.25 |
| Van der Waals volume, WT | 4 | 8.75 |
| Site modification type (if any) | 4 | 9.25 |
| Amino acid type, mutant | 4 | 10.75 |
| Side-chain polarity, WT | 4 | 11.50 |
| Is in protein kinase domain? | 3 | 11.67 |
The “Votes” column indicates how many feature selection algorithms cast a vote for that particular feature during the 10-fold cross-validation selecting procedure; the “Avg Rank” column describes the averaged rank of a particular feature within the selected algorithms.
Confusion matrix of individual classifier performance.
| Algorithms | TP | FN | TN | FP |
| J48 (Tree) | 221 | 5 | 318 | 13 |
| Random Forest | 216 | 10 | 320 | 11 |
| NB Tree | 217 | 9 | 311 | 20 |
| Functional Tree | 217 | 9 | 323 | 8 |
| Decision Table | 222 | 4 | 296 | 35 |
| DTNB | 219 | 7 | 321 | 10 |
| LWL(J48+KNN) | 220 | 6 | 316 | 15 |
| Bayes Net | 221 | 5 | 313 | 18 |
| Naive Bayes | 218 | 8 | 309 | 22 |
| SVM | 219 | 7 | 323 | 8 |
| Neural Network | 218 | 8 | 321 | 10 |
| Combined (0.5) | 223 | 3 | 328 | 3 |
All in silico experiments were evaluated with 10-fold cross-validation. TP means an instance in the positive set (COSMIC) was correctly classified as causative, TN means an instance in the negative set (dbSNP) was correctly classified as non-causative.
Comparison of performance of individual and combined classifiers.
| Algorithm | TP Rate | FP Rate | Accuracy | Precision | Recall | F-Measure |
| J48 (Tree) | 0.978 | 0.039 | 0.968 | 0.944 | 0.978 | 0.961 |
| Random Forest | 0.956 | 0.033 | 0.962 | 0.952 | 0.956 | 0.954 |
| NB Tree | 0.960 | 0.060 | 0.948 | 0.916 | 0.960 | 0.937 |
| Functional Tree | 0.960 |
| 0.969 | 0.964 | 0.960 | 0.962 |
| Decision Table |
| 0.106 | 0.930 | 0.864 |
| 0.919 |
| DTNB | 0.969 | 0.030 | 0.969 | 0.956 | 0.969 | 0.963 |
| LWL(J48+KNN) | 0.973 | 0.045 | 0.962 | 0.936 | 0.973 | 0.954 |
| Bayes Net | 0.978 | 0.054 | 0.959 | 0.925 | 0.978 | 0.951 |
| Naive Bayes | 0.965 | 0.066 | 0.946 | 0.908 | 0.965 | 0.936 |
| SVM | 0.969 |
|
|
| 0.969 |
|
| Neural Network | 0.965 | 0.030 | 0.968 | 0.956 | 0.965 | 0.960 |
| Combined (0.5) |
|
|
|
|
|
|
Each algorithm trained using selected features and evaluated with 10-fold cross-validation. Values are average of the metrics evaluated with respect to the positive and negative classes.
Top predicted unconfirmed mutations.
| Rank | Priority Score | Position | WT | Mutant |
| 1* | 0.97699 | 861 | L | R |
| 2* | 0.97649 | 724 | G | S |
| 3 | 0.97644 | 721 | G | S |
| 4 | 0.97577 | 858 | L | K |
| 5 | 0.97566 | 721 | G | D |
| 6 | 0.97559 | 861 | L | P |
| 7 | 0.97558 | 862 | L | P |
| 8 | 0.97509 | 719 | G | A |
| 9 | 0.97507 | 721 | G | A |
| 10 | 0.97483 | 729 | G | R |
| 11 | 0.97369 | 857 | G | E |
| 12 | 0.97365 | 719 | G | V |
| 13 | 0.97185 | 854 | T | A |
| 14 | 0.97110 | 735 | G | S |
| 15 | 0.97023 | 856 | F | S |
| 16 | 0.96854 | 856 | F | L |
| 17 | 0.96507 | 729 | G | E |
| 18 | 0.96462 | 855 | D | G |
| 19 | 0.96399 | 779 | G | S |
| 20 | 0.96291 | 858 | L | A |
| 21* | 0.96238 | 725 | T | M |
| 22 | 0.96210 | 858 | L | W |
| 23 | 0.96034 | 779 | G | C |
| 24 | 0.95998 | 723 | F | S |
| 25* | 0.95649 | 858 | L | Q |
| 26 | 0.95400 | 858 | L | M |
| 27 | 0.95381 | 731 | W | R |
| 28 | 0.95333 | 799 | L | R |
| 29 | 0.95268 | 720 | S | P |
| 30 | 0.95253 | 838 | L | P |
| … | ||||
| 161* | 0.61788 | 746 | E | K |
Probability scores and rankings of the top predicted mutations. Scores were calculated with the multiple classifier trained on COSMIC v.50 data. Asterisks indicate the five mutations selected for cell-based assays.
Feature values of selected mutations.
| Mutation | E746K | L861R | L858Q | G724S | T725M |
| Protein Family | EGFR | EGFR | EGFR | EGFR | EGFR |
| Protein Group | TK | TK | TK | TK | TK |
| Wildtype amino acid | E | L | L | G | T |
| Blosum62 | 1 | −2 | −2 | 0 | −1 |
| Side Chain Polarity Mut | 1 | 1 | 1 | 1 | 0 |
| Conservation AllKinase Wild | 0.049284 | 0.12363 | 0.409919 | 0.545306 | 0.122923 |
| Conservation Group | 0.392157 | 0.215686 | 0.803922 | 0.94902 | 0.380392 |
| Conservation AllKinase | 0.186021 | 0.070003 | 0.51031 | 0.716841 | 0.146852 |
| Conservation Family | 0.454545 | 0.818182 | 0.818182 | 1 | 0.909091 |
| Sub domain | II | VIb | VIb | I | I |
| Avg Mass | 146.18934 | 174.20274 | 146.14594 | 105.09344 | 149.20784 |
| binding site | NA | NA | NA | NA | NA |
| Van der Waals Volume Wild | 109 | 124 | 124 | 48 | 93 |
| modification | NA | NA | NA | NA | Phosphorylation |
| snp amino acid | K | R | Q | S | M |
| Side Chain Polarity Wild | 1 | 0 | 0 | 0 | 1 |
| Is Pk Domain | 0 | 0 | 0 | 0 | 0 |
Figure 1Structural location of selected EGFR mutation sites.
Protein crystal structure [PDB∶2JIU] shown as cartoon, with sites G724, T725, L858 and L861 shown as spheres. Structural regions highlighted in yellow are kinase subdomain I and the activation loop. The structure image was generated using PyMOL [75].
Figure 2Auto-phosphorylation of wild-type and mutant EGFR and impact of mutations on downstream EGFR signaling.
The blot shows phosphorylation of the four C-terminal tail tyrosines (Y1086, Y1045, Y845, Y1173 and Y1068) in EGFR, and two downstream proteins, ERK1/2 and AKT, in the presence (+) and absence(−) of EGF. “Un” indicates untransfected CHO cells. Total levels of EGFR (GFP), ERK1/2, AKT and tubulin (control) are also shown.
Figure 3Quantified tyrosine auto-phosphorylation levels of wild-type and mutant-type EGFR.
Quantified phosphorylation levels are shown in the form of histograms. Quantification was done using Image J.