| Literature DB >> 26206375 |
Jay K Khurana1, Jay E Reeder2, Antony E Shrimpton3, Juilee Thakar4,5.
Abstract
BACKGROUND: Non-synonymous single nucleotide polymorphisms (nsSNPs) are the most common DNA sequence variation associated with disease in humans. Thus determining the clinical significance of each nsSNP is of great importance. Potential detrimental nsSNPs may be identified by genetic association studies or by functional analysis in the laboratory, both of which are expensive and time consuming. Existing computational methods lack accuracy and features to facilitate nsSNP classification for clinical use. We developed the GESPA (GEnomic Single nucleotide Polymorphism Analyzer) program to predict the pathogenicity and disease phenotype of nsSNPs.Entities:
Mesh:
Substances:
Year: 2015 PMID: 26206375 PMCID: PMC4513380 DOI: 10.1186/s12859-015-0673-2
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Flowchart overview of algorithm used in GESPA. a Input, output and retrieval of the data from available resources. Left cells show acceptable inputs including a HUGO symbol or a batch file which contains multiple genes and nsSNPs of interest. The middle cells represent resources used to obtain required information. Genes with data already saved on custom SQL-based server cloud are downloaded instantly, circumventing slower retrieval from databases. The right cells represent output obtained from each resource. b Steps involved in nsSNP pathogenicity and phenotype prediction algorithm. The steps described in the methods section are listed in the middle table which will determine the pathogenicity and phenotype of the given nsSNP. The results are available in the table format
GESPA pathogenicity classifier accuracy compared to other software using humsavar test set
| Software | Sensitivity (%)a | Specificity (%)b | Balanced accuracy (%)c | ROC Curve AUC |
|---|---|---|---|---|
| GESPA (humsavar cross validation) | 96.41 | 79.49 | 87.95 | 0.936 |
| GESPA (humvar training set) | 96.31 | 79.23 | 87.78 | 0.932 |
| Polyphen 2 | 88.68 | 62.45 | 75.56 | 0.847 |
| SIFT | 85.03 | 68.95 | 76.99 | 0.854 |
| PROVEAN | 78.39 | 79.11 | 78.75 | 0.848 |
| Mutation Assessor | 85.29 | 71.02 | 78.15 | 0.848 |
aSensitivity = TP/(TP + FN)
bSpecificity = TN/(TN + FP)
cBalanced Accuracy = (Sensitivity + Specificity)/2
Fig. 2GESPA Main Interface. The gene summary interface allows access to important nsSNP and gene annotations. General information on the gene of the selected nsSNP is provided in the top right corner. Access to other important information related to the gene including nucleotide and protein sequences and alignments is located on the right side. Alignments and sequences open in separate closable tabs while gene info opens the corresponding page for the gene on NCBI. Access to annotations related to SNPs is found in the lower portion of the interface. Predictions of SNP phenotype and pathogenicity are displayed on the main summary table
Fig. 3Multinomial ROC Curve for 5-fold cross validation on humsavar dataset. The multinomial ROC curve (black curve) is an average of 5 ROC curves which each represent the pathogenicity prediction accuracy of GESPA after training on four folds of humsavar and testing on the remaining fold. The WPC Score, PSIC Score, and nsSNPs for literature were used as predictors for pathogenicity. Using the point of maximum balanced accuracy (intersection of yellow line and black curve) for each curve, the optimal cutoff points of the WPC and PSIC Score for each fold could be determined and the AUC was found to be 0.936
GESPA phenotype prediction for randomized sample of 1080 nsSNPs
| Data not available (15.7 %) | Data available (84.3 %) | ||
|---|---|---|---|
| Correct Phenotype, less accurate prediction | 79 (46 %) | Correct Prediction | 870 (96 %) |
| No SNPs on Gene in given range | 73 (43 %) | Incorrect Prediction | 40 (4 %) |
| No previous SNPs on Clinvar | 18 (11 %) | ||
| Overall Accuracy: 80.56 % | |||