| Literature DB >> 16398931 |
Lakshmi K Matukumalli1, John J Grefenstette, David L Hyten, Ik-Young Choi, Perry B Cregan, Curtis P Van Tassell.
Abstract
BACKGROUND: Single nucleotide polymorphisms (SNP) constitute more than 90% of the genetic variation, and hence can account for most trait differences among individuals in a given species. Polymorphism detection software PolyBayes and PolyPhred give high false positive SNP predictions even with stringent parameter values. We developed a machine learning (ML) method to augment PolyBayes to improve its prediction accuracy. ML methods have also been successfully applied to other bioinformatics problems in predicting genes, promoters, transcription factor binding sites and protein structures.Entities:
Mesh:
Substances:
Year: 2006 PMID: 16398931 PMCID: PMC1955739 DOI: 10.1186/1471-2105-7-4
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Application of machine learning program in training and test/prediction modes. The left side of the flow chart represents the training mode where the input features along with the expected output are fed simultaneously to the ML program. The program then analyzes the data and generates a model in the form of decision tree or a set of production rules. The right side of the flow chart represents the testing or prediction mode where the model generated in the training mode is used to evaluate a new set of input features for predicting an expected output.
Final set of optimized features chosen for machine learning
| 1 | Sequence depth | Continuous |
| 2 | Variation type | transition transversion indel |
| 3 | PolyBayes probability | Continuous |
| 4 | Frequency of major allele | Continuous |
| 5 | Frequency of minor allele | Continuous |
| 6 | Relative distance from closest end | Continuous |
| 7 | Agreement in the forward and reverse reads | Continuous |
| 8 | Maximum quality of the major allele | Continuous |
| 9 | Maximum quality of the minor allele | Continuous |
| 10 | Average quality of major allele | Continuous |
| 11 | Average quality of minor allele | Continuous |
| 12 | Haplotype of second variation | Continuous |
| 13 | Local average quality | Continuous |
| 14 | Overall average quality | Continuous |
| 15 | Alignment quality | Continuous |
| 16 | Common repeats | Repeat_type |
A detailed definition and explanation of these features is given the methods section. The values for the features can be continuous in a given numerical range or discrete with limited options.
Comparison of ML and PolyBayes on test data set
| 1153 | 1202 | 1435 | |
| 16,748 | 16,706 | NA | |
| 207 | 249 | 16,955 | |
| 282 | 233 | NA | |
| 97.3 | 97.4 | 7.8 | |
| 80.3 | 83.8 | 100 (Set) | |
| 98.7 | 98.5 | NA | |
| 84.8 | 82.8 | 7.8 | |
| 98.3 | 98.6 | NA |
We define the following terms used to contrast ML performance with PolyBayes: We say that a SNP prediction program produces a true positive (TP) if it predicts a SNP that is judged true by the expert. Likewise, a false positive (FP) is a predicted SNP that is judged false by the expert, a true negative (TN) is a prediction of a non-SNP that concurs with the expert, and a false negative (FN) is a failure to identify a SNP that is identified by the expert. Also the following parameters were used to measure the performance of the ML output: Accuracy (i.e., fraction of candidate SNP correctly classified), sensitivity (i.e., fraction of positive outcomes correctly identified), specificity (i.e., fraction of the negative outcomes correctly identified), positive predictive value (i.e., fraction of predicted SNP being true) and negative predictive value (i.e., fraction of predicted false SNP being correctly classified)
Accuracy = (TP + TN)/total
Sensitivity = TP/(TP + FN)
Specificity = TN/(FP + TN)
Positive Predictive Value (PPV) = TP/(TP + FP)Negative Predictive Value (NPV) = TN/(TN + FN)
Application of machine learning program substantially reduces the number of false positives from 16,955 to only about 250. Other statistical measures also demonstrate considerable advantage in the application of machine learning.
Comparison of positive predictive values (PPV) from PolyBayes and ML predictor
| 20 | 1756 | 1.1 | |
| 38 | 1529 | 2.4 | |
| 31 | 1683 | 1.8 | |
| 45 | 2015 | 2.2 | |
| 50 | 1613 | 3.0 | |
| 53 | 1055 | 4.8 | |
| 148 | 2069 | 6.7 | |
| 1050 | 5235 | 16.7 | |
| 1435 | 16955 | ||
| 1153 | 207 | ||
TP: True Positive, FP: False Positive,
Positive predictive value (PPV) = TP/(TP + FP).
The number of true positives in the dataset can be increased by using stringent PolyBayes posterior probability cut-off values. However, even when the posterior probability value is set to the maximum of 1.0 the positive predictive value with PolyBayes is less than 20%. Application of machine learning showed a 5–10 fold increase in the PPV at different PolyBayes posterior probability values.
Figure 2Simplified Decision Tree. The decision tree after pruning has 491 nodes. The figure above shows only the top four layers of nodes that indicate the most critical features in the ML decision making process. A detailed version of this tree is at the website [21].
Figure 3SNP likelihood in sequences showing common variation. The positions indicated in dark grey are the polymorphic positions. Sequences 2 and 4 show common variation at two positions in the sequence alignment, and hence these polymorphisms are more likely to be real than the common variation shown in sequences 1 and 5 or the variation in sequence 3.
Algorithm for haplotype variation factor determination
| |
| |
| |
| |
| |
| |
| |
| |
Haplotype variation factor is defined as a measure of co-variance observed in the same chromatogram across different SNP loci. For each SNP locus the fraction of number of co-variances (observing minor alleles at different SNP locus on the same chromatogram) with respect to total number of minor alleles observed is first calculated. These values are then summed for all positions and the mean value (haplotype variation factor) is calculated by dividing by the total number of polymorphisms.