| Literature DB >> 17478517 |
Peng Jiang1, Haonan Wu, Jiawei Wei, Fei Sang, Xiao Sun, Zuhong Lu.
Abstract
In the yeast, meiotic recombination is initiated by double-strand DNA breaks (DSBs) which occur at relatively high frequencies in some genomic regions (hotspots) and relatively low frequencies in others (coldspots). Although observations concerning individual hot/cold spots have given clues as to the mechanism of recombination initiation, the prediction of hot/cold spots from DNA sequence information is a challenging task. In this article, we introduce a random forest (RF) prediction model to detect recombination hot/cold spots from yeast genome. The out-of-bag (OOB) estimation of the model indicated that the RF classifier achieved high prediction performance with 82.05% total accuracy and 0.638 Mattew's correlation coefficient (MCC) value. Compared with an alternative machine-learning algorithm, support vector machine (SVM), the RF method outperforms it in both sensitivity and specificity. The prediction model is implemented as a web server (RF-DYMHC) and it is freely available at http://www.bioinf.seu.edu.cn/Recombination/rf_dymhc.htm. Given a yeast genome and prediction parameters (RI-value and non-overlapping window scan size), the program reports the predicted hot/cold spots and marks them in color.Entities:
Mesh:
Substances:
Year: 2007 PMID: 17478517 PMCID: PMC1933199 DOI: 10.1093/nar/gkm217
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
The prediction performance of the RF model using the gapped dinucleotide composition feature
| Features | Se (%) | Sp (%) | MCC | ACC (%) |
|---|---|---|---|---|
| Gap{0} | 79.57 | 83.02 | 0.615 | 80.94 |
| Gap{1} | 79.81 | 83.10 | 0.619 | 81.12 |
| Gap{0,1} | 80.59 | 84.26 | 0.638 | 82.05 |
aRF model with parameters mtry = 4 and ntree = 1000. The prediction system was evaluated by the OOB estimation.
bThe gapped dinucleotide composition features were used. The integers which were inside the brackets indicated the number of intervening bases.
Figure 1.Expected prediction accuracy for sequences with different reliability indices. The accuracy and the fraction of sequences with particular RI are given. The expected accuracy of sequences with higher RI is much better than those with lower RI.
Performance comparisons with the SVMs. The training data set was randomly divided into two data sets (data set 1 and data set 2) with approximatly equal size. The performance was evaluated by the double-fold validation
| Classifier | Test 1 | Test 2 | ||||||
|---|---|---|---|---|---|---|---|---|
| Se (%) | Sp (%) | MCC | ACC (%) | Se (%) | Sp (%) | MCC | ACC (%) | |
| RF | 77.02 | 84.31 | 0.615 | 81.15 | 70.20 | 89.82 | 0.616 | 80.56 |
| SVM | 74.04 | 84.31 | 0.588 | 79.90 | 69.41 | 89.47 | 0.605 | 80.00 |
aTest 1 was processed by using data set 1 for parameters tuning and training, data set 2 for prediction performance evaluation.
bTest 2 was processed by using data set 2 for parameters tuning and training, data set 1 for prediction performance evaluation.
Figure 2.Box plots of recombination rates of the predicted hot/cold spots with different RI values. The median value is represented by a line within the rectangular box. The lower and upper edges of the rectangle represent the first and third quartiles, respectively. The circles and stars represent the ‘mild’ and ‘extreme’ outliers, respectively.