Literature DB >> 17478517

RF-DYMHC: detecting the yeast meiotic recombination hotspots and coldspots by random forest model using gapped dinucleotide composition features.

Peng Jiang¹, Haonan Wu, Jiawei Wei, Fei Sang, Xiao Sun, Zuhong Lu.

Abstract

In the yeast, meiotic recombination is initiated by double-strand DNA breaks (DSBs) which occur at relatively high frequencies in some genomic regions (hotspots) and relatively low frequencies in others (coldspots). Although observations concerning individual hot/cold spots have given clues as to the mechanism of recombination initiation, the prediction of hot/cold spots from DNA sequence information is a challenging task. In this article, we introduce a random forest (RF) prediction model to detect recombination hot/cold spots from yeast genome. The out-of-bag (OOB) estimation of the model indicated that the RF classifier achieved high prediction performance with 82.05% total accuracy and 0.638 Mattew's correlation coefficient (MCC) value. Compared with an alternative machine-learning algorithm, support vector machine (SVM), the RF method outperforms it in both sensitivity and specificity. The prediction model is implemented as a web server (RF-DYMHC) and it is freely available at http://www.bioinf.seu.edu.cn/Recombination/rf_dymhc.htm. Given a yeast genome and prediction parameters (RI-value and non-overlapping window scan size), the program reports the predicted hot/cold spots and marks them in color.

Entities: Chemical Gene Species

Mesh：

Substances：
Codon
Nucleotides

Year: 2007 PMID： 17478517 PMCID： PMC1933199 DOI： 10.1093/nar/gkm217

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

In the yeast, meiotic recombination is initiated by double-strand DNA breaks (DSBs). Meiotic DSBs occur at relatively high frequencies in some genomic regions which are called hotspots while the regions associated with low frequencies of DSBs are called coldspots (1). Several studies have been performed to determine whether the hot/cold spots share common DNA sequences and/or structural elements (2,3). It was found that the hotspots were non-randomly associated with regions of high G + C base composition and certain transcriptional profiles while the coldspots were non-randomly associated with centromeres and telomeres. Although observations concerning individual hot/cold spots have given clues as to the mechanism of recombination initiation, the prediction of hot/cold spots from DNA sequence information is still a challenging task. So far, nearly all recombination hot/cold spots finding methods are based on population-genetic data (4–6) and no software or web server has been reported to predict the hot/cold spots from a single DNA sequence. In this study, we present a novel machine-learning method, random forest (RF) model, to detect the yeast meiotic recombination hotspots and coldspots from genome sequences. Although several studies demonstrated that there was a correlation between the synonymous codon usage pattern and the recombination rate in Caenorhabditis elegans, mouse, human and other species(7–13), most hotspots are intergenic rather than intragenic, and thus the gene codon usage pattern-based attributes may fail to be applied in non-coding regions. For that reason, an ORF (Open Reading Frame)-independent feature (gapped dinucleotide composition) was used in our study. Compared with an alternative machine-learning algorithm, support vector machine (SVM), the RF method outperformed it in both sensitivity and specificity. The prediction model is implemented as a web server (RF-DYMHC) and it is freely available at http://www.bioinf.seu.edu.cn/Recombination/rf_dymhc.htm. Given a yeast DNA sequence and prediction parameters (RI-value and non-overlapping scan window size), the program reports the predicted hot/cold spots and marks them in color.

MATERIALS AND METHODS

Data sets

Gerton et al. (14) have estimated the relative recombination rates for the yeast Saccharomyces cerevisiae loci using DNA microarray at single-gene resolution. To estimate the DSBs formation adjacent to each ORF, they measured the ratio of hybridization to a DSB-enriched probe (P2) to a total genomic probe (P1). The relative strength of the recombination rate was estimated by the P2/P1 hybridization ratio. The experiments were repeated seven times for each of the 6200 genes. In this article, we take the median value as the relative recombination rate of each sequence. If any repeated array value was missing, the sequence was excluded. Finally, a total of 5266 sequences were collected. The sequences whose relative hybridization ratio ≥1.5 are defined as hotspots, while the ones whose relative hybridization ratio <0.82 are defined as coldspots. Thus, we obtained 490 hotspots and 591 coldspots which composed of the training data set. The yeast S. cerevisiae mitochondrial DNA sequence, served as negative control for our method, was downloaded from Saccharomyces Genome Database (15) at the website: http://www.yeastgenome.org/. All the data sets used in this article can be downloaded from website: http://www.bioinf.seu.edu.cn/Recombination/datasets.htm

Gapped dinucleotide composition features

The gapped dinucleotide composition is the fraction of each two nucleotides with k intervening bases within a sequence. It can be defined as: where, is the observed total number of i-th two nucleotides with k intervening bases and n() is the total number of all possible two nucleotides with k intervening bases. If k = 0, is the dinucleotide composition (16).

Random forest

RF is a classifier consisting of an ensemble of tree-structured classifiers (17). RF takes advantage of two powerful machine-learning techniques: bagging (18) and random feature selection. In bagging, each tree is trained on a bootstrap sample of the training data, and predictions are made by majority vote of trees. RF is a further development of bagging. Instead of using all features, RF randomly selects a subset of features to split at each node when growing a tree. To assess the prediction performance of the algorithm, RF performs a type of cross-validation in parallel with the training step by using the so-called out-of-bag (OOB) samples. Specifically, in the process of training, each tree is grown using a particular bootstrap sample. Since bootstrapping is sampling with replacement from the training data, some of the sequences will be ‘left out’ of the sample, while others will be repeated in the sample. The ‘left out’ sequences constitute the OOB sample. On average, each tree is grown using about 1 − e−1 ∼ 2/3 of the training sequences, leaving e−1 ∼ 1/3 as OOB. Because OOB sequences have not been used in the tree construction, one can use them to estimate the prediction performance (19,20). The RF algorithm was implemented by the randomForest R package (21).

Support vector machine

SVM is a supervised machine-learning technology based on statistical theory for data classification (22). SVM seeks an optimal hyperplane to separate two classes of samples. It uses kernel functions to map original data to a feature space of higher dimensions and locate an optimal separating hyperplane there. The SVM algorithm was implemented by the e1071 (version 1.5-12) R package (23). We used different kernels (linear, RBF, 2, 3-order polynomial) and the RBF kernel performed the best (data not shown). So we used the SVM with RBF kernel, as a competent machine-learning method, to compare with the RF algorithm. The parameters C and γ of the RBF kernel were optimized by the standard grid search (24).

Prediction system assessment

For a prediction problem, a classifier can classify an individual instance into the following four categories: false positive (FP), true positive (TP), false negative (FN) and true negative (TN). The total prediction accuracy (ACC), Specificity (Sp), Sensitivity (Se) and Mattew's correlation coefficient (MCC) (25) for assessment of the prediction system are given by

Reliability index

Here, the reliability index (RI) was used to determine the effectiveness of recombination hotspots and coldspots prediction. For RF algorithm, an intuitive RI can be derived from the fractions of votes for the positive and negative classes of each sample. We define RI as: where f and f− are fractions of votes for the positive and negative classes of each sample, respectively.

RESULTS

Constructing the RF prediction model with gapped dinucleotide composition features

The prediction results of the RF classifiers were shown in Table 1. The performance was evaluated by the OOB estimation on the training dataset. The gap {0} and the gap {1} dinucleotide composition-based RF prediction models achieved total accuracies of 80.94 and 81.12%, respectively. The prediction performance can be improved by combing the two composition features. The gap {0, 1} based RF model achieved 82.05% total accuracy and 0.638 MCC value.

Table 1.

The prediction performance of the RF model using the gapped dinucleotide composition feature

Features^b	Se (%)	Sp (%)	MCC	ACC (%)
Gap{0}	79.57	83.02	0.615	80.94
Gap{1}	79.81	83.10	0.619	81.12
Gap{0,1}	80.59	84.26	0.638	82.05

aRF model with parameters mtry = 4 and ntree = 1000. The prediction system was evaluated by the OOB estimation.

bThe gapped dinucleotide composition features were used. The integers which were inside the brackets indicated the number of intervening bases.

The prediction performance of the RF model using the gapped dinucleotide composition feature aRF model with parameters mtry = 4 and ntree = 1000. The prediction system was evaluated by the OOB estimation. bThe gapped dinucleotide composition features were used. The integers which were inside the brackets indicated the number of intervening bases.

Reliability index of the RF model

The reliability of prediction is an important factor that gives users more information about the quality of the prediction. We adopted RI to indicate the level of certainty of the prediction model. The results, as shown in Figure 1, were obtained through the OOB estimation. It indicated that the higher the RI was the higher reliability the prediction gained. When RI > 6, the total prediction accuracy is >90%. Approximately, 78.1% of the predicted sequences were with RI > 2 which indicated that the RF prediction model was reliable.

Figure 1.

Expected prediction accuracy for sequences with different reliability indices. The accuracy and the fraction of sequences with particular RI are given. The expected accuracy of sequences with higher RI is much better than those with lower RI.

Comparison with the SVM prediction model

It has been proven that SVMs usually outperform other machine-learning methods in many fields of pattern recognition (24,26–31). So, we choose the SVM prediction model as an alternative algorithm to compare with the RF prediction model. To make comparisons impartial, a double-fold cross-validation was implemented. We randomly divided the training data set into two independent data sets (data set 1 and data set 2) of approximately equal size. Then, we used one data set for parameters tuning (the parameters were optimized by the standard grid search (24)) and training. The other data set was used for evaluating the prediction performance. As shown in Table 2, the RF classifier outperformed the SVM classifier in both sensitivity and specificity.

Table 2.

Classifier	Test 1^a				Test 2^b

	Se (%)	Sp (%)	MCC	ACC (%)	Se (%)	Sp (%)	MCC	ACC (%)
RF	77.02	84.31	0.615	81.15	70.20	89.82	0.616	80.56
SVM	74.04	84.31	0.588	79.90	69.41	89.47	0.605	80.00

aTest 1 was processed by using data set 1 for parameters tuning and training, data set 2 for prediction performance evaluation.

bTest 2 was processed by using data set 2 for parameters tuning and training, data set 1 for prediction performance evaluation.

Performance comparisons with the SVMs. The training data set was randomly divided into two data sets (data set 1 and data set 2) with approximatly equal size. The performance was evaluated by the double-fold validation aTest 1 was processed by using data set 1 for parameters tuning and training, data set 2 for prediction performance evaluation. bTest 2 was processed by using data set 2 for parameters tuning and training, data set 1 for prediction performance evaluation.

Applying the RF model to full genome analysis

In order to evaluate the sensitivity and specificity of the RF model in detecting hotspots and coldspots from the full genome, we trained the RF model on the training data set and tested the remaining 4185 sequences. The distribution of recombination rates of the predicted hot/cold spots with different RI values is shown in Figure 2. There is a trend that an increase in the RI value results in an increase in recombination rates of the predicted hotspots and a decrease in recombination rates of the predicted coldspots, respectively. The predicted hotspots and coldspots have more possibility to be ‘true’ hotspots or coldspots with a higher RI value. Therefore, RI as a regulating parameter controls the trade-off between sensitivity and specificity. We set a cutoff RI > 7. Out of the 4185 sequences, a total of 195 sequences were predicted as hotspots and 591 sequences were predicted as coldspots. Approximately, 81.0% of the predicted hotspots had relative recombination ratios >1.09 and ∼80.0% of the predicted coldspots had relative recombination ratios <1.07.

Figure 2.

Box plots of recombination rates of the predicted hot/cold spots with different RI values. The median value is represented by a line within the rectangular box. The lower and upper edges of the rectangle represent the first and third quartiles, respectively. The circles and stars represent the ‘mild’ and ‘extreme’ outliers, respectively. Since it would be surprising to find meiotic recombination hot/cold spots in mtDNA data, the yeast S. cerevisiae mitochondrial data can be served as a negative control for our method. We used the RF model to scan the S. cerevisiae mitochondrial DNA with a non-overlapping window (sliding window size: 0.5 kb). The results showed that all RI values were ≤5 and ∼98.8% RI values were ≤3, which was consistent with the current knowledge.

Web server

The prediction model is implemented as a web server named RF-DYMHC, and it is made available at http://www.bioinf.seu.edu.cn/Recombination/rf_dymhc.htm. Given a yeast genome and prediction parameters (RI value and non-overlapping window scan size), the program breaks the input sequence into subsequences. Each of these subsequences constitutes a sample and each sample will be mapped into a 32-dimension feature space reflecting the gap {0} and gaped {1} base-pair compositions. The output of the web server returns the predicted hotspots and coldspots and marks them in color. More details about the input and output formats are available at http://www.bioinf.seu.edu.cn/Recombination/Manual.htm

DISCUSSION

It is a challenging problem to detect meiotic recombination hotspots and coldspots in eukaryotic genomes based on computational techniques. In this article, we have introduced a RF-based method to detect recombination hot/cold spots from yeast genome. The OOB estimation of the prediction model indicated that the RF classifier achieved high prediction accuracy. It was also compared with an alternative machine-learning algorithm, SVM prediction model. The RF was found to outperform the SVM in both sensitivity and specificity. We used the RF model to test the remaining 4185 sequences. The results indicated that the RI controlled the trade-off between sensitivity and specificity. Though the prediction model was constructed by a two-class prediction model, we attempted to construct another three-class RF prediction model. We ranked the Gerton et al. data sets (5266 sequences) based on the median array value of the seven microarrays. The top one-third sequences were marked as hotspots, the bottom one-third sequences as coldspots and the rest as neutral sequences. The total accuracy of the OOB estimation was 51.22%, which was only 17.89% higher than the random classifier. Approximately 65.60% of the failed predicted coldspots were falsely predicted as neutral ones, while ∼67.23% of the failed predicted neutral sequences were classified into coldspots. The results indicated that the three-class RF model failed to separate the coldspots from the neutral ones. Since the experimental identification of recombination hot/cold spots is time consuming and money costing, it is infeasible for large numbers of genomic sequences. Hence, efficiently and reliably detecting them by computational approach is important. Further improvement of our model will be focused on incorporating more attributes. Our predicting system will also be optimized by the rapidly increased experimental validated data sets in the future.

26 in total

Review 1. Estimating recombination rates from population-genetic data.

Authors: Michael P H Stumpf; Gilean A T McVean
Journal: Nat Rev Genet Date: 2003-12 Impact factor: 53.242

2. ESLpred: SVM-based method for subcellular localization of eukaryotic proteins using dipeptide composition and PSI-BLAST.

Authors: Manoj Bhasin; G P S Raghava
Journal: Nucleic Acids Res Date: 2004-07-01 Impact factor: 16.971

3. Predicting rRNA-, RNA-, and DNA-binding proteins from primary structure with support vector machines.

Authors: Xiaojing Yu; Jianping Cao; Yudong Cai; Tieliu Shi; Yixue Li
Journal: J Theor Biol Date: 2005-11-07 Impact factor: 2.691

4. Does recombination improve selection on codon usage? Lessons from nematode and fly complete genomes.

Authors: G Marais; D Mouchiroud; L Duret
Journal: Proc Natl Acad Sci U S A Date: 2001-04-24 Impact factor: 11.205

Review 5. Meiotic recombination hotspots.

Authors: M Lichten; A S Goldman
Journal: Annu Rev Genet Date: 1995 Impact factor: 16.830

6. Evolutionary rate of a gene affected by chromosomal position.

Authors: J Perry; A Ashworth
Journal: Curr Biol Date: 1999-09-09 Impact factor: 10.834

7. Genetic and physical maps of Saccharomyces cerevisiae.

Authors: J M Cherry; C Ball; S Weng; G Juvik; R Schmidt; C Adler; B Dunn; S Dwight; L Riles; R K Mortimer; D Botstein
Journal: Nature Date: 1997-05-29 Impact factor: 49.962

8. Hill-Robertson interference is a minor determinant of variations in codon bias across Drosophila melanogaster and Caenorhabditis elegans genomes.

Authors: Gabriel Marais; Gwenaël Piganeau
Journal: Mol Biol Evol Date: 2002-09 Impact factor: 16.240

9. Selection conflicts, gene expression, and codon usage trends in yeast.

Authors: Richard M Kliman; Naheelah Irving; Maria Santiago
Journal: J Mol Evol Date: 2003-07 Impact factor: 2.395

10. Support vector machine for classification of meiotic recombination hotspots and coldspots in Saccharomyces cerevisiae based on codon composition.

Authors: Tong Zhou; Jianhong Weng; Xiao Sun; Zuhong Lu
Journal: BMC Bioinformatics Date: 2006-04-26 Impact factor: 3.169

7 in total

1. Analysis of biological features associated with meiotic recombination hot and cold spots in Saccharomyces cerevisiae.

Authors: Loren Hansen; Nak-Kyeong Kim; Leonardo Mariño-Ramírez; David Landsman
Journal: PLoS One Date: 2011-12-29 Impact factor: 3.240

2. iRSpot-DACC: a computational predictor for recombination hot/cold spots identification based on dinucleotide-based auto-cross covariance.

Authors: Bingquan Liu; Yumeng Liu; Xiaopeng Jin; Xiaolong Wang; Bin Liu
Journal: Sci Rep Date: 2016-09-19 Impact factor: 4.379

Review 3. Per aspera ad astra: When harmful chromosomal translocations become a plus value in genetic evolution. Lessons from Saccharomyces cerevisiae.

Authors: Valentina Tosato; Carlo V Bruschi
Journal: Microb Cell Date: 2015-08-20

4. iRSpot-Pse6NC: Identifying recombination spots in Saccharomyces cerevisiae by incorporating hexamer composition into general PseKNC.

Authors: Hui Yang; Wang-Ren Qiu; Guoqing Liu; Feng-Biao Guo; Wei Chen; Kuo-Chen Chou; Hao Lin
Journal: Int J Biol Sci Date: 2018-05-22 Impact factor: 6.580

5. Epigenetic Marks and Variation of Sequence-Based Information Along Genomic Regions Are Predictive of Recombination Hot/Cold Spots in Saccharomyces cerevisiae.

Authors: Guoqing Liu; Shuangjian Song; Qiguo Zhang; Biyu Dong; Yu Sun; Guojun Liu; Xiujuan Zhao
Journal: Front Genet Date: 2021-06-29 Impact factor: 4.599

6. SPoRE: a mathematical model to predict double strand breaks and axis protein sites in meiosis.

Authors: Raphaël Champeimont; Alessandra Carbone
Journal: BMC Bioinformatics Date: 2014-12-11 Impact factor: 3.169

7. Recombination spot identification Based on gapped k-mers.

Authors: Rong Wang; Yong Xu; Bin Liu
Journal: Sci Rep Date: 2016-03-31 Impact factor: 4.379

7 in total