Literature DB >> 29989083

iRSpot-Pse6NC: Identifying recombination spots in Saccharomyces cerevisiae by incorporating hexamer composition into general PseKNC.

Hui Yang¹, Wang-Ren Qiu^1,2, Guoqing Liu³, Feng-Biao Guo¹, Wei Chen^1,4,5, Kuo-Chen Chou^1,5, Hao Lin^1,5.

Abstract

Meiotic recombination caused by meiotic double-strand DNA breaks. In some regions the frequency of DNA recombination is relatively higher, while in other regions the frequency is lower: the former is usually called "recombination hotspot", while the latter the "recombination coldspot". Information of the hot and cold spots may provide important clues for understanding the mechanism of genome revolution. Therefore, it is important to accurately predict these spots. In this study, we rebuilt the benchmark dataset by unifying its samples with a same length (131 bp). Based on such a foundation and using SVM (Support Vector Machine) classifier, a new predictor called "iRSpot-Pse6NC" was developed by incorporating the key hexamer features into the general PseKNC (Pseudo K-tuple Nucleotide Composition) via the binomial distribution approach. It has been observed via rigorous cross-validations that the proposed predictor is superior to its counterparts in overall accuracy, stability, sensitivity and specificity. For the convenience of most experimental scientists, the web-server for iRSpot-Pse6NC has been established at http://lin-group.cn/server/iRSpot-Pse6NC, by which users can easily obtain their desired result without the need to go through the detailed mathematical equations involved.

Entities: Chemical Disease Gene Species

Keywords: 5-step rules; Key hexamers; PseKNC; Recombination spot; SVM; Webserver

Mesh：

Year: 2018 PMID： 29989083 PMCID： PMC6036749 DOI： 10.7150/ijbs.24616

Source DB: PubMed Journal: Int J Biol Sci ISSN： 1449-2288 Impact factor: 6.580

Introduction

Meiotic recombination occurs at each generation in diploid organisms, which is caused by meiotic double-strand DNA breaks (DSBs)1(Figure ). Meiosis can guarantee not only the stability of the chromosome number of species but also a species evolving mechanism to adapt to the environment changes 2. Recombination can lead to a change in genetic information between homologous chromosomes. Thus, it is one of main driving forces in genome evolution. The frequency of DNA recombination in some regions is relatively higher as referred to recombination hotspots, while in other regions the frequency is lower referred to the recombination coldspots 3-5. There have been many in-depth studies of recombination sites 3; 6-9. Gerton et al. 3 mapped double-strand break sites on chromosomes in the Saccharomyces cerevisiae (S. cerevisiae), and found that hotspots were non-randomly associated with regions of high GC base composition, while coldspots were non-randomly associated with the centromeres and telomeres. Some hotspots that require transcription factor binding are called α hotspots, and others are called β hotspots 3. Recently, there have been new developments on the research of recombination sites. ChIP experiments showed that substantial Spo11 persists at Rec8 binding sites during DSB formation 10; PRDM9, as a catalytic H3K4 trimethylated histone trimethylase, is involved in the initiation of recombination and recombination with recombination hot spots 11, found that the regions with high nucleosome occupancy have high recombination rate in the yeast genome 12. The correct identification of recombination spots can provide important clues for understanding the evolution mechanism. Generally, biochemical experiments can produce accurate information for determine recombination spots. However, with the development of high-throughput sequencing technique, more and more genome data were generated, thus, determining recombination spots with these wet-experiments requires more and more expensive experimental materials and long experimental period. Machine learning-based methods are a good choice for timely and accurately identifying the recombination spots. Up to now, some methods have been developed to identify recombination spot. Jiang et al. firstly developed a new model based on gapped dinucleotide composition and random forest (RF) to predict meiotic recombination hotspots and coldspots in S. cerevisiae 13. In the meantime, Zhou et al. established an SVM-based model to discriminate hotspots from coldspots in S. cerevisiae by using codon composition 14. Subsequently, Liu et al. proposed to use the increment of diversity combined with quadratic discriminant for predicting the recombination spots 15. Chen et al. developed a new DNA sample descriptor called pseudo dinucleotide composition (PseDNC) to improve prediction accuracy for the recombination hotspots and coldspots 16. According to the concept of PseDNC, Li et al. 17 and Qiu et al. 18 also developed different prediction models to address this problem. Liu et al. incorporated the weight of features into recombination hotspots prediction model 19. A predictor called iRSpot-DACC was also presented to predict recombination hotspots and coldspots 20. Recently, the same problem was further investigated by including the Z curve approach 21, and the ensemble learning approach 22. Although the aforementioned methods could achieve quite encouraging results, further studies are needed due to the following reasons. (i) The DNA samples used to train the models are with different length, which prevents them from establishing a widely useful model because users do not know how long the working length should be used for a query DNA sequence. For example, in using the aforementioned methods to scan a chromosome, we do not know the optimal width of the scan window 23 for the biological sequence concerned. In fact, for the published webserver based on those methods, only a prediction will be given even for a chromosome with a length of thousands base pairs. However, there are many recombination points in the genome. Therefore, most of those models are quite limited for practical applications. (ii) Some works 13; 14; 21; 24 used codon composition or coding region information to formulate DNA samples. However, recombination spots are not always located in coding regions. Some non-coding regions may also contain recombination spots. Thus, these methods could not identify recombination spots in the intergenic regions. (iii) The prediction results are still far from satisfactory yet; the accuracy should be further improved. (iv) Only three webservers were published. For the convenience of most experimental scientists, more user-friendly webservers in this regard are needed. The present study was devoted to develop a more powerful predictor in this area by considering the aforementioned four issues. To make the new predictor more clear in logical development and more useful in practical application, the Chou's 5-step rules 25 were followed as reported in a series of recent studies (see, e.g., 26-35).

Materials and Methods

Benchmark dataset: hot/cold spots DNA sequences

According to the Chou's 5-step rules, the first prerequisite to establish an effective predictor for a biological system is to construct or select a high quality benchmark dataset. In this study, the raw data was derived from Gerton et al. 3, who used DNA microarray as the single-gene resolution method to estimate the DSBs formation adjacent to each ORF for the S. cerevisiae loci. They measured the ratio of DSB-rich probes hybridized to total genomic probes. Based on the experimental data, Jiang et al. 13 constructed a benchmark dataset including 490 recombination hotspots and 591 coldspots. So far most of the existing models 13-20 were built up based on such benchmark dataset. The length distribution of original samples was shown in Figure . It was noticed that the length distributed in a wide range from the shortest one of 131 bp to the longest one of thousands bp. To overcome such a shortcoming, we rebuilt the benchmark dataset according to the strategy that recombination hotspots were correlated with peaks of G+C base composition 3. By doing so, we unified the length of each sample to 131 bp because the length of shortest sequence is 131 bp. For those sequences with >131 bp, we chose their subsequences with 131 bp that have the maximum GC content. As a result, the new dataset also has 490 samples for recombination hotspots and 591 samples for recombination coldspots, but all the sequences are 131 bp long now. The new benchmark dataset can be downloaded from the link at http://lin-group.cn/server/iRSpot-Pse6NC.

Hexamer composition and its PseKNC vector

How to translate a DNA sequence with L bases into a vector is the second important step to develop a predictor for discriminating recombination hotspots from recombination coldspots. This is because all the existing machine-learning algorithms can only handle vectors but not sequences as elaborated in 36. But a vector in a discrete framework might totally lose all the sequence-order or pattern information. To deal with this problem, the PseAAC (Pseudo Amino Acid Composition) was introduced 37. Ever since the concept of PseAAC was proposed, it has been swiftly penetrated into many biomedicine and drug development areas 38; 39 as well as nearly all the areas of computational proteomics (see, e.g., 40-48 and a long list of references cited in a recent review paper 49). Encouraged by the successes of using PseAAC to deal with protein/peptide sequences, its idea has been extended to deal with DNA/RNA sequences 16; 22; 24; 32; 50 in computational genomics via PseKNC (Pseudo K-tuple Nucleotide Composition) 51; 52. According to 53, for a DNA sample with L nucleic acid residues: its general form of PseKNC can be formulated as: where T is the transposing operator, the subscript is an integer, and its value and the components will depend on how to extract the desired features and properties from the DNA sequence. In this study, their definitions are described below. K-tuple (or called K-mer) nucleotide composition has important biological significance 54 that the whole DNA sequence can be uniquely determined from the K-tuple nucleotide frequency distribution; i.e., the frequency distribution of K-tuple nucleotide contains mostly the information of the DNA sequence. And K-mer nucleotide composition has been widely used in gene identification 55 and other regulatory element recognition 24; 56-59. Several studies 60,61 have shown that hexamer (6-mer) distribution has unique properties among species and different DNA fragments. Thus, we have the dimension of PseKNC in Eq.2 is: and its components given by: where and L denote the number of the u-th hexamer and the length of the sample sequence, respectively. Thus, the DNA sample has been uniquely defined in a 4096-D PseKNC vector.

The rule for ranking features

The DNA sequence is represented by a set of 4096 features, which may bring out three problems 62-63: (1) containing some redundant or irrelevant information; (ii) leading to an over-fitting model and reducing its flexibility; (iii) causing the curse of dimensionality and dyscalculia. However, we can improve these problems by means of the feature selection approach 64. Many effective feature selection techniques have been proposed, such as diffusion Maps 65, principal component analysis (PCA) 66-68, analysis of variance (ANOVA) 69; 70, recursive feature elimination algorithm 71; 72 and geometry preserving projections (GPP) 73 and so on. These techniques are all quite efficient in alleviating the interference from noise or irrelevant features so as to improve the prediction quality. Here, let us define a prior probability given by where M is the total occurrence times of all hexamers in the benchmark dataset (including both positive and negative samples), and represents the number of hexamers in the i-th type with i = 1 referring to the positive subset whereas i=2 referring to the negative subset. Now, the probability of the j-th hexamers occurring in type i can be formulated as where represents the total occurrence number of a given j-th hexamer in the benchmark dataset. The smaller the P(), the lower the probability of the j-th hexamer randomly occurring in type i, meaning the hexamer has more biological significance. The confidence level (CL) of the j-th hexamer occurring in i-th type of sample is defined by: Suppose: thus the 4096 hexamers can be ranked according to the values of Eq.8.

Support vector machine

Support vector machine (SVM) is a supervised machine learning algorithm based on statistical learning theory, and has been successfully applied in the field of bioinformatics 74. The basic idea of SVM is to transform the data into a high dimensional feature space and then determine the optimal separating hyper plane. For a brief formulation of SVM and how it is working, see the papers 75; 76; for more details about SVM, see a monograph 77. In this study, we used the free software LIBSVM 3.20, which was developed by Chang and Lin 78. Due to its good performance for classification, the radial basis kernel function was used to obtain the best classification hyper plane. The two parameters, C and γ, which were preliminarily optimized through a grid search strategy. The proposed predictor thus built up is called iRSpot-Pse6NC, where “i” stands for “identify”, “RSpot” for “Recombination Spots”, and “Pse6NC” for “Pseudo 6-tuple Nucleotide Composition”.

Results and Discussion

Cross-validation

To evaluate the quality of a new predictor, one needs to consider the following two things: (i) what metrics should be used to measure its performance? (ii) what test method should be adopted to calculate these metrics? In literature, the following four metrics are usually used to measure a predictor's quality 79: (i) overall accuracy (Acc); (ii) stability (MCC); (iii) sensitivity (Sn); and (4) specificity (Sp). But their conventional expressions directly taken from math books are lack of intuition and difficult to understand by most biological scientists. Fortunately, by means of the symbols introduced by Chou in studying signal peptides 23, the four conventional metrics can be converted to a set of intuitive ones 16; 80; 81 as given below: where represents the total number of positive samples investigated, while is the number of positive samples incorrectly predicted to be of negative one; the total number of negative samples investigated, while the number of the negative samples incorrectly predicted to be of positive one. As pointed out by many recent publications (see, e.g., 22; 32; 33; 50; 82-90), the meanings of Sn, Sp, Acc, and MCC have become crystal clear when using Eq.9. With a set of intuitive metrics, the next thing is how to test their values. As is well known, the independent dataset test, subsampling (or K-fold cross-validation) test, and jackknife test are the three cross-validation methods widely used for testing a prediction method 91. To reduce the computational cost, in this study we adopted the 5-fold cross-validation (namely K=5), as done by many investigators with SVM as the prediction engine (see, e.g., 24; 26; 92-95).

Comparison with existing methods

Listed in Table are the metrics rates (Eq.9) achieved by iRSpot-Pse6NC via the 5-fold cross-validation on the benchmark dataset. For facilitating comparison, listed there are also the corresponding rates obtained by iRSpot-PseDNC 16, iRSpot-KNCPseAAC 18, and IDQD 15 using exactly the same cross-validation method and same benchmark dataset. As we can see from the table, the rates achieved by iRSpot-Pse6NC are remarkably higher than its cohorts in all the four metrics, clearly indicating the proposed predictor is indeed superior to the existing predictors in this area.

Feature analysis

As mentioned in section 2.3, the dimension for the hexamer vector is 4096, which is too large to avoid the high-dimension problems. To exclude the noise and redundant features, we used the incremental feature selection (IFS) to find out the best feature subset to maximize accuracy. We initially ranked the 4096 hexamers according to Eqs.5-8. Subsequently, the 4096 feature subsets were obtained, in which the first feature subset contained the first hexamer, the second feature subset was produced by adding the second hexamer into the first feature subset, and so on. Thirdly, the SVM with 5-fold cross-validation was adopted to examine the accuracies of 4096 feature subsets. By using Acc as vertical coordinates and feature number as horizontal coordinates, we plotted IFS curve in Figure . One may notice that the peak of the curve is 84.08%, which is located at horizontal coordinate of 381. This result (84.08%) is dramatically higher than that (71.04%) of all features. Meanwhile, we also dramatically reduced the considered features from 4096 to 381, indicating that our proposed feature selection technique could pick out the optimal hexamers so as to further improve the prediction quality. Accordingly, the 381 hexamers were selected to form the optimal feature subset to train the prediction model. To further investigate the performance of the optimal model across the entire range of SVM decision values, we drew the ROC curve 96 in Figure . It shows that the AUC (the Area Under ROC Curve) reaches the value of 0.9084, indicating that the proposed method is quite promising and holds very high potential to become a useful high-throughput tool for predicting recombination spots. For further analyzing the contributions of different features in the prediction model, a heat map 97 was provided (Figure ), which is a graphical representation of a matrix by using different colors according to its CL values scaled between 0 and 1. As we can see from Figure , for the 4096 different hexamers, the majority of them are blue or green, indicating that most of them are irrelevant to the recombination spot recognition. It can be seen from Figure that those regions with high GC content, e.g., the hexamers CGCCGG, AGCCGG and GCAGCT, GCCGGA, AGTGGG are with the CL values ranking top five among all the features and with the confidence level of CL > 98.3%. Moreover, we performed a detail analysis on the 381 optimal hexamers with CL>98.3% to investigate the relationship between the features and GC content (Figure ). In this figure, abscissa coordinate denotes the GC content distribution from 0% -100%, and the vertical axis indicates that the percentage of positive and negative samples at the GC content shown on the abscissa. It can be seen from the figure that the optimal hexamers with high GC content have a higher proportion in positive samples, whereas hexamers with lower GC contents have a higher proportion of negative samples. This means that there is a close relationship between GC content and the hot spots, once again proofing that the way we handled the data is fully valid.

Web-server and user guide

As pointed out in 25 and demonstrated in many follow-up publications (see, e.g., 28; 30; 32; 35; 81; 98-116), user-friendly and publicly accessible web-servers represent the future direction for developing practically more useful predictors. Actually, a new prediction method with the availability of a user-friendly web-server would significantly enhance its impacts 36; 49. In view of this, the web-server for iRSpot-Pse6NC has been established. Furthermore, to maximize the convenience of most experimental scientists, the step-by-step instructions are given below. Step 1. Open the web server at http://lin-group.cn/server/iRSpot-Pse6NC and you will see the top page of`iRSpot-Pse6NC shown on your computer screen (Figure ). Step 2. Click on the WEB SERVER button to start the prediction. Either type or copy/paste the query DNA sequences into the input box at the center of Figure . The input sequences should be in the FASTA format. And click on the Submit button to see the predicted result. Step 3. Click on the DOWNLOAD button to download the benchmark data sets used to train and test the iRSpot-Pse6NC predictor. Step 4. Click on the CITATION button to find the relevant papers that document the detailed development and algorithm of iRSpot-Pse6NC. Step 5. Click on the HELP button to view the relevant instructions and the caveat when using it.

Table 1

A comparison of the proposed predictor with the existing ones.

Method	Sn^a	Sp^a	Acc^a	MCC^a
iRSpot-Pse6NC^b	0.7571	0.9103	0.8408	0.6805
iRSpot-PseDNC^c	0.6234	0.9052	0.7792	0.5585
iRSpot-KNCPseAAC^d	0.6102	0.8951	0.7660	0.5334
IDQD^e	0.6959	0.7509	0.7259	0.4469

aSee Eq.9 for the metrics definition

bProposed in this paper

cFrom 16

dFrom 18

eFrom 15

110 in total

Review 1. Meiotic and mitotic recombination in meiosis.

Authors: Kathryn P Kohl; Jeff Sekelsky
Journal: Genetics Date: 2013-06 Impact factor: 4.562

Review 2. Meiotic recombination hotspots.

Authors: M Lichten; A S Goldman
Journal: Annu Rev Genet Date: 1995 Impact factor: 16.830

3. OOgenesis_Pred: A sequence-based method for predicting oogenesis proteins by six different modes of Chou's pseudo amino acid composition.

Authors: Maryam Rahimi; Mohammad Reza Bakhtiarizadeh; Abdollah Mohammadi-Sangcheshmeh
Journal: J Theor Biol Date: 2016-12-02 Impact factor: 2.691

4. Identify and analysis crotonylation sites in histone by using support vector machines.

Authors: Wang-Ren Qiu; Bi-Qian Sun; Hua Tang; Jian Huang; Hao Lin
Journal: Artif Intell Med Date: 2017-03-07 Impact factor: 5.326

5. Identify Golgi protein types with modified Mahalanobis discriminant algorithm and pseudo amino acid composition.

Authors: Hui Ding; Li Liu; Feng-Biao Guo; Jian Huang; Hao Lin
Journal: Protein Pept Lett Date: 2011-01 Impact factor: 1.890

6. Support vector machine for classification of meiotic recombination hotspots and coldspots in Saccharomyces cerevisiae based on codon composition.

Authors: Tong Zhou; Jianhong Weng; Xiao Sun; Zuhong Lu
Journal: BMC Bioinformatics Date: 2006-04-26 Impact factor: 3.169

7. iRNA-AI: identifying the adenosine to inosine editing sites in RNA sequences.

Authors: Wei Chen; Pengmian Feng; Hui Yang; Hui Ding; Hao Lin; Kuo-Chen Chou
Journal: Oncotarget Date: 2017-01-17

8. Some remarks on protein attribute prediction and pseudo amino acid composition.

Authors: Kuo-Chen Chou
Journal: J Theor Biol Date: 2010-12-17 Impact factor: 2.691

9. Identification of antioxidants from sequence information using naïve Bayes.

Authors: Peng-Mian Feng; Hao Lin; Wei Chen
Journal: Comput Math Methods Med Date: 2013-08-24 Impact factor: 2.238

10. Prediction of phosphothreonine sites in human proteins by fusing different features.

Authors: Ya-Wei Zhao; Hong-Yan Lai; Hua Tang; Wei Chen; Hao Lin
Journal: Sci Rep Date: 2016-10-04 Impact factor: 4.379

23 in total

1. iPhosY-PseAAC: identify phosphotyrosine sites by incorporating sequence statistical moments into PseAAC.

Authors: Yaser Daanial Khan; Nouman Rasool; Waqar Hussain; Sher Afzal Khan; Kuo-Chen Chou
Journal: Mol Biol Rep Date: 2018-10-11 Impact factor: 2.316

2. Predicting membrane proteins and their types by extracting various sequence features into Chou's general PseAAC.

Authors: Ahmad Hassan Butt; Nouman Rasool; Yaser Daanial Khan
Journal: Mol Biol Rep Date: 2018-09-20 Impact factor: 2.316

Review 3. Structural Variability in the RLR-MAVS Pathway and Sensitive Detection of Viral RNAs.

Authors: Qiu-Xing Jiang
Journal: Med Chem Date: 2019 Impact factor: 2.745

4. Twenty years of bioinformatics research for protease-specific substrate and cleavage site prediction: a comprehensive revisit and benchmarking of existing methods.

Authors: Fuyi Li; Yanan Wang; Chen Li; Tatiana T Marquez-Lago; André Leier; Neil D Rawlings; Gholamreza Haffari; Jerico Revote; Tatsuya Akutsu; Kuo-Chen Chou; Anthony W Purcell; Robert N Pike; Geoffrey I Webb; A Ian Smith; Trevor Lithgow; Roger J Daly; James C Whisstock; Jiangning Song
Journal: Brief Bioinform Date: 2019-11-27 Impact factor: 11.622

Review 5. Some illuminating remarks on molecular genetics and genomics as well as drug development.

Authors: Kuo-Chen Chou
Journal: Mol Genet Genomics Date: 2020-01-01 Impact factor: 3.291

6. Special issue on Computational Resources and Methods in Biological Sciences.

Authors: Hao Lin; Shaoliang Peng; Jian Huang
Journal: Int J Biol Sci Date: 2018-07-01 Impact factor: 6.580

7. iGHBP: Computational identification of growth hormone binding proteins from sequences using extremely randomised tree.

Authors: Shaherin Basith; Balachandran Manavalan; Tae Hwan Shin; Gwang Lee
Journal: Comput Struct Biotechnol J Date: 2018-10-24 Impact factor: 7.271

8. M6AMRFS: Robust Prediction of N6-Methyladenosine Sites With Sequence-Based Features in Multiple Species.

Authors: Xiaoli Qiang; Huangrong Chen; Xiucai Ye; Ran Su; Leyi Wei
Journal: Front Genet Date: 2018-10-25 Impact factor: 4.599

9. iterb-PPse: Identification of transcriptional terminators in bacterial by incorporating nucleotide properties into PseKNC.

Authors: Yongxian Fan; Wanru Wang; Qingqi Zhu
Journal: PLoS One Date: 2020-05-15 Impact factor: 3.240

10. Identifying Phage Virion Proteins by Using Two-Step Feature Selection Methods.

Authors: Jiu-Xin Tan; Fu-Ying Dao; Hao Lv; Peng-Mian Feng; Hui Ding
Journal: Molecules Date: 2018-08-10 Impact factor: 4.411