Literature DB >> 27648451

Recombination Hotspot/Coldspot Identification Combining Three Different Pseudocomponents via an Ensemble Learning Approach.

Bingquan Liu¹, Yumeng Liu², Dong Huang³.

Abstract

Recombination presents a nonuniform distribution across the genome. Genomic regions that present relatively higher frequencies of recombination are called hotspots while those with relatively lower frequencies of recombination are recombination coldspots. Therefore, the identification of hotspots/coldspots could provide useful information for the study of the mechanism of recombination. In this study, a new computational predictor called SVM-EL was proposed to identify hotspots/coldspots across the yeast genome. It combined Support Vector Machines (SVMs) and Ensemble Learning (EL) based on three features including basic kmer (Kmer), dinucleotide-based auto-cross covariance (DACC), and pseudo dinucleotide composition (PseDNC). These features are able to incorporate the nucleic acid composition and their order information into the predictor. The proposed SVM-EL achieves an accuracy of 82.89% on a widely used benchmark dataset, which outperforms some related methods.

Entities: Chemical Disease Species

Mesh：

Substances：
Nucleotides

Year: 2016 PMID： 27648451 PMCID： PMC5015011 DOI： 10.1155/2016/8527435

Source DB: PubMed Journal: Biomed Res Int Impact factor: 3.411

1. Introduction

Meiotic recombination describes the process of alleles' exchange between homologous chromosomes during meiosis [1]. It can provide material for natural selection by producing diverse gametes. It might also contribute to the evolution of the genome via gene conversion or mutagenesis [2-4]. Although the exact location where recombination happens in the genome and the mechanism of recombination are still unclear, it has been assured that recombination plays an important role in promoting genome evolution. Therefore, several studies have been performed on chromosomes [5-7] and found that recombination presents a nonuniform distribution across the genome. Genomic regions that present relatively higher frequencies of recombination are called hotspots while those with relatively lower frequencies of recombination are called recombination coldspots [8, 9]. With the number of the sequenced genomes showing explosive growth, more reliable methods are urgently needed to be developed to identify the recombination spots. The prediction of recombination hotspots or coldspots is still a challenging task, although much information can be acquired from the experiments. Recently, several computational models have been presented to identify the recombination hotspots/coldspots. For example, Liu et al. [10], based on sequence Kmer frequencies, proposed a model which combines the increment of diversity with quadratic discriminant analysis (IDQD). Later, this method was improved by adding gaps into the kmers [11]. Chen et al. presented a predictor called iRSpot-PseDNC trained with pseudo dinucleotide composition features [12]. The aforementioned methods extracted the features from DNA sequences in different aspects. For example, the model based on oligonucleotide frequencies considers the nucleic acid composition information. The iRSpot-PseDNC incorporates both the local nucleic acid composition information and the global information of the protein sequences. Therefore, it is reasonable to combine these complementary predictors to further improve the performance of recombination hotspot/coldspot identification. In this regard, three basic predictors trained with basic kmer (Kmer) [13], dinucleotide-based auto-cross covariance (DACC) [14, 15], and pseudo dinucleotide composition (PseDNC) [16], respectively, were combined via the framework of ensemble learning approach, and a novel predictor called SVM-EL was proposed. All these features can be easily generated by a recently proposed tool called Pse-in-One [17], which is able to generate various features only based on the DNA, RNA, or protein sequence information.

2. Materials and Methods

2.1. Benchmark Dataset

The benchmark datasets S was obtained from Liu et al. [10]:where the subset S + contains 490 recombination hotspots, the subset S − contains 591 recombination coldspots, and the symbol ∪ represents the “union” in the set theory.

2.2. Feature Vectors Generated by Pse-in-One

SVM-EL is developed by combining the outcomes of three individual predictors which were trained by different features, including basic kmer (Kmer) [13], dinucleotide-based auto-cross covariance (DACC) [14, 15], and pseudo dinucleotide composition (PseDNC). These basic features can be generated by using Pse-in-One [17] which provides two approaches to generate feature vectors. One way is through the web server (http://bioinformatics.hitsz.edu.cn/Pse-in-One/) and another way is through the stand-alone tool (http://bioinformatics.hitsz.edu.cn/Pse-in-One/download/). Suppose a DNA sequence D iswhere L represents the DNA sequence length and R (i = 1,2 ⋯ L) is the nucleic acid at the position i. Therefore, three basic features used in the current study can be described as follows.

2.2.1. Kmer

Kmer [13] is an approach representing DNA sequences by the occurrence frequencies of kmers. The Kmer contains the local sequence-order information and it can be generated with the help of Pse-in-One by the following steps. For web server approach, firstly, choose DNA sequences (PseDAC-General), then select Kmer in the tab of Mode, and set the value of k. Secondly, input or upload the DNA sequence file in FASTA format, click the Submit button, and then you will see the results and you can download them as a text file (Figure 1).

Figure 1

An example of the kmer features' generation by using Pse-in-One.

For stand-alone approach, Kmer features can be easily generated by using the following command line:where −f svm represents the format of the output file which is the LIBSVM training data format, −l +1 represents the input file that contains positive samples only, k equals 3, and the sequence type is DNA. ‘./kmer.py −f svm −l +1 3 DNA'

2.2.2. Dinucleotide-Based Auto-Cross Covariance (DACC)

Dinucleotide-based auto-cross covariance (DACC) [14, 15] is the combination of DAC [14, 15, 19] and DCC [14, 15]. The DAC measures the correlation between two dinucleotides for one DNA property [17]. The DCC approach measures the correlation between two dinucleotides for two different properties [17]. Given a DNA sequence D represented as (2), the DAC feature can be calculated as [17]where μ is the dinucleotide property index; L is the length of DNA sequence; lag represents the distance between two dinucleotides; P (R R ) represents the value of dinucleotide R R at position i for the dinucleotide property index μ; represents the average value of P (R R ) for a DNA sequence. Given a DNA sequence D represented as (2), the DCC feature can be calculated as [17]where μ 1 and μ 2 are two different dinucleotide property indices; L is the DNA sequence length; lag is the distance between two dinucleotides; P (R R )(P (R R )) represents the value of dinucleotide R R at position i for the dinucleotide property index μ 1(μ 2); represents the average value of P (R R )(P (R R )) for a DNA sequence. The features of DACC contain global sequence-order information, and it can be generated via Pse-in-One [17] which includes two generation approaches. The generation steps of DACC feature can be described as follows. For web server approach, firstly, choose the DNA sequences (PseDAC-General) option, then select DACC in the tab of Mode, and set the value of lag. Secondly, upload a user-defined physicochemical index file called user_property and the values of fifteen dinucleotide physicochemical properties are shown in Table 1. Finally, input or upload the DNA sequence file in FASTA format, click the Submit button, and then you will see the results and you can download them as a text file (Figure 2).

Table 1

The values of fifteen DNA dinucleotide properties.

	AA/TT	AC/GT	AG/CT	AT	CA/TG	CC/GG	CG	GA/TC	GC	TA
F-roll	0.04	0.06	0.04	0.05	0.04	0.04	0.04	0.05	0.05	0.03
F-tilt	0.08	0.07	0.06	0.10	0.06	0.06	0.06	0.07	0.07	0.07
F-twist	0.07	0.06	0.05	0.07	0.05	0.06	0.05	0.06	0.06	0.05
F-slide	6.69	6.80	3.47	9.61	2.00	2.99	2.71	4.27	4.21	1.85
F-shift	6.24	2.91	2.80	4.66	2.88	2.67	3.02	3.58	2.66	4.11
F-rise	21.34	21.98	17.48	24.79	14.51	14.25	14.66	18.41	17.31	14.24
Roll	1.05	2.01	3.60	0.61	5.60	4.68	6.02	2.44	1.70	3.50
Tilt	−1.26	0.33	−1.66	0.00	0.14	−0.77	0.00	1.44	0.00	0.00
Twist	35.02	31.53	32.29	30.72	35.43	33.54	33.67	35.67	34.07	36.94
Slide	−0.18	−0.59	−0.22	−0.68	0.48	−0.17	0.44	−0.05	−0.19	0.04
Shift	0.01	−0.02	−0.02	0.00	0.01	0.03	0.00	−0.01	0.00	0.00
Rise	3.25	3.24	3.32	3.21	3.37	3.36	3.29	3.30	3.27	3.39
Energy	−1.00	−1.44	−1.28	−0.88	−1.45	−1.84	−2.17	−1.30	−2.24	−0.58
Enthalpy	−7.60	−8.40	−7.80	−7.20	−8.50	−8.00	−10.60	−8.20	−9.80	−7.20
Entropy	−21.30	−22.40	−21.00	−20.40	−22.70	−19.90	−27.20	−22.20	−24.40	−21.30

Figure 2

An example of the DACC features' generation by using Pse-in-One.

For stand-alone approach, DACC features can be easily generated by using the following command line:where −e user_property represents the user-defined physicochemical index file, −f svm and −l +1 have the same meaning with the above command line, the parameter lag equals 3, the sequence type is DNA, and the method used is DACC. ‘./acc.py −e user_property −f svm −l +1 3 DNA DACC'

2.2.3. Pseudo Dinucleotide Composition (PseDNC)

Given a DNA sequence D represented as (2), the PseDNC feature vector D can be defined as [17]wherewhere f (1 ≤ k ≤ 16) represents the normalized frequency of dinucleotides along the DNA sequence; w (0 ≤ w ≤ 1) represents the weight factor; λ is the top counted tiers of the correlation in a DNA, θ (1 ≤ j ≤ λ) measures the correlation between dinucleotides in the DNA, which is defined aswherewhere μ represents the indices of the dinucleotide property; P (R R )(P (R R )) represents the value of dinucleotide R R (R R ) at position i(j) for the dinucleotide property index μ. Pseudo dinucleotide composition (PseDNC) [17] not only incorporates the local nucleic acid composition information and the global or long range information along the DNA sequences, but also incorporates the dinucleotide properties into feature vectors. For web server approach, the generation steps of the feature vectors are similar to those of the DACC's. For web server approach, an example is shown in Figure 3.

Figure 3

An example of the PseDNC features' generation by using Pse-in-One.

For stand-alone approach, the command line iswhere −e user_property, −f svm, and −l +1 have the same meaning with the above command line, lambda equals 7, the value of weight equals 0.3, the sequence type is DNA, and the method used is PseDNC. ‘./pse.py −e user_property −f svm −l +1 7 0.3 DNA PseDNC' The meanings of all the parameters for these scripts are described in [17].

2.3. Support Vector Machine (SVM)

Support Vector Machine (SVM) is a kind of algorithm based on statistical learning theory proposed by Vapnik [20-22], which has been widely used for many bioinformatics tasks [23-27]. In the current study, the LIBSVM package version 3.21 [18] has been employed. The SVM parameters, the kernel width parameter γ and the regularization parameter C, were optimized via the grid tool provided by LIBSVM [18]. In the current study, three basic predictors are proposed, including SVM-Kmer, SVM-DACC, and SVM-PseDNC. The values of SVM-Kmer's parameters are shown as follows:The values of SVM-DACC's parameters are shown as follows:The values of SVM-PseDNC's parameters are shown as follows:

2.4. Ensemble Learning

In machine learning, ensemble learning is the process by which multiple classifiers are constructed and combined based on the same dataset to obtain a better performance than a single classifier [28, 29] and existing popular multiobjective optimization evolutionary algorithms can be used for ensemble learning [30, 31]. Ensemble classifier also performed well in several bioinformatics problems. In the current study, the basic framework for an ensemble classifier is illustrated in Figure 4. The final results are obtained by fusing three individual classifier outcomes, as illustrated below.

Figure 4

The basic framework for an ensemble classifier.

Suppose the ensemble classifier ℂ is defined aswhere ℂ 1 represents the classifier SVM-Kmer, ℂ 2 represents the classifier SVM-DACC, and ℂ 3 represents the classifier SVM-PseDNC. The symbol ⊕ denotes the fusing operator. Therefore, the process of the ensemble classifier can be formulated as follows:where L 1 is the set only containing recombination hotspots and L 2 is the set of recombination coldspots. P (S, L ) is the probability for DNA sequence S which belongs to category L obtained by the ith basic classifier. Thus, which category the query DNA S belongs to is to be determined by using its average probability calculated by (13); that is, suppose thatwhere the operator max represents selecting a lager value in the brackets, and the subscript μ represents the query DNA S belonging to category L .

2.5. Criteria for Performance Evaluation

The prediction results can be divided into true positive (TP), false negative (FN), false positive (FP), and true negative (TN) [32]. In the current study, jackknife test [33-37] was employed and four kinds of evaluation indexes were adopted, including Sensitivity (Se), Specificity (Sp), Accuracy (Acc), and Matthew's Correlation Coefficient (Mcc). They are described as

3. Results and Discussion

3.1. Performance of the Three Basic Classifiers

As an inherent property, sequence-order is important for the classification of DNA sequences. So, three basic methods based on sequence-order information are adopted to identify recombination hotspots/coldspots. Table 2 shows the performance of the three methods. According to the table, we can see that SVM-DACC and SVM-PseDNC outperform SVM-Kmer on the prediction accuracy index. The main reason is that SVM-Kmer is only based on local sequence-order information, while both of SVM-DACC and SVM-PseDNC also contain global sequence-order information.

Table 2

Results on benchmark dataset for different predictors proposed in the current study.

Predictor	Test method	Se (%)	Sp (%)	Acc (%)	MCC
SVM-Kmer^a	Jackknife	75.92	86.29	81.59	0.628
SVM-DACC^b	Jackknife	76.12	87.99	82.61	0.649
SVM-PseDNC^c	Jackknife	72.04	90.69	82.24	0.644
SVM-EL	Jackknife	76.33	88.33	82.89	0.654

aThe parameters used are k = 6 for SVM-Kmer and C = 27 and γ = 2 for LIBSVM [18].

bThe parameters used are lag = 6 for SVM-DACC and C = 23 and γ = 2−3 for LIBSVM [18].

cThe parameters used are λ = 7 and w = 0.3 for SVM-PseDNC and C = 213 and γ = 23 for LIBSVM [18].

3.2. The Performance of the Three Basic Predictors Can Be Further Improved by Using Ensemble Learning

Based on the analysis above, we have proposed three basic predictors for identifying recombination hotspots/coldspots. These methods capture DNA information from different aspects. Therefore, we presented a complementary method SVM-EL which can fuse these basic methods to improve the prediction performance. The performance of SVM-EL is shown in Table 2, from which we can see that SVM-EL outperforms the three basic methods. Besides, the corresponding receiver operating characteristic (ROC) curves of the four classifiers were drawn in Figure 5. AUC, the area under the ROC curve, is often used to indicate the performance of a classifier: the larger the value, the better the classifier.

Figure 5

The comparison of different predictors for hotspots/coldspots identification. The areas under ROC curves (AUC) of SVM-EL, SVM-DACC, SVM-Kmer, and SVM-PseDNC are 0.91, 0.90, 0.89, and 0.87, respectively.

As shown in Figure 5, the predictor SVM-EL showed the top performance, outperforming three basic methods: SVM-Kmer, SVM-DACC, and SVM-PseDNC.

3.3. Comparison with Other Related Predictors

Two state-of-the-art methods, IDQD [10] and iRSpot-PseDNC, were selected to compare with the proposed SVM-EL. Table 3 shows the results of various methods on the benchmark dataset.

Table 3

Results on benchmark dataset for different predictors.

Predictor	Test method	Se (%)	Sp (%)	Acc (%)	MCC
IDQD^a	5-fold	79.40	81.00	80.30	0.603
iRSpot-PseDNC^b	Jackknife	73.06	89.49	82.04	0.638
SVM-EL	Jackknife	76.33	88.33	82.89	0.654

aFrom Liu et al. [10].

bFrom Chen et al. [12].

According to Table 3, we can see that SVM-EL outperforms the other methods. The main reason is that IDQD and SVM-Kmer only consider local sequence-order information, and iRSpot-PseDNC, SVM-DACC, and SVM-PseDNC improved them by incorporating global sequence-order information. However, SVM-EL not only incorporates the local nucleic acid information, but also incorporates the global information. Therefore, we conclude that SVM-EL would be a useful tool for hotspots/coldspots identification.

4. Conclusion

In this article, we proposed a predictor called SVM-EL for yeast hotspot/coldspot identification, which combines Support Vector Machine (SVM) with Ensemble Learning (EL). The approach combined with different predictors trained by different features contributes to the improvement of prediction accuracy. SVM-EL is trained by different features, including basic kmer (Kmer), dinucleotide-based auto-cross covariance (DACC), and pseudo dinucleotide composition (PseDNC). All these features can be generated by Pse-in-One [17], which is a powerful web server for generating various DNA, RNA, or protein features. It also provides a stand-alone version to users, which is easy to use. Via jackknife test, it was observed that the predictor outperforms other predictors. In the future, we will consider using other approaches for yeast hotspot/coldspot identification, such as bioinspired computing models [38-45].

36 in total

1. GC-content evolution in mammalian genomes: the biased gene conversion hypothesis.

Authors: N Galtier; G Piganeau; D Mouchiroud; L Duret
Journal: Genetics Date: 2001-10 Impact factor: 4.562

2. On the Universality and Non-Universality of Spiking Neural P Systems With Rules on Synapses.

Authors: Tao Song; Jinbang Xu; Linqiang Pan
Journal: IEEE Trans Nanobioscience Date: 2015-11-25 Impact factor: 2.935

3. Patterns of meiotic double-strand breakage on native and artificial yeast chromosomes.

Authors: S Klein; D Zenvirth; V Dror; A B Barton; D B Kaback; G Simchen
Journal: Chromosoma Date: 1996-12 Impact factor: 4.316

4. Identification of DNA-binding proteins by combining auto-cross covariance transformation and ensemble learning.

Authors: Bin Liu; Shanyi Wang; Qiwen Dong; Shumin Li; Xuan Liu
Journal: IEEE Trans Nanobioscience Date: 2016-04-20 Impact factor: 2.935

5. miRClassify: an advanced web server for miRNA family classification and annotation.

Authors: Quan Zou; Yaozong Mao; Lingling Hu; Yunfeng Wu; Zhiliang Ji
Journal: Comput Biol Med Date: 2013-12-21 Impact factor: 4.589

6. Quantification of knee vibroarthrographic signal irregularity associated with patellofemoral joint cartilage pathology based on entropy and envelope amplitude measures.

Authors: Yunfeng Wu; Pinnan Chen; Xin Luo; Hui Huang; Lifang Liao; Yuchen Yao; Meihong Wu; Rangaraj M Rangayyan
Journal: Comput Methods Programs Biomed Date: 2016-03-21 Impact factor: 5.428

7. The influence of recombination on human genetic diversity.

Authors: Chris C A Spencer; Panos Deloukas; Sarah Hunt; Jim Mullikin; Simon Myers; Bernard Silverman; Peter Donnelly; David Bentley; Gil McVean
Journal: PLoS Genet Date: 2006-07-31 Impact factor: 5.917

8. iDNA-Prot|dis: identifying DNA-binding proteins by incorporating amino acid distance-pairs and reduced alphabet profile into the general pseudo amino acid composition.

Authors: Bin Liu; Jinghao Xu; Xun Lan; Ruifeng Xu; Jiyun Zhou; Xiaolong Wang; Kuo-Chen Chou
Journal: PLoS One Date: 2014-09-03 Impact factor: 3.240

9. DNA binding protein identification by combining pseudo amino acid composition and profile-based protein representation.

Authors: Bin Liu; Shanyi Wang; Xiaolong Wang
Journal: Sci Rep Date: 2015-10-20 Impact factor: 4.379

10. Multiple sites for double-strand breaks in whole meiotic chromosomes of Saccharomyces cerevisiae.

Authors: D Zenvirth; T Arbel; A Sherman; M Goldway; S Klein; G Simchen
Journal: EMBO J Date: 1992-09 Impact factor: 11.598

1 in total

1. UltraPse: A Universal and Extensible Software Platform for Representing Biological Sequences.

Authors: Pu-Feng Du; Wei Zhao; Yang-Yang Miao; Le-Yi Wei; Likun Wang
Journal: Int J Mol Sci Date: 2017-11-14 Impact factor: 5.923

1 in total