| Literature DB >> 27030570 |
Rong Wang1, Yong Xu1, Bin Liu1.
Abstract
Recombination is crucial for biological evolution, which provides many new combinations of genetic diversity. Accurate identification of recombination spots is useful for DNA function study. To improve the prediction accuracy, researchers have proposed several computational methods for recombination spot identification. The k-mer feature is one of the most useful features for modeling the properties and function of DNA sequences. However, it suffers from the inherent limitation. If the value of word length k is large, the occurrences of k-mers are closed to a binary variable, with a few k-mers present once and most k-mers are absent. This usually causes the sparse problem and reduces the classification accuracy. To solve this problem, we add gaps into k-mer and introduce a new feature called gapped k-mer (GKM) for identification of recombination spots. By using this feature, we present a new predictor called SVM-GKM, which combines the gapped k-mers and Support Vector Machine (SVM) for recombination spot identification. Experimental results on a widely used benchmark dataset show that SVM-GKM outperforms other highly related predictors. Therefore, SVM-GKM would be a powerful predictor for computational genomics.Entities:
Mesh:
Substances:
Year: 2016 PMID: 27030570 PMCID: PMC4814916 DOI: 10.1038/srep23934
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1An example to show the tree structure of k-mer counting.
This example only contains two alphabets, A and T. We use k = 3 and three sequences S = AAAAT, S = ATTTT, and S = AATA to build k-mer tree. Each node t at depth d represents a sequence of length d, denoted by s(t), which is determined by the path from the root of the tree to ti. At depth d = 3, for node t6, s(t) = ‘AAA’, S contains two counts of this k-mer, S and S do not contain this k-mer. For node t, s(t) = ‘AAT’, S and S both contain one count, and S does not contain this k-mer. Compared t with t, the paths to these two nodes only contain one mismatch.
Figure 2The influence of parameter k on the performance of two predictors.
Two predictors, one is SVM-GKM, the other is kmer-SVM. We consider the word length k from 8 to 15, and choose the mismatch length m = 7 for SVM-GKM predictor. SVM-GKM achieves the highest result when k = 13, kmer-SVM obtains the highest result when k = 10.
Figure 3Comparison of SVM-GKM and kmer-SVM with four performance measures.
This figure shows the best results that SVM-GKM and kmer-SVM achieved, where word length k = 13 and matches length m = 7 for SVM-GKM, and word length k = 10 for kmer-SVM. SVM-GKM outperforms kmer-SVM in terms of all the four performance measures.
Results of different methods for recombination spot identification.
| Predictor | Sn(%) | Sp(%) | Acc(%) | MCC |
|---|---|---|---|---|
| SVM-GKM | 81.22 | 90.69 | 86.57 | 0.728 |
| iRSpot-PseDNC | 81.63 | 88.14 | 85.19 | 0.692 |
| IDQD | 79.40 | 81.00 | 80.30 | 0.603 |
| kmer-SVM | 74.49 | 84.75 | 82.31 | 0.597 |
aThe parameters used: k = 13 and m = 7.
bFrom Chen et al.53.
cFrom Liu et al.17.
dThe parameter used: k = 10.
Comparison of the most discriminative gapped k-mer with two known motifs in hotspot sequences.
| Motifs name | Sequence | Matching bases |
|---|---|---|
| M26 | A | CCG |
| 4095 | CC |
aThese two motifs in hotspots are reported by57. The gapped k-mer ‘CCG*T**C**CA*’ with top discriminative power matches these two motifs. The matching bases are shown in bold.