| Literature DB >> 32341900 |
He Peng1.
Abstract
BACKGROUND: Conserved nucleic acid sequences play an essential role in transcriptional regulation. The motifs/templates derived from nucleic acid sequence datasets are usually used as biomarkers to predict biochemical properties such as protein binding sites or to identify specific non-coding RNAs. In many cases, template-based nucleic acid sequence classification performs better than some feature extraction methods, such as N-gram and k-spaced pairs classification. The availability of large-scale experimental data provides an unprecedented opportunity to improve motif extraction methods. The process for pattern extraction from large-scale data is crucial for the creation of predictive models.Entities:
Keywords: Mutational information mining; Long range correlation; Sequence feature extraction
Year: 2020 PMID: 32341900 PMCID: PMC7179567 DOI: 10.7717/peerj.8965
Source DB: PubMed Journal: PeerJ ISSN: 2167-8359 Impact factor: 2.984
Figure 1A schematic diagram for the algorithm description.
(A) A concise algorithm flow for frequent collaborative sequences pattern detection is illustrated. (B) The crucial processing part of the algorithm is shown in a straightforward approach.
Figure 2The demonstration of the algorithm in detail.
(A) Each frequent sequence maps to a column of a table. Each sub-sequence is represented by a symbol, and the frequent sequences combinations were derived. (B) The order of each sub-sequences in the combination of each frequent sequence is obtained by re-scanning the original sequences. (C) The achievement of the frequent sequences approximate matching (implemented by non-deterministic finite automaton algorithm) is shown. (D) Compact data structure for recording mutation information. The first digit represents the type of mutation, which can be either sub(substitution), ins (insertion) or del (deletion). The number for location begins from zero.
The data size for each species.
| Species | Positive data size | Negative data size |
|---|---|---|
| 282 | 500 | |
| 298 | 457 | |
| 691 | 2,094 | |
| 691 | 1,437 | |
| 238 | 443 | |
| 691 | 2,310 | |
| 691 | 2,023 |
The results of average accuracy of 10-fold of miRNA identifying for various species.
| Species | CFSP | k-mer | gkmSVM |
|---|---|---|---|
| 81.46% | 64.07% | 65.95% | |
| 90.20% | 85.70% | 90.79% | |
| 93.18% | 91.20% | 77.43% | |
| 84.87% | 77.73% | 70.28% | |
| 93.39% | 86.34% | 81.10% | |
| 93.04% | 89.80% | 89.02% | |
| 85.08% | 75.86% | 55.67% |
The results of ROC of miRNA identifying for various species.
| Species | CFSP | k-mer | gkmSVM |
|---|---|---|---|
| 0.90 | 0.73 | 0.78 | |
| 0.86 | 0.68 | 0.93 | |
| 0.97 | 0.76 | 0.89 | |
| 0.87 | 0.75 | 0.80 | |
| 0.97 | 0.81 | 0.87 | |
| 0.92 | 0.72 | 0.95 | |
| 0.83 | 0.78 | 0.74 |
The results of CFSP combining with various machine learning methods.
| Species | Method | Accuracy | ROC |
|---|---|---|---|
| Weighted svm | 93.52% | 0.98 | |
| Random Forest | 94.24% | 0.97 | |
| Neural Network | 92.81% | 0.97 | |
| Weighted svm | 92.00% | 0.97 | |
| Random Forest | 91.00% | 0.99 | |
| Neural Network | 89.67% | 0.95 | |
| Weighted svm | 79.33% | 0.86 | |
| Random Forest | 80.44% | 0.83 | |
| Neural Network | 78.97% | 0.86 |
The results of average accuracy (10-fold cross validated) for Sigma-54 promoter prediction.
| Method | Average accuracy | Average ROC |
|---|---|---|
| CFSP+svm | 79.82% | 0.89 |
| k-mer+svm | 75.16% | 0.84 |
| CFSP+Random Forest | 82.96% | 0.93 |
| k-mer+Random Forest | 79.33% | 0.86 |
Figure 3The bootstrap statistics graph for CFSP method (A) and the k-mer way (B).
The results of piRNA prediction.
| Method | Sp(%) | Sn(%) | Acc (%) |
|---|---|---|---|
| k-mer | 98.4 | 52.04 | 75.22 |
| Pibomd | 89.76 | 91.48 | 90.62 |
| Asysm-Pibomd | 96.2 | 72.68 | 84.44 |
| CFSP | 89.11 | 89.17 | 89.12 |
Figure 4(A) One conserved sequence, which occurs 79 times in 46,264 binding site peaks from the ChIP-seq data-set. The mutation profile of this conserved sequence is illustrated, where ’_ ’ indicates this base is unchanged; DEL indicates this base is lost; INS X indicates a new base X is inserted in front of this base. (B) Several repeated elements patterns are listed. (C) In the first column, the top five DNA motifs, mined by meme-chip tools (Machanick & Bailey, 2011) are illustrated. The resemblant conserved sequences, found by the CFSP algorithm are listed in the second column. In the third column, the position-specific scoring matrices, which are transformed from mutational information are listed. The similarity between meme motif and resemblant conserved sequence with PSSM format was calculated via a stamp motif comparison tool (Mahony & Benos, 2007). The E-values for the similarity of those pairs is displayed in the fourth column. (D) One motif is selected in each group clustered by gkmsvm descriptors, and the corresponding motif found by the CFSP algorithm is listed below. (E) There are additional datasets (File No: ENCFF100GRL, ENCFF616IRT, ENCFF870CER, Target: SREBF1) collected from https://www.encodeproject.org. The top two motifs are selected in each file using meme tools, and the corresponding motifs found by our algorithm are listed below.
The data for protein sequences identification.
| Data name | Training or test | Positive or negative | Data size |
|---|---|---|---|
| Biofilm | Training Data Set | Positive Sequences | 1,305 |
| Training Data Set | Negative Sequences | 1,463 | |
| Test Data Set | Positive Sequences | 145 | |
| Test Data Set | Negative Sequences | 163 | |
| Integrins | Training Data Set | Positive Sequences | 100 |
| Training Data Set | Negative Sequences | 518 | |
| Test Data Set | Positive Sequences | 12 | |
| Test Data Set | Negative Sequences | 58 |
The results of protein sequences classification.
| DataSet | Method | Precison | Recall | F1 |
|---|---|---|---|---|
| Integrins | ProtVecx(Best representation) | 1 | 0.83 | 0.91 |
| Integrins | frequent sequences tuples | 1 | 0.91 | 0.97 |
| Biofilm formation | ProtVecx(Best representation) | 0.82 | 0.56 | 0.72 |
| Biofilm formation | frequent sequences tuples | 0.97 | 0.78 | 0.87 |