| Literature DB >> 31703551 |
Mengmeng Zhu1,2, Michael Gribskov3.
Abstract
BACKGROUND: Micropeptides are small proteins with length < = 100 amino acids. Short open reading frames that could produces micropeptides were traditionally ignored due to technical difficulties, as few small peptides had been experimentally confirmed. In the past decade, a growing number of micropeptides have been shown to play significant roles in vital biological activities. Despite the increased amount of data, we still lack bioinformatics tools for specifically identifying micropeptides from DNA sequences. Indeed, most existing tools for classifying coding and noncoding ORFs were built on datasets in which "normal-sized" proteins were considered to be positives and short ORFs were generally considered to be noncoding. Since the functional and biophysical constraints on small peptides are likely to be different from those on "normal" proteins, methods for predicting short translated ORFs must be trained independently from those for longer proteins.Entities:
Keywords: Coding; Machine learning; Micropeptide; Noncoding; Small ORF; lncRNA; sORF; smORF
Mesh:
Substances:
Year: 2019 PMID: 31703551 PMCID: PMC6842143 DOI: 10.1186/s12859-019-3033-9
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Training and test data sets
| Dataset | #Positive | #Negative | #Total |
|---|---|---|---|
| Training | 3194 | 2369 | 5563 |
| Test | 823 | 567 | 1390 |
#positive: number of positive data points
#negative: number of negative data points
#total: total number of data points
Fig. 1Parameter Optimization. The avg. F1 and accuracy are shown at the best t for the indicated values of λ. 10-fold cross validation results with different 휆 and 푡 combinations on the training set. λ: the hyperparameter for regularization strength in logistic regression; t: the hyperparameter for threshold in logistic regression; best t: when λ is fixed, the t from t ∈ {0, 0.05, 0.1, …, 0.95, 1.0} that gives the best performance; avg. F1 val: the average F1 score on the 10 validation sets when both λ and t are fixed; avg. accu val: the average accuracy on the 10 validation sets when both λ and t are fixed
MiPepid results on the training set
| Accuracy | |||
|---|---|---|---|
| Positive | Negative | Overall | |
| 0.9845 | 0.9818 | 0.9827 | 0.9822 |
“positive” and “negative” refer to the accuracies of MiPepid on the positive and negative subsets, respectively;
“overall” refers to the accuracy on the whole training set (positive + negative)
MiPepid results on the blind test set
| Accuracy | |||
|---|---|---|---|
| Positive | Negative | Overall | |
| 0.9640 | 0.9587 | 0.9559 | 0.9576 |
“positive” and “negative” refer to the accuracies of MiPepid on the positive and negative subsets, respectively;
“overall” refers to the accuracy on the whole test set (positive + negative)
List of micropeptides published after Dec 2015
| Micropeptide name | Protein sequence length | in SmProt non-highConf | Reference |
|---|---|---|---|
| MOXI | 56 | yes | [ |
| DWORF | 35 | yes | [ |
| Myomixer / Minion | 84 | yes | [ |
| SPAR | 90 | no | [ |
| HOXB-AS3 | 53 | no | [ |
in SmProt non-highConf: If this micropeptide was already included in the SmProt [34] non-high-confidence subset, then the value is “yes”, otherwise “no”
Comparison with existing methods on the blind test set and the new_positive dataset
| Method | Blind test set | New_positive | ||||||
|---|---|---|---|---|---|---|---|---|
| Positive | Negative | Overall | ||||||
| #Correct | Accuracy | #Correct | Accuracy | Accuracy | #Correct | Accuracy | ||
| CPC [ | 17 | 0.02 | 567 | 1.00 | 0.04 | 0.42 | 0 | 0.00 |
| CPC2 [ | 61 | 0.07 | 567 | 1.00 | 0.14 | 0.45 | 0 | 0.00 |
| CPAT [ | 261 | 0.32 | 567 | 1.00 | 0.48 | 0.60 | 3 | 0.60 |
| MiPepid (our method) | 789 | 0.96 | 542 | 0.96 | 0.96 | 0.96 | 5 | 1.00 |
positive: the positive subset of the blind test set;
negative: the negative subset of the blind test set;
overall: the overall performance on the blind test set;
#correct: the number of correctly classified cases by a method;
accuracy: #correct divided by the total number of cases in that dataset/subset;
F1: the F1score
Comparison with sORFfinder
| Method | Blind test set | |||||
|---|---|---|---|---|---|---|
| Positive | Negative | Overall | ||||
| #Correct | Accuracy | #Correct | Accuracy | Accuracy | ||
| sORFfinder | 708 | 0.86 | 506 | 0.89 | 0.89 | 0.87 |
| MiPepid (our method) | 789 | 0.96 | 542 | 0.96 | 0.96 | 0.96 |
positive: the positive subset of the blind test set;
negative: the negative subset of the blind test set;
overall: the overall performance on the blind test set;
#correct: the number of correctly classified cases by a method;
accuracy: #correct divided by the total number of cases in that dataset/subset;
F1: the F1 score
MiPepid’s prediction on the non-high-confidence data in SmProt
| Data source | #sORFs | avg sORF length (aa) | #Predicted positive | Proportion |
|---|---|---|---|---|
| high-throughput literature mining | 25,663 | 44 | 20,516 | 0.80 |
| ribosome profiling | 13,715 | 36 | 8596 | 0.63 |
| MS data | 324 | 15 | 233 | 0.72 |
high-throughput literature mining: published sORFs that were identified using high-throughput experimental methods;
ribosome profiling: sORFs predicted from Ribo-Seq data;
MS data: sORFs predicted from MS data;
#sORFs: number of sORFs from a particular data source;
avg sORF length (aa): the average length of sORFs measured in number of amino acids;
#predicted positive: number of sORFs that are predicted as positive by MiPepid;
proportion:
Fig. 2Predicted Coding Probability. Coding probability as a function of the predicted small ORF length. Scatterplot of the length of sORF vs. predicted coding probability for the non-high-confidence sORFs in SmProt. aa: number of amino acids. The y = 0.6 horizontal line separates sORFs that are predicted as positive (predicted coding probability ≥0.6) and the rest predicted negative
MiPepid’s prediction on small protein-coding genes in model organisms
| Species | #seq | %Predicted positive |
|---|---|---|
| 422 | 96.68% | |
| yeast ( | 502 | 93.63% |
| arabidopsis ( | 2888 | 98.61% |
| zebrafish ( | 2481 | 96.78% |
| mouse ( | 6451 | 97.54% |
#seq: number of small protein-coding sequences
%predicted positive: percentage of sequences predicted as coding by MiPepid