| Literature DB >> 31052553 |
Xiaoyong Pan1,2, Lei Chen3,4, Kai-Yan Feng5, Xiao-Hua Hu6, Yu-Hang Zhang7, Xiang-Yin Kong8, Tao Huang9, Yu-Dong Cai10.
Abstract
Small nucleolar RNAs (snoRNAs) are a new type of functional small RNAs involved in the chemical modifications of rRNAs, tRNAs, and small nuclear RNAs. It is reported that they play important roles in tumorigenesis via various regulatory modes. snoRNAs can both participate in the regulation of methylation and pseudouridylation and regulate the expression pattern of their host genes. This research investigated the expression pattern of snoRNAs in eight major cancer types in TCGA via several machine learning algorithms. The expression levels of snoRNAs were first analyzed by a powerful feature selection method, Monte Carlo feature selection (MCFS). A feature list and some informative features were accessed. Then, the incremental feature selection (IFS) was applied to the feature list to extract optimal features/snoRNAs, which can make the support vector machine (SVM) yield best performance. The discriminative snoRNAs included HBII-52-14, HBII-336, SNORD123, HBII-85-29, HBII-420, U3, HBI-43, SNORD116, SNORA73B, SCARNA4, HBII-85-20, etc., on which the SVM can provide a Matthew's correlation coefficient (MCC) of 0.881 for predicting these eight cancer types. On the other hand, the informative features were fed into the Johnson reducer and repeated incremental pruning to produce error reduction (RIPPER) algorithms to generate classification rules, which can clearly show different snoRNAs expression patterns in different cancer types. The analysis results indicated that extracted discriminative snoRNAs can be important for identifying cancer samples in different types and the expression pattern of snoRNAs in different cancer types can be partly uncovered by quantitative recognition rules.Entities:
Keywords: Monte Carlo feature selection; RIPPER algorithm; cancer type; snoRNA; support vector machine
Mesh:
Substances:
Year: 2019 PMID: 31052553 PMCID: PMC6539089 DOI: 10.3390/ijms20092185
Source DB: PubMed Journal: Int J Mol Sci ISSN: 1422-0067 Impact factor: 5.923
Results of 10-fold cross-validation by using 69 produced classification rules and the optimal support vector machine (SVM) classifier.
| Cancer Type | Classification Rules | Optimal SVM Classifier | ||
|---|---|---|---|---|
| TPR † | FPR ‡ | TPR † | FPR ‡ | |
| HNSC | 0.683 | 0.040 | 0.877 | 0.023 |
| KIRC | 0.815 | 0.046 | 0.927 | 0.003 |
| LGG | 0.928 | 0.012 | 0.979 | 0.003 |
| LUAD | 0.550 | 0.051 | 0.839 | 0.034 |
| LUSC | 0.595 | 0.047 | 0.731 | 0.023 |
| PRAD | 0.868 | 0.019 | 0.978 | 0.023 |
| THCA | 0.884 | 0.033 | 0.966 | 0.004 |
| UCEC | 0.684 | 0.038 | 0.869 | 0.006 |
† True positive rate, ‡ False positive rate.
Figure 1Confusion matrices for 10-fold cross-validation based on two classifiers. (A) The confusion matrix yielded by the 69 produced classification rules for classifying samples from 8 cancer types. The numbers were pooled by running 10-fold cross-validation on the training data thrice. (B) The confusion matrix yielded by the optimal support vector machine (SVM) classifier. The numbers were pooled by running 10-fold cross-validation on the training data once.
Figure 2IFS curves derived from the IFS method and SVM classifiers. The x-axis is the number of features involved in building classifiers, while the y-axis is their corresponding Matthew’s correlation coefficient (MCC) values. (A) The IFS curve yielded by 10-fold cross-validation. (B) The IFS curve yielded by 5-fold cross-validation.
The lowest and highest accuracies for each cancer type yielded by the optimal SVM classifier on each of ten folds.
| Cancer Type | Highest Accuracy | Lowest Accuracy |
|---|---|---|
| HNSC | 0.930 | 0.807 |
| KIRC | 0.949 | 0.897 |
| LGG | 1.000 | 0.922 |
| LUAD | 0.875 | 0.786 |
| LUSC | 0.808 | 0.654 |
| PRAD | 1.000 | 0.925 |
| THCA | 1.000 | 0.931 |
| UCEC | 0.930 | 0.789 |
Sample sizes of eight major cancer types.
| Cancer Type | Name | Number of Samples |
|---|---|---|
| HNSC | Head and neck squamous cell | 567 |
| KIRC | Kidney renal clear cell carcinoma | 587 |
| LGG | Lower grade glioma | 512 |
| LUAD | Lung adenocarcinoma | 559 |
| LUSC | Lung squamous carcinoma | 521 |
| PRAD | Prostate adenocarcinoma | 535 |
| THCA | Thyroid carcinoma | 581 |
| UCEC | Uterine corpus endometrial carcinoma | 571 |
| Total | --- | 4433 |