| Literature DB >> 30384456 |
Lei Chen1,2,3, Yu-Hang Zhang4, Xiaoyong Pan5, Min Liu6, Shaopeng Wang7, Tao Huang8, Yu-Dong Cai9.
Abstract
Messenger RNA (mRNA) and long noncoding RNA (lncRNA) are two main subgroups of RNAs participating in transcription regulation. With the development of next generation sequencing, increasing lncRNAs are identified. Many hidden functions of lncRNAs are also revealed. However, the differences in lncRNAs and mRNAs are still unclear. For example, we need to determine whether lncRNAs have stronger tissue specificity than mRNAs and which tissues have more lncRNAs expressed. To investigate such tissue expression difference between mRNAs and lncRNAs, we encoded 9339 lncRNAs and 14,294 mRNAs with 71 expression features, including 69 maximum expression features for 69 types of cells, one feature for the maximum expression in all cells, and one expression specificity feature that was measured as Chao-Shen-corrected Shannon's entropy. With advanced feature selection methods, such as maximum relevance minimum redundancy, incremental feature selection methods, and random forest algorithm, 13 features presented the dissimilarity of lncRNAs and mRNAs. The 11 cell subtype features indicated which cell types of the lncRNAs and mRNAs had the largest expression difference. Such cell subtypes may be the potential cell models for lncRNA identification and function investigation. The expression specificity feature suggested that the cell types to express mRNAs and lncRNAs were different. The maximum expression feature suggested that the maximum expression levels of mRNAs and lncRNAs were different. In addition, the rule learning algorithm, repeated incremental pruning to produce error reduction algorithm, was also employed to produce effective classification rules for classifying lncRNAs and mRNAs, which gave competitive results compared with random forest and could give a clearer picture of different expression patterns between lncRNAs and mRNAs. Results not only revealed the heterogeneous expression pattern of lncRNA and mRNA, but also gave rise to the development of a new tool to identify the potential biological functions of such RNA subgroups.Entities:
Keywords: cell type; expression specificity; feature selection; lncRNA; mRNA
Mesh:
Substances:
Year: 2018 PMID: 30384456 PMCID: PMC6274976 DOI: 10.3390/ijms19113416
Source DB: PubMed Journal: Int J Mol Sci ISSN: 1422-0067 Impact factor: 5.923
Figure 1Entire procedures of the computational scheme for investigating lncRNA and mRNA with expression features. The 71 expression features were analyzed by the maximum relevance minimum redundancy (mRMR) method, resulting in an mRMR feature list. Then, the incremental feature selection (IFS) method with the random forest (RF) algorithm was used to extract the optimal features and build the optimal RF classifier. At the same time, the IFS method with the repeated incremental pruning to produce error reduction (RIPPER) algorithm was adopted to learn classification rules.
Figure 2IFS curves based on the predicted results yielded by the IFS method with five different classification algorithms. The X-axis represents the number of features participating in the classification, and the Y-axis represents the Matthew’s correlation coefficients (MCCs). The optimal MCC (marked with red diamonds) for random forest (RF), repeated incremental pruning to produce error reduction (RIPPER), nearest neighbor algorithm (1-NN), support vector machine (SVM) and logistic regression (LR) is 0.895, 0.888, 0.818, 0.622 and 0.806, respectively. For RF, when top 13 features were used, the MCC value first overcomes 0.880. While for RIPPER, the MCC value first achieves 0.870 when top 10 features were employed.
Performance of the optimal random forest (RF), repeated incremental pruning to produce error reduction (RIPPER), nearest neighbor algorithm (1-NN), support vector machine (SVM), and logistic regression (LR) classifier.
| Classification Algorithm | Number of Used Features | SN | SP | ACC | MCC |
|---|---|---|---|---|---|
| RF | 53 |
| 0.940 |
|
|
| RIPPER | 70 | 0.952 |
| 0.946 | 0.888 |
| 1-NN | 69 | 0.932 | 0.896 | 0.911 | 0.818 |
| SVM | 71 | 0.758 | 0.861 | 0.820 | 0.622 |
| LR | 19 | 0.944 | 0.876 | 0.903 | 0.806 |
Top 13 features in maximum relevance minimum redundancy (mRMR) feature list.
| No. | Feature Name |
|---|---|
| 1 | Expression specificity |
| 2 | Intestinal epithelial cell |
| 3 | Neutrophil |
| 4 | Hepatocyte |
| 5 | Mast cell |
| 6 | Fibroblast of the conjuctiva |
| 7 | Reticulocyte |
| 8 | Mesenchymal cell |
| 9 | Lymphocyte of b lineage |
| 10 | Neuronal stem cell |
| 11 | Macrophage |
| 12 | Pericyte cell |
| 13 | Max cpm in all facet |
Figure 3A box plot to illustrate the performance, measured by Matthew’s correlation coefficient, of random forest (RF) and repeated incremental pruning to produce error reduction (RIPPER) algorithms on 1000 randomly produced feature subsets. For RF, each feature subset contained the feature of expression specificity and 12 other features randomly selected from rest 70 features, while for RIPPER, each feature subset contained the feature of expression specificity and 9 other features randomly selected from rest 70 features.
18 Classification rules yielded by repeated incremental pruning to produce error reduction (RIPPER) on top ten features.
| Rule Number | Condition | Outcome |
|---|---|---|
| Rule-1 | (Neuronal stem cell ≤ 2.58) and (Mesenchymal cell ≤ 1.44) and (Intestinal epithelial cell ≤ 0) and (Mesenchymal cell ≤ 0.197) and (Expression specificity ≥ 0.552495) | lncRNA |
| Rule-2 | (Neuronal stem cell ≤ 2.58) and (Hepatocyte ≤ 0) and (Mast cell ≤ 0.151) | lncRNA |
| Rule-3 | (Neuronal stem cell ≤ 2.58) and (Hepatocyte ≤ 0) and (Intestinal epithelial cell ≤ 0.246) and (Mesenchymal cell ≤ 2.04) and (Neuronal stem cell ≤ 0) | lncRNA |
| Rule-4 | (Neuronal stem cell ≤ 5.17) and (Mast cell ≤ 0.842) and (Intestinal epithelial cell ≤ 0.737) and (Neuronal stem cell ≤ 0.542) and (Mesenchymal cell ≤ 2.75) | lncRNA |
| Rule-5 | (Neuronal stem cell ≤ 2.58) and (Hepatocyte ≤ 1.73) and (Intestinal epithelial cell ≤ 0.246) and (Mesenchymal cell ≤ 2.63) and (Neuronal stem cell ≤ 0) | lncRNA |
| Rule-6 | (Neuronal stem cell ≤ 2.58) and (Hepatocyte ≤ 1.73) and (Mast cell ≤ 0.352) and (Expression specificity ≤ 0.299388) and (Mast cell ≤ 0) | lncRNA |
| Rule-7 | (Neuronal stem cell ≤ 5.17) and (Hepatocyte ≤ 1.73) and (Intestinal epithelial cell ≤ 0.246) and (Expression specificity ≥ 0.497214) and (Lymphocyte of b lineage ≥ 1.23) | lncRNA |
| Rule-8 | (Neuronal stem cell ≤ 2.58) and (Hepatocyte ≤ 1.73) and (Mesenchymal cell ≤ 1.44) and (Intestinal epithelial cell ≤ 0) | lncRNA |
| Rule-9 | (Neuronal stem cell ≤ 5.17) and (Mesenchymal cell ≤ 4.52) and (Reticulocyte ≤ 0) and (Fibroblast of the conjuctiva ≤ 0) and (Hepatocyte ≤ 3.46) and (Expression specificity ≥ 0.311106) and (Intestinal epithelial cell ≤ 4.18) and (Neuronal stem cell ≤ 2.71) | lncRNA |
| Rule-10 | (Neuronal stem cell ≤ 5.17) and (Mast cell ≤ 0.842) and (Intestinal epithelial cell ≤ 1.72) and (Hepatocyte ≤ 0) and (Fibroblast of the conjuctiva ≤ 0) | lncRNA |
| Rule-11 | (Neuronal stem cell ≤ 5.17) and (Hepatocyte ≤ 5.19) and (Intestinal epithelial cell ≤ 1.23) and (Neuronal stem cell ≤ 0.542) and (Mesenchymal cell ≤ 10.1) and (Mast cell ≤ 0.907) | lncRNA |
| Rule-12 | (Neuronal stem cell ≤ 2.58) and (Mesenchymal cell ≤ 4.5) and (Intestinal epithelial cell ≤ 0.983) and (Hepatocyte ≤ 0) and (Fibroblast of the conjuctiva ≤ 0) | lncRNA |
| Rule-13 | (Neuronal stem cell ≤ 5.17) and (Mesenchymal cell ≤ 5.18) and (Mesenchymal cell ≤ 1.45) and (Neutrophil ≥ 0.848) and (Expression specificity ≤ 0.333673) and (Mesenchymal cell ≤ 0.904) and (Fibroblast of the conjuctiva ≤ 0) | lncRNA |
| Rule-14 | (Neuronal stem cell ≤ 5.17) and (Mesenchymal cell ≤ 5.18) and (Neuronal stem cell ≤ 2.58) and (Intestinal epithelial cell ≤ 0.737) and (Neuronal stem cell ≤ 0) and (Expression specificity ≤ 0.517863) and (Mesenchymal cell ≥ 2.87) | lncRNA |
| Rule-15 | (Neuronal stem cell ≤ 5.17) and (Mast cell ≤ 1.06) and (Hepatocyte ≤ 5.19) and (Neuronal stem cell ≤ 0.542) and (Intestinal epithelial cell ≤ 6.88) and (Mesenchymal cell ≤ 28.5) | lncRNA |
| Rule-16 | (Neuronal stem cell ≤ 5.17) and (Mesenchymal cell ≤ 5.28) and (Fibroblast of the conjuctiva ≤ 0) and (Mast cell ≤ 0.842) and (Intestinal epithelial cell ≤ 11.8) and (Neuronal stem cell ≤ 2.58) and (Hepatocyte ≤ 20.1) | lncRNA |
| Rule-17 | (Neuronal stem cell ≤ 7.75) and (Intestinal epithelial cell ≤ 1.72) and (Mesenchymal cell ≤ 4.52) and (Neuronal stem cell ≤ 2.58) and (Lymphocyte of b lineage ≤ 4.07) and (Neutrophil ≥ 5.72) and (Mast cell ≤ 19) and (Expression specificity ≥ 0.115115) | lncRNA |
| Rule-18 | Otherwise | mRNA |
Figure 4The rule learning procedures of repeated incremental pruning to produce error reduction (RIPPER) algorithm [85].