| Literature DB >> 35428172 |
Ying Li1, Lizheng Wei1, Cankun Wang2, Jianing Zhao1, Siyu Han1,3, Yu Zhang4, Wei Du5.
Abstract
BACKGROUND: Long non-coding RNA (LncRNA) plays important roles in physiological and pathological processes. Identifying LncRNA-protein interactions (LPIs) is essential to understand the molecular mechanism and infer the functions of lncRNAs. With the overwhelming size of the biomedical literature, extracting LPIs directly from the biomedical literature is essential, promising and challenging. However, there is no webserver of LPIs relationship extraction from literature.Entities:
Keywords: Corpus; Logistic regression; Multiple text features; Named entity recognition; lncRNA–protein interaction
Mesh:
Substances:
Year: 2022 PMID: 35428172 PMCID: PMC9013167 DOI: 10.1186/s12859-022-04665-3
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.307
Fig. 1The flowchart of LPInsider. The flowchart consists of six parts: (1) typing literature, (2) text preprocessing, (3) named entity recognition, (4) Extracting text features, (5) classification by logistic regression, (6) outcome including entity information and the judgment of positive or negative samples. For example, after users type “BC1 RNA associates with Pura”, webserver recognizes that BC1 is lncRNA and Pura is protein, and finds the interac-tion between BC1 and Pura. Users get the result returned by webserver
Statistics of lncRNAs and proteins
| Database | lncRNAs | Proteins |
|---|---|---|
| HGNC | 11,513 | 0 |
| GENCODE | 6424 | 0 |
| LncRNADisease | 373 | 0 |
| Lnc2Cancer | 1618 | 0 |
| RAID v2.0 | 3460 | 10,968 |
| LncRInter | 277 | 318 |
| NPInter | 76,870 | 7442 |
| RPISeq | 0 | 2043 |
| UniProt | 0 | 83,692 |
| STRING | 0 | 21,129,733 |
The comparison of the two types of negative samples
| lncRNA | Protein | Interaction keyword | Negative word | Example |
|---|---|---|---|---|
| True | True | False | / | There was no significant change in Igf2 or H19 expression in brain |
| True | True | True | True | We found no association between the FISH resultsand MALAT1 expression in patients |
The results of tenfold cross-validation verification on LPIs Corpus, IntAct and LncRNADisease
| Database | Type | Precision | Recall | f1-score |
|---|---|---|---|---|
| LPIs Corpus | lncRNA | 0.9541 | 0.9836 | 0.9686 |
| protein | 0.71857 | 0.8727 | 0.7881 | |
| LncRNADisease | lncRNA | 0.5597 | 0.5174 | 0.5378 |
| IntAct | protein | 0.4555 | 0.2300 | 0.3057 |
Fig. 2Screenshot of LPInsider’s webserver
Fig. 3Screenshot of a part of the result returned by webserver
Fig. 4Example of syntactic structure vector
Using multiple text features
| Features | Classifier | Accuracy | Precision | Recall | f1-score |
|---|---|---|---|---|---|
| Semantic word vector | LGBM | 0.84920 | 0.86992 | 0.83037 | 0.84754 |
| SVM | 0.85534 | 0.75734 | 0.84071 | ||
| Logistic regression | 0.91635 | ||||
| Random forest | 0.81373 | 0.84939 | 0.77344 | 0.80728 | |
| Xgboost | 0.86683 | 0.87700 | 0.85637 | 0.86522 | |
| Semantic word vectors and syntactic structure vectors | LGBM | 0.85659 | 0.88094 | 0.83152 | 0.85391 |
| SVM | 0.88173 | 0.81453 | 0.87453 | ||
| Logistic regression | 0.92684 | ||||
| Random forest | 0.82158 | 0.86687 | 0.76823 | 0.81217 | |
| Xgboost | 0.87882 | 0.89386 | 0.86452 | 0.87787 | |
| Semantic word vectors, syntactic structure vectors and distance vectors | LGBM | 0.87476 | 0.89739 | 0.85298 | 0.87276 |
| SVM | 0.89080 | 0.83035 | 0.88486 | ||
| Logistic regression | 0.93218 | ||||
| Random forest | 0.82360 | 0.86757 | 0.77133 | 0.81454 | |
| Xgboost | 0.88382 | 0.90279 | 0.86435 | 0.88169 | |
| Semantic wordvector, syntactic structurevector, distance vector and part of speech vector | LGBM | 0.89205 | 0.91488 | 0.87007 | 0.89046 |
| SVM | 0.91719 | 0.88221 | 0.91513 | ||
| 0.93304 | |||||
| Random forest | 0.84216 | 0.87469 | 0.80599 | 0.83753 | |
| Xgboost | 0.89164 | 0.91930 | 0.86366 | 0.88926 |
Bold indicates the better experimental results
The performances of LPInsider and three deep learning methods
| Classifier | Accuracy | Precision | Recall | f1-score |
|---|---|---|---|---|
| textCNN | 0.85935 | 0.86855 | 0.85747 | 0.86022 |
| Capsule network | 0.71352 | 0.71590 | 0.84718 | 0.75284 |
| LSTM | 0.89497 | 0.88557 | 0.90181 | |
| LPInsider | 0.90380 |
Bold indicates the better experimental results
Fig. 5P-R curves of LPInsider with four machine learning and three deep learning models when using multiple text features