| Literature DB >> 28263989 |
Andre Lamurias1, Luka A Clarke2, Francisco M Couto1.
Abstract
Many biomedical relation extraction approaches are based on supervised machine learning, requiring an annotated corpus. Distant supervision aims at training a classifier by combining a knowledge base with a corpus, reducing the amount of manual effort necessary. This is particularly useful for biomedicine because many databases and ontologies have been made available for many biological processes, while the availability of annotated corpora is still limited. We studied the extraction of microRNA-gene relations from text. MicroRNA regulation is an important biological process due to its close association with human diseases. The proposed method, IBRel, is based on distantly supervised multi-instance learning. We evaluated IBRel on three datasets, and the results were compared with a co-occurrence approach as well as a supervised machine learning algorithm. While supervised learning outperformed on two of those datasets, IBRel obtained an F-score 28.3 percentage points higher on the dataset for which there was no training set developed specifically. To demonstrate the applicability of IBRel, we used it to extract 27 miRNA-gene relations from recently published papers about cystic fibrosis. Our results demonstrate that our method can be successfully used to extract relations from literature about a biological process without an annotated corpus. The source code and data used in this study are available at https://github.com/AndreLamurias/IBRel.Entities:
Mesh:
Substances:
Year: 2017 PMID: 28263989 PMCID: PMC5338769 DOI: 10.1371/journal.pone.0171929
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Corpora used to develop and evaluate the system.
Each line refers to a corpus, how it was used (Dev: development; Eval: evaluation; NER: Named Entity Recognition; RE: Relation extraction), the total number of relevant entities and relations annotated, and the number of documents.
| NER | RE | ||||||
|---|---|---|---|---|---|---|---|
| Corpus | Dev | Eval | Dev | Eval | Entities | Relations | Documents |
| Bagewadi’s | X | X | X | X | 1963 | 318 | 301 |
| miRTex | X | X | X | 1245 | 771 | 350 | |
| TransmiR | X | X | 1145 | 547 | 243 | ||
| IBRel-miRNA | X | 52970 | NA | 4000 | |||
| IBRel-CF | X | 612 | NA | 51 | |||
Fig 1Pipeline used to perform the experiments.
The input text (A) first goes through natural language processing tools to generate token features (B), then a named entity recognition module (C) to identify named entities and finally relation extraction (D) to extract relations between entities. Bagewadi (E), miRTex (F), TransmiR (G) and IBRel-miRNA (H) refer to the four corpora previously described.
Fig 2Multi-instance learning bags.
For each sentence, we generated bags according to the distinct miRNA-gene pairs mentioned in the text. If a pair exists in the reference database, the bag is labeled as positive. Multi-instance learning assumes that at least one of the instances of a positive bag should describe a true relation.
Example of gene entities identified that were then matched with UniProt entries.
Entity text refers to the original text found in the abstract, while Entry name and Entry ID refer to UniProt entries.
| Entity text | Entry name | Entry ID |
|---|---|---|
| Smad | SMAD3_HUMAN | P84022 |
| N-Myc | NDRG1_HUMAN | Q92597 |
| Interferon regulatory factor 3 | IRF3_HUMAN | Q14653 |
| Egr-2 | EGR2_HUMAN | P11161 |
Example of miRNA entities identified that were then matched with miRBase entries.
Entity text refers to the original text found in the abstract, while Entry name and Entry ID refer to miRBase entries.
| Entity text | Entry name | Entry ID |
|---|---|---|
| miRNA-155 | hsa-miR-155 | MI0000681 |
| miR-200 | hsa-miR-200a | MI0000737 |
| miR125a | hsa-mir-125a | MI0000469 |
| microRNA-9 | hsa-mir-9 | MI0000466 |
miRNA-gene relations extraction evaluation results on each corpus, comparing co-occurrence, supervised and IBRel (window size = 3).
P, R and F refer to precision, recall and F-score.
| Co-occurrence | SL kernel | IBRel | |||||||
|---|---|---|---|---|---|---|---|---|---|
| Gold standard | P | R | F | P | R | F | P | R | F |
| Bagewadi’s | 0.528 | 0.992 | 0.689 | 0.661 | 0.886 | 0.493 | 0.577 | 0.532 | |
| miRTex | 0.474 | 0.910 | 0.623 | 0.536 | 0.837 | 0.583 | 0.285 | 0.383 | |
| TransmiR | 0.147 | 0.851 | 0.250 | 0.238 | 0.090 | 0.130 | 0.359 | 0.486 | |
Entity recognition evaluation results on each corpus, for miRNA and gene entities.
P, R and F refer to precision, recall and F-score.
| miRNA | Gene | |||||
|---|---|---|---|---|---|---|
| Gold standard | P | R | F | P | R | F |
| Bagewadi’s | 0.902 | 0.936 | 0.919 | 0.814 | 0.580 | 0.677 |
| miRTex | 0.934 | 0.948 | 0.941 | 0.803 | 0.788 | 0.795 |
| TransmiR | 0.726 | 0.651 | 0.687 | 0.255 | 0.618 | 0.361 |
miRNA-gene relations extracted from the IBRel-CF corpus using IBRel, ordered by maximum confidence level.
| miRNA | Gene | Sentences | Documents | Max. Confidence | Correct |
|---|---|---|---|---|---|
| hsa-mir-494 | CFTR | 10 | 5 | 0.996 | Y |
| hsa-mir-93 | CXCL8 | 6 | 1 | 0.978 | Y |
| hsa-mir-101-1 | CFTR | 8 | 3 | 0.96 | Y |
| hsa-mir-224 | SLC4A4 | 5 | 1 | 0.937 | Y |
| hsa-mir-145 | CFTR | 5 | 3 | 0.871 | Y |
| hsa-mir-193b | BRCA1 | 2 | 1 | 0.86 | N |
| hsa-mir-193b | CFTR | 2 | 1 | 0.857 | Y |
| hsa-mir-155 | AKT1 | 4 | 1 | 0.828 | Y |
| hsa-miR-199a-5p | AKT1 | 5 | 1 | 0.807 | Y |
| hsa-mir-183 | IDH2 | 3 | 1 | 0.763 | Y |
| hsa-mir-155 | CXCL8 | 5 | 2 | 0.736 | Y |
| hsa-mir-125b-1 | CFTR | 4 | 1 | 0.709 | Y |
| hsa-mir-125a | LIN28A | 5 | 1 | 0.705 | N |
| hsa-mir-224 | CFTR | 4 | 1 | 0.655 | Y |
| hsa-mir-99b | LIN28A | 5 | 1 | 0.651 | N |
| hsa-mir-99b | KRT18 | 3 | 1 | 0.65 | N |
| hsa-mir-126 | TOM1L1 | 2 | 1 | 0.647 | Y |
| hsa-miR-199a-5p | CAV1 | 5 | 1 | 0.642 | Y |
| hsa-miR-509-3p | CFTR | 3 | 2 | 0.613 | Y |
| hsa-mir-125a | KRT18 | 3 | 1 | 0.58 | N |
| hsa-mir-221 | ATF6 | 3 | 1 | 0.546 | Y |
| hsa-mir-145 | SMAD3 | 3 | 1 | 0.543 | Y |
| hsa-mir-138-1 | CFTR | 3 | 1 | 0.539 | Y |
| hsa-mir-99b | CFTR | 2 | 1 | 0.519 | Y |
| hsa-mir-223 | CFTR | 3 | 1 | 0.513 | Y |
| hsa-mir-125a | CFTR | 2 | 1 | 0.512 | Y |
| hsa-let-7e | LIN28A | 3 | 1 | 0.508 | N |