| Literature DB >> 29343218 |
Juan Miguel Cejuela1, Shrikant Vinchurkar2, Tatyana Goldberg3, Madhukar Sollepura Prabhu Shankar3, Ashish Baghudana4, Aleksandar Bojchevski3, Carsten Uhlig3, André Ofner3, Pandu Raharja-Liu3, Lars Juhl Jensen5, Burkhard Rost6,7,8,9,10.
Abstract
BACKGROUND: The subcellular localization of a protein is an important aspect of its function. However, the experimental annotation of locations is not even complete for well-studied model organisms. Text mining might aid database curators to add experimental annotations from the scientific literature. Existing extraction methods have difficulties to distinguish relationships between proteins and cellular locations co-mentioned in the same sentence.Entities:
Keywords: Annotations; Database curation; GO; Protein; Relation extraction; Subcellular localization; Text mining
Mesh:
Substances:
Year: 2018 PMID: 29343218 PMCID: PMC5773052 DOI: 10.1186/s12859-018-2021-9
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Most related protein and localizations closed to each other. Repetitions of relationships were collapsed at the document level after normalizing the entities: proteins to UniProtKB and localizations to GO. In the LocTextCorpus, the majority of unique relations were annotated between entities occurring in the same sentence (distance 0 = D0; 66% of all relations) or in adjacent sentences (dist. 1 = D1; 15%). Combined, D0+D1 accounted for 81% of the relations. Removing repetitions when considering the GO hierarchy (children identifiers are more exact than their parents), D0+D1 accounted for 89% of all unique relationships
LocText (RE only) and STRING Tagger (NER); intrinsic evaluation
| Method and evaluation | P | R | F ± |
|---|---|---|---|
|
| 84% | 78% | 81% ± 1 |
| 80% | 78% | 79% ± 2 | |
| 90% | 71% | 80% ± 3 | |
| 96% | 92% | 94% ± 1 | |
| 93% | 68% | 79% ± 3 | |
| 75% | 74% | 74% ± 3 |
Performances of the NER and RE components independently evaluated on the LocTextCorpus; P=precision, R=recall, F ±StdErr=F-measure with standard error
Fig. 2LocText full pipeline (NER + RE); intrinsic evaluation. Using the STRING Tagger-extracted (“predicted”) entities, both LocText and Baseline had low and comparable F-measure (F=57% ± 4 and F=51% ± 3, resp.), however LocText was optimized for precision (P=86%)
LocText found novel GO annotations in latest publications; extrinsic evaluation
| Org. | # | C | C&NR | C&NT | C&NR,NT |
|---|---|---|---|---|---|
| Human | 20 | 13 (65%) | 10 (50%) | 9 (45%) | 7 (35%) |
| Yest | 20 | 17 (85%) | 12 (60%) | 6 (30%) | 4 (20%) |
| Cress | 20 | 16 (80%) | 11 (55%) | 9 (45%) | 7 (35%) |
|
| 60 | 46 (77%) | 33 (55%) | 24 (40%) | 18 (30%) |
LocText mined protein location relations not tagged in Swiss-Prot in latest publications: 2012-2017 for (column Org.=organism) human and 1990-2017 for yeast and cress. (#) 60 novel text-mined annotations (20 for each organism) were manually verified: (C=correct) 77% were correct; 55% were correct and had no relation (NR) in Swiss-Prot; 40% were correct and were not in text (NT) descriptions of Swiss-Prot; 30% were correct and neither had a relation nor appeared in text descriptions
Fig. 3LocText pipeline. The input are text documents (e.g. PubMed). First, the STRING Tagger recognizes named entities (NER): proteins (green in the example; linked to UniProtKB), cellular localizations (pink; linked to GO), and organisms (yellow; linked to NCBI Taxonomy). Then, the relation extractor (RE) of LocText resolves which proteins and localizations are related (as in “localized in”). The output is a list of text-mined relationships (GO annotations) linked to the original text sources (e.g. PMIDs)