| Literature DB >> 23599415 |
Mariana Neves1, Alexander Damaschun, Nancy Mah, Fritz Lekschas, Stefanie Seltmann, Harald Stachelscheid, Jean-Fred Fontaine, Andreas Kurtz, Ulf Leser.
Abstract
Biomedical literature curation is the process of automatically and/or manually deriving knowledge from scientific publications and recording it into specialized databases for structured delivery to users. It is a slow, error-prone, complex, costly and, yet, highly important task. Previous experiences have proven that text mining can assist in its many phases, especially, in triage of relevant documents and extraction of named entities and biological events. Here, we present the curation pipeline of the CellFinder database, a repository of cell research, which includes data derived from literature curation and microarrays to identify cell types, cell lines, organs and so forth, and especially patterns in gene expression. The curation pipeline is based on freely available tools in all text mining steps, as well as the manual validation of extracted data. Preliminary results are presented for a data set of 2376 full texts from which >4500 gene expression events in cell or anatomical part have been extracted. Validation of half of this data resulted in a precision of ~50% of the extracted data, which indicates that we are on the right track with our pipeline for the proposed task. However, evaluation of the methods shows that there is still room for improvement in the named-entity recognition and that a larger and more robust corpus is needed to achieve a better performance for event extraction. Database URL: http://www.cellfinder.org/Entities:
Mesh:
Year: 2013 PMID: 23599415 PMCID: PMC3629873 DOI: 10.1093/database/bat020
Source DB: PubMed Journal: Database (Oxford) ISSN: 1758-0463 Impact factor: 3.451
Figure 1.Overview of the literature curation pipeline for the CellFinder database. It includes the following steps: triage of potential relevant documents, retrieval of full text, preprocessing (sentence splitting, tokenization and parsing), named-entity recognition (genes, proteins, cell lines, cell types, organs, tissues, expression triggers), gene expression events extraction, manual validation of the results and integration into the database. Automatic procedures are shown in red, whereas the manual ones are shown in blue.
Figure 2.Examples of gene expression events for the kidney stem cell corpus (PMID 17389645, PMCID PMC1885650). Each expression trigger (dark yellow) is always related with only one gene/protein (in blue) and only one cell (in yellow) or anatomical part (in red). However, the corpus was also annotated with entities, which do not take part in any event. Visualization of the corpus was provided by Brat annotation tool (60).
Statistics on the corpora
| Features | CF-hESC | CF-Kidney | ||||
|---|---|---|---|---|---|---|
| Training | Development | Test | Training | Development | Test | |
| Documents | 6 | 2 | 2 | 6 | 2 | 2 |
| Sentences | 1379 | 259 | 539 | 1578 | 618 | 383 |
| Sentences with entities | 944 | 163 | 302 | 1344 | 527 | 314 |
| Sentences with events | 147 | 26 | 40 | 240 | 210 | 122 |
| Entities | 4158 | 583 | 1260 | 4834 | 3443 | 1748 |
| Genes/proteins | 1264 | 163 | 355 | 1440 | 1338 | 782 |
| Cell lines | 198 | 72 | 141 | 11 | 8 | 1 |
| Cell types | 1556 | 179 | 524 | 917 | 259 | 72 |
| Anatomical parts | 921 | 137 | 173 | 2116 | 1380 | 617 |
| Expression triggers | 219 | 32 | 67 | 350 | 458 | 276 |
| Relationships | 944 | 160 | 390 | 1144 | 1404 | 1320 |
| Expression-Gene/protein | 472 | 84 | 195 | 572 | 702 | 660 |
| Expression-CellLine | 13 | 6 | 36 | 14 | 5 | |
| Expression-CellType | 435 | 56 | 122 | 411 | 398 | 86 |
| Expression-anatomy | 24 | 18 | 37 | 147 | 299 | 574 |
Information is shown for the training, development and test data sets of the CF-hESC and CF-Kidney data sets. It includes number of documents, sentences, sentences with entities and sentences with events. Number of annotations is presented by entity type, and the number of events also shown according to the entities participating in the relationships.
Figure 3.Screen-shot of Bionotate configured for the validation of the gene expression events. Three named-entities are always pre-annotated: a trigger (in green), a gene (in blue) and a cell line, cell type or anatomical part (in red). The answers assess whether the biological event is taking place, its negation, the accuracy of the named-entity recognition and the relevancy of the publication from where the snippet was derived.
Evaluation of the automatic named-entity recognition on the CF- hESC and CF-Kidney corpora
| Corpora | Match | Entity types (recall/F-score) | |||||
|---|---|---|---|---|---|---|---|
| Genes | C. lines | C. types | Anatomy | Expression | |||
| CF-hESC | Development | Ex. | 0.61/0.54 | 0.68/0.61 | 0.14/0.15 | 0.34/0.34 | 0.72/0.15 |
| OT | 0.75/0.65 | 0.94/0.85 | 0.62/0.66 | 0.48/0.45 | 0.91/0.19 | ||
| Ov. | 0.82/0.69 | 0.94/0.81 | 0.70/0.73 | 0.72/0.62 | 0.97/0.20 | ||
| Test | Ex. | 0.68/0.65 | 0.40/0.49 | 0.25/0.28 | 0.30/0.25 | 0.45/0.08 | |
| OT | 0.76/0.72 | 0.58/0.65 | 0.58/0.65 | 0.43/0.35 | 0.54/0.09 | ||
| Ov. | 0.77/0.71 | 0.61/0.69 | 0.77/0.82 | 0.81/0.71 | 0.55/0.10 | ||
| CF-Kidney | Development | Ex. | 0.34/0.45 | 1.00/0.33 | 0.17/0.26 | 0.69/0.75 | 0.68/0.43 |
| OT | 0.35/0.46 | 1.00/0.33 | 0.18/0.27 | 0.88/0.87 | 0.69/0.43 | ||
| Ov. | 0.46/0.56 | 1.00/0.34 | 0.77/0.80 | 0.90/0.89 | 0.76/0.47 | ||
| Test | Ex. | 0.69/0.76 | 1.00/0.33 | 0.89/0.86 | 0.67/0.74 | 0.80/0.42 | |
| OT | 0.70/0.77 | 1.00/0.33 | 0.93/0.89 | 0.69/0.76 | 0.80/0.42 | ||
| Ov. | 0.70/0.77 | 1.00/0.33 | 0.94/0.91 | 0.72/0.77 | 0.81/0.42 | ||
Results are shown for the development and test data sets in the format recall/F-score. Matching is evaluated regarding same span and entity type (Ex.), overlapping span and same type (OT) and overlapping span of any entity type (Ov.).
Statistics on the extracted named entities
| Annotations | Genes | C. lines | C. types | Anatomy | Expression |
|---|---|---|---|---|---|
| Distinct mentions | 702 829 | 81 074 | 183 820 | 565 860 | 681 370 |
| Distinct spans | 34 222 | 1825 | 9142 | 14 874 | 892 |
| Distinct ids | 34 353 | 11 875 | 1150 | 4300 |
For each entity type, the number of annotations, distinct spans and identifiers is shown. Sometimes more than one identifier is assigned to a mention, therefore their high number. Trigger words (Expression) are not normalized to any ontology.
Evaluation of TEES during training
| Data sets | Relationship | Development | Test | ||||
|---|---|---|---|---|---|---|---|
| P | R | F | P | R | F | ||
| CF-hESC | Cell | 0.86 | 0.56 | 0.68 | 0.77 | 0.45 | 0.57 |
| Gene | 0.91 | 0.68 | 0.78 | 0.82 | 0.90 | 0.86 | |
| Event | 0.60 | 0.35 | 0.44 | 0.38 | 0.53 | 0.44 | |
| CF-Kidney | Cell | 0.71 | 0.50 | 0.59 | 0.62 | 0.68 | 0.65 |
| Gene | 0.60 | 0.82 | 0.69 | 0.73 | 0.75 | 0.74 | |
| Event | 0.17 | 0.49 | 0.25 | 0.12 | 0.56 | 0.20 | |
| CF-Both | Cell | 0.77 | 0.55 | 0.65 | 0.69 | 0.64 | 0.67 |
| Gene | 0.67 | 0.81 | 0.73 | 0.69 | 0.84 | 0.76 | |
| Event | 0.55 | 0.48 | 0.51 | 0.50 | 0.56 | 0.53 | |
Evaluation is shown for the ‘Cell’ and ‘Gene’ relationships and for the development and test data sets, as described in Table 1. The complete events derived from a ‘Cell’ and a ‘Gene’ argument associated to the same trigger are also shown. For each training run, evaluation is carried out on the corresponding development and test data sets, i.e. two documents for each single corpus (CF-hESC and CF-Kidney) and four documents when training on the joined corpus (CF-Both). Predictions were performed over the gold-standard named-entity annotations. ‘P’ refers to ‘Precision’, ‘R’ to ‘Recall’ and ‘F’ to ‘F-score’.
Evaluation of gene expression extraction
| Data sets | Relationship/Event | Development | Test | Predictions | ||||
|---|---|---|---|---|---|---|---|---|
| P | R | F | P | R | F | |||
| CF-hESC | Cell | 0.43 | 0.06 | 0.10 | 0.76 | 0.33 | 0.46 | 14 551 |
| Gene | 0.35 | 0.22 | 0.27 | 0.76 | 0.79 | 0.77 | 112 372 | |
| Events | 0.50 | 0.08 | 0.14 | 0.27 | 0.05 | 0.08 | 4280 | |
| Triplets | 0.06 | 0.51 | 0.10 | 0.05 | 0.35 | 0.09 | ||
| CF-Kidney | Cell | 0.44 | 0.02 | 0.05 | 0.52 | 0.57 | 0.55 | 109 934 |
| Gene | 0.62 | 0.06 | 0.10 | 0.77 | 0.69 | 0.73 | 5520 | |
| Event | 115 | |||||||
| Triplets | 0.02 | 0.19 | 0.04 | 0.02 | 0.28 | 0.05 | ||
| CF-Both | Cell | 1.0 | 0.01 | 0.02 | 0.70 | 0.64 | 0.67 | 69 079 |
| Gene | 0.33 | 0.01 | 0.01 | 0.69 | 0.84 | 0.76 | 3792 | |
| Event | 178 | |||||||
| Triplets | 0.02 | 0.22 | 0.04 | 0.03 | 0.30 | 0.05 | ||
We have trained the TEES system on three data sets: CF-hESC, CF-Kidney and CF-Both. Results for the ‘Cell’ and ‘Gene’ relationships were provided by TEES during processing of the documents. Performance for complete events is evaluated allowing overlapping matches for entity spans, but with equality of entity types and argument types. The triplets correspond to every possible combination of the triggers, genes/proteins, cells or anatomical parts in the same sentence, i.e. the highest possible recall for any relationship extraction system provided the predictions for the entities. The ‘Pred.’ column presents the number of relationships or complete events, which have been extracted from the 2376 full texts on kidney research when using each of the training models. ‘P’ refers to ‘Precision’, ‘R’ to ‘Recall’ and ‘F’ to ‘F-score’.
Evaluation of the gene expression snippets in Bionotate
| Answers | CF-hESC | CF-Kidney | CF-Both | Total | ||||
|---|---|---|---|---|---|---|---|---|
| No. snippets | % | No. snippets | % | No. snippets | % | No. snippets | % | |
| 1. Yes | 1204 | 49.1 | 34 | 29.5 | 6 | 3.3 | 1244 | 45.4 |
| 2. Yes (negation) | 47 | 1.9 | 3 | 2.6 | 0 | 0 | 50 | 1.8 |
| 3. No (but entities correct) | 218 | 9.0 | 8 | 7.0 | 1 | 0.6 | 227 | 8.3 |
| 4. No (trigger wrong) | 194 | 8.0 | 28 | 24.3 | 78 | 43.8 | 300 | 11.0 |
| 5. No (gene wrong) | 346 | 14.1 | 11 | 9.6 | 6 | 3.4 | 363 | 13.2 |
| 6. No (cell/anatomy wrong) | 207 | 8.5 | 26 | 22.6 | 9 | 5.1 | 242 | 8.8 |
| 7. No (gene/cell/anatomy wrong) | 55 | 2.2 | 4 | 3.5 | 1 | 0.6 | 60 | 2.2 |
| 8. No (irrelevant document) | 177 | 7.2 | 1 | 0.9 | 77 | 43.2 | 255 | 9.3 |
| Total | 2448 | 100 | 115 | 100 | 178 | 100 | 2741 | 100 |
A total of 2741 snippets (gene expression events) were validated. These events were predicted by the three models used for training TEES event extraction system. Percentages for each answer are also shown.