| Literature DB >> 26428294 |
Suwisa Kaewphan1, Sofie Van Landeghem2, Tomoko Ohta3, Yves Van de Peer4, Filip Ginter5, Sampo Pyysalo6.
Abstract
MOTIVATION: The recognition and normalization of cell line names in text is an important task in biomedical text mining research, facilitating for instance the identification of synthetically lethal genes from the literature. While several tools have previously been developed to address cell line recognition, it is unclear whether available systems can perform sufficiently well in realistic and broad-coverage applications such as extracting synthetically lethal genes from the cancer literature. In this study, we revisit the cell line name recognition task, evaluating both available systems and newly introduced methods on various resources to obtain a reliable tagger not tied to any specific subdomain. In support of this task, we introduce two text collections manually annotated for cell line names: the broad-coverage corpus Gellus and CLL, a focused target domain corpus.Entities:
Mesh:
Year: 2015 PMID: 26428294 PMCID: PMC4708107 DOI: 10.1093/bioinformatics/btv570
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Qualitative comparison of three corpora on different criteria
| Corpus | |||
|---|---|---|---|
| Characteristics | CellFinder | JNLPBA | Gellus |
| Annotation diversity | 17.10% (66/386) | 58.38% (2528/4330) | 32.31% (210/650) |
| Document size | 10 full-texts | 2404 abstracts | 1212 documents |
| Annotation span | excl. head nouns | incl. head nouns | excl. head nouns |
| Domain | Embryonic stem cells | blood transcription factors | random + cancer |
| Cell line definition | Specific | Specific + Generic | Specific |
| Normalized cell lines (%) | 66.84 (258/386) | 45.24 (1959/4330) | 79.38 (516/650) |
*The statistics of cell line section of the derived corpus are slightly different from the original one.
**This represents the number of unique strings per number of mentions.
***The corpus consists of 300 PubMed abstracts and extracts from 912 PMC full text documents.
Comparison of performance across different corpora for the overlap matching criterion
| Dictionary (approximate) | N/A | 19.83/44.60/27.45 | 36.47/79.59/50.02 | 13.76/92.74/23.96 |
| Dictionary (exact) | N/A | 54.14/42.40/47.56 | 74.84/76.87/75.84 | 55.46/86.59/67.61 |
| GENIA tagger | JNLPBA | 66.92/69.80/68.33 | 20.00/23.13/21.45 | 40.50/55.87/46.96 |
| ABNER | JNLPBA | 65.27/70.80/67.92 | 22.88/25.17/23.97 | 39.91/52.51/45.36 |
| Gimli | JNLPBA | 32.35/23.81/27.43 | 42.86/43.02/42.94 | |
| NERsuite | JNLPBA | 57.60/76.40/65.68 | 15.71/39.46/22.47 | 27.72/63.69/38.63 |
| NERsuite + dict | JNLPBA | 63.45/76.60/69.41 | 30.48/63.95/41.29 | 37.65/72.07/49.46 |
| NERsuite | CellFinder | 31.99/23.00/26.76 | 54.71/81.63/65.51 | 30.13/37.43/33.39 |
| NERsuite + dict | CellFinder | 60.13/33.80/43.27 | 72.40/74.30/73.34 | |
| NERsuite | Gellus | 73.16/31.00/43.55 | 51.85/28.57/36.84 | 79.39/71.51/75.25 |
| NERsuite + dict | Gellus | 73.81/41.80/53.37 | 89.43/74.83/81.48 | |
The numbers displayed in bold font represent the best performing systems for each test corpus. (Note that the evaluation on ABNER, GENIA tagger and Gimli was done with provided models solely trained on the original JNLPBA training data).
The results of NERsuite trained on different models and applied on the CLL corpus
| Trained Model | Precision (%) | Recall (%) | |
|---|---|---|---|
| CellFinder | 80.00 | 28.15 | 41.65 |
| CellFinder | 86.90 | 69.50 | 77.23 |
| JNLPBA | 75.92 | 52.49 | 62.07 |
| JNLPBA | 82.93 | 79.47 | 81.16 |
| Gellus | 92.90 | 40.47 | 56.38 |
| Gellus + dict | 90.22 | 82.11 | 85.98 |