| Literature DB >> 27843493 |
Maryam Habibi1, David Luis Wiegandt1, Florian Schmedding2, Ulf Leser1.
Abstract
Recently, methods for Chemical Named Entity Recognition (NER) have gained substantial interest, driven by the need for automatically analyzing todays ever growing collections of biomedical text. Chemical NER for patents is particularly essential due to the high economic importance of pharmaceutical findings. However, NER on patents has essentially been neglected by the research community for long, mostly because of the lack of enough annotated corpora. A recent international competition specifically targeted this task, but evaluated tools only on gold standard patent abstracts instead of full patents; furthermore, results from such competitions are often difficult to extrapolate to real-life settings due to the relatively high homogeneity of training and test data. Here, we evaluate the two state-of-the-art chemical NER tools, tmChem and ChemSpot, on four different annotated patent corpora, two of which consist of full texts. We study the overall performance of the tools, compare their results at the instance level, report on high-recall and high-precision ensembles, and perform cross-corpus and intra-corpus evaluations. Our findings indicate that full patents are considerably harder to analyze than patent abstracts and clearly confirm the common wisdom that using the same text genre (patent vs. scientific) and text type (abstract vs. full text) for training and testing is a pre-requisite for achieving high quality text mining results.Entities:
Keywords: Chemical named entity recognition; Ensemble approach; Patent mining; Performance measurements; Simple chemical elements
Year: 2016 PMID: 27843493 PMCID: PMC5086069 DOI: 10.1186/s13321-016-0172-0
Source DB: PubMed Journal: J Cheminform ISSN: 1758-2946 Impact factor: 5.514
The details of the gold standard patent corpora containing the annotations for chemicals
| Corpus | Number of patents | Annotated entities | Number of annotations |
|---|---|---|---|
| CEMP training set (CEMP_T) [ | 7000 patents (title and abstract) | ABBREVIATION, FAMILY, FORMULA, TRIVIAL, MULTIPLE, SYSTEMATIC, IDENTIFIERS | 33543 (without normalization) |
| CEMP development set (CEMP_D) [ | 7000 patents (title and abstract) | ABBREVIATION, FAMILY, FORMULA, TRIVIAL, MULTIPLE, SYSTEMATIC, IDENTIFIERS | 32142 (without normalization) |
| CHEBI patent corpus (chapati) [ | 40 full patents (title, abstract, claims, description) | CLASS, CHEMICAL, ONT, FORMULA, LIGAND, CM | 18746 (normalized to CHEBI identifiers) |
| BioSemantic patent corpus (BioS) [ | 200 full patents (title, abstract, claims, description) | IUPAC, SMILES, InChi, ABBREVIATION, MOA, DISEASE, FORMULA, REGISTRY NUMBER, GENERIC, TRADEMARK, CAS NUMBER, TARGET | 400125 (without normalization) |
Details on the chemical NER tools in terms of training sets, databases to which the entities are normalized, classes of chemicals addressed, and tokenization methods
| NER tool | Training set | Databases | Classes | Tokenization method |
|---|---|---|---|---|
| tmChem [ | CHEMDNER corpus at BioCreative IV (training and development sets) | CHEBI | SYSTEMATIC | Tokenization at every non-letter and non-digit characters, number- letter changes and lower case letter followed by an uppercase letter |
| MESH | FORMULA | |||
| FAMILY | ||||
| TRIVIAL | ||||
| IDENTIFIER | ||||
| MULTIPLE | ||||
| ABBREVIATION | ||||
| ChemSpot [ | A subset of SCAI Corpus [ | ChemIDplus | SYSTEMATIC | Tokenization at every non-letter and non-digit characters and number-letter changes |
| CHEBI | FORMULA | |||
| CAS | FAMILY | |||
| NUMBER | TRIVIAL | |||
| PubChem | IDENTIFIER | |||
| InChI | MULTIPLE | |||
| DrugBank | ABBREVIATION | |||
| KEGG | ||||
| Human | ||||
| Metabolome | ||||
| MESH |
Fig. 1Evaluation scores in terms of precision, recall and F-measure values are measured for ChemSpot and tmChem NER tools over gold standard corpora
The top 10 entities with highest FP for each chemical NER tool on the four different corpora
| CEMP_T | CEMP_D | ||
|---|---|---|---|
| ChemSpot | tmChem | ChemSpot | tmChem |
| Water 951 |
| Water 842 |
|
|
| Sugar 66 |
|
|
|
| CH |
|
|
| DEG 155 |
| Peptide 153 |
|
| Peptide 107 | NO 42 | Chitosan 130 | O 46 |
| Chitosan 91 | Solvate 40 | DEG 108 | NO 45 |
| Starch 81 |
| Parkinson 80 | N 44 |
|
| Hydrogen 38 |
|
|
|
|
|
| Sulfate 37 |
| Parkinson 60 | Beta-cyclodextrin 34 |
| Beta-cyclodextrin 36 |
Common mistakes are shown in italic
The top 10 entities with highest FN for each chemical NER tool on the four different corpora
| CEMP_T | CEMP_D | ||
|---|---|---|---|
| ChemSpot | tmChem | ChemSpot | tmChem |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| Heteroaryl 82 |
| Heteroaryl 87 |
|
|
| S 86 |
| S 86 |
| N 71 | Cyano 85 |
|
|
| Alkoxy 67 |
| Alkoxy 63 |
|
|
|
|
|
|
|
| Oligonucleotides 50 |
| Halo 52 |
|
| Opioid 50 |
|
|
Common mistakes are shown in italic
Fig. 2Distributions of FP counts from high to low, for unique entities covering 25% of cases, obtained by tmChem and ChemSpot over all corpora. The x-axis represents the number of unique entities. The distributions are notably different for full patents compared to patent abstracts
Fig. 3Distributions of FN counts from high to low, for unique entities covering 25% of cases, obtained by tmChem and ChemSpot over all corpora. The x-axis represents the number of unique entities. The distributions are very similar for full patents and patent abstracts
Fig. 4The FP and FN counts of simple chemical elements normalized by the FP and FN counts obtained for the entire entities by tmChem and ChemSpot over all corpora
Fig. 5Evaluation scores with regard to precision, recall and F-measure over recognized spans obtained by ChemSpot and tmChem NER tools over gold standard corpora. The results are provided by considering simple elements represented by “+” and without them noted by “−”
Fig. 6Evaluation scores with regard to precision, recall, and F-measure values over recognized spans obtained by ChemSpot, tmChem, the area of their intersection and union over gold standard corpora
Fig. 7Precision, recall and F-measure values of the models trained using different corpora on the CEMPs, chapati, and BioS patent corpora
The execution time, in seconds, of NER tools over 10 full patent documents and 10 journal articles
| Text genre | Chemical NER tool | |
|---|---|---|
| ChemSpot | tmChem | |
| 10 Patent documents | 562 | 66 |
| 10 Scientific articles |
|
|
The execution time values of both systems are lower on scientific articles shown in italic compared with patents
Statistical measurements calculated over 17,000 patent documents and 17,000 journal articles
| Text genre | Sentence length | Document length | Number of unique TLAs | Number of TLAs | Number of tables | Number of figures |
|---|---|---|---|---|---|---|
| Patents | 21.12 |
|
|
|
|
|
| Articles |
| 3512.30 | 8.47 | 44.73 | 2.03 | 2.97 |
The largest values are represented in italic for each measurement
The full confusion matrix for the ambiguous entity “alkyl” calculated for ChemSpot and tmChem over CEMP_T and CEMP_D corpora
| ChemSpot CEMP_T | Predicted “alkyl” | Predicted others | tmChem CEMP_T | Predicted “alkyl” | Predicted others |
|---|---|---|---|---|---|
| Actual | TP | FN | Actual | TP | FN |
| “Alkyl” | 354 | 74 | “Alkyl” | 206 | 226 |
| Actual | FP | TN | Actual | FP | TN |
| Others | 260 | 599 | Others | 39 | 816 |
The full confusion matrix for the ambiguous entity “H” calculated for ChemSpot and tmChem over chapati and BioS corpora containing complete patent documents
| ChemSpot chapati | Predicted “H” | Predicted others | tmChem chapati | Predicted “H” | Predicted others |
|---|---|---|---|---|---|
| Actual | TP | FN | Actual | TP | FN |
| “H” | 33 | 37 | “H” | 36 | 34 |
| Actual | FP | TN | Actual | FP | TN |
| Others | 11 | 789 | Others | 46 | 754 |