| Literature DB >> 27087307 |
Yaoyun Zhang1, Jun Xu1, Hui Chen2, Jingqi Wang1, Yonghui Wu1, Manu Prakasam3, Hua Xu4.
Abstract
Medicinal chemistry patents contain rich information about chemical compounds. Although much effort has been devoted to extracting chemical entities from scientific literature, limited numbers of patent mining systems are publically available, probably due to the lack of large manually annotated corpora. To accelerate the development of information extraction systems for medicinal chemistry patents, the 2015 BioCreative V challenge organized a track on Chemical and Drug Named Entity Recognition from patent text (CHEMDNER patents). This track included three individual subtasks: (i) Chemical Entity Mention Recognition in Patents (CEMP), (ii) Chemical Passage Detection (CPD) and (iii) Gene and Protein Related Object task (GPRO). We participated in the two subtasks of CEMP and CPD using machine learning-based systems. Our machine learning-based systems employed the algorithms of conditional random fields (CRF) and structured support vector machines (SSVMs), respectively. To improve the performance of the NER systems, two strategies were proposed for feature engineering: (i) domain knowledge features of dictionaries, chemical structural patterns and semantic type information present in the context of the candidate chemical and (ii) unsupervised feature learning algorithms to generate word representation features by Brown clustering and a novel binarized Word embedding to enhance the generalizability of the system. Further, the system output for the CPD task was yielded based on the patent titles and abstracts with chemicals recognized in the CEMP task.The effects of the proposed feature strategies on both the machine learning-based systems were investigated. Our best system achieved the second best performance among 21 participating teams in CEMP with a precision of 87.18%, a recall of 90.78% and aF-measure of 88.94% and was the top performing system among nine participating teams in CPD with a sensitivity of 98.60%, a specificity of 87.21%, an accuracy of 94.75%, a Matthew's correlation coefficient (MCC) of 88.24%, a precision at full recall (P_full_R) of 66.57% and an area under the precision-recall curve (AUC_PR) of 0.9347. The SSVM-based CEMP systems outperformed the CRF-based CEMP systems when using the same features. Features generated from both the domain knowledge and unsupervised learning algorithms significantly improved the chemical NER task on patents.Database URL:http:// database. oxfordjournals. org/ content/ 2016/ baw049.Entities:
Mesh:
Substances:
Year: 2016 PMID: 27087307 PMCID: PMC4834204 DOI: 10.1093/database/baw049
Source DB: PubMed Journal: Database (Oxford) ISSN: 1758-0463 Impact factor: 3.451
Figure 1.The workflow of our system for chemical named entity recognition from patents.
Statistics of the training and development datasets of the BioCreative V CHEMDNER patents challenge
| Types | Training set | Development set |
|---|---|---|
| ABBREVIATION | 588 | 454 |
| FAMILY | 12 209 | 11 710 |
| FORMULA | 2239 | 2120 |
| IDENTIFIER | 99 | 125 |
| MULTIPLE | 140 | 141 |
| SYSTEMATIC | 9570 | 9194 |
| TRIVIAL | 8698 | 8298 |
| ALL | 32 955 | 32 042 |
CHEMDNER patents: CHEMDNER from patent text.
Figure 2.An example of the BIO representation of chemical named entities.
Illustration of features identified as prefixes/suffixes, n-grams of characters and prefixes/suffixes of a chemical named entity
| Chemical name | Benzylamino |
|---|---|
| prefixes/suffixes | b, be, ben, ino, no, o |
| n-grams of characters | be, ben, en, enz, nz, nzy, zy, zyl, yl, yla, la, lam, am, ami, mi, min, in, ino, no |
| Prefixes/suffixes of chemicals | benzyl, amino |
Figure 3.An example of semantic type annotation for context feature extraction.
Figure 4.A comparison between real-valued and binarized embedding features.
The performance of CRF-based and SSVM-based CEMP systems with different types of features on the development dataset (%)
| Method | CRF | SSVM | ||||
|---|---|---|---|---|---|---|
| Baseline | 85.05 | 86.18 | 85.61 | 85.63 | 87.78 | 86.53 |
| +Chemical pattern | 85.28 | 86.16 | 85.72 (+0.11) | 85.82 | 87.74 | 86.61 (+0.08) |
| +Gene lexicon | 85.53 | 86.29 | 85.91 (+0.19) | 85.81 | 87.92 | 86.76 (+0.15) |
| +Semantic type | 85.48 | 86.52 | 86.00 (+0.09) | 85.77 | 88.27 | 86.87 (+0.11) |
| +ChemSpot | 82.49 | 90.24 | 86.19 (+0.19) | 82.86 | 91.82 | 87.07 (+0.20) |
| +Word embedding | 82.30 | 91.06 | 86.46 (+0.27) | 82.73 | 92.43 | 87.31 (+0.24) |
| +Brown clustering | 86.34 | 87.58 | 86.96 (+0.50) | 86.10 | 89.44 | 87.74 (+0.43) |
| +Post | 86.02 | 88.45 | 87.22 (+0.26) | 85.88 | 89.99 | 87.89 (+0.15) |
CRF: conditional random fields; SSVM: structural support vector machine; CEMP: Chemical Entity Mention Recognition in Patents.
The performance of CRF-based and SSVM-based systems on the test set for the CEMP task (%)
| Training dataset | Algorithm | |||
|---|---|---|---|---|
| Train + development | CRF | 89.64 | 88.59 | |
| Train + development | SSVM | 87.18 |
CRF: conditional random fields; SSVM: structural support vector machine; CEMP: Chemical Entity Mention Recognition in Patents. Top performance in each column is bolded.
The performance of CRF-based and SSVM-based systems on the test set for the CPD task (%)
| Training dataset | Algorithm | Sensitivity | Specificity | Accuracy | MCC | P_full_R |
|---|---|---|---|---|---|---|
| Train + development | CRF | 98.32 | 94.59 | 87.85 | 66.27 | |
| Train + development | SSVM | 87.21 |
CRF: conditional random fields; SSVM: structural support vector machine; CPD: chemical passage detection. Top performance in each column is bolded.
Examples of chemical named entity recognition errors
| Error type | Example |
|---|---|
| Gene & proteins | The compositions comprise antisense compounds, particularly antisense oligonucleotides, targeted to |
| Breaking long chemicals | The derivative has a structure expressed by the formula ( |
| Recognize partially | |
| Unmatched punctuations | particularly sphingosine (SPH) and |
| Uncommon context | The invention relates to a steroid derivative which steroidal skeleton is bound at |
*The correct chemical mentions are bolded and underlined, while the misrecognized chemicals are bolded and italicized.