| Literature DB >> 35964150 |
O A Tarasova1, A V Rudik2, N Yu Biziukova2, D A Filimonov2, V V Poroikov2.
Abstract
MOTIVATION: Application of chemical named entity recognition (CNER) algorithms allows retrieval of information from texts about chemical compound identifiers and creates associations with physical-chemical properties and biological activities. Scientific texts represent low-formalized sources of information. Most methods aimed at CNER are based on machine learning approaches, including conditional random fields and deep neural networks. In general, most machine learning approaches require either vector or sparse word representation of texts. Chemical named entities (CNEs) constitute only a small fraction of the whole text, and the datasets used for training are highly imbalanced. METHODS ANDEntities:
Keywords: CNE; CNER; Chemical named entity recognition; Mpro inhibitors; Naïve Bayes classifier; SARS-CoV-2
Year: 2022 PMID: 35964150 PMCID: PMC9375066 DOI: 10.1186/s13321-022-00633-4
Source DB: PubMed Journal: J Cheminform ISSN: 1758-2946 Impact factor: 8.489
Examples of chemical named entities belonging to particular types
| Type | An example of A chemical named entity |
|---|---|
| Abbreviation | Mtx |
| Systematic | Anthracene, phenylenediamine |
| Formula | H2O2 |
| FAMILY | Flavonoids |
| Trivial | Haloperidol |
| Chemical Named Entity (CNE) | Anthracene, phenylenediamine, haloperidol, H2O2, Mtx, flavonoids |
| Non-CNE | Quarantine |
Fig. 1a The types are arranged for each variant: target token and a set of tokens before and after the target token; b an example of a set of multi-n-grams with n = 5 generations
Fig. 2The relationship between length of an n-gram, context window, and accuracy of CNER: a for class “Systematic”, b for class “Trivial”, and c average IA for all classes
Accuracy of chemical named entity recognition using the naïve-Bayes approach based on the representation of texts using n-grams equal to five symbols and a context window of one token before and after analysis
| Type | N* | R** | IA*** |
|---|---|---|---|
| Abbreviation | 12,506 | 118 | 0.99 |
| Formula | 13,466 | 110 | 0.99 |
| Family | 19,017 | 78 | 0.97 |
| Systematic | 32,510 | 46 | 0.99 |
| Trivial | 25,140 | 59 | 0.98 |
| CNE | 102,639 | 14 | 0.98 |
| Non-CNE | 1,480,509 | 1.01 | 0.98 |
*—N is the number of fragments of texts used for training
**—R is the ratio of the number of all tokens to the number of tokens
belonging to a certain type, indicating a measure of dataset imbalance
***—IA invariant accuracy
Fig. 3The relationships between the values of accuracy and B-statistics for the types: a “Systematic”; b “Trivial”; c “CNE”; d “non-CNE”
Fig. 4An example of chemical named entity extraction based on naïve-Bayes estimations