| Literature DB >> 24834132 |
Abstract
The rapid increase in the flow rate of published digital information in all disciplines has resulted in a pressing need for techniques that can simplify the use of this information. The chemistry literature is very rich with information about chemical entities. Extracting molecules and their related properties and activities from the scientific literature to "text mine" these extracted data and determine contextual relationships helps research scientists, particularly those in drug development. One of the most important challenges in chemical text mining is the recognition of chemical entities mentioned in the texts. In this review, the authors briefly introduce the fundamental concepts of chemical literature mining, the textual contents of chemical documents, and the methods of naming chemicals in documents. We sketch out dictionary-based, rule-based and machine learning, as well as hybrid chemical named entity recognition approaches with their applied solutions. We end with an outlook on the pros and cons of these approaches and the types of chemical entities extracted.Entities:
Keywords: Chemical entities; Chemical names; Information extraction
Year: 2014 PMID: 24834132 PMCID: PMC4022577 DOI: 10.1186/1758-2946-6-17
Source DB: PubMed Journal: J Cheminform ISSN: 1758-2946 Impact factor: 5.514
Figure 1Example of classes of chemical entities-bolded-extracted by different systems from the chemical literature.
Description and examples of the expressing methods of chemical structural information ([5,8]andhttp://en.wikipedia.org)
| 1. Systematic names | reflect the information of the chemical structure. International Union of Pure and Applied Chemistry (IUPACh) | ‘3-(3,4-dihydroxyphenyl)prop-2-enoic acid’ |
| 2. Trivial names | they do not reflect the structure of the chemical substance. | ‘caffeic acid’ utilized for ‘3-(3,4-dihydroxyphenyl)prop-2-enoic acid’. |
| 3. Semi systematic names | at least one part is used in the systematic sense, IUPAC-like, non-IUPAC names. | in‘N-benzoylglycine’ the part ‘benzoyl’ is systematic, whereas ‘glycine’ is the trivialname for ‘ |
| 4. Common or generic names | names applied to a class of compounds | camphor, water and alcohol |
| 5. Registered trademark/brand names | they identify the brand owner as the commercial source of products. | ‘aspirin’ |
| 6. Company codes | a company code is to identify the compound within the company. | ZD5077 = ICI204636 = ZM204636 |
| 7. Acronyms and abbreviations | they are used to get short names. | DMS for dimethyl sulfate |
| 8. Index and reference | numbers from Chemical Abstracts Service (CAS) registry numbers, Beilstein registry numbers, etc | CAS number of water is 7732-18-5 |
| 9. Anaphors | Compounds are named earlier in the text but co-referenced to a shorter name, called the anaphor, later in the text. | A compound number is anaphor where … bioactivity is found in compounds [ |
| 10. Sum formula | Consists of the elements contributing to a compound and the number of their occurrences | ‘ |
| 11. Chemical structures | explicit and implicit structures | Markush structures, where R1 = CH3, COOH, etc… |
Chemical text corpora for evaluating and training the NER applications
| IUPAC training corpus | IUPAC names | [ | |
| SCAI | All chemical names | [ | |
| PubMed corpus | Compounds, reagents, chemical adjectives enzymes and prefix | [ | Not available. |
| Sciborg corpus | All chemical names | [ | Not available |
| GENIA corpus | Biological besides some chemical entities | [ | |
| European Patent Office and the ChEB | All chemical names | [ | |
| CHEMDNER Corpus | Chemical compounds and drugs | [ |
Figure 2The main steps for developing the chemical NER system.
Description of common categories of textual features with some examples, summarized from [23]
| Linguistic | to find the prefix that is common to all variations of the term, |
| to find the root term of the variant word, | |
| to assign each token to a grammatical category or | |
| to divide the text into syntactical correlated parts of words, | |
| (e.g chucking, lemmatization, stemming and Part-of-speech (POS) tagging) | |
| Orthographic | to capture knowledge on word formation by the presence of these features, (e.g capitalization and symbols) |
| Morphological | to reflect common structures and/or sub-sequences of characters among entities, (e.g suffixes and prefixes, char n-gram and word shape patterns) |
| Context | to establish a higher level of relationship between the tokens and the extracted features, e.g (windows and conjunctions) |
| Lexicons | to add domain knowledge to the set of features for optimizing the NER system. Dictionaries of domain term are used to match the entity names in the text and the resulting tags are used as features. Examples of the types of dictionaries used (target entity name and trigger name). |
Figure 3Types of NER systems with some related techniques.