| Literature DB >> 21575201 |
Lezan Hawizy1, David M Jessop, Nico Adams, Peter Murray-Rust.
Abstract
BACKGROUND: The primary method for scientific communication is in the form of published scientific articles and theses which use natural language combined with domain-specific terminology. As such, they contain free owing unstructured text. Given the usefulness of data extraction from unstructured literature, we aim to show how this can be achieved for the discipline of chemistry. The highly formulaic style of writing most chemists adopt make their contributions well suited to high-throughput Natural Language Processing (NLP) approaches.Entities:
Year: 2011 PMID: 21575201 PMCID: PMC3117806 DOI: 10.1186/1758-2946-3-17
Source DB: PubMed Journal: J Cheminform ISSN: 1758-2946 Impact factor: 5.514
Figure 1Tokenisation.
Figure 2OSCAR Tagging.
Figure 3Regex Tagging.
Figure 4English POS Tagging.
Figure 5Basic English Syntax Tree. http://en.wikipedia.org/wiki/File:Basic_english_syntax_tree.svg.
Figure 6AST Output of ANTLR Parse.
Phrases Recognised by ChemicalTagger
| Phrase Name | Example |
|---|---|
| Add-Phrase | Benzoyl peroxide (85 mg) was |
| Apparatus-Action | A 50-ml round-bottom flask |
| Concentrate-Phrase | The filtrate was |
| Cool-Phrase | The reaction was then |
| Degass-Phrase | The solution was |
| Dissolve-Phrase | Salt was |
| Dry-Phrase | The yellow product was |
| Extract-Phrase | the products were |
| Filter-Phrase | The solution was |
| Heat-Phrase | The mixture was |
| Partition-Phrase | The reaction mixture was |
| Precipitate-Phrase | |
| Purify-Phrase | The mixture was |
| Quench-Phrase | The reaction was |
| Recover-Phrase | The precipitate was |
| Remove-Phrase | The solvent was |
| Stir-Phrase | The reaction mixture was |
| Synthesize-Phrase | |
| Wait-Phrase | The mixture was |
| Wash-Phrase | The resin was |
| Yield-Phrase | Chromatography |
Figure 7.
Figure 8Graph of Reaction Paths.
Number of Phrases Marked up by Annotators and ChemicalTagger
| Phrase Name | Annotators' Markup | ChemicalTagger Markup |
|---|---|---|
| Add | 46-49 | 47 |
| ApparatusAction | 18-23 | 21 |
| Concentrate | 10-11 | 11 |
| Cool | 23-28 | 24 |
| Degass | 19-29 | 22 |
| Dissolve | 29-34 | 30 |
| Dry | 36-40 | 39 |
| Extract | 11-12 | 10 |
| Filter | 21-26 | 20 |
| Heat | 15-26 | 17 |
| Partition | 2-7 | 3 |
| Precipitate | 15-20 | 13 |
| Purify | 25-32 | 26 |
| Quench | 16-16 | 16 |
| Recover | 0-9 | 9 |
| Remove | 18-21 | 30 |
| Stir | 33-37 | 34 |
| Synthesize | 50-66 | 46 |
| Wait | 2-8 | 14 |
| Wash | 25-26 | 25 |
| Yield | 35-39 | 36 |
Action Name Agreement (%)
| Annotator1 | Annotator2 | Annotator3 | Annotator4 | ChemicalTagger | |
|---|---|---|---|---|---|
| Annotator1 | - | 91.4 | 94.0 | 94.3 | 92.1 |
| Annotator2 | 91.4 | - | 92.2 | 92.5 | 91.5 |
| Annotator3 | 94.0 | 92.2 | - | 94.0 | 92.0 |
| Annotator4 | 94.3 | 92.5 | 94.0 | - | 92.2 |
| ChemicalTagger | 92.1 | 91.5 | 92.0 | 92.2 | - |
| 91.9 | |||||
| 93.1 | |||||
Filtered Phrase Agreement (%)
| Annotator1 | Annotator2 | Annotator3 | Annotator4 | ChemicalTagger | |
|---|---|---|---|---|---|
| Annotator1 | - | 75.1 | 70.2 | 75.0 | 61.4 |
| Annotator2 | 75.1 | - | 77.6 | 80.0 | 60.7 |
| Annotator3 | 70.2 | 77.6 | - | 79.0 | 56.5 |
| Annotator4 | 75.0 | 80.0 | 79.0 | - | 63.0 |
| ChemicalTagger | 61.4 | 60.7 | 56.5 | 63.0 | - |
| 60.4 | |||||
| 76.2 | |||||
Phrase Alignment Using the Needleman-Wunsch Algorithm
| Annotator1 | Annotator2 |
|---|---|
| 1. to a 25 ml three-necked round-bottomed flask fitted with a dean-stark trap, a condenser, and a nitrogen inlet/outlet and magnetic stirrer | 1. a 25 ml three-necked round-bottomed flask fitted with a dean-stark trap, a condenser, and a nitrogen inlet/outlet |
| 2.was subsequently sealed with a rubber septum | 2. which was subsequently sealed with a rubber septum |
| 3. stirring the reaction mixture overnight at room temperature | 3. after stirring the reaction mixture overnight at room temperature |
| 4. evaporation of the eluate | |
| 5. afforded 8 as a white solid (2.63 g, 57% yield) | 4. which then afforded 8 as a white solid (2.63 g, 57% yield) |
Phrase Alignment Agreement(%)
| Annotator1 | Annotator2 | Annotator3 | Annotator4 | ChemicalTagger | |
|---|---|---|---|---|---|
| Annotator1 | - | 90.2 | 89.2 | 91.1 | 88.4 |
| Annotator2 | 90.2 | - | 90.8 | 91.6 | 89.8 |
| Annotator3 | 89.2 | 90.8 | - | 91.6 | 87.2 |
| Annotator4 | 91.1 | 91.6 | 91.6 | - | 90.2 |
| ChemicalTagger | 88.4 | 89.8 | 87.2 | 90.2 | - |
| 88.9 | |||||
| 90.8 | |||||