| Literature DB >> 22321698 |
Maria Liakata1, Shyamasree Saha, Simon Dobnik, Colin Batchelor, Dietrich Rebholz-Schuhmann.
Abstract
MOTIVATION: Scholarly biomedical publications report on the findings of a research investigation. Scientists use a well-established discourse structure to relate their work to the state of the art, express their own motivation and hypotheses and report on their methods, results and conclusions. In previous work, we have proposed ways to explicitly annotate the structure of scientific investigations in scholarly publications. Here we present the means to facilitate automatic access to the scientific discourse of articles by automating the recognition of 11 categories at the sentence level, which we call Core Scientific Concepts (CoreSCs). These include: Hypothesis, Motivation, Goal, Object, Background, Method, Experiment, Model, Observation, Result and Conclusion. CoreSCs provide the structure and context to all statements and relations within an article and their automatic recognition can greatly facilitate biomedical information extraction by characterizing the different types of facts, hypotheses and evidence available in a scientific publication.Entities:
Mesh:
Year: 2012 PMID: 22321698 PMCID: PMC3315721 DOI: 10.1093/bioinformatics/bts071
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.Example of discourse labelling using CoreSC.
Fig. 2.Hierarchical representation of concepts and properties in the CoreSC scheme.
Statistics on the training data (ART/CoreSC corpus)
| Measure | Bac | Con | Exp | Goa | Met | Mot | Obs | Res | Mod | Obj | Hyp | Total |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Number of sentences | 7606 | 3636 | 3858 | 582 | 4281 | 541 | 5410 | 8404 | 3656 | 1161 | 780 | 39 915 |
| Number of words | 193 930 | 102 173 | 93 882 | 16 564 | 107 309 | 13 737 | 123 394 | 224 353 | 99 313 | 29 215 | 21 315 | 1 025 185 |
| Percentage of sentences | 19 | 9 | 10 | 1 | 11 | 1 | 14 | 21 | 9 | 3 | 2 | |
| Number of words p/s (mean) | 25.5 | 28.1 | 24.33 | 28.46 | 25.07 | 25.39 | 22.81 | 26.7 | 27.16 | 25.16 | 27.33 | |
| Number of words p/s (SD) | 12.32 | 12.49 | 20.6 | 12.69 | 11.4 | 10.34 | 11.44 | 12.65 | 14.76 | 11.16 | 12.01 | |
| κ-IAA | 0.87 | 0.89 | 0.65 | 0.60 | 0.74 | 0.46 | 0.79 | 0.78 | 0.43 | 0.81 | 0.46 |
Micro precision, recall and F-measure for different system configurations, with highest value for each measure per category in bold
| Acc | BAC | CON | EXP | GOA | MET | MOT | OBS | RES | MOD | OBJ | HYP | ||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Features | Classifier | P | R | F | P | R | F | P | R | F | P | R | F | P | R | F | P | R | F | P | R | F | P | R | F | P | R | F | P | R | F | P | R | F | |
| B | Multinomial | 14 | 19 | 19 | 19 | 9 | 9 | 9 | 10 | 10 | 10 | 1 | 1 | 1 | 11 | 11 | 11 | 1 | 1 | 1 | 13 | 14 | 14 | 21 | 22 | 21 | 9 | 9 | 9 | 3 | 3 | 3 | 2 | 2 | 2 |
| CRFS | 45.3 | 45 | 60 | 51 | 35 | 28 | 31 | 72 | 74 | 73 | 17 | 24 | 29 | 28 | 28 | 24 | 11 | 15 | 49 | 49 | 49 | 43 | 43 | 43 | 49 | 47 | 48 | 42 | 26 | 32 | 24 | 12 | 16 | ||
| 39.9 | 41 | 47 | 44 | 26 | 23 | 25 | 61 | 67 | 64 | 27 | 18 | 22 | 24 | 24 | 24 | 15 | 12 | 14 | 43 | 45 | 44 | 38 | 37 | 38 | 39 | 39 | 39 | 31 | 25 | 27 | 17 | 13 | 15 | ||
| L | 41.2 | 41 | 50 | 45 | 30 | 22 | 25 | 66 | 66 | 66 | 30 | 26 | 26 | 25 | 26 | 22 | 15 | 18 | 48 | 44 | 46 | 39 | 44 | 41 | 45 | 38 | 41 | 33 | 28 | 30 | 21 | 13 | 16 | ||
| 41.2 | 44 | 52 | 47 | 29 | 27 | 28 | 68 | 70 | 69 | 32 | 19 | 24 | 26 | 26 | 26 | 15 | 12 | 13 | 44 | 45 | 45 | 40 | 38 | 39 | 43 | 42 | 42 | 31 | 23 | 26 | 17 | 12 | 14 | ||
| L | 44.9 | 45 | 62 | 52 | 36 | 27 | 30 | 74 | 66 | 70 | 39 | 12 | 19 | 28 | 31 | 29 | 26 | 05 | 08 | 49 | 43 | 46 | 42 | 49 | 45 | 52 | 42 | 46 | 39 | 19 | 26 | 23 | 09 | 13 | |
| CRFS | 34.7 | 51 | 55 | 32 | 39 | 72 | 75 | 39 | 13 | 19 | 17 | 22 | 28 | 10 | 15 | 40 | 46 | 31 | 37 | 37 | 45 | 42 | 18 | 25 | 28 | 06 | 10 | ||||||||
| 34.6 | 53 | 60 | 56 | 41 | 39 | 40 | 69 | 73 | 71 | 32 | 21 | 25 | 27 | 25 | 26 | 23 | 45 | 47 | 46 | 44 | 43 | 43 | 45 | 45 | 45 | 35 | 26 | 30 | 18 | 12 | 14 | ||||
| CRFS | 50.4 | 56 | 65 | 60 | 46 | 44 | 74 | 41 | 21 | 31 | 13 | 18 | 50 | 49 | 47 | 53 | 52 | 42 | 28 | 26 | 14 | 18 | |||||||||||||
| 47.7 | 54 | 60 | 57 | 43 | 40 | 41 | 69 | 73 | 71 | 35 | 20 | 25 | 29 | 28 | 28 | 22 | 16 | 18 | 47 | 49 | 48 | 45 | 44 | 45 | 49 | 49 | 49 | 38 | 28 | 32 | 21 | 18 | |||
| L | 56 | 50 | 41 | 72 | 75 | 37 | 20 | 26 | 25 | 29 | 25 | 06 | 10 | 47 | 50 | 54 | 13 | ||||||||||||||||||
Fig. 3.F-score versus κ for CoreSCs.
Fig. 4.Confidence value when LibS, CRFSuite and manual annotation agree.
Fig. 5.Confidence value when there is no agreement on annotation.
Fig. 6.Confidence value scores per category for the entire corpus.
Fig. 7.Confusion matrix for CoreSC categories according to LibS.
F-measures for CRFSuite LOOF, 9-fold cross-validation
| Feat | BAC | CON | EXP | GOA | MET | MOT | OBS | RES | MOD | OBJ | HYP |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 60 | 44 | 76 | 28 | 30 | 18 | 51 | 47 | 52 | 34 | 18 | |
| 60 | 44 | 76 | 27 | 30 | 18 | 51 | 47 | 53 | 34 | 19 | |
| 44 | 76 | 26 | 30 | 18 | 51 | 47 | 52 | 34 | 17 | ||
| 60 | 44 | 76 | 28 | 18 | 51 | 47 | 52 | 34 | 18 | ||
| 60 | 43 | 76 | 27 | 30 | 18 | 51 | 47 | 52 | 33 | 17 | |
| 60 | 44 | 76 | 27 | 30 | 18 | 51 | 48 | 51 | 35 | 17 | |
| 60 | 44 | 76 | 27 | 30 | 51 | 47 | 52 | 34 | 18 | ||
| 60 | 44 | 76 | 27 | 30 | 19 | 51 | 47 | 52 | 34 | 18 | |
| 60 | 43 | 75 | 26 | 30 | 18 | 51 | 47 | 52 | 34 | 19 | |
| 60 | 44 | 76 | 50 | 47 | 51 | 33 | 15 | ||||
| 59 | 43 | 75 | 27 | 30 | 22 | 50 | 46 | 50 | 32 | 19 | |
| 25 | |||||||||||
| 60 | 44 | 75 | 27 | 30 | 18 | 51 | 47 | 52 | 35 | 17 | |
| 60 | 44 | 76 | 26 | 18 | 51 | 47 | 53 | 34 | 16 | ||
| 60 | 45 | 76 | 27 | 30 | 51 | 48 | 52 | 34 | 18 | ||
| 60 | 45 | 76 | 28 | 30 | 18 | 51 | 47 | 53 | 34 | 19 | |
| 60 | 44 | 76 | 27 | 30 | 18 | 51 | 47 | 52 | 34 | 18 | |
| 60 | 44 | 76 | 28 | 30 | 18 | 51 | 47 | 52 | 34 | 18 | |
| 60 | 44 | 76 | 27 | 30 | 18 | 51 | 47 | 52 | 34 | 18 | |
| 60 | 44 | 76 | 27 | 30 | 51 | 47 | 52 | 34 | 17 |
F-measures for LibLinear LOOF, 9-fold cross-validation
| Feat | BAC | CON | EXP | GOA | MET | MOT | OBS | RES | MOD | OBJ | HYP |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 57 | 41 | 71 | 25 | 28 | 18 | 48 | 45 | 49 | 32 | 18 | |
| 55 | 41 | 71 | 28 | 27 | 20 | 48 | 43 | 46 | 32 | 17 | |
| 57 | 42 | 71 | 24 | 28 | 19 | 48 | 45 | 48 | 31 | 17 | |
| 55 | 41 | 71 | 25 | 28 | 19 | 48 | 45 | 48 | 31 | 15 | |
| 57 | 40 | 72 | 25 | 28 | 20 | 48 | 45 | 49 | 33 | 18 | |
| 57 | 41 | 71 | 26 | 28 | 21 | 48 | 45 | 49 | 31 | 17 | |
| 57 | 41 | 71 | 26 | 28 | 19 | 48 | 44 | 48 | 32 | 17 | |
| 57 | 41 | 71 | 25 | 28 | 19 | 48 | 45 | 49 | 32 | 17 | |
| 57 | 41 | 71 | 26 | 28 | 18 | 48 | 45 | 48 | 32 | 18 | |
| 56 | 40 | 70 | 25 | 27 | 19 | 48 | 44 | 47 | 30 | 19 | |
| 56 | 41 | 72 | 26 | 27 | 47 | 44 | 46 | 31 | 16 | ||
| 54 | 40 | 70 | 25 | 27 | 20 | 46 | 43 | 45 | 27 | 17 | |
| 56 | 40 | 71 | 29 | 19 | 47 | 44 | 49 | 31 | 17 | ||
| 57 | 42 | 71 | 25 | 28 | 20 | 48 | 45 | 49 | 32 | 16 | |
| 57 | 41 | 71 | 25 | 29 | 21 | 47 | 45 | 48 | 31 | 17 | |
| 57 | 42 | 71 | 25 | 28 | 20 | 48 | 45 | 49 | 32 | 19 | |
| 57 | 41 | 71 | 26 | 28 | 19 | 48 | 45 | 48 | 32 | 17 | |
| 57 | 41 | 71 | 25 | 28 | 18 | 48 | 45 | 49 | 32 | 18 | |
| 57 | 41 | 71 | 26 | 28 | 19 | 48 | 45 | 49 | 33 | 17 | |
| 57 | 41 | 71 | 24 | 28 | 20 | 48 | 45 | 49 | 32 | 16 |
Fig. 8.Single feature classification with LibL, illustrating the contribution of 15 individual features.
Numbers for each type of feature
| Feat | Uni | Bi | GR | VPOS | Subj | Dobj | Iobj | Obj2 | Verb | VC | P | H | Gl | L | C |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| No. | 10 515 | 42 438 | 11 854 | 6 | 3843 | 7414 | 45 | 59 | 1543 | 10 | 1 | 12 | 53 | 9 | 3 |
Numbers for each type of feature were: L, length; H, history; C, citation; Gl, global features, including absloc, sectionid, struct1-3, sectionloc