| Literature DB >> 23645987 |
Tudor Groza1, Hamed Hassanzadeh, Jane Hunter.
Abstract
Today's search engines and digital libraries offer little or no support for discovering those scientific artifacts (hypotheses, supporting/contradicting statements, or findings) that form the core of scientific written communication. Consequently, we currently have no means of identifying central themes within a domain or to detect gaps between accepted knowledge and newly emerging knowledge as a means for tracking the evolution of hypotheses from incipient phases to maturity or decline. We present a hybrid Machine Learning approach using an ensemble of four classifiers, for recognizing scientific artifacts (ie, hypotheses, background, motivation, objectives, and findings) within biomedical research publications, as a precursory step to the general goal of automatically creating argumentative discourse networks that span across multiple publications. The performance achieved by the classifiers ranges from 15.30% to 78.39%, subject to the target class. The set of features used for classification has led to promising results. Furthermore, their use strictly in a local, publication scope, ie, without aggregating corpus-wide statistics, increases the versatility of the ensemble of classifiers and enables its direct applicability without the necessity of re-training.Entities:
Keywords: conceptualization zones; information extraction; scientific artifacts
Year: 2013 PMID: 23645987 PMCID: PMC3623603 DOI: 10.4137/BII.S11572
Source DB: PubMed Journal: Biomed Inform Insights ISSN: 1178-2226
The CoreSC annotation scheme10 and its adaptation to our goals.
| Category | Description | Re-purposed category |
|---|---|---|
| Hypothesis (HYP) | A statement that needs to be confirmed by experiments and data | Hypothesis (HYP) |
| Motivation (MOT) | The reasons supporting the investigation | Motivation (MOT) |
| Background (BAC) | Accepted background knowledge and previous work | Background (BAC) |
| Goal (GOA) | The target state of the investigation | Objective (OBJ) |
| Object (OBJ) | The main theme or product of the investigation | Objective (OBJ) |
| Method-New (MET) | Means by which the investigation is carried out and the goal is planned to be achieved | Out of scope (O) |
| Method-Old (MET) | A method proposed in previous works | Background (BAC) |
| Experiment (EXP) | An experimental method | Out of scope (O) |
| Model (MOD) | A description of the model or framework used in the investigation | Out of scope (O) |
| Observation (OBS) | A statement describing data or phenomena encountered during the investigation | Finding (FIN) |
| Result (RES) | A factual statement about the outcome of the investigation | Finding (FIN) |
| Conclusion (CON) | A statement that connects observations and results to the initial hypothesis | Finding (FIN) |
Notes: The left column presents the original annotation scheme used in the ART corpus, while the right column shows the transformations we have applied to re-purpose this scheme to our goals.
The coverage of the re-purposed classes from the ART corpus.2
| Category | No. sentences | Coverage | Re-purposed category | No. sentences | Coverage |
|---|---|---|---|---|---|
| HYP | 780 | 1.95% | HYP | 780 | 1.95% |
| MOT | 541 | 1.35% | MOT | 541 | 1.35% |
| BAC | 7,606 | 19.05% | BAC | 10,229 | 25.62% |
| MET (old) | 2,623 | 6.57% | |||
| GOA | 582 | 1.45% | OBJ | 1,743 | 4.36% |
| OBJ | 1,161 | 2.90% | |||
| MET (new) | 1,658 | 4.15% | O | 9,172 | 22.97% |
| EXP | 3,858 | 9.66% | |||
| MOD | 3,656 | 9.15% | |||
| OBS | 5,410 | 13.55% | FIN | 17,450 | 43.71% |
| RES | 8,404 | 21.05% | |||
| CON | 3,636 | 9.10% |
Notes: Naturally, by merging some of the initial types, our re-purposed classes gained more weight in the overall corpus distribution.
Experimental results of the individual classifiers.
| HYP | MOT | BAC | OBJ | FIN | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
| |||||||||||
| P | R | F1 | P | R | F1 | P | R | F1 | P | R | F1 | P | R | F1 | |
| MALLET | 14.84 | 1.77 | 3.16 | 15.74 | 0.52 | 0.98 | 59.41 | 46.60 | 51.86 | 49.39 | 12.24 | 19.41 | 65.23 | 86.21 | 74.20 |
| CRF++ | 18.87 | 8.38 | 10.95 | 20.47 | 2.59 | 4.54 | 60.93 | 59.21 | 52.93 | 29.15 | 71.78 | 82.98 | |||
| YamCha1vs1 | 19.70 | 11.31 | 17.12 | 9.07 | 54.47 | 60.96 | 57.35 | 43.44 | 25.84 | 32.30 | 73.20 | 77.91 | 75.43 | ||
| YamCha1vsAll | 12.08 | 6.05 | 7.79 | 12.69 | 7.63 | 9.45 | 55.93 | 57.43 | 56.50 | 34.44 | 26.01 | 29.41 | 71.29 | 78.11 | 74.49 |
Notes: Bold numbers denote the best F1 score achieved for the particular class. We can observe that YamCha1vs1 outperforms the other classifiers in the first two classes, while CRF++ achieves the best results in the latter three.
Experimental results of the direct set operations (see list below).
| HYP | MOT | BAC | OBJ | FIN | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
| |||||||||||
| P | R | F1 | P | R | F1 | P | R | F1 | P | R | F1 | P | R | F1 | |
| DS_OP1 | 17.14 | 15.12 | 17.53 | 10.54 | 13.05 | 51.46 | 70.88 | 59.47 | 43.91 | 36.05 | 68.74 | 89.63 | |||
| DS_OP2 | 14.48 | 12.39 | 13.59 | 16.74 | 9.60 | 12.06 | 49.92 | 66.67 | 56.89 | 42.54 | 29.58 | 34.76 | 64.28 | 92.64 | 75.86 |
| DS_OP3 | 14.48 | 12.39 | 12.75 | 14.50 | 9.82 | 11.63 | 52.66 | 69.68 | 37.44 | 38.83 | 37.93 | 67.27 | 90.65 | 77.18 | |
| DS_OP4 | 14.80 | 13.75 | 13.72 | 13.98 | 13.19 | 50.72 | 66.27 | 57.29 | 35.02 | 38.40 | 36.47 | 68.93 | 86.02 | 76.48 | |
Notes: Bold numbers denoted the best F1 score achieved. As expected, the best results have been achieved by those set operations that included the two best individual classifiers—CRF++ and YamCha1vs1. Direct set operations: DS_OP1: CRF++ ∪ YamCha1vs1; DS_OP2: MALLET ∪ YamCha1vs1; DS_OP3: CRF++ ∪ YamCha1vsAll; DS_OP4: YamCha1vs1 ∪ YamCha1vsAll.
Experimental results of the paired set operations (see list below)—best F1 scores are marked in bold.
| HYP | MOT | BAC | OBJ | FIN | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
| |||||||||||
| P | R | F1 | P | R | F1 | P | R | F1 | P | R | F1 | P | R | F1 | |
| PS_OP1 | 23.17 | 4.56 | 7.34 | 25.66 | 1.31 | 2.45 | 63.81 | 55.64 | 59.21 | 58.52 | 25.16 | 72.07 | 83.42 | 77.28 | |
| PS_OP2 | 26.82 | 6.14 | 22.99 | 4.41 | 59.31 | 60.90 | 53.80 | 23.90 | 32.96 | 70.72 | 87.37 | 78.13 | |||
| PS_OP3 | 21.49 | 4.24 | 6.93 | 21.54 | 3.69 | 6.18 | 58.93 | 60.47 | 59.53 | 52.94 | 21.95 | 30.80 | 71.16 | 86.89 | |
Notes: Here we observe the effect induced by the best individual classifiers on the overall pair efficiency: they complement the less efficient classifiers to achieve the best results (eg, PS_OP2 in HYP, MOT and BAC). Paired set operations: PS_OP1: (MALLET ∪ CRF++) ∩ (YamCha1vs1 ∪ YamCha1vsAll); PS_OP2: (MALLET ∪ YamCha1vs1) ∩ (CRF++ ∪ YamCha1vsAll); PS_OP3: (MALLET ∪ YamCha1vsAll) ∩ (CRF++ ∪ YamCha1vs1).
Experimental results of the voting mechanism.
| HYP | MOT | BAC | OBJ | FIN | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
| |||||||||||
| P | R | F1 | P | R | F1 | P | R | F1 | P | R | F1 | P | R | F1 | |
| MALLET | 22.63 | 3.13 | 5.33 | 5.92 | 0.40 | 0.74 | 62.77 | 56.36 | 59.15 | 55.14 | 22.72 | 32.01 | 70.26 | 87.60 | 77.93 |
| CRF++ | 21.01 | 5.06 | 7.87 | 12.89 | 1.31 | 2.36 | 62.40 | 58.06 | 57.12 | 26.09 | 71.64 | 86.64 | |||
| YamCha1vs1 | 23.31 | 6.14 | 21.55 | 4.57 | 59.11 | 60.08 | 59.44 | 54.51 | 24.28 | 33.40 | 72.29 | 83.85 | 77.60 | ||
| YamCha1vsAll | 16.41 | 3.78 | 5.99 | 20.06 | 4.59 | 7.34 | 59.55 | 59.66 | 59.44 | 53.42 | 24.24 | 33.20 | 71.63 | 84.55 | 77.51 |
Notes: We can observe that they follow the same pattern as in the case of the individual classification: the classifiers that have achieved the best results in the individual classification perform well also as veto holders in the voting mechanism.
F1 scores for one feature classification using the best CRF++ model.
| Feature | HYP | MOT | BAC | OBJ | FIN |
|---|---|---|---|---|---|
| f_adjectives | 0.00 | 0.00 | 0.00 | 0.00 | 61.05 |
| f_cc | 0.00 | 0.00 | 0.00 | 0.00 | 61.05 |
| f_figs | 0.00 | 0.00 | 0.00 | 0.00 | 61.05 |
| f_pronouns | 0.00 | 0.00 | 0.00 | 0.00 | 61.05 |
| f_relsectplace | 0.00 | 0.00 | 0.00 | 0.00 | 61.05 |
| f_sectionplace | 0.00 | 0.00 | 0.00 | 0.00 | 60.98 |
| f_topics_distro | 0.00 | 0.00 | 0.00 | 0.00 | 61.26 |
| f_verbs | 0.00 | 0.00 | 0.00 | ||
| f_adverbs | 0.00 | 0.00 | 0.00 | 0.00 | 61.05 |
| f_citation | 0.00 | 0.00 | 0.00 | ||
| f_hedging | 0.00 | 0.00 | 0.00 | 0.00 | 61.05 |
| f_pronouns_distro | 0.00 | 0.00 | 0.00 | 0.00 | 61.05 |
| f_rhetrel | 0.00 | 0.00 | 0.02 | 0.00 | 61.05 |
| f_tables | 0.00 | 0.00 | 0.00 | 0.00 | 61.05 |
| f_vbclasses | 0.00 | 0.00 | 0.00 | ||
| f_adverbs_distro | 0.00 | 0.00 | 0.00 | 0.00 | 61.33 |
| f_citation_distro | 0.00 | 0.00 | 0.00 | ||
| f_paperplace | 0.00 | 0.00 | 0.00 | ||
| f_relparplace | 0.00 | 0.00 | 0.00 | 0.00 | 61.05 |
| f_rhetrel_distro | 0.00 | 0.00 | 1.37 | 0.00 | 61.40 |
| f_topic | 0.00 | 0.00 | 0.00 | 0.00 | 61.05 |
| f_vbclasses_distro | 0.00 | 0.00 | 0.00 | ||
| cf_adverbs | 0.00 | 0.00 | 0.00 | 13.78 | 60.87 |
| cf_hedging | 0.00 | 0.00 | 0.00 | 13.78 | 61.17 |
| cf_pronouns | 0.00 | 0.00 | 0.00 | 13.78 | 61.17 |
| cf_rhetrel | 0.00 | 0.00 | 0.32 | 13.78 | 61.09 |
| cf_vbclasses | 0.00 | 0.00 | 13.78 | ||
| cf_verbs | 0.00 | 0.00 | 4.62 | 13.78 |
Notes: Bold numbers denote the most interesting F1 scores achieved by diverse features. We can observe that only the well represented classes in the corpus have associated successful F1 scores.
F1 scores for “leave one feature out” classification using the best CRF++ model.
| Feature | HYP | MOT | BAC | OBJ | FIN |
|---|---|---|---|---|---|
| f_adjectives | 10.81 | 3.16 | 59.55 | 37.14 | 76.60 |
| f_cc | 9.65 | 3.27 | 59.67 | 37.11 | 76.57 |
| f_figs | 9.80 | 2.97 | 59.13 | 76.25 | |
| f_pronouns | 4.06 | 59.62 | 37.39 | 76.52 | |
| f_relsectplace | 8.20 | 2.67 | 59.36 | 34.99 | 76.10 |
| f_sectionplace | 10.73 | 4.09 | 59.64 | 37.21 | 76.46 |
| f_topics_distro | 7.94 | 1.94 | 59.77 | 35.35 | 76.58 |
| f_verbs | 9.61 | 2.88 | 59.51 | 34.13 | 76.02 |
| f_adverbs | 10.36 | 3.54 | 59.61 | 36.64 | 76.61 |
| f_citation | 10.39 | 59.68 | 36.92 | 76.63 | |
| f_hedging | 6.91 | 3.51 | 59.68 | 36.03 | 76.62 |
| f_pronouns_distro | 10.63 | 3.50 | 59.85 | 76.68 | |
| f_rhetrel | 4.34 | 36.86 | 76.66 | ||
| f_tables | 10.79 | 59.50 | 36.98 | 76.23 | |
| f_vbclasses | 10.03 | 4.30 | 59.72 | 37.40 | 76.64 |
| f_adverbs_distro | 9.80 | 59.63 | 76.68 | ||
| f_citation_distro | 10.36 | 3.89 | 58.90 | 36.26 | 76.25 |
| f_paperplace | 9.32 | 3.48 | 56.94 | 35.71 | 74.45 |
| f_relparplace | 10.45 | 3.64 | 59.53 | 35.33 | 76.26 |
| f_rhetrel_distro | 9.10 | 3.23 | 59.97 | 36.45 | 76.63 |
| f_topic | 9.64 | 59.35 | 37.17 | 76.30 | |
| f_vbclasses_distro | 10.03 | 4.25 | 59.54 | 36.10 | 76.36 |
| cf_adverbs | 9.53 | 59.60 | 76.74 | ||
| cf_hedging | 9.68 | 4.39 | 59.75 | 36.89 | 76.77 |
| cf_pronouns | 10.51 | 4.10 | 59.81 | 36.42 | 76.62 |
| cf_rhetrel | 10.11 | 59.87 | 36.75 | 76.69 | |
| cf_vbclasses | 10.92 | 4.43 | 59.70 | 37.27 | 76.42 |
| cf_verbs | 10.29 | 2.61 | 59.77 | 76.35 |
Notes: Similar to the one feature classification, bold numbers denote the most interesting F1 scores achieved, this time, by leaving the corresponding feature out from the classification model. Interestingly, in some cases the F1 score is higher than in the case of the overall F1 score achieved by the final model.
Classification confusion matrix based on the best CRF++ model.
| HYP | MOT | BAC | OBJ | FIN | O | |
|---|---|---|---|---|---|---|
| HYP | 1 | 83 | 4 | 398 | 45 | |
| MOT | 3 | 396 | 19 | 90 | 19 | |
| BAC | 39 | 31 | 123 | 2,409 | 1,420 | |
| OBJ | 0 | 5 | 368 | 572 | 288 | |
| FIN | 58 | 0 | 1,399 | 160 | 1,168 |
Note: Bold numbers denote correctly classified instances.
Comparative overview of the F1 scores achieved by the different techniques.
| HYP | MOT | BAC | OBJ | FIN | |
|---|---|---|---|---|---|
| Individual | 13.71 | 11.72 | 59.88 | 37.40 | 76.93 |
| Direct set operations | 59.85 | 77.76 | |||
| Paired set operations | 9.59 | 7.22 | 59.95 | 34.99 | 78.20 |
| Voting | 9.53 | 7.37 | 35.67 |
Notes: Overall, the proposed hybrid methods perform the best, the most consistent aggregation technique being the direct set operations.
Comparative overview of the classification results on the 11 classes proposed by Liakata et al.16
| BAC | CON | EXP | GOA | MET | MOT | OBS | RES | MOD | OBJ | HYP | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Liakata et al | 62 | 45 | 76 | 28 | 30 | 20 | 51 | 51 | 53 | 34 | 19 |
| Individual | 57 | 41 | 73 | 19 | 22 | 11 | 46 | 45 | 34 | 30 | 16 |
| Direct set operations | 56 | 73 | 20 | 14 | 37 | ||||||
| Paired set operations | 57 | 40 | 16 | 19 | 8 | 42 | 43 | 37 | 24 | 13 | |
| Voting | 57 | 40 | 73 | 17 | 20 | 8 | 44 | 43 | 36 | 27 | 17 |
Notes: Bold numbers denote F1 scores close to the ones obtained by Liakata. Overall, our model performs fairly well, with a few exceptions; the decrease in efficiency being explained by the increased versatility of our classification model.