| Literature DB >> 27747591 |
Seid Muhie Yimam1, Chris Biemann2, Ljiljana Majnaric3, Šefket Šabanović3, Andreas Holzinger4.
Abstract
In this article, we demonstrate the impact of interactive machine learning: we develop biomedical entity recognition dataset using a human-into-the-loop approach. In contrary to classical machine learning, human-in-the-loop approaches do not operate on predefined training or test sets, but assume that human input regarding system improvement is supplied iteratively. Here, during annotation, a machine learning model is built on previous annotations and used to propose labels for subsequent annotation. To demonstrate that such interactive and iterative annotation speeds up the development of quality dataset annotation, we conduct three experiments. In the first experiment, we carry out an iterative annotation experimental simulation and show that only a handful of medical abstracts need to be annotated to produce suggestions that increase annotation speed. In the second experiment, clinical doctors have conducted a case study in annotating medical terms documents relevant for their research. The third experiment explores the annotation of semantic relations with relation instance learning across documents. The experiments validate our method qualitatively and quantitatively, and give rise to a more personalized, responsive information extraction technology.Entities:
Keywords: Biomedical entity recognition; Data mining; Human in the loop; Interactive annotation; Knowledge discovery; Machine learning; Relation learning
Year: 2016 PMID: 27747591 PMCID: PMC4999566 DOI: 10.1007/s40708-016-0036-4
Source DB: PubMed Journal: Brain Inform ISSN: 2198-4026
Fig. 3Automation suggestions using the WebAnno automation component after annotating 5 (b) initial response 9 (c) additional abstracts. Correct suggestions are marked in grey, while wrong suggestions are marked in red. a is the correct annotation by a medical expert. (Color figure online)
Relation types from (a) the BioNLP shared task 2011 and (b) identified during the relation annotation process by our medical expert
| Descriptions | |
|---|---|
| (a) Relation types from BioNLP 2011 | |
| Equivalent | Two protein or cell components are equivalent |
| Protein-component | The protein-component is a less specific object-component relation that holds between a gene or protein and its component, such as a protein domain or the promoter of a gene. |
| Subunit-complex | Subunit-complex is a component-object relation that holds between a protein complex and its subunits, individual proteins |
| (b) New relation types | |
| Activator-reactor | Two proteins linked with the same reaction; the first one is responsible for starting the reaction and the second one responsible for its sustainability |
| Antibody–antigen | An immune protein with the ability to specifically bound the antigen, a foreign substance, and to neutralise its toxicity |
| Cell-marker | A set of surface proteins typical for a cell lineage or a stage of development |
| Cell-variant | The main cell lineage and the subtypes which are the parts of this larger cell family |
| DNA-transcript | DNA and its mRNA (messenger RNA) which translate the gene‘s message to a protein product |
| Ligand–receptor | Two proteins or molecules which can bind to each other because oft he complementarity of the binding site |
| Protein-variant | Two proteins with the similar structure and function |
Fig. 1Relation copy annotator: upper pane: relation annotation by the annotator. Lower pane: relation suggestions that can be copied by the user to the upper pane
Evaluation result for the BioNLP-NLPBA 2004 task using an interactive online learning approach with different sizes of training dataset (in number of sentences) measured in precision, recall and F-measure on the fixed development dataset
| Sentence | Recall | Precision | F-score |
|---|---|---|---|
| 40 | 27.27 | 39.05 | 32.11 |
| 120 | 37.74 | 44.01 | 40.63 |
| 280 | 46.68 | 51.39 | 48.92 |
| 600 | 53.23 | 54.89 | 54.05 |
| 1240 | 57.83 | 57.74 | 57.78 |
| 2520 | 59.35 | 61.26 | 60.29 |
| 5080 | 62.32 | 64.03 | 63.16 |
| 10,200 | 66.43 | 67.50 | 66.96 |
| 18,555 | 69.48 | 69.16 | 69.32 |
Fig. 2Learning curve showing the performance of interactive automation for BioNLP-NLPBA 2004 data set using different sizes of training data. (Color figure online)
Machine learning automation and expert annotator performance for BioNLP 2011 REL shared task dataset
| Mode | Annotator type | Recall | Precsion | F-score |
|---|---|---|---|---|
| Automation | ||||
| Entity | 61.94 | 49.31 | 54.91 | |
| Protein | 57.31 | 50.97 | 53.95 | |
| Expert | ||||
| Entity | 29.11 | 22.90 | 25.63 | |
| Protein | 71.94 | 59.28 | 65.00 | |
Analysis of relation suggestions. For a total of 20 randomly selected BioNLP2011 REL shared task documents, there has been a total of 397 relations annotated. In the process, the system produces on average 2.1 suggestions per relations and 19.85 suggestions per document. The last column shows an average number of relation suggestions across several documents
| Docs | All | Rels | Perrel | Perdoc | Acrossdocs |
|---|---|---|---|---|---|
| 20 | 397 | 193 | 2.1 | 19.85 | 0.18 |