| Literature DB >> 28699546 |
Yukun Chen1, Thomas A Lask2, Qiaozhu Mei3, Qingxia Chen2,4, Sungrim Moon5, Jingqi Wang6, Ky Nguyen6, Tolulola Dawodu6,7, Trevor Cohen6, Joshua C Denny2,8, Hua Xu9.
Abstract
BACKGROUND: Active learning (AL) has shown the promising potential to minimize the annotation cost while maximizing the performance in building statistical natural language processing (NLP) models. However, very few studies have investigated AL in a real-life setting in medical domain.Entities:
Mesh:
Year: 2017 PMID: 28699546 PMCID: PMC5506567 DOI: 10.1186/s12911-017-0466-9
Source DB: PubMed Journal: BMC Med Inform Decis Mak ISSN: 1472-6947 Impact factor: 2.796
Fig. 1Workflow of Active LEARNER
An example of a cluster that contains multiple sentences about prescription
| Cluster representative | Sentences in a cluster |
|---|---|
| X | 14. Dulcolax 10 mg p.o. or p.r. q. day p.r.n. |
| 9. Amaryl 4 mg p.o. q. day . | |
| 3. Nortriptyline 25 mg p.o. q. h.s. | |
| 2) Metformin 500 mg p.o. q. 8 h . | |
| … |
Schedule of the user study
| Time | Event | Task | Duration |
|---|---|---|---|
| Week 0 | Guided Training | 1. Annotation guideline review | 30 min |
| 2. Sentence-by-sentence annotation and review using the interface | 45 min | ||
| Practice | 1. Three quarter-hour sessions of annotation practice | 45 min | |
| 2. Four half-hour sections of annotation using | 3 h | ||
| Week 1 | Annotation warm up training | 1. Sentence-by-sentence annotation and review using the interface | 15 - 30 min |
| 2. Two 15 min sessions of annotation practice | 30 min | ||
| Main study for method Random | Four 30 min sessions of annotation using Method 2 | 3 h | |
| 15-min break between sessions | |||
| Week 2 | Annotation warm up training | 1. Sentence-by-sentence annotation and review using the interface | 15 - 30 min |
| 2. Two 15 min sessions of annotation practice | 30 min | ||
| Main study for method CAUSE | Four 30 min sessions of annotation using Method 2 | 3 h | |
| 15-min break between sessions |
Characteristics (counts of sentences, words, and entities, words per sentence, entities per sentence, and entity density) in five folds of the dataset and the pool of querying data
| Sentence count | Word count | Entity Count | Words per sentence | Entities per sentence | Entity densitya | |
|---|---|---|---|---|---|---|
| Fold 1 | 4,085 | 44,403 | 5,395 | 10.87 | 1.32 | 0.25 |
| Fold 2 | 4,085 | 45,588 | 5,183 | 11.16 | 1.27 | 0.24 |
| Fold 3 | 4,084 | 45,355 | 5,201 | 11.11 | 1.27 | 0.24 |
| Fold 4 | 4,085 | 45,141 | 5,263 | 11.05 | 1.29 | 0.25 |
| Fold 5 | 4,084 | 44,834 | 5,177 | 10.98 | 1.27 | 0.24 |
| Pool (Fold 2 + 3 + 4 + 5) | 16,338 | 180,918 | 20,824 | 11.07 | 1.27 | 0.24 |
aEntity density is the number of words of the entities divided by the total number of words
Fig. 2Simulated learning curves by 5-fold cross validation that plot F-measure vs. number of words in the training set for random sampling (Random), least confidence (Uncertainty), and CAUSE that used least confidence to measure uncertainty
User annotation counts, speed, and quality comparison in the 120-min main study
| Users | Methods | Annotated entity count | Annotation speed (Entities per min) | Annotation quality (F-measure) |
|---|---|---|---|---|
| User 1 | Random | 945 | 7.88 | 0.82 |
| CAUSE | 926 | 7.72 | 0.83 | |
| User 2 | Random | 882 | 7.35 | 0.81 |
| CAUSE | 948 | 7.90 | 0.82 |
Fig. 3Learning curves of F-measure vs. annotation time in minutes by Random and CAUSE from user 1 and 2
Comparison between Random and CAUSE in ALC score and F-measure of the last model in the 120-min main study
| User Index | Evaluated method | ALC scores | F-measure of models at 120 min |
|---|---|---|---|
| User 1 | Random | 0.812 | 0.680 |
| CAUSE | 0.783 | 0.666 | |
| User 2 | Random | 0.820 | 0.682 |
| CAUSE | 0.831 | 0.691 |
Characteristics of Random and CAUSE in each 120-min main study from user 1 and 2 (part 1)
| User | Method | Annotated Sentences | Words in annotated sentences | Entities in annotated sentences | Words in entities |
|---|---|---|---|---|---|
| User 1 | Random | 655 | 8,023 | 945 | 1,915 |
| CAUSE | 232 | 6,333 | 926 | 2,145 | |
| User 2 | Random | 651 | 7,325 | 882 | 1,952 |
| CAUSE | 240 | 6,455 | 948 | 2,404 |
Characteristics of Random and CAUSE in each 120-min main study from user 1 and 2 (part 2)
| User | Method | Sentences per min | Words per sentence | Words per min | Entities Per Sentence | Entity Density |
|---|---|---|---|---|---|---|
| User 1 | Random | 5.53 | 12.24 | 67.70 | 1.44 | 0.24 |
| CAUSE | 1.97 | 27.00 | 53.30 | 3.99 | 0.34 | |
| User 2 | Random | 5.55 | 11.25 | 62.44 | 1.35 | 0.27 |
| CAUSE | 2.01 | 26.98 | 54.33 | 3.95 | 0.37 |