Literature DB >> 24001514

Evaluating the impact of pre-annotation on annotation speed and potential bias: natural language processing gold standard development for clinical named entity recognition in clinical trial announcements.

Todd Lingren¹, Louise Deleger, Katalin Molnar, Haijun Zhai, Jareen Meinzen-Derr, Megan Kaiser, Laura Stoutenborough, Qi Li, Imre Solti.

Abstract

OBJECTIVE: To present a series of experiments: (1) to evaluate the impact of pre-annotation on the speed of manual annotation of clinical trial announcements; and (2) to test for potential bias, if pre-annotation is utilized.
METHODS: To build the gold standard, 1400 clinical trial announcements from the clinicaltrials.gov website were randomly selected and double annotated for diagnoses, signs, symptoms, Unified Medical Language System (UMLS) Concept Unique Identifiers, and SNOMED CT codes. We used two dictionary-based methods to pre-annotate the text. We evaluated the annotation time and potential bias through F-measures and ANOVA tests and implemented Bonferroni correction.
RESULTS: Time savings ranged from 13.85% to 21.5% per entity. Inter-annotator agreement (IAA) ranged from 93.4% to 95.5%. There was no statistically significant difference for IAA and annotator performance in pre-annotations.
CONCLUSIONS: On every experiment pair, the annotator with the pre-annotated text needed less time to annotate than the annotator with non-labeled text. The time savings were statistically significant. Moreover, the pre-annotation did not reduce the IAA or annotator performance. Dictionary-based pre-annotation is a feasible and practical method to reduce the cost of annotation of clinical named entity recognition in the eligibility sections of clinical trial announcements without introducing bias in the annotation process.

Entities: CellLine Disease Gene Species

Keywords: Information Extraction; Natural Language Processing; Pre-annotation; clinical trial announcements; named entity recognition; umls

Mesh：

Year: 2013 PMID： 24001514 PMCID： PMC3994857 DOI： 10.1136/amiajnl-2013-001837

Source DB: PubMed Journal: J Am Med Inform Assoc ISSN： 1067-5027 Impact factor: 4.497

Objective

Natural language processing (NLP) projects require manually annotated gold standard corpora to train and test supervised, machine learning-based algorithms or, in the case of rule-based methods, to test the performance of the rules. In light of the high cost of expert manual annotations, NLP researchers need robust methods to speed up the annotation process, without biasing the generated gold standard. In our institution, we are working on an NIH-funded project to automate clinical trial eligibility screening by using NLP algorithms. This effort requires the development of a substantial manually annotated gold standard. As such, this annotation is very time-consuming and costly. In this study, our aim is to present a series of experiments: (1) to evaluate the impact of pre-annotation on the speed of manual annotation of clinical trial announcements (CTA); and (2) to test for potential bias, if pre-annotation is utilized. We define potential bias as either increasing the discrepancy between annotators measured by inter-annotator agreement (IAA) or decreasing the agreement (called annotator performance in our study) between the annotations of the annotator with pre-annotated text and the eventual gold standard. The annotation task included labeling medical named entities in two classes: disease/disorder and sign/symptom. Unified Medical Language System (UMLS) Concept Unique Identifiers (CUI) and SNOMED-CT codes were also annotated for each entity. The rest of the paper is structured as follows. In the ‘Background and significance’ section, we present relevant literature. In ‘Data and methods’, we describe the data, experimental methods, and analytical approaches. In the ‘Results’ section, we present the results. In the ‘Discussion’ section, we discuss the findings, limitations, and future research questions. In the final section, we provide our conclusions.

Background and significance

Pre-annotation has been studied widely in NLP tasks such as Named Entity Recognition (NER) (biomedical1–4 and astrophysical5 domains), part of speech (POS) tagging (Wall Street Journal6–9 and medical literature2 10), and Semantic Frame/Role Labeling.11 These approaches used some machine learning systems with varying sizes of training data. Some systems did active learning pre-annotation, incrementally training on iterative human input and presenting annotators with pre-annotated text,2 5 8 while others4 10 relied on an existing tool such as MetaMap12 to generate a pre-annotation set to apply to the whole text. Many applications for different domains have been built in order to semi-automatically annotate text as the user is working, updating future files with machine learning output based on previous annotations.13–18 These efforts all seek to decrease annotation time, but in our study we focus on the role of a single pre-annotation set for particular named entities in the clinical domain. The main contribution of this study is in evaluating—in the clinical domain—if dictionary-based annotation sets provide substantial savings in time without biasing the annotation. Several studies evaluated the time savings of pre-annotation. Using Wall Street Journal text, Ringger et al6 studied the cost considerations of generating a POS-tagged gold standard using many annotators and was able to reduce, by half, the amount of time it took to annotate the same amount of data. They concluded that the hourly cost savings were partially dependent on the (self-rated) expert level of the annotator. In the biomedical domain, Ganchev et al2 developed a semi-automated system to pre-annotate MEDLINE abstracts with a high-recall named entity tagger for gene mentions, and reported an astounding 75% reduction in time for the best tagger. In the clinical domain and using some of the same entity classes as our study, Ogren et al4 used MetaMap to pre-annotate for disease and disorder. They reported a longer time for the pre-annotation set and doubted that there was any benefit in the pre-annotation method, citing spurious annotations that needed to be corrected. By building on these earlier works, we compare the performance and time savings from different annotation sets in the clinical domain. Our study is unique in the biomedical domain, as it evaluates the statistical significance of the potential bias effect of pre-annotation in addition to time and cost savings. Machine learning-based pre-annotation is built upon training the model on a small amount of annotated text. A question remains whether the machine learning model is necessary in these cases, or whether a simple dictionary-based pre-annotation set is sufficient. Due to the initial smaller training set, the performance of a machine learning model is expected to be lower than a dictionary-based approach. We hypothesize that the dictionary-based approach might not have as many spurious results as Ogren et al’s approach; consequently, the dictionary-based pre-annotation will successfully reduce annotation time. In evaluating the development of the Penn Treebank, Fort and Sagot8 compared the quality of pre-annotation (using different POS taggers) and reported no significant difference in performance (Krippendorf's α19) between the two annotators, discounting that pre-annotation causes bias. Nor did Névéol et al10 find bias from pre-annotation on semantic annotation of PubMed queries. Other than in some limited domain set tasks, such as surname recognition3 or POS tagging,20 no dictionary-based pre-annotation method has been studied. Although not a dictionary method, pre-annotation of dates based on regular expressions was used to help decrease the time per annotation in a protected health information de-identification task of clinical notes.21

Data and methods

The annotation task in our study included annotating disease/disorder and sign/symptom entities. We followed the annotation guidelines and schema from the SHARPn project.22 The SHARPn guidelines find and normalize clinically relevant mentions to Clinical Element Model templates, linking CUIs to mentions and identifying attributes and modifiers. We employed two experienced annotators (henceforth referred to as A1 and A2) with bachelor degrees who had been trained using these guidelines. One annotator had previous clinical expertise (as a registered nurse) and a Bachelor of Science degree in Nursing. Chapman et al23 demonstrated that using both clinician and non-clinician annotators does not bias the annotated corpus, although non-clinicians need longer training time. The annotators were given access to the UMLS Terminology Services SNOMED CT24 and Metathesaurus Browsers, in order to look up terms and assign CUIs and SNOMED-CT Codes (CODEs). The following is an example sentence from a CTA: ‘Suspected of having lung cancer due to clinical symptoms, such as positive sputum cytology, hemoptysis, unresolved pneumonia, persistent cough…’ A sample screen shot from the SNOMED-CT browser while searching for lung cancer is shown in figure 1.

Figure 1

UMLS Technology Services SNOMED-CT browser: search for lung cancer.

UMLS Technology Services SNOMED-CT browser: search for lung cancer. Malignant tumor of lung is the best match for lung cancer and so the CODE (listed in the browser window as Concept: 363358000) and CUI (C0242379) are annotated with the span lung cancer. The five entities in the sample sentence (lung cancer, symptoms, hemoptysis, pneumonia, and persistent cough) are all annotated with associated CUI and CODEs, as shown in figure 2.

Figure 2

Sample disease/disorder and sign/symptom entities.

Sample disease/disorder and sign/symptom entities. Three entities belong to sign/symptom class and two are disease/disorder. Lab or test results (such as positive x-ray or positive sputum cytology) were not annotated. The Protégé plug-in Knowtator25 was used for annotating the corpora. A screenshot from the program used to annotate is shown in figure 3.

Figure 3

Pre-annotated clinical trial announcement text in Knowtator.

Data

The CTA corpus for these experiments is composed of 1400 CTAs randomly selected from the clinicaltrials.gov website26 (a total of 141 386 documents as of March 2013). We annotated only the eligibility criteria sections of the CTAs. One thousand of the 1400 CTAs were previously annotated27 without pre-annotation for disease/disorder and sign/symptom. The 1000 were split in half and randomly assigned to a control group and a dictionary generation set. The control and dictionary generation sets are non-overlapping with the experiment sets. More detail is provided on the distribution of the remaining 400 CTAs into experiment sets in the ‘Methods’ section and in figure 4. The distribution of disease/disorder and sign/symptom entities for the CTAs was 196.3 tokens per file, with an average count of 7.1 entities per 100 tokens.

Figure 4

Experiment study design.

Methods

The two annotators were given sets of CTAs (both non-labeled and pre-annotated) to annotate in the Knowtator program for disease/disorder and sign/symptom. The sample size was determined based on the training size requirements of the Machine Learning algorithms that utilized the annotated CTAs. The underlying informatics projects provided the foundation for the exploratory pre-annotation experiments. The actual sample size is based not on the number of experiment CTAs (400) but on the units of analysis, namely the number of annotated entities and the number of tokens that the annotators read. Across all the control, dictionary, and experiment sets the annotators read almost 400 000 tokens (348 445) and annotated 19 002 medical named entities. For the non-labeled text, annotators were asked to annotate disease/disorder and sign/symptom entities, as described above. For a pre-annotated text, annotators were given the following choices: removing an annotation they thought was spurious; keeping or modifying said annotation; or adding an additional annotation. Figure 3 depicts the Knowtator program, with a set of pre-annotations on a particular CTA for an annotator to remove, correct, approve, or add a new annotation. In adjudication all disagreements and any remaining ambiguities were resolved.

Pre-annotation procedure

Whereas previous studies relied on machine learning output to generate pre-annotation, we relied on a dictionary method in our study (figure 4). We evaluated two dictionaries of different sizes and origins, and each dictionary entry consisted of three items: term, UMLS CUI, and SNOMED-CT code. The first dictionary type was created by extracting annotations from the dictionary generation set of 500 CTAs, as described in the Data section. This dictionary is called the ‘automated dictionary’, as it represents the automatically extracted set of all of the annotations of the gold standard set. The CTA automated dictionary contains 3414 diseases/disorders and 294 signs/symptoms. The second dictionary type was created manually by the annotators, over several weeks. During the adjudication process of the double annotated, gold standard generation of the dictionary generation set of 500 CTAs, the annotators developed a list of what they determined to be common annotation decisions (‘manual dictionary’). The CTA manual dictionary contains 522 disease/disorder entities and 47 signs/symptoms. We used regular expression matching to pre-annotate the text, with the dictionary terms as input (see figure 4, Experiments). The list of matches and their offsets was imported into Knowtator in order to assign the class labels for each term. We wrote a program to assign the UMLS CUIs and SNOMED-CT codes to the pre-annotated terms. Table 1 shows the number of dictionary matches for each experiment set.

Table 1

CTA pre-annotation experiments

			Entity class
Document sets	Corpus	Number of files	DD	SS	Annotator with pre-annotated Text	Dictionary method	Hypothesis
Dictionary	CTA	500	6478	484	N/A	N/A
Control	CTA	500	8117	474	N/A	N/A
Experiment
1
1.1	CTA	100	719	39	A2	Manually generated	Using human annotator collected dictionary of annotation terms to pre-annotate CTAs will reduce annotation time without accompanied bias
1.2	CTA	100	603	38	A1	Manually generated
2
2.1	CTA	100	878	102	A2	Automatically generated	Using automatically generated dictionary of annotation terms to pre-annotate CTAs will reduce annotation time without accompanied bias
2.2	CTA	100	994	76	A1	Automatically generated

A1, annotator 1; A2, annotator 2; CTA, clinical trial announcements; DD, disease/disorder; SS, sign/symptom.

CTA pre-annotation experiments A1, annotator 1; A2, annotator 2; CTA, clinical trial announcements; DD, disease/disorder; SS, sign/symptom. We split the text for each experiment into two sets, Set1 and Set2. A1 received non-labeled text in Set1 and pre-annotated text in Set2; A2 received pre-annotated text in Set1 and non-labeled text in Set2. Table 1 details each of these sets, as follows: the total number of entities for each set; the annotator who had the pre-annotated set (A1 or A2); the dictionary type that was used for the pre-annotation (manual vs automated); and the hypothesis tested in the experiment. For the dictionary and control sets, the number of entities shown is the number of entities in the gold standard. For the experiment sets, the number is the result of pre-annotation (the number of entities given to the annotator with pre-annotation). Figure 4 also details the study design for the experiments.

Experiments

There are two experiments (labeled: 1, 2 in table 1). As shown in figure 4, each experiment is split into two sub-experiment sets (eg, 1.1, 1.2; 2.1, 2.2). The first document set (listed as ‘Dictionary’) includes 500 traditionally-annotated, gold standard CTAs and is the source of pre-annotation terms for experiments 1 and 2. The second document set (listed as ‘Control’) includes 500 traditionally-annotated, gold standard CTAs. The experiment sets 1 and 2 comprise the remaining 400 CTAs for experiments. Each experiment set was double annotated and adjudicated for a final gold standard.

Experiment 1

A1 was given 100 non-labeled CTAs in set 1.1 and 100 pre-annotated CTAs in set 1.2. A2 was given pre-annotated CTAs in set 1.1 and non-labeled CTAs in set 1.2. The purpose of this experiment was to evaluate the potential bias of the CTA manual dictionary pre-annotation on the annotator and potential pre-annotation time savings using terms for pre-annotation that were collected by the annotators in their earlier CTA annotation projects.

Experiment 2

The purpose of this experiment was to evaluate the potential bias of the CTA automated dictionary pre-annotation on the annotator and potential pre-annotation time savings. A1 was given 100 non-labeled CTAs in set 2.1 and 100 pre-annotated CTAs in set 2.2. A2 was given pre-annotated CTAs in set 2.1 and non-labeled CTAs in set 2.2. The purpose of doing both experiments 1 and 2 is to compare how different pre-annotation dictionary types (automated and manual in the CTA corpus) affects the IAA, performance relative to the eventual gold standard (resulting from the adjudication process), and potential time savings.

Measuring annotator bias

By comparing the IAA for each set in an experiment (eg, experiment 1: sets 1.1 and 1.2), we looked for potential bias caused by annotating text with pre-annotation. The F-measure (equation 3) calculated is the harmonic mean between precision (equation 1) and recall (equation 2). The IAA compares the agreement between each annotator by temporarily treating one annotator (eg, A1) as the gold standard and calculating the F-measure for the other annotator (eg, A2).28 When we report on the F-measure IAA, we list only one per class because the F-measure is identical for each annotator (A1's precision relative to A2 is A2's recall relative to A1).

Measuring individual annotators’ distance from adjudicated gold standard

After the double annotation of each experiment set, the annotators met in adjudication (under the supervision of one of the investigators) and came to an agreement on a final gold standard. An F-measure was calculated for each annotator, relative to the gold standard for each entity class (disease/disorder and sign/symptom) for that set. This is what we are calling the annotator's performance. Comparing the performance between the annotator who received non-labeled text and the annotator who received pre-annotated text, within a single experiment set (eg, 1.1), helps to show any potential biasing effect that pre-annotation has on the annotators’ performances, relative to the gold standard. We can also compare the same annotator's (eg, A1) annotation speed, in the experiment set with non-labeled text (eg, 1.1), with the experiment set with pre-annotated text (1.2). The impact of pre-annotation on annotation speed is measured for the same annotator across sub-experiments (eg, 1.1 vs 1.2), using the same corpus and dictionary approach, while the impact of pre-annotation on creating bias is measured between annotators within sub-experiments (eg, A1 vs A2 in 1.1 and then again in 1.2). The experiments are repeated two times, for dictionary differences (manual vs automated). In addition, there is a further multiplying factor of two based on which annotator is getting pre-annotated text. Altogether there are four sub-experiments (as shown in table 1) to control for dictionary and pre-annotation.

Statistical analysis

We performed one-way analysis of variance (ANOVA) on nine variables: A1 F-measure against the gold standard, for disease/disorder annotation; A2 F-measure against the gold standard, for disease/disorder annotation; A1 F-measure against the gold standard, for sign/symptom annotation; A2 F-measure against the gold standard, for sign/symptom annotation; A1 versus A2 IAA, for disease/disorder annotation; A1 versus A2 IAA, for sign/symptom annotation; number of class entities, tokens, and CUI/CODE entities. The purpose of performing an ANOVA test on each of these variables was to determine if the variance between files and annotators was statistically significant. Due to the number of different tests conducted, we applied a very conservative Bonferroni correction to account for the increased possibility of type I error. Thus, to adjust for nine different significance tests with multiple variables that may not be independent,29 findings were considered statistically significant at p<0.0001. The sets for comparison were the control documents for each experiment set, both as pairs (1.1/1.2, etc.) and individually (1.1 vs Control, 1.2 vs Control, etc.). To calculate statistical significance, in order to test whether the IAA between annotators or whether each annotator's performance was significantly different among experiment sets, we calculated the F-measures per files. These per-file F-measures were compared using a one-way ANOVA test for statistical significance among the different experimental groups. Table 2 shows the per-set averages of the F-measures for IAA and annotator performance.

Table 2

IAA and annotator performance

Experiment set	IAA (%)	Performance (%)
Experiment set	IAA (%)	A1	A2
1.1	95.5	98.8	96.4
1.2	93.4	98.2	95.2
2.1	93.7	97.0	96.0
2.2	94.7	97.0	96.9

A1, annotator 1; A2, annotator 2; IAA, inter-annotator agreement.

IAA and annotator performance A1, annotator 1; A2, annotator 2; IAA, inter-annotator agreement.

Time savings

To test for time savings in annotation for each set, we recorded the annotation times and compared them to evaluate the effect of pre-annotation. Table 3 displays the time savings for pre-annotated text over non-labeled text. For each experiment set, the amount of time needed to annotate and calculated time savings of annotating with pre-annotated text are indicated. Also included is the average of both sets of each experiment. For example, set 1.1 took 17.7 h for pre-annotated text and 20.5 h for non-labeled text. This represents an overall time savings of 13.9% for A2, who had pre-annotated text. Also in set 1.1, A2 took an average of 45.4 s per entity with pre-annotated text, while A1 took an average of 7.3 s longer per entity with non-labeled text. The average between the two sub-experiment 1 sets was 16.6% for overall time and for per-entity time savings. The greatest overall time savings is in set 2.2 (automated dictionary pre-annotation) with 20.8%. A paired t test shows that the time savings in each experiment set were significant (p<0.01).

Table 3

Overall and per entity time savings

	Overall time (hours)				Time per entity (seconds)
Experiment set	Pre-annotated text	Non-label	% Saved	Average per experiment (%)	Pre	Non-label	% Saved	Average per experiment (%)	p Value
1.1	17.7	20.5	13.9		45.4	52.7	13.9		<0.01
1.2	14.3	17.7	19.3	16.6	34.9	43.3	19.3	16.6	<0.01
2.1	14	17.5	20.0		30.5	38.2	20.0		<0.01
2.2	14.25	18.2	21.5	20.8	28.7	36.6	21.5	20.8	<0.01

Overall and per entity time savings

Comparisons for statistical significance

Each experiment set has three F-measures (agreement between the annotators and performance for each annotator). The performance reported is the combined class F-value, which is the F-measure for both classes of disease/disorder and sign/symptom; these are listed in table 2. Table 4 lists the p values from the results of the ANOVA comparisons for each experiment pair. The purpose of this comparison is to examine if there is a significant difference in an annotator's performance when receiving pre-annotated or non-labeled text. The annotator F-measures are separated according to entity class. The control 500 CTA set, where no pre-annotation occurred, provides a set for comparison. In each column, an experiment set is compared against the control set. In the first column the pooled CTA text sets (1.1–2.2) were compared against the control set. In the second column, the set 1.1 was compared, and so on.

Table 4

Statistical significance of experiments

Statistical significance of experiments 1–2 (CTA)
	CTA vs 500*	1.1 vs 500	1.2 vs 500	2.1 vs 500	2.2 vs 500
A1 vs GS (D)	0.37	0.38	0.43	0.44	0.08
A1 vs GS (S)	0.42	<0.0001	<0.0001	0.3	0.98
A2 vs GS (D)	0.36	0.14	0.01	0.22	0.11
A2 vs GS (S)	0.38	<0.0001	<0.0001	0.57	0.01
IAA (D)	0.34	0.96	0.24	0.54	0.95
IAA (S)	0.06	<0.0001	<0.0001	0.38	0.73
Code_Ent	0.35	0.14	0.28	0.48	0.22
DS_Ent	0.45	0.13	0.03	0.98	0.16
Tokens	0.4	0.06	0.03	0.11	0.15

*Control for CTA.

A1, annotator 1; A2, annotator 2; CTA, clinical trial announcements; D, disease/disorder; GS, gold standard; IAA, inter-annotator agreement; S, sign/symptom.

Bold indicates statistical significance at p<0.0001.

Statistical significance of experiments *Control for CTA. A1, annotator 1; A2, annotator 2; CTA, clinical trial announcements; D, disease/disorder; GS, gold standard; IAA, inter-annotator agreement; S, sign/symptom. Bold indicates statistical significance at p<0.0001. The results in table 4 show that when annotating signs and symptoms, the annotators’ performance and IAA are significantly different from the eventual gold standard on Bonferroni p<0.0001 level. This finding is significant for experiment sets 1.1 and 1.2. None of the other comparisons show statistically significant difference. Table 4 also lists the p value for each variable in intra-experiment ANOVA comparisons. There is no statistically significant difference between manual and automated experiments.

Discussion

In every experiment pair, the annotator with the pre-annotated text took less time to annotate than the annotator with non-labeled text. This illustrates a clear time savings and, unlike other studies,4 spurious annotations in the pre-annotation set did not affect the annotator’s performance. The time saved in each experiment was significant (p<0.01). The time savings result in part from reducing the amount of time an annotator has to look up entities to match with the UMLS terminology database (see figure 1). The automated dictionary pre-annotation experiment (2.1/2.2) shows greater per-entity time savings, compared to the manual dictionary experiment sets (20.8% time savings vs 16.6%). The reduced time savings of the manual dictionary-based pre-annotation set versus the automated may be due to a lack of coverage, since the automated dictionary contained more than six times the total entries (3708 vs 569). During adjudication we learned that many of the smaller abbreviations (eg, ‘ms’ (multiple sclerosis), ‘all’ (acute lymphoblastic lymphoma)) produce spurious annotations that cost time in removing. The pre-annotation program performed the lookup and annotated without regard to capitalization, matching complete tokens only. For example, the abbreviation MS would match both ‘ms’ and ‘MS’, but not ‘aims’. Modifications to the pre-annotation program could be developed to allow shorter (two to three letter) abbreviations to be case sensitive and further increase time savings for pre-annotated tasks. To put the time savings in perspective, a 3-month (60 work days) annotation project can be reduced to as little as 48 days when using an automatically generated pre-annotation dictionary on CTAs. When using a manually created pre-annotation dictionary on the same corpus, the 60 days can be reduced to 50. For projects that implement double annotation, the saved labor cost is twice of the saved time, as 10 days annotation time saves 20 days’ labor.

Performance

Compared to the eventual gold standard, the annotator without pre-annotation missed short abbreviations more often, including ‘v’ (vomiting), ‘uti’ (urinary tract infection), and ‘mm’ (multiple myeloma). Pre-annotation can capture these short tokens. However, the annotator without pre-annotation missed fewer long phrasal annotations which require close reading of the text such as ‘lack of progress in his speech sound development’ and ‘decreased active rotation range of motion’. In addition, although the time savings were significant, the annotator with pre-annotated text tended to allow frequently occurring terms like ‘disease’ or ‘infections’ to remain unmodified, even if there were additional qualifying terms like ‘autoimmune’ or ‘hepatitis’. The purpose of performing an ANOVA test on each of the nine variables was to determine if any of the variance was statistically significant. We demonstrated that the class entities, tokens, and CUI/CODE entities were not statistically significantly different, in most of the set comparisons, when compared to the baseline set. This indicates that the texts’ structures are not so different as to cause annotation speed differences. In CTAs, sign/symptom entities are not as frequent as disease/disorder and only average 0.9 entities per file, or 0.32 per 100 tokens. We believe that rare sign/symptoms entities in this corpus did not provide a strong basis for the statistical significance test and this is the reason why the sign/symptom IAA and annotator performance were statistically significantly different in experiment 1 (table 4). Another important comparison point is the intra-experiment ANOVA calculation for annotator performances. This indicates the potential statistical significance of the variance between the annotator who had pre-annotation and the annotator who did not, between manual and automated dictionary experiments. In no category of F-measure was the performance difference statistically significant to a Bonferroni corrected p value of 0.0001 (table 4, intra-experiment significance); that is, the pre-annotation did not introduce annotation bias.

Limitations

A limitation of this study is that annotation time savings and potential annotation bias are not tested in the same sub-experiments. However, this is mitigated by the careful study design and ANOVA tests. Another limitation is the focus on just one corpora (CTA) and one source of dictionaries. Although preliminary results showed a similar pattern for pre-annotation experiments on a clinical note corpus, further research is needed with multiple different clinical corpora. Future studies should also experiment with other dictionary sources such as UMLS. Finally, cross-corpora pre-annotation experiments have been planned with dictionaries generated from different types of clinical texts. For practical purposes it is a limitation of our proposed method that we did not test the pre-annotation value of the dictionaries based on the number of underlying documents. That is, we used a fixed set of 500 documents to generate the dictionaries instead of consecutively increasing sets (eg, 100, 200, and so on documents). Future research should test if a dictionary based on smaller number of notes would have a beneficial effect.

Conclusions

This study evaluated the effects of pre-annotation on annotation time and annotator bias in the annotation of disease/disorder and sign/symptom entities for an important clinical corpora, CTAs. The pre-annotated set was created from either an automatically extracted or a manually generated dictionary. Time savings were statistically significant and present in all of the experiments, when the annotator used pre-annotated text. There was no statistically significant difference in annotator performance or IAA between using a manually or automatically collected dictionary of pre-annotation sets. Furthermore, pre-annotated text did not introduce bias for the annotations. We conclude that either manually or automatically generated dictionary-based pre-annotation is a feasible and practical method to reduce the cost of clinical NER in the eligibility sections of CTAs without introducing bias in the annotation process.

10 in total

1. Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program.

Authors: A R Aronson
Journal: Proc AMIA Symp Date: 2001

2. MyMiner: a web application for computer-assisted biocuration and text annotation.

Authors: David Salgado; Martin Krallinger; Marc Depaule; Elodie Drula; Ashish V Tendulkar; Florian Leitner; Alfonso Valencia; Christophe Marcelle
Journal: Bioinformatics Date: 2012-07-12 Impact factor: 6.937

3. Extraction of adverse drug effects from clinical records.

Authors: Eiji Aramaki; Yasuhide Miura; Masatsugu Tonoike; Tomoko Ohkuma; Hiroshi Masuichi; Kayo Waki; Kazuhiko Ohe
Journal: Stud Health Technol Inform Date: 2010

4. Agreement, the f-measure, and reliability in information retrieval.

Authors: George Hripcsak; Adam S Rothschild
Journal: J Am Med Inform Assoc Date: 2005-01-31 Impact factor: 4.497

5. ABNER: an open source tool for automatically tagging genes, proteins and other entity names in text.

Authors: Burr Settles
Journal: Bioinformatics Date: 2005-04-28 Impact factor: 6.937

6. Evaluation of training with an annotation schema for manual annotation of clinical conditions from emergency department reports.

Authors: Wendy W Chapman; John N Dowling; George Hripcsak
Journal: Int J Med Inform Date: 2007-02-20 Impact factor: 4.046

7. Semi-automatic semantic annotation of PubMed queries: a study on quality, efficiency, satisfaction.

Authors: Aurélie Névéol; Rezarta Islamaj Doğan; Zhiyong Lu
Journal: J Biomed Inform Date: 2010-11-20 Impact factor: 6.317

8. Multiple significance tests: the Bonferroni method.

Authors: J M Bland; D G Altman
Journal: BMJ Date: 1995-01-21

9. A sequence labeling approach to link medications and their attributes in clinical notes and clinical trial announcements for information extraction.

Authors: Qi Li; Haijun Zhai; Louise Deleger; Todd Lingren; Megan Kaiser; Laura Stoutenborough; Imre Solti
Journal: J Am Med Inform Assoc Date: 2012-12-25 Impact factor: 4.497

10. Towards comprehensive syntactic and semantic annotations of the clinical narrative.

Authors: Daniel Albright; Arrick Lanfranchi; Anwen Fredriksen; William F Styler; Colin Warner; Jena D Hwang; Jinho D Choi; Dmitriy Dligach; Rodney D Nielsen; James Martin; Wayne Ward; Martha Palmer; Guergana K Savova
Journal: J Am Med Inform Assoc Date: 2013-01-25 Impact factor: 4.497

10 in total

17 in total

Review 1. Clinical Natural Language Processing in 2014: Foundational Methods Supporting Efficient Healthcare.

Authors: A Névéol; P Zweigenbaum
Journal: Yearb Med Inform Date: 2015-08-13

Review 2. Recent Advances in Clinical Natural Language Processing in Support of Semantic Analysis.

Authors: S Velupillai; D Mowery; B R South; M Kvist; H Dalianis
Journal: Yearb Med Inform Date: 2015-08-13

3. Assisted annotation of medical free text using RapTAT.

Authors: Glenn T Gobbel; Jennifer Garvin; Ruth Reeves; Robert M Cronin; Julia Heavirland; Jenifer Williams; Allison Weaver; Shrimalini Jayaramaraja; Dario Giuse; Theodore Speroff; Steven H Brown; Hua Xu; Michael E Matheny
Journal: J Am Med Inform Assoc Date: 2014-01-15 Impact factor: 4.497

4. A Pilot Study on Developing a Standardized and Sensitive School Violence Risk Assessment with Manual Annotation.

Authors: Drew H Barzman; Yizhao Ni; Marcus Griffey; Bianca Patel; Ashaki Warren; Edward Latessa; Michael Sorter
Journal: Psychiatr Q Date: 2017-09

5. Automated detection of medication administration errors in neonatal intensive care.

Authors: Qi Li; Eric S Kirkendall; Eric S Hall; Yizhao Ni; Todd Lingren; Megan Kaiser; Nataline Lingren; Haijun Zhai; Imre Solti; Kristin Melton
Journal: J Biomed Inform Date: 2015-07-17 Impact factor: 6.317

6. Identification of social determinants of health using multi-label classification of electronic health record clinical notes.

Authors: Rachel Stemerman; Jaime Arguello; Jane Brice; Ashok Krishnamurthy; Mary Houston; Rebecca Kitzmiller
Journal: JAMIA Open Date: 2021-02-09

7. A study of active learning methods for named entity recognition in clinical text.

Authors: Yukun Chen; Thomas A Lasko; Qiaozhu Mei; Joshua C Denny; Hua Xu
Journal: J Biomed Inform Date: 2015-09-15 Impact factor: 6.317

8. Preparing an annotated gold standard corpus to share with extramural investigators for de-identification research.

Authors: Todd Lingren; Yizhao Ni; Louise Deleger; Megan Kaiser; Laura Stoutenborough; Keith Marsolo; Michal Kouril; Katalin Molnar; Imre Solti
Journal: J Biomed Inform Date: 2014-02-17 Impact factor: 6.317

9. A method for the development of disease-specific reference standards vocabularies from textual biomedical literature resources.

Authors: Liqin Wang; Bruce E Bray; Jianlin Shi; Guilherme Del Fiol; Peter J Haug
Journal: Artif Intell Med Date: 2016-02-27 Impact factor: 5.326

10. Automated Risk Assessment for School Violence: a Pilot Study.

Authors: Drew Barzman; Yizhao Ni; Marcus Griffey; Alycia Bachtel; Kenneth Lin; Hannah Jackson; Michael Sorter; Melissa DelBello
Journal: Psychiatr Q Date: 2018-12