Literature DB >> 33898938

Annotation and initial evaluation of a large annotated German oncological corpus.

Madeleine Kittner¹, Mario Lamping^2,3, Damian T Rieke^2,3,4, Julian Götze⁵, Bariya Bajwa⁵, Ivan Jelas³, Gina Rüter³, Hanjo Hautow¹, Mario Sänger¹, Maryam Habibi¹, Marit Zettwitz³, Till de Bortoli³, Leonie Ostermann⁵, Jurica Ševa¹, Johannes Starlinger¹, Oliver Kohlbacher^6,7,8,9, Nisar P Malek⁵, Ulrich Keilholz³, Ulf Leser¹.

Abstract

OBJECTIVE: We present the Berlin-Tübingen-Oncology corpus (BRONCO), a large and freely available corpus of shuffled sentences from German oncological discharge summaries annotated with diagnosis, treatments, medications, and further attributes including negation and speculation. The aim of BRONCO is to foster reproducible and openly available research on Information Extraction from German medical texts.
MATERIALS AND METHODS: BRONCO consists of 200 manually deidentified discharge summaries of cancer patients. Annotation followed a structured and quality-controlled process involving 2 groups of medical experts to ensure consistency, comprehensiveness, and high quality of annotations. We present results of several state-of-the-art techniques for different IE tasks as baselines for subsequent research.
RESULTS: The annotated corpus consists of 11 434 sentences and 89 942 tokens, annotated with 11 124 annotations for medical entities and 3118 annotations of related attributes. We publish 75% of the corpus as a set of shuffled sentences, and keep 25% as held-out data set for unbiased evaluation of future IE tools. On this held-out dataset, our baselines reach depending on the specific entity types F1-scores of 0.72-0.90 for named entity recognition, 0.10-0.68 for entity normalization, 0.55 for negation detection, and 0.33 for speculation detection. DISCUSSION: Medical corpus annotation is a complex and time-consuming task. This makes sharing of such resources even more important.
CONCLUSION: To our knowledge, BRONCO is the first sizable and freely available German medical corpus. Our baseline results show that more research efforts are necessary to lift the quality of information extraction in German medical texts to the level already possible for English.

Entities: Chemical

Keywords: German language; corpus annotation; medical information extraction

Year: 2021 PMID： 33898938 PMCID： PMC8054032 DOI： 10.1093/jamiaopen/ooab025

Source DB: PubMed Journal: JAMIA Open ISSN： 2574-2531

LAY SUMMARY In this work we present the Berlin-Tübingen-Oncology (BRONCO) corpus, a German medical text corpus of 200 discharge summaries from the oncology departments of two hospitals. In the corpus mentions of diagnoses, treatments and medications are annotated together with a number of attributes as laterality, negation, and speculation. The corpus will be freely available for the research community for training and evaluating models for information extraction. To our knowledge, BRONCO will be the first freely available German medical corpus. To obey data protection law, we anonymized all documents and shuffled sentences in the publicly available version of the corpus. Consequently, applications are limited to the sentence level. We also provide baselines for named entity recognition, named entity normalization, and negation and speculation detection using state-of-the-art techniques.

INTRODUCTION

Clinical documentation contains a vast amount of patient-specific information, including disease etiology, family background, symptoms, examination results, and treatments. A systematic analysis of large quantities of documents can help to improve clinical care, to support clinical decision making, and to quality-control clinical pathways. However, documentation is mostly available in free text format at least in Germany, and its retrospective analysis for a given research hypothesis requires reading and understanding often hundreds or thousands of long and complex texts. Clinical natural language processing (NLP) investigates methods for automated information extraction (IE) specifically designed to process clinical text containing incomplete sentences, complex syntax, medical vocabulary, and idiosyncratic abbreviations. The quality of clinical NLP tools depends on the availability of annotated medical corpora for training and evaluation. Thus, the sharing of annotated corpora is indispensable: (1) The performance of different tools can be evaluated and compared, (2) reproducibility of previous results can be checked, and (3) machine-learning based NLP tools can be developed by groups world-wide without the time-consuming effort of corpus annotation, which furthermore requires high levels of medical knowledge. To secure patient privacy, sharing of medical reports is only allowed either with explicit patient consent or when texts are fully anonymized. In the United States, HIPAA (Health Insurance Portability and Accountability Act of 1996; https://www.hipaajournal.com/de-identification-protected-health-information/) defines anonymization of medical data as the removal of 18 distinct Protected Health Information (PHI) identifiers. Based on these regulations, the NLP community developed tools for automatic deidentification of medical narratives, which greatly eased the development and publication of annotated corpora, such as MIMIC or corpora provided through shared tasks such as i2b2/n2c2 (https://portal.dbmi.hms.harvard.edu/projects/n2c2-nlp/), SemEval (http://alt.qcri.org/semeval2014/), and the CLEF ehealth lab series (http://clef-ehealth.org/). The access to corpora, in turn, enabled the development of freely available high-quality IE tools, among them MetaMap and cTakes., Compared to English, development of clinical NLP tools processing German medical text is still in its infancy. Similar to the United States, anonymized patient data can be shared in principle. However, there exists no clear definition of the PHI identifiers that need to be removed to obtain a fully anonymized medical document. Instead, the decision whether a certain approach achieves anonymization rests with the data protection officers at each institution. It is therefore extremely difficult to (1) obtain medical documents for NLP research outside of hospitals and (2) to share those data with other research groups. Consequently, although several annotation studies on German clinical corpora have been carried out previously, all those corpora are kept closed. The most recent corpus is 3000PA containing 3000 documents from 3 clinical sites that has been annotated with medication parameters. For German clinical texts, there are also no freely available IE or anonymization tools, and the reported quality of IE methods on closed corpora can neither be evaluated independently nor reproduced externally. Supplementary TableS1 gives key characteristics of selected clinical corpora used for IE for different languages, showing that several corpora for languages other than German are freely available for years, especially for English. In this work, we present the freely available Berlin-Tübingen-Oncology corpus (BRONCO). It consists of shuffled sentences from 200 German discharge summaries from cancer patients annotated with medical entities (BRONCO was created by the nationally funded project “Personalizing Oncology via Semantic Integration of Data” (PersOnS), see https://persons-project.informatik.uni-tuebingen.de/). The ultimate aim of BRONCO is to foster the development of high-quality NLP tools for extracting the precise disease history of cancer patients. As a first step toward this goal, we manually anonymized documents and annotated diagnoses, treatments, and medication. Additionally, medical entities were annotated with attributes (laterality, negation, speculation, and possible in the future). We also created baselines for a set of IE tasks using state-of-the-art technologies. To allow unbiased evaluation of IE tools, we randomly split the corpus in 2 parts: The larger subset, BRONCO150, contains 8976 sentences and 8760 annotations and is available under a liberal license for training and evaluation of IE tools (https://www2.informatik.hu-berlin.de/∼leser/bronco/index.html). The second subset, BRONCO50, with 2458 sentences and 2364 annotations, is kept closed as held-out data. As a further mean to prevent deanonymization sentences in both corpora are randomly shuffled. In the future, we will offer the service to evaluate new IE tools on BRONCO50 in our lab.

MATERIALS AND METHODS

Corpus design and preprocessing

We randomly selected 200 discharge summaries of patients suffering from hepatocellular carcinoma or melanoma treated between 2013 and 2016 at the university hospitals in Berlin or Tübingen. After careful anonymization the study on this data and publication of BRONCO was approved by the Data Protection Officers of both hospitals and the ethics committee of Charité (EA1/322/20). Documents were extracted from electronic patient records, converted to plain text, and manually anonymized by 1 or 2 clinicians at each hospital. Anonymization included removal of direct identifiers as names, age, contact details, IDs, and locations. Dates, persons, and hospital names were preannotated using regular expressions with the annotation tool Ellogon (http://www.ellogon.org/index.php/annotation-tool). All dates within each document were automatically modified by a fixed number of days to keep chronological order of events. The number of days was chosen randomly for each document.

Annotation scope

At first, we annotated section headings in all discharge summaries (see Supplementary Material for details). In specific sections, we annotated medical entities that are particularly important for the disease history of cancer patients, namely: diagnosis, treatment, and medication. As a common practice in NLP research, by “medical entities,” we mean linguistic entities, that is, word or phrases that designate objects or processes relevant in health, including expressions clinicians use to describe patient-related matters. For terminology grounding (normalization) of medical entities, we utilized terminologies commonly used in clinical practice in Germany. A diagnosis is a disease, a symptom or a medical observation that can be matched with the German Modification of the International Classification of Diseases (ICD10; www.dimdi.de/dynamic/de/klassifikationen/icd/icd-10-gm/). A treatment is a diagnostic procedure, an operation or a systemic cancer treatment that can be found in the Operationen und Prozedurenschlüssel (OPS; www.dimdi.de/dynamic/de/klassifikationen/ops/). A medication names a pharmaceutical substance or a drug that can be related to the Anatomical Therapeutic Chemical Classification System (ATC; www.dimdi.de/dynamic/de/arzneimittel/atc-klassifikation/). Examples for each entity type are shown in Figure 1. Whenever applicable, medical entities were annotated with laterality (right, left, and both sided), negation (e.g., a diagnosis is ruled out or a medication is paused), speculation (e.g., a diagnosis is unclear), or whether it is expressed as a possible future event (e.g., a procedure is planned for the future). Examples for each attribute are shown in lines 1–5 in Figure 1. We defined a number of rules in our annotation guideline (available on the BRONCO website) to clarify any ambiguous situation we encountered in our corpus. These rules are shown in Supplementary Appendix A.

Figure 1.

Exemplary excerpts from original discharge summaries and annotations, shown in BRAT visualization. Attributes in brackets have the following meaning: laterality right (R), negated entity (negative), speculative entity (speculative), and entity possible in the future (possibleFuture). Additionally, codes resulting from entity normalization are given in brackets.

Annotation process

The annotation process was conducted by an annotation leader who prepared documents for annotation, developed annotation guidelines together with the medical experts, and organized conflict resolution but did not perform any annotations. For organizational reasons, annotations were performed by 2 groups of annotators, group A (2 medical experts) and group B (3 medical experts and 3 medical students). Annotation guidelines were developed by the annotation leader and group A using 9 documents (see Supplementary Appendix B); adaptations required by situations encountered only later were possible. An overview of the complete annotation process is illustrated in Figure 2. Technically, we used the Brat Rapid Annotation Tool (BRAT).

Figure 2.

Annotation procedure including deidentification, annotation of section titles, and annotation of medical entities with attributes. Altogether, 1 annotation leader and 9 medical annotators were involved in different parts of the process. Group A annotated 87 documents, of which 32 documents were double annotated for quality control. Differences in annotations were discussed with the annotation leader and resolved based on the guidelines and mutual agreement. To speed up annotation, we preannotated 59 documents with frequently annotated phrases, such as “CT Thorax/Abdomen/Becken” (computed tomography of thorax, abdomen, and pelvis), using exact matching. Annotators had to check and correct preannotations. Group B annotated 113 documents. Here, we used a different procedure because medical students performed differently well during training. First, 3 medical students double annotated all documents without preannotations. Then the 3 medical experts of group B resolved conflicts using BRAT as shown in Figure 3. Training of annotators and considerations that lead to this procedure are described in Supplementary Appendix B.

Figure 3.

Visualization of mismatches between annotations of 2 annotators, shown in BRAT visualization. (A) One of the annotations misses Laterality R and (B) “Oberbauchsonographie” (sonography of the upper abdomen) is annotated only by 1 annotator and “Ausschluss von Leberfiliae” (exclusion of liver metastasis) is annotated with different text spans and only once with attribute possibleFuture. Interannotator agreement (IAA) is calculated as microaveraged phrase-level F1-score before conflict resolution. We used phrase-level IAA, because most diagnosis and treatment annotations comprise of multiple tokens like “hepatozelluläres Karzinom” (hepatocellular carcinoma). In such cases, phrase-level IAA is more suitable than token-level IAA as phrases with different boundaries are detected as disagreement and can be resolved during conflict resolution. We used average F1-score instead of Cohen’s κ as the number of negative (not marked) phrases is poorly defined.

Corpus creation

After annotation, the corpus was split in 2 parts containing only annotated sections of 150 and 50 documents, referred to as BRONCO150 and BRONCO50, respectively. In each part, sentences were split based on punctuation. To avoid splitting sentence after abbreviations like “Z.n.” (condition after), we used a list of common German medical abbreviations retrieved from Wikipedia (https://de.wikipedia.org/wiki/Medizinische_Abk%C3%BCrzungen) as exceptions. As an additional measure against potential deanonymization, we randomly shuffled sentences within each part of BRONCO. Finally, we further split BRONCO150 into 5 sets for allowing reproducible cross validation. We performed 2 analyses on BRONCO150 to evaluate the effect of shuffling sentences. First, we calculated similarity scores between sentences originating from the same document and those coming from different documents. Secondly, we tried to reconstruct the original documents through clustering of sentences. We performed hierarchical clustering to segment the sentences into 150 groups, that is, 1 group per original document. For each group, we measure from how many documents the sentences originate. For both analyses, we used cosine similarity over TF-IDF representations of the documents.

Baseline methods for information extraction

We developed baseline tools using state-of-the-art techniques for named entity recognition (NER), named entity normalization (NEN), and detection of negated and uncertain entities. Performances of all baselines were measured as microaveraged precision, recall and F1. To this end, both BRONCO corpora were tokenized and tagged with part-of-speech using JCORE models that have been trained on a closed German clinical corpus (FRAMED)., Gold standard annotations were used to convert the corpora to IOB format.

Named entity recognition

We applied the conditional random fields (CRF) implementation CRFsuite and a bidirectional long short-term memory network with a final CRF layer (LSTM-CRF), respectively. For both, CRF and LSTM-CRF, we used default feature sets plus a number of further lexical and orthographic features. We also tested the impact of FastText embeddings trained on German Wikipedia articles. CRF and LSTM-CRF models were evaluated with 5-fold cross validation on BRONCO150 and trained on the full BRONCO150 corpus for evaluation on the held-out corpus.

Named entity normalization

We implemented a simple approach using a dictionary lookup with Apache Solr 7.5.0 (https://lucene.apache.org/) followed by a reranking of candidates using the inference method from. Additionally to the dictionaries used for annotation, we applied Rote Liste (Rote Liste, Service GmbH, 2/2019) for mentions of branded drug names. To evaluate NEN, gold standard entity annotations were extracted from the BRONCO corpora and subjected to normalization.

Negation and speculation detection

We applied NegEx, which detects negated and uncertain (speculated) entities using a list of trigger terms and rules for defining their scope. We applied the original list of German trigger terms from as well as an updated list from. For evaluation, all sentences and gold standard entity annotations were fed into NegEx. More details on all applied methods are shown in Supplementary Appendix C.

RESULTS

We first report on the estimated quality and frequency of annotated entities in BRONCO and both subsets. Next, we study whether shuffling of sentences in BRONCO150 actually prevents reconstruction of documents. Finally, we report, separately for both parts of BRONCO, on baseline results for information extraction using state-of-the-art techniques.

Quality of annotations

The quality of annotation is measured separately for annotation and normalization of medical entities and attributes for both groups of annotators (A and B). Table 1 shows the IAA for all double annotated documents for group A (32 documents) and group B (113 documents). Annotation of text spans reaches high agreement for all medical entities in group A (IAA 0.81–0.94). Agreement for normalization is also high with IAA 0.73–0.90. For both levels of annotation quality, text span and normalization, agreement increases from treatment to diagnosis to medication. Also, the agreement between annotations of attributes is high, especially for negation and laterality with IAA of 0.81 and 0.75, respectively. Agreement is generally lower for group B: 0.66–0.87 for text spans, 0.47–0.75 for normalization, and 0.37–0.53 for attributes. Note that all conflicting annotations were manually resolved in the final BRONCO.

Table 1.

Interannotator agreement (IAA) calculated as microaveraged phrase-level F1 for 2 corpus sets annotated by 2 groups of annotators (A, B)

	Group A			Group B
Annotation type	No. of instances	Text span	Code/attribute	No. of instances	Text span	Code/attribute
Diagnosis	734	0.88 (0.94)	0.84	2860	0.69 (0.79)	0.54
Treatment	522	0.81 (0.93)	0.73	1730	0.66 (0.77)	0.47
Medication	300	0.94 (0.96)	0.90	927	0.87 (0.92)	0.75
Laterality	104	–	0.75	452	–	0.53
Negation	76	–	0.81	319	–	0.50
Speculation	81	–	0.69	288	–	0.44
Possible Future	37	–	0.68	244	–	0.37

Note: IAA was calculated before conflict resolution. For text spans, IAA is also given as (token level) Cohen’s κ in paratheses. Number of double annotated documents: group A (32) and group B (113).

Interannotator agreement (IAA) calculated as microaveraged phrase-level F1 for 2 corpus sets annotated by 2 groups of annotators (A, B) Note: IAA was calculated before conflict resolution. For text spans, IAA is also given as (token level) Cohen’s κ in paratheses. Number of double annotated documents: group A (32) and group B (113).

Frequency of entities

Corpus statistics and frequencies of annotated entities for both parts of BRONCO as well as for the complete corpus are shown in Table 2. Overall, BRONCO contains 11 124 annotations of medical entities and 3118 annotations of attributes. Most frequent annotations are diagnosis (5245), followed by treatment (3866) and medication (2013). Judged by the number of unique instances (26–45% of all annotations), the vocabulary is quite versatile for each type of entity. Overall, 1256 medical entities (10%) are related to a specific laterality, and about 15% are either negated (630 entities), speculated (613 entities), or may possibly occur in the future (619 entities). Overall, 796 medical entities (7%) are noncontinuous.

Table 2.

Frequency of annotated medical entities and attributes in BRONCO and its 2 subsets, together with general statistics

Annotation type	BRONCO150	BRONCO50	BRONCO complete	Unique instances
Diagnosis	4080	1165	5245	2394
Treatment	3050	816	3866	1101
Medication	1630	383	2013	532
Total medical entities	8760	2364	11 124	–
Laterality	1033	223	1256
Negation	503	127	630
Speculation	474	139	613
Possible future	479	140	619
Total attributes	2489	629	3118
#Documents	150	50	200
#Sentences	8976	2458	11 434
#Tokens	70 572	19 370	89 942

Note: Unique instances are the number of unique mentions within the complete corpus.

Frequency of annotated medical entities and attributes in BRONCO and its 2 subsets, together with general statistics Note: Unique instances are the number of unique mentions within the complete corpus.

No reconstruction of documents

First, we compared the similarity of sentences in BRONCO150. Supplementary Figure S2 shows the distributions of pairwise similarities for sentences of the same (left) and different original documents (right) having at least 1 word in common. There is almost no difference regarding in- and cross-document sentence pairs. Furthermore, about 90% of all sentence pairs do not share a single word and therefore have zero similarity in the 1-hot encoding we applied here (note that these pairs were excluded to create Supplementary Figure S2, since otherwise the boxplots degenerate to flat lines). Furthermore, we studied how much a hierarchical clustering of sentences (with cutoffs to create 150 clusters) reconstructs the original documents. Figure 4 shows the distribution of numbers of documents per cluster. On average, clusters consist of 60 sentences originating from 22 different documents. There are only 3 clusters having sentences just from a single original document. These clusters contain only 10–13 sentences which cover 6–21% of a complete document. As there are also 18 clusters of similar sizes and similar average pairwise similarity between cluster members, we see no way of identifying pure (yet still very incomplete) clusters without knowledge of the original document.

Figure 4.

Distribution of documents per cluster after hierarchical clustering of sentences in BRONCO150.

Performance of IE baselines

We compared the performance of a CRF and a LSTM-CRF with and without using German (nonbiomedical) word embeddings. Results are listed in Table 3. On the BRONCO150 subset, the CRF approach outperforms LSTM-CRF by ∼3pp F1 on diagnosis, by ∼1pp on treatment, and by ∼2pp on medication; differences are very similar on the BRONCO50 subset. Word embeddings have only marginal impact on the CRF, but considerably improve performance of the LSTM-CRF approach (+5pp, +3pp, and +3pp for diagnosis, treatment, and medication, respectively). A notable difference exists between the results for diagnosis and treatment on BRONCO150 versus BRONCO50 for both approaches, where F1 scores are lower for the held-out part. We attribute this drop to the fact that results for BRONCO150 are obtained using cross-validation over randomly shuffled sentences, which means that sentences from the same document often are contained in the training and the test data. This increases the chances that individual entities of the test split already have been seen in the training data. Though this might be considered as a form of information leakage, we decided against creating the folds in BRONCO150 at the level of documents, as this would make document reconstruction easier and reidentification of individuals possible. Clearly, results for the BRONCO50 subset should be considered as more realistic.

Table 3.

Performance for baseline methods for NEN and NER (CRF and LSTM-CRF) with and without pretrained word embeddings (WE)

Annotation type	Task	Method	BRONCO150			BRONCO50
Annotation type	Task	Method	P	R	F1	P	R	F1
Diagnosis	NER	CRF	0.80(0.01)	0.71(0.02)	0.75(0.02)	0.79	0.67	0.72
		CRF+WE	0.782(0.006)	0.70(0.02)	0.74(0.01)	0.77	0.66	0.71
		LSTM	0.75(0.03)	0.69(0.03)	0.72(0.01)	0.78	0.65	0.71
		LSTM+WE	0.81(0.08)	0.74(0.08)	0.77(0.08)	0.79	0.65	0.72
	NEN	Dictionary lookup	0.58	0.54	0.56	0.54	0.50	0.52
Treatment	NER	CRF	0.86(0.02)	0.78(0.01)	0.82(0.01)	0.83	0.73	0.78
		CRF+WE	0.85(0.02)	0.78(0.01)	0.81(0.01)	0.81	0.73	0.76
		LSTM	0.83(0.04)	0.79(0.03)	0.81(0.02)	0.85	0.69	0.76
		LSTM+WE	0.85(0.06)	0.82(0.07)	0.84(0.06)	0.76	0.74	0.75
	NEN	Dictionary lookup	0.18	0.13	0.15	0.15	0.12	0.13
Medication	NER	CRF	0.96(0.008)	0.85(0.02)	0.90(0.009)	0.94	0.87	0.90
		CRF+WE	0.96(0.004)	0.84(0.009)	0.90(0.006)	0.95	0.85	0.90
		LSTM	0.91(0.05)	0.86(0.03)	0.88(0.02)	0.95	0.85	0.89
		LSTM+WE	0.96(0.02)	0.87(0.06)	0.91(0.04)	0.91	0.89	0.90
	NEN	Dictionary lookup	0.66	0.68	0.67	0.64	0.69	0.66

Note: Results for BRONCO150 are averaged over 5-fold with standard deviation in brackets. Best (highest) values per entity type, corpus, and w/o WE are bold.

Performance for baseline methods for NEN and NER (CRF and LSTM-CRF) with and without pretrained word embeddings (WE) Note: Results for BRONCO150 are averaged over 5-fold with standard deviation in brackets. Best (highest) values per entity type, corpus, and w/o WE are bold. We applied a dictionary lookup approach combined with a candidate reranking. Results are listed in Table 3. We find the best performance in terms of F1 for medication (0.67 and 0.66) followed by diagnosis (0.56 and 0.52) for BRONCO150 and BRONCO50, respectively. For treatment, performance only reaches F1 0.15 (BRONCO150) and F1 0.13 (BRONCO50). We applied NegEx using 2 available lists of trigger terms, Chapman et al and Cotik et al. Using the Chapman list, negation detection reaches F1 0.44 (BRONCO150) and F1 0.37 (BRONCO50), as shown in Table 4. Speculation detection is worse. F1 only reaches 0.02 and 0.09 on BRONCO150 and BRONCO50, respectively. The recently published Cotik list improves results, but F1 scores nevertheless do not exceed 0.55 for negation and 0.33 for speculation detection in both corpora.

Table 4.

Negation and speculation detection of entities using NegEx with 2 lists of German trigger terms: Chapman et al and Cotik et al

		BRONCO150				BRONCO50
Annotation type	Trigger list	#GSC	P	R	F1	#GSC	P	R	F1
Negation	Chapman	503	0.57	0.35	0.44	127	0.45	0.31	0.37
Negation	Cotik	503	0.62	0.50	0.55	127	0.52	0.55	0.54
Speculation	Chapman	474	0.13	0.01	0.02	139	0.26	0.06	0.09
Speculation	Cotik	474	0.54	0.24	0.33	139	0.71	0.22	0.33

Negation and speculation detection of entities using NegEx with 2 lists of German trigger terms: Chapman et al and Cotik et al

DISCUSSION

We present the BRONCO, a large and freely available corpus of German oncological discharge summaries. The corpus consists of shuffled sentences and is annotated with medical entities (diagnosis, treatment, medication) and their attributes (laterality, negation, speculation, possible in the future). Additionally, we developed baselines for NER, NEN, and negation and speculation detection and evaluated them on 2 subsets of the corpus. BRONCO150 will be published openly. Application of BRONCO is limited to sentence-level IE tasks. Nevertheless, we believe BRONCO can have a positive impact on German clinical NLP because it is the first sizable corpus that will become freely available. Some previously built German medical corpora for IE exceed the size of BRONCO. For instance, 3000PA, created by large German research consortium (https://www.medizininformatik-initiative.de/), contains 3000 documents annotated with medication and related parameters. A subset of 3000PA annotated with diagnosis, findings, and symptoms contains 1.5M tokens. However, none of these is publicly available. The main reason for this situation undoubtedly is the uncertainty among researchers and data protection officers when a corpus can be given the status of being “fully anonymized,” as required by German and European regulations. We reacted in 3 ways to this issue: first, the corpus was completely manually deidentified. This process was confirmed by Charité and UKT data protection officers. Second, we only annotate and publish certain sections of the discharge summaries, avoiding all sections containing mostly biographic information. Third, we shuffled all sentences in the 2 subcorpora to blur their order and relationships. We performed an attempt to break this shuffling using sentence clustering and showed that it failed. Our method for this attempt has, however, limitations. The most important one probably is that in our 1-hot encoding words must appear syntactically identical to be matched between sentences, ignoring their semantics. One could try to overcome the limitation by using precomputed language models.,, However, extremely large and domain-specific corpora necessary to train good language models are either not available or kept closed. The same is true for any potentially existing language models. BRONCO provides high-quality annotations (1) because in all double annotated documents (145 out of 200) conflicts were dissolved in a controlled process and (2) because all single annotated documents were annotated by persons that achieved high IAA with their peers for all levels of annotation. The IAA for entity annotation (0.69–0.88 for diagnosis, 0.66–0.81 for treatment, and 0.87–0.94 for medication) is comparable to previous annotations studies: achieved 0.88–0.99 for medication and reached 0.637 for diagnosis. Annotation studies on English clinical corpora are in the range of 0.7–0.88 (F1) or 0.73 (Cohen’s κ) for entities like disorder, procedures, or chemicals and drugs.,, As expected, normalization of entities was more difficult than merely finding entity mentions, especially for treatment concepts. Annotators were much less familiar with OPS (especially medical students) than with the other terminologies. Documentation officers, who create ICD10, OPS, and DRG coding as part of their professional activities, probably would have been a better fit for this task. The 2-step annotation process we used for group B achieved the best balance between work time/cost and annotation quality. For comparable medical annotation projects, we therefore recommend the following procedure: First, annotations should be performed by persons, preferably more senior medical students, specifically hired for the annotation task. Every document should be annotated at least twice, and annotators are asked to highlight phrases where they are not sure how to proceed. In a second step, trained staff members only correct such phrases and conflicting annotations which significantly reduces the time they have to invest. The first step may include preannotations of frequently annotated terms to further speed up the process. However, this procedure certainly is a challenge for building truly large corpora containing thousands of documents. Generally, recent years have shown that neural network based NER taggers outperform all other methods for biomedical texts, at least for English. In recent NER studies on German clinical corpora, CRF and LSTM-CRF methods have been used. A CRF and a character-level Bi-LSTM-CRF was trained for several types of medical entities, including medical condition, treatments and medications on 627 clinical notes from nephrology annotated with UMLS. Their F1-scores for treatment and medication are ∼5pp and ∼11pp worse for the CRF and ∼3pp and ∼2pp worse for the Bi-LSTM-CRF, when compared to BRONCO150 results (the precise setup for evaluation is not clear in the paper, but it certainly used a form of cross-validation. Therefore, we compare to BRONCO150 and not BRONCO50), though in their case the Bi-LSTM-CRF always outperformed the CRF. Their F1-scores for medical condition are ∼9pp and 13pp better for CRF and the Bi-LSTM-CRF, respectively. The rule-based system JUMEx extracted among other entities medication names from 3000PA reaching F1-scores of 0.65 compared to 0.90 for BRONCO50. On the Jena subset of 3000PA the CRF-based JCORE pipeline extracted among other entities diagnosis mentions with F1-score of 0.48 compared to 0.72 for BRONCO50. The latter 2 studies work on much larger corpora (more than 1.5M tokens) using 10-fold cross validation while for BRONCO150 we could only apply 5-fold. Additionally, 3000PA covers documents from a broader domain than BRONCO. Further progress in NER may be achieved by adding more fine-grained language models. For German medical texts, such models are not available, yet. However, it would be worth testing the German instance of the multilingual BERT language model. For NEN, we applied dictionary matching followed by a reranking of candidate terms. Results are mixed; whereas F1-scores for diagnosis and medication are somewhat encouraging (52% vs 66% on BRONCO50), the performance for treatments is very low (13%). These poor results can be related to the well-known vocabulary mismatch between the language of controlled vocabularies and the clinical jargon. Especially, OPS contains very complex concepts. Building interface terminologies may help to overcome this issue, as well as making German translations of rich terminologies such as SNOMED-CT augmented with proper synonym sets accessible to the research community. Abbreviations are often specific within organizations and thus notoriously difficult to include in general terminologies. Additionally, tools for abbreviation resolution, like, might be worthwhile here to improve terminology grounding. Negation and speculation detection using NegEx showed diverse results. We achieved the best performance using trigger terms from Cotik et al. F1-scores of 0.55 for negation can be considered as a promising basis for future improvements, yet a score of 0.33 for speculation is clearly not satisfying. Note that Cotik et al report F1-scores of 0.91 for negation and 0.55 for speculation on their corpus, indicating the still highly corpus-specific nature of the trigger term lists. To improve negation and speculation detection, one could either largely extend the list of trigger terms and their scopes, or adapt other tools like ConText. a more advanced version of NegEx currently available only for English. Also training of polarity models, as in, could be tested.

CONCLUSION

We provide the BRONCO, the first annotated German medical corpus freely available to the research community. This corpus offers the possibility to compare, evaluate, and train basic NLP tasks for the medical domain such as NER, NEN, and detection of different attributes of named entities.

FUNDING

This work was funded by the German Bundesministerium für Bildung und Forschung (BMBF), grants 031L0030B and 031L0023B, and the Deutsche Forschungsgemeinschaft, grant LE1428/1-1. D.T.R. is a participant in the BIH-Charité Clinician Scientist Program funded by the Charité – Universitätsmedizin Berlin and the Berlin Institute of Health.

AUTHOR CONTRIBUTIONS

U.L. and O.K. conceived the idea of the project. U.L., N.P.M., and U.K. supervised the work. M.K. organized the annotation process and performed data analysis, supported by M.S., H.H., J.Š., J.S., and M.H. M.L., D.T.R., I.J., G.R., M.Z., T.B., J.G., B.B., and L.O. anonymized and annotated the corpus. M.K. and U.L. wrote the manuscript with input from all authors.

SUPPLEMENTARY MATERIAL

Supplementary material is available at Journal of the American Medical Informatics Association online.

CONFLICT OF INTEREST STATEMENT

None declared.

DATA AVAILABILITY

The data underlying this article (BRONCO150) will be shared on reasonable request and based on a data usage agreement. Please visit https://www2.informatik.hu-berlin.de/∼leser/bronco/index.html. Click here for additional data file.

23 in total

1. A simple algorithm for identifying negated findings and diseases in discharge summaries.

Authors: W W Chapman; W Bridewell; P Hanbury; G F Cooper; B G Buchanan
Journal: J Biomed Inform Date: 2001-10 Impact factor: 6.317

2. Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications.

Authors: Guergana K Savova; James J Masanz; Philip V Ogren; Jiaping Zheng; Sunghwan Sohn; Karin C Kipper-Schuler; Christopher G Chute
Journal: J Am Med Inform Assoc Date: 2010 Sep-Oct Impact factor: 4.497

3. Overcoming barriers to NLP for clinical text: the role of shared tasks and the need for additional creative solutions.

Authors: Wendy W Chapman; Prakash M Nadkarni; Lynette Hirschman; Leonard W D'Avolio; Guergana K Savova; Ozlem Uzuner
Journal: J Am Med Inform Assoc Date: 2011 Sep-Oct Impact factor: 4.497

4. CDA-Compliant Section Annotation of German-Language Discharge Summaries: Guideline Development, Annotation Campaign, Section Classification.

Authors: Christina Lohr; Stephanie Luther; Franz Matthies; Luise Modersohn; Danny Ammon; Kutaiba Saleh; Andreas G Henkel; Michael Kiehntopf; Udo Hahn
Journal: AMIA Annu Symp Proc Date: 2018-12-05

5. Interface Terminologies, Reference Terminologies and Aggregation Terminologies: A Strategy for Better Integration.

Authors: Stefan Schulz; Jean-Marie Rodrigues; Alan Rector; Christopher G Chute
Journal: Stud Health Technol Inform Date: 2017

6. De-identification of clinical notes via recurrent neural network and conditional random field.

Authors: Zengjian Liu; Buzhou Tang; Xiaolong Wang; Qingcai Chen
Journal: J Biomed Inform Date: 2017-06-01 Impact factor: 6.317

7. ConText: an algorithm for determining negation, experiencer, and temporal status from clinical reports.

Authors: Henk Harkema; John N Dowling; Tyler Thornblade; Wendy W Chapman
Journal: J Biomed Inform Date: 2009-05-10 Impact factor: 6.317

8. A long journey to short abbreviations: developing an open-source framework for clinical abbreviation recognition and disambiguation (CARD).

Authors: Yonghui Wu; Joshua C Denny; S Trent Rosenbloom; Randolph A Miller; Dario A Giuse; Lulu Wang; Carmelo Blanquicett; Ergin Soysal; Jun Xu; Hua Xu
Journal: J Am Med Inform Assoc Date: 2017-04-01 Impact factor: 4.497

9. Towards comprehensive syntactic and semantic annotations of the clinical narrative.

Authors: Daniel Albright; Arrick Lanfranchi; Anwen Fredriksen; William F Styler; Colin Warner; Jena D Hwang; Jinho D Choi; Dmitriy Dligach; Rodney D Nielsen; James Martin; Wayne Ward; Martha Palmer; Guergana K Savova
Journal: J Am Med Inform Assoc Date: 2013-01-25 Impact factor: 4.497

10. Fine-grained information extraction from German transthoracic echocardiography reports.

Authors: Martin Toepfer; Hamo Corovic; Georg Fette; Peter Klügl; Stefan Störk; Frank Puppe
Journal: BMC Med Inform Decis Mak Date: 2015-11-12 Impact factor: 2.796

1 in total

1. Automatic extraction of 12 cardiovascular concepts from German discharge letters using pre-trained language models.

Authors: Phillip Richter-Pechanski; Nicolas A Geis; Christina Kiriakou; Dominic M Schwab; Christoph Dieterich
Journal: Digit Health Date: 2021-11-26

1 in total