Literature DB >> 30763407

Term sets: A transparent and reproducible representation of clinical code sets.

Richard Williams^1,2, Benjamin Brown^1,2,3, Evan Kontopantelis², Tjeerd van Staa², Niels Peek^1,2.

Abstract

OBJECTIVE: Clinical code sets are vital to research using routinely-collected electronic healthcare data. Existing code set engineering methods pose significant limitations when considering reproducible research. To improve the transparency and reusability of research, these code sets must abide by FAIR principles; this is not currently happening. We propose 'term sets', an equivalent alternative to code sets that are findable, accessible, interoperable and reusable.
MATERIALS AND METHODS: We describe a new code set representation, consisting of natural language inclusion and exclusion terms (term sets), and explain its relationship to code sets. We formally prove that any code set has a corresponding term set. We demonstrate utility by searching for recently published code sets, representing them as term sets, and reporting on the number of inclusion and exclusion terms compared with the size of the code set.
RESULTS: Thirty-one code sets from 20 papers covering diverse disease domains were converted into term sets. The term sets were on average 74% the size of their equivalent original code set. Four term sets were larger due to deficiencies in the original code sets. DISCUSSION: Term sets can concisely represent any code set. This may reduce barriers for examining and reusing code sets, which may accelerate research using healthcare databases. We have developed open-source software that supports researchers using term sets.
CONCLUSION: Term sets are independent of clinical code terminologies and therefore: enable reproducible research; are resistant to terminology changes; and are less error-prone as they are shorter than the equivalent code set.

Entities: CellLine Chemical Disease Gene Species

Mesh：

Year: 2019 PMID： 30763407 PMCID： PMC6375602 DOI： 10.1371/journal.pone.0212291

Source DB: PubMed Journal: PLoS One ISSN： 1932-6203 Impact factor: 3.240

Introduction

Clinical code terminologies, such as SNOMED [1] and ICD [2], are dictionaries of terms that allow clinicians to record events in electronic health records (EHRs) using alpha-numeric codes rather than free text. This makes patient records more manageable for clinical care, and allows secondary uses of the data, such as researchers performing retrospective observational studies. Researchers construct clinical codes sets [3-5] to represent the medical concepts they wish to investigate. This is a time-consuming activity, and prone to errors which can lead to biases in subsequent analyses [6]. Storing code sets in a format that facilitates validation, sharing and reuse is important, and called for frequently [7-10]. Code sets, also called code lists and value sets [3,9], range from one code to several thousand. The Value Set Authority Centre (VSAC) [11] provides a repository for code sets allowing their sharing and reuse. Their largest, for “Problem”, contains 117,930 SNOMED codes. This code set is likely not useful, but there are several that are and that contain thousands of codes: Trauma (ICD-10) 18524, Fracture lower body (ICD-10) 5902, Infection (SNOMED) 4066 and Cancer (SNOMED) 3867. Verifying large code sets, by checking that all included codes are correct, and also that no codes are missing, is an enormous task and acts as a barrier to reuse [3]. Updating code sets as terminologies change over time, and sub-setting or extending code sets, are laborious and error-prone activities. This is important because differences in code sets can cause large variations in findings. Rodriguez et al [12] found rheumatoid arthritis (RA) incidence to be 0.15 per 1000 person-years, while Watson et al [13], in the same database, found it to be 1.03 per 1000 person-years; a sevenfold difference. Another study [14], calculated the weekly incidence of infectious intestinal disease as: 8.3/100,000 if using the World Health Organisation’s ICD-10 code set; 10.24/100,000 if using the Royal College of General Practitioners Research and Surveillance Centre’s ICD-9 code set; and 17.93/100,000 if using the ontological definition on which the paper was based. The FAIR principles [15] aim to improve the transparency and reusability of scientific data and the algorithms and tools for processing and curating that data. Clinical code sets are a key part of the research process and should abide by FAIR principles; they should be findable, accessible, interoperable and reusable. This is not currently the case. Almost all code sets are unpublished [4] and therefore not accessible. Those that are published, on dedicated repositories such as VSAC or clinicalcodes.org [16], are findable but reuse is a challenge. In theory, reuse is achieved by downloading the relevant code set and applying it to an EHR database. However the task of checking the code set for errors involves reading the definition for each code to confirm that they are correctly in the set, and also speculatively searching the rest of the terminology for codes that may have been omitted. This is arguably as time-consuming as constructing the code set from scratch and is one of the current barriers to reuse. There is also no way currently to determine if a missing code was accidentally or deliberately omitted, therefore impossible to determine if a mistake was made, or if the code set definition contained a subtlety not otherwise described.

Objective

We propose a new representation of selection criteria for EHR based studies, based on lists of inclusion and exclusion terms. We introduce a methodology for constructing codes sets which takes advantage of this representation, show that our method can represent any possible code set, and in doing so is typically more concise, and therefore practical for other researchers to verify, validate and ultimately reuse with confidence.

Materials and methods

We introduce ‘term sets’ to define cohort selection criteria for EHR-based studies. A ‘term set’ consists of three parts: inclusion terms describing the feature of interest (e.g. ‘stroke, ‘heart failure’); exclusion terms describing things of no interest (e.g. ‘family history’, ‘screening’); and the target clinical code terminology and version (e.g. terminology = SNOMED-CT, version = uk-edition-v20180401). A code set is created from a term set by searching the terminology for codes that contain inclusion terms but that don’t contain exclusion terms.

Relationship between code sets and term sets

The traditional representations of cohort selection criteria are clinical code sets which are applied to EHR databases via a query language. Code sets are extensional; they enumerate every code in the set. Term sets by contrast are intensional; they provide necessary and sufficient conditions by which a code is a member of the set. When applied to a particular terminology and version, a term set uniquely defines a code set. For example, consider the phrase “countries of the world” which is intensional, as compared with a complete list of countries of the world which is extensional. The list of countries changes over time, but at any point the intensional set can be derived from the extensional definition. Similarly, the extensional code set can be derived from the intensional term set.

Procedure for constructing term sets

Our method to construct a term set: Select a clinical code terminology Decide upon one or more inclusion terms, e.g. ‘heart failure’. Perform a search within the terminology for codes with a definition matching the inclusion terms. The search rules are described below. Optionally exclude matching definitions by adding exclusion terms. E.g. for ‘stroke’, it would make sense to exclude the term ‘family history’. For hierarchical code terminologies, return codes that are descendants of matching codes, with definitions that do not contain an inclusion term. Add inclusion or exclusion terms to explicitly include or exclude these descendant codes. Iterate until all inclusion terms have been added, and there are no unmatched descendants. Deciding upon inclusion and exclusion terms is often a complex task requiring medical expertise. Therefore when implementing this method a clinician would need to be involved, or at the very least an expert in the particular disease domain. However for now we concentrate on the method itself, rather than its implementation. A worked example for the method can be found in S2 Appendix.

Search rules

Case insensitive

The term [fracture] matches “Shoulder fracture” and “Fracture of shoulder”.

Words are matched in any order

The term [shoulder fracture] matches “Shoulder fracture” and “Fracture of shoulder”.

All words must be present

The term [type 2 diabetes] matches “Diabetes, type 2” and “History of type 2 diabetes”, but not “Type 1 diabetes”.

Use quotes to match exactly

The term [“type 2 diabetes”] matches “Type 2 diabetes” and “History of type 2 diabetes” but not “Diabetes, type 2”.

Wildcards allow partial word searching

The term [diabet*] matches “Diabetes” and “Diabetic patient”.

Exact matches are never excluded

The term [heart failure] always matches “Heart failure” even if [heart] were excluded.

Proof that any code set can be represented as a term set

This ensures that our method can actually be used in practice for all code sets.

Clinical code terminology

A clinical code terminology T = (C,D,f) is a set of codes C, a set of definitions D, and a mapping function f:C→D that links each code c∈C with a set of one or more definitions d∈D. Examples for Snomed CT, Read v2 and ICD-10 would be: The mapping function is surjective; each element of D is mapped to by at least one element of C. The inverse function f−1:D→C therefore exists for all definitions in D and is defined such that ∀ d∈D, f−1(d) = Y with c∈Y⇔d∈f(c).

Matching definition set

For a set of word sequences W = {w1,…,w} and a terminology T = (C,D,f) we define the matching definition set MD(T,W) as the set of all definitions d∈D where w matches d.

Matching definition set with exclusions

Given two sets of word sequences W,E and a terminology T = (C,D,f) we define the matching definition set with exclusions MDE(T,W,E) as the set of all definitions d∈D where w matches d and e does not match d.

Matching concept set

For a terminology T = (C,D,f), and two sets of word sequences W,E, we define the matching concept set M(T,W,E) as all codes in the terminology whose definition matches W. Alternatively:

Proposal

Any subset of clinical codes from a terminology can be represented by a set of inclusion terms and a set of exclusion terms. Formally, for terminology T = (C,D,f) and any X = {x1,x2,…,x}, a subset of C, there exists a set of inclusion word sequences I = {i1,i2,…,i} and a set of exclusion word sequences E = {e1,e2,…,e} such that

Proof

Let I = f(X) and E = f(X). Then For a complete proof and all definitions, see S1 Appendix.

Term set software

We have developed a web application (https://getset.herokuapp.com) that implements the above methods and allows users to create and verify term sets. The tool is currently implemented for Read v2 codes [17] which are used in UK general practice, however it is straightforward to extend to other hierarchical terminologies like ICD or SNOMED. Once created, term sets can be automatically verified and then shared via GitHub (https://github.com/). Users are encouraged to add their name, a short title and description, so that researchers reusing their set can easily determine their intent.

Empirical study

The proof above demonstrates “completeness”; any code set can be represented as a term set. We also wished to demonstrate “efficiency”: a term set is shorter than the equivalent code set and is therefore easier and quicker to check. We therefore conducted an empirical study which found published clinical code sets, created their equivalent term set representations, and reported on their relative sizes. GetSet is currently configured with Read v2, therefore we searched PubMed for papers using the Clinical Practice Research Datalink (CPRD) [18]; a large primary care database containing Read v2 codes with 100s of publications annually. We used the search term ("CPRD"[all fields] or "Clinical Practice Research Datalink"[all fields]) and sorted the results by date descending. Reviewing recent papers ensured we can demonstrate that our method is valid for the current state of the art in clinical code set engineering. We reviewed each paper in turn and included those that required the construction of code sets to define a cohort of patients. Cohort definition is the focal point of each paper and therefore the code set(s) that are most likely to appear. Also, by focussing on cohort definition, we avoided over-representation from papers with numerous code sets. For each paper reviewed we extracted any code sets that described a patient cohort for a condition/diagnosis that had not been previously included. Certain conditions will likely be studied more frequently than others; restricting ourselves to one code set per condition ensured we had a sufficient variety of diseases. We continued to review papers until code sets were discovered from 20 distinct papers. This ensured we would find 20 code sets for a variety of diagnoses and from a variety of authors. We then created term set representations for each code set, using the above method, with the following caveats: Any ‘medcodes’ (CPRD’s code dictionary) were first converted to Read v2 codes. We removed all codes except Read v2 (e.g. CPRD also contains Oxmis codes, which were in use pre-2000, and CTV3 codes). Where multiple codes have identical definitions, and the code set has included some but not all, we extended the code set to include them all. For each code set we reported on the code set size and compared this with the number of inclusion and exclusion terms in our equivalent representation.

Results

The PubMed search was executed on 17th January 2018 by the lead author and returned 809 papers. The target of code sets from 20 distinct papers was reached after reviewing 45 papers; no further papers were reviewed. The 20 papers consisted of: 18 which included their code set in the paper, as a supplement, or in an online repository; 1 with code sets available on request so they were requested and received; and 1 that referenced code sets from another paper so this was retrieved to obtain the code sets. A total of 31 code sets for cohort definitions were found in the 20 papers. For further detail see: https://doi.org/10.5281/zenodo.1316984. The median number of codes in each code set was 48 (IQR [18,120]). The smallest code set was for Stevens-Johnson syndrome and contained 1 code, while the largest code set, for infections that could lead to a potential hospitalization, contained 3,219 codes. Each code set was successfully converted into a term set using our previously described procedure. The term sets are available at https://doi.org/10.5281/zenodo.1316984. The full list of code set definitions, their sizes, and the equivalent term set sizes are in Table 1. Nine code sets omitted codes with definitions identical to an included code and so these codes were added prior to the conversion process. As an example, the code set for rheumatoid arthritis included the code “N040R00: Rheumatoid nodule”, but did not include the code “N042200: Rheumatoid nodule”, therefore N042200 was added prior to the conversion to a term set. The full list of extra codes for these nine code sets is available in S1 Table.

Table 1

Codes set descriptions and sizes, the size of the related inclusion/exclusion term sets, and the inclusion/exclusion term sizes as proportions of the original code set size.

Proportions ≤ 100% are displayed in bold.

Cohort definition code sets	Code set size	Number of inclusion terms	Number of exclusion terms	Number of inclusion and exclusion terms as % of code set size
Type 2 diabetes mellitus [19]	116	5	9	12.1%
Cancer except non-melanoma skin cancer [20]	1395	67	144	15.1%
Total knee replacement [21]	40	2	8	25%
Polymyalgia rheumatic [22]	3	1	0	33.3%
Asthma specific [23]	120	4	51	45.8%
Hidradenitis suppurativa (HS) [24]	2	1	0	50%
Shortness of breath excluded [5]	29	11	4	51.7%
Shortness of breath [5]	48	11	14	52.1%
Dementia [25]	74	8	31	52.7%
Non acute heart failure [26]	40	22	0	55%
Ethnicity [27]	183	46	63	59.6%
Potential hospitalized infections [28]	3219	1383	537	59.6%
Tuberculosis [29]	151	4	95	65.6%
Shoulder dislocation [30]	18	2	10	66.7%
Country of birth [27]	467	241	88	70.4%
Giant cell arteritis [22]	7	4	1	71.4%
Type 1 diabetes mellitus [31]	35	4	25	82.9%
Psoriatic arthritis [32]	7	5	1	85.7%
Possible undiagnosed HS [24]	47	40	2	89.4%
Religion [27]	112	72	29	90.2%
Fragility fracture [33]	18	3	14	94.4%
Rheumatoid arthritis [34]	57	13	42	96.5%
Living alone [27]	65	39	25	98.5%
Colorectal cancer [35]	23	9	14	100%
Stevens-Johnson syndrome [36]	1	1	0	100%
Toxic epidermal necrolysis [36]	5	4	1	100%
Myotonic dystrophy type 1 [37]	2	2	0	100%
Marital status [27]	148	34	131	111.5%
Cohabitation [27]	85	22	79	118.8%
Residence [27]	168	92	111	120.1%
Heart failure [26]	55	20	48	123.6%

Codes set descriptions and sizes, the size of the related inclusion/exclusion term sets, and the inclusion/exclusion term sizes as proportions of the original code set size.

Proportions ≤ 100% are displayed in bold. The total size of the term sets was on average 74% of the size of the code sets. In four code sets the total number of inclusion and exclusion terms exceeded the size of the code set: marital status, cohabitation, residence and heart failure. The code sets for marital status and cohabitation both use the code “1331.00: Single”. The inclusion term “single” matches many unrelated codes therefore many exclusion terms are needed. The code sets for residence and heart failure were perhaps poorly defined by the original authors. The residence code set aims to include codes that describe a person’s residential status and includes such wonderful terms as “Fall from cliff, occurrence in residential institution” and “Bitten by crocodile, occurrence in residential institution”, but then doesn’t include the terms “Prolonged stay in weightless environment, occurrence in residential institution” or “Victim of avalanche, occurrence in residential institution”. In order to represent this precisely with a term set we needed to include a large number of unnecessary exclusion terms. Finally the heart failure code set includes some, but not all, cardiomyopathy codes. There is no clinical reason for this and the number of inclusion terms would reduce if “cardiomyopathy” could be included, as opposed to the current situation where the exact definition of 15 cardiomyopathy codes must be included.

Discussion

We have developed a method for creating clinical code sets that incorporates metadata on how the code set was created. We have demonstrated with a formal proof that our method works for any code set, and have shown empirically that the lists of inclusion and exclusion terms are on average shorter than the list of codes themselves. A recent HL7 initiative provides a method for defining intensional value sets (code sets) [38]. Using this method a researcher can define a set of rules which when applied to a terminology generate a code set. However this does not give the creator of the code set any support, methodology or tools for how to create the rules for the intensional definition. In a similar way, Reference Sets [39] within SNOMED can be used to specify a subset of concepts for use in a particular application, but without creation support. Reference sets are also specific to SNOMED. Our approach provides a generalizable methodology and software tool which are used to build term sets and their associated code sets. Integration of the approaches could be achieved if term sets created with our software were exportable to the HL7 definition of an intensional value set. This would then provide a robust and transparent code set creation process, along with a precise, formal definition. There are at least four existing tools and associated methodologies for constructing clinical code sets. Davé and Petersen [40] created code sets by searching for synonymous terms and browsing the hierarchy. The final Stata script can be shared so that the process can be scrutinized. Others have developed R/Stata scripts: pcdsearch [41] and CALIBERcodelists [42,43]. These scripts reuse the ideas of Davé and Petersen, while allowing more complex queries using Boolean operators and regular expressions. Recently Watson et al. [5] presented a three-stage process: defining the clinical concept a priori with clinician assistance; searching a clinical terminology using R or Stata to create an initial code set; and producing a final code set via a Delphi exercise with at least two GPs (the main difference to previous approaches). Our approach builds on the strengths of these methods while addressing certain limitations. Each method above has a way of excluding codes; typically by specifying the codes themselves. By using exclusion terms, we produce metadata that is uncoupled from particular terminologies and is more readable to reviewers of the code set. The output of the above methods is always a script (Stata or R). By not tying our method to a particular scripting language, and using a simple web application, we reduce the barriers to the methodical creation, inspection and reuse of code sets. Allowing regular expressions may help the code set creator, however it will likely act as a further barrier to reuse if the expressions get overcomplicated or if the next researcher is unfamiliar with regular expressions. We have kept our search strategy as simple as possible to mitigate this problem. Although some of the reviewed code sets may have used one of the above methods, none made available the scripts used to create them. It is probably a safe assumption that this is true for the majority of code sets. The problem, for researchers reusing the code set, is that it is unknown which codes are missing and whether they were omitted deliberately or accidentally. Using our methodology these decisions become explicit. A future researcher may disagree with a decision, but at least it is available for scrutiny, and they can reuse the generated code set by tweaking the definition rather than starting from scratch. Clinician involvement in code set development is critical, but precisely how research groups incorporate our methodology into their working practices is an open question. One option would be to use the three-stage process from Watson et al. with steps one and two (synonym definition and code set creation) facilitated with our tool. We found examples where definitions only make sense when considered in the context of the hierarchy. E.g. the term “single” could be a numerical descriptor or a marital status. Our search strategy could be extended to examine the definitions of each codes’ ancestors. A search for “marital status single” would then return the code with the definition “single” only if it had ancestors that contained the words “marital” and “status”. This would alleviate the problem where inclusion terms with low specificity (“single” as a marital status, “white” as an ethnicity) lead to large numbers of exclusion terms. The Read dictionary has a prefix-based hierarchy (G30’s parent is G3, G3’s parent is G). Two of the code sets we analysed (Dementia and potential hospitalized infections) used wildcards to represent multiple codes, e.g. “A*” to represent “A….” and all of its descendants. This leads to shorter code sets, which are easier to interpret, however it is problematic for two reasons. Firstly, when a code is included in a set it is not necessary that all descendants should also be included, and simply using a wildcard gives no guarantees that the researcher has inspected and accepted each code. Secondly, as the actual codes used in the analysis are not explicitly provided, it is impossible to determine which codes were actually used because code dictionaries change over time, with codes added and removed. Our methodology, which encourages users to specify inclusion (or exclusion) terms to match all descendants of included codes leads to more complete synonym lists and gives extra confidence to researchers reusing the code set. Various problems were identified in the code sets (examples in Table 2). They fall into three categories: codes are included which do not correspond to the code set description; codes are omitted when they are obviously part of the code set; and some included and omitted codes are contradictory and should either all be included or all omitted. As we aimed to reproduce the code sets exactly, we have invariably created code sets with more inclusion and exclusion terms than are strictly necessary. By correcting the four code sets which had larger associated term sets we saw the average term set to code set proportion fall from 118.5% to 77.3%; all four term sets are now smaller than the code sets. For code sets constructed from scratch using our tool we would expect the number of inclusion and exclusion terms to be further reduced.

Table 2

Examples of problems encountered with code sets.

Code set	Example potential problems			Reason for problem
Fragility fracture	Included:	S22..00	Fracture of humerus	The included and omitted codes are contradictory. This leads to additional, unnecessary, inclusion and exclusion terms.
	Included:	S222000	Closed fracture of humerus NOS
	Omitted:	S22z.00	Fracture of humerus NOS
Potential hospitalized infections	Included:	A53..00	Herpes zoster	The included and omitted codes are contradictory. This leads to additional, unnecessary, inclusion and exclusion terms.
		F501611	Herpes zoster—otitis externa
		A35..00	Erysipelas
	Omitted:	F501411	Erysipelas—otitis externa
	Included:	AB…00	Mycoses (and all descendant codes)
	Omitted:	FyuN500	Otitis externa in mycoses
		Hyu0E00	Pneumonia in mycoses classified elsewhere
		N016.00	Arthropathy associated with mycoses
Type II diabetes mellitus	Included:	C105100	Diabetes mellitus, adult onset, + ophthalmic manifestation	There is no clinical reason for type II diabetes why you would include the first two codes and exclude the second two. They should all be included.
	Included:	C10z100	Diabetes mellitus, adult onset, + unspecified complication
	Omitted:	C100100	Diabetes mellitus, adult onset, no mention of complication
	Omitted:	C101100	Diabetes mellitus, adult onset, with ketoacidosis
Type I diabetes mellitus	Included:	C10EE00	Type 1 diabetes mellitus with hypoglycaemic coma	There is no clinical reason for type I diabetes why you would include the first two codes and exclude the second two. They should all be included.
	Included:	C10EN00	Type 1 diabetes mellitus with ketoacidotic coma
	Omitted:	C10E300	Type 1 diabetes mellitus with multiple complications
	Omitted:	C10EA00	Type 1 diabetes mellitus without complication
Rheumatoid arthritis	Included:	N065.00	Unspecified polyarthropathy or polyarthritis	The included and omitted codes are contradictory. This leads to additional, unnecessary, inclusion and exclusion terms.
Rheumatoid arthritis	Omitted:	N065.11	Polyarthropathy not elsewhere classified
Marital status	Included:	13IL300	Wife alive	The included and omitted codes are contradictory. This leads to additional, unnecessary, inclusion and exclusion terms.
	Omitted:	13IL700	Husband alive
	Included:	13IL.00	Health of spouse
	Omitted:	13Fe.00	Lives with spouse
	Included:	13ID.00	Partner unemployed
	Omitted:	13IZ400	Partner alive
		13IZ500	Partner unwell
		13IZ600	Partner well
Cohabitation	Included:	13IL300	Wife alive	The included and omitted codes are contradictory. This leads to additional, unnecessary, inclusion and exclusion terms.
	Omitted:	13IL700	Husband alive
	Included:	13IL.00	Health of spouse
	Omitted:	13Fe.00	Lives with spouse
	Omitted:	13HG.11	Spouse left home
Living alone	Included:	13FH.00	Lives with relatives	These codes are examples of living with someone and are therefore not examples of living alone.
		13Is.00	Lives with grandfather
		13It.00	Lives with grandmother
Residence	Included:	U10F100	Fall from cliff, occurrence in residential institution	The included codes indicate residence in a residential institution. The omitted codes are equivalent to this and should be included.
	Included:	U128100	Bitten by crocodile or alligator, occurrence in residential institution
	Omitted:	U1B2100	Prolonged stay in weightless environment, occurrence in residential inst…
	Omitted:	U196100	Victim of avalanche, occurrence in residential institution
Religion	Omitted:	13yL.00	Tibetan Buddhist	These codes are examples of religions and should be included.
	Omitted:	13yu.00	Coptic orthodox
	Omitted:	13zS.00	Weslyan Methodist
Country of birth	Omitted:	13dt.00	Born in Isle of Man	These codes are indicative of country of birth and so should be included.
	Omitted:	13du.00	Born in Faroe Islands
	Omitted:	13dv.00	Born in Greenland
Ethnicity	Omitted:	9TC..00	Roma ethnic group	These codes are descriptive of ethnicity and so should be included.
	Omitted:	9TC0.00	Bulgarian Roma
	Omitted:	9TC1.00	Czech Roma
	Included:	9S6..00	Indian	The included and omitted codes are contradictory. This leads to additional, unnecessary, inclusion and exclusion terms.
	Omitted:	1347.00	Indian origin
Heart failure	Included:	G55..00	Cardiomyopathy	There is no clinical reason for heart failure why you would include one code for cardiomyopathy but then exclude others. They should all be included.
	Omitted:	G558200	Dystrophic cardiomyopathy
	Omitted:	G558400	Amyloid cardiomyopathy
	Omitted:	G558.00	Cardiomyopathy in disease EC
Shortness of breath	Omitted:	173g.00	Breathlessness causing difficulty eating	This code is a synonym for shortness of breath and so should be included.
Tuberculosis	Included:	Ayu1900	Miliary tuberculosis, unspecified	The included and omitted codes are contradictory. This leads to additional, unnecessary, inclusion and exclusion terms.
	Omitted:	Ayu1800	Other miliary tuberculosis
	Included:	Ayu1300	Respiratory TB unspecified, no mention of bacteriological confirmation
	Omitted:	Ayu1100	Respiratory TB unspecified, confirmed bacteriologically and histologically
Cancer not non-melanoma skin cancer	Omitted:	B305B00	Malignant neoplasm of fourth metacarpal bone	These are types of cancer and should be included.
	Omitted:	ByuB.00	Malignant neoplasm of thyroid and other endocrine glands
	Omitted:	B640000	B-cell acute lymphoblastic leukaemia
	Omitted:	B624.12	Hairy cell leukaemia
	Omitted:	B509.00	Malignant melanoma of eye
Asthma	Included:	679J000	Health education—asthma self management	The included and omitted codes are contradictory. This leads to additional, unnecessary, inclusion and exclusion terms.
Asthma	Omitted:	679J.00	Health education–asthma

There are reasons why published code sets have omissions that aren’t necessarily errors. A researcher might justifiably decide that it is more important to capture a short list of codes which occur most frequently in their dataset than to focus on codes that occur infrequently or not at all. This may be true for their own research, but for other researchers wanting to reuse their code sets on different data sources it is not good enough. The burden of large code sets might have encouraged researchers to keep their code sets short, but with our methodology this is no longer a restriction, as validation can be performed on the shorter term sets rather than the code sets. Another valid reason for omissions is that code dictionaries change over time so it is possible that codes recently added to a terminology do not appear in a code set. This becomes a question of how to best keep code sets updated over time, and our approach provides a simple way to do this. Previously when updating a code set a researcher, who hadn’t kept records of their search strategy from several years before, may end up recreating the code set. Now with the inclusion and exclusion terms captured and stored alongside the code set, one simply executes the term set definition against the updated code dictionary to see what additional codes may or may not need to be included. We have demonstrated our method using Read codes, however the only precondition is that a terminology maps codes to definitions in a hierarchy, so our method would easily transfer to other terminologies such as SNOMED and ICD. One interesting avenue for further investigation is whether code sets can be translated into different terminologies. Once a researcher has defined a code set for one terminology, they could use the web tool to switch to a second terminology and automatically apply the same inclusion and exclusion terms to define a code set for that terminology. This would be useful for researchers using UK primary care data which is migrating from Read to SNOMED.

Strengths

We have shown that our method works formally via the proof and empirically via the code set mapping exercise. Using recent code sets from a variety of authors and for a variety of conditions demonstrates the generalisability of our technique. We have built upon the ideas from existing tools and methodologies as well as the recommendations from our earlier review [3].

Limitations

The search for papers was performed by a single author, however given the transparency of the search strategy the biggest risk is that a paper containing a code set has been incorrectly rejected. This would presumably be a random bias and not affect the results. The list of papers reviewed is also available for inspection at https://doi.org/10.5281/zenodo.1316984. The decision to select code sets for the cohort definition, rather than for the outcomes or the confounders, could have affected the results. However we found code sets for a wide variety of conditions and had few problems converting them into our format, so consider it likely that this would extend to other conditions. Code sets can be represented in multiple ways, some of which will be easier to understand than others. Some researchers may therefore be able to produce ‘better’ term sets. This can also be seen as a strength, as researchers are more likely to use term sets that are more clearly defined, so these term sets will prevail at the expense of those that are harder to understand. There may be occasions where it is unclear if a code should be included or not, for example if clinicians use the code in different ways. At present one solution is to create two or more term sets that either include or exclude the uncertain codes. These term sets would have slightly different inclusion and exclusion lists, and their associated description would highlight how sensitive or specific the term set was. Finally, although largely terminology agnostic, on occasion the particular inclusion and exclusion terms are loosely tied to the terminology used. One extreme example in Read v2 is for the term “G21z00: …without congestive cardic failure” which misspells the word “cardiac”. When selecting this code you would need an inclusion term of “cardic failure” which could be confusing and is unlikely to work in other terminologies. This is, however, an infrequent occurrence.

Conclusion

We have developed a new representation of cohort selection criteria for EHR based studies, a term set, which consists of: inclusion and exclusion terms; and a clinical code terminology and version. We have described a method to create term sets and developed an open source web application that implements this procedure. We have shown that our representation is as expressive as clinical code sets, but more efficient. Finally, term sets are easier to share, inspect, and reuse, because they are independent of specific (versions of) clinical terminologies. We expect that this will benefit transparent and reproducible research with EHR data.

Full proof and definitions.

Full formal proof and all definitions for the claim that a term set can represent any code set. (DOCX) Click here for additional data file.

Inconsistent duplicated codes.

Codes added to code sets where a code with an identical definition had been excluded. (DOCX) Click here for additional data file.

Worked example.

Step by step construction of a term set for Type 2 Diabetes. (DOCX) Click here for additional data file.

37 in total

Review 1. Clinical code set engineering for reusing EHR data for research: A review.

Authors: Richard Williams; Evangelos Kontopantelis; Iain Buchan; Niels Peek
Journal: J Biomed Inform Date: 2017-04-22 Impact factor: 6.317

2. Angiotensin receptor blocker use and gastro-oesophageal cancer survival: a population-based cohort study.

Authors: J Busby; Ú McMenamin; A Spence; B T Johnston; C Hughes; C R Cardwell
Journal: Aliment Pharmacol Ther Date: 2017-11-03 Impact factor: 8.171

3. Rheumatoid arthritis in UK primary care: incidence and prior morbidity.

Authors: L A García Rodríguez; L B Tolosa; A Ruigómez; S Johansson; M-A Wallander
Journal: Scand J Rheumatol Date: 2009 May-Jun Impact factor: 3.641

4. Risk of uveitis and inflammatory bowel disease in people with psoriatic arthritis: a population-based cohort study.

Authors: Rachel Charlton; Amelia Green; Gavin Shaddick; Julia Snowball; Alison Nightingale; William Tillett; Catherine H Smith; Neil McHugh
Journal: Ann Rheum Dis Date: 2017-11-01 Impact factor: 19.103

5. An Ontology to Improve Transparency in Case Definition and Increase Case Finding of Infectious Intestinal Disease: Database Study in English General Practice.

Authors: Simon de Lusignan; Stacy Shinneman; Ivelina Yonova; Jeremy van Vlymen; Alex J Elliot; Frederick Bolton; Gillian E Smith; Sarah O'Brien
Journal: JMIR Med Inform Date: 2017-09-28

6. Temporal trends and patterns in heart failure incidence: a population-based study of 4 million individuals.

Authors: Nathalie Conrad; Andrew Judge; Jenny Tran; Hamid Mohseni; Deborah Hedgecott; Abel Perez Crespillo; Moira Allison; Harry Hemingway; John G Cleland; John J V McMurray; Kazem Rahimi
Journal: Lancet Date: 2017-11-21 Impact factor: 79.321

7. Selection of medical diagnostic codes for analysis of electronic patient records. Application to stroke in a primary care database.

Authors: Martin C Gulliford; Judith Charlton; Mark Ashworth; Anthony G Rudd; Andre Michael Toschke
Journal: PLoS One Date: 2009-09-24 Impact factor: 3.240

8. Data resource profile: cardiovascular disease research using linked bespoke studies and electronic health records (CALIBER).

Authors: Spiros C Denaxas; Julie George; Emily Herrett; Anoop D Shah; Dipak Kalra; Aroon D Hingorani; Mika Kivimaki; Adam D Timmis; Liam Smeeth; Harry Hemingway
Journal: Int J Epidemiol Date: 2012-12-05 Impact factor: 7.196

9. The FAIR Guiding Principles for scientific data management and stewardship.

Authors: Mark D Wilkinson; Michel Dumontier; I Jsbrand Jan Aalbersberg; Gabrielle Appleton; Myles Axton; Arie Baak; Niklas Blomberg; Jan-Willem Boiten; Luiz Bonino da Silva Santos; Philip E Bourne; Jildau Bouwman; Anthony J Brookes; Tim Clark; Mercè Crosas; Ingrid Dillo; Olivier Dumon; Scott Edmunds; Chris T Evelo; Richard Finkers; Alejandra Gonzalez-Beltran; Alasdair J G Gray; Paul Groth; Carole Goble; Jeffrey S Grethe; Jaap Heringa; Peter A C 't Hoen; Rob Hooft; Tobias Kuhn; Ruben Kok; Joost Kok; Scott J Lusher; Maryann E Martone; Albert Mons; Abel L Packer; Bengt Persson; Philippe Rocca-Serra; Marco Roos; Rene van Schaik; Susanna-Assunta Sansone; Erik Schultes; Thierry Sengstag; Ted Slater; George Strawn; Morris A Swertz; Mark Thompson; Johan van der Lei; Erik van Mulligen; Jan Velterop; Andra Waagmeester; Peter Wittenburg; Katherine Wolstencroft; Jun Zhao; Barend Mons
Journal: Sci Data Date: 2016-03-15 Impact factor: 6.444

10. Protective effect of antirheumatic drugs on dementia in rheumatoid arthritis patients.

Authors: Andy Judge; Cesar Garriga; Nigel K Arden; Simon Lovestone; Dani Prieto-Alhambra; Cyrus Cooper; Christopher J Edwards
Journal: Alzheimers Dement (N Y) Date: 2017-11-09

2 in total

1. Clinical validation of genomic functional screen data: Analysis of observed BRCA1 variants in an unselected population cohort.

Authors: Kelly M Schiabor Barrett; Max Masnick; Kathryn E Hatchell; Juliann M Savatt; Natalie Banet; Adam Buchanan; Huntington F Willard
Journal: HGG Adv Date: 2022-01-08

2. Clinical code usage in UK general practice: a cohort study exploring 18 conditions over 14 years.

Authors: Salwa S Zghebi; David Reeves; Christos Grigoroglou; Brian McMillan; Darren M Ashcroft; Rosa Parisi; Evangelos Kontopantelis
Journal: BMJ Open Date: 2022-07-25 Impact factor: 3.006

2 in total