| Literature DB >> 26385205 |
Chao Pang1, Annet Sollie2, Anna Sijtsma3, Dennis Hendriksen4, Bart Charbon4, Mark de Haan4, Tommy de Boer4, Fleur Kelpin4, Jonathan Jetten4, Joeri K van der Velde4, Nynke Smidt3, Rolf Sijmons2, Hans Hillege5, Morris A Swertz6.
Abstract
There is an urgent need to standardize the semantics of biomedical data values, such as phenotypes, to enable comparative and integrative analyses. However, it is unlikely that all studies will use the same data collection protocols. As a result, retrospective standardization is often required, which involves matching of original (unstructured or locally coded) data to widely used coding or ontology systems such as SNOMED CT (clinical terms), ICD-10 (International Classification of Disease) and HPO (Human Phenotype Ontology). This data curation process is usually a time-consuming process performed by a human expert. To help mechanize this process, we have developed SORTA, a computer-aided system for rapidly encoding free text or locally coded values to a formal coding system or ontology. SORTA matches original data values (uploaded in semicolon delimited format) to a target coding system (uploaded in Excel spreadsheet, OWL ontology web language or OBO open biomedical ontologies format). It then semi- automatically shortlists candidate codes for each data value using Lucene and n-gram based matching algorithms, and can also learn from matches chosen by human experts. We evaluated SORTA's applicability in two use cases. For the LifeLines biobank, we used SORTA to recode 90 000 free text values (including 5211 unique values) about physical exercise to MET (Metabolic Equivalent of Task) codes. For the CINEAS clinical symptom coding system, we used SORTA to map to HPO, enriching HPO when necessary (315 terms matched so far). Out of the shortlists at rank 1, we found a precision/recall of 0.97/0.98 in LifeLines and of 0.58/0.45 in CINEAS. More importantly, users found the tool both a major time saver and a quality improvement because SORTA reduced the chances of human mistakes. Thus, SORTA can dramatically ease data (re)coding tasks and we believe it will prove useful for many more projects. Database URL: http://molgenis.org/sorta or as an open source download from http://www.molgenis.org/wiki/SORTA.Entities:
Mesh:
Year: 2015 PMID: 26385205 PMCID: PMC4574036 DOI: 10.1093/database/bav089
Source DB: PubMed Journal: Database (Oxford) ISSN: 1758-0463 Impact factor: 3.451
Comparison of existing tools with SORTA
| SORTA | BioPortal annotator | ZOOMA | Shiva | Agreement maker | LogMap | Peregrine | |
|---|---|---|---|---|---|---|---|
| Comparable similarity score | Y | N | N | N | Y | Y | N |
| Import code system in ontology format | Y | Y | Y | Y | Y | Y | Y |
| Import code system in excel format | Y | N | N | N | N | N | N |
| Uses lexical index to improve performance | Y | Y | Y | N | N | Y | Y |
| Code/Recode data directly in the tool | Y | N | N | N | Y | N | N |
| Tool available as online service | Y | Y | Y | N/A | N/A | N/A | N |
| Support partial matches | Y | N | N | Y | Y | Y | N |
| Match complex data values | Y | N | N | Y | Y | Y | N |
| Learns from curated dataset | Y | N | Y | N | N | N | N |
Y represents Yes; N represents No; N/A represents unknown
ZOOMA and BioPortal Annotator were the closest to our needs.
Figure 1.SORTA overview. The desired coding system or ontology can be uploaded in OWL/OBO and Excel and indexed for fast matching searches. Data values can be uploaded and then automatically matched with the indexed ontology using Lucene. A list of the most relevant concepts is retrieved from the index and matching percentages are calculated using the n-gram algorithm so that users can easily evaluate the matching score. Users can choose the mappings from the suggested list.
Example of how to upload a coding system and a coding/recoding target
| Concept ID | Concept Label | System ID |
|---|---|---|
| 02060 | cardio training | MET |
| 02020 | bodypump | MET |
| 18310 | swimming | MET |
| 15430 | kung fu | MET |
| 15350 | hockey | MET |
| 12150 | running | MET |
This example shows an Excel file with MET (Metabolic Equivalent of Task), a system developed to standardize physical activity, in which each concept ID includes a list of different sports representing specific amounts of energy consumption.
Example of how to upload data values and coding/recoding source)
| Name (required) | Synonym_1 (optional) | OMIM (optional) |
|---|---|---|
| 2,4-dienoyl-CoA reductase deficiency | DER deficiency | 222745 |
| 3-methylcrotonyl-CoA carboxylase deficiency | 3MCC | 210200 |
| Acid sphingomyelinase deficiency | ASM | 607608 |
At minimum, one column of values should be provided: the first column with the header ‘Name’. Additional optional columns that start with ‘Synonym_’ can contain the synonyms for input values. Other optional column headers can contain other identifiers, e.g. in this example OMIM.
Figure 2.Example of coding a physical activity. A list of MET codes was matched with input and sorted based on similarity scores, from which the proper code can be selected to recode the input. If none of the candidate codes is suitable, users can either search for codes manually or decide to use ‘Unknown code’. If the button ‘Code data’ is clicked, the input is recoded only with the selected code. If the button ‘Code and add’ is clicked, the input is recoded and the input gets added to the code as a new synonym. The example is a typo of the Dutch word for ‘swimming’. zwemmen = swimming, zwemmen 2x = twice a week, soms zwemmen = occasional swimming, gym-zwemmen = water gym.
Figure 3.Receiver operating characteristic (ROC) curves evaluating performance on LifeLines data. Blue represents the performance before the researcher recoded all the LifeLines data. During coding, the researcher introduced new knowledge to the database and if a similar dataset was uploaded again (e.g. second rounds of the same questionnaire), the coding performance greatly improved as shown by the red curve.
Precision and recall for the LifeLines case study
| Rank cut-off | Before coding | After coding | ||||
|---|---|---|---|---|---|---|
| Recall | Precision | F-measure | Recall | Precision | F-measure | |
| 1 | 0.59 | 0.65 | 0.62 | 0.97 | 0.98 | 0.97 |
| 2 | 0.66 | 0.39 | 0.49 | 0.97 | 0.50 | 0.66 |
| 3 | 0.71 | 0.29 | 0.41 | 0.97 | 0.34 | 0.50 |
| 4 | 0.74 | 0.24 | 0.36 | 0.97 | 0.26 | 0.41 |
| 5 | 0.76 | 0.21 | 0.33 | 0.97 | 0.21 | 0.35 |
| 6 | 0.77 | 0.19 | 0.30 | 0.97 | 0.18 | 0.30 |
| 7 | 0.78 | 0.17 | 0.28 | 0.97 | 0.15 | 0.26 |
| 8 | 0.78 | 0.16 | 0.27 | 0.98 | 0.14 | 0.25 |
| 9 | 0.78 | 0.14 | 0.24 | 0.98 | 0.12 | 0.21 |
| 10 | 0.79 | 0.14 | 0.24 | 0.98 | 0.11 | 0.20 |
| 11 | 0.79 | 0.13 | 0.22 | 0.98 | 0.10 | 0.18 |
| 12 | 0.79 | 0.12 | 0.21 | 0.98 | 0.09 | 0.16 |
| 13 | 0.79 | 0.12 | 0.21 | 0.98 | 0.09 | 0.16 |
| 14 | 0.79 | 0.12 | 0.21 | 0.98 | 0.08 | 0.15 |
| 15 | 0.79 | 0.11 | 0.19 | 0.98 | 0.08 | 0.15 |
| 16 | 0.79 | 0.11 | 0.19 | 0.98 | 0.07 | 0.13 |
| 17 | 0.79 | 0.11 | 0.19 | 0.98 | 0.07 | 0.13 |
| 18 | 0.80 | 0.11 | 0.19 | 0.98 | 0.06 | 0.11 |
| 19 | 0.80 | 0.10 | 0.18 | 0.98 | 0.06 | 0.11 |
| 20 | 0.80 | 0.10 | 0.18 | 0.98 | 0.06 | 0.11 |
| 30 | 0.80 | 0.10 | 0.18 | 0.98 | 0.04 | 0.08 |
| 50 | 0.80 | 0.09 | 0.16 | 0.98 | 0.03 | 0.06 |
In total, 90 000 free text values (of which 5211 were unique) were recoded to physical exercise using MET coding system. The table shows recall and precision per position in the SORTA result before coding (using only the MET score descriptions) and after coding (when a human curator had already processed a large set of SORTA recommendations by hand).
Figure 4.Example of matching the input value ‘external auditory canal defect’ with HPO ontology terms. A list of candidate HPO ontology terms was retrieved from the index and sorted based on similarity scores. Users can select a mapping by clicking the ‘v’ button. If none of the candidate mappings are suitable, users can choose the ‘No match’ option.
Comparison of SORTA, BioPortal and ZOOMA
| Rank cut-off | SORTA | BioPortal | ZOOMA | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Recall | Precision | F-measure | Recall | Precision | F-measure | Recall | Precision | F-measure | |
| 1 | 0.58 | 0.45 | 0.51 | 0.34 | 0.54 | 0.42 | 0.17 | 0.63 | 0.27 |
| 2 | 0.69 | 0.27 | 0.39 | 0.35 | 0.44 | 0.39 | 0.17 | 0.60 | 0.26 |
| 3 | 0.73 | 0.19 | 0.30 | 0.35 | 0.44 | 0.39 | 0.18 | 0.60 | 0.28 |
| 4 | 0.76 | 0.15 | 0.25 | N/A | N/A | N/A | N/A | N/A | N/A |
| 5 | 0.78 | 0.13 | 0.22 | N/A | N/A | N/A | N/A | N/A | N/A |
| 6 | 0.81 | 0.11 | 0.19 | N/A | N/A | N/A | N/A | N/A | N/A |
| 7 | 0.81 | 0.09 | 0.16 | N/A | N/A | N/A | N/A | N/A | N/A |
| 8 | 0.83 | 0.08 | 0.15 | N/A | N/A | N/A | N/A | N/A | N/A |
| 9 | 0.83 | 0.08 | 0.15 | N/A | N/A | N/A | N/A | N/A | N/A |
| 10 | 0.85 | 0.07 | 0.13 | N/A | N/A | N/A | N/A | N/A | N/A |
| 11 | 0.85 | 0.06 | 0.11 | N/A | N/A | N/A | N/A | N/A | N/A |
| 12 | 0.85 | 0.06 | 0.11 | N/A | N/A | N/A | N/A | N/A | N/A |
| 13 | 0.86 | 0.06 | 0.11 | N/A | N/A | N/A | N/A | N/A | N/A |
| 14 | 0.86 | 0.05 | 0.09 | N/A | N/A | N/A | N/A | N/A | N/A |
| 15 | 0.87 | 0.05 | 0.09 | N/A | N/A | N/A | N/A | N/A | N/A |
| 16 | 0.87 | 0.05 | 0.09 | N/A | N/A | N/A | N/A | N/A | N/A |
| 17 | 0.87 | 0.05 | 0.09 | N/A | N/A | N/A | N/A | N/A | N/A |
| 18 | 0.88 | 0.04 | 0.08 | N/A | N/A | N/A | N/A | N/A | N/A |
| 19 | 0.88 | 0.04 | 0.08 | N/A | N/A | N/A | N/A | N/A | N/A |
| 20 | 0.88 | 0.04 | 0.08 | N/A | N/A | N/A | N/A | N/A | N/A |
| 30 | 0.89 | 0.03 | 0.06 | N/A | N/A | N/A | N/A | N/A | N/A |
| 50 | 0.92 | 0.02 | 0.04 | N/A | N/A | N/A | N/A | N/A | N/A |
N/A not applicable
Evaluation based on the CINEAS case study in which 315 clinical symptoms were matched to Human Phenotype Ontology. The table shows the recall/precision per position in SORTA, BioPortal Annotator and ZOOMA. N.B. both BioPortal Annotator and ZOOMA have a limitation that they can only find exact matches and return a maximum of three candidates.
Figure 5.Performance comparison for matching HPO terms among three algorithms. Lucene (blue line), combination of Lucene + n-gram (red) and combination of Lucene + n-gram + inverse document frequency (green).