| Literature DB >> 31509880 |
Sebastian Mate1, Marvin Kampf1, Wolfgang Rödle2, Stefan Kraus2, Rumyana Proynova3, Kaisa Silander4, Lars Ebert5, Martin Lablans5, Christina Schüttler2, Christian Knell1, Niina Eklund4, Michael Hummel6,7, Petr Holub7, Hans-Ulrich Prokosch1,2.
Abstract
BACKGROUND: High-quality clinical data and biological specimens are key for medical research and personalized medicine. The Biobanking and Biomolecular Resources Research Infrastructure-European Research Infrastructure Consortium (BBMRI-ERIC) aims to facilitate access to such biological resources. The accompanying ADOPT BBMRI-ERIC project kick-started BBMRI-ERIC by collecting colorectal cancer data from European biobanks.Entities:
Mesh:
Year: 2019 PMID: 31509880 PMCID: PMC6739205 DOI: 10.1055/s-0039-1695793
Source DB: PubMed Journal: Appl Clin Inform ISSN: 1869-0327 Impact factor: 2.342
Example for the EAV table format (mock-up data)
| Patient ID | Concept | Value | Instance |
|---|---|---|---|
| 1 | TNM-T | T1 | 1 |
| 1 | TNM-N | N0 | 1 |
| 1 | TNM-M | M1 | 1 |
| 1 | Age | 45 | 1 |
| 1 | Metastases |
|
|
| 1 | Metastases |
|
|
| 1 | Gender | Male | 1 |
| 1 | Date of Surgery | 21.07.2013 | 1 |
| 2 | TNM-T | T4 | 1 |
| 2 | TNM-N | N0 | 1 |
| … | … | … | … |
Abbreviation: EAV, Entity-Attribute-Value.
Example for the flat file format (mock-up data)
| Patient | TNM-T | TNM-N | TNM-M | Age | Metastases | Gender | Date of surgery |
|---|---|---|---|---|---|---|---|
| 1 | T1 | N0 | M1 | 45 |
| Male | 21.07.2013 |
| 2 | T4 | N0 | M0 | 34 | Female | 23.11.2018 | |
| 3 | T2 | N0 | M1 | 21 | Osseus | Male | 04.04.2017 |
| 4 | T1 | N1 | M1 | 43 | Brain; Osseus | Male | 19.03.2012 |
| 5 | T2 | N2 | M1 | 76 | Hepatic | Female | 10.09.2013 |
Fig. 1The extract-transform-load (ETL) pipeline as configured for the ADOPT BBMRI-ERIC project, shown for two exemplary biobanks, one contributing an Entity-Attribute-Value (EAV) and the other a flat file. The files are extracted from the Biobank Information Management Systems (BIMS, left), processed by our tools into an XML file, and finally loaded into the OSSE system (right). CMF, central metadata definition file; LMF, local metadata definition file.
Fig. 2Illustration of the bag-of-words algorithm in MDRMatcher. After normalization and semantic expansion, it compares all n -grams from the source and the target item and computes a similarity score.
Fig. 3The MappingGUI program, which is used to curate mappings between source and target terms and values.
Supported data type transformations, with source data types shown on the left and target data types on top
| From/To | Enumerated | Integer | Float | Boolean (True) | Boolean (False) | String | Date | DateTime |
|---|---|---|---|---|---|---|---|---|
|
| Target value |
|
|
| ||||
|
|
|
|
|
|
| |||
|
|
|
|
|
|
| |||
|
| Target value |
|
|
|
|
| ||
|
| Target value |
|
|
|
|
| ||
|
|
| |||||||
|
|
|
|
| |||||
|
|
|
|
|
Note: Italics indicates the type of handling, bold the return value. Blank indicates no transformation rule.
Fig. 4A four-axial classification scheme to assess the mapping quality of MDRMatcher.
Assessment of the automatic mapping quality for metadata items received
| Property | Class | |||||||
|---|---|---|---|---|---|---|---|---|
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | ||
|
| Conceptual | Yes | No | Yes | Yes | Yes | No | Yes |
| Mapping | Yes | No | No | Yes | Yes | Yes | No | |
| Correctness | Yes | Yes | No | No | No | No | No | |
| Matching | Yes | No | Yes | Yes | No | No | No | |
|
| Biobank 1 | 163 | 10 | 17 | 11 | 2 | 2 | 0 |
| Biobank 2 | 121 | 1 | 21 | 17 | 0 | 1 | 0 | |
| Biobank 3 | 83 | 7 | 36 | 14 | 2 | 1 | 2 | |
| Biobank 4 | 94 | 0 | 12 | 8 | 2 | 0 | 0 | |
| Biobank 5 | 89 | 0 | 14 | 13 | 0 | 1 | 0 | |
| Biobank 6 | 117 | 0 | 7 | 9 | 0 | 0 | 0 | |
| Biobank 7 | 142 | 0 | 17 | 18 | 1 | 0 | 0 | |
| Biobank 8 | 117 | 0 | 18 | 12 | 2 | 0 | 1 | |
| Biobank 9 | 97 | 1 | 12 | 18 | 0 | 0 | 0 | |
| Biobank 10 | 129 | 0 | 16 | 12 | 1 | 1 | 0 | |
| Total | 1152 | 19 | 170 | 132 | 10 | 6 | 3 | |
| Percent | 77.21 | 1.27 | 11.39 | 8.85 | 0.67 | 0.40 | 0.20 | |
Note: The numbers are per source item, which is comprised of concept and value, e.g., “UICC_STAGE = Not known.”
Confusion matrix for the evaluation of MDRMatcher as an automatic mapper
|
|
| |
|---|---|---|
|
|
|
|
|
|
|
|
Analysis of wrong behavior of MDRMatcher per target data element in classes 3 to 5 for metadata items
| Type of problem | Class 3 | Class 4 | Class 5 | Total |
|---|---|---|---|---|
| Missing synonyms | 22 | 22 | ||
| No down-ranking | 21 | 21 | ||
|
| 3 | 6 | 4 | 13 |
| Wrong up-ranking due to bad fuzzy matches | 3 | 8 | 11 | |
| Unfavorable removal of redundant words | 3 | 8 | 11 | |
| Inability to compare Roman with Arabic number | 1 | 8 | 9 | |
| Use of different languages | 5 | 5 | ||
| Spelling errors or inconsistent naming | 5 | 5 | ||
| Unable to map something to “Other” | 2 | 2 | ||
| Other | 4 | 12 | 18 |
ETL results for the actual facts data received from 10 biobanks participating in the CCDC pilot as reported by ETLHelper
| Property | Biobank | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | Total | |||
|
| |||||||||||||
| 1 | Patients | 1,066 | 218 | 50 | 55 | 300 | 600 | 300 | 308 | 218 | 300 | 3415 | |
| 2 | Data records | 50,020 | 8,756 | 2,067 | 2,477 | 11,984 | 22,617 | 13,225 | 13,729 | 9,160 | 13,573 | 147,608 | |
| 3 | Different concepts | 53 | 47 | 49 | 47 | 47 | 43 | 49 | 52 | 40 | 56 | 483 | |
| 4 | Different source items | 205 | 161 | 145 | 116 | 117 | 133 | 178 | 150 | 129 | 158 | 1,492 | |
|
| |||||||||||||
| 5 | Mappings between source and target items | 193 | 159 | 134 | 116 | 116 | 133 | 178 | 150 | 127 | 158 | 1,464 | |
| 6 | Data records that should have a mapping | 50,020 | 8,756 | 2,067 | 2,477 | 11,984 | 22,617 | 11,325 | 13,729 | 9,160 | 13,573 | 145,708 | |
| 7 | Data records that do have a mapping | 49,263 | 8,331 | 1,931 | 2,477 | 11,983 | 22,617 | 11,325 | 13,729 | 9,148 | 13,573 | 144,377 | |
| 8 | Percentage of data records that have a mapping | 98.5% | 95.1% | 93.4% | 100.0% | 100.0% | 100.0% | 100.0% | 100.0% | 99.9% | 100.0% | 99.1% | |
| 9 | Data records that don't have a mapping |
|
|
| 0 |
| 0 | 0 | 0 |
| 0 |
| |
|
| |||||||||||||
| 10 | Total number of transformations | 49,263 | 8,331 | 1,931 | 2,477 | 11,983 | 22,617 | 13,225 | 13,729 | 9,148 | 13,573 | 146,277 | |
| 11 |
|
| 35,786 | 6,544 | 1,383 | 1,907 | 9,436 | 19,350 | 10,294 | 10,798 | 7,622 | 10,234 | 113,354 |
| 12 |
| 1,119 | 100 | 28 | 308 | 218 | 91 | 1,864 | |||||
| 13 |
| 7,093 | 815 | 230 | 300 | 995 | 2,568 | 2,140 | 1,627 | 436 | 1,851 | 18,055 | |
| 14 |
| 102 | 300 | 218 | 620 | ||||||||
| 15 |
| 2,654 | 435 | 126 | 55 | 636 | 98 | 191 | 380 | 218 | 797 | 5,590 | |
| 16 |
| 2,132 | 388 | 92 | 110 | 652 | 600 | 600 | 616 | 436 | 600 | 6,226 | |
| 17 | Total number of good transformations | 48,784 | 8,284 | 1,931 | 2,400 | 11,983 | 22,616 | 13,225 | 13,729 | 9,148 | 13,573 | 145,673 | |
| 18 |
|
|
| 0 | 0 |
| 0 |
| 0 | 0 | 0 | 0 |
|
| 19 |
| 0 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| ||
| 20 | Total number of bad transformations |
|
| 0 |
| 0 |
| 0 | 0 | 0 | 0 |
| |
| 21 | Percentage of good transformations | 99.0% | 99.4% | 100.0% | 96.9% | 100.0% | 100.0% | 100.0% | 100.0% | 100.0% | 100.0% | 99.6% | |
|
| |||||||||||||
| 22 | Percentage of input data records that could be mapped and transformed | 97.5% | 94.6% | 93.4% | 96.9% | 100.0% | 100.0% | 100.0% | 100.0% | 99.9% | 100.0% | 98.7% | |
Abbreviations: CCDC, Colon Cancer Data Collection; ETL, extract-transform-load.
Note: Rejected data records are printed in bold.