| Literature DB >> 31865899 |
Robinette Renner1, Shengyu Li2, Yulong Huang3, Ada Chaeli van der Zijp-Tan3, Shaobo Tan2, Dongqi Li2, Mohan Vamsi Kasukurthi2, Ryan Benton2, Glen M Borchert4, Jingshan Huang5,6, Guoqian Jiang7.
Abstract
BACKGROUND: The medical community uses a variety of data standards for both clinical and research reporting needs. ISO 11179 Common Data Elements (CDEs) represent one such standard that provides robust data point definitions. Another standard is the Biomedical Research Integrated Domain Group (BRIDG) model, which is a domain analysis model that provides a contextual framework for biomedical and clinical research data. Mapping the CDEs to the BRIDG model is important; in particular, it can facilitate mapping the CDEs to other standards. Unfortunately, manual mapping, which is the current method for creating the CDE mappings, is error-prone and time-consuming; this creates a significant barrier for researchers who utilize CDEs.Entities:
Keywords: Artificial neural network; Biomedical research integrated domain group (BRIDG) model; Common data element; Schema mapping
Mesh:
Year: 2019 PMID: 31865899 PMCID: PMC6927104 DOI: 10.1186/s12911-019-0979-5
Source DB: PubMed Journal: BMC Med Inform Decis Mak ISSN: 1472-6947 Impact factor: 2.796
Fig. 1CDE structure. CDE mainly consists of two parts: Data Element Concept and Value Domain
Fig. 2Example of CDE structure. The CDE structure for a data element capturing a patient’s specific type of Acute Myeloid Leukemia
Fig. 3Oncology Subset of the BRIDG Model [6]. This is the Oncology view of the BRIDG model
Similarity of testing data to training data
| Testing Set | Number of CDEs | Semantic Similarity |
|---|---|---|
| Similar | 52 | 86.54% |
| Moderately Different | 220 | 68.64% |
| Different | 186 | 4.52% |
Fig. 4Example of word similarity matrix. This is the similarity matrix of the question text corresponding to the CDE Acute Myeloid Leukemia Classification Type and the CDE Chronic Myelogenous Leukemia Classification Type. Their corresponding question text is “What was the classification of the acute myelogenous leukemia?” and “What was the classification of the chronic myelogenous leukemia?” respectively. After calculating the similarity between every word and generating the word similarity matrix, we build the word similarity list by sorting and obtaining the maximum similarity from the matrix. The maximum similarity is represented by grey background. Note that after obtaining the maximum similarity, the similarities of this column and this row will be ignored, meaning that they will not participate in the sorting any more
Fig. 5Neural network structure. The inputs are the similarities of six attributes. The output is the overall similarity between two CDEs
Fig. 6Training flow chart. The training flow demonstrates the process of training and recommendation
Algorithm parameters
| Parameter | Description | Values Tested | Optimal Values |
|---|---|---|---|
| Training ratio | Ratio of training to verification data | 75% training and 25% verification 90% training and 10% verification | 75% training and 25% verification |
| Training CDEs per BRIDG class | Determines the training list | 4–10 | 8 |
| Similarity threshold | Determines the threshold for considering two words to be similar | 0.6–1.0 | 0.7 or 0.8 |
Accuracy with different training validation
| Top n | Accuracy (training set: verification set = 3:1) (%) | Accuracy (training set: verification set = 9:1) (%) |
|---|---|---|
| 1 | 33.99% | 41.52% |
| 2 | 51.96% | 63.16% |
| 3 | 64.71% | 73.10% |
| 4 | 71.90% | 80.70% |
| 5 | 76.14% | 83.04% |
| 6 | 82.03% | 85.38% |
| 7 | 85.29% | 87.72% |
| 8 | 86.93% | 90.06% |
| 9 | 89.22% | 91.23% |
| 10 | 92.16% | 94.15% |
Fig. 7Best match rates per testing set. Bars refer to the best matching rate for testing sets with different degrees of semantic similarity compared to the training set: similar, moderately different, and different. The blue bars represent the situation in which the testing set contains only CDEs mapped to BRIDG classes with sufficient training data. The orange bars represent the situation that the testing set contains all CDEs
Example of attribute weights
| Weight Name | Weight | |
|---|---|---|
| Verification - 1232 | Verification – 1504 | |
| 0.174620962 | 0.174501362 | |
| 0.156449029 | 0.155271219 | |
| 0.159148243 | 0.156401809 | |
| 0.159921116 | 0.164160267 | |
| 0.160018435 | 0.164215956 | |
| 0.189842216 | 0.185449386 | |