| Literature DB >> 25361575 |
Chao Pang1, Dennis Hendriksen2, Martijn Dijkstra2, K Joeri van der Velde3, Joel Kuiper1, Hans L Hillege4, Morris A Swertz3.
Abstract
OBJECTIVE: Pooling data across biobanks is necessary to increase statistical power, reveal more subtle associations, and synergize the value of data sources. However, searching for desired data elements among the thousands of available elements and harmonizing differences in terminology, data collection, and structure, is arduous and time consuming.Entities:
Keywords: Biobank; Data integration; Harmonization; Search
Mesh:
Year: 2014 PMID: 25361575 PMCID: PMC4433361 DOI: 10.1136/amiajnl-2013-002577
Source DB: PubMed Journal: J Am Med Inform Assoc ISSN: 1067-5027 Impact factor: 4.497
Figure 1:Harmonization process. Many studies need to pool data in order to reach sufficient statistical power, however matching data elements of interest to the available data elements is a daunting task.
Figure 2:Example of query expansion. ‘Parental diabetes mellitus’ is annotated with the ontology terms ‘Parental’ and ‘Diabetes mellitus.’ Then the terms are expanded based on synonyms, resulting in three terms for ‘Diabetes mellitus’ and three terms for ‘Parental,’ so all 3 × 3 = 9 combinations are used for the search (only four are shown here).
Figure 3:Overview of BiobankConnect. Data elements of interest (target) are matched to all available data elements (source), based on knowledge from the ontology terms.
Figure 4:Matching results produced by BiobankConnect. (A) Matching data elements for ‘Parental diabetes mellitus’ in Prevend. The gold standard matches are two data elements, V57A_1 and V57B_1, located in the second and third positions. (B) The matching data element for ‘History of hypertension’ in the NCDS database. The best match in the experts’ opinion is ‘downhibp,’ located in the first position on the candidate list. CM, cohort member.
Precision and recall performance
| FINRISK | Hunt | KORA | MICROS | NCDS | Total | |||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Rank | P | R | P | R | P | R | P | R | P | R | P | R |
| 1 | 0.91 | 0.50 | 0.61 | 0.16 | 0.88 | 0.53 | 0.73 | 0.27 | 0.59 | 0.17 | 0.75 | 0.28 |
| 2 | 0.68 | 0.72 | 0.65 | 0.34 | 0.67 | 0.79 | 0.53 | 0.37 | 0.48 | 0.27 | 0.60 | 0.44 |
| 3 | 0.57 | 0.88 | 0.59 | 0.46 | 0.48 | 0.83 | 0.45 | 0.46 | 0.37 | 0.30 | 0.49 | 0.52 |
| 4 | 0.45 | 0.90 | 0.53 | 0.55 | 0.40 | 0.89 | 0.39 | 0.52 | 0.31 | 0.33 | 0.42 | 0.58 |
| 5 | 0.39 | 0.95 | 0.47 | 0.60 | 0.34 | 0.92 | 0.33 | 0.56 | 0.27 | 0.36 | 0.36 | 0.62 |
| 6 | 0.34 | 0.97 | 0.42 | 0.64 | 0.31 | 0.96 | 0.30 | 0.61 | 0.25 | 0.39 | 0.32 | 0.65 |
| 7 | 0.29 | 0.97 | 0.39 | 0.69 | 0.27 | 0.96 | 0.27 | 0.63 | 0.23 | 0.41 | 0.29 | 0.68 |
| 8 | 0.26 | 0.97 | 0.37 | 0.73 | 0.25 | 0.98 | 0.25 | 0.67 | 0.21 | 0.44 | 0.27 | 0.71 |
| 9 | 0.23 | 0.97 | 0.35 | 0.77 | 0.24 | 1.00 | 0.24 | 0.68 | 0.19 | 0.44 | 0.25 | 0.72 |
| 10 | 0.22 | 0.98 | 0.33 | 0.81 | 0.22 | 1.00 | 0.22 | 0.70 | 0.17 | 0.44 | 0.23 | 0.74 |
| 11 | 0.20 | 0.98 | 0.31 | 0.82 | 0.21 | 1.00 | 0.21 | 0.71 | 0.16 | 0.44 | 0.22 | 0.75 |
| 12 | 0.19 | 0.98 | 0.29 | 0.83 | 0.20 | 1.00 | 0.20 | 0.72 | 0.15 | 0.45 | 0.21 | 0.75 |
| 13 | 0.18 | 0.98 | 0.27 | 0.84 | 0.19 | 1.00 | 0.19 | 0.74 | 0.14 | 0.46 | 0.20 | 0.76 |
| 14 | 0.17 | 0.98 | 0.25 | 0.84 | 0.18 | 1.00 | 0.19 | 0.77 | 0.14 | 0.47 | 0.19 | 0.77 |
| 15 | 0.16 | 0.98 | 0.24 | 0.85 | 0.17 | 1.00 | 0.19 | 0.79 | 0.13 | 0.49 | 0.18 | 0.78 |
| 16 | 0.15 | 0.98 | 0.23 | 0.86 | 0.16 | 1.00 | 0.18 | 0.82 | 0.13 | 0.50 | 0.17 | 0.80 |
| 17 | 0.14 | 0.98 | 0.22 | 0.86 | 0.16 | 1.00 | 0.18 | 0.84 | 0.13 | 0.51 | 0.17 | 0.79 |
| 18 | 0.14 | 0.98 | 0.21 | 0.87 | 0.15 | 1.00 | 0.18 | 0.85 | 0.12 | 0.51 | 0.15 | 0.81 |
| 19 | 0.13 | 0.98 | 0.20 | 0.87 | 0.14 | 1.00 | 0.17 | 0.87 | 0.12 | 0.52 | 0.16 | 0.81 |
| 20 | 0.13 | 0.98 | 0.19 | 0.88 | 0.14 | 1.00 | 0.17 | 0.87 | 0.11 | 0.53 | 0.14 | 0.82 |
| 30 | 0.09 | 0.98 | 0.13 | 0.91 | 0.11 | 1.00 | 0.14 | 0.93 | 0.08 | 0.57 | 0.11 | 0.85 |
| 50 | 0.06 | 0.98 | 0.09 | 0.94 | 0.10 | 1.00 | 0.11 | 0.96 | 0.06 | 0.64 | 0.08 | 0.88 |
Calculated per biobank and for total.
P, precision; R, recall.
Figure 5:Receiver operating characteristic (ROC) curve. Matching performance for 32 data elements in five different biobanks. Note that BiobankConnect only retrieves a subset of data elements based on the semantic/lexical similarity queries, therefore the ROC curves end before reaching 1.00, 1.00. For the remaining data elements we simulated a line of non-discrimination, indicated by dotted lines.
Ranking performance
| Rank | P1 (using ontology) | Cumulative P1 | P2 (Lucene matching) | Cumulative P2 |
|---|---|---|---|---|
| 1 | 63.9% (n = 122) | 63.9% (n = 122) | 51.3% (n = 98) | 51.3% (n = 98) |
| 2 | 14.1% (n = 27) | 78.0% (n = 149) | 12.0% (n = 23) | 63.4% (n = 121) |
| 3 | 8.40% (n = 16) | 86.4% (n = 165) | 8.37% (n = 16) | 71.7% (n = 137) |
| 4 | 3.10% (n = 6) | 89.5% (n = 171) | 4.18% (n = 8) | 75.9% (n = 145) |
| 5 | 3.70% (n = 7) | 93.2% (n = 178) | 5.23% (n = 10) | 81.2% (n = 155) |
| 6 | 3.10% (n = 6) | 96.3% (n = 184) | 1.04% (n = 2) | 82.2% (n = 157) |
| 7 | 0.00% (n = 6) | 96.3% (n = 184) | 0.00% (n = 0) | 82.2% (n = 157) |
| 8 | 1.50% (n = 3) | 97.8% (n = 187) | 1.04% (n = 2) | 83.2% (n = 159) |
| 9 | 0.60% (n = 1) | 98.4% (n = 188) | 2.09% (n = 4) | 85.3% (n = 163) |
| 10 | 0.00% (n = 0) | 98.4% (n = 188) | 0.52% (n = 1) | 85.6% (n = 164) |
| ≥10 | 0.00% (n = 0) | 98.4% (n = 188) | 3.66% (n = 7) | 89.5% (n = 171) |
| Not found | 1.60% (n = 3) | 10.5% (n = 20) | ||
| Total | 100% (n = 191) | 100% (n = 191) |
P1,2 shows the rank of 191 expert selected ‘best’ matches within the automatically produced lists of relevant matches, using ontology annotations of the desired data elements or Lucene matching only, respectively. BiobankConnect predicted ‘best’ matches as first choice (rank 1) in 63.9% of cases and within the ‘top 10’ in 98.4% of cases.
BiobankConnect reduces the amount of data elements that need to be checked
| Biobank | R1 (via BiobankConnect) | R2 (string matching) | R3 (random search) |
|---|---|---|---|
| KORA (75) | 1.5 | 1.8 | 36 |
| MICROS (119) | 2.0 | 1.3 | 59 |
| FINRISK (223) | 1.5 | 1.9 | 111 |
| Hunt (353) | 2.5 | 4.1 | 174 |
| NCDS (516) | 1.2 | 1.8 | 260 |
| Prevend (6174) | 2.2 | 4.3 | 3109 |
| Average | 1.8 | 2.7 | 3730 |
| Missed elements | 3 | 20 | 0 |
R1,2,3 shows the average rank of the ‘best’ match when searching using BiobankConnect, using Lucene string matching only, and random iteration, respectively.