| Literature DB >> 27153686 |
Chao Pang1, David van Enckevort2, Mark de Haan2, Fleur Kelpin2, Jonathan Jetten2, Dennis Hendriksen2, Tommy de Boer2, Bart Charbon2, Erwin Winder2, K Joeri van der Velde2, Dany Doiron3, Isabel Fortier3, Hans Hillege4, Morris A Swertz1.
Abstract
MOTIVATION: While the size and number of biobanks, patient registries and other data collections are increasing, biomedical researchers still often need to pool data for statistical power, a task that requires time-intensive retrospective integration.Entities:
Mesh:
Year: 2016 PMID: 27153686 PMCID: PMC4937195 DOI: 10.1093/bioinformatics/btw155
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.The overview of the framework of MOLGENIS/connect
Fig. 2.Example of the EMX data upload format. Data can be uploaded using Excel Metadata describing the columns of each data sheet (i.e. ‘entity’) that must be provided in a special ‘attributes’ sheet. Data values are stored in ordinary sheets (e.g. ‘patients’). The ‘categorical’ gender attribute and the ‘xref’ disease attribute refer to another two sheets, ‘genders’ and ‘diseases’ (omitted for readability)
Fig. 3.Example of algorithm generation for target attribute BMI from the Prevend data source (1) a transformation template is generated from the candidate matches (using Magma syntax), (2) the template is automatically edited based on unit conversion rules if applicable and (3) the software evaluates if more complex algorithm templates can be used. Based on two good candidate matches and the desired ‘BMI’ target, a previously used BMI conversion algorithm is proposed that incorporates the unit conversion rules (e.g. from ‘cm’ to ‘m’ because BMI is recorded as composite unit kg/m2)
Fig. 4.Mapping project overview. The attributes of the target DataSchema are shown on the left of the table. The columns contain matching attributes from each of the sources. New source data can be added by clicking the ‘+Add source’ button. Attribute matches and conversion algorithms are automatically generated and colour coded to indicate if the algorithms are generated with high confidence (perfect match in semantic search) or low quality (partial match in semantic search) or to indicate if an algorithm has been curated by the user
Summary of the quality measures of algorithm generator and semantic search (in percentages)
Cells are colour-coded to represent the amount of human input (manual work) required to fix the matching, with green being the easiest and red being the most difficult (Please see the online article at http://bioinformatics.oxfordjournals.org/ for the colour-coded table).
Quality measures of algorithm generator and semantic search in percentages, grouped by attribute topic
| Algorithm generator | Semantic search | |||||
|---|---|---|---|---|---|---|
| Perfect (%) | Good (%) | Bad (%) | Perfect (%) | Good (%) | Bad (%) | |
| Diet (10) | 50 | 40 | 10 | 70 | 30 | 0 |
| Disease (14) | 86 | 14 | 0 | 71 | 29 | 0 |
| Drink (8) | 0 | 38 | 63 | 50 | 38 | 13 |
| Education (17) | 0 | 82 | 18 | 65 | 35 | 0 |
| Food (42) | 88 | 5 | 7 | 14 | 33 | 52 |
| General (18) | 28 | 50 | 22 | 50 | 11 | 39 |
| Job (8) | 0 | 100 | 0 | 25 | 0 | 75 |
| Measurement (42) | 62 | 17 | 21 | 74 | 10 | 17 |
| Medication (11) | 0 | 36 | 64 | 27 | 36 | 36 |
| Smoking (14) | 1 | 21 | 64 | 14 | 57 | 29 |
| Total (184) | 47 | 30 | 22 | 46 | 26 | 28 |
The numbers between brackets indicate the number of target attributes.
Fig.5.Scatter plot visualizing the success rates of algorithm generator and semantic search per attribute domain. The X-axis and Y-axis represent ‘useful algorithm’ (defined as when the algorithms generated are correct or partially correct) and ‘useful search’ (defined as when the matched source attributes found fall within top 20 of the suggested list) categories of algorithm generator and semantic search in Table 2. The numbers in parenthesis are the number of attributes for the corresponding topics