| Literature DB >> 26793094 |
Naveen Ashish1, Peehoo Dewan1, Arthur W Toga1.
Abstract
This work is focused on mapping biomedical datasets to a common representation, as an integral part of data harmonization for integrated biomedical data access and sharing. We present GEM, an intelligent software assistant for automated data mapping across different datasets or from a dataset to a common data model. The GEM system automates data mapping by providing precise suggestions for data element mappings. It leverages the detailed metadata about elements in associated dataset documentation such as data dictionaries that are typically available with biomedical datasets. It employs unsupervised text mining techniques to determine similarity between data elements and also employs machine-learning classifiers to identify element matches. It further provides an active-learning capability where the process of training the GEM system is optimized. Our experimental evaluations show that the GEM system provides highly accurate data mappings (over 90% accuracy) for real datasets of thousands of data elements each, in the Alzheimer's disease research domain. Further, the effort in training the system for new datasets is also optimized. We are currently employing the GEM system to map Alzheimer's disease datasets from around the globe into a common representation, as part of a global Alzheimer's disease integrated data sharing and analysis network called GAAIN. GEM achieves significantly higher data mapping accuracy for biomedical datasets compared to other state-of-the-art tools for database schema matching that have similar functionality. With the use of active-learning capabilities, the user effort in training the system is minimal.Entities:
Keywords: active Learning; common data model; data harmonization; data mapping; machine learning
Year: 2016 PMID: 26793094 PMCID: PMC4710756 DOI: 10.3389/fninf.2015.00030
Source DB: PubMed Journal: Front Neuroinform ISSN: 1662-5196 Impact factor: 4.081
Figure 1Data element mapping.
Figure 2Element information in data dictionary.
Figure 3GEM architecture.
Set of features for classification.
| Description similarity | TFIDF | We calculate the similarity score of the text descriptions based on TFIDF similarity of the two data elements present in their respective data dictionaries. | A (real) value in the range 0.0–1.0 |
| Topic model based text similarity score [Topic score] | We build a topic model from the column descriptions of all the data elements of the two sources. We then calculate a similarity score based on the cosine similarity of the topic distributions of the two data elements. | A (real) value in the range 0.0–1.0 | |
| TFIDF rank [Tfidf rank] | The (ordinal) rank based on the TFIDF text match score | Integer with 1 denoting the top (best) match | |
| Topic model rank [Topic rank] | The rank based on the topic model based text match score | Integer with 1 denoting top match | |
| Edit distance [Edit distance] | A word-based edit distance between associated element text descriptions | Integer | |
| Element names | Name match applicable [Name match] | Whether a name match score is applicable (for a given element pair) or not | Binary, Y or N |
| Name match score [Name score] | A name match score that is provided by an element name matching classifier module | A (real) value in the range 0.0–1.0 | |
| Metadata constraints | Cardinality [Source/Target cardinality] | The number of possible data values for an element (if discrete) | Integer ≥ 1 |
| Range [Source/Target min/max] | The numeric range (if applicable) | Real numbers for min and max of range | |
| Other | Table correspondence score [Table score] | This is a score that captures how well do the tables that two elements belong to respectively, correspond to each other | A (real) value in the range 0.0–1.0 |
Figure 4Training data for GEM.
Figure 5GEM data mapping and active learning interface. (A) GEM UI, (B) Process.
Figure 6Mapping accuracy. (A) Mapping accuracy, (B) Classifiers.
Figure 7Feature relevance.
Impact of active-learning on user effort.
| 10 | 342 | 34 | 213 | 21 | 2500 | 250 | 0.79 |
| 20 | 691 | 35 | 431 | 22 | 5000 | 250 | 0.84 |
| 30 | 1004 | 35 | 674 | 22 | 7500 | 250 | 0.91 |
Based on passive learning effort estimation.