| Literature DB >> 22741525 |
Tian Mi1, Sanguthevar Rajasekaran, Robert Aseltine.
Abstract
BACKGROUND: Recent large scale deployments of health information technology have created opportunities for the integration of patient medical records with disparate public health, human service, and educational databases to provide comprehensive information related to health and development. Data integration techniques, which identify records belonging to the same individual that reside in multiple data sets, are essential to these efforts. Several algorithms have been proposed in the literatures that are adept in integrating records from two different datasets. Our algorithms are aimed at integrating multiple (in particular more than two) datasets efficiently.Entities:
Mesh:
Year: 2012 PMID: 22741525 PMCID: PMC3439324 DOI: 10.1186/1472-6947-12-59
Source DB: PubMed Journal: BMC Med Inform Decis Mak ISSN: 1472-6947 Impact factor: 2.796
Experimental results on simulated data sets (constant threshold)
| | BIA | 99.4% | 98.8% | 14702 | 98.5% | 97.0% | 342411 | - | - | - |
| | PCD | 99.4% | 98.8% | 12422 | 98.5% | 97.0% | 334583 | - | - | - |
| IDS | 99.4% | 98.8% | 11562 | 98.5% | 97.0% | 291162 | 99.7% | 99.3% | 1200810 | |
| IDS(FCED) | 99.4% | 98.8% | 6031 | 98.5% | 97.0% | 164307 | 99.7% | 99.3% | 693665 | |
| | TPA | 92.2% | 78.3% | 406 | 90.7% | 69.9% | 7703 | 97.7% | 91.8% | 39843 |
| | TPA(FCED) | 92.2% | 78.3% | 265 | 90.7% | 69.9% | 5266 | 97.7% | 91.8% | 21640 |
| | BIA | 99.4% | 98.8% | 14453 | 98.3% | 97.0% | 354332 | - | - | - |
| | PCD | 99.4% | 98.8% | 13812 | 98.3% | 97.0% | 360613 | - | - | - |
| IDS | 99.4% | 98.8% | 13859 | 98.3% | 97.0% | 317052 | 99.6% | 99.4% | 1351910 | |
| IDS(FCED) | 99.4% | 98.8% | 9016 | 98.3% | 97.0% | 204071 | 99.6% | 99.4% | 861785 | |
| | TPA | 92.2% | 78.3% | 469 | 90.7% | 70.0% | 9484 | 97.7% | 91.8% | 42436 |
| | TPA (FCED) | 92.2% | 78.3% | 359 | 90.7% | 70.0% | 6140 | 97.7% | 91.8% | 27155 |
| | BIA | 98.8% | 98.8% | 11000 | 96.9% | 96.4% | 301960 | - | - | - |
| | PCD | 98.8% | 98.8% | 11234 | 96.9% | 96.4% | 299756 | - | - | - |
| IDS | 98.8% | 98.8% | 10547 | 96.9% | 96.4% | 254805 | 98.8% | 98.9% | 1046654 | |
| IDS(FCED) | 98.8% | 98.8% | 5390 | 96.9% | 96.4% | 145604 | 98.8% | 98.9% | 587013 | |
| | TPA | 92.4% | 78.6% | 391 | 90.7% | 69.9% | 7516 | 97.7% | 91.8% | 33499 |
| | TPA(FCED) | 92.4% | 78.6% | 235 | 90.7% | 69.9% | 4313 | 97.7% | 91.8% | 20312 |
| Size | 1,000 | 5,000 | 10,000 | |||||||
Experimental results on simulated data sets (proportional threshold)
| | BIA | 98.4% | 96.9% | 14593 | 97.7% | 95.4% | 345880 | - | - | - |
| | PCD | 98.4% | 96.9% | 13515 | 97.7% | 95.4% | 345645 | - | - | - |
| IDS | 98.4% | 96.9% | 11422 | 97.7% | 95.4% | 298069 | 99.5% | 99.0% | 1225476 | |
| IDS(FCED) | 98.4% | 96.9% | 7125 | 97.7% | 95.4% | 173932 | 99.5% | 99.0% | 673650 | |
| | TPA | 91.8% | 77.7% | 515 | 90.4% | 69.5% | 9203 | 97.6% | 91.7% | 38499 |
| | TPA(FCED) | 91.8% | 77.7% | 281 | 90.4% | 69.5% | 4500 | 97.6% | 91.7% | 23374 |
| | BIA | 98.4% | 96.9% | 14547 | 97.8% | 95.6% | 384222 | - | - | - |
| | PCD | 98.4% | 96.9% | 14156 | 97.8% | 95.6% | 381191 | - | - | - |
| IDS | 98.4% | 96.9% | 13671 | 44.6% | 99.6% | 343927 | 99.6% | 99.1% | 1416142 | |
| IDS(FCED) | 98.4% | 96.9% | 9078 | 44.6% | 99.6% | 222305 | 99.6% | 99.1% | 884472 | |
| | TPA | 91.8% | 77.7% | 485 | 42.5% | 90.5% | 10140 | 97.6% | 91.7% | 45436 |
| | TPA(FCED) | 91.8% | 77.7% | 344 | 42.5% | 90.5% | 6484 | 97.6% | 91.7% | 29030 |
| | BIA | 98.6% | 97.2% | 11890 | 97.8% | 95.5% | 314115 | - | - | - |
| | PCD | 98.6% | 97.2% | 12046 | 97.8% | 95.5% | 313006 | - | - | - |
| IDS | 98.6% | 97.2% | 11031 | 97.8% | 96.0% | 272085 | 99.6% | 99.1% | 1083059 | |
| IDS(FCED) | 98.6% | 97.2% | 5937 | 97.8% | 96.0% | 165495 | 99.6% | 9.1% | 610262 | |
| | TPA | 91.8% | 77.7% | 250 | 90.1% | 70.0% | 7843 | 97.6% | 91.7% | 32827 |
| | TPA(FCED) | 91.8% | 77.7% | 171 | 90.1% | 70.0% | 4297 | 97.6% | 91.7% | 21046 |
| Size | 1,000 | 5,000 | 10,000 | |||||||
Accuracies in 5-fold cross validation on picking up the threshold
| constant = 30 | 99.3% | 99.3% | 99.7% | 99.6% | 97.3% |
| proportion = 0.35 | 99.2% | 99.1% | 99.4% | 99.0% | 97.0% |
Experimental results on real data sets (N = 1,083,878)
| constant t = 1 | 1:52:41 | 0:27:29 | 94,381 | 87,756 | 108,800 | 93.0% | 80.7% | |
| t = 1 | 3:11:17 | 0:29:33 | 101,864 | 99,562 | 108,800 | 97.8% | 91.6% | |
| - | 1:06:04 | 1:04:13 | 90,950 | 83,270 | 108,800 | 91.6% | 76.5% | |
| t = 1 | 2:04:09 | 1:06:04 | 101,344 | 99,711 | 108,800 | 98.4% | 91.6% | |
| proportional t = 0.1 | 1:55:24 | 0:30:56 | 94,521 | 87,966 | 108,800 | 93.1% | 80.9% | |
| t = 0.1 | 3:14:37 | 0:44:05 | 101,254 | 99,346 | 108,800 | 98.1% | 91.3% | |
| - | 1:04:32 | 1:05:41 | 90,950 | 83,270 | 108,800 | 91.6% | 76.5% | |
| t = 0.1 | 2:06:16 | 1:09:02 | 100,896 | 98,949 | 108,800 | 98.1% | 90.9% |
Four-category analysis on real data sets (N = 1,083,878)
| constant t = 1 | 93.0% | 2.2% | 0.0% | 4.8% | |
| t = 1 | 97.7% | 2.1% | 0.0% | 0.2% | |
| - | 91.6% | 1.7% | 0.0% | 6.7% | |
| t = 1 | 98.4% | 1.3% | 0.0% | 0.3% | |
| proportional t = 0.1 | 93.1% | 2.2% | 0.0% | 4.7% | |
| t = 0.1 | 98.1% | 0.1% | 0.0% | 0.4% | |
| - | 91.6% | 1.7% | 0.0% | 6.7% | |
| t = 0.1 | 98.1% | 1.3% | 0.0% | 0.6% |
Performance comparison with FEBRL
| | 100.0% | 766 | 100.0% | 3766 | 100.0% | 8735 | |
| | 99.0% | 2125 | 100.0% | 11171 | 100.0% | 15922 | |
| Our | 99.0% | 2563 | 98.2% | 9172 | 97.7% | 20391 | |
| Algorithms | 100.0% | 187 | 100.0% | 250 | 100.0% | 469 | |
| | 100.0% | 234 | 100.0% | 453 | 100.0% | 828 | |
| | 99.2% | 203 | 98.4% | 516 | 98.0% | 1047 | |
| FEBRL | 100.0% | 40438 | 100.0% | 173597 | - | >15 min | no blocking |
| | 100.0% | 1284 | 100.0% | 2284 | 100.0% | 3265 s | With blocking |
| Size | 1000 | 2000 | 3000 | ||||
Figure 1 Relationship between thresholds and accuracy/completeness. (A)EDname with constant thresholds; (B)RDED with constant thresholds; (C)NDED with constant thresholds; (D)EDname with proportional thresholds; (E)RDED with proportional thresholds; (F)NDED with proportional thresholds.