| Literature DB >> 25942687 |
Abdullah-Al Mamun1, Robert Aseltine2, Sanguthevar Rajasekaran1.
Abstract
BACKGROUND: Record linkage integrates records across multiple related data sources identifying duplicates and accounting for possible errors. Real life applications require efficient algorithms to merge these voluminous data sources to find out all records belonging to same individuals. Our recently devised highly efficient record linkage algorithms provide best-known solutions to this challenging problem.Entities:
Mesh:
Year: 2015 PMID: 25942687 PMCID: PMC4420456 DOI: 10.1371/journal.pone.0124449
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1Web-based user interface.
(A) shows the first and main page of the website, where users select data files, choose configurations and submit them. (B) is the instruction page. Users can view their submission history through login (C). (D) shows a sample submission history page.
Records for 5 people having 9 attributes.
| ID | FN | LN | SSN | DoB | G | SchID | MN | SSID |
|---|---|---|---|---|---|---|---|---|
| 1 | Risa | Pierce | 133183594 | 09261990 | M | 1524 | Vesta | 0676221410 |
| 2 | Maile | Kramer | 135370878 | 07261991 | F | 1526 | Lenna | 0957261480 |
| 3 | Kimberly | Battle | 141274186 | 04071982 | F | 1527 | Jacki | 0144591609 |
| 4 | Kamal | Mcclain | 148965694 | 10091991 | M | 70000 | Luisa | 0278635088 |
| 5 | Yvonne | Vaughan | 153614228 | 02061992 | F | 70003 | Basil | 0368901550 |
Each row of the table represents each row of Input01.csv file.
Records for 5 people having 4 attributes.
| ID | First Initial | Last Name | Social Security Number |
|---|---|---|---|
| 1 | R | Pierce | 133183594 |
| 2 | M | Kramer | 135370878 |
| 3 | K | Battle | 141274186 |
| 4 | K | Mcclain | 148965694 |
| 8 | L | MUELLER | 184498846 |
Each row of the table represents each row of Input02.csv file.
Records for 5 people having 8 attributes.
| ID | FirstName | LastName | DateOfBirth | Gender | SchID | MN | SSID |
|---|---|---|---|---|---|---|---|
| 1 | RISA | PIERCE | 09261990 | M | 001524 | VESTA | 0676221410 |
| 2 | MAILE | KRAMER | 07261991 | F | 001526 | LENNA | 0957261480 |
| 3 | KIMBERLY | BATTLE | 04071982 | F | 001527 | JACKI | 0144591609 |
| 5 | YVONNE | VAUGHAN | 02061992 | F | 070003 | BASIL | 0368901550 |
| 8 | KELSIE | MUELLER | 01131992 | M | 070020 | JAKE | 7243583370 |
Each row of the table represents each row of Input03.csv file.
Fig 2Screenshot of input parameter selection for our 3 example files.
Fig 3Screenshot of linkage criteria for our 3 example files.
Generated output for our example data sets.
| Cluster ID | File Name | ID | First Name | Last Name | SSN | Date Of Birth | Gender | School ID | Middle Name | SSID |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Input02.csv | 1 | r | pierce | 133183594 | |||||
| 1 | Input01.csv | 1 | risa | pierce | 133183594 | 09261990 | m | 1524 | vesta | 0676221410 |
| 1 | Input03.csv | 1 | risa | pierce | 09261990 | m | 1524 | vesta | 0676221410 | |
| 2 | Input02.csv | 2 | m | kramer | 135370878 | |||||
| 2 | Input01.csv | 2 | maile | kramer | 135370878 | 07261991 | f | 1526 | lenna | 0957261480 |
| 2 | Input03.csv | 2 | maile | kramer | 07261991 | f | 1526 | lenna | 0957261480 | |
| 3 | Input02.csv | 3 | k | battle | 141274186 | |||||
| 3 | Input01.csv | 3 | kimberly | battle | 141274186 | 04071982 | f | 1527 | jacki | 0144591609 |
| 3 | Input03.csv | 3 | kimberly | battle | 04071982 | f | 1527 | jacki | 0144591609 | |
| 4 | Input02.csv | 4 | k | mcclain | 148965694 | |||||
| 4 | Input01.csv | 4 | kamal | mcclain | 148965694 | 10091991 | m | 70000 | luisa | 0278635088 |
| 5 | Input01.csv | 5 | yvonne | vaughan | 153614228 | 02061992 | f | 70003 | basil | 0368901550 |
| 5 | Input03.csv | 5 | yvonne | vaughan | 02061992 | f | 70003 | basil | 0368901550 | |
| 6 | Input03.csv | 8 | kelsie | mueller | 01131992 | m | 70020 | jake | 7243583370 | |
| 6 | Input02.csv | 8 | l | mueller | 184498846 |
Time comparison of RLT-S with FEBRL, FRIL, and TPA (FCED).
| Tool Name | (1000, 1000) | (2000, 2000) | (3000, 3000) | (4000, 4000) | (5000, 5000) |
|---|---|---|---|---|---|
| RLT-S | 95 | 110 | 142 | 212 | 237 |
| FEBRL | 330 | 834 | 1630 | 2770 | 4150 |
| FRIL | 841 | 1992 | 3555 | 6043 | 8683 |
| TPA (FCED) | 172 | 223 | 274 | 360 | 433 |
Times shown are in milliseconds. Computation times are taken for (number of records in first file, number of records in second file).