| Literature DB >> 27067004 |
Asif Sohail1, Muhammad Murtaza Yousaf2.
Abstract
BACKGROUND: Record de-duplication is a process of identifying the records referring to the same entity. It has a pivotal role in data mining applications, which involves the integration of multiple data sources and data cleansing. It has been a challenging task due to its computational complexity and variations in data representations across different data sources. Blocking and windowing are the commonly used methods for reducing the number of record comparisons during record de-duplication. Both blocking and windowing require tuning of a certain set of parameters, such as the choice of a particular variant of blocking or windowing, the selection of appropriate window size for different datasets etc.Entities:
Keywords: Data integration; Inverted index; Record comparison reduction; Record linkage/de-duplication
Mesh:
Year: 2016 PMID: 27067004 PMCID: PMC4828843 DOI: 10.1186/s12911-016-0280-9
Source DB: PubMed Journal: BMC Med Inform Decis Mak ISSN: 1472-6947 Impact factor: 2.796
Fig. 1Record de-duplication process
Fig. 2Proposed framework for record de-duplication
Fig. 3Possible results of duplicate detection
Confusion Matrix
| Actual | Classification by RL technique | |
|---|---|---|
| Match ( | Non-match ( | |
| Match (M) | True matches | False Non-matches |
| True Positives (TP) | False Negatives (FN) | |
| Non-match (U) | False Matches | True Non-matches |
| False Positives (FP) | True Negatives (TN) | |
Datasets for FRAMEWORK Evaluation
| Dataset name | No. of fields | No. of records | No. of original records | No. of duplicate records |
|---|---|---|---|---|
| Dataset-A | 12 | 1000 | 500 | 500 |
| Dataset-C | 12 | 1000 | 600 | 400 |
Permutations for Experimental Evaluation
| Indexing technique | Methodology | Encoding function for indexing key | Field comparison functions |
|---|---|---|---|
| 1. Blocking | • Single Key Blocking (SKB) | • Soundex (SDX) | • Soundex |
| 2. Windowing with window sizes 3, 6, 9, …, 30 | • Single Key Windowing (SKW) |
Linking Fields and Comparison Functions
| Linking fields | Comparison function |
|---|---|
| postcode | Edit Distance |
| address_1 | Q-gram |
| soc_sec_id | Edit Distance |
| given_name | Soundex/Substring |
| surname | Soundex/Substring |
Results of Experiment using Full Index
| Dataset | Dataset-A | Dataset-C |
|---|---|---|
| Record Comparisons | 499500 | 499500 |
| Classified Matches | 496 | 1054 |
| Classified possible matches | 2 | 135 |
| Pairs Quality or Precision | 0.000993 | 0.002210 |
Setup for BLOCKING Experiments (X = A OR C)
| Experiment category | Exp. code | Blocking key | Encoding function for blocking key |
|---|---|---|---|
| Single Key Blocking (SKB) | DX-SKB |
| 1. Soundex (SDX) |
| Composite Key Blocking (CKB) | DX-CKB |
| |
| Multipass Blocking (MPB) | DX-MPB |
|
Results of blocking methods for dataset-A
| Blocking method | Single key blocking (SKB) | Composite key blocking (CKB) | Multipass blocking (MPB) | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Blocking Keys | SDX | SB4 | SB3 | SDX | SB4 | SB3 | SDX | SB4 | SB3 |
| Record Comparisons |
| 4096 |
|
| 484 |
|
| 5279 |
|
| Matches |
| 474 |
|
| 476 |
|
|
| 495 |
| F-Score |
| 0.969 |
|
| 0.975 |
|
| 0.989 |
|
Results of blocking methods for dataset-C
| Blocking method | Single key blocking (SKB) | Composite blocking key (CKB) | Multipass blocking (MPB) | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Blocking keys |
|
|
|
|
|
|
|
|
|
| Record comparisons |
| 4175 |
|
| 583 |
|
| 5340 |
|
| Matches |
| 848 |
|
| 551 |
|
| 1008 |
|
| F-Score |
| 0.886 |
|
| 0.684 |
|
|
|
|
Setup for WINDOWING Experiments (X = A OR C)
| Experiment category | Exp. code | Description | Sorting key |
|---|---|---|---|
| Single key windowing (SKW | DX-SKW-SDX | Dataset X - Single Key Windowing - Soundex encoding |
|
| DX-SKW-SB4 | Dataset X - Single Key Windowing - Substring4 encoding | ||
| Composite key windowing (CKW) | DX-CKW-SDX | Dataset X - Composite Key Windowing - Soundex encoding |
|
| DX-CKW-SB4 | Dataset X - Composite Key Windowing - Substring4 encoding | ||
| Multipass windowing (MPW) | DX-MPW-SDX | Dataset X - Multipass Windowing - Soundex encoding |
|
| DX-MPW-SB4 | Dataset X - Multipass Windowing - Substring4 encoding |
Results of windowing variants (Dataset-A and Dataset-C)
| Dataset | Dataset-A | Dataset-C | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Window size | 3 | 6 | 12 | 21 | 30 | 3 | 6 | 12 | 21 | 30 |
| Record Pairs : SKW-SDX | 9551 | 17151 | 32679 | 54322 | 75768 | 8476 | 14777 | 27629 | 47081 | 65030 |
| Matches : SKW-SDX | 469 | 478 | 481 | 485 | 486 | 764 | 864 | 918 | 965 | 978 |
| F-Score : SKW-SDX | 0.959 |
| 0.948 | 0.929 |
|
| 0.886 | 0.904 |
| 0.895 |
| Record Pairs : SKW-SB4 | 10271 | 18591 | 34882 | 58769 | 81591 | 9808 | 16380 | 30259 | 50809 | 70322 |
| Matches : SKW-SB4 | 479 | 482 | 483 | 484 | 484 | 900 | 949 | 967 | 979 | 981 |
| F-Score : SKW-SB4 |
| 0.963 | 0.948 | 0.923 |
| 0.910 |
| 0.926 | 0.911 |
|
| Record Paris : CKW-SDX | 3539 | 7186 | 14437 | 24966 | 35314 | 3342 | 6624 | 13033 | 22469 | 31713 |
| Matches : CKW-SDX | 469 | 477 | 482 | 488 | 490 | 519 | 662 | 783 | 862 | 912 |
| F-Score : CKW-SDX | 0.965 |
| 0.968 | 0.963 |
|
| 0.765 | 0.840 | 0.878 |
|
| Record Paris : CKW-SB4 | 3651 | 7430 | 14884 | 25905 | 36655 | 3803 | 7409 | 14240 | 24426 | 34359 |
| Matches : CKW-SB4 | 487 | 488 | 491 | 491 | 492 | 750 | 863 | 922 | 955 | 969 |
| F-Score : CKW-SB4 |
| 0.981 | 0.976 | 0.965 |
|
| 0.892 | 0.918 |
| 0.923 |
| Record Paris : MPW-SDX | 13858 | 25968 | 49722 | 82615 | 114045 | 12158 | 21982 | 41884 | 70643 | 96960 |
| Matches : MPW-SDX | 496 | 496 | 496 | 496 | 496 | 889 | 976 | 1015 | 1032 | 1034 |
| F-Score : MPW-SDX |
| 0.970 | 0.944 | 0.907 |
| 0.902 |
| 0.936 | 0.912 |
|
| Record Paris : MPW-SB4 | 15614 | 29191 | 54977 | 91679 | 125521 | 14434 | 25208 | 46783 | 78219 | 107261 |
| Matches : MPW-SB4 | 494 | 494 | 494 | 494 | 494 | 1022 | 1031 | 1032 | 1034 | 1035 |
| F-Score : MPW-SB4 |
| 0.964 | 0.936 | 0.894 |
|
| 0.961 | 0.939 | 0.905 |
|
Fig. 4Number of record comparisons of windowing variants using SDX (Dataset-A)
Fig. 5Number of matches of windowing variants using SDX (Dataset-A)
Fig. 6F-Score of windowing variants using SDX and SB4 (Dataset-A)
Best window sizes under for dataset-A and dataset-C
| Windowing variant | Window size for Dataset-A | Window size for Dataset-C | ||
|---|---|---|---|---|
| SDX | SB4 | SDX | SB4 | |
| Multipass Windowing – MPW (Highest matches) | 3–6 | 3–6 | 21–24 | 6–9 |
| Composite Key Windowing – CKW (Least comparisons) | 21–24 | 6–9 | 30 | 30 |
| Single Key Windowing - SKW | 21–24 | 6–9 | 30 | 21–24 |
Fig. 7Comparison of MPB and MPW for Dataset-A (a) and Dataset-C (b)
Number of comparisons using proposed framework
| Phases of the proposed framework | Number of comparisons |
|---|---|
| CKB using SB4 | 583 |
| MPB using SB4 | 5304 |
| MPW using SB4 | 14434 |
| Total | 20321 |