| Literature DB >> 33816910 |
Antonio Maratea1, Angelo Ciaramella1, Giuseppe Pio Cianci1.
Abstract
Record linkage aims to identify records from multiple data sources that refer to the same entity of the real world. It is a well known data quality process studied since the second half of the last century, with an established pipeline and a rich literature of case studies mainly covering census, administrative or health domains. In this paper, a method to recognize matching records from real municipalities and banks through multiple similarity criteria and a Neural Network classifier is proposed: starting from a labeled subset of the available data, first several similarity measures are combined and weighted to build a feature vector, then a Multi-Layer Perceptron (MLP) network is trained and tested to find matching pairs. For validation, seven real datasets have been used (three from banks and four from municipalities), purposely chosen in the same geographical area to increase the probability of matches. The training only involved two municipalities, while testing involved all sources (municipalities vs. municipalities, banks vs banks and and municipalities vs. banks). The proposed method scored remarkable results in terms of both precision and recall, clearly outperforming threshold-based competitors. ©2020 Maratea et al.Entities:
Keywords: Deduplication; Entity resolution; Feature extraction; Neural networks; Record Linkage
Year: 2020 PMID: 33816910 PMCID: PMC7924437 DOI: 10.7717/peerj-cs.258
Source DB: PubMed Journal: PeerJ Comput Sci ISSN: 2376-5992