| Literature DB >> 33267133 |
Lydia González-Serrano1, Pilar Talón-Ballestero1, Sergio Muñoz-Romero1,2, Cristina Soguero-Ruiz1, José Luis Rojo-Álvarez1,2.
Abstract
Customer Relationship Management (CRM) is a fundamental tool in the hospitality industry nowadays, which can be seen as a big-data scenario due to the large amount of recordings which are annually handled by managers. Data quality is crucial for the success of these systems, and one of the main issues to be solved by businesses in general and by hospitality businesses in particular in this setting is the identification of duplicated customers, which has not received much attention in recent literature, probably and partly because it is not an easy-to-state problem in statistical terms. In the present work, we address the problem statement of duplicated customer identification as a large-scale data analysis, and we propose and benchmark a general-purpose solution for it. Our system consists of four basic elements: (a) A generic feature representation for the customer fields in a simple table-shape database; (b) An efficient distance for comparison among feature values, in terms of the Wagner-Fischer algorithm to calculate the Levenshtein distance; (c) A big-data implementation using basic map-reduce techniques to readily support the comparison of strategies; (d) An X-from-M criterion to identify those possible neighbors to a duplicated-customer candidate. We analyze the mass density function of the distances in the CRM text-based fields and characterized their behavior and consistency in terms of the entropy and of the mutual information for these fields. Our experiments in a large CRM from a multinational hospitality chain show that the distance distributions are statistically consistent for each feature, and that neighbourhood thresholds are automatically adjusted by the system at a first step and they can be subsequently more-finely tuned according to the manager experience. The entropy distributions for the different variables, as well as the mutual information between pairs, are characterized by multimodal profiles, where a wide gap between close and far fields is often present. This motivates the proposal of the so-called X-from-M strategy, which is shown to be computationally affordable, and can provide the expert with a reduced number of duplicated candidates to supervise, with low X values being enough to warrant the sensitivity required at the automatic detection stage. The proposed system again encourages and supports the benefits of big-data technologies in CRM scenarios for hotel chains, and rather than the use of ad-hoc heuristic rules, it promotes the research and development of theoretically principled approaches.Entities:
Keywords: Customer Relationship Management; Levenshtein distance; X-from-M strategy; big data; duplicate detection; entropy; hospitality industry; mass density function; mutual information; name matching
Year: 2019 PMID: 33267133 PMCID: PMC7514908 DOI: 10.3390/e21040419
Source DB: PubMed Journal: Entropy (Basel) ISSN: 1099-4300 Impact factor: 2.524
Figure 1Normalized for the different features in the experimental database. Features have been sorted in descending order of the position of the maximum of each distribution, for visualization purposes, which is strongly consistent with their nature, as seen.
Figure 2Distance in number of characters obtained for when screening percentile in the statistical distributions of each of the m features. Panels show mean and 95% confidence interval (shaded in gray) for the 100 independent realizations. Panel down is a zoom for the range , given that this region turns to be the most interesting one for neighborhood purposes of distance .
Figure 3Density mass function estimations and their asymptotic CI for examples of features with different nature (a–c). Estimated entropies and asymptotic evolution for representative example features (d–f). Estimated mutual information and asymptotic evolution for representative examples of pairs of features (g–i).
Figure 4Averaged neighborhood of a specific customer , obtained for 100 independent realizations, as a function of the fixed threshold in distance and the number of features required to fulfill its corresponding threshold, according to the X from M criterion.
Trade-off between Sensitivity and Specificity.
| Sensitivity, TPR (%) | Specificity, TNR (%) | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|
|
| 1 | 2 | 3 | 4 | 5 | 1 | 2 | 3 | 4 | 5 |
| 1 | 89.16 | 91.66 | 96.86 | 99.99 | 100.0 | 99.60 | 77.48 | 38.08 | 90.54 | 94.67 |
| 2 | 89.16 | 91.61 | 96.77 | 99.99 | 99.99 | 99.79 | 89.48 | 64.19 | 83.40 | 81.69 |
| 3 | 89.16 | 91.61 | 96.77 | 99.99 | 99.99 | 99.84 | 90.03 | 65.28 | 75.27 | 68.51 |
| 4 | 89.16 | 91.39 | 96.50 | 99.97 | 99.99 | 99.84 | 90.99 | 67.39 | 71.30 | 60.59 |
| 5 | 89.16 | 90.00 | 93.71 | 99.90 | 99.94 | 99.84 | 95.66 | 76.37 | 74.37 | 58.89 |
Figure 5Trade-off between False Negative Rate (a) and True Negative Rate (b).