| Literature DB >> 34325649 |
Chloé-Agathe Azencott1,2, Maïté Laurent3, Catherine Noguès4,5, Nadine Andrieu1,6, Dominique Stoppa-Lyonnet3,7,8, Yue Jiao3,1,6, Fabienne Lesueur1,6, Noura Mebirouk1,6, Lilian Laborde9, Juana Beauvallet1,6, Marie-Gabrielle Dondon1,6, Séverine Eon-Marchais1,6, Anthony Laugé3, Sandrine M Caputo10.
Abstract
BACKGROUND: Linking independent sources of data describing the same individuals enable innovative epidemiological and health studies but require a robust record linkage approach. We describe a hybrid record linkage process to link databases from two independent ongoing French national studies, GEMO (Genetic Modifiers of BRCA1 and BRCA2), which focuses on the identification of genetic factors modifying cancer risk of BRCA1 and BRCA2 mutation carriers, and GENEPSO (prospective cohort of BRCAx mutation carriers), which focuses on environmental and lifestyle risk factors.Entities:
Keywords: Hybrid process; Probabilistic linkage; Record linkage; Supervised machine learning
Year: 2021 PMID: 34325649 PMCID: PMC8320036 DOI: 10.1186/s12874-021-01299-6
Source DB: PubMed Journal: BMC Med Res Methodol ISSN: 1471-2288 Impact factor: 4.615
An example of a record pair comparison and its PRL likelihood score calculation
| CTR | NUMFAM | SUJID | GENDER | Yob | Mob | Dob | BRCA1 | BRCA2 | MUT_HGVS | PRL Score | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Individual GEMO_5789 | 1 | 17455 | 0001 | 2 | 1959 | 08 | 05 | 1 | 0 | c.3403C > T | – |
| Individual GENEPSO_01082300001 | 1 | 08230 | 0001 | 2 | 1958 | 08 | 05 | 1 | 0 | c. 3481_3491del | – |
| Similarity | 1 | 0 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 0.7825 | – |
| 0.02272 | 0.00025 | 0.0018 | 0.5000 | 0.01098 | 0.07692 | 0.03125 | 0.3333 | 0.3333 | 0.0006 | – | |
| 5.45 | 11.95 | 9.1 | 0.99 | 6.49 | 3.68 | 4.99 | 1.57 | 1.57 | 10.69 | sum( | |
| 5.45 | 0 | 9.1 | 0.99 | 0 | 3.68 | 4.99 | 1.57 | 1.57 | 8.36 | sum( | |
| score | 0.6322 |
Ten matching variables were used to identify record pairs: BRCA1 mutational status (BRCA1), BRCA2 mutational status (BRCA2), mutation description using the HGVS nomenclature (MUT_HGVS), gender (GENDER), recruiting center number (CTR), family number (NUMFAM), individual number in the family (SUJID), year of birth (Yob), month of birth (Mob) and day of birth (Dob). BRCA1 and BRCA2 matching variable: 1: “carrier of a BRCA1/2 mutation”, 0: “non-carrier of a BRCA1/2 mutation”. GENDER matching variable: 1: male, 2: female. The similarity vector in the third row is used as input in the machine learning approaches. The PRL score is calculated from the weight and the similarity
Fig. 1Elaboration of hybrid record linkage process and main steps. a Assignment the matching status by PRL followed by manual review, so that we could obtain a set of true matches representing the gold standard. b Selection of the best-performing supervised machine learning algorithm c Selection of the best-performing methods among PRL, ML, and PRL + ML d Training of a final ML model on a larger subset of initial datasets. e Application of the optimal linkage method to link the updated databases
Fig. 2Score distribution of 15,653,232 record pairs in dataset 1. a Whole score distribution. b Zoom on the distribution for the highest scores
Mean performance for the ML algorithms trained on the Atrain dataset, evaluated on Atest
| Models | Atest dataset | |||
|---|---|---|---|---|
| Recall | Precision | |||
| Bernoulli | 0.01172 | 0.00079 | 0.01139 | 0.00096 |
| CT | 0.9841 | 0.016 | 0.9779 | 0.0059 |
| Bagged trees | 0.9809 | 0.012 | 0.9826 | 0.0080 |
| AdaBoost | 0.9839 | 0.011 | 0.9828 | 0.0075 |
| RF | 0.011 | 0.9824 | 0.010 | |
| SVM | 0.9821 | 0.017 | 0.9789 | 0.0068 |
| NNET | 0.9823 | 0.012 | 0.0078 | |
Six machine learning algorithms were tested: Classification Tree (CT), Bagged trees, AdaBoost, Random Forest (RF), Support Vector Machine (SVM) and Neural Network (NNET). M mean, SD standard deviation. The highest mean values among the different algorithms are highlighted in bold
Fig. 3Performance of three linkage methods: PRL (Probabilistic Record Linkage), RF (Random Forest) and PRL + RF. PRL has thresholds varying from 0.6 to 0.8. a Comparison of their recalls. b Comparison of their precisions
Fig. 4Comparison of candidate matches predicted by the RF and PRL models for the updated databases. a RF and PRL identified 819 and 1268 new candidate matches, respectively; 772 candidate matches were common to both approaches. b After manual review, PRL + RF led to the identification of 738 true matches, among which 727 were identified by PRL alone and 715 by RF alone. 704 true matches were identified by both approaches. 23 true matches were identified only by PRL, and 11 true matches were identified only by the RF model
Fig. 5General overview of the hybrid record linkage process. a Probabilistic record linkage (PRL) followed by a stage of manual review is first applied to build a dataset allowing the construction of a supervised machine learning (ML) model. b The PRL + ML combined linkage is then used to classify the updated datasets (Record pair comparison from Database X’ and Database Y′). The ML model obtained in (a) is used (dotted arrow) for the prediction in (b)