| Literature DB >> 34903103 |
Elaheh Shafieibavani1, Benjamin Goudey1,2, Isabell Kiral1, Peter Zhong1, Antonio Jimeno-Yepes1, Annalisa Swan1, Manoj Gambhir1, Andreas Buechner3, Eugen Kludt3, Robert H Eikelboom4,5,6, Cathy Sucher4,5, Rene H Gifford7, Riaan Rottier8, Kerrie Plant8, Hamideh Anjomshoa1,9.
Abstract
While cochlear implants have helped hundreds of thousands of individuals, it remains difficult to predict the extent to which an individual's hearing will benefit from implantation. Several publications indicate that machine learning may improve predictive accuracy of cochlear implant outcomes compared to classical statistical methods. However, existing studies are limited in terms of model validation and evaluating factors like sample size on predictive performance. We conduct a thorough examination of machine learning approaches to predict word recognition scores (WRS) measured approximately 12 months after implantation in adults with post-lingual hearing loss. This is the largest retrospective study of cochlear implant outcomes to date, evaluating 2,489 cochlear implant recipients from three clinics. We demonstrate that while machine learning models significantly outperform linear models in prediction of WRS, their overall accuracy remains limited (mean absolute error: 17.9-21.8). The models are robust across clinical cohorts, with predictive error increasing by at most 16% when evaluated on a clinic excluded from the training set. We show that predictive improvement is unlikely to be improved by increasing sample size alone, with doubling of sample size estimated to only increasing performance by 3% on the combined dataset. Finally, we demonstrate how the current models could support clinical decision making, highlighting that subsets of individuals can be identified that have a 94% chance of improving WRS by at least 10% points after implantation, which is likely to be clinically meaningful. We discuss several implications of this analysis, focusing on the need to improve and standardize data collection.Entities:
Keywords: cochlear implant; machine learning; predictive model
Mesh:
Year: 2021 PMID: 34903103 PMCID: PMC8764462 DOI: 10.1177/23312165211066174
Source DB: PubMed Journal: Trends Hear ISSN: 2331-2165 Impact factor: 3.293
Cohort demographics: including the total number of patients, and the reported distribution by gender (with number of females and their percentage in the brackets). For the following statistics, the reported number of patients with the mean and standard deviation are provided in brackets: word score recognition (WRS), with CI and HA, pure tone average (PTA), and years of severe to profound deafness (YRS-D) for the implanted and contralateral ears. All individuals in this study were implanted between 2003 and 2018.
| VUMC | ESIA | MHH | Combined dataset | |
|---|---|---|---|---|
| Number of patients | 453 | 246 | 1790 | 2489 |
| Number of female | 453 (199, 43.9%) | NA | 1790 (986, 55.1%) | 2243(1185, 47.6%) |
| Age(CI) | 453 (65.7, 13.8) | 246 (64.7, 14.0) | 1790 (57.3, 16.7) | 2489 (59.6, 16.3) |
| WRS(CI) | 453 (45.0, 22.6) | 246 (42.8, 23.1) | 1790 (53.5, 28.0) | 2489 (50.9, 26.9) |
| WRS(HA) | 376 (8.4, 12.3) | 238(7.0, 11.4) | 709(4.2, 9.5) | 1323 (5.9, 10.9) |
| PTA | 450 (97.7, 19.3) | 246 (116.7, 14.1) | 1771 (98.5, 17.6) | 2467(100.2, 18.4) |
| PTA | 450 (83.4, 25.5) | 246 (85.5, 29.0) | 1740 (76.3, 28.6) | 2436 (78.5, 28.3) |
| YRS-D | 396 (24.9, 17.1) | 230 (27.2, 18.3) | 1373 (8.1, 12.5) | 1999(13.7, 16.5) |
| YRS-D | 58 (26.9, 14.9) | 62 (28.3, 17.0) | 592 (11.6, 17.0) | 712 (14.3, 17.8) |
Features included or excluded in the baseline models and in the novel models developed in this work. These features have been found to significantly impact post-implantation hearing performance in previous studies. Here, ✓indicates the feature is used in the model, and - indicates the feature is not used in the model. Calculated as the difference of Age-CI and YRS-D. Both WRS(HA) and PB are used as pre-operative speech test measures.
|
| Baseline model A | Baseline model B | Baseline model C | Models in this work |
|---|---|---|---|---|
| Age at onset of s/p deafness (Age-D) | ✓ | ✓ | ✓ | ✓ |
| Duration of HA use (YRS-HA) | - | - | ✓ | ✓ |
| Etiology | ✓ | ✓ | - | ✓ |
| PTA | - | - | ✓ | ✓ |
| PTA | - | ✓ | - | ✓ |
| PTA | - | - | ✓ | ✓ |
| Duration of s/p deafness (YRS-D) | ✓ | ✓ | ✓ | ✓ |
| Age-CI | ✓ | ✓ | ✓ | ✓ |
| Pre-operative speech test | - | - | ✓ | ✓ |
| Native speaker | - | - | - | ✓ |
| Implant side | - | - | - | ✓ |
Figure 1.Comparison of the MAE of predicting the post-operative WRS on all datasets combined using four novel models and three baseline models, where Models A and B are linear and Model C is a Random Forest. The first four boxes use all features in our dataset, while the remaining boxes are the baseline models. Statistical significance of the drop in performance compared to XGB-RF is shown by lines above the bars, where symbols correspond to the following p-values: ns : p > 0.05, * : p ≤ 0.05, **: p ≤ 0.01, *** : p ≤ 0.001, ****: p ≤ 0.0001.
Figure 2.Evaluation of XGB-RF predictive performance under increasingly stringent validation settings. We consider single clinic and all-clinics cross-validation, where models are trained and evaluated in a cross-validation framework using data from either a single clinic or all clinics combined, with results from the latter stratified across the three datasets. As multiple models are constructed, we show the distribution of these results. Finally, external validation, shows the result of training a model on two cohorts and evaluating on the remaining cohort. These results are a single score for a single model and hence are shown as a horizontal line.
Figure 3.Impact of adjusting training dataset size on MAE. Here, the amount of data used to train the model is varied, leaving a fixed 10% of each dataset withheld to evaluate the model. Estimated dataset in dot lines are fitted logarithmic curve in order to extrapolate the impact additional training data of the same quality on the model performance.
Figure 4.Comparison of the accuracy of predicting discretized post-operative WRS using four machine learning models and the three baseline models. The first four boxes use all features in our dataset, while the remaining boxes are the baseline models, where Models A and B are linear and Model C is a Random Forest. Statistical significance of drop in performance compared to XGB-Lin is shown by lines above the bars, where symbols correspond to the following p-values: ns : p > 0.05, *: p ≤ 0.05, ** : p ≤ 0.01, *** : p ≤ 0.001, ****: p ≤ 0.0001.
Figure 5.a) Distribution of actual delta WRS (WRS(CI)-WRS(HA)) for individuals predicted to fall within seven predicted WRS ranges. b) Positive predictive value (PPV) of achieving a post-implantation delta WRS of 10 considering different prediction intervals: all predictions, and the three highest risk groups in subplot a). c) Negative predictive value (NPV) of achieving a post-implantation delta WRS of 10 considering different prediction intervals: the three lowest risk groups in subplot a) and all predictions. Note the colour of the bars correspond to those in subplot a).