| Literature DB >> 34173914 |
Mikkel Fly Kragh1,2, Henrik Karstoft3.
Abstract
Embryo selection within in vitro fertilization (IVF) is the process of evaluating qualities of fertilized oocytes (embryos) and selecting the best embryo(s) available within a patient cohort for subsequent transfer or cryopreservation. In recent years, artificial intelligence (AI) has been used extensively to improve and automate the embryo ranking and selection procedure by extracting relevant information from embryo microscopy images. The AI models are evaluated based on their ability to identify the embryo(s) with the highest chance(s) of achieving a successful pregnancy. Whether such evaluations should be based on ranking performance or pregnancy prediction, however, seems to divide studies. As such, a variety of performance metrics are reported, and comparisons between studies are often made on different outcomes and data foundations. Moreover, superiority of AI methods over manual human evaluation is often claimed based on retrospective data, without any mentions of potential bias. In this paper, we provide a technical view on some of the major topics that divide how current AI models are trained, evaluated and compared. We explain and discuss the most common evaluation metrics and relate them to the two separate evaluation objectives, ranking and prediction. We also discuss when and how to compare AI models across studies and explain in detail how a selection bias is inevitable when comparing AI models against current embryo selection practice in retrospective cohort studies.Entities:
Keywords: Artificial intelligence; Embryo selection; Model evaluation and comparison; Selection bias
Mesh:
Year: 2021 PMID: 34173914 PMCID: PMC8324599 DOI: 10.1007/s10815-021-02254-6
Source DB: PubMed Journal: J Assist Reprod Genet ISSN: 1058-0468 Impact factor: 3.412
List of studies that used AI on image data to predict or rank embryos based on pregnancy outcome. The reported information only concerns evaluation of pregnancy-related outcomes. Therefore, if a study includes additional tasks such as blastocyst prediction, these are not included in the table
| Reference | Input | Outcome | Embryo populationa | Human vs. AIb | Metricsc |
|---|---|---|---|---|---|
| [ | Static image | Fetal heartbeat | *-D5-blastocyst | ✓ | Accuracy, AUC |
| [ | Static image, patient age | Beta-HCG | *-D5/D6-blastocyst | - | Accuracy, sensitivity, specificity, PPV, NPV, FPR, FNR, F1, AUC |
| [ | Static image, patient age, blastocyst age, lab settings | Ploidy/beta-HCG | *-D5/D6-blastocyst | ✓ | Accuracy, sensitivity, specificity, PPV, AUC, NDCG |
| [ | Time-lapse video | Fetal heartbeat | ICSI-D3-*, ICSI-D5-* |
| Sensitivity, PPV, AUC |
| [ | Time-lapse video | Fetal heartbeat2 | *-D5-* | - | AUC |
| [ | Static image | Fetal heartbeat | *-D5-blastocyst | ✓ | Accuracy, sensitivity, specificity |
| [ | Static image | “pregnancy” | *-*-blastocyst | - | Accuracy, sensitivity, specificity |
| [ | Static image | Live birth | *-D5/D6-blastocyst | Accuracy, sensitivity, specificity, AUC | |
| [ | Static image | Live birth | *-D5-blastocyst1 | - | Accuracy, sensitivity, specificity, PPV, NPV, AUC |
| [ | Static image, annotations, patient info (age, BMI, ...) | Live birth | *-D5/D6-blastocyst |
| Accuracy, sensitivity, specificity, informedness, AUC |
| [ | Static image, annotations, patient info (age, BMI, ...) | Live birth | *-D5/D6-blastocyst | - | Accuracy, sensitivity, specificity, informedness, PPV, NPV, AUC |
| [ | Time-lapse video | Fetal heartbeat | *-*-* | ✓ | PPV, NPV, AUC |
| [ | Time-lapse video | Fetal heartbeat | *-D5/D6-* |
| AUC |
a The notation for Embryo population is explained in “Data foundation” and visualized in Fig. 1
b Human vs. AI comparisons are discussed in “Bias in model comparisons”
c All metrics are explained in detail in “Evaluation metrics: which performance measure to use?”
1 Only aneuploid miscarriages (confirmed with genetic testing of chorionic villus samples) were included as negative live births
2 Negative fetal heartbeat was assumed for all non-transferred embryos that had “failed or abnormal fertilization, grossly abnormal morphology or aneuploidy from preimplantation genetic testing”
Fig. 1Example scheme for reporting embryo population and outcome. A study reporting prediction of live birth on transferred day 5 blastocysts fertilized by ICSI would have the embryo population ICSI-D5-Blastocyst-Transfer and outcome live birth
Fig. 2Confusion matrix and definitions of common binary classification metrics
Fig. 3Example of a hypothetical distribution of predicted scores across positive and negative implantation outcomes and the corresponding receiver operating characteristic (ROC) curve
Fig. 4Calibration plot linking predicted probabilities to actual success rates. Grouped observations (triangles) represent success rates for embryos grouped by similar predictions. Flexible calibration (solid line) represents a smoothed estimate of observed success rates in relation to model predictions. The distributions of scores for positive and negative pregnancy outcomes are shown at the bottom of the graph
Fig. 5Influence of test set sample size (log scale) on standard deviations for different metrics. Solid lines denote mean values for each metric, whereas shaded regions illustrate the standard deviations. Potential performance improvements caused by increasing the sample size of the training set are not addressed in this analysis
Fig. 6Influence of selection bias on model comparison
Influence of selection bias on model comparison with different performance metrics
| Accuracy | Informedness | F1-score | AUC | |||||||||
| Model 1 | Model 2 | Model 1 | Model 2 | Model 1 | Model 2 | Model 1 | Model 2 | |||||
| Overall | 0.70 | 0.70 | 0.41 | 0.41 | 0.68 | 0.68 | 0.76 | 0.76 | ||||
| 0.60 | 0.71 | 0.13 | 0.37 | 0.74 | 0.79 | 0.59 | 0.74 | |||||
| 0.60 | 0.66 | 0.14 | 0.25 | 0.74 | 0.76 | 0.59 | 0.66 | |||||
| 0.61 | 0.63 | 0.14 | 0.15 | 0.75 | 0.75 | 0.59 | 0.61 | |||||
All performance measures are obtained from the simulations in Fig. 6