Literature DB >> 32930094

Performance of a deep learning based neural network in the selection of human blastocysts for implantation.

Charles L Bormann^1,2, Manoj Kumar Kanakasabapathy³, Prudhvi Thirumalaraju³, Raghav Gupta³, Rohan Pooniwala³, Hemanth Kandula³, Eduardo Hariton¹, Irene Souter^1,2, Irene Dimitriadis^1,2, Leslie B Ramirez⁴, Carol L Curchoe^5,6, Jason Swain⁶, Lynn M Boehnlein⁷, Hadi Shafiee^2,3.

Abstract

Deep learning in in vitro fertilization is currently being evaluated in the development of assistive tools for the determination of transfer order and implantation potential using time-lapse data collected through expensive imaging hardware. Assistive tools and algorithms that can work with static images, however, can help in improving the access to care by enabling their use with images acquired from traditional microscopes that are available to virtually all fertility centers. Here, we evaluated the use of a deep convolutional neural network (CNN), trained using single timepoint images of embryos collected at 113 hr post-insemination, in embryo selection amongst 97 clinical patient cohorts (742 embryos) and observed an accuracy of 90% in choosing the highest quality embryo available. Furthermore, a CNN trained to assess an embryo's implantation potential directly using a set of 97 euploid embryos capable of implantation outperformed 15 trained embryologists (75.26% vs. 67.35%, p<0.0001) from five different fertility centers.

Entities: Chemical Disease Gene Species

Keywords: blastocysts; convolutional neural networks; embryology; euploid embryos; human; human embryos; in - vitro fertilization; medicine

Mesh：

Year: 2020 PMID： 32930094 PMCID： PMC7527234 DOI： 10.7554/eLife.55301

Source DB: PubMed Journal: Elife ISSN： 2050-084X Impact factor: 8.140

Introduction

Assisted reproductive technologies (ART) such as in vitro fertilization (IVF), while a solution to many infertile couples have been inefficient with an average success rate of approximately 30% reported in 2015 in the US (CDC, 2015). IVF is also an expensive solution costing patients well over $10,000 out-of-pocket for each ART cycle in the US with many patients requiring multiple cycles to achieve successful pregnancy (CDC, 2015; Birenbaum-Carmeli, 2004; Toner, 2002). Although multiple factors such as maternal age, medical diagnosis, gamete and embryo quality, and endometrium receptivity determine the success of ART cycles, the challenge of non-invasive selection of the highest available quality from a patient’s cohort of embryos (top-quality embryo) for transfer remains as one of the most important factors in achieving successful ART outcomes (Vaegter et al., 2017; Barash et al., 2017; Conaghan et al., 2013; Wong et al., 2013; Racowsky et al., 2015; Filho et al., 2010; Machtinger and Racowsky, 2013; Demko et al., 2016; Einarsson et al., 2017; Hill et al., 1989; Erenus et al., 1991; Paulson et al., 1990; Osman et al., 2015). Traditional methods of embryo selection rely on visual embryo morphological assessment and are highly practice-dependent and subjective (Storr et al., 2017; Baxter Bendus et al., 2006; Paternot et al., 2009). Fully automated assessments of embryos are challenging owing to the complexity of embryo morphologies. Emulating the skill of highly trained embryologists in efficient embryo assessment in a fully automated system is a major challenge in all of the previous work done in computer-aided assessments of embryos due to focus on measuring specific expert-defined parameters such as zona pellucida thickness variation, number of blastomeres, degree of cell symmetry and cytoplasmic fragmentation, etc. (Rocha et al., 2017a; Rocha et al., 2017b). Machine learning is loosely defined as a computer program that learns a given task over time through experience and improves itself to achieve the best possible task performance. In the past decade, advances in hardware compute performance and machine learning techniques have significantly improved their applicability in real-world medical and non-medical problems. Recently, machine learning has been proposed as a solution for automated analysis of embryo morphologies (Rocha et al., 2017b; Bormann et al., 2020; Dimitriadis et al., 2019; Thirumalaraju et al., 2019; Khosravi et al., 2019; Kanakasabapathy et al., 2019a). This work makes use of a deep convolutional neural network (CNN), a representation learning technique, that has been proven to be effective in image classification tasks. Unlike most prior computer-aided algorithms, including some techniques of machine learning used for embryo assessment, the reported CNN architecture allows automated embryo feature selection and analysis at the pixel level without any interference by an embryologist (Rocha et al., 2017a; Rocha et al., 2017b). Such networks do not depend on human-specified features and can develop an ability to evaluate embryos categorically through iterative learning from thousands of examples. The use of deep-learning in IVF has also been explored; however, these recent neural network-based approaches have focused on either classifying embryos based on morphological quality and were not evaluated for transfer outcomes, or were developed with the use of time-lapse series of images toward the evaluation of implantation (Khosravi et al., 2019; Tran et al., 2019). It is important to emphasize here that most fertility centers do not possess time-lapse imaging hardware even in the United States of America (Dolinko et al., 2017). The lack of availability of such hardware limits an otherwise promising technology mostly to resource-rich settings and fail to improve quality of and access to care in resource-constrained settings where such advances are sorely needed (Wahl et al., 2018; Hosny and Aerts, 2019). Furthermore, in current clinical practice, embryos with the highest morphological grades (top-quality) are the first to be transferred, however, clinically these decisions are performed manually, even with time-lapse imaging systems. The development of networks that can measure an embryo’s potential for implantation and help in rank ordering embryos in a patient embryo cohort for transfer have utility in virtually all fertility centers. Conventionally, embryo transfers are performed at the cleavage or the blastocyst stage of development. Embryos are at the cleavage stage 2–3 days after fertilization and develop further in suitable culture conditions to reach the blastocyst stage 5–7 days after fertilization. Blastocyst embryo transfers, in particular, have been associated with better implantation rates and have helped lower the number of embryos transferred at a time (De Croo et al., 2019). Therefore, in this study, we have investigated the use of a CNN pre-trained with 1.4 million ImageNet images and transfer-learned using 2440 static human embryo images recorded at a single time-point of 113 hr post insemination (hpi) for the development of neural networks that can help identify embryos capable of implantation and for identifying the top quality embryos (Figure 1). The top-quality embryos were identified by combining a previously developed network (Xception architecture) trained to classify embryos based on its blastocyst quality with a genetic algorithm scheme (Figure 1; Thirumalaraju et al., 2020). The original neural network was trained on a hierarchical system of categorization, derived from a clinical Gardener-based grading system, to minimize data sparsity and improve overall network learning (Kanakasabapathy et al., 2019a; Thirumalaraju et al., 2020; Kanakasabapathy et al., 2019b; Esteva et al., 2017). The two major categories of non-blastocysts and blastocysts made up the inference classes, which included the training classes 1, 2, and 3, 4, 5, respectively (Figure 1). Pre-training with a large dataset of images from ImageNet honed the ability of the developed CNN to identify the shape, structure, and texture variations between morphologically complex embryos with minimal data requirements while the genetic algorithm helped in rank ordering embryos by generating unified scores (Figure 1). The developed network was evaluated using an independent test set comprising of 97 patient-embryo cohorts. Embryos of the highest quality that were selected from the patient cohorts were evaluated using known implantation outcomes.

Figure 1.

Classification and selection of embryos at 113 hpi.

Classification and selection of embryos at 113 hpi.

The schematic shows neural networks that classify, and rank order embryos based on their morphological quality (network A) and classify embryos based on the implantation potential (network B). The two networks share a common Xception architecture but the classification layers are specific to each task. Network A also uses a genetic algorithm that helps in generating embryo scores by using the softmax output of the network with weights generated by the algorithm during training. Embryo(s) with the highest scores are evaluated for single embryo and double embryo transfer scenarios using the retrospective test set. The implantation potential is given by the softmax output of the neural network. Additionally, we also investigated if the neural network can be trained to directly differentiate between embryos based on their potential for implantation (Figure 1). Our tests with patient cohorts using the algorithm does not account for the ploidy status of the embryos. Since pre-implantation genetic screened (PGS) euploid embryos are associated with higher implantation chance, we also designed a neural network to evaluate the network performance in refining the screened embryos based on their implantation potential. The evaluations using the patient cohorts tend to yield embryo selections with unknown outcomes or ploidy status, therefore, for this section of the study, we utilized a test set of 97 euploid embryos with known implantation outcomes. The CNN was trained and evaluated in identifying euploid embryos capable of implantation and the performance was compared against those of 15 embryologists from five different fertility centers across the United States of America.

Results

Evaluation of embryo selection based on embryo quality

In our evaluations of the CNN in categorizing embryos imaged at 113 hpi based on their morphology, the network performed with an accuracy of 90.97% (area under the curve: 0.96) in differentiating between blastocysts and non-blastocysts (n = 742) (Kanakasabapathy et al., 2019a; Thirumalaraju et al., 2020; Figure 2—figure supplement 1). The high accuracy indicated that the trained network was concordant with embryologists in categorizing embryos. These categorization scores (five values per embryo) need to be used by taking into account the scores of other embryos in the cohort to establish a rank order. In order to use the five probability values effectively for calculating the embryo score, we utilized a genetic algorithm, which is well-suited for optimization problems with multiple existing solutions. Here, the genetic algorithm empowered the developed CNN to make selections of the top-quality embryos within a patient’s embryo cohort at 113 hpi. Therefore, once we established that the network was capable of categorizing embryos based on their morphologies with high accuracy, we used a genetic algorithm and the network defined probability values of the embryos, belonging to each of the five training classes, to rank order the embryos for transfer. The 5 × 1 vector weights generated by the genetic algorithm during its training phase were used in evaluating retrospectively collected embryo cohorts from 97 patients. The final weights utilized in this study were −10.01226347, –3.63697951, −3.32090987, 2.15367795, and 2.8715555 for classes 1 through 5, respectively. Embryos were ranked by the algorithm from highest to the lowest.

Figure 2—figure supplement 1.

Confusion matrix of the network in classifying embryos based on their morphological quality.

According to the American Society for Reproductive Medicine guidelines on the limits to the number of embryos per transfer, one embryo is transferred for high prognosis patients with <37 years of age and two or more embryos are transferred for patients with >37 years of age as well as younger patients with low prognosis (Practice Committee of the American Society for Reproductive Medicine. Electronic address: ASRM@asrm.org and Practice Committee of the Society for Assisted Reproductive Technology, 2017). Therefore, in this study, the selection accuracy was assessed for scenarios of single embryo transfers (SET) and double embryo transfers (DET). Using embryo cohort images (n = 732) from the 97 patients, the accuracy of 5 well-trained embryologists’ selections were evaluated in comparison to selections made by the CNN + genetic algorithm (CNNg). The rank-ordering performed by the algorithm may not utilize the same features used by embryologists in identifying the top embryos for transfer. Therefore, we initially evaluated the ability of both groups to effectively select (i) blastocyst(s) for transfer and (ii) the highest quality of blastocyst(s) (HQB) available for transfer. High-quality blastocysts are defined as embryos that met the freezing criteria (>3 CC blastocyst grade; see Materials and methods) of the Massachusetts General Hospital (MGH) fertility clinic. For blastocyst selections at 113 hpi, the CNNg algorithm performed with an accuracy of 98.96% for SET, which was similar (p>0.05) to the average accuracy of the embryologists (96.91%, CI: 94.69% to 99.12%) (n = 5) (Figure 2A). However, when two embryo selections for DET were allowed based on blastocyst and non-blastocyst classification, the CNNg algorithm performed with an accuracy of 100.00%, which was better (p<0.05) than embryologists (n = 5) who performed with an average accuracy of 98.76% (CI: 97.69% to 99.83%) (Figure 2B).

Figure 2.

Classification and selection of embryos at 113 hpi.

(A) The performance in single embryo selections by embryologists and the algorithm in selecting blastocysts using embryo morphologies obtained at 113 hpi from 97 patient cohorts. (B) The performance in double embryo selections by the two groups in selecting blastocysts (n = 97 patient cohorts). (C) The performance in single embryo selections by the two groups in selecting the highest quality blastocysts (n = 97 patient cohorts). (D) The performance of the two groups in selecting the highest quality blastocysts when two selections were provided (n = 97 patient cohorts).

Confusion matrix of the network in classifying embryos based on their morphological quality.

The matrix provides the network’s confusion between the five training classes. The dotted lines represent the separation between non-blastocysts (classes 1 and 2) and blastocysts (classes 3, 4, and 5). The reported accuracy is the binary classification performance accuracy of the CNN in differentiating between the two inference classes (non-blastocysts and blastocysts). Toward the selection of HQB at 113 hpi, the accuracy of the CNNg algorithm for SET was 89.69% similar (p>0.05) to the embryologists (n = 5) who performed with an average accuracy of 90.31% (CI: 87.50% to 93.11%) (Figure 2C). When two embryo selections for DET at 113 hpi were allowed, the system performed with a better (p<0.05) accuracy of 97.94% in comparison to the embryologists who performed with an average accuracy of 96.91% (CI: 96.00% to 97.81%) (Figure 2D). The evaluations indicated that the two groups made selections that were of similar quality or marginally different quality. Since the network was trained on the MGH classification criteria, the comparable performance of the CNNg algorithm and embryologists indicated that the neural network has trained itself sufficiently and made selections that were of clinically acceptable quality. In our evaluations, the selections made by each group, while were of similar quality, were observed to not necessarily be the same embryos from each cohort, and thus their transfer outcomes may be different.

Evaluation of selections using implantation outcomes

It is critical to evaluate the system performance in selecting the patient embryos based on pregnancy (implantation) outcome. Typically, in a clinical IVF cycle, the top-quality embryo is selected from the cohort of available embryos and is transferred to the patient. Embryos, which are similarly of a high-quality, are often frozen based on the freezing criteria used by the fertility center, for transfers in subsequent procedures for the same patient if needed. Frozen cycle transfers are not performed for all patients. Hence, the CNNg algorithm was evaluated in embryo selection for SET at 113 hpi using patient embryo cohorts based on actual implantation outcomes of the selected embryos and associated cycle characteristics (n = 97) are provided in Supplementary file 1-table 1. The test dataset was retrospectively collected based on pre-defined selection criteria and evaluations of transfer outcomes were performed using fresh embryo transfer cycles. The system selected 97 embryos in 97 patient embryo cohorts (742 embryos in total), out of which 44 embryos had known implantation outcomes. The accuracy of the system in SET through embryo selection at 113 hpi based on its implantation outcome was 59.1% while the implantation success rate for the 102 transferred embryos at the MGH fertility center was 44.1% for blastocyst transfers (Supplementary file 1-table 2). Furthermore, prior reports suggest that in general practice, the average implantation rates for manual-based embryo selection and transfers at blastocyst stages can be as low as 34% (Martins et al., 2017). A limitation of a retrospective study is that not all embryos are transferred. Implantation outcomes of all embryos selected by the CNNg algorithm cannot be evaluated. Therefore, although the dataset was prepared not taking into consideration the availability of subsequent frozen cycle transfers, we investigated with the fertility center if the patients of the test set had any subsequent embryo transfers using the frozen embryos from the test set. When we consider subsequent frozen embryo transfers, five embryos originally selected by the CNNg algorithm at 113 hpi had known implantation outcomes of which four led to successful implantations (Supplementary file 1-table 2). The accuracy of the CNNg algorithm in SET, when both fresh and frozen embryo transfers were considered, was 61.2%. In such a scenario, for this specific dataset, the implantation success rate at MGH fertility center was 48.5% for blastocyst transfers when including both frozen and fresh transfers. The results suggest that the CNNg algorithm has the potential to improve clinical transfer outcomes. It should, however, be emphasized that in this particular analysis the performance of the system was evaluated by only using the embryos selected by the network and the embryologists. Furthermore, to evaluate if a CNN can potentially measure implantation potential through morphology alone, a pooled set of 29 embryo images with known transfer outcomes in a pilot study was used by the network to evaluate embryos based on their potential for implantation. The network was trained as a binary classifier and the SoftMax probability values outputted by the network was used as the embryo’s implantation potential. The CNN was retrained using 281 embryo images with known implantation outcomes that did not overlap with the test set and the final classification layer was replaced with the two classes- negative implantation and positive for implantation. The ability to differentiate embryo was measured through a receiver operating characteristic curve (ROC) analysis, establishing area under the curve (AUC) of 0.771 (CI: 0.579 to 0.906) (p<0.05) and the CNN performed with an accuracy of 82.76% (CI: 64.23% to 94.15%) (Figure 3A). Ten out of 11 embryos had implanted with an implantation potential of over 0.47 and similarly, for embryos that scored less than 0.47, 12 out of 18 embryos did not implant according to the patient cycle history.

Figure 3.

Performance in identifying embryos based on implantation outcomes.

(A) The performance of the neural network system in identifying embryos that implanted compared to the baseline historical implantation for the image set (n = 29). The error-bar represents the Clopper-Pearson exact binomial 95% confidence interval. (B) The performance of the neural network system in identifying euploid embryos that implanted compared to the performance of 15 embryologists in identifying implanting embryos (n = 97). The error-bar represents the 95% confidence interval of the embryologists’ performance in identifying implanting embryos.

The scatter plot illustrates the implantation potential of the euploid embryos evaluated in this study as measured by the neural network (n = 97). The ground truth represents actual clinical transfer outcomes.

Performance in identifying embryos based on implantation outcomes.

Implantation potential and the relative implantation rates using the euploid embryo test set.

Evaluation of euploid embryos based on their implantation potential

After we observed high performance in the artificial intelligence (AI)-based implantation potential prediction when compared with historical clinical data, we further conducted a multi-center AI system evaluation by comparing the implantation potential prediction accuracies obtained from the AI system and the embryo selections of 15 embryologists from five different fertility clinics. Here, we used 97 genetically screened euploid embryos transferred at 113 hpi to remove the effect of chromosomal abnormalities as a confounder, which existed in the pilot study (29 patient embryo). The IVF cycle characteristics in which these embryos were used are provided in Supplementary file 1-table 3. The system performed with an accuracy of 75.25% while the embryologists performed with an average accuracy of 67.35% (CI: 64.52% to 70.19%) in differentiating euploid embryos based on their implantation outcome (Figure 3B). A one-sample t-test revealed that the CNN significantly outperformed (p<0.05) the embryologists in predicting embryo implantation by measuring the implantation potential of euploid embryos using a static image obtained at a single time-point of 113 hpi. The average implantation score of euploid embryos misclassified based on their implantation outcome using the CNN was 0.57. 95% of the misclassified euploid embryos possessed scores ranging between 0.51 and 0.63. Implantation scores closer to 0.5 indicate lower confidence in system predictions while implantation scores closer to 0 or 1 indicate higher confidence in system predictions (Figure 3—figure supplement 1). These results indicate that the majority of system errors in misclassifying the euploids occur among the embryos with the lowest confidence. Approximately 91% of euploid embryos with implantation potential scores of 0.80 or higher, and nearly 81% of embryos with implantation potential scores above 0.66 successfully implanted when transferred (Figure 3—figure supplement 1). Similarly, around 78% of euploid embryos with an implantation potential <0.33, failed to successfully implant when transferred (Figure 3—figure supplement 1). These results suggest that the network’s implantation scores agree well with transfer outcomes even in high-quality euploid embryos.

Figure 3—figure supplement 1.

Implantation potential and the relative implantation rates using the euploid embryo test set.

Discussion

Deep neural networks hold value in aiding clinical decision making and have received significant attention from the IVF community. The deep-neural network-based approach showcased here is an objective approach to one of the more subjective but important parts of a clinical IVF process-embryo selections for transfer (Bormann et al., 2020). Since over 80% of fertility clinics rely on non-time lapse imaging systems as part of their clinical processes, such neural network-based algorithms that rely purely on static single timepoint images can effectively assist in decision making (Dolinko et al., 2017). In our study, we have evaluated two neural network-based approaches for improving embryo selection. Firstly, we have demonstrated that a deep-neural network in combination with a genetic algorithm (CNNg) can yield a continuous score that represents the quality of the embryo and that objective orders of transfer can be determined for a given set of embryos using such scores. The ranking algorithm studied here was able to consistently select embryos of the highest available morphological quality. Although the network was trained to classify embryos based on their quality, it performed well even in differentiating between embryos of the same class when combined with a genetic algorithm. The benefit of such systems is particularly evident in cases where selections made by the clinic/embryologist, although of similar grade, resulted in lower overall transfer success rates. Our networks only focused on the morphological features for embryo quality assessments due to data scarcity. The network’s learning can be compounded with data from additional timepoints, morphokinetics, and patient and cycle-specific information for more personalized IVF predictions and outcomes. Recently, Tran et al. studied the use of a deep-learning model (IVY) that can analyze whole time-lapse videos instead of specific time points for fetal heartbeat prediction (Tran et al., 2019). However, the study was flawed since embryos with unknown outcomes (non-transferred embryos) were considered as negative outcome cases, which made up most of their dataset (~90%). The heavy class bias in their dataset and improper study design severely limits any conclusions that can be drawn from the work. A major hurdle for the development of networks capable of analyzing multi-timepoint images and with additional patient-specific information is the limited availability of diversified data with known clinical outcomes. During training, the lack of availability of such data prevents the networks from effectively learning relevant outcome-associated patterns in data. The need for data scales with the complexity of the task and the number of variables introduced. While this work focuses primarily on the utility of deep-learning algorithm for embryo evaluations at 113 hpi, it is also possible to develop similar networks for embryo evaluations at different timepoints, provided that sufficient data with matched outcomes/annotations are available. We have evaluated a similar network for use with cleavage-stage embryos (70 hpi) and showed that deep-learning approaches can outperform trained embryologists in certain tasks such as embryo selection (Thirumalaraju et al., 2019; Kanakasabapathy et al., 2020). A major concern in any clinical practice, however, is the loss of viable embryos due to system errors. Therefore, the AI-based embryo selection algorithm reported here does not make any suggestion on discarding embryos. All embryos assessed by the CNNg in the selection process may be cryopreserved as per clinical practice. Thereby, our approach will not negatively affect the cumulative pregnancy rate since viable embryos will not be lost. However, it may improve the pregnancy rate as the system may be able to improve the chance of achieving a pregnancy faster with fewer embryos transferred. Furthermore, it is important to note that in its current stage this system is intended to act only as an assistive tool for embryologists. The embryologists can include the system’s prediction to make better judgments during embryo selection. The scores provided by the algorithm are continuous, but it can also be easily modified to present its scoring results in both binary and a more categorical format. Clinically, besides morphological features, various other important metrics and parameters are considered by embryologists at the time of decision making such as taking into account the ploidy status of the transferable embryos. PGS verified euploid embryos have been shown to possess a higher probability of successful outcome but cost a hefty premium on top of the cycle costs at most fertility centers in the United States (Drazba et al., 2014). Furthermore, for patients with two or more euploid embryos, additional assessments of embryo morphology are required to select the best embryo based on their morphology for transfer, since euploids do not inherently guarantee implantation. Thus far, to the best of our knowledge, no system, deep-learning-based or otherwise, has been shown to be capable of differentiating between euploid blastocysts based on their capacity for implantation. Euploid embryos are usually of the highest available quality and differentiating between them objectively and reliably through manual analysis can be extremely challenging. The CNN-based approach, through direct estimations of implantation potential from 113 hpi embryo morphology, outperformed trained embryologists in identifying implanting embryos from a set of PGS euploid embryos. This accomplishment exhibits the potential of artificial intelligence-based approaches to improve success rates in the IVF lab. Our observations indicated that the system performed with a significantly better agreement with the actual implantation outcome for embryos with implantation scores closer to 1 or 0 (Higher confidence). Furthermore, the comparison between the decisions made by 15 embryologists from different fertility centers in the US and the deep-neural network showcased that neural networks can outperform embryologists in identifying embryos capable of implantation. Hence, by applying the suggestions of a CNN, a trained embryologist can improve their selection of the embryo with the highest implantation potential. Advances in artificial intelligence have fostered numerous applications that have the potential to improve standard-of-care in the different fields of medicine. While other groups have also evaluated different use cases for machine learning in assisted reproductive medicine, this approach is novel in how it used a CNN trained on a large dataset to make predictions based on static images. The approach has shown the potential of CNNs to be used in aiding embryologists to select the embryo with the highest implantation potential, especially amongst high-quality euploid embryos. Although the current retrospective study shows that these systems can perform better than highly-trained embryologists, randomized control trials are required before routine use in clinical practice is adopted.

Materials and methods

Data collection and preparation

Data were collected at the Massachusetts General Hospital (MGH) fertility center in Boston, Massachusetts. We used 3469 recorded videos of embryos collected from 543 patients with informed consent for research and publication, under an institutional review board approval for secondary research use. Videos were collected for research after institutional review board approval by the Massachusetts General Hospital Institutional Review Board (IRB#2017P001339 and IRB#2019P002392). All the experiments were performed in compliance with the relevant laws and institutional guidelines of the Massachusetts General Hospital, Brigham and Women's Hospital, and Partners Healthcare. The videos were collected using a commercial time-lapse imaging system (Vitrolife Embryoscope). The imaging system used a Leica 20x objective that collected images at 10-min intervals under illumination from a single 635 nm LED. Each patient’s set of embryos were exported as videos (.avi) using the imaging system software. The videos of individual embryos were broken down into their respective frames to extract images from all timepoints post insemination. The images were identified by their timestamps and only images collected at 113 ± 0.05 hr post insemination were processed and used in this study. The extracted images were 250 × 250 pixels and they were cropped to 210 × 210 pixels. The cropping removed both the timestamps and identifiers present in the frame. All embryos used in the study were annotated using images from the fixed time-points (113 hpi) by senior-level embryologists with a minimum of 5 years of human IVF training. Annotations for embryo implantation were assigned based on clinical outcomes. Out-of-focus images were included in the datasets and used for both testing and training. Only images of embryos that were completely non-discernable were removed from the study as part of the data cleaning procedure.

Hierarchical categorization

The two networks in this study used two categorization systems. The network focused on the rank ordering of embryos used a hierarchical categorization system. The embryo images at 113 hpi time point were categorized between training classes 1 through five as described in detail elsewhere (Thirumalaraju et al., 2020). Briefly, degenerated embryos, which did not begin compaction formed Class 1 while Class 2 embryos were those that reached the morula stage by 113 hpi. Classes 1 and 2 together formed ‘non-blastocysts’ inference class. Class three embryos exhibit features of an early blastocyst which is highlighted by the presence of blastocoel cavity and thick zona pellucida but lack expansion. Class four embryos were blastocysts with blastocoel cavities occupying over half of the embryo volume but either their inner cell mass (ICM) or trophectoderm (TE) was of poor quality. They are non-freezable quality embryos (<3 CC), where three represents the degree of expansion (range 1–6) and C represents the quality of ICM and TE (range A-D), respectively. Class 5 embryos, however, met cryopreservation criteria (>3 CC) and included full blastocysts to hatched blastocysts. Classes 3, 4, and 5 together formed ‘blastocysts’ inference class. The two inference classes are used since the differentiation of blastocysts and non-blastocysts is a universally accepted categorization that is relevant to embryologists, while the five class categorization is specific to the neural network training, performance and evaluation (Thirumalaraju et al., 2020). Networks that were focused on estimating an embryo’s implantation potential used a two-class training and inference system- positive for implantation and negative for implantation.

Neural network training for 113 hpi

The 113 hpi evaluation dataset included images of 2440 embryos categorized across five classes post-cleaning based on their clinical annotations made at 113 hpi. Our training set for this classification task used 1188 images with a validation dataset of 510 images obtained at 113 hpi. With the availability of unskewed validation sets prior to augmentation, we used a data generator during training, which performed random rotations and flips across all classes on the fly. The system performing with an accuracy of 90.97% was used in this study in combination with our genetic algorithm. The genetic algorithm was trained and tested with the training data prior to testing it with our independent test data. No human interaction was required/performed once the images were provided to the system during testing, as the entire process was fully automated. The independent non-overlapping test set consisted of 742 images of embryos originating from 97 patients. The selections were compared with embryologist selections. The network was also trained to classify embryos with successful and unsuccessful implantation. 281 embryo images with known implantation outcomes were used for training. Implantation signifies the attachment of a blastocyst into the endometrium. The status of implantation was clinically verified by ultrasound ~6 weeks after embryo transfer. Ninety-seven euploid embryos were evaluated by 15 embryologists, including director level embryologists from five different fertility centers.

Embryo selection algorithm development

A genetic algorithm was designed to perform selections in combination with the neural network. The genetic algorithm component utilizes the probability scores of every embryo belonging to each of the five different classes to generate a transfer score that can be used to effectively identify the best embryo available in a cohort. For system evaluations, we used an independent set of embryos (100 patients; 2–12 embryos per patient), with no overlap with the training data set used for any prior exercise. The patient cohorts were chosen under the following criteria: (i) each patient embryo cohort had to possess at least two 2PN embryos, and (ii) at least one embryo of the patient embryo cohort developed to blastocyst stage by 113 hpi.

Genetic algorithm

We trained a genetic algorithm to select the morphologically highest quality embryo from a given cohort. There are four phases namely initialization, selection, crossover, and mutation. The classified embryos for each patient were sorted according to their identifier numbers allotted by the deep neural network. A population of weights was generated at random during initialization. A population size of 100 was generated with a 5 × 1 matrix representing each weight. Each weight defined a possible solution for the rank-ordering of embryos based on their quality using the five training classes. The dot product of the weights with the output logits provided by the CNN was used in the calculation of the fitness. The algorithm runs multiple cycles to select the optimal set of weight towards achieving the appropriately suitable rank order of embryos based on their qualities. At each cycle, all the weight sets obtained using the given population were used rank-ordering embryos within the training set. The best 20 wt sets were selected in each cycle. These selected weights (specimens) were then bred with each other with a probability set to 20%. It randomly selected two specimens from the selected top pool and created a random binary 5 × 1 matrix, where one represents that the given element should be switched in cohort and 0 represents that given element should not be switched within the cohort. The fitness function checks if the selected embryo belongs to the highest class available within the tested cohort. It checks if the selected solution (specimen) picked the embryo belonging to the top class in a given cohort of patient embryos. If the selected embryo belonged to the top class, the score was increased and if it did not, the score was not modified. After iterating for all patients’ cohorts, the total scores were used to select the best 20 weights of the given population and were taken for crossover and mutation to repeat the process. The new specimens replaced their parents in the top selected group of embryos. Otherwise, the matrix remained the same. After breeding, each specimen from the top selected group was mutated to give five mutations by adding a random float 5 × 1 matrix with a probability of 20%. These mutations were then added to the new population and the selection step was repeated with the new population of 100. The genetic algorithm ran until the entire population converged to the same score after which a random weight was selected from the population as the final weight. Thus, final generated weights were used to further test the embryo cohorts within our test set. In the interests of transparency, eLife publishes the most substantive revision requests and the accompanying author responses. Acceptance summary: This paper is successful example of an emerging trend in medicine in which machine learning algorithms are employed to integrate clinical data in a rigorous and consistent manner in ways that outperform trained medical specialists in making critical decisions – in this case selecting embryos for implantation following in-vitro fertilization. Decision letter after peer review: Thank you for submitting your article "Performance of a deep-learning based neural network in selection of human blastocysts for implantation" for consideration by eLife. Your article has been reviewed by three peer reviewers, and the evaluation has been overseen by a Reviewing Editor and Michael Eisen as the Senior and Reviewing Editor. The reviewers have opted to remain anonymous. The reviewers have discussed the reviews with one another and the Reviewing Editor has drafted this decision to help you prepare a revised submission. As the editors have judged that your manuscript is of interest, but as described below that additional work is required before it is published, we would like to draw your attention to changes in our revision policy that we have made in response to COVID-19 (https://elifesciences.org/articles/57162). First, because many researchers have temporarily lost access to the labs, we will give authors as much time as they need to submit revised manuscripts. We are also offering, if you choose, to post the manuscript to bioRxiv (if it is not already there) along with this decision letter and a formal designation that the manuscript is "in revision at eLife". Please let us know if you would like to pursue this option. (If your work is more suitable for medRxiv, you will need to post the preprint yourself, as the mechanisms for us to do so are still in development.) In this work, the authors demonstrate a CNN-based algorithm for the selection of embryos at 113 hours post-insemination. Although a number of machine learning algorithms for embryo classification have been reported, the value of this work is that it demonstrated improved selection of embryos at 113 hpi. There were, however, several important issues raised during review and discussion that need to be addressed in a revised manuscripts. 1) There are several types of classification and inference performed in the study, but it was often hard to follow the order and logic of each classification. It would help immensely to have a figure that outlines the overall logic of the classification process – what the objective of the classification/inference is and how it fits into the overall embryo selection process. As it is, Figure 1 doesn't really accomplish this, and it made it difficult to follow the manuscript in places. 2) In the Introduction, discussion on the importance of evaluating embryos based on transfer outcomes needs to be strengthened. In the current version, the authors discuss on the lack of time-lapse imaging systems in fertility clinics to justify for the novelty of the work. However, to me, the major contribution of this work is the application of CNN for evaluating embryo quality at 113 hpi. 3) There was not a clear justification for why 113 hpi was used or what would be expected if other times were used. 4) It was not fully clear to the reviewers how the ground truth was established and whether there is any data on how good this ground truth is. Similarly it was not entirely clear whether any of the human annotation was used in the predictions as opposed to in training. We assume not as the authors describe this as a fully automated system, but this needs to be clarified. 5) The description of the images used should be strengthened. As it is the central data for the paper is not adequately described, with respect to image quality, completeness, etc… 6) The authors distinguish between 5 classes: 1-2 are non-blastocyst and 3-5 are blastocyst classes. They claim 90% accuracy in separating these. This should be a trivial task unless many embryos are between Morula and Early Blast (classes 2 and 3), but they don't provide a table to show the class distribution. A confusion matrix would be informative. 7) For 113 hpi blastocyst selection, the authors only report the accuracy values for both SET and DET. However, it is important to know whether the algorithm marks an embryo as blastocyst when it really is not (i.e. false positive) or as not-blastocyst when it really is (i.e. false negative). So, can the authors add a confusion matrix to show the data for all 4 cases? 8) The authors use a genetic algorithm (GA) to rank embryos. While the supplement provides a clearer description of what they do, the "genetic algorithm" part of Figure 1A is a bit misleading. In general, the Genetic Algorithm is not described well. To calculate the ranking, they multiply the probability scores from the CNN with a 5x1 weight matrix. The GA is used to optimize the weight matrix, and is not used during inference. It would be interesting to see the trained weights of this matrix – it would explain how much each class contributes to ranking. 9) It is unclear why the 5 classes were reduced to 2 for some of the analyses. 10) There is inconsistent naming of models. Early on they use a CNN, they later combine the CNN and genetic algorithm (subsection “Evaluation of embryo selection based on embryo quality”, first paragraph) and after that begin using the term system. It's not clear whether the latter two are the same or different. 11) They have an undefined term HQB which makes it hard to understand how experiments in the last paragraph of the subsection “Evaluation of embryo selection based on embryo quality” different from each other. 12) In the subsection "Evaluation of selection using implantation outcomes", they do not provide any rationale for using only fresh embryos. They simply state that they do, and alter combine the dataset with frozen embryos. 13) They aren't clear about what the implantation potential is, possibly the probability from the softmax of the CNN. 14) 5 embryos originally selected by the model had known outcome in a subsequent frozen transfer, and 4 of them led to successful implantation. This is nice, but what about the remaining 49 with an unknown outcome. I don't think any conclusion can be drawn based on the 5 with known outcomes. More data like this is needed. 15) In the Materials and methods section, the authors write that 3469 recorded videos of embryos were collected. How have the images at fixed time-points been obtained from these videos? Have you processed the images in any manner? The authors should describe more in detail how the data (i.e. images) were processed prior to feeding them to CNN for training. 16) Some of the acronyms (HQB, 3CC) appear in the Results section without full names. Although they are written in the Materials and methods sections, considering the order of the manuscript (Results then Materials and methods as currently it is), they should appear in the Results section. 17) In the Discussion, it would be helpful if the authors commented, based on their results, on the potential advantage or limitation of using a single static image for this purpose, as opposed to several images or a video clip. 18) The authors should cite and discuss the publication by Tran et al., 2019, and compare/contrast how this current submission differs and or adds to the existing literature. 19) The sentences near the end of the subsection “Evaluation of Euploid embryos based on their implantation potential” are important but are grammatically incorrect and therefore hard to understand. I get what they are trying to say but it's just poorly worded. On this note, there are a number of grammatical errors throughout the paper. 20) Citations basically stop at the Discussion section and there are some statements that definitely need literature support. […] 1) There are several types of classification and inference performed in the study, but it was often hard to follow the order and logic of each classification. It would help immensely to have a figure that outlines the overall logic of the classification process – what the objective of the classification/inference is and how it fits into the overall embryo selection process. As it is, Figure 1 doesn't really accomplish this, and it made it difficult to follow the manuscript in places. Thank you for your comment. We have now updated Figure 1 of the revised manuscript to reflect the work described more accurately. 2) In the Introduction, discussion on the importance of evaluating embryos based on transfer outcomes needs to be strengthened. In the current version, the authors discuss on the lack of time-lapse imaging systems in fertility clinics to justify for the novelty of the work. However, to me, the major contribution of this work is the application of CNN for evaluating embryo quality at 113 hpi. Thank you for your comment. The major contribution of our work is the development and evaluation of deep-learning models that can help identify embryos capable of implantation and in identifying the top-quality embryos from embryo sets. We have evaluated two approaches using static 113 hpi embryos images in this work: (i) the utility of a CNN in combination with a genetic algorithm for rank-ordering embryos based on their quality and (ii) utility of a CNN for the direct evaluation and estimation of the implantation potential of an embryo based on its morphology. While our method of utilizing static images for both of the approaches greatly contributes towards improving the accessibility to AI technologies for most fertility centers, our development of neural networks for the evaluation of euploid embryos is unique. Euploid embryos are usually very high-quality embryos and thus are difficult to differentiate between, based on their implantation potential through visual inspections. However, since currently there exists no assay or technology that can help differentiate between these embryos, embryologists rely on their intuition and judgment. In this work, we have shown that the network designed to measure implantation potential based on the embryo morphology at 113 hpi, outperformed trained embryologists in identifying implanting embryos from a set of PGS euploid embryos. We have now modified the Introduction of the manuscript to further clarify the significance of the reported work. 3) There was not a clear justification for why 113 hpi was used or what would be expected if other times were used. We have now provided a rationale for using 113 hpi embryo images in the Introduction of the revised manuscript. Other time points can also be used; however, the networks need to be trained using data from that timepoint since embryo are biological cells and the morphology changes rapidly in a matter of hours. “Conventionally, embryo transfers are performed at the cleavage or the blastocyst stage of development. […] Therefore, in this study, we have investigated the use of a CNN pre-trained with 1.4 million ImageNet images and transfer-learned using 2440 static human embryo images recorded at a single time-point of 113 hours post insemination (hpi) for the development of neural networks that can help identify embryos capable of implantation and for identifying the top quality embryos (Figure 1).” “While this work focuses primarily on the utility of deep-learning algorithm for embryo evaluations at 113 hpi, it is also possible to develop similar networks for embryo evaluations at different timepoints, provided that sufficient data with matched outcomes/annotations are available. We have evaluated a similar network for use with cleavage-stage embryos (70 hpi) and showed that deep-learning approaches can outperform trained embryologists in certain tasks such as embryo selection (Thirumalaraju et al., 2019).” 4) It was not fully clear to the reviewers how the ground truth was established and whether there is any data on how good this ground truth is. Similarly it was not entirely clear whether any of the human annotation was used in the predictions as opposed to in training. We assume not as the authors describe this as a fully automated system, but this needs to be clarified. The ground truth of embryo quality grading was established through manual grading annotations provided by embryologists. The ground truth in the implantation study is based on clinical implantation outcomes that were established through ultrasound inspection, 6 weeks after embryo transfer. No user input was used/needed by the algorithm in deciding the rank order and the implantation potential of the embryos during the test phase. We have now provided additional clarifications in the Materials and methods section of the revised manuscript. “All embryos used in the study were annotated using images from the fixed time-points (113 hpi) by senior-level embryologists with a minimum of 5 years of human IVF training. Annotations for embryo implantation were assigned based on clinical outcomes.” “Implantation signifies the attachment of a blastocyst into the endometrium. The status of implantation was clinically verified by ultrasound ~6 weeks after embryo transfer.” “No human interaction was required/performed once the images were provided to the system during testing, as the entire process was fully automated.” 5) The description of the images used should be strengthened. As it is the central data for the paper is not adequately described, with respect to image quality, completeness, etc… We have now provided additional information on the image collection and pre-processing prior to use with CNN in the Materials and methods section of the revised manuscript. “The videos were collected using a commercial time-lapse imaging system (Vitrolife Embryoscope). […] Only images of embryos that were completely non-discernable were removed from the study as part of the data cleaning procedure.” 6) The authors distinguish between 5 classes: 1-2 are non-blastocyst and 3-5 are blastocyst classes. They claim 90% accuracy in separating these. This should be a trivial task unless many embryos are between Morula and Early Blast (classes 2 and 3), but they don't provide a table to show the class distribution. A confusion matrix would be informative. We have now provided a confusion matrix (Figure 2—figure supplement 1) in the revised manuscript. 7) For 113 hpi blastocyst selection, the authors only report the accuracy values for both SET and DET. However, it is important to know whether the algorithm marks an embryo as blastocyst when it really is not (i.e. false positive) or as not-blastocyst when it really is (i.e. false negative). So, can the authors add a confusion matrix to show the data for all 4 cases? The SET and DET selections are performed after rank ordering with the scores obtained from the combination of the CNN SoftMax probabilities and the genetic algorithm weights. Essentially, the use of genetic algorithm (GA) helps in selecting only embryos of the highest quality and confidence by generating a rank order and thereby aids in minimizing the error of the CNN system. Therefore, a confusion matrix is not suitable to understand the overall algorithm. However, a confusion matrix can help better understand the CNN classifier (without the genetic algorithm). We have now provided the confusion matrix of the CNN in Figure 2—figure supplement 1. 8) The authors use a genetic algorithm (GA) to rank embryos. While the supplement provides a clearer description of what they do, the "genetic algorithm" part of Figure 1A is a bit misleading. In general, the Genetic Algorithm is not described well. To calculate the ranking, they multiply the probability scores from the CNN with a 5x1 weight matrix. The GA is used to optimize the weight matrix, and is not used during inference. It would be interesting to see the trained weights of this matrix – it would explain how much each class contributes to ranking. We have now provided the final weights generated by the genetic algorithm in the Results section. The genetic algorithm tries to heavily de-emphasize class 1 (arrested/degenerate) embryos. Interestingly, it also tries to avoid selecting embryos that are early stage blastocysts. Unsurprisingly, It prioritizes features that are associated with the highest quality embryos during selection. “…once we established that the network was capable of categorizing embryos based on their morphologies with high accuracy, we utilized the network defined probability values of the embryos, belonging to each of the 5 training classes, with a genetic algorithm to rank order the embryos for transfer. […] The final weights utilized in this study were -10.01226347, -3.63697951, -3.32090987, 2.15367795, and 2.8715555 for classes 1 through 5, respectively.” 9) It is unclear why the 5 classes were reduced to 2 for some of the analyses. The 2 deep learning models used two classification systems. We have now clarified them in more detail in the Materials and methods section of the revised manuscript. “The two networks in this study used two categorization systems. The network focused on the rank ordering of embryos used a hierarchical categorization system. […] Networks that were focused on estimating an embryo’s implantation potential used a two-class training and inference system- positive for implantation and negative for implantation.” 10) There is inconsistent naming of models. Early on they use a CNN, they later combine the CNN and genetic algorithm (subsection “Evaluation of embryo selection based on embryo quality”, first paragraph) and after that begin using the term system. It's not clear whether the latter two are the same or different. In this study, we have evaluated two deep learning-based approaches for identifying embryos capable of implantation. One version of the neural networks was designed to classify, and rank order embryos based on their morphological quality and the other version was designed to classify embryos based on the implantation potential. The two networks share a common Xception architecture, but the classification layers are specific to each task. We have now expanded the Introduction section of the revised manuscript to improve clarity on the different approaches evaluated in this study. “Therefore, in this study, we have investigated the use of a CNN pre-trained with 1.4 million ImageNet images and transfer-learned using 2440 static human embryo images recorded at a single time-point of 113 hours post insemination (hpi) for the development of neural networks that can help identify embryos capable of implantation and for identifying the top quality embryos (Figure 1). […] The CNN was trained and evaluated in identifying euploid embryos capable of implantation and the performance was compared against those of 15 embryologists from 5 different fertility centers across the United States of America.” 11) They have an undefined term HQB which makes it hard to understand how experiments in the last paragraph of the subsection “Evaluation of embryo selection based on embryo quality” different from each other. HQB stands for high-quality blastocysts, which is now defined in the Results section of the revised manuscript. We have also provided additional clarifications to improve the clarity on rationale and design of the experiments that were performed. “According to the American Society for Reproductive Medicine guidelines on the limits to the number of embryo transfer, 1 embryo is transferred for high prognosis patients with <37 years of age and 2 or more embryos are transferred for patients with >37 years of age as well as younger patients with low prognosis (Guidance on the limits to the number of embryos to transfer: a committee opinion, 2017). […] High-quality blastocysts are defined as embryos that met the freezing criteria (>3CC blastocyst grade; see Materials and methods) of the Massachusetts General Hospital (MGH) fertility clinic.” 12) In the subsection "Evaluation of selection using implantation outcomes", they do not provide any rationale for using only fresh embryos. They simply state that they do, and alter combine the dataset with frozen embryos. We have now included our rationale on why fresh cycles were used in evaluating the networks in the revised manuscript. “Typically, in a clinical IVF cycle, the top-quality embryo is selected from the cohort of available embryos and is transferred to the patient. […] The test dataset was retrospectively collected based on pre-defined selection criteria and evaluations of transfer outcomes were performed using fresh embryo transfer cycles.” “A limitation of a retrospective study is that not all embryos are transferred. Implantation outcomes of all embryos selected by the algorithm cannot be evaluated. Therefore, although the dataset was prepared not taking into consideration the availability of subsequent frozen cycle transfers, we investigated all patients from the test set who had subsequent embryo transfers using frozen embryos from the test set.” 13) They aren't clear about what the implantation potential is, possibly the probability from the softmax of the CNN. We have now explicitly stated in the Results section that implantation potential is the SoftMax output of the CNN. “The network was trained as a binary classifier and the SoftMax probability values outputted by the network was used as the embryo’s implantation potential.” 14) 5 embryos originally selected by the model had known outcome in a subsequent frozen transfer, and 4 of them led to successful implantation. This is nice, but what about the remaining 49 with an unknown outcome. I don't think any conclusion can be drawn based on the 5 with known outcomes. More data like this is needed. The frozen cycle transfers evaluated in this study was not one of the primary objectives. Typically, in a clinical IVF cycle, the top-quality embryo is selected from the cohort of available embryos and is transferred to the patient. Embryos, which are similarly of a high-quality, are often frozen based on the freezing criteria used by the fertility center, for transfers in subsequent procedures for the same patient if needed. Frozen cycle transfers are not performed for all patients. Therefore, the retrospective test set of patient embryo cohorts (n=97) was selected, and evaluations of transfer outcomes were performed using fresh embryo transfer cycles. A limitation of such a retrospective study is that not all embryos are transferred. Therefore, although the dataset was prepared not taking into consideration the availability of subsequent frozen cycle transfers, we investigated with the fertility center if the patients of the test set had any subsequent embryo transfers using the frozen embryos from the test set and reported our findings. 15) In the Materials and methods section, the authors write that 3469 recorded videos of embryos were collected. How have the images at fixed time-points been obtained from these videos? Have you processed the images in any manner? The authors should describe more in detail how the data (i.e. images) were processed prior to feeding them to CNN for training. We have now provided more information on the pre-processing that was performed on the input images prior to feeding them to the CNN for training and testing in the Materials and methods section of the revised manuscript. Briefly, the images were only cropped to remove timestamps and identifiers. No additional image pre-processing was performed. “The imaging system used a Leica 20x objective that collected images at 10 min intervals under illumination from a single 635 nm LED. […] Only images of embryos that were completely non-discernable were removed from the study as part of the data cleaning procedure.” 16) Some of the acronyms (HQB, 3CC) appear in the Results section without full names. Although they are written in the Materials and methods sections, considering the order of the manuscript (Results then Materials and methods as currently it is), they should appear in the Results section. We have now provided the full names of the acronyms at their first instance in the revised manuscript. 17) In the Discussion, it would be helpful if the authors commented, based on their results, on the potential advantage or limitation of using a single static image for this purpose, as opposed to several images or a video clip. We have now discussed the benefits and challenges of using several images/video clips for embryo assessment in the revised manuscript. “The network’s learning can be compounded with data from additional timepoints, morphokinetics, and patient and cycle-specific information for more personalized IVF predictions and outcomes…” “A major hurdle for the development of networks capable of analyzing multi-timepoint images and with additional patient-specific information is the limited availability of diversified data with known clinical outcomes. […] The need for data scales with the complexity of the task and the number of variables introduced.” 18) The authors should cite and discuss the publication by Tran et al., 2019, and compare/contrast how this current submission differs and or adds to the existing literature. We had cited the work by Tran et al. in the first version of the manuscript. However, in the revised manuscript we have now added a discussion around their work. Briefly, we mention that the work by Tran et al. was flawed due to improper study design and thus is of limited value. Furthermore, the focus of Tran et al., is in utilizing only time-lapse videos, thereby restricting their approach to expensive, bulky, and rarely available instruments preventing its utility for most fertility centers in the US. In contrast, our approach focuses on single timepoint images enabling potential access to critical future technology. Additionally, our work has shown for the first time, the ability of a neural network to identify embryos capable of implantation amongst similar quality euploid embryos. Discussion: “Recently, Tran et al. studied the use of a deep-learning model (IVY) that can analyze whole time-lapse videos instead of specific time points for fetal heartbeat prediction (Tran et al., 2019). […] A major hurdle for the development of networks capable of analyzing multi-timepoint images and with additional patient-specific information is the limited availability of diversified data with known clinical outcomes.” “Clinically, besides morphological features, various other important metrics and parameters are considered by embryologists at the time of decision making such as taking into account the ploidy status of the transferable embryos. […] The CNN-based approach, through direct estimations of implantation potential from 113 hpi embryo morphology, outperformed trained embryologists in identifying implanting embryos from a set of PGS euploid embryos.” 19) The sentences near the end of the subsection “Evaluation of Euploid embryos based on their implantation potential” are important but are grammatically incorrect and therefore hard to understand. I get what they are trying to say but it's just poorly worded. On this note, there are a number of grammatical errors throughout the paper. Thank you for your comment. We have now rephrased these sentences in the revised manuscript and have also added a supplementary figure for clarity. We have also corrected all typographical errors throughout the revised manuscript. “Approximately 91% of euploid embryos with implantation potential scores of 0.80 or higher, and nearly 81% of embryos with implantation potential scores above 0.66 successfully implanted when transferred (Figure 3—figure supplement 1). […]These results suggest that the network’s implantation scores agree well with transfer outcomes even in high-quality euploid embryos.” 20) Citations basically stop at the Discussion section and there are some statements that definitely need literature support. We have now added additional references in the Discussion section where it was needed.

31 in total

1. Interobserver and intraobserver variation in day 3 embryo grading.

Authors: Allison E Baxter Bendus; Jacob F Mayer; Sharon K Shipley; William H Catherino
Journal: Fertil Steril Date: 2006-10-30 Impact factor: 7.329

2. A critical appraisal of time-lapse imaging for embryo selection: where are we and where do we need to go?

Authors: Catherine Racowsky; Peter Kovacs; Wellington P Martins
Journal: J Assist Reprod Genet Date: 2015-07-01 Impact factor: 3.412

Review 3. Time-lapse microscopy and image analysis in basic and clinical embryo development research.

Authors: C Wong; A A Chen; B Behr; S Shen
Journal: Reprod Biomed Online Date: 2012-11-20 Impact factor: 3.828

4. Embryo implantation after human in vitro fertilization: importance of endometrial receptivity.

Authors: R J Paulson; M V Sauer; R A Lobo
Journal: Fertil Steril Date: 1990-05 Impact factor: 7.329

5. Dermatologist-level classification of skin cancer with deep neural networks.

Authors: Andre Esteva; Brett Kuprel; Roberto A Novoa; Justin Ko; Susan M Swetter; Helen M Blau; Sebastian Thrun
Journal: Nature Date: 2017-01-25 Impact factor: 49.962

6. Development and evaluation of inexpensive automated deep learning-based imaging systems for embryology.

Authors: Manoj Kumar Kanakasabapathy; Prudhvi Thirumalaraju; Charles L Bormann; Hemanth Kandula; Irene Dimitriadis; Irene Souter; Vinish Yogesh; Sandeep Kota Sai Pavan; Divyank Yarravarapu; Raghav Gupta; Rohan Pooniwala; Hadi Shafiee
Journal: Lab Chip Date: 2019-11-22 Impact factor: 6.799

7. A review on automatic analysis of human embryo microscope images.

Authors: E Santos Filho; J A Noble; D Wells
Journal: Open Biomed Eng J Date: 2010-10-11

8. Deep learning enables robust assessment and selection of human blastocysts after in vitro fertilization.

Authors: Zev Rosenwaks; Olivier Elemento; Nikica Zaninovic; Iman Hajirasouliha; Pegah Khosravi; Ehsan Kazemi; Qiansheng Zhan; Jonas E Malmsten; Marco Toschi; Pantelis Zisimopoulos; Alexandros Sigaras; Stuart Lavery; Lee A D Cooper; Cristina Hickman; Marcos Meseguer
Journal: NPJ Digit Med Date: 2019-04-04

9. Intra- and inter-observer analysis in the morphological assessment of early-stage embryos.

Authors: Goedele Paternot; Johanna Devroe; Sophie Debrock; Thomas M D'Hooghe; Carl Spiessens
Journal: Reprod Biol Endocrinol Date: 2009-09-29 Impact factor: 5.211

10. Consistency and objectivity of automated embryo assessments using deep neural networks.

Authors: Charles L Bormann; Prudhvi Thirumalaraju; Manoj Kumar Kanakasabapathy; Hemanth Kandula; Irene Souter; Irene Dimitriadis; Raghav Gupta; Rohan Pooniwala; Hadi Shafiee
Journal: Fertil Steril Date: 2020-04 Impact factor: 7.329

11 in total

1. Should there be an "AI" in TEAM? Embryologists selection of high implantation potential embryos improves with the aid of an artificial intelligence algorithm.

Authors: V W Fitz; M K Kanakasabapathy; P Thirumalaraju; H Kandula; L B Ramirez; L Boehnlein; J E Swain; C L Curchoe; K James; I Dimitriadis; I Souter; C L Bormann; H Shafiee
Journal: J Assist Reprod Genet Date: 2021-09-17 Impact factor: 3.357

2. Adaptive adversarial neural networks for the analysis of lossy and domain-shifted datasets of medical images.

Authors: Manoj Kumar Kanakasabapathy; Prudhvi Thirumalaraju; Hemanth Kandula; Fenil Doshi; Anjali Devi Sivakumar; Deeksha Kartik; Raghav Gupta; Rohan Pooniwala; John A Branda; Athe M Tsibris; Daniel R Kuritzkes; John C Petrozza; Charles L Bormann; Hadi Shafiee
Journal: Nat Biomed Eng Date: 2021-06-10 Impact factor: 25.671

3. Three ways of knowing: the integration of clinical expertise, evidence-based medicine, and artificial intelligence in assisted reproductive technologies.

Authors: Gerard Letterie
Journal: J Assist Reprod Genet Date: 2021-04-19 Impact factor: 3.357

Review 4. Review of computer vision application in in vitro fertilization: the application of deep learning-based computer vision technology in the world of IVF.

Authors: Claudio Michael Louis; Alva Erwin; Nining Handayani; Arie A Polim; Arief Boediono; Ivan Sini
Journal: J Assist Reprod Genet Date: 2021-04-03 Impact factor: 3.357

5. The paper chase and the big data arms race.

Authors: Carol Lynn Curchoe
Journal: J Assist Reprod Genet Date: 2021-03-13 Impact factor: 3.357

6. Team-Based Learning and Lecture-Based Learning: Comparison of Sudanese Medical Students' Performance.

Authors: El-Fatih Z El-Samani; Karim Eldin M A Salih; Jalal Ali Bilal; Emtinan K Hamid; Omer Abdelgadir Elfaki; Muawia E A Idris; Hind A Elsiddig; Maha M Salim; Hashim Missawi; Mohammed Abass; Walyeldin Elfakey
Journal: Adv Med Educ Pract Date: 2021-12-24

7. Stain-free detection of embryo polarization using deep learning.

Authors: Cheng Shen; Adiyant Lamba; Meng Zhu; Ray Zhang; Magdalena Zernicka-Goetz; Changhuei Yang
Journal: Sci Rep Date: 2022-02-14 Impact factor: 4.379

8. Live-Birth Prediction of Natural-Cycle In Vitro Fertilization Using 57,558 Linked Cycle Records: A Machine Learning Perspective.

Authors: Yanran Zhang; Lei Shen; Xinghui Yin; Wenfeng Chen
Journal: Front Endocrinol (Lausanne) Date: 2022-04-22 Impact factor: 6.055

Review 9. Reporting on the Value of Artificial Intelligence in Predicting the Optimal Embryo for Transfer: A Systematic Review including Data Synthesis.

Authors: Konstantinos Sfakianoudis; Evangelos Maziotis; Sokratis Grigoriadis; Agni Pantou; Georgia Kokkini; Anna Trypidi; Polina Giannelou; Athanasios Zikopoulos; Irene Angeli; Terpsithea Vaxevanoglou; Konstantinos Pantos; Mara Simopoulou
Journal: Biomedicines Date: 2022-03-17

10. Embryo selection with artificial intelligence: how to evaluate and compare methods?

Authors: Mikkel Fly Kragh; Henrik Karstoft
Journal: J Assist Reprod Genet Date: 2021-06-26 Impact factor: 3.412