1 Introduction
Infertility is a common condition, as roughly one in six couples face some degree of infertility, defined as failing to naturally conceive over the course of 1 year. Male infertility factors contribute to approximately half of all cases.[1] In vitro fertilization (IVF) has made parenthood possible for many people that could not conceive otherwise. IVF enables fertilization of a female egg by a male sperm outside the female body. The selection of sperm cells is crucial especially for intracytoplasmic sperm injection (ICSI), a common type of IVF, in which the clinician chooses a single sperm cell using a micropipette and injects it into the female egg in a dish. Approximately 10 eggs are obtained in each IVF cycle through a surgery subsequent to the woman’s hormonal treatment, and it is not uncommon that no fertilized egg develops into a good-quality embryo.[2] Unfortunately, much less efforts are invested in analyzing sperm cells rather than eggs for IVF sperm selection. The World Health Organization (WHO) provides criteria for male fertility evaluation based on imaging of the sperm sample.[3] These criteria mainly consist of evaluating the sperm morphology, motility, and DNA fragmentation status. While motility assessment is done on live sperm cells swimming in a dish, internal morphological assessment and DNA fragmentation assays are typically performed on fixed and stained sperm cells, and on different portions of the sample; hence, different evaluations are done on different cells. Certain sperm cells may possess WHO2010-normal morphology, but not motility, or may possess both WHO2010-normal morphology and motility, but fragmented DNA, causing defective embryo development or delivery.[4] Current tests cannot distinguish between these scenarios leading to a lack of consistency in sperm evaluation and selection performed by different clinicians, as well as a large margin of error, even when automated analysis is performed.
Deep convolutional neural networks (CNNs) have recently proven to be an efficient tool for image analysis and classification.[5–9] The model is generally built from sequential layers, each providing a nonlinear mapping of the previous layer output to the following layer. Recently, generative adversarial neural networks (GANs)[10] have been successfully used for virtual staining of microscopic images.[11–15] These neural networks include a generator model and a discriminator model, where the generator takes random noise and maps it to an image, and the discriminator classifies images as real or generated.[16] Deep learning automatic classification of sperm cells could be the next gold standard. However, it is still difficult to obtain reliable automatic sperm classifiers or virtual staining using only qualitative 2D images as an input. The biological mechanisms that connect sperm movement, morphology, and contents to fertilization potential and normal pregnancy is not fully understood yet.[17, 18]
In the absence of staining, sperm cells are nearly transparent under bright-field microscopy, as their optical properties differ only slightly from those of their surroundings, resulting in a weak image contrast. An internal contrast mechanism that can be used when imaging sperm cells without staining is their refractive index.[19–21] Phase imaging creates stain-free quantitative image contrast based on the optical path delay (OPD) induced in the light beam as it interacts with the sample, which can be recorded by interferometry. Conventional phase contrast imaging methods for sperm cells, such as Zernike phase contrast microscopy, differential interference contrast (DIC) microscopy, and Hoffman modulation contrast microscopy, are not quantitative, thus they do not enable interpretation of the resulting phase images in terms of the quantitative optical thickness of the sperm cell. In addition, these techniques present significant imaging artifacts, especially near the cell edges. Quantitative phase imaging records the full sample complex wave front including the optical thickness map of the cell, which is equal to the integral of the refractive-index values across the cell thickness. This map is proportional to the cell dry mass surface density, thereby providing cellular parameters that have not been available to clinicians.[22] Until recently, quantitative phase imaging implementations were limited to optics labs, due to the optical setup bulkiness, difficulty of alignment, and sensitivity to mechanical vibrations. In recent years, we made significant efforts and succeeded to make these wave front sensors applicable and affordable for direct clinical use.[21, 23–25]
DNA fragmentation is a critical biomarker in sperm cells. Studies have shown that even in normal semen samples, ≈20% of the sperm cells have fragmented DNA, which becomes worse with age.[26] Sperm DNA fragmentation has been associated with reduced fertilization rates, reduced embryo quality, reduced pregnancy rates, and increased miscarriage rates. Thus, sperm cells with fragmented DNA should not be selected for IVF. DNA fragmentation is not well correlated with morphology and motility[27] and cannot be imaged at the moment without staining. Detecting DNA fragmentation requires molecular staining, which cannot be carried out during IVF.
In this work, we present a new approach for measuring the DNA fragmentation status of live and unstained dynamic sperm cells, at the individual-cell level, in parallel to measuring the cell motility and morphology as though they have been stained. This is achieved by using a clinic-ready interferometric module to acquire dynamic sperm cells and record their quantitative phase profiles, followed by using an algorithmic architecture that includes CNNs to virtually stain the cells and classify them based on their morphology, motility, and DNA fragmentation status. Each of the scores for morphology, motility, and fragmentation is mapped to a number between 0 and 1, where 0 means the lowest possible quality and 1 means the highest possible quality. All information is then mapped to a 3D scatter plot for each cell.
In contrast to previous attempts of finding correlations between pairs of stain-free bright-field (low-contrast) images and DNA-fragmentation-stained images, as well as for motility or morphology, using machine learning,[28–34] here we use rapid interferometric imaging, providing not only the amplitude low-contrast image, but also the quantitative phase profile of the cell, including internal cellular organelle contrast and content-related cellular texture, attributed to the DNA fragmentation level of the cell. As the phase profile is quantitative, having meaningful and highly informative structural (topographic) and content (refractive-index)-related values on all the cell points, the gap between the images in the input pair is smaller compared to the previous methods that used bright-field imaging; thus, the presented approach allows individual-cell DNA fragmentation classification, in addition to internal-morphological-structure virtual staining of each cell (which is not possible by bright-field imaging), enabling more accurate fertility grading, based on the triple generalized score (WHO morphology, motility, and DNA fragmentation).
We show that the number of cells that pass all three criteria cannot be accurately determined by the number of cells that pass each criterion separately, which necessitates a different fertility grading procedure than is used today.
2 Results
The architecture of the overall system for attaining a generalized fertility evaluation per patient is shown in Figure 1. The proposed technique enables simultaneously obtaining, for each of the dynamic sperm cells imaged in a dish, the full stain-like morphology, DNA fragmentation, and motility status, without the need for chemical staining. As shown in Figure 1, the analysis includes four deep neural networks (DNNs) and least-squares linear approximation: DNN1 performs morphological virtual staining, DNN2 performs morphological scoring, DNN3 performs DNA fragmentation scoring, DNN4 performs DNA virtual staining, and motility scoring is performed using least-squares linear approximation. To generate quantitative stain-free imaging data, we implemented a clinic-ready holographic setup (Figure S1, Supporting Information) to acquire stain-free quantitative phase maps of the cells. This setup is composed of an inverted microscope and a custom-built common-path, compact interferometric module, connected to the microscope camera port (see Experimental Section). The acquired off-axis interferograms were processed into the optical thickness maps of the sperm cells, taking into consideration both their physical thickness and their refractive index contents. As shown in Figure S2 and in Video S1, Supporting Information, these maps can be visualized in HEMA-staining-like colors, thereby distinguishing between different subcellular structures. We designed a custom-built tracking algorithm and tracked all cells dynamically, resulting in a space–time array per cell.
Using the networks trained on well-established ground-truth stain-based labels for sperm morphology and DNA fragmentation, as well as quantitative motility analysis using least-squares linear approximation, we can obtain for each live and dynamic cell the intracellular morphology and associated parameters, motility parameters, and the DNA fragmentation status using only the quantitative stain-free imaging data. We then map each cell to one point in a 3D space, with the axes being morphology, motility, and fragmentation values, thereby displaying the complete cellular status of each cell, for live cells without using cell staining. The combined cell points per patient in the 3D space are depicted as a sphere, with the sphere volume representing the patient’s generalized fertility score. Regardless of the triple scoring of each individual cell, DNN1 and DNN4 use GAN models for virtual staining for visualization purposes, which is complementary to the automatic scoring capability.
Using this scheme, we quantitatively imaged 5101 dynamic human sperm cells from eight donors, resulting in 51 809 images. The collection of all cells measured per donor and their representation on the 3D scatter plot uniquely characterizes the fertility status of the donor. Instead of having three different criteria measured on three different populations of the sample, where there is the risk of having cells that pass one criterion and fail one or more of the other criteria, we are now able to generate a triple-criteria score per cell.
2.1 Full Morphology Evaluation and Virtual Staining
As shown in Figure 1 and Video S1, Supporting Information, the morphological evaluation of each cell includes virtual staining of the cell, extraction of the cell morphological features, and the final cell classification, as recommended by the 2010 World Health Organization[3] (WHO2010) for stained cells, although no chemical staining is used and the cells are not fixated but rather dynamically swim in a dish. For both the cell virtual staining and the WHO2010-based morphology scoring tasks, we used DNNs. The first DNN in Figure 1, DNN1, performs virtual staining of individual sperm cells.[14] It is based on a conventional GAN architecture that is designed to generate new high-quality images. DNN1 was trained with the stain-free optical thickness maps of sperm cells and their chemically stained counterparts as labels, for the exact same sperm cell. Then, the trained network could transform the stain-free optical thickness maps into virtually stained images, making them look as though they were chemically stained, without actual staining (Figure S3, Supporting Information), thereby providing the information necessary for gold-standard evaluation. The network was tested by classification of a trained embryologist, after randomizing the data order, and yielded very similar results to those obtained by chemical staining, as presented below. In the current study, virtual staining is implemented on live and highly dynamic sperm cells for the first time, in contrast to our previous research that demonstrated virtual staining on fixed sperm cells.[14] In addition, we analyzed the optical thickness maps of the cells using classical image processing techniques to automatically extract six morphological features, including nucleus area, acrosome area, total head area, mean posterior–anterior difference, dry mass, and the variance of the optical thickness values quantifying the texture of the cell. These six features were calculated as explained in the study by Mirsky et. al[35] and were used as input to DNN3 for DNA fragmentation scoring, as well as to give the clinician further quantitative morphological parameters, which are yet to be standard for use in the future. Note that virtual staining is demonstrated as an additional annotation guide to the embryologist, and not as the input or ground-truth label to DNN2 for morphological scoring or DNN3 for DNA fragmentation scoring presented below.
DNN2 in Figure 1 performs sperm scoring based on the WHO2010 morphology guidelines, which include five criteria: head shape, acrosome size, number of vacuoles, midpiece shape and orientation, and cytoplasmic droplets. Regardless of our experiment that examined all cell population without bias, these WHO2010 guidelines for sperm morphology were previously formulated by examining sperm cells retrieved from the human female cervical secretion following sexual intercourse, where it is assumed that sperm cells that reached there have higher fertilization potential. During our training process, the network received the optical thickness map of the cell along with the ground-truth WHO2010 classification of a trained embryologist as labels, which is based on the chemically stained image. Once the model is trained, the network predicts the morphology score without using the chemical staining-based label. The network outputs six parameters; the five WHO2010 criteria per cell (head shape, acrosome size, number of vacuoles, midpiece shape and orientation, and cytoplasmic droplets) together with a sixth parameter: the direct overall prediction of whether the cell passes all five criteria (independent of the first five outputs) according to the embryologist. In addition, a combined overall prediction checking if the model predicted a “pass” for all five qualifications presented by the first five outputs was calculated, for comparison with the direct overall prediction (see Experimental Section). Figure 2 presents the architecture of DNN2, as well as the receiver operating characteristic curve (ROC) and precision recall curve (PRC). This network was built as a standard CNN because the entire information necessary for cell classification based on the WHO2010 guidelines is available in the image. We started with very few layers and slowly increased the number of layers till generalized learning was obtained, as indicated by the training curves as well as the test success metrics.
The ROCs for the five qualifications attained areas under the curves (AUCs) of 95.7%, 96.3%, 100.0%, 98.7%, and 86.7%, respectively. The PRCs for the five qualifications attained AUCs of 93.6%, 95.5%, 100.0%, 99.7%, and 97.4%, respectively. The direct overall parameter, representing the direct classification, attained AUC of 96.2% for the ROC and 93.5% for the PRC, precision of 90.9%, and accuracy of 93.1%. AUCs are used here as an indicator of network performance. Precision is expected to be a dominant predictor for a successful sperm classifier, as a person’s potential of fertilization is mostly dependent on his best cells. These cells reach the egg in natural insemination and are the only cells that should be selected for IVF. Once the network is trained and tested, it can classify the cell without having the ground-truth label from the trained embryologist. The direct overall output was set as the coordinate value of the morphological axis in the 3D scatter plot for the cell examined.
2.2 Motility Evaluation
To assess the sperm motility on an individual-cell basis, we used the space–time arrays extracted from the cell tracking algorithm, composed of the locations of each cell in each frame. The trajectory for one representative cell is shown in Figure 3. We divided the trajectory into one-second windows with a half window stride, as the qualification of swimming linearly in progressively motile cells is only expected over short time intervals. One of these windows is shown on the right of Figure 3. We then calculated eight motility parameters suggested by the WHO2010. These include curvilinear velocity (VCL), straight-line velocity (VSL), average path velocity (VAP), linearity (LIN), wobble (WOB), straightness (STR), beat-cross frequency (BCF), and mean angular displacement (MAD). The median values of these parameters over all windows per cell were taken as the final motility parameters per cell. For each donor, another subsample was tested by an experienced embryologist according to the qualitative WHO2010 protocol, classifying each sperm cell into one of three motility classes: immotile, nonprogressively motile, and progressively motile. To compare the qualitative and quantitative motility tests, we conducted two comparative experiments. First, we calculated the correlation between the qualitative and quantitative motility test results over all eight donors. To do this, we chose four quantitative motility parameters (VCL, VSL, VAP, and VSL × LIN), which most resembled the qualitative assessment, resulting in high significant correlations. The other parameters yielded small or no correlation. To cancel the effect of sampling errors, a video containing 87 sperm cells was processed both quantitatively by our automatic algorithm and qualitatively by the embryologist. This increased the correlations (from 0.49 to 0.75) and their statistical significance (p-values decrease from 0.15 to ). We then used a least-square approximation to define a linear equation that maps all automatically extracted quantitative motility parameters to the three qualitative classes defined above, to ensure that no previously acquired fertility score is overlooked by our protocol. The normalized function value of the linear equation was used as the coordinate of the motility axis in the 3D scatter plot.
2.3 DNA Fragmentation Evaluation and Virtual Staining
We used DNN3 to grade each live sperm cell according to its DNA fragmentation level. In contrast to Barnea et al.[36] that shows only statistical difference for specific parameters in large populations of sperm cells measured by interferometry, with population overlaps, here we provide an individual-cell DNA measurement without staining. For training, we used pairs of images: the stain-free quantitative thickness map of the cell and the image of the same cell after it was stained by acridine orange, a DNA fragmentation indicator emitting green fluorescence for double-stranded DNA (nonfragmented) and red fluorescence for single-stranded DNA or RNA (fragmented) as labels. Results are shown in Figure S4, Supporting Information. The automatically extracted morphological parameters were also inserted into DNN3, creating a bimodal neural network. Figure 4a presents the network architecture (see also Experimental Section). The ROC AUC and PRC AUC were all above 0.98. After training and testing, the network can take a stain-free quantitative optical thickness map of the cell and output the cell fragmentation level without the need for chemical staining. DNN3 used a bimodal neural network architecture. It receives both the optical thickness map and the six morphological features calculated based on this map, in addition to the chemically stained cell image label, because these features were found to be correlative to the DNA fragmentation level of the cell,[36] practically guiding the network. When training without these six features, we obtain inferior results: accuracy of 0.90, ROC AUC of 0.78, and PRC AUC of 0.98.
Another DNN (DNN4) used a GAN architecture to generate virtually stained DNA fragmentation images. It was trained to virtually stain the quantitative optical thickness map, creating a semblance of its acridine-orange-stained counterpart. The network architecture is shown in Figure 4b and its virtual staining operation is demonstrated in Figure 4c, yielding a virtually stained image that is very similar to the chemically stained sperm cell image shown in the center. Note that this virtual staining was not used to score the cells presented in Figure 5 and 6 according to their DNA fragmentation status. This scoring was done by DNN3. Only then, we used DNN4 with the fragmentation score from DNN3 as label to virtually stain the optical thickness map and make it look as if the cell was chemically stained with acridine orange, as a visual tool to the embryologist. The verification of DNN4 virtual staining results was done by a trained embryologist (see Section 4.7).
2.4 Patient’s Fertility Scoring
To evaluate the patient’s fertility, we calculated each sperm-cell morphology, motility, and DNA fragmentation as explained above, for live cells and without staining, and then mapped the cell into a single point in a 3D space with axes representing these three criteria. Our evaluation also consists of virtual morphological and acridine-orange staining per cell. As shown in Figure 5, the presented technique enables gathering all single-cell triple-criteria information regarding the population of sperm cells. The left column in Figure 5 shows the resulting 3D scatter plots for the eight donors measured while presenting both the 2-D projection of each cell on each of the three planes and also the location of that cell in the 3D space. The right column in Figure 5 shows the intersections of the three criteria and the left column shows the Venn diagrams, where each of the three circles represents a specific criterion. The number of cells that pass each criterion, together with the number of cells that pass each set of two criteria and all criteria, is displayed, where the passing thresholds are set to 0.5 for morphology, 0.29 for motility, and 0.61 for fragmentation. This figure demonstrates the great variability between donors. In relation to the other criteria, some donors have more cells that pass the fragmentation tests (donors 2–5), while others have more cells that pass the motility test (donors 1, 6–8). Some donors have more cells that are both nonfragmented and morphologically intact (donors 1–3, 5, 7), while others have more cells that are both morphologically intact and motile (donors 4, 6, 8). The percentage of the overall passing cells differs among the donors: donor 5 (Figure 5e) has significantly more cells than donor 2 (Figure 5b), yet donor 2 has three more sperm cells that pass all three criteria. The thresholds may be adjusted according to the patient’s specific needs, as shown in Figure S5 and Video S2, Supporting Information.
To check whether the tests dependencies were consistent between the donors, we checked the statistical independence between the criteria per each donor. This was implemented by multiplying the three percentages of cells passing each criterion and comparing the result to the percentage of the cells passing all three criteria. For example, for donor 3 (Figure 5c), the number of cells that actually pass all three criteria is 1.53 times of the number of cells that are expected under the independent-criteria assumption, whereas for donor 4 (Figure 5d), the same ratio is only 0.95. This shows that the criteria are not completely independent, as the expected percentage of passing cells under the independent-criteria assumption is different from the end-result and that the criteria dependencies differ between the donors.
(1)
where x> = is the normalized morphology value averaged for all N cells measured per donor, y> = is the normalized motility value averaged for all N cells measured per donor, and z> = is the normalized DNA fragmentation value averaged for all N measured cells per donor. This normalized overall score P represents the fertility scores obtained by the current practice, as the averaging was done on each criterion separately, with the underlying assumption of sample homogeneity.
(2)
This parameter takes the distance from the origin for each cell in the normalized 3D space, and averages all these distances per donor.
(3)
These new scores, K and K1, depend on how well the cells stand in all three criteria, whereas a donor might have no sperm cells that pass all three criteria together and still have a high previous fertility score P.
We now show that these new scores result in different donor grading, implying that erroneous donor grading is done today, even under the homogenous sample assumption. Figure 6 shows the previous fertility scores together with the new scores obtained for the eight donors. In this figure, each sphere represents a donor. The distance of the center of the sphere from the origin of the 3D space represents the donor’s status as per the current practice, P, and can be seen as the first value next donor number. On the other hand, the sphere diameter is correlative to the donor’s first new fertility score, K, which can be seen as on the right in the second row next to each donor’s sphere. The second new score, K1, can be seen on the left in the second row next to each donor’s sphere. As seen, the donors’ rankings according to the old and new scores differ from each other. Donor 3, for instance, takes the last place according to the K score, as this donor’s average sperm quality is the lowest, although donor 4 would be labeled as the worst donor according to the previous fertility grading, P. Donor 1 takes the fourth place according to the K score, while he takes fifth place according to the K1 score and sixth place according to current score, P. On the other hand, the two best donors, donors 7 and 6, are consistent among all three scores, while donor 5 is the third best donor for both the K and the P scores but not for the K1 score, implying that the new scores may be especially useful in discriminating situations of infertility.
3 Conclusion
In this work, we present a new capability to simultaneously measure individual sperm morphology, motility, and fragmentation, as well as virtually stain the sperm cells. Today, a clinician is unable to gather this necessary information per sperm cell, and instead the fertility examination provides independent percentages regarding each evaluation separately, on different parts of the cell population, where statistical changes may occur between subsamples. Thus, currently, there is no information regarding how many cells would have passed two or all three tests. In addition, the harsh fixation and staining protocols required for performing the morphology and DNA fragmentation assays might damage the reliability of the assays if not performed correctly, and create different results when performed by different labs. Our virtual classifiers solve these issues as they do not require staining and perform all three assays on the same cells, while they are alive and dynamic. We show that our virtual assays correlate with the chemical gold standards, and when applied together on each cell, give us the capability to consider the previously unavailable intersections between criteria when counting the normal and abnormal cells per patient. We show that patients may differ in the statistical dependencies of the different tests (Figure 5) and that consequently a patient with overall lower independent scores may very well have a higher score per cell or even a higher percentage of qualified sperm cells (Figure 6), emphasizing the need for the new approach and the insufficiency of the current examination.
This approach is expected to give rise to personalized sperm quality evaluations by adjusting the thresholds to suit the current need. Figure S5 and Video S2, Supporting Information, show how different thresholds result in different sperm groups. Without any specialized needs, one may want to use the original thresholds, and obtain a result similar to that which is shown in Figure S5a, Supporting Information. However, another would possibly want to increase the fragmentation threshold, as shown in Figure S5b, Supporting Information, securing a low percent of fragmented cells from a potential sperm donation. In another case, one may personalize these thresholds for selecting sperm cells for fertilization and may want to raise all thresholds in order to select the most promising sperm cells, creating scatter plots similar to the one in Figure S5c, Supporting Information. Other situations may result in decreasing or increasing different thresholds, depending on which patient is being evaluated, what types of infertility problems he possesses and for what reason this evaluation is being held.
Here, we demonstrated the usefulness of the method for male fertility evaluation prior to IVF. In the future, the integration of a portable clinic-ready interferometric module that can be attached to the standard microscopic systems used today in the clinic, together with our virtual staining and classification triple-criteria method at the single-cell level, can enable computer-assisted sperm selection during IVF in real time, allowing clinicians to choose the most qualified sperm cells for egg injection, with the best morphology, motility, and DNA fragmentation scores. As the scoring is automatic, rather than manual, in the future, the system may be integrated with a robotic selector and independently choose the sperm cells with the highest probability of fertilization. Moreover, incorporating these systems will give rise to advanced research linking types of chosen sperm cells with fertilization and pregnancy success, with the ability to personalize medicine for patients suffering from fertility problems.
To conclude, we have suggested a new approach that utilizes stain-free quantitative phase imaging and deep learning, in order to improve male fertility assessments through combined morphology, motility, and DNA fragmentation scores, allowing clinicians and researchers to obtain previously unattainable fertility evaluations at the single sperm-cell level.
Acknowledgements
Applied Sciences and Engineering grant from the Ministry of Science and Technology of Israel.
Conflict of Interest
The authors declare no conflict of interest.