Literature DB >> 35382336

Accurate Tumor Subtype Detection with Raman Spectroscopy via Variational Autoencoder and Machine Learning.

Chang He¹, Shuo Zhu¹, Xiaorong Wu², Jiale Zhou², Yonghui Chen², Xiaohua Qian¹, Jian Ye^1,3,4.

Abstract

Accurate diagnosis of cancer subtypes is a great guide for the development of surgical plans and prognosis in the clinic. Raman spectroscopy, combined with the machine learning algorithm, has been demonstrated to be a powerful tool for tumor identification. However, the analysis and classification of Raman spectra for biological samples with complex compositions are still challenges. In addition, the signal-to-noise ratio of the spectra also influences the accuracy of the classification. Herein, we applied the variational autoencoder (VAE) to Raman spectra for downscaling and noise reduction simultaneously. We validated the performance of the VAE algorithm at the cellular and tissue levels. VAE successfully downscaled high-dimensional Raman spectral data to two-dimensional (2D) data for three subtypes of non-small cell lung cancer cells and two subtypes of kidney cancer tissues. Gaussian naïve bayes was applied to subtype discrimination with the 2D data after VAE encoding at both the cellular and tissue levels, significantly outperforming the discrimination results using original spectra. Therefore, the analysis of Raman spectroscopy based on VAE and machine learning has great potential for rapid diagnosis of tumor subtypes.

Entities: Chemical

Year: 2022 PMID： 35382336 PMCID： PMC8973095 DOI： 10.1021/acsomega.1c07263

Source DB: PubMed Journal: ACS Omega ISSN： 2470-1343

Introduction

Early diagnosis of cancer is highly important for improving the 10 year survival of patients. As the treatments of different subtypes have significant differences, identification of cancer subtypes is an essential reference for developing a therapy plan.[1] At present, the most mainstream clinical methods are the pathological diagnosis and immunochemistry.[2,3] However, these processes are too complicated and time-consuming, and the probability of misjudgment strongly relies on personal experience. Therefore, it is beneficial to develop a rapid, noninvasive, and highly sensitive technique that can accurately detect and recognize different subtypes of tumors. Raman spectroscopy is a label-free molecular vibrational spectroscopic method that can provide specific fingerprint information about the chemical composition of the biological sample, which is beneficial for clinical diagnosis.[4−6] The Raman spectra consist of spectral information from nucleic acids, proteins, lipids, carbohydrates, and so on. Different types of cancer cells have different structures and compositions, corresponding to different spectra, which may lead to the possible differentiation of tumor subtypes.[7−9] Raman spectra of biological samples consist of molecular information with complex composition, which makes it almost impossible to achieve the peak intensity of the target molecule directly. Therefore, an efficient analytical technique is needed to process the Raman signal. As a significant branch of machine learning, classification algorithms such as linear discriminant analysis (LDA) and supporting vector machine (SVM) are currently the most common methods for analyzing Raman spectra.[6,10−14] These classification algorithms commonly distinguish Raman spectra by mining the differences between data belonging to different classes. Instead of the traditional method of extracting peak intensity, these algorithms process the entire information on the spectra when classifying them, and the more comprehensive information makes it more accurate.[15,16] Currently, researchers have implemented algorithms to discriminate Raman spectra of tumors at the cellular and tissue levels.[17−19] Although Raman spectroscopy has been shown to be applicable for the identification of cancer subtypes, there is still a problem that may affect the application of the classifier: feature redundancy.[20−23] The spectra consist of hundreds or even thousands of data points, which make up multiple Raman peaks in the spectrum. But not every Raman peak contributes to distinguish different types of Raman spectra. Thus, unimportant data points may create feature redundancy that can have a negative impact on the classification performance. At present, the feature selection method based on variance thresholding is a common one to eliminate invalid information in machine learning, but it fails to consider the entire spectral information comprehensively and thus has limited improvement on the spectra classification.[24−26] In addition, a poor signal-to-noise ratio means that the intensity of the weaker peaks is highly susceptible to being masked or affected by noise, thus affecting the classification effect. These deficiencies make the classification algorithms ineffective in differentiating Raman spectra of different tumor subtypes, which consist of similar compositions. In this work, we combined the nine classification algorithms with Raman spectroscopy to distinguish the subtypes of cancers. The variational autoencoder (VAE) was employed to reduce feature redundancy and to accomplish noise reduction. Some studies have been conducted using autoencoders (e.g., sparse autoencoder) to process spectra for improved analysis in fields such as tumor types, microplastics, and drug resistance.[27−29] However, the application of autoencoders for classifying tumor subtypes is still lacking. VAE encodes Raman spectra as two-dimensional (2D) information, which fits a Gaussian distribution, and the fewer data dimensions make it insufficient to store noise information.[30−32] Nine classification models were trained to classify the VAE-encoded Raman spectra of three subtypes of non-small cell lung cancer (NSCLC) cell lines (A549, H1299, and H460) and two subtypes of kidney cancer tissues (clear cell renal cell carcinomas (CCRCC) and papillary renal cell carcinoma (PRCC)). Meanwhile, we compared the classification performance of different classifiers for both the original spectra and the VAE-compressed data, and the results show that the VAE-processed data generally give better classification results. For NSCLC cells, the Gaussian naïve bayes (NB) achieves the best classification result with an accuracy of 89.6%. For kidney cancer data, the Gaussian NB represented the best classification performance with an accuracy of 81.4%.

Materials and Methods

Cell Culture and Preparation

The three different types of human non-small cell lung cancer (NSCLC) cell lines A549, H1299, and H460 were obtained from American Type Culture Collection (ATCC). All cells were cultured in DMEM with 10% fetal bovine serum, 100 U·mL–1 penicillin, and 100 μg·mL–1 streptomycin at 37 °C in a cell incubator with 5% CO2. The mentioned reagents were bought from Gibco. Before the Raman measurements were performed, the NSCLC cell lines were washed with phosphate-buffered solution (PBS) and fixed with 4% paraformaldehyde for 10 min at room temperature. Then the cells were purified with ultrapure water to remove paraformaldehyde.

Clinical Sample Acquisition and Pathological Diagnosis

All kidney cancer tissues were acquired from patients who were diagnosed with kidney cancer with a preoperative imaging technique. The tissues were excised intraoperatively or removed by biopsy. A total of 30 patient samples were retrieved, including 16 CCRCC and 14 PRCC. The diameter of each tissue was about 1–2 mm. The acquisition of tissue samples and all experiments in this study were approved by the Ethics Committee of Renji Hospital, School of Medicine, Shanghai Jiao Tong University. The pathological diagnoses, as the gold standard for classification results, were accomplished with the standard procedure in Renji Hospital of Shanghai Jiao Tong University. All tissues were stored in a 0 °C environment until Raman spectra were collected.

Raman Spectra Acquisition

Raman spectra were obtained on a confocal Raman system (Horiba, XploRA PLUS) from the fixed cells on quartz plates. For the acquisition of cell Raman spectra, a laser beam of 532 nm was used as an excitation source. The laser power was 39.1 mW on the sample with a 60×/NA 0.7 objective. Raman spectra from all cell samples were recorded in the spectral range of 400–4000 cm–1. The collection time was 2 s for each Raman spectrum. All cells were scanned with a multipoint grid containing the whole cell in steps of 0.5–2 μm. For each cell, more than 150 spectra were collected in the acquisition. After the cell spectra were measured, the nucleus was stained. The staining reagent was DNA-binding fluorescent dye 4′,6-diamidino-2-phenylindole (DAPI). Fluorescence images were acquired using a fluorescence microscope (Leica, DM 2500) to distinguish the nucleus. The spectra acquired at the acquisition location within the DAPI-stained region are considered nuclear spectra. A total of 140 NSCLC cells were collected (H1299: 58 cells; H460: 50 cells; A549: 32 cells). To balance the sample size of the subtypes in the train set, different numbers of spectra were chosen for different subtypes of cells during the training process (20 spectra for each cell nucleus and cytoplasm of H1299, 25 spectra for H460, and 35 spectra for A549). Before the Raman spectrum acquisition of kidney cancer tissues, all tissues were taken out from a 0 °C environment and placed at room temperature (26 °C) for 30 min to avoid the influence of temperature on the Raman tests.[33−35] The tissues were placed on quartz plates, and the laser was illuminated through the quartz plate onto the tissues. Raman spectra from kidney cancer tissues were excited by a 785 nm laser and recorded in the spectral range of 200–1800 cm–1, and the collection time was 5 s with the laser power of 29.8 mW. The laser spot area is approximately 3 μm. For each point, two Raman spectra were acquired and took their average as the final spectrum to reduce the random noise. For each tissue, 100 spectra were collected and 30 spectra were randomly chosen to establish the data set in the case of information saturation.

Data Preprocessing

Spectral data from our experiments were preprocessed and analyzed using Python. Before analysis, the preprocessing of Raman spectral data is crucial. A strong background always exists in Raman spectra of biological samples, and noise from different sources also contributes to the Raman signal. Removing the background and other noise before analysis is highly important. First, cell spectra were selected from all spectra according to the Raman peak between 2800 and 3000 cm–1. Then the spectra were background-subtracted. In the next step, the spectra were smoothed by employing Savitzky–Golay (SG), a filter using a polynomial order of 3 and a frame length of 15.[36,37] It is a filter based on local polynomial least-squares fitting in the time domain. Afterward, we used adaptive iterative reweighted penalized least-squares (air-PLS), which iterates the least-square method to remove the baseline.[38,39] The background-subtracted spectra were finally normalized to the standard normal distribution. The same process was repeated for all three types of cells. For clinical samples, the SG filter and airPLS with the same parameters as cell samples were applied to preprocess the Raman spectra.

Variational Autoencoder

The VAE consists of an encoder and a decoder. The encoder compresses the original data into a latent space (often of lower dimension than the original space), and the decoder decompresses the data in the latent space and makes it as close as possible to the original data. Unlike traditional autoencoders, for each sample, the encoder of VAE outputs not a definite value but a probability distribution, which is represented by the mean and variance of a normal distribution to ensure the continuity of the encoding. The VAE algorithm has two major advantages in the analysis of Raman spectra. The first advantage is to perform clustering of spectra. Since each spectrum is encoded as a distribution that overlaps with each other, if the distributions are from not similar spectra, then a larger loss term will be generated during training VAE. This characteristic allows classification models to be better trained and predicted from data in the latent space. The second advantage is denoising. The dimensionality of the latent space is not sufficient to support its storage of random and complex noise. Thus, VAE allows the noise to be further removed to purify the spectral information. Before training the VAE models, data from unimportant silent regions were removed from all spectra, and each spectrum is guaranteed to have 1056 dimensions for convolution operations. Normally, a larger volume of data is desired to prevent overfitting as the number of parameters to be learned in an unsupervised learning model increases.[40] A simplified VAE network architecture was designed to avoid overfitting owing to the limitation of the volume of spectral sets. The encoder was composed of four convolution layers with 1, 16, 32, and 32 kernels. All of the convolution layers have the kernel size of 4 pixels and the stride of 2. Each convolution layer is batch-normalized and followed with a 40% dropout layer to avoid overfitting. The output of the encoder was flattened and transformed by three fully connected layers with 128, 32, and 16 neurons. All of these seven layers in the encoder employed ReLU as activation functions. Finally, a fully connected layer with two neurons was used to map the output to the latent space. The decoder has the opposite network architecture to the encoder, except that the last convolutional layer, which has only one filter, does not have any activation function to output the predicted spectra. The loss function of VAE was defined as Kullback–Leibler (KL) divergence +0.9* root-mean-square errors between input spectra and predicted spectra.[41−43] The batch size of VAE training was 50. All VAE models were accomplished with Keras in Python and used the stochastic gradient descent optimizer with a learning rate of 0.0005 and momentum of 0.9.

Classification Using Machine Learning

All Raman spectra were split into train and test sets in a 7:3 ratio at the cell or patient level. For the classification of cells, the train set contains 98 cells (H1299: 40; H460: 35; A549: 23), and the test set contains 42 cells (H1299: 18; H460: 15; A549: 9). For kidney cancer, the train set contains 21 patients (CCRCC: 11, PRCC: 10), and the test set contains 9 patients (CCRCC: 5, PRCC: 4). Different classification models built were employed to classify original spectra after preprocessing and data in latent space separately, including random forest (RF), multilayer perception (MLP), SVM with linear kernel and radial basic function (RBF) kernel, logistic regression (LR), k nearest neighbor (KNN), Gaussian naïve bayes (NB), Adaboost, and LDA.[44−51] All of the models were built with leave-one-patient-out cross-validation (LOPOCV).[52,53] The LOPOCV used the Raman spectra of one patient or cell as a validation set and the others as the train set. The validation set is then looped until each patient’s data are used as the validation set for training the model. In contrast to the function of the test set, LOPOCV divides the data in the train set to validate the model for tuning parameters during the train stage. Multiple classifiers were usually established for multiple classifications, and each classifier corresponds to a false positive rate (FPR) and a true positive rate (TPR). Here, FPR and TPR are defined with the following formulas:where FP, TN, FN, and TP are the number of false positive, true negative, false negative, and true positive samples, respectively. Obviously, TPR and FPR will be influenced by the change of threshold which is used to determine the classes. For a classifier, we can get a TPR and FPR point pair based on its performance in the test sample. Therefore, the classifier can be mapped to a point on this plane. With adjusting the threshold used in this classifier, we can get a curve that goes through the points (0,0) and (1,1), which is the receiver operating characteristic (ROC) curve of this classifier. An overall standard is needed to evaluate the performance of multiclassifiers. Here, macroaveraging and microaveraging are employed to evaluate the general performance of machine learning models. Macro-average is the direct calculation of the FPR and TPR of each binary classifier, and their average is the area under the ROC curve (AUC) of multiclassifiers.[54] The microaverage calculates the arithmetic average of TP, FP, TN, and FN of each binary classifier and then obtains the FPR and TPR of multiclassifiers according to these average values, and finally, the ROC curve and AUC are determined.

Results and Discussion

Data Set of NSCLC Cells

Lung cancer, being the second largest number of new cases, poses a serious threat to human life.[55] NSCLC is the most common type of lung cancer, accounting for about 85% of lung cancers.[56] We employed three different cell subtypes of NSCLC to verify the performance of Raman spectroscopy at the cellular level for subtype diagnosis. Before collecting Raman spectra, all the cell samples were fixed with 4% paraformaldehyde for 10 min. Some studies have suggested that paraformaldehyde fixation is a sample preparation method with minimal impact on biochemical information over living conditions.[57,58] To the best of our knowledge, 4% paraformaldehyde fixation for 10 min is optimal for Raman studies.[59,60] The remaining molecular information (such as proteins and lipids) on cell samples is sufficient to distinguish between cell types.[61,62] The whole-cell Raman mapping was performed on each cell model in this work, and the cell nuclei were stained with DAPI to distinguish the nucleus boundary in the cell after Raman acquisition. From the example Raman spectrum in the range of 200–4000 cm–1 (Figure S1a), the background signal almost obscures the Raman signal in the range of 200–400 cm–1, thus the scanning ranges of all the cell samples were set to 400–4000 cm–1. For each cell, more than 150 Raman spectra were measured in acquisition with a 532 nm laser. Compared to the common 785 nm laser, the 532 nm laser could acquire the more significant Raman signals of the C–H bond for the cellular spectra, although it excites a stronger fluorescence signal. The fluorescence background can be removed by baseline correction. Figure shows the bright-field images, Raman mapping images (plotted by the Raman bands between 2700 and 3100 cm–1), DAPI-stained fluorescent images, and the average normalized Raman spectra of three subtypes of NSCLC of A549, H1299, and H460. The cancer cells have large heterogeneity so that the size and morphology are not reliable evidence for identifying the subtypes of NSCLC cells. An obvious Raman band can be detected in the spectral region between 2700 and 3100 cm–1 (Figure iv), which is the stretching vibration mode of the C–H bond. This band originates from two main aliphatic C–H bonds located at 2854 and 2940 cm–1 and other abundant C–H bonds belonging to various biomolecules. Hobro et al. showed that, although the use of paraformaldehyde fixation has an influence on this region, its influence is very weak relative to the Raman intensity of this region.[57]Figure S2 shows the Raman spectra at different detection positions (point 1, quartz; point 2, cytoplasm; point 3, nucleus), indicating that no significant Raman bands exist in this spectral range on the quartz substrate. Therefore, the intensity of the Raman signal between 2700 and 3100 cm–1 can be utilized to determine whether the spectrum is inside the cell (shown in Figure ii). The clearer cell boundaries are shown in Figure S2 by overexposed heatmaps. The area of the nucleus is characterized by DAPI staining, and therefore, we can further separate the Raman spectra from the nucleus and cytoplasm (Figure iii). It can be found that there are slight differences in the Raman spectra between the nucleus and cytoplasm for different cells (Figure S2). The Raman signal from 2100 to 2700 cm–1 was removed from the spectra as this range is in the biological transparency window, and typically, it does not affect subsequent data analysis.[63,64]

Figure 1

(i) Bright-field images, (ii) Raman mapping images plotted by the shaded Raman band between 2700 and 3100 cm–1, (iii) DAPI-stained fluorescent images, (iv) their overlay images, and (v) mean Raman spectra from (a) A549, (b) H460, and (c) H1299 NSCLC cell lines. The scale bars in panel (i) are 10 μm. The blue curves represent the mean Raman spectra only from the nuclei region, and red curves represent spectra only from the cytoplasm region. The shaded areas along the spectra represent the standard deviations of the means.

Classification of Different Cellular Raman Data Sets

To compare the effects of data from different parts of a cell on the classification effect, we established four different SVM with RBF kernel models based on different data sets: (1) nuclear model, where the data set consists of spectra of cell nucleus only; (2) cytoplasmic model, where the data set consists of spectra of cell cytoplasm only; (3) cell model, where the data set consists of all spectra of three subtypes of NSCLC cells; the model only outputs the cell subtype no matter if the spectrum is from the nucleus or cytoplasm; (4) six-classes model, where the data set consists of spectral data of the nucleus and cytoplasm of three subtypes of cells. Therefore, it contains six classes of spectral data. The model outputs the cell subtype to which the spectrum belongs along with whether it is a cytoplasmic or nuclear spectrum. SVM is the classical algorithm used for classification based on Raman spectra. The train set consists of the spectra from 98 cells (40 cells for H1299, 35 cells for H460, and 23 cells for A549). To balance the sample size of each cell subtype in the train set, 20 spectra were randomly selected for each cell nucleus and cytoplasm of H1299, 25 spectra for H460, and 35 spectra for A549. Each spectrum was standardized after smoothing and baseline correction. The LOPOCV method was used to evaluate the classification performance of the model on the train set. To assess the classification performance of the machine learning models on unknown data, we adopt the Raman spectra from 42 other cells (18 cells for H1299, 15 cells for H460, and 9 cells for A549) as the test set. For the SVM classifier with more than two classes, the model outputs the classification scores which represent the relative probabilities that every spectrum belongs to every class and takes the class with the maximum probability as the final classification result. Figure i shows the classification scores of every spectrum in the test set, which are predicted by the SVM models based on the entire spectral information. For the six-classes model, we visualize the 6D classification scores on 3D space using t-distributed stochastic neighbor embedding (t-SNE). t-SNE is a powerful technology to visualize high-dimensional data in low-dimensional space while preserving the original distribution of the data. As can be seen from the collection of data points of different colors and shapes, spectra belonging to different types were successfully separated. The results show that the SVM has good classification performance for all four models. Even for the complex six-classes model which distinguishes the cytoplasm and nucleus of different cell types, different types of data points can still be separated.

Figure 2

(i) Graphic classification results and (ii) ROC curves and AUC values of four SVM models based on different types of data sets: (a) cytoplasmic model, (b) nuclear model, (c) cell model, (d) six-classes model. The gray dashed lines represent the ROC curves for the completely random guesses. The classification model mostly provides the relative probability that the spectrum is positive. If the classification score is greater than the specified threshold, it will be distinguished as a positive sample; otherwise, it is a negative sample. As thresholds change, common performance indicators will be influenced, such as FPR and TPR. Therefore, we introduce the ROC curve to measure the performance of the model at full thresholds. As shown in Figure ii, we depicted the ROC curves of every model. The micro- and macroaverage ROC curves also were calculated with the ROC curves of every single classifier to evaluate the general performance of the classification model (see the details in Materials and Methods). Meanwhile, the AUC values were calculated to evaluate the model performance numerically. For every model, the AUC value of the classifier with H460 as the positive sample is significantly superior to the other classifiers. This result indicates that the classification performance of the model to H460 is better. All AUC values are higher than 0.8, which means that the classification performances of the four models are credible. The classification results are shown in Table . The results at the spectrum level refer to the percentage of all spectra that are accurately classified. The accuracies on the cell level are developing statistics of the result at the spectral level that select the class with the highest number of classification spectra in a cell as the final predicted class of this cell. Although the accuracy of the six-classes model is slightly lower than that of the other three models at the spectral level, it achieves 100% accuracy at the cellular level. Therefore, the six-classes model is considered to obtain the best classification results for cells.

Table 1

SVM Result of the Distinction between Three Subtypes of NSCLC

model	level	LOPOCV (%)	accuracy (%)
nuclear model	spectrum	92.5	91.9
nuclear model	cell		97.6
cytoplasmic model	spectrum	90.1	88.4
cytoplasmic model	cell		97.6
cell model	spectrum	90.5	89.2
cell model	cell		97.6
six-classes model	spectrum	85.7	83.3
six-classes model	cell		100

VAE Dimensionality Reduction of Cellular Data

In the fingerprint and C–H regions, there is rich information on biomolecular Raman bands, giving them the potential to distinguish different subtypes of cells. However, the abundant molecular species may cause feature redundancy, which means some bands are indistinguishable between different subtypes of cancer cells. Meanwhile, noise is another main factor affecting classification performance. The VAE was employed to mitigate the effects of feature redundancy and noise. Figure shows the schematic representation of VAE. The VAE consists of an encoder, which compresses the Raman spectral data into a 2D latent space, and a decoder, decoding the 2D data from the latent space into a spectrum. The encoder compresses the Raman spectrum into data points that fit a Gaussian distribution. The latent space formed by the neurons at the end of the encoder is often only two or three dimensions, which makes it insufficient to store randomly occurring noise signals, thus simultaneously achieving noise reduction and dimensionality reduction. The VAE uses a convolutional network architecture, and the specific architecture and training details are described in Materials and Methods.

Figure 3

Schematic representation of the principle of VAE. VAE consists of two parts including an encoder and a decoder. The encoder downscales the input Raman spectra into a 2D latent space. In the latent space, similar data points will be relatively closer. The decoder converts the downscaled 2D data points back to Raman spectra. As shown in Figure a, the data points of different classes, which are represented by different shapes and colors, are trained to fit the Gaussian distribution in the latent space. The spectra with similarities are grouped into the same cluster, while those that are not similar are separated. Figure b visualizes the train loss and test loss of the VAE with the increase of epochs. The loss represents the KL divergence and the root-mean-square errors between the input spectra and predicted spectra (see the details in Materials and Methods). Rapid reduction of both the train loss and test loss after 30 epochs represents that the fast convergence of the VAE model, which suggests strong reliability of the VAE for prediction. Figure c exhibits the comparisons between the VAE-generated spectra of the central points of every cluster in the latent space and the mean spectra from the same class of NSCLC spectra. The great similarity between the practical spectra and the VAE-generated spectra indicates that the clustering behavior of the VAE encoder reflects the similarity trend of the practical spectral distribution. In addition, the spectra generated by VAE have the characteristics of low noise like the average spectrum, indicating that VAE contributes to noise reduction. Meanwhile, the stochastic original spectra and the corresponding VAE-generated spectra shown in Figure S3 indicate that the VAE can perform noise filtering while preserving the spectral features in the process of encoding and compression.

Figure 4

Graphic representation of the VAE on the test set for different subtypes of NSCLC cells. (a) VAE-encoded space of NSCLC spectra. (b) Loss function of VAE on the training set and test set. (c) Experiment-acquired averaged Raman spectra (red) and VAE-generated spectra (blue) from the centers of different classes: (i) A549 cytoplasm; (ii) A549 nuclei; (iii) H460 cytoplasm; (iv) H460 nuclei; (v) H1299 cytoplasm; (vi) H1299 nuclei. (d) ROC curves of Gaussian NB and SVM with RBF kernel models based on VAE-encoded data and original spectra of subtypes of NSCLC. The gray dashed lines represent the ROC curves for the completely random guesses. The redundant features are confirmed to be removed by calculating both correlations between features and variances of features. The variances for every feature of original spectral data and VAE-encoded data in the train set were calculated to detect the significance of the features. Features with low variance usually include useless information. From Figure S4a, the variances of the spectral data are all significantly smaller than VAE-encoded data, and there are variances of numerous features close to 0. This result indicates that the useless features are removed by the dimension reduction of VAE. The Pearson correlation coefficient (PCC) was applied to compute the correlation between features. The absolute value of PCC is in the range of 0 to 1, and the closer to 1 shows the higher correlation. A strong correlation between features represents the duplication of the information contained in the features. Usually, an absolute value of PCC greater than 0.5 represents a relatively strong correlation.[65] The statistical results of PCC values suggest a relatively strong correlation between numerous features of the original spectral data (Figure S4b). In contrast, the correlation between the two features after VAE encoding is extremely weak (the absolute PCC is 0.036), which indicates the features are independent of each other. These results prove that the redundant features are effectively removed after dimension reduction by VAE.

Classification of VAE-Encoded Cellular Data

To verify the improvement of the classification performance by VAE, we trained the nine machine learning algorithms based on the same training set to classify the spectra of different subtypes of NSCLC cells through the entire spectral information and the data in latent space, including RF, MLP, SVM with linear kernel, SVM with RBF kernel, LR, KNN, Gaussian NB, Adaboost, and LDA. Because of the better classification performance at the cellular level, the six-classes model was used to evaluate the classification performance of every algorithm. For every classification model trained by entire spectral information, the feature selection method based on variance thresholding is used to remove low variance features. We only included 200 features before model training to prevent overfitting. As shown in Table , for each algorithm, the accuracy of the classifier trained using the VAE-encoded data was generally higher than that of the classifier trained using the entire spectral information. The accuracy improvement varies among different classification algorithms, indicating that there are discrepancies in the improvement effect of VAE on different algorithms. Raman spectra are high-dimensional data and have high correlations between adjacent features because the peaks possess a certain width. After dimension reduction by VAE, the dimension of data turns into two dimensions, and the features are relatively independent from each other. Therefore, for algorithms that are appropriate for low-dimensional or low-relevance data, their classification accuracies will be significantly improved after dimension reduction by VAE, such as Adaboost and Gaussian NB.[66] In contrast, for algorithms suitable for high-dimensional data, the improvements in classification accuracies after dimension reduction are not significant, such as MLP.[67] Among them, Gaussian NB achieved the best classification results for the VAE-encoded data, and SVM with the RBF kernel had the best results for the entire spectral information. A similar result can also be proven by the ROC curves of these two classification models in Figure d. SVM with the RBF kernel has a great advantage for the classification of nonlinear high-dimensional data. In contrast, Gaussian NB has better classification performance for low-dimensional data. The role of VAE is to compress high-dimensional data into low-dimensional data. Therefore, Gaussian NB produces better classification results for VAE-encoded data. As shown in Figure d, the ROC curves of the Gaussian NB and SVM with RBF kernel models trained with the entire spectral information are relatively weaker, indicating that the dimensionality reduction process of VAE has a beneficial performance on the classification models.

Table 2

Classification Result of Nine Models on Six Classes of NSCLC Cell Spectra with and without VAE

	without VAE		with VAE
model	LOPOCV (%)	accuracy (%)	LOPOCV (%)	accuracy (%)
Gaussian NB	81.4	80.7	89.6	89.6
RF	83.2	83.0	88.5	88.0
SVM (RBF)	85.7	83.3	88.4	86.3
SVM (linear)	83.7	83.3	88.3	88.6
LDA	84.3	83.6	87.9	87.7
LR	82.9	81.6	86.3	85.7
Ada	78.7	75.3	87.7	86.4
KNN	77.0	75.2	84.7	83.7
MLP	80.2	79.7	81.3	81.1

We also established four Gaussian NB models based on different data sets (nuclear model, cytoplasmic model, cell model, and six-classes model). The accuracies and ROC curves of these four Gaussian NB models on the test set are shown in Table S1 and Figure S5. It can be found that all four Gaussian NB models show superior performance to the SVM models (see Table ), and the accuracy can reach 100% on the cell level for each model. We also compared the influence of four dimension reduction methods on the classification performance including principal component analysis (PCA), t-SNE, uniform manifold approximation and projection (UMAP), and VAE (Table S2). The LDA, Gaussian NB, and SVM algorithms that perform effectively on low-dimensional data were selected for the classification of data after dimensionality reduction. The LOPOCVs and accuracies of the algorithms were computed to evaluate the classification performances (Table S2). The results showed that UMAP and VAE were significantly superior to PCA and t-SNE in improving the classification results. Although the improvement of LDA classification by UMAP is better than VAE, the combination of VAE and Gaussian NB still achieves the best classification performance. Therefore, compared with other three dimension reduction algorithms, VAE can effectively perform dimension reduction and improve the classification performance.

VAE Dimensionality Reduction of Kidney Cancer Data

To further validate the performance of Raman spectroscopy combined with VAE for cancer diagnosis on the tissue level, we collected Raman spectra of kidney cancer tissues from patients removed by surgery or puncture and established the machine learning models for in vitro cancer subtype diagnosis. Kidney cancer caused about 180,000 deaths, and the incidence is gradually increasing at a rate of 2.2% in 2020.[55] We retrieved tumor tissues from 16 CCRCC patients and 14 PRCC patients for evaluation. The scanning ranges were set to 200–1800 cm–1 for all tissue samples due to the weak Raman signal after the biological transparency window (Figure S1b). For each tissue, approximately 100 Raman spectra were acquired, but in the case of signal saturation, some of the spectra were excluded and typically only 30 spectra of the tumor tissue were employed to train and test the machine learning models after preprocessing. We calculated the PCC between the average spectra of 100 spectra and 30 spectra. The PCC shows that there is a strong correlation between the two mean spectra (PCCs > 0.95), which means that the tissue information derived from 30 spectra is almost equivalent to 100 spectra. Raman spectra of 70% of the patient samples (11 CCRCC and 10 PRCC) were used as the train set to train the VAE model and the classification models, and the remaining spectra were used to perform model performance tests. Figure shows the results achieved by the trained VAE model on the test set. As shown in Figure a, a commendable differentiation effect can be achieved for the data set of the two subtypes of kidney cancer in the latent space of VAE. Data points of the same subtype are clustered together, while data points of different subtypes are aggregated into different clusters. VAE maintains the original similarity relationship of the data well while performing dimensionality reduction of Raman spectra of kidney cancer. The convergences, consisting of the KL divergences and mean square errors between the decoded spectra and practical spectra in train and test processes, are shown in Figure b. Both the train loss and the test loss converge after 50 epochs, demonstrating that the VAE model is robust to the encoding and decoding process of kidney cancer data. Figure c demonstrates the comparisons between the VAE-generated spectra of the central points of different clusters in the latent space and the mean spectra of different subtypes of cancer. Similar to the cell results, the agreement between the mean spectra and generated spectra suggests that the VAE accomplishes both spectral information compression and noise reduction. The variances and correlation coefficients indicate that the useless and highly correlated features were removed after the dimension reduction of VAE (Figure S6).

Figure 5

Graphic representation of the VAE on the test set for two subtypes of kidney cancer. (a) VAE-encoded space of kidney cancer spectra. (b) Loss function of VAE on the training set and test set. (c) Experiment-acquired averaged Raman spectra (red) and VAE-generated spectra from the centers of different subtypes, (i) CCRCC and (ii) PRCC, of kidney cancer spectra (blue). Insets in panel (c) show pathological images of two subtypes. The scale bars are 1 mm. (d) Graphic classification results of RF based on the original spectra. (e) ROC curves of Gaussian NB and RF based on VAE-encoded data and original spectra of subtypes of kidney cancer.

Classification of Kidney Cancer

The same nine algorithms were trained based on the validation of LOPOCV to classify different subtypes of kidney cancer tissues through entire spectral information and VAE-encoded data in latent space, respectively. Table represents the classification results of nine models on VAE-encoded spectral data of kidney cancer. It shows that Gaussian NB achieves the best performance among the nine classification algorithms, with a differentiation accuracy of 81.4% and an AUC value of 0.835. In addition, the classification of the data using VAE after dimensionality reduction is generally better compared with the results obtained using the entire spectral information (Table S3). Figure d shows the classification effect of the RF, which achieves the best performance based on the entire spectral information on the test set. The two axes represent the probability that the spectral data belong to the two subtypes calculated by RF. The results demonstrated that the split between the data of the two subtypes was not apparent. We also compared the ROC curves of RF and Gaussian NB based on both the entire spectral information and VAE-encoded data (Figure e). The ROC curves of the Gaussian NB classifier based on the VAE-encoded data are significantly better than those of the other three classifiers, indicating that VAE could significantly improve classification performance with the help of Gaussian NB. Thus, it demonstrates that VAE performs effective information purification and noise reduction for the spectra of kidney cancer. At the same time, the accuracy of up to 81.4% proves that Raman spectroscopy combined with VAE is an effective tool for in vitro biopsy of subtypes of kidney cancer, therefore aiding the surgeon in designing the surgical plan.

Table 3

Classification Results of Nine Models on VAE-Encoded Data of Kidney Cancer Spectra

model	LOPOCV (%)	accuracy (%)	AUC
Gaussian NB	82.7	81.4	0.835
LDA	80.6	80.2	0.823
LR	79.3	78.9	0.795
MLP	79.2	78.4	0.806
SVM (RBF)	79.2	78.7	0.793
KNN	79.1	77.6	0.785
SVM (linear)	78.5	76.4	0.777
RF	78.5	77.3	0.783
Ada	73.4	72.6	0.747

Conclusions

In summary, we developed an analysis technique based on the label-free Raman spectroscopy to identify tumor subtypes. We applied an autoencoder network of VAE to attenuate the effects of high feature dimensionality and noise. VAE downscales Raman spectral data in high-dimensional space to a low-dimensional space (only two dimensions) and maintains the Gaussian distribution in the latter. The low-dimensional information in the latent space of VAE makes it impossible to store randomly occurring noise information for noise reduction purposes. Among nine types of machine learning algorithms combined with VAE, Gaussian NB achieved the best performance to classify the cell data from three subtypes of NSCLC and the tissue data from two subtypes of kidney cancer. In contrast to the conventional SVM algorithms, Gaussian NB improves the classification accuracy from 83.3 to 89.6% at the cellular level and from 77.5 to 81.4% at the tissue level. These results suggest that the analysis of Raman spectra using the VAE-combined machine learning algorithms can assist physicians in making rapid, noninvasive, and more accurate discrimination of the subtypes of cancer cells and tissues. It is of great significance for both the development of surgical plans and postoperative drug administration.

38 in total

1. Baseline correction using adaptive iteratively reweighted penalized least squares.

Authors: Zhi-Min Zhang; Shan Chen; Yi-Zeng Liang
Journal: Analyst Date: 2010-02-19 Impact factor: 4.616

2. Studies of chemical fixation effects in human cell lines using Raman microspectroscopy.

Authors: Aidan D Meade; Colin Clarke; Florence Draux; Ganesh D Sockalingum; Michel Manfait; Fiona M Lyng; Hugh J Byrne
Journal: Anal Bioanal Chem Date: 2010-01-20 Impact factor: 4.142

3. The effect of cell fixation on the discrimination of normal and leukemia cells with laser tweezers Raman spectroscopy.

Authors: James W Chan; Douglas S Taylor; Deanna L Thompson
Journal: Biopolymers Date: 2009-02 Impact factor: 2.505

4. Higher criticism thresholding: Optimal feature selection when useful features are rare and weak.

Authors: David Donoho; Jiashun Jin
Journal: Proc Natl Acad Sci U S A Date: 2008-09-24 Impact factor: 11.205

5. Pathologic diagnosis as the gold standard.

Authors: L B Rorke
Journal: Cancer Date: 1997-02-15 Impact factor: 6.860

Review 6. Machine learning: Trends, perspectives, and prospects.

Authors: M I Jordan; T M Mitchell
Journal: Science Date: 2015-07-17 Impact factor: 47.728

7. Label-free diagnosis of lung cancer with tissue-slice surface-enhanced Raman spectroscopy and statistical analysis.

Authors: Kun Zhang; Chunyan Hao; Yanyan Huo; Baoyuan Man; Chao Zhang; Cheng Yang; Mei Liu; Chuansong Chen
Journal: Lasers Med Sci Date: 2019-04-13 Impact factor: 3.161

8. Sensing of Hydrogen Sulfide Gas in the Raman-Silent Region Based on Gold Nano-Bipyramids (Au NBPs) Encapsulated by Zeolitic Imidazolate Framework-8.

Authors: Jiaming Chen; Longhua Guo; Lifen Chen; Bin Qiu; Guolin Hong; Zhenyu Lin
Journal: ACS Sens Date: 2020-12-04 Impact factor: 7.711

9. Intraoperative brain cancer detection with Raman spectroscopy in humans.

Authors: Michael Jermyn; Kelvin Mok; Jeanne Mercier; Joannie Desroches; Julien Pichette; Karl Saint-Arnaud; Liane Bernstein; Marie-Christine Guiot; Kevin Petrecca; Frederic Leblond
Journal: Sci Transl Med Date: 2015-02-11 Impact factor: 17.956

10. Sensitivity of Transmission Raman Spectroscopy Signals to Temperature of Biological Tissues.

Authors: Adrian Ghita; Pavel Matousek; Nick Stone
Journal: Sci Rep Date: 2018-05-30 Impact factor: 4.379

2 in total

1. Label-Free Differentiation of Cancer and Non-Cancer Cells Based on Machine-Learning-Algorithm-Assisted Fast Raman Imaging.

Authors: Qing He; Wen Yang; Weiquan Luo; Stefan Wilhelm; Binbin Weng
Journal: Biosensors (Basel) Date: 2022-04-15

2. A Machine Learning Framework for Detecting COVID-19 Infection Using Surface-Enhanced Raman Scattering.

Authors: Eloghosa Ikponmwoba; Okezzi Ukorigho; Parikshit Moitra; Dipanjan Pan; Manas Ranjan Gartia; Opeoluwa Owoyele
Journal: Biosensors (Basel) Date: 2022-08-02

2 in total