Literature DB >> 33217646

Early lung cancer diagnostic biomarker discovery by machine learning methods.

Ying Xie1, Wei-Yu Meng1, Run-Ze Li1, Yu-Wei Wang1, Xin Qian2, Chang Chan2, Zhi-Fang Yu2, Xing-Xing Fan1, Hu-Dan Pan1, Chun Xie1, Qi-Biao Wu1, Pei-Yu Yan1, Liang Liu1, Yi-Jun Tang2, Xiao-Jun Yao3, Mei-Fang Wang4, Elaine Lai-Han Leung5.   

Abstract

Early diagnosis has been proved to improve survival rate of lung cancer patients. The availability of blood-based screening could increase early lung cancer patient uptake. Our present study attempted to discover Chinese patients' plasma metabolites as diagnostic biomarkers for lung cancer. In this work, we use a pioneering interdisciplinary mechanism, which is firstly applied to lung cancer, to detect early lung cancer diagnostic biomarkers by combining metabolomics and machine learning methods. We collected total 110 lung cancer patients and 43 healthy individuals in our study. Levels of 61 plasma metabolites were from targeted metabolomic study using LC-MS/MS. A specific combination of six metabolic biomarkers note-worthily enabling the discrimination between stage I lung cancer patients and healthy individuals (AUC = 0.989, Sensitivity = 98.1%, Specificity = 100.0%). And the top 5 relative importance metabolic biomarkers developed by FCBF algorithm also could be potential screening biomarkers for early detection of lung cancer. Naïve Bayes is recommended as an exploitable tool for early lung tumor prediction. This research will provide strong support for the feasibility of blood-based screening, and bring a more accurate, quick and integrated application tool for early lung cancer diagnostic. The proposed interdisciplinary method could be adapted to other cancer beyond lung cancer.
Copyright © 2020 The Authors. Published by Elsevier Inc. All rights reserved.

Entities:  

Keywords:  Biomarker; Early diagnosis; Lung cancer; Machine learning; Metabolites

Year:  2020        PMID: 33217646      PMCID: PMC7683339          DOI: 10.1016/j.tranon.2020.100907

Source DB:  PubMed          Journal:  Transl Oncol        ISSN: 1936-5233            Impact factor:   4.243


Introduction

In worldwide, lung carcinoma is the leading cause of cancer death in the past few decades. In January 2019, National Central Cancer Registry of China (NCCRC) released its latest nationwide tumor statistics of population-based tumor registry data gathered from 368 tumor registries in 2015. According to this report, lung cancer ranks top for its incidence among malignant tumors in China. What's more, the 5-year survival rate for patients with lung tumors was low which is at 18%. However, the survival rate can increase to approximately 55% if early diagnosis of lung cancer was achieved. Moreover, it has been reported the early stage patients who received proper treatment could have a 5 year survival rate around 40% [1]. Unfortunately, over 70% patients are diagnosed when their tumor are developed to the advanced stages, and most of them are not suitable for receiving operation [2]. This is partly related with the early stage diagnostic methods which are still not sensitive and specific enough. Therefore, it appears to be an important step to find out the cogent and powerful diagnostic biomarkers of lung cancer, particularly for the diagnosis of early lung tumor progression. Metabolomics study have been used to recognize the metabolic pathways and metabolites that regulate tumor progression and physiological function [3], [4], [5]. Metabolomics could provide the information of cellular metabolic processes that drive tumorigenesis and tumor progression. These metabolites also could be helpful for distinguishing the tumor stage, histological types, and even the response to drug treatment [6]. These changes in metabolite pattern had be used to evaluate the clinical characteristics of colorectal tumor [7], ovarian tumor [8], renal tumor [9], oral tumor [10], and pancreatic tumor [11]. Even so, for lung cancer, more specific and sensitive biomarkers were needed to be revealed with metabolic analysis. Artificial intelligence (AI) is the competence for machines to imitate human behavior, which is extremely adept at handling extensive amounts of data. Machine learning is the application of AI, which allows computer systems could be trained automatically from experience without explicitly programmed [12]. Fundamentally, machine learning means learning from the practice of using algorithms to parse data, and then making a prediction or decision about the future situation of any new data sets [13]. In cancer, machine learning has already been used to explore survival and prognostic prediction models in pancreatic cancer, bladder cancer, advanced nasopharyngeal carcinoma and breast cancer [14], [15], [16], [17]. In some cases, their performance had attained comparable to that of human experts [18]. Machine learning models could be seemed as an approach of designing the model by learning from experience and improving its performance [19]. These models aim at finding out effective variables and the relationship between them. Over the past few years, the field of AI has moved from largely theoretical studies to real-world applications [20,21]. The application of AI in several domains is now associated with great expectations and at the same time exists a great vacancy in cancer research especially lung cancer. In this study, the major aim of metabolomics research on lung cancer was to discover clinical metabolic biomarkers that had representative alterations between lung tumor patients and healthy individuals. Moreover, we also focused on biomarkers for distinguishing each histological subtypes and disease stages, especially for early stages. For the first time, based on the plasma metabolite features, we applied machine learning to develop the diagnostic model for early stages of lung cancer.

Materials and methods

Patients and groups

A total of 110 patients and 43 healthy individuals of the Hubei Taihe Hospital were included in this study. The Institutional Review Board of Taihe Hospital, Hubei University of Medicine approved the study involving patients and healthy individuals. All individuals have written informed consent prior to participation in the investigation, with permission of sample collection, usage and data analysis. Final diagnosis was ascertained by clinical symptoms and histopathological examination of operative specimens. According to the TNM staging system, patients were classified as stage I (n = 54) stage II (n = 31), stage III (n = 25). Based on the WHO classification of tumors [22], the tumors have been classified as adenocarcinomas (n = 63), squamous carcinomas (n = 41) and other histological types (n = 6).

Targeted metabolomic study using LC-MS/MS

Targeted metabolomic study was performed with previous reported LC-MS methods [23,24]. In brief, plasma samples (200 μL) were thawed on ice, mixed with N-ethylmaleimide PBS buffer (10 mM, 200 μL) and 1000 μl of methanol containing 10 ng/ml internal standards (IS) Phe-d5.The mixed solution then incubated 20 mins at −20  °C and centrifuged at 13,000 rpm for 10 min at 4  °C. The supernatants were dried under nitrogen flow at 4  °C and reconstituted with 30% methanol for LC-MS analysis. The chromatographic separation was carried out with a Waters X Bridge™ BEH C18 analytical column (2.5 μm, 3.0  ×  100 mm; Waters, Torrance, CA) using a Waters ACQUITY UPLC coupled with a 4000 Q-TRAP mass spectrometer. The mobile phase was composed of 0.1% formic acid water (solvent A) and methanol (solvent B) which was running in a gradient program: 0–3.0 min (0%–1% B); 3.0–10.0 min (1–3% B); 10.0–14.0 min (3–50% B); 14.0–18.0 min (50–95% B); 18.0–22.0 min (95–0% B); followed by a 3-min re-equilibration step. The flow rate was 0.6 ml/min, and 10 μl were injected in LC-MS. The mass spectra were acquired in both the negative and positive ion voltage modes for electrospray (ESI) with the following parameters: gas temperature, 450  °C; the ion spray voltage, ±4500 V; ion source gas 1 (nebulizer gas) 40 psi (N2); ion source gas 2 (auxiliary gas), 40 psi (N2); curtain gas: 20 psi. Targeted MS/MS (MRM) mode were used with the collision energy ranging from 10 V to 40 V. All LC-MS data were obtained by AB Analyst Software (Version 1.6.2). The intensity of each ion was normalized to the peak area of IS prior to multivariate statistical analysis. For the metabolomic assay, principal component analysis (PCA) and orthogonal projection to latent structures discriminant analysis (OPLS-DA) analysis were analyzed with SIMCA-p14 software (Umetrics AB, UMEÅ, Sweden) for tested groups. Variables were screened by VIP values first, and values exceeding 1 were considered eligible for group discrimination. The selected metabolites were further confirmed by a between normal control and disease control with P-value less than 0.05. MetaboAnalyst 3.0 was used for integrated analysis, pathway impact and enrichment.

Statistical analysis

All statistical data were analyzed by using SPSS Statistics 22.0 (SPSS, Chicago, IL, United States), GraphPad Prism 8.0 (GraphPad Software, La Jolla, CA, United States), TBtools v0.6735 [25] and Orange software 3.23 [26]. Receiver operating characteristic (ROC) curve analysis was established to evaluate the diagnostic performance of metabolites. P values < 0.05 was considered statistically significant.

Machine learning methods

The six machine learning techniques of K-nearest neighbor (KNN), Naïve Bayes, AdaBoost, Support Vector Machine (SVM), Random Forest, and Neural Network with 10-cross fold technique were used for the early lung tumor prediction based on the metabolomic biomarkers features. SVM, which is the classification algorithm, intends to invent a decision boundary between two categories that enables the prediction of labels from feature vectors [27]. K-nearest-neighbor (KNN) is the preferred selection when there is tiny prior knowledge of data, which is elementary and plain nonparametric method for classification [28]. Random Forest is an ensemble tree method that trees are grown by binary recursive splitting of right-censored data [29]. Naïve Bayes is a statistical classifier, which was used to predict class membership probability [30]. It hypothesizes all variables participate in classification independently and provides the result for prediction [31]. Neural Network purposes to simulate the neuron and human brain. The artificial neuron of Neural Network uses particular input features to assign suitable mathematical weights that are eventually able to predict some output object [32]. 80% samples including stage I lung tumor patients (n = 43) and healthy individuals (n = 35) were selected by stratified sampling from each group as the training set to uniformly train the models. The rest 20% samples including stage I lung cancer patients (n = 11) and healthy individuals (n = 8) making up the test set were used for evaluation. The training set was used to generate the prediction model that could predict the diagnosis of the test set. To test and compare the models, sensitivity, specificity, precision, classification accuracy, and AUC (area under the curve) value from each model were used to measure the performance. The sensitivity, specificity, and classification accuracy values were obtained from true negative (TN), false negative (FN), true positive (TP), and false positive (FP). The following terms are essential information of them: TP = Lung tumor patients correctly diagnosed as patients. FP = Healthy individuals incorrectly identified as patients. TN = Healthy individuals correctly identified as healthy. FN = Lung tumor patients incorrectly identified as healthy.

Results

Metabolic biomarkers for detection of early lung tumor

Our studies aim to identify metabolites that could act as promising biomarkers for distinguishing lung tumor patients with healthy individuals, disease stages as well as pathological patterns with high sensitivity and specificity. Firstly, stage I lung cancer patients (n = 54) and healthy individuals (n = 43) were separated by using unsupervised hierarchical clustering with heat map shown in Fig. 1. Through Mann–Whitney U test, 46 influential metabolic biomarkers (Fig. 2A and Table S1) showed statistically significant difference (p-value<0.05) among 61 metabolites. Increased levels of l-Leucine, l-Valine, serine and other 25 metabolites were observed in stage I lung cancer patients compared to healthy individuals. Moreover, fumaric acid, citric acid, PC (36:4), and other 15 metabolites were downregulated in stage I lung cancer patients compared to healthy controls. These 46 influential metabolites are potential markers for pre-clinical screening of lung tumor.
Fig. 1

Heatmap depicting the metabolomic biomarker levels of Stage I lung tumor patients (n = 54) and healthy people (n = 43). Stage I lung tumor patients and healthy people were grouped by hierarchical clustering of metabolomic biomarker levels.

Fig. 2

Metabolic biomarkers for detection of early lung tumor and evaluation of different histological types. (A) 46 influential metabolomic biomarkers with statistical significance of Stage I lung tumor patients (mean value with SD). Through Mann–Whitney U test, 46 influential metabolic biomarkers showed statistically significant difference (p-value<0.05) among 61 metabolites. (B) PCA of 10 metabolomic biomarkers in early lung tumor detection. It revealed a clear separation between stage I lung tumor patients and healthy individuals. (C) ROC curve of metabolomic biomarkers and combined variates in early lung tumor detection. The combination of six variates included proline, l-kynurenine, spermidine, amino-hippuric acid, palmitoyl-l-carnitine and taurine. (D) ROC analysis of metabolomic biomarkers and combined variates of adenocarcinoma (n = 63) and squamous carcinoma (n = 41) patients. The combination of four variates included hypoxanthine, l-Kynurenine, proline and Carnitine. SD, standard deviation. PCA, principal component analysis. ROC, receiver operating characteristic.

Heatmap depicting the metabolomic biomarker levels of Stage I lung tumor patients (n = 54) and healthy people (n = 43). Stage I lung tumor patients and healthy people were grouped by hierarchical clustering of metabolomic biomarker levels. Metabolic biomarkers for detection of early lung tumor and evaluation of different histological types. (A) 46 influential metabolomic biomarkers with statistical significance of Stage I lung tumor patients (mean value with SD). Through Mann–Whitney U test, 46 influential metabolic biomarkers showed statistically significant difference (p-value<0.05) among 61 metabolites. (B) PCA of 10 metabolomic biomarkers in early lung tumor detection. It revealed a clear separation between stage I lung tumor patients and healthy individuals. (C) ROC curve of metabolomic biomarkers and combined variates in early lung tumor detection. The combination of six variates included proline, l-kynurenine, spermidine, amino-hippuric acid, palmitoyl-l-carnitine and taurine. (D) ROC analysis of metabolomic biomarkers and combined variates of adenocarcinoma (n = 63) and squamous carcinoma (n = 41) patients. The combination of four variates included hypoxanthine, l-Kynurenine, proline and Carnitine. SD, standard deviation. PCA, principal component analysis. ROC, receiver operating characteristic. Next, these 46 influential metabolic biomarkers were applied to construct ROC curves. Based on the AUC (area under the ROC curve) value, sensitivity and specificity, top 10 metabolic biomarkers with higher diagnostic value (AUC>0.800) were showed in Table 1 and Fig. S1. In addition, PCA (principal component analysis) with these 10 metabolites revealed a clear separation between stage I lung tumor patients and healthy individuals (Fig. 2B and Table S2). In the comparison between stage I lung tumor patients and healthy individuals, proline showed the best AUC value of 0.923 (95% CI: 0.871–0.975), with a sensitivity of 79.6% and specificity of 93.0% at the cut off value of 24.350.
Table 1

ROC analysis of metabolomic biomarkers and combined variates.

AUCStd. errorAsymptotic 95% confidence interval
Optimal cut offSensitivitySpecificityYouden index
Lower boundUpper bound
ROC analysis of metabolomic biomarkers and combined variates in early lung tumor detection.
L-Kynurenine0.8250.0430.7400.9090.97585.2%72.1%0.573
Proline0.9230.0260.8710.97524.35079.6%93.0%0.727
Spermidine0.8900.0350.8210.9587.19581.5%90.7%0.722
Amino-hippuric acid0.8110.0450.7220.9004.03568.5%93.0%0.615
Palmitoyl-l-carnitine0.9060.0320.8430.9693.65574.1%100.0%0.741
Taurine0.9200.0320.8560.98371.30088.9%95.3%0.842
Phenylalanine0.8480.0380.7740.922125.50079.6%76.7%0.564
L-Valine0.8760.0360.8060.946167.00068.5%95.3%0.639
o-Tyr0.8220.0430.7380.90624.65083.3%72.1%0.554
Carnitine0.8480.0400.7690.9264.68072.2%93.0%0.652
Combination of two0.9330.0280.8780.9780.33785.2%93.0%0.782
Combination of three0.9680.0190.9311.000−0.14794.4%97.7%0.921
Combination of six0.9890.0110.9671.000−0.10298.1%100.0%0.981
ROC analysis of metabolomic biomarkers and combined variates of adenocarcinoma and squamous carcinoma patients.
L-Kynurenine0.4230.0600.3060.5401.05077.8%24.4%0.022
Proline0.5800.0570.4690.69235.15054.0%65.9%0.198
Carnitine0.5360.0580.4220.6506.83538.1%75.6%0.137
Hypoxanthine0.6390.0550.5310.7460.09269.8%56.1%0.259
Hippuric acid0.6280.0560.5190.7372.62049.2%77.5%0.267
Combination of four0.7400.0490.6440.8370.55658.7%78.0%0.368

Abbreviations: ROC,receiver operating characteristic; AUC, area under the curve.

ROC analysis of metabolomic biomarkers and combined variates. Abbreviations: ROC,receiver operating characteristic; AUC, area under the curve. Moreover, the potential combination schemes of metabolic biomarkers based on logistic regression analysis were carried out to enhance the sensitivity and accuracy of diagnostic of early stages of lung cancer. As shown in Table 1, Table S3 and Fig. 2C, the combination of six variables (metabolites) remarkably enhanced the AUC to 0.989 (95% CI: 0.967–1.000, Sensitivity = 98.1%, Specificity = 100.0%). The metabolites used included proline, l-kynurenine (AUC = 0.825, Sensitivity = 85.2%, Specificity = 72.1%), spermidine (AUC = 0.890, Sensitivity = 81.5%, Specificity = 90.7%), amino-hippuric acid (AUC = 0.811, Sensitivity = 68.5%, Specificity = 93.0%), palmitoyl-l-carnitine (AUC = 0.906, Sensitivity = 74.1%, Specificity = 100.0%) and taurine (AUC = 0.920, Sensitivity = 88.9%, Specificity = 95.3%). These results indicated that 6 metabolic biomarkers could act as a promising combination for early detection of lung tumor.

Metabolic biomarkers for disease progression

In addition, we were also interested in the alteration of metabolites in different stages. To identify the metabolites level changes with tumor stage progress, Kruskal–Wallis test was applied. Fig. 3A showed the 10 metabolites which showed significant difference in stage I (n = 54), stage II (n = 31), stage III (n = 25) lung tumor patients and healthy individuals (n = 43). Although there was statistically significant difference between lung tumor patients and healthy individuals (Fig. S2 and Table S4), the metabolic biomarkers showed poor performance for discrimination of stage I, II, and III patients with lung cancer (Fig. 3B–D, Table S5 and Table S6). Both of non-parametric test and receiver operating characteristic curves suggested the continuous abnormal expression in lung cancer patients compared with healthy individuals.
Fig. 3

Metabolomic biomarkers changes with tumor stage progress. (A) It showed the levels of 10 metabolites which showed significant difference in stage I (n = 54), stage II (n = 31), stage III (n = 25) lung tumor patients and healthy individuals (n = 43). (B) ROC curve of metabolomic biomarkers of stage I (n = 54) lung tumor patients. (C) ROC curve of metabolomic biomarkers of stage II (n = 31) lung tumor patients. (D) ROC curve of metabolomic biomarkers of stage III (n = 25) lung tumor patients. ROC, receiver operating characteristic.

Metabolomic biomarkers changes with tumor stage progress. (A) It showed the levels of 10 metabolites which showed significant difference in stage I (n = 54), stage II (n = 31), stage III (n = 25) lung tumor patients and healthy individuals (n = 43). (B) ROC curve of metabolomic biomarkers of stage I (n = 54) lung tumor patients. (C) ROC curve of metabolomic biomarkers of stage II (n = 31) lung tumor patients. (D) ROC curve of metabolomic biomarkers of stage III (n = 25) lung tumor patients. ROC, receiver operating characteristic.

Metabolic biomarkers for evaluation of different histological types

For lung cancer histological type prediction, particularly the distinction between squamous carcinoma and adenocarcinoma, it is a significant diagnostic requirement in clinical practice. Several clinical studies have demonstrated that tumor histological type differing toxicity and efficacy of treatment [33]. Tumor histological types identification will be helpful for improving the treatment efficiency. We applied the metabolites from the adenocarcinoma (n = 63) and squamous carcinoma (n = 41) patients to construct ROC curves and performed Mann–Whitney U test. There were only 2 influential metabolites identified between adenocarcinoma and squamous carcinoma, including hippuric acid (p-value=0.029) and hypoxanthine (p-value=0.017). In the ROC analysis (Fig. S3), hypoxanthine showed the AUC value of 0.639 (95% CI: 0.531–0.746), with a sensitivity of 69.8% and specificity of 56.1% at the cut off value of 0.092. And the hippuric acid showed the AUC value of 0.628 (95% CI: 0.519–0.737), with a sensitivity of 49.2% and specificity of 77.5% at the cut off value of 2.620. As shown in Fig. 2D and Table 1, the combination of four variates enhanced the AUC to 0.740 (95% CI: 0.644–0.837, sensitivity = 58.7%, specificity = 78.0%), including hypoxanthine, l-Kynurenine (AUC = 0.423, sensitivity = 77.8%, specificity = 24.4%), proline (AUC =0.580, sensitivity = 54.0%, specificity = 65.9%) and Carnitine (AUC = 0.536, sensitivity = 38.1%, specificity = 75.6%). These results indicated metabolic biomarkers showed poor performance on distinguishing different lung tumor histological types in our study, since all AUC values < 0.800 with poor sensitivity and specificity.

Utilization of machine learning methods

To develop the early lung tumor prediction model, we considered six machine learning techniques: K-nearest-neighbor (KNN), Naïve Bayes, AdaBoost, Support Vector Machine (SVM), Random Forest, and Neural Network (Fig. 4A). The training set of stage I lung tumor patients (n = 43) and healthy individuals (n = 35) was used to develop machine learning models based on the metabolic biomarker features. Using Orange 3.23 platform, there were 61 kinds of metabolites used as features to develop the machine learning models. To determine which model would provide the most precise predictions on the lung tumor metabolomics data, the sensitivity, specificity, precision, classification accuracy, and AUC value of six machine learning models were assessed (Supplementary methods). With the best value in each evaluation is highlighted in Table 2. In training set, Naïve Bayes and Neural Network indicated better results in comparison with other techniques (KNN, AdaBoost, SVM, and Random Forest). The precision, classification accuracy, specificity, sensitivity and the AUC value of Naïve Bayes and Neural Network are 100.0%. AdaBoost machine learning technique showed poor performance (precision =0.885, classification accuracy=0.885, specificity=0.886, sensitivity=0.884, and AUC=0.885).
Fig. 4

Machine learning was applied to develop the diagnostic model for early stages of lung cancer. (A) Machine learning applications build early lung tumor prediction models. (B) To validate diagnosis performance of machine learning models and demonstrate the specificity of metabolic biomarker features of early lung cancer patients found in our study, we created a scrambled set that showed no predictive value. AUC, area under the curve. AdaBoost, Adaptive Boosting. SVM, support vector machines. KNN, k-nearest neighbor.

Table 2

Machine learning models used for early lung tumor detection based on the metabolomic biomarker features.

TPFPTNFNClassification accuracySensitivitySpecificityAUCPrecision
Training setKNN3803550.9360.8841.0001.0000.944
SVM4323300.9741.0000.9431.0000.975
Random Forest4103520.9740.9531.0001.0000.976
Neural Network4303501.0001.0001.0001.0001.000
Naïve Bayes4303501.0001.0001.0001.0001.000
AdaBoost3843150.8850.8840.8860.8850.885
Test setKNN90820.8950.8181.0001.0000.916
SVM100810.9470.9091.0001.0000.953
Random Forest112600.8951.0000.7501.0000.911
Neural Network100810.9470.9091.0001.0000.953
Naïve Bayes110801.0001.0001.0001.0001.000
AdaBoost40870.6320.3641.0000.6820.804

Abbreviations: AdaBoost, Adaptive Boosting; SVM, support vector machines; KNN, k-nearest neighbor; TN, true negative; FN, false negative; TP, true positive; FP, false positive; AUC, area under the curve.

Machine learning was applied to develop the diagnostic model for early stages of lung cancer. (A) Machine learning applications build early lung tumor prediction models. (B) To validate diagnosis performance of machine learning models and demonstrate the specificity of metabolic biomarker features of early lung cancer patients found in our study, we created a scrambled set that showed no predictive value. AUC, area under the curve. AdaBoost, Adaptive Boosting. SVM, support vector machines. KNN, k-nearest neighbor. Machine learning models used for early lung tumor detection based on the metabolomic biomarker features. Abbreviations: AdaBoost, Adaptive Boosting; SVM, support vector machines; KNN, k-nearest neighbor; TN, true negative; FN, false negative; TP, true positive; FP, false positive; AUC, area under the curve. Subsequently, we used our trained diagnostic machine learning models to classify the test set that consisted of stage I lung cancer patients (n = 11) and healthy individuals (n = 8) to evaluate its performance. As shown in Table 2, the Naïve Bayes model showed the best performance on all evaluation parameters. The specificity of Naïve Bayes, Neural Network, KNN, AdaBoost, and SVM was 1.000, which means good prediction power against healthy individuals. For overall quality of prediction, AUC of Naïve Bayes, Neural Network, KNN, Random Forest, and SVM were 1.000. The sensitivity of Naïve Bayes and Random Forest was 1.000, SVM and Neural Network was 0.909, which means little FN (false negative) scale. In medical situation, FN scale is more important than FP (false positive) [34]. Consequently, these four models (Naïve Bayes, Random Forest, SVM and Neural Network) were appropriate for diagnosis of early lung tumor. In classification accuracy, Naïve Bayes model has a 1.000 rate with SVM and Neural Network models of 0.947. All of three models have enough accuracy that could be used for medical application. Naïve Bayes is a simple probabilistic classifier based on applying the Bayes' theorem with strong independence and normality assumptions between the variables [35]. It is one of the most valid machine learning algorithms with strong independence and normality assumptions between features, which has been widely employed for the prediction [36]. Given the above, our study testified that Naïve Bayes was the best model with the highest level of sensitivity, specificity, and accuracy. Therefore, Naïve Bayes is recommended as an exploitable tool for early lung tumor prediction. The Fast Correlation-Based Filter (FCBF) algorithm is a supervised method, which is based on information theory [37]. It takes both identifying correlated features for classification and eliminating redundant features into account [38]. Based on the symmetrical uncertainty (SU), FCBF ranks features in descending order of correlation and casts off those redundant features that are less correlated [39]. Consequently, the optimal correlated and non-redundant features subset is acquired. It could maximize the diagnostic potential of the extractable information. The relative importance of a metabolic biomarker feature is developed using the Fast Correlation-Based Filter (FCBF) algorithm. All metabolites features were ranked and scored according to their ability to discern the classification label of an object (Table S7). According to the ranking, top 8 metabolic biomarkers were used to develop the machine learning models, respectively. Table S8 and Fig. S4 showed that AUC values changed with the number of variates. When the top 5 variates including taurine, Palmitoyl-l-carnitine, proline, 2-DG, and PE (36:4) were used as metabolites features, prediction model showed the excellent performance which was similar with previous models. Therefore, these 5 metabolic biomarkers could be potential candidates for pre-clinical screening of lung cancer. Then based on the results of logistic regression analysis, the combination of top three variates including taurine, Palmitoyl-l-carnitine, and proline, showed the AUC value of 0.968 (95% CI: 0.931–1.000), with a sensitivity of 94.4% and specificity of 97.7% in classical analysis (Table 1). To validate diagnosis performance of machine learning models and demonstrate the specificity of metabolic biomarker features of early lung cancer patients found in our study, we performed a control experiment [40]. We created a scrambled training set that correctly labeled training set was replaced with a training set where the labels were randomly assigned. As expected, the accuracy of our machine learning models markedly dropped, which is equivalent to randomly choosing, showing no predictive value and validating the specific predictive signature of metabolic biomarker features (Fig. 4B).

Discussion

Lung cancer is the worldwide leading cause of cancer-related mortality, which early diagnosis could improve survival rate. However, high-risk people are generally recommended annual radiologic screening by low-dose computed tomography (LDCT) [41], which is also the one and only way of the clinically lung cancer detection at present. Due to the significant cost and high false-discovery rate [42], fulfillment of CT screening is unsatisfactory. Therefore, the availability of blood-based screening could increase lung cancer patient uptake, including plasma metabolic biomarkers detection. Our present study attempted to discover Chinese patients’ plasma metabolites as predictive biomarkers for lung cancer diagnosis. In this work, we use a pioneering interdisciplinary mechanism, which is firstly applied to lung cancer, to detect early lung cancer diagnostic biomarkers by combining metabolomics and machine learning methods. The highly sensitive and accurate metabolomics technology and machine learning methods bring novelty to our work compared with previous study that used miRNAs as biomarkers for early lung cancer diagnostic. Main previous studies of breast cancer, prostate cancer and other cancers have screened miRNAs as biomarkers for distinguishing cancer patients and different tumor subtypes [43]. Nevertheless, miRNA screening has its limitation, such as high cost and technical monopolization [44]. Then, clinical usual tests of blood antigen CEA, CA125, SCC, etc. still meets the problem of low sensitivity and low accuracy [45]. At the same time, small biopsy specimen means trauma and false negative as the limited section site [46]. In our present study, we focused on seeking out metabolic biomarkers as early stage lung cancer diagnostic biomarkers. Herein, we identified a new range of metabolites by determining their plasma profiles in patients prior to lung tumor clinical diagnosis. From 61 kinds of metabolites, we found 10 metabolic biomarkers could act as the promising biomarker for early detection of lung tumor. Particularly, the combination of six variates remarkably enhanced the AUC to 0.989 with sensitivity of 98.1% and specificity of 100.0%, which has not been reported so far. According to these results, we recommended the specific combination of these six metabolites as screening biomarker for early detection of lung tumor. We believe that this finding could allow us to develop a specific, sensitive, and minimally invasive implement for early lung tumor prevention and prediction. To date, machine learning approaches, which provide a promising alternative to classical data analysis methods, have been utilized in varied biomedical applications, including drug discovery [47], and biomarker development [48]. Rather than conventional types of data analysis requiring prior awareness of biological dependencies, machine learning could improve diagnostic capability through abundant and high-quality data. One team collected 23 items of demographic data and tumor-related parameters of 102 cervical cancer patients who had undergone radical hysterectomy for treatment, and investigated diverse machine learning models to predict 5-year survival rate in patients [49]. Another group made use of 2267 women colposcopy findings and human papillomavirus (HPV) biomarkers to develop a clinical decision support scoring system using artificial neural networks for cervical intraepithelial neoplasia patients, in which showing the ANN predicted with higher accuracy compared to cytology with or without HPV test [50]. Moreover, it has been proved the possibility to excavate serum microRNA panel as a potential biomarker for the detection of gastric cancer by machine learning [51]. Six types of machine learning techniques were used to select three biomarkers (miR-21-5p, miR-29c-3p, and miR-22-3p) from the published miRNA profiling study (GSE23739). Herein, we firstly used metabolic biomarkers as machine-learning features for lung cancer diagnosis, and the obtained results were analyzed and discussed. The top 5 relative importance metabolic biomarkers (taurine, Palmitoyl-l-carnitine, proline, PE (36:4) and 2-DG), which were developed by FCBF algorithm, could be potential candidates for pre-clinical screening of lung cancer. In this study, several machine learning models were applied and compared, which were evaluated by test set and control experiment (Fig. 4B and Table 2). It leads to greatest assessment values on Naïve Bayes, but it also leads to decent assessment values on Neural Network, and SVM. As standalone screening models, high sensitivity would be desirable to minimize false positives. As shown in Table 2, Naïve Bayes, Random Forest, Neural Network, and SVM models with high sensitivity have dependable and stable potential for early lung tumor prediction. Furthermore, specificity thresholds set in the training set performed similarly when applied to the test set, indicating that models are well calibrated. The specificity threshold of Naïve Bayes, Neural Network, and SVM models defined in the training set achieved a similar specificity in the test set. Given the above of our study, Naïve Bayes, Neural Network, and SVM, based on the metabolic biomarker features, may be conducive for the diagnosis of early lung tumor. These results provided strong support for the feasibility of blood-based screening basing on metabolomics technology and machine learning for early lung cancer diagnostic. Cancer progression is strongly related to cellular metabolism, and the dysregulated metabolism is conducive to tumor progression and initiation [52]. Cancer cells change their metabolic pathways, which request specific enzymes to catalyze biochemical reactions, to meet the growing need of the rapid cell reproduction and division. Metabolites, including metabolic biomarkers for lung cancer diagnosis found in our study, have functions related to tumorigenesis and tumor progression. Proline dehydrogenase (PRODH), which catalyzes the first step of proline degradation, is activated by lymphoid-specific helicase (LSH) to decrease proline levels [53]. Proline catabolism relating PRODH has been shown could either promote tumor survival through ROS-induced autophagy or ATP production, or as tumor suppressor to initiate ROS-mediated apoptosis depending on the tumor microenvironment [54,55]. One team recent study found that PRODH promotes lung cancer tumorigenesis by eliciting the expression of IKKα-dependent inflammatory genes and epithelial to mesenchymal transition (EMT) [56]. High levels of l- kynurenine could provide a microenvironment for lung tumor growing through initiating T-cell apoptosis, inhibiting T-cell proliferation and leading to immune tolerance [57,58]. Spermine N1-acetyltransferase (SSAT) is the pivotal protein involved in the homeostasis and synthesis of the spermidine [59]. Spermidine/SSAT is the rate-limiting step in the catabolism of polyamines, which play particular role in maintaining the membrane potential and regulating cell volume [60, 61]. Recent studies have reported that SSAT is upregulated in lung cancer [62]. 2-Deoxy-d-glucose (2DG), a glucose analogue, is converted to 2-DG-P by hexokinase. 2-DG-P cannot be metabolized but it could allosterically suppress hexokinase, which is the rate-limiting enzyme of glycolysis [63]. On account of blocking glycolysis, 2-DG influences in various biological processes. It could inhibit N-linked glycosylation, increases oxidative stress, and efficiently suppresses cell growth and invasion [64].The amino acid 2-aminoethanesulfonic acid, commonly known as taurine, has widespread physiological effects and was confirmed as the endogenous anti-injury material [65]. It could up-regulate the expression of N-acetyl galactosaminyl transferase 2, down-regulate the expression of matrix metalloproteinase-2, and inhibit the potential invasion and metastasis [66]. Previous studies have proposed that changes of taurine levels could be used to predict the malignant transformation and formation of breast, bladder and colorectal tumors [67], [68]–69]. Recently, on March 2020, Chabon et al. used integrating genomic features for non-invasive early lung cancer detection [70], which initially demonstrated machine learning method could be used for lung cancer detection. Based on cell-free DNA (cfDNA) features, researchers developed and prospectively validated a machine-learning method termed ‘lung cancer likelihood in plasma’ (Lung-CLiP), which could discriminate early lung cancer patients from controls. During screening test, they observed sensitivities of 63% stage I lung cancer patients with 80% specificity. Compared with Naïve Bayes, Neural Network, and SVM models basing on the metabolic biomarkers as features developed in our study, the sensitivity and specificity are > 85%. Although our machine learning models is less accurate than LDCT, this strategy could potentially increase the total number of patients screened. As this strategy progresses into clinical trial, abundant sample data will allow for improvement of performance by using more progressive machine-learning algorithms. On the other hand, our current study still has several limitations. First, our analysis was built on data from single healthcare institution of finite geographic region. And non-small cell lung cancer excepted, the number of patients with other types of lung cancer was inadequate. Therefore, further confirmatory studies at other institutions are necessary prior to implementation. Second, our data only included the metabolites level. More information of lung tumor patients and healthy individuals, such as age, history of smoking, the concurrent tumor diagnosis, and past medical history, would be helpful for further study. We propose that integration of plasma metabolic biomarkers with CT screening or other lung tumor features could further improve performance. In any case, we need to figure out a proper method to apply this strategy in clinical practice, such as combined with electronic chips system in order to make the plasma tests and model application in an assembly line. One potential application of our study could be served as a premier screening for some of the lung cancer patients. In despite of being candidates for LDCT as high-risk people, these patients are not being screened due to the concerns with false positives, limited access and other limited reasons. Then patients who show positive tests would then be referred to LDCT screening. Additionally, by modifying the machine learning methods and incorporating features appropriate for other cancer types, we expect that it could be feasible to develop strategy combining metabolomics and machine learning for a diverse range of malignancies diagnosis.

Conclusions

A pioneering interdisciplinary method was proposed in this study to detect early lung cancer diagnostic biomarkers by combining metabolomics and machine learning methods. Metabolic biomarkers demonstrate significant diagnostic strength for early detection of lung tumor.

CRediT authorship contribution statement

Ying Xie: Methodology, Investigation, Writing - review & editing. Wei-Yu Meng: Writing - original draft, Methodology, Formal analysis. Run-Ze Li: Writing - original draft, Conceptualization, Investigation. Yu-Wei Wang: Software, Data curation. Xin Qian: Investigation, Resources. Chang Chan: Investigation, Resources. Zhi-Fang Yu: Investigation, Resources. Xing-Xing Fan: Resources. Hu-Dan Pan: Resources. Chun Xie: Resources, Project administration. Qi-Biao Wu: Resources. Pei-Yu Yan: Resources. Liang Liu: Resources, Supervision. Yi-Jun Tang: Resources, Supervision. Xiao-Jun Yao: Conceptualization, Writing - review & editing, Funding acquisition. Mei-Fang Wang: Conceptualization, Resources, Writing - review & editing. Elaine Lai-Han Leung: Conceptualization, Writing - review & editing, Funding acquisition.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
  12 in total

Review 1.  Challenges in translational machine learning.

Authors:  Artuur Couckuyt; Ruth Seurinck; Annelies Emmaneel; Katrien Quintelier; David Novak; Sofie Van Gassen; Yvan Saeys
Journal:  Hum Genet       Date:  2022-03-04       Impact factor: 5.881

2.  A Convolutional Neural Network-Based Intelligent Medical System with Sensors for Assistive Diagnosis and Decision-Making in Non-Small Cell Lung Cancer.

Authors:  Xiangbing Zhan; Huiyun Long; Fangfang Gou; Xun Duan; Guangqian Kong; Jia Wu
Journal:  Sensors (Basel)       Date:  2021-11-30       Impact factor: 3.576

3.  Lung Cancer Prediction from Text Datasets Using Machine Learning.

Authors:  C Anil Kumar; S Harish; Prabha Ravi; Murthy Svn; B P Pradeep Kumar; V Mohanavel; Nouf M Alyami; S Shanmuga Priya; Amare Kebede Asfaw
Journal:  Biomed Res Int       Date:  2022-07-14       Impact factor: 3.246

4.  Investigation on the incidence and risk factors of lung cancer among Chinese hospital employees.

Authors:  Zi-Hao Chen; Zhi-Yong Chen; Jing Kang; Xiang-Peng Chu; Rui Fu; Jia-Tao Zhang; Yi-Fan Qi; Jing-Hua Chen; Jun-Tao Lin; Ben-Yuan Jiang; Xue-Ning Yang; Yi-Long Wu; Wen-Zhao Zhong; Qiang Nie
Journal:  Thorac Cancer       Date:  2022-07-11       Impact factor: 3.223

5.  Miniaturized microfluidic-based nucleic acid analyzer to identify new biomarkers of biopsy lung cancer samples for subtyping.

Authors:  Xue Lin; Zi-Hao Bo; Wenqi Lv; Zhanping Zhou; Qin Huang; Wenli Du; Xiaohui Shan; Rongxin Fu; Xiangyu Jin; Han Yang; Ya Su; Kai Jiang; Yuchen Guo; Hongwu Wang; Feng Xu; Guoliang Huang
Journal:  Front Chem       Date:  2022-08-29       Impact factor: 5.545

Review 6.  Blood-based biomarker in Parkinson's disease: potential for future applications in clinical research and practice.

Authors:  Lars Tönges; Carsten Buhmann; Stephan Klebe; Jochen Klucken; Eun Hae Kwon; Thomas Müller; David J Pedrosa; Nils Schröter; Peter Riederer; Paul Lingor
Journal:  J Neural Transm (Vienna)       Date:  2022-04-15       Impact factor: 3.850

7.  Biomedical Application of Identified Biomarkers Gene Expression Based Early Diagnosis and Detection in Cervical Cancer with Modified Probabilistic Neural Network.

Authors:  K Ramesh; Pankaj Agarwal; Vandana Ahuja; Bilal Ahmed Mir; Shvets Yuriy; Majid Altuwairiqi; Stephen Jeswinde Nuagah
Journal:  Contrast Media Mol Imaging       Date:  2022-09-10       Impact factor: 3.009

8.  A Multi-Task Convolutional Neural Network for Lesion Region Segmentation and Classification of Non-Small Cell Lung Carcinoma.

Authors:  Zhao Wang; Yuxin Xu; Linbo Tian; Qingjin Chi; Fengrong Zhao; Rongqi Xu; Guilei Jin; Yansong Liu; Junhui Zhen; Sasa Zhang
Journal:  Diagnostics (Basel)       Date:  2022-07-31

9.  Kynurenine and Hemoglobin as Sex-Specific Variables in COVID-19 Patients: A Machine Learning and Genetic Algorithms Approach.

Authors:  Jose M Celaya-Padilla; Karen E Villagrana-Bañuelos; Juan José Oropeza-Valdez; Joel Monárrez-Espino; Julio E Castañeda-Delgado; Ana Sofía Herrera-Van Oostdam; Julio César Fernández-Ruiz; Fátima Ochoa-González; Juan Carlos Borrego; Jose Antonio Enciso-Moreno; Jesús Adrián López; Yamilé López-Hernández; Carlos E Galván-Tejada
Journal:  Diagnostics (Basel)       Date:  2021-11-25

10.  Research on the Segmentation of Biomarker for Chronic Central Serous Chorioretinopathy Based on Multimodal Fundus Image.

Authors:  Jianguo Xu; Jianxin Shen; Qin Jiang; Cheng Wan; Zhipeng Yan; Weihua Yang
Journal:  Dis Markers       Date:  2021-09-03       Impact factor: 3.434

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.