Meiling Sheng1, Zhaohui Dong2, Yanping Xie3. 1. Department of Respiration, Jinhua People's Hospital, Jinhua, Zhejiang 321000, China. 2. Department of Intensive Care Unit, First Hospital of Huzhou, First Affiliated Hospital of Huzhou University, Huzhou, Zhejiang 313000, China. 3. Department of Respiratory Medicine, First Hospital of Huzhou, First Affiliated Hospital of Huzhou University, Huzhou, Zhejiang 313000, China, xieyp011@163.com.
Abstract
BACKGROUND: Lung cancer is a severe cancer with a high death rate. The 5-year survival rate for stage III lung cancer is much lower than stage I. Early detection and intervention of lung cancer patients can significantly increase their survival time. However, conventional lung cancer-screening methods, such as chest X-rays, sputum cytology, positron-emission tomography (PET), low-dose computed tomography (CT), magnetic resonance imaging, and gene-mutation, -methylation, and -expression biomarkers of lung tissue, are invasive, radiational, or expensive. Liquid biopsy is non-invasive and does little harm to the body. It can reflect early-stage dysfunctions of tumorigenesis and enable early detection and intervention. METHODS: In this study, we analyzed RNA-sequencing data of tumor-educated platelets (TEPs) in 402 non-small-cell lung cancer (NSCLC) patients and 231 healthy controls. A total of 48 biomarker genes were selected with advanced minimal-redundancy, maximal-relevance, and incremental feature-selection (IFS) methods. RESULTS: A support vector-machine (SVM) classifier based on the 48 biomarker genes accurately predicted NSCLC with leave-one-out cross-validation (LOOCV) sensitivity, specificity, accuracy, and Matthews correlation coefficients of 0.925, 0.827, 0.889, and 0.760, respectively. Network analysis of the 48 genes revealed that the WASF1 actin cytoskeleton module, PRKAB2 kinase module, RSRC1 ribosomal protein module, PDHB carbohydrate-metabolism module, and three intermodule hubs (TPM2, MYL9, and PPP1R12C) may play important roles in NSCLC tumorigenesis and progression. CONCLUSION: The 48-gene TEP liquid-biopsy biomarkers will facilitate early screening of NSCLC and prolong the survival of cancer patients.
BACKGROUND: Lung cancer is a severe cancer with a high death rate. The 5-year survival rate for stage III lung cancer is much lower than stage I. Early detection and intervention of lung cancer patients can significantly increase their survival time. However, conventional lung cancer-screening methods, such as chest X-rays, sputum cytology, positron-emission tomography (PET), low-dose computed tomography (CT), magnetic resonance imaging, and gene-mutation, -methylation, and -expression biomarkers of lung tissue, are invasive, radiational, or expensive. Liquid biopsy is non-invasive and does little harm to the body. It can reflect early-stage dysfunctions of tumorigenesis and enable early detection and intervention. METHODS: In this study, we analyzed RNA-sequencing data of tumor-educated platelets (TEPs) in 402 non-small-cell lung cancer (NSCLC) patients and 231 healthy controls. A total of 48 biomarker genes were selected with advanced minimal-redundancy, maximal-relevance, and incremental feature-selection (IFS) methods. RESULTS: A support vector-machine (SVM) classifier based on the 48 biomarker genes accurately predicted NSCLC with leave-one-out cross-validation (LOOCV) sensitivity, specificity, accuracy, and Matthews correlation coefficients of 0.925, 0.827, 0.889, and 0.760, respectively. Network analysis of the 48 genes revealed that the WASF1 actin cytoskeleton module, PRKAB2 kinase module, RSRC1 ribosomal protein module, PDHB carbohydrate-metabolism module, and three intermodule hubs (TPM2, MYL9, and PPP1R12C) may play important roles in NSCLC tumorigenesis and progression. CONCLUSION: The 48-gene TEP liquid-biopsy biomarkers will facilitate early screening of NSCLC and prolong the survival of cancer patients.
Lung cancer is a severe cancer with a high death rate.1,2 Early detection of lung cancer is the most effective way to increase survival time, since survival time is directly associated with lung cancer stage and early-treatment patients will have better diagnoses.3 The 5-year survival rates for stage I and stage III lung cancer patients are 67% and 23%, respectively.3 The survival difference between early-stage and late-stage lung cancer is huge. Therefore, early screening of lung cancer is the key to lung cancer prevention and therapy.Conventionally, lung cancer is detected through chest X-rays, sputum cytology, positron-emission tomography (PET), low-dose computed tomography (CT), and magnetic resonance imaging.4 However, many diagnosed patients are already in late stages.5 Although PET and CT are developing progressively higher resolutions and can detect smaller tumors, they are radiational and expensive.In recent years, sequencing technologies have developed rapidly. It has been found that tumor tissue can release small numbers of tumor cells, DNA, RNA, or exosomes into blood. These tumor cells in blood are called circulating tumor cells (CTCs).6 Nowadays, CTCs can be isolated and DNA and RNA with CTCs sequenced accurately.7 Other types of liquid-biopsy components include ctDNA, ctRNA, exosomes, and tumor-educated platelets (TEPs).8 Tumor-derived exosomes contain various molecules, such as dsDNA and small RNA, and can reflect the status of tumor cells.9 TEPs are blood platelets that contain tumor RNAs.10 They are a great source of tumor-derived RNAs. There have been several studies showing that TEP RNAs can be cancer biomarkers.10–12 Liquid biopsy has become ever more important in early lung cancer detection and is the one of the foundations of personalized medicine.13 It can reflect early-stage dysfunctions of tumorigenesis and enable early detection and intervention.In this study, we analyzed RNA-sequencing data of TEPs in 402 non-small-cell lung cancer (NSCLC) patients and 231 healthy controls. By comparing their expression differences with the minimal redundancy, maximal relevance (MRMR) method, differentially expressed genes were ranked. Then, with incremental feature selection (IFS), optimal biomarkers were selected. Finally, a support vector machine (SVM) classifier based on the optimal biomarkers was constructed and evaluated. TEP biomarkers could be a useful way to enable early intervention in lung cancer patients and prolong their survival.
Methods
Blood gene-expression profiles of NSCLC
Blood gene-expression profiles of NSCLC patients were downloaded from the Gene Expression Omnibus with accession number GSE8984314 (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE89843). There were 402 NSCLC samples and 231 healthy control samples. Samples with chronic pancreatitis, epilepsy, multiple sclerosis, insignificant atherosclerosis, pulmonary hypertension, stable angina pectoris, and unstable angina pectoris were excluded. Expression levels of 4,722 genes in TEPs were measured using RNA sequencing. We considered the 402 NSCLC samples as positive samples, the 231 healthy control samples as negative samples, and the expression levels of the 4,722 genes as classification features. The goal was to identify the differentially expressed genes between NSCLC and healthy controls and construct an effective TEP-biomarker-based NSCLC classifier. The workflow of TEP-biomarker-based NSCLC-classifier construction is shown in Figure 1. First, TEP data were preprocessed as a matrix with rows of samples and columns with genes. Then, genes were ranked with the MRMR method.15 After MRMR, the genes were all ranked. Then, with the ranked-gene list, the IFS method18–23 was used to optimize the biomarker-gene set. Finally, biomarkers were determined and the final SVM classifier constructed. Each step is illustrated in the following sections.
Figure 1
Workflow of TEP biomarker-based NSCLC classifier construction.
Notes: First, TEP data were preprocessed as a matrix with rows of samples and columns of genes. Then, genes were ranked with the MRMR method. After MRMR, genes were all ranked. Then, with the ranked-gene list, incremental feature selection was adopted to optimize the biomarker-gene set. Finally, biomarkers were determined and the final SVM classifier constructed.
Biomarker-gene selection based on MRMR and IFS methods
We used the MRMR method15 to rank the genes based on their relevance with sample labels (NSCLC or healthy controls) and redundancy between genes. To illustrate this method clearly, let us use Ω, Ω, and Ω to represent the complete set of candidate genes for biomarker ranking, the selected m biomarker genes, and the to-be-selected n genes, respectively. The relevance of gene g from Ω with sample type t can be measured with mutual information (I):16,17After we defined mutual information, the redundancy (R) of the gene g with the selected biomarker genes in Ω can be calculated:To select the best gene g from Ω that can maximize its relevance with sample type t and minimize its redundancy with the selected biomarker genes in Ω, we need to maximize the MRMR function:After n rounds of evaluation, a ranked-gene list can be obtained:The position of a gene in this ranked list (h) reflects the trade-off between relevance with sample classes, ie, whether a sample is NSCLC, and redundancy with selected biomarker genes, ie, genes with smaller index values. The genes on the top are better than the genes on the bottom.To reduce computation complexity, we analyzed only the top 500 MRMR genes. To determine how many genes should be selected to form the optimal biomarkers, we adopted the IFS method18–23 and constructed 500 SVM classifiers. In this study, we used the SVM function with default parameters from R package e10171 (https://cran.r-project.org/web/packages/e1071) to build the SVM classifier. Each time, a candidate gene set S = {g′1, g′2, …, g′} (1 ≥ k ≥ 500) of the top k genes in the MRMR list was used to build the SVM classifier. The performance of the top k-gene classifier was evaluated with leave-one-out cross-validation (LOOCV). Finally, an IFS curve was plotted, with the top genes used as the x-axis and the LOOCV Matthews correlation coefficients (MCCs) of classifiers as the y-axis. Based on the IFS curve, we can decide how many genes should be used to build a classifier with great performance and little complexity. Usually, the peak or the change point of the IFS curve was chosen.
Prediction-performance evaluation of SVM classifier
As mentioned, LOOCV,24,25 also known as jackknife testing, was used to evaluate the prediction performance of each SVM classifier. LOOCV continues for n rounds to test all samples one by one. In each round of LOOCV, one sample was tested while the other samples were trained. After n rounds, all samples were tested one at a time. LOOCV is widely used to evaluate prediction performance.26 Although the independent test has also been widely used, the selection of independent-test samples is arbitrary, and sometimes the choice of different validation cohorts may lead to totally different conclusions, as the validation samples may have different distributions from the training samples.26 Cross-validation can overcome these problems.26By comparing the predicted sample classes with the actual sample classes, sensitivity (Sn), specificity (Sp), accuracy (ACC), and MCC were calculated to evaluate prediction performance:
where TP, TN, FP, and FN stand for true positive (NSCLC), true negative (healthy control), false positive (NSCLC), and false negative (healthy control), respectively. Since the sizes of positive (NSCLC) and negative (healthy control) samples were imbalanced in this study, MCC was a better measurement than ACC. MCC considered both sensitivity and specificity.27
Results and discussion
Genes showing different expression patterns between NSCLC and healthy controls
We obtained the top 500 most discriminative genes of NSCLC and healthy control samples using the MRMR method. The MRMR method is based on information theory. Mutual information is used to measure relevance and redundancy. It has been widely used in the bioinformatics field.28–32 We used a C/C++ version of MRMR software (http://home.penglab.com/proj/mRMR/) to apply the gene-ranking process. Unlike statistical test methods, such as the t-test for case–control experiment design and ANOVA for multiple-group design, MRMR not only considers the relevance between genes and sample classes but also redundancy between genes.
Optimal biomarkers identified from MRMR gene list with IFS methods
After MRMR analysis, we applied the IFS procedure to select the optimal number of top MRMR genes to form the biomarker-gene set. The relationship between the number of genes and prediction MCCs was plotted as an IFS curve (Figure 2). It can be seen that when 266 genes were used, the LOOCV MCC was the highest – 0.764, but even early, when only 48 genes were used, the MCC was 0.760. To consider both using fewer genes and achieving higher prediction MCC, we chose the 48 genes as the optimal biomarker-gene set, since increasing the number of genes beyond 48 would not significantly increase the MCC any more. The 48 genes are shown in Table 1.
Figure 2
IFS curve showing how prediction performance improved when more and more genes were used to construct the classifier.
Notes: The IFS curve explained the relationship between the number of genes and prediction performance, assessed by MCCs in this study. The x-axis denotes the number of top genes that were used to construct the SVM classifier, and the y-axis denotes the LOOCV MCCs of the classifiers. The highest MCC was achieved when 266 genes were used. However, after 48 genes were used, the IFS curve entered the plateau area and did not increase much, even when increasing numbers of genes were included. To consider both model complexity and model performance, we chose the 48 genes as the optimal biomarker-gene set.
The 48 genes were chosen based on MRMR and IFS methods. To evaluate their prediction power objectively, we calculated LOOCV sensitivity, specificity, accuracy, and MCC. The confusion matrix of predicted sample classes and actual sample classes is shown in Table 2. LOOCV sensitivity, specificity, accuracy and MCC of the 48-gene classifier were 0.925, 0.827, 0.889, and 0.760, respectively.
Table 2
Confusion matrix of predicted sample classes and actual sample classes using 48 genes
Actual NSCLC
Actual healthy controls
Predicted NSCLC
372
40
Predicted healthy controls
30
191
Abbreviation: NSCLC, non-small-cell lung cancer.
To demonstrate more intuitively the discriminative power of these 48 genes for NSCLC and healthy control samples, we draw a heat map using these 48 genes (Figure 3). It can be seen that even without an advanced machine-learning algorithm, such as SVM, the simple hierarchical clustering can group most NSCLC and healthy control samples into the right clusters. Upregulation and downregulation patterns of these 48 genes were very clear between NSCLC and healthy control samples.
Figure 3
Heat map of NSCLC and healthy control samples using the selected 48 genes.
Notes: The NSCLC and healthy control samples were hierarchically cluttered using the 48 selected genes. There were very clear clusters of NSCLC and healthy controls. Most samples were grouped into the right cluster.
Abbreviation: NSCLC, non-small-cell lung cancer.
Biological significance of the 48 biomarker genes
To explore the regulatory mechanisms of the 48 genes, we mapped them onto Search Tool for the Retrieval of Interacting Genes/Proteins (STRING),33 a comprehensive and widely used protein functional association network.34–39 The subnetwork of these 48 genes extracted from STRING is shown in Figure 4, with selected genes highlighted in red. It can be seen that there were several modules on the network that were circled together.
Figure 4
Modules and intermodule hubs of biomarker genes on STRING network.
Note: Four modules (WASF1 module, PRKAB2 module, RSRC1 module, and PDHB module) and three intermodule hubs (TPM2, MYL9, and PPP1R12C) were revealed on the biomarker subnetwork.
Abbreviation: STRING, Search Tool for the Retrieval of Interacting Genes/Proteins.
On the bottom left is the WASF1 module which included MYO5A and WASF1. These two genes both interacted with NCKAP1, CYFIP2, and CYFIP1. In this WASF1 module, four genes (CYFIP1, CYFIP2, NCKAP1, and WASF1) were involved in hsa04810: regulation of actin cytoskeleton. It has been reported that actin cytoskeleton was associated with lung cancer migration and invasion.40The WASF1 module interacted with the PRKAB2 module and PDHB module through the intermodule hubs. There were three intermodule hubs as follows: TPM2, MYL9, and PPP1R12C. They connected the WASF1 actin-cytoskeleton module, the PRKAB2 kinase module, and the PDHB carbohydrate-metabolism module. Interestingly, these inter-module hubs ranked significantly higher than the intramodule genes. TPM2, PPP1R12C, and MYL9 ranked fifth, 12th, and 25th, respectively (Table 1). These intermodule hubs are understudied. Only one study has suggested that MYL9 is downregulated in NSCLC and may be associated with tumorigenesis of NSCLC.41 Unlike traditional lung cancer-tissue analysis, these intermodule hubs may reflect an earlier dysfunction in NSCLC and worth further investigation.In the PRKAB2 module, PRKAB2 is a family member of AMPK. AMPK is a key pathway in NSCLC and engages in cross talk with the EGFR pathway to sensitize the response of NSCLC cells to lung cancer therapeutics, such as erlotinib treatment.42 In the PDHB module was PDHB, MLH3, and SLC38A1. Functional analysis of these modules using GATHER43 suggested that seven members (ACLY, CS, DLAT, DLST, OGDH, PDHA2, and PDHB) were involved in GO:0006092 main pathways of carbohydrate metabolism, with P<0.0001 and Bayes factor of 21. As we know, one of the hallmarks of cancer is cellular energy metabolism.44 Cancer cell growth and proliferation need a lot of energy. The module was significantly enriched in carbohydrate metabolism. MLH3 and SLC38A1 were less connected with these carbohydrate metabolism genes than PDHB. Also, it has been reported that the haplotype MSH3 was associated with lung cancer45 and SLC38A1 significantly overexpressed in NSCLC.46At the top middle was the RSRC1 module, which included RSRC1 and FLOT1. Within this module, eight genes (RPS11, RPS14, RPS15, RPS26, RPS28, RPS3, RPS3A, and RPS9) that RSRC1 interacted with were ribosomal protein genes. Ribosome is important for protein biosynthesis, and there have been several reports that downregulation of ribosomal protein can inhibit or attenuate NSCLC growth and migration.47–49 Also, they have been considered oncogenes of NSCLC.49 Another gene was FLOT1. It has been reported that in NSCLC, the expression of FLOT1 was abnormal and correlated with tumor progression and poor survival.50To summarize, the possible biological mechanism of the NSCLC TEP biomarkers is shown in Figure 5. The inter-module hub genes, including TPM2, MYL9, and PPP1R12C, stitched together the WASF1 module, which regulated actin cytoskeleton, the PRKAB2 module, which was involved in the AMPK–EGFR pathway, and the PDHB module, which was involved in carbohydrate metabolism. The PDHB module interacted with the RSRC1 module, which was associated with protein biosynthesis, growth, and migration.
Figure 5
Possible biological mechanism of the NSCLC TEP biomarkers.
Notes: Intermodule-hub genes, including TPM2, MYL9, and PPP1R12C, stitched together the WASF1 module, which regulated actin cytoskeleton, the PRKAB2 module, which was involved in the AMPK–EGFR pathway, and the PDHB module, which was involved in carbohydrate metabolism. The PDHB module interacted with the RSRC1 module, which was associated with protein biosynthesis, growth, and migration.
Early detection of lung cancer is critical for NSCLC patients, since early-stage patients have much longer survival than late-stage patients. Unfortunately, conventional lung cancer screening, such as chest X-rays, sputum cytology, PET, CT, and magnetic resonance imaging, are invasive, radiational, or expensive. Liquid biopsy makes early detection possible, since CTC, ctDNA, ctRNA, exosomes, and TEP reflect early changes during tumorigenesis. By analyzing TEP RNA-sequencing data of NSCLC patients and healthy controls, we identified 48 TEP biomarkers. These biomarkers can accurately predict NSCLC. In-depth biological network analysis suggested that there were four modules and three intermodule hubs that may trigger NSCLC. Our results provided novel insights into tumorigenesis and a useful tool for early detection and treatment of NSCLC.
Authors: Bojiang Chen; Wen Zhang; Jun Gao; Hong Chen; Li Jiang; Dan Liu; Yidan Cao; Shuang Zhao; Zhixin Qiu; Jing Zeng; Shangfu Zhang; Weimin Li Journal: Cancer Lett Date: 2014-09-06 Impact factor: 8.679
Authors: Tao Huang; Cheng-Lin Liu; Lin-Lin Li; Mei-Hong Cai; Wen-Zhong Chen; Yi-Feng Xu; Paul F O'Reilly; Lei Cai; Lin He Journal: Sci Rep Date: 2016-09-01 Impact factor: 4.379
Authors: Tong-Hui Zhao; Min Jiang; Tao Huang; Bi-Qing Li; Ning Zhang; Hai-Peng Li; Yu-Dong Cai Journal: Biomed Res Int Date: 2013-04-22 Impact factor: 3.411
Authors: Hamza Ali; Romée Harting; Ralph de Vries; Meedie Ali; Thomas Wurdinger; Myron G Best Journal: Front Oncol Date: 2021-06-04 Impact factor: 6.244
Authors: Marcello Scala; Majid Mojarrad; Saima Riazuddin; Karlla W Brigatti; Zineb Ammous; Julie S Cohen; Heba Hosny; Muhammad A Usmani; Mohsin Shahzad; Sheikh Riazuddin; Valentina Stanley; Atiye Eslahi; Richard E Person; Hasnaa M Elbendary; Anne M Comi; Laura Poskitt; Vincenzo Salpietro; Queen Square Genomics; Jill A Rosenfeld; Katie B Williams; Dana Marafi; Fan Xia; Marta Biderman Waberski; Maha S Zaki; Joseph Gleeson; Erik Puffenberger; Henry Houlden; Reza Maroofian Journal: Brain Date: 2020-04-01 Impact factor: 13.501