Literature DB >> 32420331

Epigenome-Wide Tobacco-Related Methylation Signature Identification and Their Multilevel Regulatory Network Inference for Lung Adenocarcinoma.

Yan-Mei Dong1, Ming Li2, Qi-En He1, Yi-Fan Tong1, Hong-Zhi Gao3, Yi-Zhi Zhang4, Ya-Meng Wu2, Jun Hu4, Ning Zhang2, Kai Song1.   

Abstract

Tobacco exposure is one of the major risks for the initiation and progress of lung cancer. The exact corresponding mechanisms, however, are mainly unknown. Recently, a growing body of evidence has been collected supporting the involvement of DNA methylation in the regulation of gene expression in cancer cells. The identification of tobacco-related signature methylation probes and the analysis of their regulatory networks at different molecular levels may be of a great help for understanding tobacco-related tumorigenesis. Three independent lung adenocarcinoma (LUAD) datasets were used to train and validate the tobacco exposure pattern classification model. A deep selecting method was proposed and used to identify methylation signature probes from hundreds of thousands of the whole epigenome probes. Then, BIMC (biweight midcorrelation coefficient) algorithm, SRC (Spearman's rank correlation) analysis, and shortest path tracing method were explored to identify associated genes at gene regulation level and protein-protein interaction level, respectively. Afterwards, the KEGG (Kyoto Encyclopedia of Genes and Genomes) pathway analysis and GO (Gene Ontology) enrichment analysis were used to analyze their molecular functions and associated pathways. 105 probes were identified as tobacco-related DNA methylation signatures. They belong to 95 genes which are involved in hsa04512, hsa04151, and other important pathways. At gene regulation level, 33 genes are uncovered to be highly related to signature probes by both BIMC and SRC methods. Among them, FARSB and other eight genes were uncovered as Hub genes in the gene regulatory network. Meanwhile, the PPI network about these 33 genes showed that MAGOH, FYN, and other five genes were the most connected core genes among them. These analysis results may provide clues for a clear biological interpretation in the molecular mechanism of tumorigenesis. Moreover, the identified signature probes may serve as potential drug targets for the precision medicine of LUAD.
Copyright © 2020 Yan-Mei Dong et al.

Entities:  

Mesh:

Substances:

Year:  2020        PMID: 32420331      PMCID: PMC7201762          DOI: 10.1155/2020/2471915

Source DB:  PubMed          Journal:  Biomed Res Int            Impact factor:   3.411


1. Introduction

Lung cancer has been the leading cause of cancer-related mortality throughout the world for decades [1]. It has been shown that smokers 15–30 times more likely to get lung cancer or die from it than lifetime never-smokers. Moreover, smoking during cancer therapy may influence radiotherapy and chemotherapy outcomes [2]. Tobacco exposure is the major risk for lung cancer; however, 10% to 15% of patients still have no history of it [3, 4]. Being the major form of lung cancer associated with never-smokers [5-7], when considering therapies for lung adenocarcinoma (LUAD) patients, the carcinogenic mechanisms of smokers are believed to differ from those of never-smokers. The rising trend in the proportion of never-smokers in LUAD urgently requires the understanding of such differences on molecular levels for the development of precision medicine. DNA methylation, a useful and stable surrogate of the genetic response, has recently been suggested to be one of the potential mechanisms for smoking-related health outcomes [8-11]. Many studies have demonstrated that smoking can induce genomic instability by producing genetic mutations and altering epigenetic modifications. Recently, epigenome-wide association studies (EWAS) have opened a new avenue for lung cancer research. Hypermethylation of promoter regions has frequently been observed in smokers with and without lung cancer [12]. The cumulative smoking dose (pack-years) has been shown to correlate well with the frequency of methylation in cancer-free heavy smokers [13]. However, the highly reproducible DNA methylation markers linked to smoking and their corresponding mechanism remain unclear. In our previous study, we identified 48 genes whose promoter DNA methylation levels were highly related to tobacco exposure for LUAD patients. In that paper, the average methylation value of the probes located in a gene's promoter area was used as the methylation value of that gene. Then, such average methylation values of protein-coding genes were used as variables to do the tobacco exposure classification. In short, our previous study was limited in analyzing promoter methylation levels of protein-coding genes. Recent studies, however, show that DNA methylation in the body areas of genes might be involved in differential promoter usage and also in transcription elongation and alternative splicing [14-16]. Gene body methylation has long been ignored but is more prevalent in the genome than promoter hypermethylation. It may be associated with increased gene expression. Therefore, investigating every probe as an individual variable rather than investigating only promoter probes of protein-coding genes may uncover real signature markers. Correspondingly, the aim of this study is twofold: (1) to identify signature methylation probes highly related to tobacco exposure from hundreds of thousands of epigenome-wide probes and (2) to uncover the important genes related to these signature probes from different molecular levels, i.e., gene regulatory level and protein-protein interaction level. Figure 1 and Figure S1 show the corresponding flowchart of our study.
Figure 1

Flowchart of current research. Cited from “Genetic and Environmental Contributions to Functional Connectivity Architecture of the Human Brain.”

2. Materials and Methods

2.1. Data

Two GEO (Gene Expression Omnibus) datasets and one TCGA (The Cancer Genome Atlas) dataset were used for training and validation, respectively. They are summarized in Table 1.
Table 1

The summary of the clinical information of LUAD samples in all datasets.

LUADDiscovery cohortValidation cohort
GSE39279TCGAGSE66836
Number of samples145172121
Age40–9033–8439–84
Gender
 Females769166
 Males698155
Smoking history
 Current102105105
 Never436716
Stage
 Stage I848970
 Stage II244924
 Stage III302325
 Stage IV7102
 Not applicable1
Smoking pack-years
 0–509958Unknown
 51–1003519Unknown
 101–12528Unknown
 Not applicable987Unknown

The ages of 4 smokers and 5 nonsmokers in TCGA data are unknown.

These two GEO datasets with access numbers of GSE39279 and GSE66836 were downloaded from https://www.ncbi.nlm.nih.gov/geo/ [17]. GSE39279 was used as the training data, while GSE66836 was used as a validation data. Their methylation levels (β-value) of 485,577 CpG-probes were collected via the Infinium Human Methylation 450 BeadChip. Their clinical information was collected from the supplemental document of reference [18]. TCGA dataset was downloaded through the public Genomic Data Commons Data Portal (GDC: https://portal.gdc.cancer.gov/). The level 3 methylation data of LUAD patients measured by Infinium Human Methylation 450 BeadChip was used as the second validation sample set. Additionally, the level 3 mRNA expression data measured by the Illumina HiSeq 2000 RNA Sequencing Version 2 platform was also downloaded and used in our study. For LUAD patients, we shall use the following definitions for tobacco usage and exposure [19]: Current smoker: an adult who has smoked at least 100 cigarettes in their lifetime and who currently smokes cigarettes or has quit within the previous 12 months. Never smoker: an adult who has never smoked or has smoked less than 100 cigarettes in their lifetime. Ever smoker: an adult who has smoked at least 100 cigarettes in their lifetime (irrespective of whether they are currently smoking).

2.2. Statistic Methods

2.2.1. Jarque–Bera Test

In statistics, the Jarque–Bera test is a goodness-of-fit test to decide whether sample data have the skewness and kurtosis matching a normal distribution [20]. Due to the unknown distribution of the probes, Jarque–Bera test was used to test their goodness-of-fit of the normal distribution.

2.2.2. Wilcoxon Rank-Sum Test

The Wilcoxon rank-sum test (also called the Mann–Whitney-Wilcoxon (MWW), Mann–Whitney U test, or Wilcoxon-Mann–Whitney test) is a nonparametric test determining whether two independent samples were selected from populations having the same distribution. Even though the t-test is one of the most widely used statistical procedures, but it assumes that the variable in question is normally distributed in the two groups. When this assumption is in doubt, the nonparametric Wilcoxon rank-sum test is suggested as an alternative to the t-test, which does not rely on the distributional assumptions [21]. Wilcoxon rank-sum test with Benjamini-Hochberg correction, a powerful tool to control the false discovery rate (FDR), was used to select probes whose patterns are significantly different between current-smokers and never-smokers.

2.2.3. Partial Least Squares (PLS)

PLS [22] is a widely used algorithm for modeling the relationship between sets of observed variables by means of latent variables. It comprises regression and classification tasks as well as dimension reduction and modeling [23, 24]. Instead of finding hyperplanes of minimum variance between the response and independent variables, it finds a linear regression model by projecting the predicted variables (i.e., classification labels) and the observed variables (methylation values of probes in our case) to a new lower space [25-27]. Therefore, it performs very well for the analysis of high-dimension-small-sample data in bioinformatics.

2.2.4. Biweight Midcorrelation Coefficient (BIMC) Algorithm

BIMC (biweight midcorrelation coefficient) algorithm was proposed by Yuan et al. [28]. Pearson's correlation coefficient is a representative method of correlation coefficient approaches. However, Pearson's correlation coefficient is sensitive to outliers. Biweight midcorrelation is considered to be a good alternative to Pearson's correlation coefficient since it is more robust to outliers.

2.3. Spearman's Rank Correlation

Unlike Pearson's correlation coefficient, Spearman's correlation coefficient is more accurate in describing the correlation in the nonlinear relationship and is not sensitive to significant outliers in samples.

2.3.1. Differential Correlation Strategy

After calculating the correlation coefficients using BIMC or SRC [28], the Differential Correlation Strategy (DCS) was used to identify genes that are significantly associated with the identified signature probes throughout the whole genome genes. To do this, the genome-wide expression (GE) data and the epigenome-wide methylation (ME) data were used together as the input variables to BIMC or SRC to calculate their correlation coefficients. Afterwards, the correlation coefficients were used as input variables to the DCS algorithm. In this study, the two thresholds TS1 and TS2 of DCS were optimized as 0.3 and 0.6, respectively.

2.4. Data Preprocessing

For all datasets, genes whose GE or ME values were missing in all samples were removed. The GE data was log 2 transformed. Patient samples lacking any of the important clinical parameters (age, gender, cancer stage, overall survival time, and vital status) were removed. Genes and probes on X and Y chromosomes were removed. As a result, 16050 genes remained in GE data and 374176 probes were available in ME data, respectively.

2.5. Signature Probe Identification

The big challenge in bioinformatics, especially in epigenomic analysis, is the tremendous overwhelming imbalance between the variable size and the sample size. In our case, there were several thousands of probes but less than 180 samples in each dataset. The number of variables was more than 20 times the number of samples. It is not possible to identify signature probes in one step. Therefore, a new deep selecting method was proposed to identify tobacco-related signature probes. Learning from the concept of “Deep Learning” methods for extracting features step by step, the proposed deep selecting method based on PLS extracts features and removes the least important variables step by step. It consists of the following steps: (1) using Wilcoxon rank-sum test to roughly select probes whose methylation levels were significantly different between smokers and never-smokers; (2) initiating a tobacco exposure pattern classification model with methylation values of all significantly different probes as input variables; (3) sorting probes according to their contributions to the classification model; (4) removing the least important probes; (5) remodeling the classification model; (6) repeating steps 3–6 iteratively until the classification accuracy could not be improved any more. Then, the remaining probes are considered as the signature probes since using only their ME values can accurately predict the tobacco exposure pattern of patients. Figure S1 shows the corresponding pipeline of it. Fivefold cross-validation was explored to train the classification model. The sensitivity (SN), specificity (SP), and accuracy (ACC) were used to evaluate the classification performance of the model. Please see the supplementary document for their definitions.

2.6. Uncovering Related Genes at the Gene Regulatory Level

To identify genes significantly related to our identified signature probes, BIMC and SRC integrated into the DCS methods were explored with the GE and ME values of the whole genome-wide genes as input variables, respectively. Their corresponding gene regulatory network was visualized using OmicShare, a free online platform for data analysis (http://www.omicshare.com/).

2.7. Uncovering Related Genes at the Protein-Protein Interaction Level

The betweenness of a shortest path protein is the number of shortest paths across the protein. Then, the shortest path proteins were ranked by betweenness in descending order. The proteins whose betweenness was greater than 3,000 were picked out and their corresponding genes were treated as genes related to our identified signature probes at the PPI level. The Dijkstra algorithm served to find the shortest path in the graph G between two given proteins, which was implemented in the R package “igraph” [29]. In order to ensure the validity and precision of our results, we randomly chose 134 proteins in the PPI network for the shortest path tracing and repeated the procedure 100 times, and a permutation test was performed. Then, we removed 5 genes that appear more frequently in randomized results, which were TP53, AKT1, HSP90AA, CTNNB1, and UBC [30, 31]. A weighted PPI network [32] among related genes was obtained from the STRING (Search Tool for the Retrieval of Interacting Genes) database (version 10.0). The STRING database is a database for searching known and predicted interactions between proteins (http://string.embl.de/) [33]. The ID of the human species is 9,606. There are 8,548,002 pairs of related interaction forces in it. The R package “STRINGdb” was used to map the corresponding protein IDs of the identified signature genes. Moreover, in this network, proteins are denoted as nodes and the interaction of every two proteins is given as an edge marked with a d score. The d value can be considered to represent the protein distances to each other: a smaller distance value indicates that the protein pair has a higher interaction confidence score, which means they may have more analogous functions.

2.8. Genes and Genomes Pathway Analysis and Gene Ontology Enrichment Analysis

KEGG (Kyoto Encyclopedia of Genes and Genomes) is a database resource for understanding high-level functions and utilities of biological systems, such as cells, organisms, and ecosystems, from molecular-level information. Such information is obtained from large-scale molecular datasets generated by genome sequencing and other high-throughput experimental technologies [34]. To do KEGG pathway analysis, the obtained differential gene identifiers are transformed into Ensemble ID by DAVID online analysis tool (https://david.ncifcrf.gov/) [35]. KEGG pathway analysis and GO (Gene Ontology) enrichment analysis were performed to analyze the functions of genes to which the signature probes belong. To do this, the KOBAS (http://kobas.cbi.pku.edu.cn/) online analyzing tool was used [36-38]. The flowchart of tobacco exposure pattern signature probe identification and their multilevel regulatory network analysis is shown in Figure 1 and Figure S1. All analyses were performed using MATLAB or R codes. Please refer to supplementary materials for more details.

3. Results

3.1. Identified Signature Probes and Their Classification Performance

According to the result obtained by the Jarque–Bera test, only 16.08% of 374176 probes were normally distributed. Therefore, the nonparametric Wilcoxon rank-sum test with the Benjamini-Hochberg FDR was used to roughly select probes whose methylation patterns in smokers and never-smokers were significantly different. As a result, 3,557 probes were selected at this step. Then, the proposed deep selecting method based on PLS was used to identify the real signature probes from these 3557 probes. Considering the prediction accuracies of all training and validation datasets, 105 probes were eventually identified as signatures because using only their methylation values can achieve the best classification performance. Table 2 lists their probe IDs and the corresponding genes. Other details are listed in Table S1 in the supplementary materials including their median methylation values in smoking and nonsmoking groups and their P values (Wilcoxon rank-sum test) between these two groups, respectively.
Table 2

List of 105 signature methylation probes.

NumberProbeChromosomeGene symbolFeature type
1cg00334821chr7LIMK1
2cg00370022chr15CYP1A1N_Shelf
3cg00688979chr6CCHCR1
4cg00702638chr3KIAA1143; KIF15Island
5cg00976097chr5AHRRIsland
6cg00993400chr1AL672294.1; ZNF692S_Shore
7cg01049916chr10FGFR2
8cg01637537chr22FAM83FN_Shore
9cg01971034chr1N_Shelf
10cg02050426chr18CDH20Island
11cg02387679chr5IQGAP2Island
12cg02498206chr3BOC
13cg02826525chr2ARHGEF33; RP11-173C1.1Island
14cg02988118chr2N_Shelf
15cg03078488chr7IGF2BP3N_Shore
16cg03277049chr3LINC00886Island
17cg03642695chr17TEKT3
18cg03789372chr8
19cg03806812chr2
20cg03945895chr1PRDM2
21cg03985801chr1LGR6N_Shore
22cg04267214chr1Island
23cg04616529chr16CLEC16A
24cg04865290chr3TMEM110; TMEM110-MUSTN1N_Shelf
25cg05033369chr1FCRLA
26cg05559381chr5
27cg05575921chr5AHRRN_Shore
28cg05752786chr1SYT2Island
29cg05787209chr16STX1BS_Shore
30cg05951221chr2ECEL1P1Island
31cg06010163chr6
32cg06227763chr18
33cg06540950chr5
34cg06637330chr5N_Shelf
35cg07160783chr16
36cg07325233chr21AP000295.9; IL10RB; IL10RB-AS1Island
37cg07709148chr8RP11-486M23.1
38cg07813142chr2SP5Island
39cg08008475chr13RNY1P1
40cg08374798chr20COL9A3N_Shore
41cg08733957chr1GALEN_Shore
42cg08894131chr1GJA5
43cg09194449chr7PTPRN2Island
44cg09278187chr1FOXJ3
45cg09370982chr16RP11-20I23.1; TBC1D24S_Shore
46cg09799983chr2CYP1B1; CYP1B1-AS1Island
47cg10076730chr13COL4A2N_Shore
48cg10354195chr10LRRC27
49cg10385390chr1PARK7S_Shore
50cg10413224chr7BMPERIsland
51cg10650290chr7PTPRN2
52cg11545521chr11PTPRJ
53cg11751707chr2CYP1B1; CYP1B1-AS1Island
54cg11954332chr1PRRX1
55cg12020590chr11
56cg12387247chr19FCER2
57cg13563863chr19FZR1Island
58cg13654445chr9NTRK2
59cg13990746chr1ANKRD45Island
60cg14270346chr9RP11-613M10.9; SHB
61cg14320852chr9
62cg14373988chr1PEX10N_Shore
63cg14419740chr7PTPRN2Island
64cg15233380chr13SHISA2
65cg15513657chr14MEG3
66cg15585555chr1RASSF5S_Shore
67cg15680620chr1ANKRD45S_Shore
68cg15922705chr6COL9A1N_Shore
69cg16315376chr1SYT2Island
70cg16322479chr5EXOC3; EXOC3-AS1S_Shore
71cg16377959chr5LINC01019
72cg16840978chr8RAB11FIP1N_Shelf
73cg16884847chr2PRKCE
74cg17211612chr12DNAH10
75cg17373442chr3CHST2Island
76cg17676618chr10
77cg18001059chr7MRPL32; PSMA2Island
78cg18713316chr1KCNN3Island
79cg18883807chr14
80cg18919659chr22PACSIN2
81cg19341901chr7
82cg19859270chr3CPOX; GPR15
83cg20439473chr17VEZF1N_Shelf
84cg20459495chr2
85cg20538211chr4IGFBP7-AS1N_Shelf
86cg20546279chr7S_Shore
87cg20628376chr10RP11-351M16.3
88cg21012061chr1ELK4; MFSD4
89cg21083936chr11S_Shelf
90cg21500300chr12BCAT1; RP11-662I13.2S_Shore
91cg21885107chr14PAPLN; RP4-647C14.2Island
92cg23369748chr6SASH1
93cg23501962chr11RP4-683L5.1; SLC1A2N_Shore
94cg23854567chr12PXN
95cg24203542chr11NAV2
96cg24279017chr12ETV6
97cg24772753chr2SP5Island
98cg25192619chr6CCDC167Island
99cg26005485chr8FAM135BN_Shore
100cg26029292chr8ZNF7S_Shelf
101cg26076054chr5AHRRIsland
102cg26582784chr7AC002454.1; CDK6Island
103cg26799398chr1ECM1; TARS2
104cg26972614chr11IL18BP
105cg27052537chr7
These 105 probes are located in different regions of 95 genes. For clarity, they were called signature genes. Six signature genes were related to multiple probes. They were AHRR (cg05575921, cg26076054, and cg00976097), ANKRD45 (cg13990746 and cg15680620), CYP1B1 (cg09799983 and cg11751707), PTPRN2 (cg09194449, cg10650290, and cg14419740), SP5 (cg07813142 and cg24772753), and SYT2 (cg05752786 and cg16315376). Twenty signature probes are not located in any genes, which means these signature probes may locate at intragenic areas or their information remains unknown due to the experimental limitations. The corresponding classification performance is shown in Table 3. The sensitivity, specificity, and accuracy of the training set are 98.04%, 100%, and 98.62%, respectively. All sensitivities, specificities, and accuracies of our two independent validation sets are higher than 80%. Moreover, the differences between these sensitivities and specificities are smaller than 4.2%. All these results indicated that the reliability of the model is very high. They also confirmed the essentiality of the identified signature probes.
Table 3

The classification performance obtained by the methylation values of 105 identified signature probes.

DataDatabaseSNSPACC
Training cohortGSE392790.980410.9862
Validation cohortTCGA0.84760.8060.8314
GSE668360.83810.81250.8347

The strongly related genes and their regulatory network at different levels.

To uncover the genes strongly related to our identified signature probes or signature genes, both BIMC and SRC methods were ted into the DCS method to analyze the whole genome genes using their GE and ME values. Using the BIMC method, 84 genes were uncovered to be highly correlated with signature probes. Meanwhile, using the SRC method, 134 genes were uncovered to be highly related to signature probes. Among them, 34 genes were in common. Figure 2 shows the gene regulatory network of these 34 genes. In this network, FARSB, PAICS, HMMR, PBK, NRAS, BRIP1, PSMD12, PSMD11, and SKA2 have the highest connections (>10.75), which means they are the core genes in this regulatory network. For clarity, they were named 84 BIMC genes, 134 SRC genes, and 34 common genes.
Figure 2

The gene weight network of important genes. The soft threshold was used to calculate the connectivity of the genes. Gene connectivity was expressed using the color of nodes (nodes representing genes). (a) A gene weight network constructed by 34 genes. (b) A gene weight network of part of the genes with significant weights performed a map of gene weights.

Protein regulation analysis was performed on 95 signature genes, 84 BIMC genes, and 134 SRC genes using the shortest path tracing algorithm. The corresponding protein-protein interaction (PPI) networks were shown in Figure 3 with dots representing proteins and edges representing interactions between proteins.
Figure 3

The protein-protein interaction (PPI) networks. (a) PPI corresponding to the 95 signature genes (the genes corresponding to the protein whose betweenness is greater than or equal to 100). (b) PPI corresponding to the 84 BIMC genes (the genes corresponding to the protein whose betweenness is greater than or equal to 100). (c) PPI corresponding to the 134 SRC genes (the genes corresponding to the protein whose betweenness is greater than or equal to 200).

3.2. Molecular Function and Pathway Analysis

The significant biological functions of 95 signature genes obtained by KEGG pathway analysis and GO analysis are listed in Tables S2 and S3 in the supplementary materials. Nine pathways were significantly enriched. Three of them may be very important for LUAD: hsa04512. Specific interactions between cells and the ECM are mediated by transmembrane molecules, mainly integrins and perhaps also proteoglycans, CD36, or other cell-surface-associated components. These interactions lead to a direct or indirect control of cellular activities such as adhesion, migration, differentiation, proliferation, and apoptosis. hsa04151. The phosphatidylinositol 3′-kinase- (PI3K-) Akt signaling pathway is activated by many types of cellular stimuli or toxic insults and regulates fundamental cellular functions such as transcription, translation, proliferation, growth, and survival. hsa04510. Cell-matrix adhesions play essential roles in important biological processes including cell motility, cell proliferation, cell differentiation, regulation of gene expression, and cell survival.

4. Discussion

As we mentioned above, identifying signature probes throughout hundreds of thousands of probes was a big challenge in our case. Therefore, we proposed a deep selecting method to overcome it. Least important probes were supposed to have more noise than useful information. Removing them can improve the classification accuracy. In every round of deep selecting method, the comparative contributions significantly varied after removing noisy probes. Therefore, the remodeling of the classification model and recalculating of the contributions of probes in steps 3–5 of this method were very important. With this method, 105 probes were identified as tobacco-related signatures. Using only their methylation values, the tobacco exposure patterns of LUAD patients can be classified correctly. For the training dataset GSE39279, the accuracy of the 105-signature probe classifier was 98.62%. Such a high accuracy usually indicates the overfitting of the model training. But the accuracies of two independent validation datasets were both higher than 83%, which proves the high quality in model generalization. More importantly, these two validation datasets (GSE66836 and TCGA) were collected by two different organizations from two different LUAD cohorts. The good-enough accuracies for both of them strongly proved the satisfactory predictive performance of the classification model. It also strongly proved the essentiality of the identified signature probes. Figure 4 shows that the 105 probes were mainly located in the open sea (45.7%) and island (25.7%) regions. On the contrary, the percentages of probes distributed over shore (19.1%) and shelf (9.5%) areas were very low. Additionally, these probes were mainly distributed in the promoter region (23.8%) and the body region (45.7%).
Figure 4

Distribution of feature types of 105 signature probes.

Apart from six signature genes being related to multiple probes, there are also 16 probes belonging to multiple genes. Normally, for one-probe-multiple-gene cases, the first gene in the list is the most likely one corresponding to that probe. Therefore, the first gene was chosen as the corresponding gene to that probe. Therefore, for multiple-gene cases, we chose the probe and the first corresponding gene in the list for further relationship analysis. Table S4 lists the coefficient values between 85 probes and 77 genes (multiple-probe-one-gene is allowed). Among these 85 probes, 25 of them are located in the promoter area. But only 15 of these 25 promoter probes negatively regulate their genes. The methylations of the other 10 have positive relationships with their genes' expression values. For signature probes located at the body area of genes, 28 of them negatively regulate their gene expression while the other 20 positively regulate their gene expression. In addition, Figure S2 shows the Volcano plot of the expression levels of 95 genes corresponding to 105 probes in smoker and never-smoker groups. The detailed information is shown in Table S5 in the supplementary materials. It can be seen from Table S5 that smoking alters the level of gene expression. ECM1, AHRR, GPR15, and PR11-351M16.3 were significantly upregulated in smokers. On the contrary, LINC01019 and MFSD4 were significantly downregulated in never-smokers. From the literature research using PubMed, 25 of the 95 signature genes are associated with smoking, 6 are associated with LUAD, 2 are associated with LUSC, 4 are associated with NSCLC, 9 are associated with lung cancer, 22 are associated with cancer, and the remaining 34 genes are still unclear (see Table S6). Among these 95 signature genes, AHRR is one of the most important genes. It corresponds to cg00976097, cg05575921, and cg26076054 probes in 105 signature probes. Several published reports showed: (1) the DNA methylation of it in lung tissue of smokers was significantly lower than that in nonsmokers, while DNA methylation was inversely proportional to the level of AHRR mRNA in smokers and nonsmokers [39, 40]; (2) the protein encoded by AHRR is involved in the regulation of cell growth and differentiation and the immune system [41]; (3) AHRR was found to be the most significantly different probe set between never-smokers and current-smokers [41-43]. Figure 5 shows the methylation levels of cg05575921 in current-smoker and never-smoker samples. The P values were 0.4E – 4 which means that its methylation levels were significantly different in these two group samples.
Figure 5

Beeswarm plot of gene AHRR (cg05575921).

Figure 2 shows the gene regulatory network of related genes uncovered at the gene regulatory level. FARSB, PAICS, HMMR, PBK, NRAS, BRIP1, PSMD12, PSMD11, and SKA2 were the most important genes because they had the most connections with other genes. Several of them have been reported to be important for LUAD or NSCLC as in the following examples: It is known from The Human Protein Atlas that FARSB is a protein-coding gene and a prognostic indicator for certain cancers [44]. PAICS was pointed out to be used as a prognostic biomarker for aggressive LUAD [45]. HMMR is closely related to the tumor outgrowth to the metastatic relapse of LUAD [46]. Studies have shown that the expression of PBK/TOPK was significantly associated with the adverse prognosis in NSCLC [47]. The NRAS gene provides instructions for making a protein called N-Ras that is involved primarily in regulating cell division. Ohashi et al. pointed out that NRAS mutations were common in people with a history of smoking in LUAD [48]. Mutations in BRIP1 were frequently observed in LUAD [49]. Both PSMD12 and PSMD11 are involved in cell cycle progression, apoptosis, or DNA damage repair [50]. PRR11-SKA2 gene pair is shown to be synergistically overexpressed in lung cancer and participates in the process of lung cancer development [51]. Under the premise that the corresponding protein betweenness of the gene is greater than or equal to 100, we found that the genes obtained by the BIMC gene and the PPI network were a subset of the genes obtained by the SRC gene and the PPI network. Therefore, we only analyzed the genes obtained from the BIMC gene and their PPI network (see Table S7). In total, 33 possible related genes were uncovered. The top 8 genes with the highest betweenness were MAGOH, FYN, CBL, RHOA, MAPK8, CYC1, ITGB1, and CCL26. FYN is a member of the protein-tyrosine kinase oncogene family. It encodes a membrane-associated tyrosine kinase that has been implicated in the control of cell growth. Du et al. [52] determined that CBL plays a key role in the development and metastasis of lung tumors and that c-CBL mutations may contribute to the carcinogenic potential of MET and EGFR in lung cancer. RHOA is a protein-coding gene, and RHOA-related diseases include adenocarcinoma and peripheral T-cell lymphoma. Konstantinidou G et al. [53] revealed that RHOA is involved in the biological process of maintaining the signal axis required for KRAS-driven lung adenocarcinoma. The signaling pathway of MAPK8 affects the regulation of gene expression in non-small-cell lung cancer [54]. Li et al. [55] stated clearly that CYC1 silencing inhibited growth in osteosarcoma cells via increasing apoptosis and damaging energy metabolism. Pellinen et al. [56] pointed out that ITGB1 is associated with poor prognosis of prostate cancer. CCL26 is involved in the regulation of eosinophil function in asthmatic patients [57]. Although these genes may not be indispensable markers associated with smoking in lung adenocarcinoma, they reveal that cancer usually has a common molecular mechanism and specific genes with type-dependent patterns.

5. Conclusions

The identification of biomolecular markers is one of the most important issues in biomedicine and genomics. Lung adenocarcinoma is a disease with a high mortality rate in cancer. Researchers are eager to find its related genes, which is helpful to expose its mechanism and to develop effective treatments. This study used statistical methods to find 105 signature probes, which are methylation biomarkers of lung adenocarcinoma associated with tobacco exposure patterns. The analysis of gene weighting regulatory networks and protein regulatory networks revealed genes that are directly or indirectly related to these 105 signature probes at different levels. It is hopeful that this research will be of great help to medical workers.
  50 in total

1.  Extracellular Matrix Receptor Expression in Subtypes of Lung Adenocarcinoma Potentiates Outgrowth of Micrometastases.

Authors:  Laura E Stevens; William K C Cheung; Sally J Adua; Anna Arnal-Estapé; Minghui Zhao; Zongzhi Liu; Kelly Brewer; Roy S Herbst; Don X Nguyen
Journal:  Cancer Res       Date:  2017-02-14       Impact factor: 12.701

2.  CYC1 silencing sensitizes osteosarcoma cells to TRAIL-induced apoptosis.

Authors:  Guodong Li; Dong Fu; Wenqing Liang; Lin Fan; Kai Chen; Liancheng Shan; Shuo Hu; Xiaojun Ma; Ke Zhou; Biao Cheng
Journal:  Cell Physiol Biochem       Date:  2014-11-28

3.  Partial least squares methods: partial least squares correlation and partial least square regression.

Authors:  Hervé Abdi; Lynne J Williams
Journal:  Methods Mol Biol       Date:  2013

4.  Molecular analysis of plasma DNA for the early detection of lung cancer by quantitative methylation-specific PCR.

Authors:  Kimberly Laskie Ostrow; Mohammad O Hoque; Myriam Loyo; Marianna Brait; Alissa Greenberg; Jill M Siegfried; Jennifer R Grandis; Autumn Gaither Davis; William L Bigbee; William Rom; David Sidransky
Journal:  Clin Cancer Res       Date:  2010-06-30       Impact factor: 12.531

5.  The gene pair PRR11 and SKA2 shares a NF-Y-regulated bidirectional promoter and contributes to lung cancer development.

Authors:  Yitao Wang; Ying Zhang; Chundong Zhang; Huali Weng; Yi Li; Wei Cai; Mengyu Xie; Yinjiang Long; Qing Ai; Zhu Liu; Gang Du; Sen Wang; Yulong Niu; Fangzhou Song; Toshinori Ozaki; Youquan Bu
Journal:  Biochim Biophys Acta       Date:  2015-07-08

6.  Multi-class tumor classification by discriminant partial least squares using microarray gene expression data and assessment of classification models.

Authors:  Yongxi Tan; Leming Shi; Weida Tong; G T Gene Hwang; Charles Wang
Journal:  Comput Biol Chem       Date:  2004-07       Impact factor: 2.877

7.  KOBAS server: a web-based platform for automated annotation and pathway identification.

Authors:  Jianmin Wu; Xizeng Mao; Tao Cai; Jingchu Luo; Liping Wei
Journal:  Nucleic Acids Res       Date:  2006-07-01       Impact factor: 16.971

Review 8.  DNA methylation changes of whole blood cells in response to active smoking exposure in adults: a systematic review of DNA methylation studies.

Authors:  Xu Gao; Min Jia; Yan Zhang; Lutz Philipp Breitling; Hermann Brenner
Journal:  Clin Epigenetics       Date:  2015-10-16       Impact factor: 6.551

9.  Role and regulation of coordinately expressed de novo purine biosynthetic enzymes PPAT and PAICS in lung cancer.

Authors:  Moloy T Goswami; Guoan Chen; Balabhadrapatruni V S K Chakravarthi; Satya S Pathi; Sharath K Anand; Shannon L Carskadon; Thomas J Giordano; Arul M Chinnaiyan; Dafydd G Thomas; Nallasivam Palanisamy; David G Beer; Sooryanarayana Varambally
Journal:  Oncotarget       Date:  2015-09-15

10.  ITGB1-dependent upregulation of Caveolin-1 switches TGFβ signalling from tumour-suppressive to oncogenic in prostate cancer.

Authors:  Teijo Pellinen; Sami Blom; Sara Sánchez; Katja Välimäki; John-Patrick Mpindi; Hind Azegrouz; Raffaele Strippoli; Raquel Nieto; Mariano Vitón; Irene Palacios; Riku Turkki; Yinhai Wang; Miguel Sánchez-Alvarez; Stig Nordling; Anna Bützow; Tuomas Mirtti; Antti Rannikko; María C Montoya; Olli Kallioniemi; Miguel A Del Pozo
Journal:  Sci Rep       Date:  2018-02-05       Impact factor: 4.379

View more
  2 in total

1.  MAGOH/MAGOHB Inhibits the Tumorigenesis of Gastric Cancer via Inactivation of b-RAF/MEK/ERK Signaling.

Authors:  Yong Zhou; Zhongqi Li; Xuan Wu; Laizhen Tou; Jingjing Zheng; Donghui Zhou
Journal:  Onco Targets Ther       Date:  2020-12-10       Impact factor: 4.147

2.  Contribution of upregulated aminoacyl-tRNA biosynthesis to metabolic dysregulation in gastric cancer.

Authors:  Xiaoling Gao; Rui Guo; Yonghong Li; Guolan Kang; Yu Wu; Jia Cheng; Jing Jia; Wanxia Wang; Zhenhao Li; Anqi Wang; Hui Xu; Yanjuan Jia; Yuanting Li; Xiaoming Qi; Zhenhong Wei; Chaojun Wei
Journal:  J Gastroenterol Hepatol       Date:  2021-08-01       Impact factor: 4.369

  2 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.