Literature DB >> 30361509

Identification of Novel Genes in Human Airway Epithelial Cells associated with Chronic Obstructive Pulmonary Disease (COPD) using Machine-Based Learning Algorithms.

Shayan Mostafaei1, Anoshirvan Kazemnejad2, Sadegh Azimzadeh Jamalkandi3, Soroush Amirhashchi4, Seamas C Donnelly5,6, Michelle E Armstrong5, Mohammad Doroudian5.   

Abstract

The aim of this project was to identify candidate novel therapeutic targets to facilitate the treatment of COPD using machine-based learning (ML) algorithms and penalized regression models. In this study, 59 healthy smokers, 53 healthy non-smokers and 21 COPD smokers (9 GOLD stage I and 12 GOLD stage II) were included (n = 133). 20,097 probes were generated from a small airway epithelium (SAE) microarray dataset obtained from these subjects previously. Subsequently, the association between gene expression levels and smoking and COPD, respectively, was assessed using: AdaBoost Classification Trees, Decision Tree, Gradient Boosting Machines, Naive Bayes, Neural Network, Random Forest, Support Vector Machine and adaptive LASSO, Elastic-Net, and Ridge logistic regression analyses. Using this methodology, we identified 44 candidate genes, 27 of these genes had been previously been reported as important factors in the pathogenesis of COPD or regulation of lung function. Here, we also identified 17 genes, which have not been previously identified to be associated with the pathogenesis of COPD or the regulation of lung function. The most significantly regulated of these genes included: PRKAR2B, GAD1, LINC00930 and SLITRK6. These novel genes may provide the basis for the future development of novel therapeutics in COPD and its associated morbidities.

Entities:  

Mesh:

Year:  2018        PMID: 30361509      PMCID: PMC6202402          DOI: 10.1038/s41598-018-33986-8

Source DB:  PubMed          Journal:  Sci Rep        ISSN: 2045-2322            Impact factor:   4.379


Introduction

Chronic obstructive pulmonary disease (COPD) is a progressive inflammatory disease characterized by airway obstruction and is predicted to be among the first three causes of death worldwide[1,2]. Clinical presentations include emphysema, small airway obstructions and chronic bronchitis. COPD has been shown to develop in 30% of smokers and smoking history, combined with reduced daily physical activity, may be the main risk factor associated with the development of COPD[3]. Additional risk factors in COPD, in genetically susceptible individuals, include a history of maternal smoking, second hand smoke, polluted air, maternal/paternal asthma, childhood asthma or respiratory infections and malnutrition[4]. Although COPD archetypically manifests itself in males, recent studies have demonstrated an increased incidence and mortality rates in females. Furthermore, female patients with COPD are more often misdiagnosed and/or underdiagnosed[5,6]. From a genetic perspective, COPD is a complex disease arising from mutations in multiple alleles and the lack of integration of data in this disease has been attributed to dispersed, independent genome-wide association studies (GWAS)[7]. DNA microarrays now permit scientists to screen thousands of genes simultaneously in order to determine which genes are active, hyperactive or silent in normal or COPD tissue. Furthermore, network-based medicine has also been recently employed to facilitate the investigation of genomics, transcriptomics, proteomics and other “–omics” in order to better understand complex diseases, such as COPD[8]. However, from a biological perspective, only a only a small subset of genes identified by these methodologies will be strongly indicative of the target disease[9]. Therefore, in this study, we employed a novel methodology, namely machine-based learning algorithms combined with penalized regression models, in order to study genomic change in COPD in a more selective manner. Furthermore, we have also had a longstanding interest in the genetics of COPD, formally as part of a European Union consortium[10-13]. Here, we now extend on these initial observations. This study was designed to apply signaling-network methodology with machine-based learning methods to better understand the genetic etiology of smoking exposure and COPD in 59 healthy smokers, 53 healthy non-smokers and 21 COPD smokers (9 of GOLD stage I and 12 of GOLD stage II) were included (Total: n = 133). Furthermore, AdaBoost Classification Trees, Decision Tree, Gradient Boosting Machines, Naive Bayes, Neural Network, Random Forest, Support Vector Machine (as machine learning algorithms) and adaptive LASSO, elastic-net, and ridge logistic regression (as statistical models) were also applied. In summary, we identified 44 candidate genes associating with smoking exposure and the incidence/progression of COPD. We also identified 17 novel genes, which were not previously associated with COPD, the regulation of lung function or smoking exposure. The most significantly regulated of these genes included: PRKAR2B, GAD1, LINC00930, and SLITRK6. These novel genes may provide the basis for the future development of novel therapeutics in COPD and warrant further investigation and validation.

Results

Differential analysis of gene expression data

In this study, 54,675 probes were screened using the microarray dataset generated from SAE cells previously from: 59 healthy smokers, 53 healthy non-smokers and 21 COPD smokers (42.8% of GOLD stage I and 57.2% of GOLD stage II) (Table 1)[14]. Differential analysis was subsequently performed in order to select 20,097 probes. Subsequently, 718 probes and 544 genes (Fig. 1) were identified which were significantly changed (all p values < 0.0001) in COPD patients compared with healthy non-smokers. These genes, which include USP27X, PPP4R4, AHRR, PRKAR2B, GAD1, CYP1A1 and CYP1B1, are listed in the Supplementary File S1.
Table 1

Basic characteristics of the study samples.

CharacteristicsCOPD Smoker (N = 21)Healthy Smoker (N = 59)Healthy Non-smoker (N = 53)P-value
Age (Year)*50.38 ± 7.08142.93 ± 7.26741.0 ± 11.30<0.001
Smoking (pack per year)*36.98 ± 23.95327.6 ± 16.9750.078
FVC*97 ± 20109 ± 13107 ± 130.004
FEV1*74 ± 20107 ± 14105 ± 14<0.001
FEV1/FVC*61 ± 880 ± 581 ± 6<0.001
Sex+Male17 (81)39 (66.1)38 (71.7)0.535
FemaleRef.
Ethnic+Caucasian14 (66.6)14 (23.7)20 (37.7)0.038
BlackRef.
Stage+ (Gold) of COPDII12 (57.2)NA
I9 (42.8)Ref.

* indicated as mean ± standard deviation, + indicated as N (%), Ref. considered as the reference level for each categorical variable, NA: not applicable.

Figure 1

Schematic demonstrating study plan and flowchart.

Basic characteristics of the study samples. * indicated as mean ± standard deviation, + indicated as N (%), Ref. considered as the reference level for each categorical variable, NA: not applicable. Schematic demonstrating study plan and flowchart.

Module identification

Normalized gene expression data was used for module identification in the SPD algorithm. In total, 576 modules were identified. Three modules were biologically more related to the progression and phenotype of COPD including, 119, 242 and 324. The minimal spanning trees obtained from the SPD algorithm are shown in Fig. 2. All the genes involved in COPD progression are presented in Table 2 and then included in machine-learning and statistical modeling approaches. From these three selected modules, gene expression within two of the modules (Fig. 2a,b), associated with COPD-progression, was increased in SAE cells. In contrast, gene expression within the third module (Fig. 2c), associated with COPD-progression, was decreased in SAE cells. In Fig. 2d, classification of samples was shown based on the disease stage (dark blue = healthy non-smoker, light blue = healthy smoker, light brown = COPD stage I and dark brown = COPD stage II).
Figure 2

Genes involved in the progression of the COPD based on the minimal- inclusive trees were obtained from SPD algorithm (dark blue = healthy non-smoker, light blue = healthy smoker, light brown = stage I of COPD smoker and dark brown = stage II of COPD smoker).

Table 2

List of the genes involved in the progression of COPD by sample progression discovery (SPD) algorithm.

Related Modules with progression of COPDNumber of involved GenesGenes Symbol
Module 11948MUCL1, LOC652993, LINC00639, LINC00942, TXNRD1, CYP1B1, ME1, GAD1, CBR3, CYP1A1, NRG1, CYP4F3, AKR1B10, HTR2B, NR0B1, GRM1, ABCC3, CDRT1, AKR1C3, CBR1, TRIM9, SPP1, ADH7, FTH1P5, FTL, ADD3-AS1, AKR1C1, SLC7A11, CACNA2D3, LHX6, CABYR, HS3ST3A1, PLEKHA8P1, BACH2, SFRP2, RPSA, CLIP4, ST3GAL4-AS1, SAMD5, AHRR, ANKDD1A, LINC00589, TMCC3, RNF175, RIMKLA, LOC100652994, GPX2, LOC344887
Module 24210LINC00930, UCHL1, REEP1, EGF, CLEC11A, TMEM74B, DNHD1, C4orf48, C6orf164, JAKMIP3
Module 32432ZSCAN4, LOC338667, PRKAR2B, PLAG1, ZNF211, SCGB1A1, TLR5, KANK1, PPP4R4, THSD7A, CYB5A, GMNN, GPRC5A, PIEZO2, GFOD1, ZNF419, THSD4, CCDC37, PAPLN, GLI3, PRKAG2-AS1, PRDM11, LOC285812, SCGB3A1, USP27X, KCNA1, LOC100507560, PRDM16, SLITRK6, CYP4Z1, GPR115, RASSF10
Genes involved in the progression of the COPD based on the minimal- inclusive trees were obtained from SPD algorithm (dark blue = healthy non-smoker, light blue = healthy smoker, light brown = stage I of COPD smoker and dark brown = stage II of COPD smoker). List of the genes involved in the progression of COPD by sample progression discovery (SPD) algorithm.

Gene selection and prediction

Based on the machine-learning and statistical penalized algorithms, and after adjustment of the effect of pack per year of smoking, elastic-net logistic regression had the highest AUC (82%), sensitivity (85%), specificity (51%) and lowest misclassification error rate (25%). In reverse, decision trees method has lowest AUC (57%), sensitivity (69%), specificity (43%) and highest misclassification error rate (39%) than other algorithms. Based on the elastic-net logistic regression, the most important selected genes included, THSD4, PPP4R4, JAKMIP3, LINC00930, DNHD1, TMCC3, CCDC37, PRDM11, GLI3, ABCC3, ADH7, SAMD5, RASSF10, USP27X, GAD1, CYP1A1, NR0B1, CYP1B1, PLAG1, PIEZO2, SCGB1A1, LOC100507560. Consequently, 44 candidate genes identified here are associated with either the occurrence or progression of COPD, or lung function (Table 3). According to the results of each computational method, 44 were selected and the computational methods were hierarchically clustered, simultaneously (Figs 3 and 4). Of these 44 genes, 27 have been previously reported in the literature to be associated with COPD, lung function (FVC, FEV1 or the FEV1/FVC ratio) or other lung diseases. These 27 genes also include the genes of THSD4, PPP4R4, SCGB1A1, and NRG1, already detected in GWA studies to determine single nucleotide polymorphisms (SNPs) specifically for COPD (Table 4). Furthermore, in our study, SNPs within 4 additional genes have been detected in GWAS studies carried out previously in lung-related studies including: PRDM11 and AHRR FVC, smoking[15,16], CYP1A1 childhood bronchitis[17] and CYP1B1 lung cancer[18]. In this study, we have identified 17 genes which have not previously been detected in COPD studies, these include: LINC00942, REEP1, C6orf164, LINC00589, JAKMIP3, LINC00930, DNHD1, TMCC3, ADH7, PRKAR2B, GAD1, LOC338667, CYB5A, PIEZO2, SLITRK6, KCNA1 and LOC100507560 (Table 4). These genes may represent novel biomarkers in the diagnosis and prognosis of COPD. Figure 5 depicts the functional protein-association networks for the 44 selected genes, as shown by STRING.
Table 3

Probes and corresponding 44 genes selected by ML algorithms and penalized regression models for association between the genes with occurrence and progression of COPD. The effect of smoking (pack per year) was adjusted in all of the methods.

Gene SymbolProbe IDNumber of MethodsLASSOAdapt. LASSOElastic netRidgeSVMGBMNBRFANNRTABCT
1.PPP4R4233002_at380%78%96%
2.THSD4222835_at290%43%
3.NRG1206343_s_at355%55%65%
4.SCGB1A1205725_at630%61%54%54%78%64%
5.AHRR229354_at898%96%77%76%76%68%48%76%
6.CYP1A1205749_at1190%82%20%65%73%11%72%74%43%77%73%
7.CYP1B1202437_s_at988%80%32%65%64%35%64%58%65%
8.PRDM11229687_s_at150%
9.CBR3205379_at114%
10.AKR1C1217626_at110%
11.AKR1C3209160_at15%
12.GRM1207299_s_at14%
13.CYP4Z1237395_at167%
14.UCHL1201387_s_at157%
15.CABYR219928_s_at154%
16.GPRC5A203108_at2100%100%
17.CCDC37243758_at150%
18.GLI3227376_at338%12%43%
19.ABCC3208161_s_at330%58%52%
20.SAMD5228653_at324%41%57%
21.RASSF10238755_at523%75%75%68%64%
22.USP27X230620_at1199%94%31%100%100%100%100%100%49%100%100%
23.HTR2B206638_at15%
24.NR0B1206645_s_at533%66%66%58%66%
25.PLAG1205372_at526%61%61%61%61%
26.SCGB3A1230378_at565%58%58%65%58%
27.LHX6219884_at155%
28.LINC009421558308_at152%
29.REEP1204364_s_at145%
30.C6orf164230506_at144%
31.LINC00589232718_at113%
32.JAKMIP3233076_at4100%64%98%56%
33.LINC009301556768_at378%4%100%
34.DNHD1229631_at153%
35.TMCC3235146_at752%82%87%82%64%73%84%
36.ADH7210505_at327%27%54%
37.PRKAR2B203680_at796%96%76%74%73%76%74%
38.GAD1205278_at923%74%67%48%67%73%46%84%67%
39.LOC3386671564786_at365%65%43%
40.CYB5A217021_at665%63%3%63%87%64%
41.PIEZO2219602_s_at656%65%60%60%68%60%
42.SLITRK6235976_at458%57%57%57%
43.KCNA1230849_at352%53%53%
44.LOC100507560231379_at938%48%74%82%74%62%41%100%50%
AUC%Sensitivity (SD)Specificity (SD)Misclassification Error Rate (SD)79%74%82%76.6%61.6%76%77%80%70%57%74.7%
0.83 (0.14)0.81 (0.16)0.85 (0.13)10.92 (0.10)0.98 (0.04)0.84 (0.12)0.95 (0.08)0.68 (0.17)0.69 (0.20)0.81 (0.14)
0.5 (0.30)0.37 (0.10)0.51 (0.29)00.15 (0.13)0.02 (0.07)0.49 (0.26)0.07 (0.15)0.66 (0.24)0.43 (0.24)0.39 (0.14)
0.27 (0.14)0.31 (0.15)0.25 (0.10)0.30 (0.03)0.31 (0.06)0.30 (0.05)0.26 (0.09)0.31 (0.09)0.32 (0.13)0.39 (0.12)0.31 (0.11)

Important index (value) for each gene in any method was reported. The third column indicated number of studies that it confirmed the association of each gene with progression of the COPD. Third column indicated sum of number of methods that it confirmed each gene (Range score: 0 to 11).

Figure 3

Interactive cluster heatmap displaying importance index of the forty-four candidate genes (as columns) in each of the machine learning and statistical methods (as rows), rows and columns of the heatmap have been reordered according to a hierarchical clustering, represented by the dendrogram, colors represent importance index of the genes (red to yellow: lower to higher of importance value).

Figure 4

Spearman’s rank correlation, co-expression, matrix between the selected genes: heatmap for hierarchical clustering the forty-four candidate genes based on their pattern of gene expression.

Table 4

Confirmation of the association of selected genes with COPD/or lung function by literature reviewing in PubMed databank with ((“COPD” OR “Lung Function”) AND “name of each selected gene”).

Gene SymbolProbe IDNumber of studiesReferences (PMIDs)
1.PPP4R4233002_at128170284
2.THSD4222835_at627564456, 24286382, 23932459, 22461431, 21965014, 20010834
3.NRG1206343_s_at1528950338, 28901268, 28604730, 28396363, 28391773, 27626312, 26837769, 26200269, 25870798, 25531467, 25501131, 25384085, 24469108, 23390248, 22665269
4.SCGB1A1205725_at427081700, 26937342, 26159408, 23144326
5.AHRR229354_at928854564, 29262847, 28100713, 28056099, 27924164, 27632354, 26667048, 22232023, 18172554
6.CYP1A1205749_at10829212267, 29076184, 28827732, 28283091, and etc.
7.CYP1B1202437_s_at3829110844, 28858732, and etc.
8.PRDM11229687_s_at128938616
9.CBR3205379_at126916823
10.AKR1C1217626_at829344298, 28210161, 26338969, 24976539, 23534707, 23474755, 17266043, 16915569
11.AKR1C3209160_at723534707, 28704416, 27629782, 25603868, 23665002, 23519145, 15284179
12.HTR2B206638_at127301951
13.GRM1207299_s_at123303475
14.CYP4Z1237395_at119473719
15.UCHL1201387_s_at528688920, 25615526, 23534707, 21143527, 17108109
16.CABYR219928_s_at526938915, 26843620, 24362251, 17317841, 21274509
17.GPRC5A203108_at1029382653, 28849235, 28088789, 26447616, 25621293, 25311788, 23154545, 22239913, 20686609, 20563252
18.CCDC37243758_at226200272, 22011669
19.GLI3227376_at327146893, 23736020, 23667589
20.ABCC3208161_s_at424176985, 23369236, 22699933, 19107936
21.SAMD5228653_at125411851
22.RASSF10238755_at124433832
23.USP27X230620_at127013495
24.NR0B1206645_s_at128965760
25.PLAG1205372_at229305497, 29249655
26.SCGB3A1230378_at526937342, 21636547, 20849603, 20660313, 19334046
27.LHX6219884_at428900494, 28396596, 27610375, 24157876
28.LINC009421558308_at0
29.REEP1204364_s_at0
30.C6orf164230506_at0
31.LINC00589232718_at0
32.JAKMIP3233076_at0
33.LINC009301556768_at0
34.DNHD1229631_at0
35.TMCC3235146_at0
36.ADH7210505_at0
37.PRKAR2B203680_at0
38.GAD1205278_at0
39.LOC3386671564786_at0
40.CYB5A217021_at0
41.PIEZO2219602_s_at0
42.SLITRK6235976_at0
43.KCNA1230849_at0
44.LOC100507560231379_at0
Figure 5

STRING protein–protein interaction networks for the forty-four candidate genes.

Probes and corresponding 44 genes selected by ML algorithms and penalized regression models for association between the genes with occurrence and progression of COPD. The effect of smoking (pack per year) was adjusted in all of the methods. Important index (value) for each gene in any method was reported. The third column indicated number of studies that it confirmed the association of each gene with progression of the COPD. Third column indicated sum of number of methods that it confirmed each gene (Range score: 0 to 11). Interactive cluster heatmap displaying importance index of the forty-four candidate genes (as columns) in each of the machine learning and statistical methods (as rows), rows and columns of the heatmap have been reordered according to a hierarchical clustering, represented by the dendrogram, colors represent importance index of the genes (red to yellow: lower to higher of importance value). Spearman’s rank correlation, co-expression, matrix between the selected genes: heatmap for hierarchical clustering the forty-four candidate genes based on their pattern of gene expression. Confirmation of the association of selected genes with COPD/or lung function by literature reviewing in PubMed databank with ((“COPD” OR “Lung Function”) AND “name of each selected gene”). STRING protein–protein interaction networks for the forty-four candidate genes.

Investigation of the differential expression of genes in healthy non-smokers (HNS; control subjects), healthy smokers, COPD patients, and COPD Stage I and II patients

In this study, we also investigated the differential expression of our 44 candidate genes in healthy non-smokers (HNS; control subjects; n = 53), healthy smokers (HS; n = 59), COPD patients (n = 21), and COPD stage I (COPD I; n = 9) and II (COPD II; n = 12) patients, respectively. We investigated the differential gene expression between HNS and HS and found significant differences in expression in 39/44 (88.6%) of all genes. In addition, 16/17 (94.1%) of the genes, not previously detected associating with COPD or lung function, were differentially expressed (Table 5; column HS v HNS). We then investigated the differential expression of these 44 genes in HS and COPD patients. Here, 24/44 (54.5%) of all genes studies were significantly regulated. Furthermore, 10/17 previously undetected genes in COPD/lung function were differentially regulated (Table 5; column COPD v HS). Finally, we investigated the regulation of these 44 genes in COPD Stage I and II patients compared with HS (Table 5; columns stage I v HS and stage II v HS). Here, we observed that 5/44 (11.4%; COPD stage I) and 16/44 (36.3%; COPD stage II) were differentially regulated. Among the previously undetected genes in COPD/lung function, 10/17 (58.8%) and 6/17 (35.3%) were significantly different in COPD stage I and II, respectively, compared with HS. A number of genes were significantly different in all four analyses (HS v HNS; HS v COPD; HS v COPD I; HS v COPD II), including: USP27X, AHRR, CYP1A1 and CYP1B1. Interestingly, of these genes, not previously identified to associate with COPD/lung function, PRKAR2B and GAD1 were significantly different in all four analyses. Therefore, this study reveals for the first time the potential role of PRKAR2B and GAD1 in COPD and smoking-related dysfunction in lung.
Table 5

Relative expression of 44 candidate genes in healthy controls (smokers and non-smokers) and COPD smoker patients (stage I and stage II).

Gene SymbolHealthy Non-Smoker (N = 53)Healthy Smoker (N = 59)COPD smoker (N = 21)COPD stage I smoker (N = 9)COPD stage II smoker (N = 12)Fold Regulation, adjusted p-value (HS vs. HNS)Fold Regulation, adjusted p-value (COPD vs. HS)Fold Regulation, adjusted p-value (stage I vs. HS)Fold Regulation, adjusted p-value (stage II vs. HS)
1.PPP4R46.56 ± 0.756.01 ± 0.885.56 ± 0.695.88 ± 0.445.32 ± 0.77−1.091, 0.004−1.081, 0.026−1.022, 0.99−1.130, 0.036
2.THSD47.59 ± 0.507.34 ± 0.547.11 ± 0.517.47 ± 0.316.84 ± 0.46−1.034, 0.052−1.032, 0.1131.018, 0.99−1.073, 0.013
3.NRG13.79 ± 0.254.05 ± 0.473.80 ± 0.383.94 ± 0.483.69 ± 0.251.069, 0.002−1.066, 0.005−1.028, 0.99−1.097, 0.021
4.SCGB1A114.48 ± 0.1514.36 ± 0.2014.15 ± 0.3214.18 ± 0.2414.11 ± 0.37−1.008, 0.015−1.015, 0.002−1.013, 0.103−1.018, 0.002
5.AHRR3.86 ± 0.164.58 ± 0.655.23 ± 0.765.42 ± 0.815.08 ± 0.711.186, 0.0011.142, 0.0011.183, 0.0011.109, 0.025
6.CYP1A14.61 ± 0.246.12 ± 1.847.72 ± 1.808.04 ± 1.527.48 ± 2.021.327, 0.0011.261, 0.0011.314, 0.0031.222, 0.018
7.CYP1B13.64 ± 0.517.74 ± 2.049.50 ± 1.169.92 ± 0.599.19 ± 1.402.126, 0.0011.227, 0.0021.282, 0.0031.187, 0.045
8.PRDM115.67 ± 0.365.35 ± 0.325.28 ± 0.295.34 ± 0.225.23 ± 0.33−1.059, 0.001−1.013, 0.363−1.002, 0.99−1.023, 0.99
9.CBR37.23 ± 0.438.15 ± 0.688.35 ± 0.648.47 ± 0.508.26 ± 0.731.127, 0.0011.025, 0.2871.039, 0.8191.013, 0.99
10.AKR1C16.82 ± 0.688.56 ± 1.118.32 ± 1.208.44 ± 1.238.22 ± 1.211.255, 0.001−1.029, 0.542−1.014, 0.99−1.041, 0.99
11.AKR1C310.30 ± 0.4311.85 ± 0.7211.96 ± 0.6612.20 ± 0.4711.78 ± 0.741.150, 0.0011.009, 0.7481.029, 0.783−1.006, 0.99
12.HTR2B3.92 ± 0.214.21 ± 0.304.25 ± 0.314.19 ± 0.404.29 ± 0.231.074, 0.0011.010, 0.618−1.005, 0.991.019, 0.99
13.GRM13.74 ± 0.114.04 ± 0.344.16 ± 0.554.21 ± 0.364.13 ± 0.661.080, 0.0011.030, 0.5711.042, 0.991.022, 0.99
14.CYP4Z16.83 ± 0.576.24 ± 0.575.92 ± 0.385.89 ± 0.455.94 ± 0.33−1.094, 0.001−1.054, 0.02−1.059, 0.445−1.050, 0.404
15.UCHL15.30 ± 0.568.79 ± 1.509.27 ± 1.819.24 ± 1.699.28 ± 1.971.658, 0.0011.055, 0.1831.051, 0.991.055, 0.97
16.CABYR4.86 ± 0.286.92 ± 1.237.46 ± 1.377.98 ± 1.137.09 ± 1.461.424, 0.0011.078, 0.1621.153, 0.0031.024, 0.99
17.GPRC5A7.58 ± 0.637.50 ± 0.438.01 ± 0.617.89 ± 0.718.10 ± 0.54−1.010, 0.991.068, 0.0011.052, 0.2621.080, 0.005
18.CCDC379.44 ± 0.549.35 ± 0.539.26 ± 0.639.21 ± 0.809.29 ± 0.50−1.009, 0.99−1.010, 0.381−1.015, 0.99−1.006, 0.99
19.GLI37.59 ± 0.386.74 ± 0.576.62 ± 0.386.69 ± 0.436.56 ± 0.34−1.126, 0.001−1.018, 0.292−1.007, 0.99−1.027, 0.99
20.ABCC36.95 ± 0.447.88 ± 0.617.62 ± 0.787.68 ± 0.767.57 ± 0.821.134, 0.001−1.034, 0.226−1.026, 0.99−1.041, 0.733
21.SAMD53.74 ± 0.143.98 ± 0.283.93 ± 0.494.0 ± 0.283.87 ± 0.621.064, 0.001−1.013, 0.1961.005, 0.99−1.028, 0.99
22.RASSF107.67 ± 0.497.06 ± 0.596.62 ± 0.556.80 ± 0.316.47 ± 0.66−1.086, 0.001−1.066, 0.001−1.038, 0.99−1.091, 0.006
23.USP27X7.43 ± 0.287.13 ± 0.406.65 ± 0.406.70 ± 0.276.60 ± 0.48−1.042, 0.001−1.072, 0.001−1.064, 0.007−1.080, 0.001
24.NR0B13.93 ± 0.244.39 ± 0.774.76 ± 0.774.87 ± 0.894.67 ± 0.691.117, 0.0011.084, 0.0251.109, 0.1521.064, 0.960
25.PLAG15.79 ± 0.565.25 ± 0.554.81 ± 0.544.95 ± 0.544.70 ± 0.54−1.103, 0.001−1.091, 0.002−1.060, 0.743−1.117, 0.025
26.SCGB3A114.21 ± 0.4113.92 ± 0.6013.16 ± 0.8313.53 ± 0.9112.87 ± 0.65−1.021, 0.061−1.058, 0.001−1.029, 0.412−1.081, 0.001
27.LHX65.36 ± 0.345.80 ± 0.446.12 ± 0.516.11 ± 0.516.12 ± 0.531.082, 0.0011.055, 0.0291.053, 0.2461.055, 0.147
28.LINC009423.58 ± 0.173.87 ± 0.714.11 ± 0.994.24 ± 1.234.01 ± 0.791.081, 0.0351.062, 0.2581.096, 0.2511.036, 0.99
29.REEP19.34 ± 0.639.82 ± 0.569.61 ± 0.779.38 ± 0.939.78 ± 0.621.051, 0.001−1.022, 0.329−1.047, 0.408−1.004, 0.99
30.C6orf1645.49 ± 0.506.09 ± 0.556.21 ± 0.616.17 ± 0.436.23 ± 0.731.109, 0.0011.020, 0.3151.013, 0.991.023, 0.99
31.LINC005895.15 ± 0.265.38 ± 0.315.40 ± 0.315.33 ± 0.175.44 ± 0.381.044, 0.0011.004, 0.706−1.009, 0.991.011, 0.99
32.JAKMIP34.51 ± 0.215.27 ± 0.615.68 ± 0.855.71 ± 0.985.65 ± 0.791.168, 0.0011.078, 0.0481.083, 0.0991.072, 0.160
33.LINC009306.01 ± 0.436.91 ± 0.576.54 ± 0.676.54 ± 0.826.53 ± 0.581.150, 0.001−1.057, 0.059−1.056, 0.503−1.058, 0.186
34.DNHD16.64 ± 0.487.14 ± 0.537.14 ± 0.747.14 ± 0.927.13 ± 0.611.075, 0.0011.000, 0.9121, 0.99−1.001, 0.99
35.TMCC33.82 ± 0.384.09 ± 0.484.45 ± 0.434.49 ± 0.474.42 ± 0.421.071, 0.0071.088, 0.0011.098, 0.0881.081, 0.149
36.ADH78.01 ± 0.9310.81 ± 0.6910.70 ± 0.6210.81 ± 0.5610.61 ± 0.681.350, 0.001−1.010, 0.2871, 0.99−1.019, 0.99
37.PRKAR2B7.35 ± 0.586.45 ± 0.725.86 ± 0.595.89 ± 0.665.83 ± 0.56−1.139, 0.001−1.101, 0.001−1.095, 0.096−1.106, 0.015
38.GAD14.91 ± 0.616.31 ± 1.07.25 ± 0.867.07 ± 1.107.39 ± 0.661.285, 0.0011.149, 0.0011.120, 0.0761.171, 0.001
39.LOC3386675.39 ± 0.325.19 ± 0.235.04 ± 0.255.03 ± 0.175.04 ± 0.30−1.038, 0.002−1.030, 0.014−1.032, 0.647−1.030, 0.675
40.CYB5A4.72 ± 0.194.71 ± 0.184.53 ± 0.244.62 ± 0.234.45 ± 0.23−1.003, 0.99−1.040, 0.002−1.019, 0.99−1.058, 0.010
41.PIEZO27.24 ± 0.856.48 ± 0.825.89 ± 0.635.77 ± 0.725.96 ± 0.58−1.117, 0.001−1.100, 0.004−1.123, 0.090−1.087, 0.225
42.SLITRK67.97 ± 0.616.86 ± 0.846.24 ± 0.786.31 ± 1.046.18 ± 0.55−1.162, 0.001−1.099, 0.004−1.087, 0.322−1.110, 0.041
43.KCNA16.87 ± 1.035.79 ± 0.945.19 ± 0.775.49 ± 0.894.97 ± 0.51−1.186, 0.001−1.116, 0.011−1.054, 0.99−1.165, 0.043
44.LOC1005075605.85 ± 0.695.42 ± 0.614.92 ± 0.484.92 ± 0.554.91 ± 0.44−1.079, 0.002−1.102, 0.001−1.102, 0.155−1.104, 0.080

The Adj. P is based on the marginally adjusted 𝑝 values by the Benjamini-Hochberg-FDR correction at α = 0.05; Median ± Interquartile range.

Relative expression of 44 candidate genes in healthy controls (smokers and non-smokers) and COPD smoker patients (stage I and stage II). The Adj. P is based on the marginally adjusted 𝑝 values by the Benjamini-Hochberg-FDR correction at α = 0.05; Median ± Interquartile range.

Investigation of the gender effect on differential gene expression in HNS, HS and COPD (Stage I and II) patients

Here we examined the effects of gender on the expression of our 44 candidate genes. We demonstrated that the expression of 40/44 (90.9%) of these genes is significantly different in HS men compared to HNS men (Table 6; HS v HNS). In addition, 15/17 (88.2%) of the novel genes previously undetected in COPD/lung function had significantly different expression levels (Table 6; HS v HNS) in men. Investigation of the expression levels of the 44 candidate genes in men with COPD versus HS revealed that 21/44 (47.7%) of genes were significantly different (Table 6; COPD v HS) and 10/17 (58.8%) of previously undetected genes in COPD/lung function were also significantly different. When HS were compared to COPD Stage I and II patients, respectively, 4/44 (Stage I; 9.0%) and 7/44 (Stage II; 15.9%) of the total candidate genes were significantly different in male HS compared to HNS. Of the 17 novel genes detected in this study, 1/17 (Stage I; 5.9%) and 3/17 (Stage II; 17.6%) were significantly different in males compared to HS (Table 6; Stage I or Stage II v HS). A number of the 44 candidate genes were significantly different in males across all four analyses, these included USP27X, AHRR, and the novel gene, JAKMIP3.
Table 6

Comparison of relative expression of 44 candidate genes between healthy controls (smokers and non-smokers) and COPD smoker patients (stage I and stage II) in men and women groups separately.

Sex groupGene SymbolHealthy Non-Smoker (N = 37)Healthy Smoker (N = 38)COPD smoker (N = 17)COPD stage I smoker (N = 9)COPD stage II smoker (N = 8)Fold Regulation, adjusted p-value (HS vs. HNS)Fold Regulation, adjusted p-value (COPD vs. HS)Fold Regulation, adjusted p-value (stage I vs. HS)Fold Regulation, adjusted p-value (stage II vs. HS)
Men1.PPP4R46.45 ± 0.776.10 ± 0.845.66 ± 0.655.88 ± 0.445.41 ± 0.79−1.057, 0.395−1.078, 0.061−1.038, 0.99−1.128, 0.14
2.THSD47.61 ± 0.497.28 ± 0.497.22 ± 0.447.47 ± 0.316.95 ± 0.41−1.044, 0.022−1.008, 0.611.026, 0.99−1.048, 0.40
3.NRG13.76 ± 0.224.04 ± 0.513.82 ± 0.393.94 ± 0.483.69 ± 0.231.073, 0.010−1.058, 0.032−1.024, 0.99−1.095, 0.10
4.SCGB1A114.49 ± 0.1314.35 ± 0.2114.17 ± 0.3214.19 ± 0.2514.15 ± 0.41−1.010, 0.030−1.013, 0.035−1.011, 0.260−1.014, 0.138
5.AHRR3.87 ± 0.174.67 ± 0.675.33 ±± 0.795.43 ± 0.815.23 ± 0.811.206, 0.0011.141, 0.0031.162, 0.0021.120, 0.064
6.CYP1A14.63 ± 0.246.31 ± 1.847.73 ± 1.618.04 ± 1.527.40 ± 1.751.364, 0.0011.225, 0.0051.274, 0.0111.172, 0.318
7.CYP1B13.70 ± 0.568.04 ± 2.059.68 ± 0.889.92 ± 0.609.42 ± 1.112.174, 0.0011.204, 0.0131.234, 0.0191.172, 0.246
8.PRDM115.67 ± 0.355.33 ± 0.345.31 ± 0.305.35 ± 0.225.27 ± 0.39−1.062, 0.001−1.004, 0.8271.003, 0.99−1.012, 0.99
9.CBR37.27 ± 0.448.27 ± 0.658.44 ± 0.648.48 ± 0.508.40 ± 0.801.137, 0.0011.021, 0.3921.026, 0.991.017, 0.99
10.AKR1C16.74 ± 0.628.61 ± 1.108.55 ± 1.098.44 ±± 1.248.67 ± 0.971.277, 0.001−1.007, 0.884−1.019, 0.991.007, 0.99
11.AKR1C310.28 ± 0.4111.9 ± 50.7612.06 ± 0.6012.20 ± 0.4711.91 ± 0.731.162, 0.0011.013, 0.9561.021, 0.99−1.004, 0.99
12.HTR2B3.86 ± 0.174.26 ± 0.304.24 ± 0.324.19 ± 0.404.30 ± 0.211.104, 0.001−1.005, 0.884−1.016, 0.991.009, 0.99
13.GRM13.74 ± 0.124.06 ± 0.354.19 ± 0.594.20 ± 0.374.17 ± 0.811.083, 0.0021.032, 0.4781.036, 0.991.028, 0.99
14.CYP4Z16.81 ± 0.596.18 ± 0.545.94 ± 0.355.90 ± 0.455.99 ± 0.21−1.102, 0.002−1.040, 0.101−1.048, 0.88−1.032, 0.99
15.UCHL15.30 ± 0.628.67 ± 1.379.46 ± 1.639.25 ± 1.699.72 ± 1.631.635, 0.0011.091, 0.0611.066, 0.991.121, 0.13
16.CABYR4.86 ± 0.287.14 ± 1.187.60 ± 1.397.98 ± 1.137.20 ± 1.631.470, 0.0011.064, 0.3081.118, 0.151.008, 0.99
17.GPRC5A7.64 ± 0.617.57 ± 0.438.02 ± 0.627.89 ± 0.718.17 ± 0.52−0.990, 0.991.059, 0.0061.043, 0.571.080, 0.032
18.CCDC379.48 ± 0.559.29 ± 0.579.25 ± 0.659.22 ± 0.809.28 ± 0.47−1.021, 0.969−1.004, 0.61−1.008, 0.99−1.001, 0.99
19.GLI37.55 ± 0.396.62 ± 0.556.60 ± 0.416.70 ± 0.436.50 ± 0.39−1.141, 0.001−1.003, 0.9271.013, 0.99−1.018, 0.99
20.ABCC36.97 ± 0.467.95 ± 0.547.72 ± 0.777.68 ± 0.767.77 ± 0.831.140, 0.001−1.030, 0.412−1.034, 0.99−1.023, 0.99
21.SAMD53.74 ± 0.144.02 ± 0.303.94 ± 0.544.00 ± 0.283.87 ± 0.761.074, 0.002−1.020, 0.133−1.003, 0.99−1.037, 0.99
22.RASSF107.70 ± 0.447.12 ± 0.576.67 ± 0.526.81 ± 0.316.51 ± 0.67−1.082, 0.001−1.067, 0.004−1.045, 0.552−1.093, 0.02
23.USP27X7.43 ± 0.297.12 ± 0.396.68 ± 0.286.71 ± 0.276.65 ± 0.30−1.043, 0.001−1.066, 0.001−1.061, 0.007−1.071, 0.002
24.NR0B13.94 ± 0.244.39 ± 0.784.84 ± 0.764.88 ± 0.894.80 ± 0.641.114, 0.0051.103, 0.0111.112, 0.1811.094, 0.627
25.PLAG15.73 ± 0.615.22 ± 0.564.83 ± 0.574.95 ± 0.554.69 ± 0.60−1.098, 0.002−1.081, 0.018−1.055, 0.99−1.113, 0.133
26.SCGB3A114.25 ± 0.3413.86 ± 0.6413.31 ± 0.8413.54 ± 0.9113.07 ± 0.72−1.028, 0.036−1.041, 0.025−1.024, 0.93−1.060, 0.004
27.LHX65.36 ± 0.335.85 ± 0.386.10 ± 0.496.12 ± 0.526.07 ± 0.481.091, 0.0011.043, 0.0711.046, 0.3361.038, 0.754
28.LINC009423.58 ± 0.193.88 ± 0.804.21 ± 1.054.25 ± 1.244.17 ± 0.871.084, 0.1721.085, 0.1551.094,0.5511.073, 0.99
29.REEP19.29 ± 0.669.81 ± 0.569.62 ± 0.789.39 ± 0.939.88 ±0.541.056, 0.005−1.020, 0.434−1.045, 0.5811.007, 0.99
30.C6orf1645.45 ± 0.426.14 ± 0.556.23 ± 0.526.17 ± 0.436.29 ± 0.621.126, 0.0011.015, 0.4891.005, 0.991.024, 0.99
31.LINC005895.09 ± 0.225.36 ± 0.345.34 ± 0.285.34 ± 0.175.35 ± 0.391.053, 0.001−1.004, 0.899−1.004, 0.99−1.002, 0.99
32.JAKMIP34.50 ± 0.205.28 ± 0.565.75 ± 0.855.71 ± 0.985.79 ± 0.741.173, 0.0011.089, 0.0491.082, 0.091.096, 0.07
33.LINC009306.01 ± 0.426.76 ± 0.536.56 ± 0.646.55 ± 0.826.58 ± 0.411.126, 0.001−1.030, 0.334−1.033, 0.99−1.028, 0.99
34.DNHD16.66 ± 0.507.07 ± 0.557.16 ± 0.807.15 ± 0.937.18 ± 0.691.061, 0.0201.013, 0.6621.011, 0.991.015, 0.99
35.TMCC33.83 ± 0.434.15 ± 0.524.55 ± 0.424.49 ± 0.474.61 ± 0.381.083, 0.0211.096, 0.0021.084, 0.3231.112, 0.095
36.ADH77.94 ± 0.8810.9 ± 10.7410.78 ± 0.5410.82 ± 0.5610.73 ± 0.561.375, 0.001−1.011, 0.166−1.009, 0.99−1.017, 0.99
37.PRKAR2B7.26 ± 0.636.43 ± 0.715.84 ± 0.575.89 ± 0.665.79 ± 0.50−1.128, 0.001−1.101, 0.004−1.092, 0.161−1.112, 0.066
38.GAD14.89 ± 0.606.33 ± 1.097.23 ± 0.897.07 ± 1.107.41 ± 0.611.295, 0.0011.142, 0.0091.117, 0.1621.171, 0.024
39.LOC3386675.37 ± 0.335.17 ± 0.234.98 ± 0.215.03 ± 0.184.92 ± 0.24−1.040, 0.006−1.038, 0.012−1.027, 0.99−1.050, 0.134
40.CYB5A4.7 ± 0.204.72 ± 0.184.58 ± 0.204.62 ± 0.234.53 ± 0.171.0, 0.99−1.031, 0.014−1.021, 0.99−1.041, 0.089
41.PIEZO27.13 ± 0.866.34 ± 0.735.83 ± 0.635.78 ± 0.725.89 ± 0.55−1.125, 0.001−1.087, 0.02−1.097, 0.326−1.077, 0.735
42.SLITRK67.95 ± 0.656.89 ± 0.786.34 ± 0.836.32 ± 1.046.36 ± 0.57−1.155, 0.001−1.087, 0.028−1.090, 0.317−1.082, 0.379
43.KCNA16.80 ± 1.005.84 ± 0.905.34 ± 0.735.50 ± 0.905.16 ± 0.45−1.166, 0.001−1.094, 0.041−1.062, 0.99−1.131, 0.271
44.LOC1005075605.85 ± 0.695.39 ± 0.604.95 ± 0.524.92 ± 0.564.98 ± 0.50−1.084, 0.012−1.089, 0.004−1.096, 0.263−1.083, 0.507
Gene Symbol Healthy Non-Smoker (N = 16) Healthy Smoker (N = 21) COPD smoker (N = 4) COPD stage I smoker (N = 0) COPD stage II smoker (N = 4) Fold Regulation, adjusted p-value (HS vs. HNS)Fold Regulation, adjusted p-value (COPD vs. HS)Fold Regulation, adjusted p-value (stage I vs. HS)Fold Regulation, adjusted p-value (stage II vs. HS)
Women1.PPP4R46.87 ±± 0.635.84 ± 0.965.15 ± 0.815.15 ± 0.81−1.175, 0.005−1.135, 0.034−1.135, 0.034
2.THSD47.58 ± 0.547.44 ± 0.636.63 ± 0.566.63 ± 0.56−1.018, 0.99−1.123, 0.047−1.123, 0.047
3.NRG13.87 ± 0.304.07 ± 0.393.71 ± 0.343.71 ± 0.341.051, 0.291−1.099, 0.184−1.099, 0.184
4.SCGB1A114.48 ± 0.2014.40 ± 0.1614.04 ± 0.3114.04 ± 0.31−1.006, 0.682−1.025, 0.006−1.025, 0.006
5.AHRR3.84 ± 0.164.43 ± 0.614.81 ± 0.384.81 ± 0.381.154, 0.0021.085, 0.5421.085, 0.542
6.CYP1A14.59 ± 0.265.79 ± 1.857.66 ± 2.807.66 ± 2.801.260, 0.0371.324, 0.0811.324, 0.081
7.CYP1B13.52 ± 0.357.21 ± 1.968.77 ± 2.048.77 ± 2.042.050, 0.0011.216, 0.271.216, 0.27
8.PRDM115.70 ± 0.395.39 ± 0.315.15 ± 0.185.15 ± 0.18−1.057, 0.029−1.047, 0.56−1.047, 0.56
9.CBR37.14 ± 0.397.93 ± 0.707.98 ± 0.537.98 ± 0.531.111, 0.0021.006, 0.991.006, 0.99
Women10.AKR1C17.00 ± 0.838.48 ± 1.177.40 ± 1.357.40 ± 1.351.211, 0.002−1.146, 0.224−1.146, 0.224
11.AKR1C310.36 ± 0.5211.68 ± 0.6111.55 ± 0.8311.55 ± 0.831.127, 0.002−1.011, 0.99−1.011, 0.99
12.HTR2B4.08 ± 0.224.13 ± 0.314.29 ± 0.314.29 ± 0.311.012, 0.991.037, 0.971.037, 0.97
13.GRM13.75 ± 0.114.04 ± 0.344.05 ± 0.324.05 ± 0.321.076, 0.0101.004, 0.991.004, 0.99
14.CYP4Z16.91 ± 0.546.36 ± 0.635.84 ± 0.525.84 ± 0.52−1.085, 0.037−1.090, 0.321−1.090, 0.321
15.UCHL15.32 ± 0.419.03 ± 1.728.49 ± 2.668.49 ± 2.661.698, 0.001−1.063, 0.99−1.063, 0.99
16.CABYR4.87 ± 0.326.55 ± 1.266.88 ± 1.266.88 ± 1.261.343, 0.0011.051, 0.991.051, 0.99
17.GPRC5A7.42 ± 0.697.39 ± 0.437.96 ± 0.647.96 ± 0.64−1.005, 0.991.077, 0.2161.077, 0.216
18.CCDC379.36 ± 0.549.48 ± 0.449.34 ± 0.629.34 ± 0.621.013, 0.99−1.015, 0.99−1.015, 0.99
19.GLI37.70 ± 0.366.99 ± 0.556.71 ± 0.226.71 ± 0.22−1.101, 0.001−1.042, 0.723−1.042, 0.723
20.ABCC36.91 ± 0.437.76 ± 0.727.20 ± 0.787.20 ± 0.781.122, 0.002−1.077, 0.345−1.077, 0.345
21.SAMD53.78 ± 0.173.91 ± 0.253.87 ± 0.213.87 ± 0.211.037, 0.20−1.010, 0.99−1.010, 0.99
22.RASSF107.64 ± 0.626.96 ± 0.626.41 ± 0.746.41 ± 0.74−1.097, 0.011−1.086, 0.374−1.086, 0.374
23.USP27X7.47 ± 0.307.16 ± 0.446.51 ± 0.816.51 ± 0.81−1.043, 0.158−1.10, 0.04−1.10, 0.04
24.NR0B13.94 ± 0.244.40 ± 0.774.44 ± 0.844.44 ± 0.841.119, 0.0651.009, 0.991.009, 0.99
25.PLAG15.95 ± 0.405.32 ± 0.554.74 ± 0.494.74 ± 0.49−1.120, 0.002−1.121, 0.115−1.121, 0.115
26.SCGB3A114.13 ± 0.5514.05 ± 0.5112.51 ± 0.2612.51 ± 0.26−1.006, 0.99−1.123, 0.001−1.123, 0.001
27.LHX65.37 ± 0.405.73 ± 0.556.22 ± 0.686.22 ± 0.681.067, 0.1161.086, 0.2531.086, 0.253
28.LINC009423.59 ± 0.143.85 ± 0.513.73 ± 0.633.73 ± 0.631.074, 0.154−1.034, 0.99−1.034, 0.99
29.REEP19.47 ± 0.589.85 ± 0.579.59 ± 0.819.59 ± 0.811.040, 0.217−1.028, 0.99−1.028, 0.99
30.C6orf1645.58 ± 0.676.01 ± 0.576.14 ± 1.026.14 ± 1.021.077, 0.2241.021, 0.991.021, 0.99
31.LINC005895.30 ± 0.295.43 ± 0.255.64 ± 0.345.64 ± 0.341.025, 0.5241.037, 0.5491.037, 0.549
32.JAKMIP34.54 ± 0.235.28 ± 0.725.39 ± 0.945.39 ± 0.941.164, 0.0021.020, 0.991.020, 0.99
33.LINC009306.04 ± 0.477.20 ± 0.566.45 ± 0.926.45 ± 0.921.191, 0.001−1.116, 0.07−1.116, 0.07
34.DNHD16.61 ± 0.467.28 ± 0.477.05 ± 0.477.05 ± 0.471.102, 0.001−1.033, 0.99−1.033, 0.99
35.TMCC33.82 ± 0.274.01 ± 0.404.07 ± 0.234.07 ± 0.231.051, 0.2831.016, 0.991.016, 0.99
36.ADH78.22 ± 1.0510.63 ± 0.5710.37 ± 0.9210.37 ± 0.921.294, 0.001−1.025, 0.99−1.025, 0.99
37.PRKAR2B7.59 ± 0.376.49 ± 0.755.92 ± 0.765.92 ± 0.76−1.169, 0.001—1.096, 0.325−1.096, 0.325
38.GAD14.98 ± 0.666.27 ± 0.867.35 ± 0.877.35 ± 0.871.260, 0.0011.171, 0.0581.171, 0.058
39.LOC3386675.45 ± 0.335.24 ± 0.255.31 ± 0.265.31 ± 0.26−1.040, 0.111.013, 0.991.013, 0.99
40.CYB5A4.74 ± 0.194.69 ± 0.204.31 ± 0.304.31 ± 0.30−1.010, 0.99−1.089, 0.005−1.089, 0.005
41.PIEZO27.53 ± 0.826.78 ±± 0.926.14 ± 0.696.14 ± 0.69−1.110, 0.054−1.105, 0.488−1.105, 0.488
42.SLITRK68.02 ± 0.516.82 ± 0.995.85 ± 0.345.85 ± 0.34−1.176, 0.001−1.166, 0.069−1.166, 0.069
43.KCNA17.04 ± 1.155.72 ± 1.024.63 ± 0.774.63 ± 0.77−1.231, 0.002−1.234, 0.184−1.234, 0.184
44.LOC1005075605.88 ± 0.735.48 ± 0.634.81 ± 0.314.81 ± 0.31−1.073, 0.227−1.141, 0.171−1.141, 0.171

The Adj. P is based on the marginally adjusted 𝑝 values by the Benjamini-Hochberg-FDR correction at α = 0.05; Median ± Interquartile range.

Comparison of relative expression of 44 candidate genes between healthy controls (smokers and non-smokers) and COPD smoker patients (stage I and stage II) in men and women groups separately. The Adj. P is based on the marginally adjusted 𝑝 values by the Benjamini-Hochberg-FDR correction at α = 0.05; Median ± Interquartile range. We then investigated the expression of our 44 candidate genes, including our 17 novel genes, in HNS, HS, COPD, and COPD Stage I and II. Here, we determined that 52.3% of the 44 candidate genes were significantly differentially expressed in HS compared to HNS females (Table 6; HS v HNS). In addition, 47.1% of the 17 novel genes were significantly different in HS females compared to HNS females. A comparison of female COPD patients to HS females revealed that expression of 7/44 (15.9%) of the 44 candidate genes and 2/17 (11.8%) of the 17 novel genes were significantly different in COPD patients compared to HS (Table 6; COPD v HS). Furthermore, we also observed of significant difference in COPD Stage II compared with HS in 6/44 (13.6%) of the 44 candidate genes and 2/17 (11.8%) of the novel genes detected in this study in females (Table 6; Stage II v HS). A number of the 44 candidate genes were significantly different in females across all four analyses, these included CYP1A1 and the novel genes, LINC00930, GAD1 and SLITRK6.

Investigation of the age effect on differential gene expression in HNS, HS and COPD (Stage I and II) patients

In this study, we also investigated the effect of age (i.e. subject or patient age: < or ≥50 years) on gene expression in HNS, HS, COPD patients, and stage I and II patients. We observed a significant change in the gene expression of 40/44 (90.9%) total candidate genes and 15/17 (88.2%) novel genes in subjects ≤ 50 years (Table 7; <50 years; HS v HNS). Comparison of HS to COPD patients revealed that expression of 20/44 (45.5%) of our candidate genes were significantly different in COPD patients ≤ 50 years. In addition, expression of 8/17 (47.1%) of our novel genes were significantly different in COPD patients compared to HS ≤ 50 years (Table 7; <50 years; COPD v HS). We also investigated differential gene expression in Stage I and II COPD patients ≤ 50 years and determined that 6/44 (13.6%; Stage I) and 9/44 (20.5%; Stage II) candidate genes, respectively, were significantly different in patients ≤ 50 years. In our cohort of 17 novel genes, we determined that 2/17 (11.8%; Stage I) and 3/17 (17.6%; Stage II) among our total 44 candidate genes, respectively, were significantly different in patients ≤ 50 years (Table 7; <50 years; Stage I and Stage II v HS). Furthermore, a certain number of these candidate genes were significantly different in subjects ≤ 50 years, across all four analysis groups, which included USP27X, CYP1A1 and the novel genes of JAKMIP3 and GAD1.
Table 7

Comparison of relative expression of 44 candidate genes between healthy controls (smokers and non-smokers) and COPD smoker patients (stage I and stage II) in age groups separately.

Age groupGene SymbolHealthy Non-Smoker (N = 45)Healthy Smoker (N = 50)COPD smoker (N = 10)COPD stage I smoker (N = 5)COPD stage II smoker (N = 5)Fold Regulation, adjusted p-value (HS vs. HNS)Fold Regulation, adjusted p-value (COPD vs. HS)Fold Regulation, adjusted p-value (stage I vs. HS)Fold Regulation, adjusted p-value (stage II vs. HS)
<50 years old1.PPP4R46.56 ± 0.776.01 ± 0.865.23 ± 0.605.64 ± 0.434.84 ± 0.48−1.091,0.009−1.149, 0.005−1.065, 0.99−1.242, 0.009
2.THSD47.65 ± 0.477.36 ± 0.567.24 ± 0.477.47 ± 0.337.02 ± 0.51−1.039, 0.049−1.017, 0.5791.015, 0.99−1.049, 0.91
3.NRG13.81 ± 0.264.05 ± 0.453.89 ± 0.504.06 ± 0.623.72 ± 0.291.064, 0.008−1.041, 0.1311.002, 0.99−1.090, 0.336
4.SCGB1A114.50 ± 0.1514.36 ± 0.1914.01 ± 0.4114.15 ± 0.3313.88 ± 0.47−1.010, 0.005−1.025, 0.009−1.015, 0.149−1.035, 0.001
5.AHRR3.87 ± 0.154.53 ± 0.635.27 ± 0.735.66 ± 0.714.91 ± 0.591.168, 0.0011.163, 0.0041.250, 0.0011.085, 0.618
6.CYP1A14.61 ± 0.245.95 ± 1.797.91 ± 2.168.22 ± 1.817.61 ± 2.671.290, 0.0011.329, 0.0081.381, 0.0071.278, 0.046
7.CYP1B13.64 ± 0.537.54 ± 2.049.75 ± 1.1710.23 ± 0.549.30 ± 1.532.072, 0.0011.293, 0.0031.357, 0.0061.234, 0.17
8.PRDM115.72 ± 0.365.36 ± 0.345.20 ± 0.315.30 ± 0.185.11 ± 0.40−1.067, 0.001−1.031, 0.193−1.011, 0.99−1.048, 0.82
9.CBR37.26 ± 0.438.10 ± 0.688.58 ± 0.598.74 ± 0.418.44 ± 0.751.116, 0.0011.059, 0.0391.079, 0.1551.042, 0.99
10.AKR1C16.92 ± 0.678.54 ± 1.148.71 ± 1.188.58 ± 1.428.85 ± 1.041.235, 0.0011.020, 0.4911.004, 0.991.036, 0.99
11.AKR1C310.31 ± 0.4611.82 ± 0.7312.15 ± 0.6712.42 ± 0.4211.88 ± 0.821.146, 0.0011.028, 0.2751.051, 0.2661.006, 0.99
12.HTR2B3.94 ± 0.214.22 ± 0.324.29 ± 0.364.31 ± 0.514.26 ± 0.181.072, 0.0011.017, 0.4211.021, 0.991.010, 0.99
13.GRM13.74 ± 0.114.01 ± 0.304.37 ± 0.734.38 ± 0.404.36 ± 1.011.072, 0.0011.090, 0.1121.091, 0.0681.087, 0.028
14.CYP4Z16.88 ± 0.586.27 ± 0.595.90 ± 0.355.89 ± 0.455.91 ± 0.27−1.097, 0.001−1.063, 0.058−1.065, 0.834−1.062, 0.903
15.UCHL15.33 ± 0.598.71 ± 1.5010.11 ± 1.7910.35 ± 1.369.88 ± 2.311.636, 0.0011.161, 0.0081.188, 0.0491.134, 0.181
16.CABYR4.86 ± 0.296.85 ± 1.207.84 ± 1.648.52 ± 1.127.22 ± 1.991.409, 0.0011.145, 0.0371.244, 0.0031.054, 0.99
17.GPRC5A7.58 ± 0.667.50 ± 0.458.16 ± 0.617.98 ± 0.688.35 ± 0.55−1.011, 0.991.088, 0.0031.065, 0.3921.114, 0.010
18.CCDC379.47 ± 0.559.34 ± 0.559.54 ± 0.669.40 ± 0.809.68 ± 0.55−1.013, 0.991.021, 0.4331.006, 0.991.036, 0.99
19.GLI37.62 ± 0.396.72 ± 0.606.51 ± 0.486.61 ± 0.496.41 ± 0.49−1.133, 0.001−1.032, 0.221−1.017, 0.99−1.050, 0.99
20.ABCC36.96 ± 0.437.91 ± 0.627.98 ± 0.698.14 ± 0.667.82 ± 0.761.136, 0.0011.009, 0.7161.029, 0.99−1.011, 0.99
21.SAMD53.74 ± 0.153.97 ± 0.284.13 ± 0.624.13 ± 0.244.13 ± 0.891.060, 0.0011.040, 0.4551.041, 0.991.041, 0.614
22.RASSF107.71 ± 0.507.08 ± 0.576.41 ± 0.606.68 ± 0.366.15 ± 0.72−1.089, 0.001−1.105, 0.002−1.060, 0.62−1.152, 0.002
23.USP27X7.46 ± 0.287.16 ± 0.406.52 ± 0.506.74 ± 0.346.31 ± 0.58−1.041, 0.002−1.098, 0.001−1.063, 0.075−1.136, 0.001
24.NR0B13.95 ± 0.254.36 ± 0.785.15 ± 0.845.35 ± 0.904.97 ± 0.811.103, 0.0031.181, 0.0051.226, 0.0051.139, 0.23
25.PLAG15.87 ± 0.555.26 ± 0.574.74 ± 0.574.93 ± 0.644.55 ± 0.49−1.116, 0.001−1.110, 0.009−1.067, 0.99−0.866, 0.045
26.SCGB3A114.25 ± 0.4013.93 ± 0.5913.48 ± 0.7813.89 ± 0.5113.09 ± 0.85−1.023, 0.025−1.033, 0.092−1.002, 0.99−1.064, 0.006
27.LHX65.33 ± 0.355.80 ± 0.446.24 ± 0.576.28 ± 0.616.20 ± 0.591.087, 0.0011.076, 0.0161.084, 0.0841.069, 0.25
28.LINC009423.57 ± 0.173.85 ± 0.694.45 ± 1.284.67 ± 1.534.24 ± 1.091.079, 0.0731.156, 0.181.212, 0.0081.101, 0.828
29.REEP19.40 ± 0.609.79 ± 0.559.81 ± 0.749.91 ± 0.919.71 ± 0.631.041, 0.0131.002, 0.7161.013, 0.99−1.009, 0.99
30.C6orf1645.53 ± 0.516.06 ± 0.566.08 ± 0.526.17 ± 0.535.99 ± 0.561.097, 0.0011.003, 0.661.017, 0.99−1.012, 0.99
31.LINC005895.15 ± 0.275.38 ± 0.325.32 ± 0.295.33 ± 0.165.30 ± 0.411.045, 0.001−1.011, 0.848−1.010, 0.99−1.015, 0.99
32.JAKMIP34.50 ± 0.225.25 ± 0.616.18 ± 0.816.20 ± 0.926.16 ± 0.801.166, 0.0011.177, 0.0021.180, 0.0011.174, 0.001
33.LINC009306.02 ± 0.366.90 ± 0.596.77 ± 0.716.87 ± 0.796.66 ± 0.691.146, 0.001−1.019, 0.716−1.004, 0.99−1.035, 0.99
34.DNHD16.68 ± 0.507.15 ± 0.547.37 ± 0.837.47 ± 0.987.27 ± 0.751.070, 0.0011.031, 0.3581.044, 0.991.017, 0.99
35.TMCC33.80 ± 0.284.11 ± 0.484.48 ± 0.534.56 ± 0.624.40 ± 0.481.082, 0.0011.090, 0.0191.109, 0.121.071, 0.85
36.ADH78.08 ± 0.9410.78 ± 0.7210.87 ± 0.6011.09 ± 0.5910.66 ± 0.581.335, 0.0011.008, 0.8041.029, 0.99−1.012, 0.99
37.PRKAR2B7.35 ± 0.616.46 ± 0.745.74 ± 0.515.83 ± 0.465.64 ± 0.59−1.137, 0.001−1.125, 0.004−1.107, 0.239−1.145, 0.054
38.GAD14.87 ± 0.566.25 ± 0.997.44 ± 0.877.26 ± 1.017.62 ± 0.781.283, 0.0011.190, 0.0021.162, 0.0681.219, 0.005
39.LOC3386675.41 ± 0.345.20 ± 0.245.03 ± 0.225.04 ± 0.205.03 ± 0.26−1.040, 0.002−1.034, 0.071−1.031, 0.99−1.034, 0.99
40.CYB5A4.73 ± 0.204.73 ± 0.184.61 ± 0.234.70 ± 0.194.53 ± 0.26−1.000, 0.99−1.026, 0.235−1.006, 0.99−1.043, 0.99
41.PIEZO27.34 ± 0.846.49 ± 0.875.71 ± 0.645.71 ± 0.795.71 ± 0.54−1.132, 0.001−1.137, 0.008−1.136, 0.291−1.135, 0.255
42.SLITRK68.05 ± 0.596.89 ± 0.846.01 ± 0.996.10 ± 1.315.93 ± 0.67−1.169, 0.001−1.146, 0.015−1.129, 0.249−1.161, 0.043
43.KCNA16.95 ± 1.045.79 ± 0.965.19 ± 0.725.27 ± 0.965.11 ± 0.47−1.200, 0.001−1.116, 0.031−1.098, 0.99−1.133, 0.658
44.LOC1005075605.90 ± 0.715.45 ± 0.625.03 ± 0.655.01 ± 0.755.05 ± 0.63−1.083, 0.006−1.083, 0.041−1.087, 0.99−1.079, 0.99
Age group Gene Symbol Healthy Non-Smoker (N = 8) Healthy Smoker (N = 9) COPD smoker (N = 11) COPD stage I smoker (N = 4) COPD stage II smoker (N = 7) Fold Regulation, adjusted p-value (HS vs. HNS)Fold Regulation, adjusted p-value (COPD vs. HS)Fold Regulation, adjusted p-value (stage I vs. HS)Fold Regulation, adjusted p-value (stage II vs. HS)
≥50 years old1.PPP4R46.64 ± 0.646.05 ± 1.165.87 ± 0.636.20 ± 0.235.70 ± 0.75−1.098, 0.99−1.031, 0.1911.024, 0.99−1.062, 0.99
2.THSD47.29 ± 0.627.15 ± 0.336.99 ± 0.537.48 ± 0.336.72 ± 0.42−1.019, 0.99−1.023, 0.3661.046, 0.99−1.065, 0.663
3.NRG13.71 ± 0.184.02 ± 0.633.72 ± 0.223.80 ± 0.153.68 ± 0.251.084, 0.606−1.081, 0.315−1.059, 0.99−1.095, 0.447
4.SCGB1A114.38 ± 0.1314.38 ± 0.3014.27 ± 0.1314.23 ± 0.1014.28 ± 0.161.000, 0.99−1.008, 0.132−1.010, 0.99−1.007, 0.99
5.AHRR3.79 ± 0.235.13 ± 0.745.19 ± 0.825.15 ± 0.965.21 ± 0.811.353, 0.0121.012, 0.921.004, 0.991.016, 0.99
6.CYP1A14.64 ± 0.237.88 ± 1.637.55 ± 1.467.83 ± 1.267.40 ± 1.641.698, 0.001−1.044, 0.763−1.007, 0.99−1.065, 0.99
7.CYP1B13.68 ± 0.439.76 ± 1.129.28 ± 1.159.55 ± 0.479.12 ± 1.432.650, 0.001−1.052, 0.546−1.022, 0.99−1.069, 0.99
8.PRDM115.42 ± 0.185.32 ± 0.125.35 ± 0.265.41 ± 0.275.32 ± 0.27−1.019, 0.991.006, 0.8411.018, 0.99−1.000, 0.99
9.CBR37.06 ± 0.448.60 ± 0.548.15 ± 0.638.17 ± 0.478.14 ± 0.741.218, 0.001−1.055, 0.159−1.052, 0.99−1.056, 0.99
≥50 years old10.AKR1C16.20 ± 0.408.75 ± 0.907.97 ± 1.158.28 ± 1.147.80 ± 1.211.412, 0.001−1.098, 0.159−1.057, 0.99−1.121, 0.596
11.AKR1C310.24 ± 0.2812.18 ± 0.5911.79 ± 0.6311.92 ± 0.4211.72 ± 0.751.190, 0.001−1.033, 0.269−1.022, 0.99−1.040, 0.911
12.HTR2B3.85 ± 0.214.17 ± 0.224.22 ± 0.284.05 ± 0.184.32 ± 0.281.083, 0.1421.012, 0.99−1.028, 0.991.036, 0.99
13.GRM13.75 ± 0.154.37 ± 0.543.98 ± 0.173.99 ± 0.173.97 ± 0.191.164, 0.008−1.098, 0.056−1.095, 0.328−1.099, 0.141
14.CYP4Z16.58 ± 0.456.02 ± 0.455.95 ± 0.425.92 ± 0.535.96 ± 0.38−1.094, 0.20−1.012, 0.688−1.017, 0.99−1.009, 0.99
15.UCHL15.18 ± 0.359.51 ± 1.428.56 ± 1.548.03 ± 1.048.89 ± 1.741.836, 0.001−1.111, 0.269−1.185, 0.467−1.070, 0.99
16.CABYR4.89 ± 0.327.69 ± 1.357.13 ± 0.997.36 ± 0.837.00 ± 1.111.575, 0.001−1.079, 0.228−1.045, 0.99−1.099, 0.99
17.GPRC5A7.60 ± 0.457.60 ± 0.207.88 ± 0.617.79 ± 0.847.93 ± 0.501.000, 0.991.037, 0.1081.024, 0.991.042, 0.99
18.CCDC379.33 ± 0.529.48 ± 0.369.02 ± 0.509.00 ± 0.869.04 ± 0.221.016, 0.99−1.051, 0.044−1.054, 0.982−1.049, 0.701
19.GLI37.39 ± 0.366.91 ± 0.256.73 ± 0.256.81 ± 0.396.68 ± 0.15−1.069, 0.044−1.027, 0.269−1.015, 0.99−1.034, 0.99
20.ABCC36.92 ± 0.597.66 ± 0.527.31 ± 0.757.16 ± 0.487.40 ± 0.891.106, 0.379−1.048, 0.315−1.070, 0.99−1.034, 0.99
21.SAMD53.77 ± 0.154.08 ± 0.333.75 ± 0.243.85 ± 0.283.70 ± 0.221.082, 0.183−1.088, 0.044−1.059, 0.953−1.104, 0.06
22.RASSF107.48 ± 0.406.90 ± 0.796.81 ± 0.466.97 ± 0.176.72 ± 0.56−1.083, 0.534−1.013, 0.3661.010, 0.99−1.027, 0.99
23.USP27X7.32 ± 0.366.88 ± 0.396.77 ± 0.276.67 ± 0.196.82 ± 0.31−1.064, 0.168−1.016, 0.482−1.031, 0.99−1.008, 0.99
24.NR0B13.84 ± 0.184.68 ± 0.724.43 ± 0.524.35 ± 0.474.48 ± 0.571.217, 0.041−1.056, 0.421−1.076, 0.99−1.044, 0.99
25.PLAG15.35 ± 0.505.23 ± 0.414.88 ± 0.534.98 ± 0.504.82 ± 0.58−1.023, 0.99−1.072, 0.159−1.050, 0.99−1.085, 0.99
26.SCGB3A113.96 ± 0.3913.88 ± 0.7512.87 ± 0.7913.10 ± 1.2012.73 ± 0.49−1.006, 0.99−1.078, 0.044−1.059, 0.642−1.090, 0.042
27.LHX65.55 ± 0.265.89 ± 0.466.01 ± 0.465.91 ± 0.336.07 ± 0.531.062, 0.8631.020, 0.2691.003, 0.991.030, 0.99
28.LINC009423.64 ± 0.204.02 ± 0.933.83 ± 0.443.77 ± 0.273.86 ± 0.531.105, 0.987−1.050, 0.841−1.066, 0.99−1.042, 0.99
29.REEP18.96 ± 0.7510.11 ± 0.619.43 ± 0.788.77 ± 0.419.84 ± 0.661.129, 0.028−1.072, 0.088−1.153, 0.024−1.028, 0.99
30.C6orf1645.27 ± 0.446.38 ± 0.516.33 ± 0.686.18 ± 0.356.42 ± 0.821.212, 0.016−1.008, 0.841−1.033, 0.991.006, 0.99
31.LINC005895.14 ± 0.205.42 ± 0.335.47 ± 0.325.35 ± 0.205.55 ± 0.361.053, 0.6091.009, 0.763−1.012, 0.991.024, 0.99
32.JAKMIP34.55 ± 0.175.52 ± 0.605.26 ± 0.635.16 ± 0.785.31 ± 0.581.214, 0.022−1.049, 0.482−1.069, 0.99−1.039, 0.99
33.LINC009305.98 ± 0.797.01 ± 0.416.34 ± 0.596.16 ± 0.766.44 ± 0.521.172, 0.061−1.106, 0.044−1.137, 0.348−1.088, 0.767
34.DNHD16.42 ± 0.277.07 ± 0.516.94 ± 0.616.77 ± 0.807.04 ± 0.531.100, 0.202−1.019, 0.92−1.045, 0.99−1.004, 0.99
35.TMCC34.03 ± 0.804.02 ± 0.554.43 ± 0.354.42 ± 0.234.44 ± 0.42−1.002, 0.991.102, 0.0351.098, 0.991.105, 0.99
36.ADH77.63 ± 0.8911.08 ± 0.4410.54 ± 0.6310.48 ± 0.2910.58 ± 0.781.453, 0.001−1.051, 0.044−1.058, 0.99−1.048, 0.99
37.PRKAR2B7.38 ± 0.426.38 ± 0.555.97 ± 0.675.96 ± 0.935.97 ± 0.55−1.158, 0.04−1.069, 0.228−1.070, 0.99−1.068, 0.99
38.GAD15.22 ± 0.876.89 ± 1.087.09 ± 0.866.85 ± 1.327.23 ± 0.571.321, 0.0251.029, 0.688−1.007, 0.991.049, 0.99
39.LOC3386675.32 ± 0.205.14 ± 0.225.05 ± 0.295.02 ± 0.175.06 ± 0.35−1.035, 0.99−1.018, 0.315−1.024, 0.99−1.015, 0.99
40.CYB5A4.70 ± 0.174.58 ± 0.184.45 ± 0.234.53 ± 0.274.41 ± 0.21−1.026, 0.99−1.029, 0.228−1.012, 0.99−1.040, 0.81
41.PIEZO26.64 ± 0.776.52 ± 0.336.05 ± 0.615.86 ± 0.746.16 ± 0.57−1.019, 0.99−1.078, 0.159−1.111, 0.81−1.058, 0.99
42.SLITRK67.50 ± 0.626.67 ± 0.966.46 ± 0.516.60 ± 0.696.38 ± 0.42−1.125, 0.298−1.033, 0.482−1.010, 0.99−1.046, 0.99
43.KCNA16.40 ± 1.005.83 ± 0.825.20 ± 0.855.79 ± 0.864.89 ± 0.69−1.098, 0.99−1.121, 0.159−1.007, 0.99−1.193, 0.356
44.LOC1005075605.59 ± 0.525.20 ± 0.524.82 ± 0.224.81 ± 0.184.83 ± 0.25−1.075, 0.638−1.079, 0.088−1.081, 0.85−1.077, 0.655

The Adj. P is based on the marginally adjusted 𝑝 values by the Benjamini-Hochberg-FDR correction at α = 0.05; Median ± Interquartile range.

Comparison of relative expression of 44 candidate genes between healthy controls (smokers and non-smokers) and COPD smoker patients (stage I and stage II) in age groups separately. The Adj. P is based on the marginally adjusted 𝑝 values by the Benjamini-Hochberg-FDR correction at α = 0.05; Median ± Interquartile range. We then investigated the differential regulation of these 44 candidate genes in HNS, HS, COPD patients, and Stage I and II patients over 50 years (Table 7; ≥50 years). In this age group, gene expression was not significantly different. This was surprising, as the symptoms of COPD worsen with age and one would expect associated gene regulation to become more dysregulated. Specifically, a comparison of HS to HNS in subjects over 50 years revealed that expression of 16/44 (36.4%) of the candidate genes and 5/17 (29.4%) 17 novel genes were significantly different (Table 7; ≥50 years; HS v HNS). Investigation of gene expression in COPD versus HS in subjects over 50 years revealed that expression of 3/44 (6.8%) of the 44 candidate genes and 1/17 (5.9%) of the novel genes were significantly different (Table 7; ≥ 50 years; COPD v HS). Subsequently, we investigated the differential gene expression in Stage I and II COPD patients over 50 years and determined that 1/44% (2.3%; Stage I) and 0/44 (0%; Stage II), respectively, were significantly different in patients ≤50 years. In our cohort of 17 novel genes, we determined that expression of only 1/17% (5.9%; Stage I) and 0/17 (0%; Stage II) genes, respectively, were significantly different in patients ≤50 years (Table 7; ≥50 years; Stage I and Stage II v HS). Furthermore, in COPD patients over 50 years, no genes were significantly different across all four analysis groups (i.e. HS v HNS; COPD v HS; Stage I v HS and Stage II v HS).

Investigation of the effect of cigarette pack number per year on differential gene expression in HS and COPD (Stage I and II) patients

Here, we also investigated the effect of cigarette pack number per year (i.e. < or ≥50 cigarette packs/year) on the gene expression in HS, COPD patients, and Stage I and II patients. We analyzed the differential regulation of the 44 candidate genes in HS, COPD patients, and Stage I and II patients who consumed less than 50 packs/year (Table 8; <50 packs/year). In this age group, gene expression was not significantly different. Investigation of gene expression in COPD versus HS in subjects who consumed less than 50 packs/year revealed that expression of 4/44 (9.1%) candidate genes and 2/17 (11.8%) novel genes were significantly different (Table 8; <50 packs/year; COPD v HS). Subsequently, we studied the differential gene expression in Stage II COPD patients who consumed ≥50 packs/year and determined that 4/44 (9.1%) candidate genes, were significantly different in patients ≤50 years. In our cohort of 17 novel genes, we determined that the expression of only 1/17 (5.9%) of Stage II genes was significantly different in patients who consumed ≥50 packs/year compared to HS (Table 8; <50 packs/year; Stage I and Stage II v HS). Furthermore, a certain number of candidate genes were significantly different in subjects who consumed ≥50 packs/year, across in both analysis groups, which included SAMD5, PLAG1 and the novel gene, SLITRK6.
Table 8

Comparison of relative expression of 44 candidate genes between healthy controls (smokers and non-smokers) and COPD smoker patients (stage I and stage II) in number of pack of cigarette per year, separately.

Smoking groupGene SymbolHealthy Smoker (N = 37)COPD smoker (N = 6)COPD stage I smoker (N = 0)COPD stage II smoker (N = 6)Fold Regulation, adjusted p-value (COPD vs. HS)Fold Regulation, adjusted p-value (stage I vs. HS)Fold Regulation, adjusted p-value (stage II vs. HS)
<50 packs per year1.PPP4R46.69 ± 0.535.81 ± 0.105.81 ± 0.10−1.153, 0.074−1.153, 0.074
2.THSD47.28 ± 0.386.99 ± 0.266.99 ± 0.26−1.042, 0.369−1.042, 0.369
3.NRG13.97 ± 0.523.66 ± 0.013.66 ± 0.01−1.086, 0.428−1.086, 0.428
4.SCGB1A114.31 ± 0.3614.40 ± 0.2314.40 ± 0.231.007, 0.7621.007, 0.762
5.AHRR5.01 ± 0.445.74 ± 1.435.74 ± 1.431.145, 0.5721.145, 0.572
6.CYP1A15.93 ± 1.397.84 ± 1.617.84 ± 1.611.321, 0.181.321, 0.18
7.CYP1B18.60 ± 1.2510.10 ± 0.4310.10 ± 0.431.175, 0.1891.175, 0.189
8.PRDM115.41 ± 0.215.41 ± 0.295.41 ± 0.291.001, 0.9711.001, 0.971
9.CBR38.18 ± 0.488.89 ± 0.598.89 ± 0.591.086, 0.1591.086, 0.159
10.AKR1C19.19 ± 0.398.91 ± 0.868.91 ± 0.86−1.031, 0.737−1.031, 0.737
11.AKR1C312.26 ± 0.3812.45 ± 0.1212.45 ± 0.121.015, 0.5551.015, 0.555
12.HTR2B4.04 ± 0.274.33 ± 0.364.33 ± 0.361.072, 0.2821.072, 0.282
13.GRM14.30 ± 0.624.14 ± 0.254.14 ± 0.25−1.038, 0.708−1.038, 0.708
14.CYP4Z16.13 ± 0.346.10 ± 0.056.10 ± 0.05−1.006, 0.876−1.006, 0.876
15.UCHL110.40 ± 0.8410.34 ± 0.5210.34 ± 0.52−1.006, 0.908−1.006, 0.908
16.CABYR8.02 ± 0.558.08 ± 1.238.08 ± 1.231.007, 0.9391.007, 0.939
17.GPRC5A7.21 ± 0.387.80 ± 0.077.80 ± 0.071.082, 0.0941.082, 0.094
18.CCDC379.24 ± 0.238.88 ± 0.018.88 ± 0.01−1.040, 0.086−1.040, 0.086
19.GLI36.57 ± 0.446.72 ± 0.036.72 ± 0.031.023, 0.6851.023, 0.685
20.ABCC38.13 ± 0.407.80 ± 0.397.80 ± 0.39−1.042, 0.371−1.042, 0.371
21.SAMD54.31 ± 0.343.44 ± 0.083.44 ± 0.08−1.252, 0.003−1.252, 0.003
22.RASSF107.33 ± 0.376.52 ± 1.206.52 ± 1.20−1.125, 0.529−1.125, 0.529
23.USP27X7.12 ± 0.446.61 ± 0.086.61 ± 0.08−1.077, 0.18−1.077, 0.18
24.NR0B14.41 ± 0.435.12 ± 0.705.12 ± 0.701.160, 0.1451.160, 0.145
25.PLAG15.37 ± 0.154.56 ± 0.344.56 ± 0.34−1.178, 0.005−1.178, 0.005
26.SCGB3A113.83 ± 0.8113.18 ± 0.0513.18 ± 0.05−1.049, 0.319−1.049, 0.319
27.LHX65.86 ± 0.346.27 ± 0.706.27 ± 0.701.070, 0.5491.070, 0.549
28.LINC009424.20 ± 1.013.86 ± 0.223.86 ± 0.22−1.088, 0.417−1.088, 0.417
29.REEP110.30 ± 0.3010.13 ± 0.3710.13 ± 0.37−1.017, 0.629−1.017, 0.629
30.C6orf1646.39 ± 0.446.60 ± 0.376.60 ± 0.371.033, 0.5871.033, 0.587
31.LINC005895.34 ± 0.115.55 ± 0.215.55 ± 0.211.040, 0.1151.040, 0.115
32.JAKMIP35.44 ± 0.265.73 ± 0.115.73 ± 0.111.052, 0.2221.052, 0.222
33.LINC009307.36 ± 0.416.71 ± 0.056.71 ± 0.05−1.097, 0.022−1.097, 0.022
34.DNHD17.08 ± 0.456.97 ± 0.476.97 ± 0.47−1.016, 0.774−1.016, 0.774
35.TMCC34.39 ± 0.644.74 ± 0.444.74 ± 0.441.081, 0.5481.081, 0.548
36.ADH711.06 ± 0.2011.17 ± 0.3111.17 ± 0.311.010, 0.5821.010, 0.582
37.PRKAR2B6.48 ± 0.515.64 ± 0.575.64 ± 0.57−1.149, 0.11−1.149, 0.11
38.GAD17.65 ± 0.567.33 ± 0.047.33 ± 0.04−1.044, 0.46−1.044, 0.46
39.LOC3386675.06 ± 0.254.83 ± 0.074.83 ± 0.07−1.047, 0.276−1.047, 0.276
40.CYB5A4.67 ± 0.224.38 ± 0.024.38 ± 0.02−1.066, 0.139−1.066, 0.139
41.PIEZO26.46 ± 0.346.44 ± 0.326.44 ± 0.32−1.003, 0.931−1.003, 0.931
42.SLITRK67.17 ± 0.106.61 ± 0.056.61 ± 0.05−1.085, 0.001−1.085, 0.001
43.KCNA15.76 ± 0.525.45 ± 0.565.45 ± 0.56−1.056, 0.517−1.056, 0.517
44.LOC1005075605.09 ± 0.464.90 ± 0.304.90 ± 0.30−1.038, 0.607−1.038, 0.607
Smoking group Gene Symbol Healthy Smoker (N = 22) COPD smoker (N = 15) COPD stage I smoker (N = 9) COPD stage II smoker (N = 6) Fold Regulation, adjusted p-value (COPD vs. HS)Fold Regulation, adjusted p-value (stage I vs. HS)Fold Regulation, adjusted p-value (stage II vs. HS)
≥50 packs per year1.PPP4R45.95 ± 0.895.49 ± 0.715.82 ± 0.435.23 ± 0.82−1.084, 0.047−1.023, 0.99−1.138, 0.054
2.THSD47.34 ± 0.567.11 ± 0.547.51 ± 0.326.81 ± 0.50−1.032, 0.1781.022, 0.99−1.078, 0.02
3.NRG14.06 ± 0.473.82 ± 0.413.97 ± 0.503.70 ± 0.28−1.063, 0.014−1.021, 0.99−1.096, 0.03
4.SCGB1A114.37 ± 0.1814.12 ± 0.3314.20 ± 0.2614.06 ± 0.37−1.018, 0.001−1.012, 0.135−1.022, 0.001
5.AHRR4.55 ± 0.675.18 ± 0.725.47 ± 0.864.96 ± 0.521.138, 0.0011.203, 0.0011.092, 0.169
6.CYP1A16.15 ± 1.897.78 ± 1.888.25 ± 1.497.42 ± 2.181.265, 0.0031.342, 0.0021.207, 0.059
7.CYP1B17.67 ± 2.109.43 ± 1.239.95 ± 0.639.03 ± 1.491.229, 0.0061.298, 0.0041.177, 0.15
8.PRDM115.35 ± 0.345.25 ± 0.295.32 ± 0.225.19 ± 0.34−1.019, 0.239−1.005, 0.99−1.030, 0.99
9.CBR38.15 ± 0.708.33 ± 0.638.57 ± 0.468.14 ± 0.711.022, 0.3831.052, 0.428−1.001, 0.99
≥50 packs per year10.AKR1C18.51 ± 1.158.21 ± 1.258.36 ± 1.308.09 ± 1.27−1.037, 0.475−1.018, 0.99−1.051, 0.99
11.AKR1C311.8 ± 20.7311.88 ± 0.6812.16 ± 0.4911.66 ± 0.761.007, 0.9681.029, 0.92−1.014, 0.99
12.HTR2B4.23 ± 0.314.24 ± 0.334.18 ± 0.434.29 ± 0.231.002, 0.884−1.012, 0.991.013, 0.99
13.GRM14.03 ± 0.304.16 ± 0.594.20 ± 0.394.13 ± 0.741.032, 0.7711.042, 0.8111.026, 0.99
14.CYP4Z16.25 ± 0.605.92 ± 0.405.93 ± 0.485.91 ± 0.35−1.056, 0.034−1.055, 0.706−1.059, 0.379
15.UCHL18.65 ± 1.489.28 ± 1.879.52 ± 1.619.09 ± 2.131.073, 0.111.100, 0.4631.051, 0.9
16.CABYR6.83 ± 1.247.40 ± 1.448.06 ± 1.186.91 ± 1.501.083, 0.1571.180, 0.011.011, 0.99
17.GPRC5A7.54 ± 0.438.05 ± 0.657.92 ± 0.758.16 ± 0.581.068, 0.0031.051, 0.3531.083, 0.008
18.CCDC379.37 ± 0.559.39 ± 0.569.40 ± 0.669.38 ± 0.511.002, 0.8021.003, 0.991.002, 0.99
19.GLI36.76 ± 0.596.62 ± 0.416.73 ± 0.456.54 ± 0.37−1.021, 0.25−1.004, 0.99−1.034, 0.946
20.ABCC37.86 ± 0.627.65 ± 0.827.79 ± 0.747.53 ± 0.90−1.027, 0.428−1.008, 0.99−1.044 0.82
21.SAMD53.95 ± 0.263.97 ± 0.503.98 ± 0.293.96 ± 0.641.005, 0.531.007, 0.991.003, 0.99
22.RASSF107.04 ± 0.606.61 ± 0.526.79 ± 0.336.47 ± 0.62−1.065, 0.002−1.036, 0.99−1.088, 0.018
23.USP27X7.13 ± 0.406.65 ± 0.446.72 ± 0.296.60 ± 0.54−1.072, 0.001−1.062, 0.017−1.081, 0.001
24.NR0B14.39 ± 0.804.77 ± 0.795.00 ± 0.874.59 ± 0.701.087, 0.0321.139, 0.0581.047, 0.99
25.PLAG15.24 ± 0.584.84 ± 0.574.98 ± 0.584.74 ± 0.58−1.083, 0.012−1.054, 0.99−1.106, 0.07
26.SCGB3A113.93 ± 0.5913.25 ± 0.7913.81 ± 0.5112.82 ± 0.71−1.051, 0.002−1.008, 0.99−1.087, 0.001
27.LHX65.80 ± 0.456.09 ± 0.526.08 ± 0.546.10 ± 0.531.050, 0.0881.048, 0.4981.050, 0.271
28.LINC009423.84 ± 0.684.16 ± 1.064.31 ± 1.304.05 ± 0.871.083, 0.291.121, 0.0931.053, 0.99
29.REEP19.78 ± 0.569.64 ± 0.739.55 ± 0.869.71 ± 0.65−1.015, 0.484−1.024, 0.99−1.007, 0.529
30.C6orf1646.07 ± 0.566.19 ± 0.646.21 ± 0.456.17 ± 0.781.020, 0.3761.023, 0.991.016, 0.99
31.LINC005895.39 ± 0.335.39 ± 0.335.35 ± 0.185.42 ± 0.411.000, 0.843−1.008, 0.991.006, 0.99
32.JAKMIP35.26 ± 0.635.74 ± 0.885.86 ± 0.945.64 ± 0.871.091, 0.0361.113, 0.0191.071, 0.233
33.LINC009306.87 ± 0.576.60 ± 0.656.73 ± 0.706.50 ± 0.63−1.041, 0.204−1.021, 0.99−1.057, 0.29
34.DNHD17.15 ± 0.547.25 ± 0.707.36 ± 0.787.17 ± 0.651.014, 0.7211.029, 0.991.003, 0.99
35.TMCC34.07 ± 0.474.43 ± 0.454.51 ± 0.504.36 ± 0.411.088, 0.0011.107, 0.0551.072, 0.348
36.ADH710.79 ± 0.7210.65 ± 0.6510.84 ± 0.5910.50 ± 0.69−1.013, 0.2451.005, 0.99−1.027, 0.99
37.PRKAR2B6.45 ± 0.745.85 ± 0.615.83 ± 0.685.87 ± 0.59−1.103, 0.003−1.106, 0.082−1.099, 0.062
38.GAD16.20 ± 0.967.39 ± 0.757.38 ± 0.827.40 ± 0.741.192, 0.0011.190, 0.0011.195, 0.002
39.LOC3386675.20 ± 0.235.07 ± 0.265.05 ± 0.185.09 ± 0.31−1.026, 0.056−1.031, 0.865−1.022, 0.99
40.CYB5A4.72 ± 0.194.54 ± 0.254.63 ± 0.254.47 ± 0.25−1.040, 0.012−1.018, 0.99−1.054, 0.004
41.PIEZO26.49 ± 0.865.87 ± 0.635.86 ± 0.725.88 ± 0.59−1.106, 0.007−1.107, 0.268−1.104, 0.168
42.SLITRK66.83 ± 0.886.18 ± 0.836.28 ± 1.116.11 ± 0.58−1.105, 0.011−1.088, 0.443−1.119, 0.03
43.KCNA15.80 ± 0.975.10 ± 0.765.39 ± 0.894.89 ± 0.58−1.137, 0.007−1.076, 0.99−1.186, 0.032
44.LOC1005075605.46 ± 0.614.93 ± 0.514.94 ± 0.594.92 ± 0.48−1.108, 0.001−1.104, 209−1.108, 0.087

The Adj. P is based on the marginally adjusted 𝑝 values by the Benjamini-Hochberg-FDR correction at α = 0.05; Median ± Interquartile range.

Comparison of relative expression of 44 candidate genes between healthy controls (smokers and non-smokers) and COPD smoker patients (stage I and stage II) in number of pack of cigarette per year, separately. The Adj. P is based on the marginally adjusted 𝑝 values by the Benjamini-Hochberg-FDR correction at α = 0.05; Median ± Interquartile range. In COPD patients who consumed ≥50 packs/year, we observed a significant change in gene expression in 22/44 (50%) candidate genes and 9/17 (52.9%) novel genes compared with HS (Table 8; ≥50 packs/year; COPD v HS). We also investigated differential gene expression in Stage I and II COPD patients who consumed ≥50 packs/year and determined that 7/44 (15.9%; Stage I) and 10/44 (22.7%; Stage II) candidate genes were significantly different compared to gene expression in HSs. In our cohort of 17 novel genes, we determined that gene expression in 2/17% (11.8%; Stage I) and 4/17% (23.5%; Stage II) was significantly different in COPD patients who consumed ≥50 packs/year (Table 8; ≥50 packs/year; Stage I and Stage II v HS). In addition, a certain number of the 44 candidate genes were significantly different across all four-analysis groups, which included USP27X, CYP1A1 and the novel genes, PRKAR2B and GAD1.

Discussion

Chronic obstructive pulmonary disease (COPD) is a progressive inflammatory disease characterized by airway obstruction and is predicted to be among the first three causes of death worldwide[1,2]. A significant degree of clinical heterogeneity has been observed in COPD patients. In functional terms, all COPD patients experience a loss in lung function, as measured using FEV1 and FVC. However, these clinical parameters are not optimal and FEV1 has been shown to correlate weakly with clinical outcome and health status[19,20]. Currently, there is an unmet clinical need to identify novel biomarkers that will facilitate improved diagnosis and prognosis in COPD. To date, COPD has been shown to develop in 30% of smokers, with smoking being one of the main risk factors associated with the development of COPD[3]. The aim of this project was to identify candidate novel biomarkers, which may provide future novel therapeutic targets, in order to facilitate the treatment of COPD using machine-based learning algorithms and penalized regression models. In this study, 59 healthy smokers, 53 healthy non-smokers and 21 COPD smokers (9 GOLD stage I and 12 GOLD stage II) were included (n = 133). 20,097 probes were generated from SAE microarray data obtained from these subjects previously[14]. Consequently, 44 candidate genes were identified to be associated with the occurrence or progression of COPD, or lung function. Of these 44 genes, 27 have been previously reported in the literature to be associated with COPD or lung function (FVC, FEV1 or the FEV1/FVC ratio). In this study, we also identified 17 genes not previously detected in COPD studies that may represent novel biomarkers in the diagnosis and prognosis of COPD. In our analyses among healthy non-smokers and healthy smokers and COPD patients (GOLD stage I and II), the most significantly regulated novel genes were: PRKAR2B, GAD1, LINC00930 and SLITRK6. PRKAR2B is a protein kinase type II-beta regulatory subunit dependent on cAMP and encoded by the PRKAR2B gene in human[21]. In our overall analyses, expression of PRKAR2B was significantly downregulated in healthy smokers and in COPD patients (and in COPD stage II) compared to healthy non-smokers. Furthermore, in males, PRKAR2B expression was also significantly downregulated in healthy smokers and in COPD patients compared to healthy non-smokers. In females, these differences were less pronounced. In subjects less than 50 years, PRKAR2B expression was significantly downregulated in healthy smokers and in COPD patients compared to healthy non-smokers. In patients over 50 years, these differences were less pronounced. With regards to smoking exposure, COPD patients who smoked more than 50 packs per year had significantly lower PRKAR2B gene expression than healthy non-smokers. This decrease was not evident in COPD patients who smoked less than 50 cigarette packs per year. Thus, we hypothesise that PRKAR2B may represent a previously unknown factor both in pathogenesis of COPD and smoking exposure. PRKAR2B is an important protein kinase in cAMP signaling, and other researchers have demonstrated that cAMP is a protective factor in the lung and COPD. Furthermore, cAMP has been shown to attenuate pro-inflammatory responses whilst concomitantly increasing anti-inflammatory responses in a number of innate immune cells[22]. The reduced PRKAR2B gene expression observed in this study may reveal PRKAR2B as a novel target in the treatment of COPD. In contrast, in this study we also observed a significant upregulation of the novel gene, GAD1, in healthy smokers and COPD patients (and in stage II) compared to healthy non-smokers. In male only subjects, this pattern was replicated. The increase in expression of GAD1 in healthy smokers and COPD patients compared to healthy non-smokers was marginally less significant than in male subjects, as expected. In subjects younger than 50 years, there was a more significant increase in GAD1 expression in healthy smokers and COPD patients compared to subjects over 50 years and also the non-smokers. Smoking exposure only significantly increased GAD1 levels in healthy smokers and COPD patients who smoked more than 50 packs per year. There were no significant changes in GAD1 expression in subjects who smoked less than 50 packs per year. Other studies have shown that levels of γ-aminobutyric acid (GABA) and glutamic acid decarboxylase 1 (GAD1), the enzyme that synthesizes GABA, are significantly increased in neoplastic tissues[23]. Furthermore, other researchers have shown that the GAD1 promoter is hypermethylated in a number of cancer cells. This effect was shown to lead to the production of high levels of GAD1, as opposed to gene silencing which one would expect. The GAD1 promoter contains a number of CpG island motifs which facilitate this hypermethylation. In this study, we hypothesise that the increased levels of GAD1 detected following smoking exposure and COPD could mean that GAD1 is an important target in the treatment of this smoking-related disease. Previous studies have demonstrated that patients with COPD are at an increased risk for both the development of primary lung cancer, as well as poor outcome after lung cancer diagnosis and treatment[24]. Targeting the knockdown of GAD1 in COPD may attenuate the increased risk of lung cancer in COPD patients. LINC00930 was an additional novel gene detected in this study using machine-based learning. This is a “long intergenic non-protein coding (linc) RNA 930” (https://www.genecards.org/cgi-bin/carddisp.pl?gene=LINC00930). Interestingly, some other novel genes including LINC00942 and LINC00589 were also detected in this study. However, they were not as significantly regulated by smoking exposure or in COPD patients. In general, lincRNAs are found in between coding genes rather than antisense to them or within introns. Although the specific function of lincRNAs is not well known, they are thought to contribute to RNA stability in cells and hence gene expression. In our study, LINC009030 expression was significantly increased in healthy smokers overall, and in males and females, compared to healthy non-smokers. LINC009030 expression was significantly increased in smokers less than 50 years when only compared to non-smokers. In the over-50 age group, COPD patients had significantly less LINC009030 expression compared to healthy non-smokers. Regarding the smoking exposure, COPD patients and Stage II COPD patients who smoked less than 50 packs per year had significantly less LINC009030 expression compared to healthy non-smokers. In this study, LINC009030 was the only novel gene whose expression was significantly upregulated in smokers while significantly downregulated in COPD patients. Additional experimentation is required for elucidating the mechanism underlying this result in order to evaluate the therapeutic potential of targeting LINC009030 in smoking-related morbidities and in COPD patients. SLITRK6 was the last novel gene, which we determined to be significantly regulated in smoking and in COPD in this study. SLITRK6 is a member of the SLITRK family of neuronal transmembrane proteins that was discovered as a bladder tumor antigen using suppressive subtractive hybridization[25]. Using immunohistochemistry, SLITRK6 has been shown to be extensively expressed in multiple epithelial tumors, including lung, bladder and breast cancer, as well as in glioblastoma[25]. In our study, we demonstrated that SLITRK6 was significantly downregulated in smokers and COPD patients (including stage II) compared to healthy non-smokers. Male smokers and COPD patients also had significantly lower SLITRK6 expression compared to non-smokers. In females, SLITRK6 expression was significantly less in smokers, but not COPD patients, compared to non-smokers. Smokers and COPD patients (including stage II) aged less than 50 years had significantly lower SLITRK6 expression compared to non-smokers. These effects were not evident in subjects over 50 years. Concerning the smoking exposure, COPD patients (including stage II) who smoked more than 50 packs per year had significantly lower SLITRK6 expression compared to non-smokers. This effect was also evident in COPD patients (including stage II) who smoked less than 50 packs per year. The highlight of this study was that the expression of the oncogenesis-promoting enzyme, GAD1, is significantly increased in response to smoking exposure and in COPD. Although SLITRK6 is considered a tumourogenesis promoting factor, it was significantly decreased in this study in responses to smoking exposure and in COPD. Here, we hypothesise that GAD1 may represent a better target in order to attenuate the incidence of cancer following COPD. However, further investigations are needed to explain the exact function of SLITRK6 in smoking-associated morbidities and in COPD. More recently, machine-based learning algorithms have gained increasing attention in bioinformatics and biology research[26,27]. In contrast, regularization-based regression models (e.g. LASSO logistic regression) have already been used widely in microarray analysis[28]. Microarray analysis has a number of limitations including overfitting and multi-collinearity. In order to address these issues, regularization of parameters is required[29]. In this study, the area under the receiver operator characteristic curve (AUC), the sensitivity and specificity, and the misclassification error rate were quantitated for machine-based learning algorithm and penalty-based statistical method used. In this study based on repeated 5-CV, the elastic-net, random forest, and LASSO regularized logistic regression models were found to perform better than the naive Bayes, ridge, gradient boosting machines, adaptive boosting classification trees, extension of LASSO, artificial neural network, support vector machines, and decision tree models, respectively. Elastic-net regularization produced a sparse model with good prediction accuracy and good grouping-capability. This result is in keeping with those from previous studies, which demonstrated that elastic-net frequently performs better than ridge and LASSO for model selection consistency and prediction accuracy in microarray datasets[28,30,31]. Therefore, the results of this study are in agreement with those from previous ones. In summary, we employed machine-based learning algorithms and penalized regression models in order to identify 44 candidate genes, whose expression was significantly regulated by smoking exposure and/or COPD. We also identified 17 novel genes which were not previously determined to be associated with smoking exposure or COPD. We determined that four of these novel genes, namely PRKARB2, GAD1, LINC00930 and SLITKR6, were the most significantly regulated by smoking exposure or in COPD. We also determined that elastic-net logistic regression in our dataset had a higher accuracy rate compared to the other algorithms. Therefore, in microarray data, elastic-net logistic regression may provide a useful methodology for future studies in the discovery of novel diagnostic- and prognostic-biomarkers, and novel therapeutic targets in the treatment of COPD and other smoking-related diseases. The strengths of this study include the use of modern and accepted computational methods, the address of potential sources of bias, validation of all of the results by literature review, the use of appropriate cross-validation method (repeated 5-CV), enrichment analysis and the use of STRING networks. The main limitations of the study are the small sample sizes and that this is a case-control study only. Specifically, this study does not establish the respective associations between the 44 candidate genes and any COPD outcomes (e.g. lung function changes, exacerbations or mortality). Therefore, future studies by us will investigate and validate the candidacy of our 44 novel genes as novel therapeutic targets in COPD using larger patient cohorts. Furthermore, these studies, will investigate the association between these 44 novel genes and the change in lung function over time (FEV1 or FEV1/FVC ratio), incidence of exacerbations and mortality in COPD patients.

Methods

Study subjects and dataset

In this study, 59 healthy smokers, 53 healthy non-smokers and 21 COPD smokers (9 of stage I, GOLD I and 12 of stage II, GOLD II) were included (Total: n = 133). Subjects were predominantly male (n = 95; 71.4%) and Caucasian (n = 48; 36.1%; Table 1). Pulmonary function tests from COPD patients revealed that forced vital capacity (FVC), forced expiratory volume 1 (FEV1) and the FEV1/FVC ratio were significantly lower in these patients compared with healthy smokers and healthy non-smokers (Table 1). From these subjects, the raw data of gene expression architecture in the small airway epithelium (SAE) cells of COPD was used[14]. 20,097 probes from 133 subjects were generated. Genome-wide gene expression analysis (GWAS) was performed using HG-U133 Plus 2.0 array (Affymetrix, Santa Clara, CA)[14]. Overall microarray quality was verified by the criteria: (1) 3′/5′ ratio for GAPDH ≤3; and (2) scaling factor ≤10.0. The captured image data from the HG-U133 Plus 2.0 arrays was processed using MAS5 algorithm. The data was normalized using GeneSpring version 7.3.1 (Agilent technologies, Palo Alto, CA). See Supplemental Methods for further details. The raw data is available at the Gene Expression Omnibus (GEO) site (http://www.ncbi.nlm.nih.gov/geo/), accession number for this dataset is GSE20257.

Gene expression analysis

Raw data (.CEL format) files were qualified, normalized, statistical comparison, removing batch effects and other unwanted variation were performed by “Affy,” “Limma” and “SVA” R packages, respectively. The cutoff of false discovery rate and fold-change for differentially expressed genes was considered at level of 0.10, and more than 2, respectively. Sample progression discovery (SPD) as a novel unsupervised computational approach to identify patterns of biological progression underlying microarray gene expression data. SPD assumes that individual samples of a microarray dataset are related by an unknown biological process, and that each sample represents one unknown point along the progression of that process. SPD aims to organize the samples in a manner that reveals the underlying progression and to simultaneously identify subsets of genes that are responsible for that progression. This method does not depend on prior knowledge and only uses gene expression information[32]. In this method, divisive/consensus k-means as a clustering gene algorithm was used (200 iterations in each consensus k-means partitioning, and 0.7 threshold for module coherence). Also, least number of genes in each modules was 10. SPD analysis was done by MATLAB 7 software.

Machine learning (ML) algorithms

Various ML algorithms including AdaBoost classification trees, decision tree, Gradient Boosting machines, Naive Bayes, neural network, random forest, support vector machine were performed in order to find genes associated with occurrence and progression of COPD, and the best ML method which has best accuracy and performance to predict COPD. All ML methods were applied using “adabag”, “CART”, “gbm”, “naivebayes”, “neuralnet”, “‘randomForest”, and “e1071” R packages.

Adaptive LASSO, elastic-net, and ridge logistic regression

The ridge regression uses an L2 penalty to regularize parameters, all of the estimated coefficients are nonzero, and hence no gene selection is performed. But, LASSO regression use the L1 penalty instead, and hence provide automatic gene selection. In other hand, ridge penalty tends to shrink the coefficients of correlated variables toward each other, good for multi-collinearity, grouped selection. But, the lasso penalty is somewhat indifferent to the choice among a set of strong but correlated variables. Therefore, LASSO is good for simultaneous estimation and eliminating trivial genes but not good for grouped selection. Elastic-net is introduced as a compromise between these two techniques, and has a penalty which is a mix of L1 and L2 penalty, combine strength between ridge and lasso[33]. In adaptive LASSO regression where adaptive weights, inverse absolute value of LASSO coefficient was used for each variable as its weight in adaptive LASSO, are used for penalizing different coefficients in the L1 penalty. Similar to the lasso, the adaptive lasso is shown to be near-minimax optimal. Unlike to the LASSO, the adaptive LASSO is consistent for gene selection[34]. The mentioned penalized logistics regression methods were done by “glmnet” R package (https://cran.r-project.org/package=glmnet). In Table 9, the summary of each machine-learning and penalized statistical methods with some of advantages and limitations were mentioned.
Table 9

Gene selection methods: Definitions, acronyms and main advantages and limitations.

Gene Selection Method NameGene Selection Method AcronymMain AdvantagesMain Limitations
Least Absolute Shrinkage and Selection OperatorLASSO(1) Smaller mean squared error (MSE) than conventional methods;(2) It is good for simultaneous estimation and eliminating trivial genes;(3) Coefficients being easy to implement is another of the merits.(1) It is not good for grouped selection;(2) For highly correlated variables, conventional methods have predictive performance empirically observed to be better than LASSO;(3) This method has shown to not always provide consistent variable selection;(4) Its estimators are biased always;(5) Its efficiency depends greatly on the number of dimension of genes.
Adaptive Least Absolute Shrinkage and Selection OperatorAdapt. LASSO(1) This method has all of advantages of the LASSO.(2) This method uses adaptive weights to penalize coefficients differently;(3) Adaptive LASSO provides a more consistent solution than LASSO.(1) It is not good for grouped selection;(2) For highly correlated variables, conventional methods have predictive performance empirically observed to be better than adapt. LASSO;(3) Its estimators are biased always;(4) Its efficiency depends greatly on the number of dimension of genes.
Elastic net regularizationElastic net(1) This method selects groups of correlatedvariables together, shares nice properties of both theLASSO and ridge;(2) It can be considered for situations with p > n, it allows the number of selected features to exceed the sample size;(3) This method has predictive performance better than LASSO and ridge.(1) It can only apply to two-class feature selection problems, it cannot resolve multi-class feature selection problems directly;(2) Its estimators aren’t robust against outliers
Ridge Logistic RegressionRidge(1) It handles the multi-collinearity problem(2) Ridge regression can reduce the variance (with an increasing bias);(3) Can improve predictive performance than ordinary least square approach.(1) It is not able to shrink coefficients to exactly zero;(2) It cannot perform variable selection; it includes all of predictors (e.g. genes) in the final model;(3) It cannot handles the overfitting problem.
Support Vector MachinesSVM(1) It has a regularization parameter for avoiding overfitting;(2) It uses the kernel trick;(3) It is defined by a convex optimization problem (no local optimization);(4) It is a powerful classifier that works well on a wide range of classification problems, in other words, it is very good when we have no idea on the data;(5) It can apply for high dimensional and not linearly separable situations.(1) Choosing a good kernel function is not easy;(2) It has several key parameters that need to be set correctly to achieve the best classification results for any given problem;(3) Long training time for large datasets and large amount of training data; it was computationally intensive, especially the grid search for tuning its parameters;(4) Difficult to understand and interpret the final model, variable weights and individual impact.
Gradient Boosting Machines (stochastic)GBM(1) It can apply for high dimensional situations;(2) It works well in the situation with a lot of main and interaction parameters;(3) It can automatically select variables;(4) It is robust to outliers and missing data;(5) It can handle the numerous correlated and irrelevant variables problems;(6) It is an ensemble learning.(1) Long training time for large datasets;(2) Difficult to understand and interpret the model;(3) Prone to overfitting.
Naive BayesNB(1) It is easy to implement as a single learning;(2) If the its conditional independence assumption actually holds, a Naive Bayes classifier will converge quicker than discriminative models (e.g. logistic regression);(3) It needs less training data than other algorithms;(1) Class conditional independence assumption for all of variables (e.g. genes);(2) It is defined by a local optimization problem.
Random ForestRF(1) It can apply for high dimensional situations;(2) It is robust to outliers and missing data;(3) It has less variance than a single decision tree;(4) Training each tree perform independently.(1) It is complex;(2) It requires more computational resources and are also less intuitive;(3) Its prediction process using random forests is time-consuming than decision trees;(4) It assumes that model errors are uncorrelated and uniform.
Artificial Neural NetworkANN(1) It is easy to implement;(2) It can approximate any function between the independent and dependent variables;(3) It handles all possible interactions between the dependent variables;(4) It does not require any assumptions, in other words, it is very good when we have no idea on the data.(1) It solved for local optimization;(2) Parameters are hard to interpret;(3) Long training time for large neural networks.
Decision TreesRT(1) Easy to interpret and explain as a single learning;(2) It is very fast;(3) Its estimators are robust against outliers;(4) Can be combined with other decision techniques;(5) It handles missing values and filling them in with the most probable value.(1) Prone to overfitting;(2) Instability;(3) This method has predictive performance worse than random forest;(4) It solved for local optimization.
AdaBoost Classification Trees (Adaptive Boosting)ABCT(1) It can be less susceptible to the overfitting problem than most learning algorithms;(2) It combines a set of weak learners in order to form a strong classifier and selection of weak classifier is easy;(3) It is a machine learning meta-algorithm.(1) It can be sensitive to noisy data and outliers;(2) Requirement of a large amount of training data and long training time.
Gene selection methods: Definitions, acronyms and main advantages and limitations.

Gene set enrichment analysis

Gene set enrichment analysis is a method to identify classes of genes that are over-represented in a large set of genes and may have an association with disease phenotypes (e.g. occurrence of COPD). The Comprehensive gene set enrichment analysis web server 2016 update called “Enrichr” was applied[35].

Cross-validation, stability and accuracy

K-fold cross-validation scheme (k-cv) is a very commonly employed technique used to evaluate classifier performance. K-CV estimation of the error is the average value of the errors committed in each fold. Thus, the K-CV error estimator depends on two factors: the training set and the partition into folds. Sensitivity analysis was performed to changes in the training set and sensitivity to changes in the folds[36]. The bootstrap (or subsampling) is another way to bring down the high variability of cross-validation, to aims stability selection[37]. Repeated cross-validation is a good strategy for (a) optimizing the complexity of regression models and (b) for a realistic estimation of prediction errors when the model is applied to new cases[38,39]. In the present study, the algorithms split the data set by using repeated random 100 times sub-sampling in 5-fold cross-validation, permuting the sample labels every time. Cross-validated performance was summarized by observed sensitivity and specificity, and misclassification error rate. Furthermore, the area under the Receiver Operator Characteristic (ROC) curve (AUC), was used to calculate of classifiers performance[40,41]. Also, in order to assessing literature validation for any results, literature mining was used in the PubMed databank. Interactive cluster heatmaps was applied by “heatmaply” R package (https://cran.r-project.org/web/packages/heatmaply)[42]. A heatmap is a popular graphical method for visualizing high-dimensional data. A static heatmap, as an interactive heatmaps, was used to represent biological data, in which colors are used to represent the values (importance index) in a matrix where columns and rows are the machine learning and statistical methods (instances) and genes selected (attributes), respectively. Rows and columns are sorted using a hierarchical clustering technique[43]. The study’s follow chart is shown in Fig. 1. Dataset 1
  34 in total

1.  Sensitivity analysis of kappa-fold cross validation in prediction error estimation.

Authors:  Juan Diego Rodríguez; Aritz Pérez; Jose Antonio Lozano
Journal:  IEEE Trans Pattern Anal Mach Intell       Date:  2010-03       Impact factor: 6.226

Review 2.  Lung cancer in chronic obstructive pulmonary disease: enhancing surgical options and outcomes.

Authors:  Stacy Raviv; Keenan A Hawkins; Malcolm M DeCamp; Ravi Kalhan
Journal:  Am J Respir Crit Care Med       Date:  2010-12-22       Impact factor: 21.405

3.  Cigarette smoking reprograms apical junctional complex molecular architecture in the human airway epithelium in vivo.

Authors:  Renat Shaykhiev; Fouad Otaki; Prince Bonsu; David T Dang; Matthew Teater; Yael Strulovici-Barel; Jacqueline Salit; Ben-Gary Harvey; Ronald G Crystal
Journal:  Cell Mol Life Sci       Date:  2010-09-06       Impact factor: 9.261

4.  Health status and the spiral of decline.

Authors:  Paul W Jones
Journal:  COPD       Date:  2009-02       Impact factor: 2.409

5.  Enrichr: a comprehensive gene set enrichment analysis web server 2016 update.

Authors:  Maxim V Kuleshov; Matthew R Jones; Andrew D Rouillard; Nicolas F Fernandez; Qiaonan Duan; Zichen Wang; Simon Koplev; Sherry L Jenkins; Kathleen M Jagodnik; Alexander Lachmann; Michael G McDermott; Caroline D Monteiro; Gregory W Gundersen; Avi Ma'ayan
Journal:  Nucleic Acids Res       Date:  2016-05-03       Impact factor: 16.971

6.  Genome-wide association analysis identifies six new loci associated with forced vital capacity.

Authors:  Daan W Loth; María Soler Artigas; Sina A Gharib; Louise V Wain; Nora Franceschini; Beate Koch; Tess D Pottinger; Albert Vernon Smith; Qing Duan; Chris Oldmeadow; Mi Kyeong Lee; David P Strachan; Alan L James; Jennifer E Huffman; Veronique Vitart; Adaikalavan Ramasamy; Nicholas J Wareham; Jaakko Kaprio; Xin-Qun Wang; Holly Trochet; Mika Kähönen; Claudia Flexeder; Eva Albrecht; Lorna M Lopez; Kim de Jong; Bharat Thyagarajan; Alexessander Couto Alves; Stefan Enroth; Ernst Omenaas; Peter K Joshi; Tove Fall; Ana Viñuela; Lenore J Launer; Laura R Loehr; Myriam Fornage; Guo Li; Jemma B Wilk; Wenbo Tang; Ani Manichaikul; Lies Lahousse; Tamara B Harris; Kari E North; Alicja R Rudnicka; Jennie Hui; Xiangjun Gu; Thomas Lumley; Alan F Wright; Nicholas D Hastie; Susan Campbell; Rajesh Kumar; Isabelle Pin; Robert A Scott; Kirsi H Pietiläinen; Ida Surakka; Yongmei Liu; Elizabeth G Holliday; Holger Schulz; Joachim Heinrich; Gail Davies; Judith M Vonk; Mary Wojczynski; Anneli Pouta; Asa Johansson; Sarah H Wild; Erik Ingelsson; Fernando Rivadeneira; Henry Völzke; Pirro G Hysi; Gudny Eiriksdottir; Alanna C Morrison; Jerome I Rotter; Wei Gao; Dirkje S Postma; Wendy B White; Stephen S Rich; Albert Hofman; Thor Aspelund; David Couper; Lewis J Smith; Bruce M Psaty; Kurt Lohman; Esteban G Burchard; André G Uitterlinden; Melissa Garcia; Bonnie R Joubert; Wendy L McArdle; A Bill Musk; Nadia Hansel; Susan R Heckbert; Lina Zgaga; Joyce B J van Meurs; Pau Navarro; Igor Rudan; Yeon-Mok Oh; Susan Redline; Deborah L Jarvis; Jing Hua Zhao; Taina Rantanen; George T O'Connor; Samuli Ripatti; Rodney J Scott; Stefan Karrasch; Harald Grallert; Nathan C Gaddis; John M Starr; Cisca Wijmenga; Ryan L Minster; David J Lederer; Juha Pekkanen; Ulf Gyllensten; Harry Campbell; Andrew P Morris; Sven Gläser; Christopher J Hammond; Kristin M Burkart; John Beilby; Stephen B Kritchevsky; Vilmundur Gudnason; Dana B Hancock; O Dale Williams; Ozren Polasek; Tatijana Zemunik; Ivana Kolcic; Marcy F Petrini; Matthias Wjst; Woo Jin Kim; David J Porteous; Generation Scotland; Blair H Smith; Anne Viljanen; Markku Heliövaara; John R Attia; Ian Sayers; Regina Hampel; Christian Gieger; Ian J Deary; H Marike Boezen; Anne Newman; Marjo-Riitta Jarvelin; James F Wilson; Lars Lind; Bruno H Stricker; Alexander Teumer; Timothy D Spector; Erik Melén; Marjolein J Peters; Leslie A Lange; R Graham Barr; Ken R Bracke; Fien M Verhamme; Joohon Sung; Pieter S Hiemstra; Patricia A Cassano; Akshay Sood; Caroline Hayward; Josée Dupuis; Ian P Hall; Guy G Brusselle; Martin D Tobin; Stephanie J London
Journal:  Nat Genet       Date:  2014-06-15       Impact factor: 38.330

7.  Discovering biological progression underlying microarray samples.

Authors:  Peng Qiu; Andrew J Gentles; Sylvia K Plevritis
Journal:  PLoS Comput Biol       Date:  2011-04-14       Impact factor: 4.475

Review 8.  Risk factors and early origins of chronic obstructive pulmonary disease.

Authors:  Dirkje S Postma; Andrew Bush; Maarten van den Berge
Journal:  Lancet       Date:  2014-08-11       Impact factor: 79.321

9.  Tobacco smoking leads to extensive genome-wide changes in DNA methylation.

Authors:  Sonja Zeilinger; Brigitte Kühnel; Norman Klopp; Hansjörg Baurecht; Anja Kleinschmidt; Christian Gieger; Stephan Weidinger; Eva Lattka; Jerzy Adamski; Annette Peters; Konstantin Strauch; Melanie Waldenberger; Thomas Illig
Journal:  PLoS One       Date:  2013-05-17       Impact factor: 3.240

10.  Multiple facets of cAMP signalling and physiological impact: cAMP compartmentalization in the lung.

Authors:  Anouk Oldenburger; Harm Maarsingh; Martina Schmidt
Journal:  Pharmaceuticals (Basel)       Date:  2012-11-30
View more
  11 in total

1.  Novel computational analysis of large transcriptome datasets identifies sets of genes distinguishing chronic obstructive pulmonary disease from healthy lung samples.

Authors:  Fabienne K Roessler; Birke J Benedikter; Bernd Schmeck; Nadav Bar
Journal:  Sci Rep       Date:  2021-05-13       Impact factor: 4.379

2.  The identification of co-expressed gene modules in Streptococcus pneumonia from colonization to infection to predict novel potential virulence genes.

Authors:  Sadegh Azimzadeh Jamalkandi; Morteza Kouhsar; Jafar Salimian; Ali Ahmadi
Journal:  BMC Microbiol       Date:  2020-12-17       Impact factor: 3.605

3.  Reticular Basement Membrane Thickness Is Associated with Growth- and Fibrosis-Promoting Airway Transcriptome Profile-Study in Asthma Patients.

Authors:  Stanislawa Bazan-Socha; Sylwia Buregwa-Czuma; Bogdan Jakiela; Lech Zareba; Izabela Zawlik; Aleksander Myszka; Jerzy Soja; Krzysztof Okon; Jacek Zarychta; Paweł Kozlik; Sylwia Dziedzina; Agnieszka Padjas; Krzysztof Wojcik; Michal Kepski; Jan G Bazan
Journal:  Int J Mol Sci       Date:  2021-01-20       Impact factor: 5.923

4.  Strain-specific behavior of Mycobacterium tuberculosis in A549 lung cancer cell line.

Authors:  Shima Hadifar; Shayan Mostafaei; Ava Behrouzi; Abolfazl Fateh; Parisa Riahi; Seyed Davar Siadat; Farzam Vaziri
Journal:  BMC Bioinformatics       Date:  2021-03-25       Impact factor: 3.169

5.  Construction of genetic classification model for coronary atherosclerosis heart disease using three machine learning methods.

Authors:  Wenjuan Peng; Yuan Sun; Ling Zhang
Journal:  BMC Cardiovasc Disord       Date:  2022-02-12       Impact factor: 2.298

Review 6.  Intellectual disability: dendritic anomalies and emerging genetic perspectives.

Authors:  Tam T Quach; Harrison J Stratton; Rajesh Khanna; Pappachan E Kolattukudy; Jérome Honnorat; Kathrin Meyer; Anne-Marie Duchemin
Journal:  Acta Neuropathol       Date:  2020-11-23       Impact factor: 17.088

Review 7.  Chronic Obstructive Pulmonary Disease: Proprioception Exercises as an Addition to the Rehabilitation Process.

Authors:  Bruno Bordoni; Marta Simonelli
Journal:  Cureus       Date:  2020-05-13

8.  m6A RNA methylation regulators could contribute to the occurrence of chronic obstructive pulmonary disease.

Authors:  Xinwei Huang; Dongjin Lv; Xiao Yang; Min Li; Hong Zhang
Journal:  J Cell Mol Med       Date:  2020-09-22       Impact factor: 5.310

9.  Network-Based Transcriptomic Analysis Identifies the Genetic Effect of COVID-19 to Chronic Kidney Disease Patients: A Bioinformatics Approach.

Authors:  Md Rabiul Auwul; Chongqi Zhang; Md Rezanur Rahman; Md Shahjaman; Salem A Alyami; Mohammad Ali Moni
Journal:  Saudi J Biol Sci       Date:  2021-06-10       Impact factor: 4.219

Review 10.  COVID-19 Susceptibility in chronic obstructive pulmonary disease.

Authors:  Jordi Olloquequi
Journal:  Eur J Clin Invest       Date:  2020-09-02       Impact factor: 5.722

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.