Literature DB >> 25032995

A diVIsive Shuffling Approach (VIStA) for gene expression analysis to identify subtypes in Chronic Obstructive Pulmonary Disease.

Jörg Menche, Amitabh Sharma, Michael H Cho, Ruth J Mayer, Stephen I Rennard, Bartolome Celli, Bruce E Miller, Nick Locantore, Ruth Tal-Singer, Soumitra Ghosh, Chris Larminie, Glyn Bradley, John H Riley, Alvar Agusti, Edwin K Silverman, Albert-László Barabási.   

Abstract

BACKGROUND: An important step toward understanding the biological mechanisms underlying a complex disease is a refined understanding of its clinical heterogeneity. Relating clinical and molecular differences may allow us to define more specific subtypes of patients that respond differently to therapeutic interventions.
RESULTS: We developed a novel unbiased method called diVIsive Shuffling Approach (VIStA) that identifies subgroups of patients by maximizing the difference in their gene expression patterns. We tested our algorithm on 140 subjects with Chronic Obstructive Pulmonary Disease (COPD) and found four distinct, biologically and clinically meaningful combinations of clinical characteristics that are associated with large gene expression differences. The dominant characteristic in these combinations was the severity of airflow limitation. Other frequently identified measures included emphysema, fibrinogen levels, phlegm, BMI and age. A pathway analysis of the differentially expressed genes in the identified subtypes suggests that VIStA is capable of capturing specific molecular signatures within in each group.
CONCLUSIONS: The introduced methodology allowed us to identify combinations of clinical characteristics that correspond to clear gene expression differences. The resulting subtypes for COPD contribute to a better understanding of its heterogeneity.

Entities:  

Mesh:

Year:  2014        PMID: 25032995      PMCID: PMC4101699          DOI: 10.1186/1752-0509-8-S2-S8

Source DB:  PubMed          Journal:  BMC Syst Biol        ISSN: 1752-0509


Background

Chronic obstructive pulmonary disease (COPD) is one of the most prevalent chronic diseases (4th cause of death globally), with increasing incidence worldwide. Understanding of the disease pathobiology is far from complete and only few novel therapeutic mechanisms of action have been identified. Tobacco smoking is the main risk factor for COPD, but only a fraction of all smokers develops the disease [1]. This variable response to smoking, plus the observation that COPD aggregates in families, strongly suggest a genetic component to the disease [2-6]. Yet, COPD is a very heterogeneous and complex disease, with varied pulmonary and extra-pulmonary clinical manifestations [7]. Understanding and characterizing this biological and clinical heterogeneity could help identify subgroups of patients (subtypes) that may benefit from different therapeutic strategies [8]. To investigate the genomic and pathobiological basis of COPD subtypes with distinct clinical manifestations, we applied several novel and complementary computational strategies to differential gene expression analysis. We used expression data from induced sputum samples of former smokers with COPD and varying degree of airflow limitation. The patients are a subset of the large ECLIPSE cohort, which is a multi-center, 3 year observational international study that collected clinical, genetic, proteomic and biomarker measures in a population of COPD patients [9]. Specifically, in the current study we sought to: (i ) compare the gene expression pattern between patient groups with different clinical characteristics; (ii) conversely, assess the clinical characteristics of groups of patients with distinct gene expression patterns identified by a novel diVIsive Shuffling Approach (VIStA) developed specifically for this purpose (see below). Unexpectedly, we found that the reverse approach (ii) showed greater potential to identify specific pathways that may offer novel therapeutic targets [10] than the traditional approach (i).

Methods

Study design, participants and ethics

The ECLIPSE cohort is a large, prospective, observational and controlled study (Clinicaltrials.gov identifier NCT00292552; GSK study code SCO104960), whose design has been published previously [9]. Here, we investigated differential gene expression in induced sputum samples of a subset of the participants that included 140 former smokers with COPD (70 with moderate or GOLD stage 2 and 70 with severe or GOLD stage 3-4 airflow limitation, matched for age and gender) with characterized clinical and laboratory measures (Table 1). Sputum induction and processing with dithiothreitol (DTT) was performed using standard methods as described previously [5], details on the generation and processing of the expression data can be found in [3]. The ECLIPSE study complies with the Declaration of Helsinki and Good Clinical Practice Guidelines and was approved by the Ethics Committees and Institutional Review Boards of all participating centers. All participants provided written informed consent.
Table 1

Summary of the characteristics of 140 subjects with sputum gene expression data from the ECLIPSE Cohort.

Demographics and clinical data
Age, yrs.65 ± 5.5
Males, %66
Body mass index, Kg/m226.8 ± 5.2
Smoking exposure, pack-yrs.48.3 ± 29.1
Annual Exacerbation rate, year-10.98 ± 1.6

Lung function

FEV1, L1.26 ± 0.45
FEV1, % revers.9.5 ± 10.4
FEV1/FVC, %43.2 ± 11.5

Imaging

Emphysema, -950HU %19.2 ± 12.2
Emphysema, extent code2.8 ± 1.8

Systemic inflammation

hsCRP (mg/L)8.24 ± 15.0
IL6 (pg/mL)7.8 ± 36
IL8 (pg/mL)9.3 ± 5.2
CCL18 (ng/mL)121.7 ± 46
Fibronogen (mg/dL) -481.9 ± 107.6
TNFA (ng/mL)103.2 ± 624
SPD (ng/mL)120.6 ± 78

Induced sputum

Total cell count, × 1067.5 ± 1.78
Neutrophils, %64.8 ± 8.5
Eosinophils, %3.1 ± 2.04
Lymphocytes, %25.4 ± 7.9

Note that all subjects are COPD patients and former smokers. The values represent mean ± standard deviation, frequency or proportion, as appropriate.

Summary of the characteristics of 140 subjects with sputum gene expression data from the ECLIPSE Cohort. Note that all subjects are COPD patients and former smokers. The values represent mean ± standard deviation, frequency or proportion, as appropriate.

Selection of clinical measures

Table 2 shows the clinical measures selected by COPD experts (SR, BC, AA, EKS) based on their association to important clinical outcomes (e.g. exacerbations, hospitalizations and death). The degree of airflow limitation (GOLDCD) was determined using spirometry, distances walked over 6 minutes (DWALK) were measured using standard methodology. Standardized questionnaires were used to obtain smoking status, cough and sputum (PHLEGM) production. COPD exacerbations in the year prior to the study were recorded, as well as body mass index (BMI). All subjects underwent a low-dose computed tomography (CT) scan of the chest to determine both airway disease and emphysema (FV950 as a quantitative assessment, and EMPHETCD as a radiologists score) [11]. Several inflammatory biomarkers were measured in peripheral blood [12]. For details on the definitions and acquisition procedures of the above measures see [9].
Table 2

Summary of the clinical characteristics of COPD patients identified as most relevant by clinical experts.

CategoryContinuous Variable for Quantitative AnalysisDiscrete VariableBinsCharacteristicsDifferentially expressed genes at FDR < 0.05
Chronic BronchitisNot applicableCough with Phlegm for at least 3 mos/yr for at least 2 yearslow extreme (Q1 = 64)neither chronic cough nor chronic phlegm0

high extreme (Q4 = 46)both chronic cough and chronic phlegm

History of ExacerbationsNumber of exacerbations per year2 or more per year and less than 2 per yearlow extreme (Q1 = 26)0 - Never0

high extreme (Q4 = 17)3 - Always

Body Mass Index (Kg/m2)BMIBMI < 21, 21-30, > 30low extreme (Q1 = 18)BMI < 210

high extreme (Q4 = 35)BMI > 30

Airflow Limitation severityFEV1 (% predicted)GOLD Stagelow extreme (Q1 = 69)< 2-GOLD stage6,049

high extreme (Q4 = 13)>4 GOLD stage

6 Minute Walk DistanceQuantitative 6MWD< 350 meters and > 350 meterslow extreme (Q1 = 38)>350 meters0

high extreme (Q4 = 101)>350 meters

RadiologistEmphysemaEmphysema severity category:low extreme (Q1 = 40)0-1.5 -No emphysema0


assessmentNot affected (N): 0Yes/No/Uncertainhigh extreme (Q4 = 45)4-5 - severe


Trivial (T): 1


Mild (M) 5-25%: 2


Moderate (O) 25-50%: 3


Severe (S) 50-75%: 4


Very Severe (V) > 75%: 5

Densitometric EmphysemaEmphysema at -950 HUEmphysema >10% (Yes/No)low extreme (Q1 = 37)Emphysema >10% = No0

high extreme (Q4 = 95)Emphysema >10% = yes

CT Airway DiseasePi10 (Square root of wall area of 10 mm internal perimeter airways)GOLD Stages 2-4 with Emphysema < 5% (Yes) or > 5% (No)low extreme (Q1 = 63)Trivial (< %5)0

high extreme (Q4 = 33)Severe (50-75%, very severe (>75%))

Columns 4-6 show the results of the differential gene expression analysis comparing the subjects of the defined bins or extremes for each characteristic. Q1/Q4 refer to the number of patients in the respective group (1st and 4th "quartile"). The only single characteristic yielding significantly differentially expressed genes is the degree of airflow limitation as given by GOLD stage (GOLDCD).

Summary of the clinical characteristics of COPD patients identified as most relevant by clinical experts. Columns 4-6 show the results of the differential gene expression analysis comparing the subjects of the defined bins or extremes for each characteristic. Q1/Q4 refer to the number of patients in the respective group (1st and 4th "quartile"). The only single characteristic yielding significantly differentially expressed genes is the degree of airflow limitation as given by GOLD stage (GOLDCD). Note that there were no controls with normal lung function among the subjects. Hence, we cannot compare COPD to normal but only the differences between COPD patients [1].

Relationship between clinical characteristics and gene expression

To investigate the relationships between differences in gene expression and clinical trait occurrence, we used two complementary analyses: (i) For each of the clinical characteristics introduced above, we divided the patients into two groups based on clinically relevant cut-points (Table 2, column 5) and computed gene expression differences between the two groups. Gene expression analysis was performed using Significance Analysis of Microarrays (SAM) [13] with a false discovery rate (FDR) of 5% as cutoff. (ii) We used VIStA (see below) to identify groups of patients with maximized differential gene expression and then compared their clinical characteristics.

diVIsive Shuffling Approach (VIStA)

We developed a novel unbiased method called diVIsive Shuffling Approach (VIStA) to identify groups of patients with maximal difference in gene expression. The algorithm consists of the following steps: (iii) n subjects are randomly partitioned into three groups of comparable size (Figure 1A). A SAM analysis is performed and the number of genes differentially expressed between groups 1 and 2 is counted. Group 3 serves as a "reservoir" of individuals for the subsequent steps.
Figure 1

Schematic representation of the diVIsive Shuffling Approach (VIStA). A Initially the subjects are divided randomly into three groups; gene expression differences are calculated between group 1 & 2, the third group serves as a reservoir for the subsequent shuffling steps. At each shuffling step, a subject from group 1 or 2 is randomly exchanged with a subject from the reservoir. If the number of differentially expressed genes increases thereby, the swap is accepted, otherwise rejected. B 20 exemplary time series of the number of differentially expressed genes between group 1 & 2 as a function of the number of attempted shuffles. The different curves correspond to different random initial divisions. After approximately 1000 shuffles the groups converge and present a large, stationary number of differentially expressed genes. C For each of the obtained divisions (500 in total), clinical characteristics in group 1 & 2 are compared.

Schematic representation of the diVIsive Shuffling Approach (VIStA). A Initially the subjects are divided randomly into three groups; gene expression differences are calculated between group 1 & 2, the third group serves as a reservoir for the subsequent shuffling steps. At each shuffling step, a subject from group 1 or 2 is randomly exchanged with a subject from the reservoir. If the number of differentially expressed genes increases thereby, the swap is accepted, otherwise rejected. B 20 exemplary time series of the number of differentially expressed genes between group 1 & 2 as a function of the number of attempted shuffles. The different curves correspond to different random initial divisions. After approximately 1000 shuffles the groups converge and present a large, stationary number of differentially expressed genes. C For each of the obtained divisions (500 in total), clinical characteristics in group 1 & 2 are compared. (iv) An individual from group 1 or 2 is randomly swapped with an individual from the reservoir group 3. We repeat the SAM analysis, counting again the new number of differentially expressed genes (Figure 1A). If this count increases, the swap is accepted, otherwise rejected. (v) Step (ii) is iterated until the number of differentially expressed genes reaches a plateau (Figure 1B), typically after approximately 1000 attempted swaps. The corresponding groups 1 & 2 represent a combination of patients with high differential gene expression. Starting with different random initial configurations, we repeat the whole procedure (i) through (iii) 500 times, resulting in 500 end configurations, each characterized by a large number of differentially expressed genes. In order to explore the extent to which these 500 subdivisions are clinically relevant and distinct, we analyze them individually for statistically significant differences in clinical characteristics between the members of group 1 and 2. For each subdivision, we identify the set of clinical characteristics (Table 2) that differ significantly between patients in group 1 and group 2 using a Mann-Whitney-U-test (significance threshold of p-value ≤ 0.05) for all continuous characteristics (e.g. BMI) and Fisher's exact test for binary characteristics (e.g. gender) (Figure 1C). We find that with the exception of two subdivisions, all the remaining 498 subdivisions show a statistically significantly difference in at least one clinical characteristic. This suggests that the shuffling algorithm indeed does identify biologically or clinically distinct divisions of patients in most cases. The frequency with which individual clinical characteristics appear as significantly different between the two groups can therefore be used to identify the combinations of clinical characteristics that co-determine gene expression differences. Note that the VIStA approach is fundamentally different from clustering techniques like hierarchical or k-means clustering. The latter attempt to identify cohesive groups based on similarity, while VIStA, on the contrary, is a divisive algorithm based on maximizing the differences between groups. Another important difference to standard clustering approaches is that by design VIStA is able to identify a large number of locally optimal divisions.

Technical considerations

We use a relatively low confidence cut-off of FDR≤ 0.1 for the SAM analysis in steps (i ) and (ii ) in order to facilitate the emergence of an initial "seed"-grouping. Sensitivity of parameter estimates were robust to variation in the exact choice. Within the SAM framework, the FDR is based on a comparison with random permutations, see [13] for details. Note that instead of SAM one could also use other approaches to determine the number of differentially expressed genes at each iteration step, for example using the p-values of simple t-tests or a minimal fold-change. As VIStA consists of repeated differential expression analyses, the same limitations as for conventional approaches apply for the minimal number of subjects and general data quality. We implemented a reservoir of 40 subjects (group 3) in order to resemble a gene expression analysis based on extremes, e.g. the 25% of subjects with the lowest BMI vs the 25% of subjects with the highest BMI. In principle, the third group is not strictly necessary, as shuffling can be performed between two groups. Increasing the size of the reservoir group could affect power through selection of more extreme subjects or by reducing the sample size for the differential expression analysis, so it will depend on the concrete application, whether or not a reservoir is useful. As detailed below, we find that 500 independent runs of VIStA provided sufficient statistical power for a robust distinction between four different subgroups in this study. Generally, a higher number of independent runs could lead to the discovery of more subtle subgroups. It is important to note, however, that the predictive power of the approach is ultimately limited by the quality and size of the expression data, as well as the clinical characteristics. The algorithm was implemented in the programming language C. A single run with 2,000 iterations takes around three hours on a standard PC. However, the vast majority of the computing time is used to perform the SAM analysis, so using a simpler technique for the differential gene expression analysis would drastically speed up the execution time if necessary.

Results & discussion

Differential gene expression of single clinical characteristics

We first attempted to identify statistically significant gene expression differences between patient groups that differ in a single clinical characteristic. To be specific, we aimed to identify genes that were differentially expressed at FDR <0.05 using bins of clinical characteristics as presented in Table 2, such as COPD severity, the history of exacerbation or BMI. As shown in Table 2, apart from the severity of airflow limitation as assessed by the GOLD stage, none of the other clinical measures identified significant gene expression changes. This failure suggests that these clinical characteristics are not sufficiently discriminative to capture gene expression variation in COPD. We hypothesized that there are indeed potential molecular drivers to disease heterogeneity, but a single clinical characteristic is unable to capture them. Therefore, we developed an inverse (divisive) clustering methodology to group the 140 COPD patients included in the study based on their gene expression patterns, and then explored the clinical characteristics of the obtained groups (Figure 1). Figure 2 presents the results of the VIStA analysis, offering a comparison of the clinical characteristics (GOLDCD, FV950, EMPHETCD, BMI, PHLEGM, AGE, DWALK, COUGH and Sex) and inflammatory biomarker levels (interleukin (IL)-6, IL-8, high-sensitivity C-reactive protein (HSCRP), chemokine motif (C-C) ligand 18 (CCL18), surfactant protein D (SPD), fibrinogen (FIBRINOG), and tumor necrosis factor alpha (TNFA) associated with patient subtypes that display the most extreme sputum gene expression pattern differences. We found that the severity of airflow limitation (GOLDCD) was the single most important determinant of differential gene expression, being statistically significant in 95% of all VIStA outputs (n = 477, Figure 2A). This is consistent with our finding discussed above that GOLDCD was the only single clinical variable associated with differential gene expression. The second most common clinical determinant of differential sputum gene expression was emphysema, quantified by either density mask analysis (FV950) or assessed qualitatively by the radiologist (EMPHETCD) (81% and 63% of all VIStA outcomes, respectively, Figure 2A) whereas BMI, Phlegm, age and DWALK were observed in 53%, 36%, 27% and 25% of the VIStA outcomes, respectively (Figure 2A). Plasma fibrinogen was the most frequently identified systemic biomarker (64% of all VIStA outcomes),
Figure 2

Combination of clinical characteristics associated with groups from VIStA. A Number of times the characteristics were found significantly different between group 1 & 2 in a total of 500 divisions. Severity of airflow limitation (GOLDCD) is the single most important determinant of differential gene expression, being statistically significant in 95% of all VIStA outputs. B Summary of the individual and pairwise number of significant occurrences of the clinical characteristics. Node size is proportional to the number of times a measure was found significant, the width of a link indicates how often two measures appeared significant in the same VIStA division. The core group contains severity of airflow limitation (GOLDCD) and the two emphysema measures EMPHETCD and FV950. C, Number of times that pairwise combinations of clinical characteristics co-occurred in the 500 VIStA outcomes. The most significant pair (as compared to a Null model of independent occurrence) is EMPHETCD and FV950, which are both measures of emphysema. D The most frequent and significant triplet is a combination of GOLDCD and EMPHETCD and FV950, measuring disease severity. E We find significant combinations of the disease severity triplet in B with four clinical characteristics: BMI, PHLEGM, DWALK and AGE.

Combination of clinical characteristics associated with groups from VIStA. A Number of times the characteristics were found significantly different between group 1 & 2 in a total of 500 divisions. Severity of airflow limitation (GOLDCD) is the single most important determinant of differential gene expression, being statistically significant in 95% of all VIStA outputs. B Summary of the individual and pairwise number of significant occurrences of the clinical characteristics. Node size is proportional to the number of times a measure was found significant, the width of a link indicates how often two measures appeared significant in the same VIStA division. The core group contains severity of airflow limitation (GOLDCD) and the two emphysema measures EMPHETCD and FV950. C, Number of times that pairwise combinations of clinical characteristics co-occurred in the 500 VIStA outcomes. The most significant pair (as compared to a Null model of independent occurrence) is EMPHETCD and FV950, which are both measures of emphysema. D The most frequent and significant triplet is a combination of GOLDCD and EMPHETCD and FV950, measuring disease severity. E We find significant combinations of the disease severity triplet in B with four clinical characteristics: BMI, PHLEGM, DWALK and AGE.

Combination of clinical traits from VIStA

Figure 2B illustrates how often combinations (pairs) of significant single clinical characteristics (or inflammatory biomarkers) co-occur in the different VIStA subtypes by the width of the links between them. The statistical significance of each co-occurrence (Figure 2C-E) was calculated using a binomial model that assumes independence of the individual characteristics or biomarker levels as the Null hypothesis. In order to quantify the extent to which the VIStA outcomes could reflect spurious associations, we also generated 10,000 random divisions of the patients and analyzed how often the individual characteristics and their combinations appear as significant (Figure 2C-E). We find that the divisions obtained by VIStA show a much higher number of significant clinical characteristics than expected by chance, with the exceptions of the biomarkers CCL18, TNFA and SPD and the variables COUGH and SEX. Similarly, also combinations of significant characteristics appear more frequent than for randomly assigned division. We observed (Figure 2C) that the pairwise co-occurrences of clinical characteristics and inflammatory biomarkers were dominated by airflow limitation severity (GOLDCD). Other characteristics frequently observed in combinations include emphysema (EMPHETCD or FV950), fibrinogen levels, phlegm, BMI and age. Most pairs appear with the frequency expected for the Null hypothesis of independent individual clinical characteristics (see the many non-significant p-values in Figure 2C-E), implying that their association is not significant (e.g. EMPHETCD and GOLDCD). A notable exception is EMPHETCD and FV950, whose statistical association is expected, given that the two variables are not independent but are different measures of the same clinical characteristic (emphysema). Figure 2D, E shows the observed and expected co-occurrence of triplets and quartets of clinical characteristics and inflammatory biomarkers. The most frequent and significant triplet consists of severity of airflow limitation (GOLDCD) and the two emphysema measures EMPHETCD and FV950. GOLDCD and either one of the severity of emphysema measures FV950 or EMPHETCD co-occurred in almost all triplets, which is again expected given their pathobiological relationship in patients with COPD. Figure 2E lists the most frequent combinations of four variables. We find that the most significant combinations are those which include the triple GOLDCD, FV950 and EMPHETCD, together with one additional variable, the most significant being FIBRINOGEN, BMI, PHLEGM, DWALK and age. In the following, however, we have not considered fibrinogen as the basis for a subtype since it is a biomarker rather than a clinical characteristic. In summary, Figure 2C-E suggests four distinct clinical parameters that define groups of patients with considerable gene expression differences. In all groups the patients are characterized by different disease severity (GOLDCD) and emphysema (i.e. EMPHETCD and FV950) but in addition, each group also has one clear distinctive parameter: high/low BMI (Group I), exercise capacity (DWALK) (Group II), Age (Group III) or presence/absence of phlegm production (Group IV) (Table 3). For example, group IA has high GOLDCD, emphysema, FV950 and low BMI, while group IB has low GOLDCD, emphysema, FV950 and high BMI.
Table 3

Summary of the clinical measures, biomarkers, and cell counts among the four groups of COPD patients identified from the results of Figure 2: each group combines GOLDCD, EMPHETCD and FV950, with either BMI (Group I), DWALK (Group II), AGE (Group III) or Phlegm (Group IV).

Group-IA(n = 25)Group-IB, (n = 23)p-valuesGroup-IIA(n = 21)Group-IIB ,(n = 32)p-valuesGroup -IIIA (n = 15)Group-IIIB(n = 28)p-valuesGroup-IVA(n = 20)Group-IVB(n = 26)p-values

Age65.465.4-63.965.4-58.7368.7***6365.96-
Lung Function----
FEV11.720.89***1.70.9***1.720.93**1.790.9***
FEV1/FVC (%)57.8832.43***55.032.5***56.5333.93***57.133.04***
FEV1 reversibility (%)7.644.73***11.57.2***11.625.93***10.45.5***
Radiologist Emphysema----
Emphysema severity1.24.2***1.34.2***1.3363.7***1.2754.2***
Densitometric Emphysema----
Emphysema at -950 HU6.9833.42***6.631.7***7.7128.09***7.0633.42***
Airflow Obstruction----
GOLD Stage23.3***2.03.3***23.2***23.3***
Body Mass Index30.7621.21***27.324.2*29.8725.8-28.6724.42*
Chronic Bronchitis (ATS_CB)1 = no-CBPhelgm1 = 24%1 = 30.4%-1 = 85.7%1 = 62.5%-1 = 6.7%1 = 37%-1-100%1 = 46.2%***
1 = no chronic phlegm1 = 56%1 = 35%-1 = 62%1 = 41%-1 = 66.6%1 = 33%-1-100%1 = 0%***
6 Minute Walk Distance428.32330.02**508.8273.9***438.97321.83**462.59322.9**
Exacerbations 0 = no-Exacerbations0 = 68%0 = 34.8%-0 = 71.4%0 = 28.1%**0 = 60%0 = 37%-0 = 70%0 = 38.5%*
CCL67.36.33-7.06.8-6.196.73-8.646.9-
IL65.6520.2-4.318.6-2.796.89**3.7218.53-
IL88.8810.77-8.39.4-7.510.28*9.810.65-
TNFa26.99160.32-31.7162.9-2.3560.44-24.14142.3-
CCL18130.3117.59-126.9124.4-115.8117.94-134126.19-
CRPHS10.49.6-9.99.5-5.79.72-5108.5-
FIBRINOG494.9499.1-481.0506.2-456.8498.58-510.8489.84-
SPD129.64110.94-124.9119.1-79.73116.3*138.76109.7-
mMRC32.09**1.02.4***1.212.04*1.12.04*
SGRQ43.2955.55**35.956.0***41.852.87*3657.73***
FFMI19.5316.13***18.517.1*18.4817.8-18.8317.15*
% Fat (Tissue)34.9229.08**31.031.7-35.9631.94-32.6931.25-
----
Neutrophils, % Neut_Blq61.3864.87-60.767.0**61.3466.69*62.0865.55-
Eosinophils, % Eos blq33.1-3.53.1-2.482.92-3.263.3-
Lymphocytes, % lymhblq28.6324.84-28.623.2*29.1923.55*27.7723.688-

* = p-value < 0.05; ** = p-value < 0.01; *** = p-value < 0.0001; - = not significant

Summary of the clinical measures, biomarkers, and cell counts among the four groups of COPD patients identified from the results of Figure 2: each group combines GOLDCD, EMPHETCD and FV950, with either BMI (Group I), DWALK (Group II), AGE (Group III) or Phlegm (Group IV). * = p-value < 0.05; ** = p-value < 0.01; *** = p-value < 0.0001; - = not significant To further characterize these subtypes suggested by VIStA we subdivided the full set of all 140 ECLIPSE patients according to the identified clinical characteristics, resulting in 8 groups of 15 to 28 patients. First, we explored a number of clinical, biomarker and cell count measures of the subjects in each group. We find, for example, that serum levels of the biomarkers IL6, IL8 and SPD are significantly higher in group IIIB than in IIIA, a difference that was not observed in other groups. Similarly, the proportion of neutrophils and lymphocytes in sputum were significantly higher in group IIIB in comparison to IIIA (Table 3). We then performed a separate differential gene expression analysis (now with a more stringent FDR <0.05) on the subgroups, finding 821 unique genes for Group I, 528 for Group II, 1,394 genes for Group III and 637 for Group IV (Figure 3B). The four groups share 7,592 genes that are differentially expressed in all of them. As expected, 80% of these genes were previously identified as differentially expressed comparing patients with moderate (GOLD 2) with those with more severe disease (GOLD 3&4) (Figure 3C). We conclude that the common core is dominated by severity of COPD, while the uniquely differentially expressed genes between the groups represent additional variation.
Figure 3

Four subtypes and differentially expressed genes. A The combinations of phenotypic measures that define the subtypes predicted by the VIStA method: all four subtypes share a common core of high values of GOLDCD, FV950 and EMPHETCD, reflecting disease severity. Each of the individual subtypes I-IV presents one additional clinical characteristic: BMI (subtype I), DWALK (II), AGE (III) or PHLEGM (IV). B Venn diagram showing the number of differentially expressed genes unique to each subtype, as well as common to all four subtypes. The common genes show a large overlap with the genes differentially expressed between subjects with GOLDCD 2 and subjects with GOLDCD 3&4, indicating that these genes reflect mostly disease severity.

Four subtypes and differentially expressed genes. A The combinations of phenotypic measures that define the subtypes predicted by the VIStA method: all four subtypes share a common core of high values of GOLDCD, FV950 and EMPHETCD, reflecting disease severity. Each of the individual subtypes I-IV presents one additional clinical characteristic: BMI (subtype I), DWALK (II), AGE (III) or PHLEGM (IV). B Venn diagram showing the number of differentially expressed genes unique to each subtype, as well as common to all four subtypes. The common genes show a large overlap with the genes differentially expressed between subjects with GOLDCD 2 and subjects with GOLDCD 3&4, indicating that these genes reflect mostly disease severity.

Specific genes & pathways in the groups from VIStA

For a further evaluation of the molecular level differences among the four groups, we performed a pathway enrichment analysis for the core set of genes common to all groups, as well as for the unique gene set of each group. Pathway annotations were obtained from the Molecular Signatures Database (MSigDB) published by the Broad Institute, Version 3.1 [14]. MSigDB integrates several different pathway databases, we use KEGG, Biocarta and Reactome. The enrichment analysis between a given gene set and a pathway was done using Fisher's exact test. As shown in Table 4, the top pathways show little overlap between the four groups, providing further evidence for VIStA's ability to capture molecular elements that are specific to each subtype. Several identified pathways were related to metabolism, diabetes and inflammation. Group 1 was most enriched with inflammatory pathways including for example the FC-Gamma-R mediated phagocytosis (p = 0.007) and CDC6-association with ORC:origin-complex pathways (p = 0.15). Further pathways include small lung cancer (p = 0.004) and maturity onset diabetes of the young (p = 0.009) [15]. Group II was enriched with lipid transport and beta-cell and insulin signaling pathways like beta cell (p = 0.005), HDL mediated lipid transport (p = 0.006) and GTP hydrolysis pathways (p = 0.007). In group III, pathways related to cell cycle control like mitotic prometaphase (p = 0.0048), and downstream signaling pathways (p = 0.003) with innate-immunity and GAB1 signaling were enriched. In group IV, distinct gap channel and inflammation pathways were identified like peptide ligand binding (p = 0.0006), gap-junction assembly (p = 0.0008) and chemokine signaling pathways (p = 0.0013).
Table 4

The 10 most strongly enriched pathways in the set of genes common among all four groups described in table 3, as well as in the individual gene sets of each group.

Top ten pathways among Common Genes
pathway p-value overlapall pathway genes
REACTOME_GENE_EXPRESSION1.22E-35235425
REACTO ME_DIABETES_PATHWAYS1.91E-33214383
REACTOME_METABOLISM_OF_PROTEINS9.48E-28134215
REACTOME_CELL_CYCLE_MITOTIC7.34E-25167306
REACTOME_GLUCOSE_REGULATION_OF_INSULIN_SECRETION1.24E-23104161
KEGG_HUNTINGTONS_DISEASE3.16E-23114185
REACTOME_INTEGRATION_OF_ENERGY_METABOLISM1.09E-21130229
REACTOME_ELECTRON_TRANSPORT_CHAIN1.11E-216075
REACTOME_RNA_POLYMERASE_I_III_AND_MITOCHONDRIAL_TRANSCR.PT.ON2.72E-2182120
REACTOME_INFLUENZA_LIFE_CYCLE1.11E-2089137

Top ten pathways among Group 1 Genes

pathwayp-valueoverlapall pathway genes
REACTOME_INORGANIC_CATION_ANION_SLC_TRANSPORTERS0.00133586794
KEGG_SMALL_CELL_LUNG_CANCER0.00359651684
KEGG_FC_GAMMA_R_MEDIATED_PHAGOCYTOSIS0.00723812697
KEGG_MATURITY_ONSET_DIABETES_OF_THE_YOUNG0.00921957325
REACTOME_AM.NO_ACID_AND_OLIGOPEPTIDE_SLC_TRANSPORTERS0.00984371448
REACTOME_SLC_MEDIATED_TRANSMEMBRANE_TRANSPORT0.010099288169
KEGG_B_CELL_RECEPTOR_SIGNALING_PATHWAY0.01020969575
KEGG_GLYCOSPHINGOLIPID_BIOSYNTHESIS_LACTO_AND_NEOLACTO_SERIES0.0102887326
REACTOME_NUCLEAR_RECEPTOR_TRANSCRIPTION_PATHWAY0.01133769450
REACTOME_CDC6_ASSOCIATION_WITH_THE_ORC:ORIGIN_COMPLEX0.01516009211

Top ten pathways among Group II Genes

pathwayp-valueoverlapall pathway genes
REACTOME_REG ULATION_OF_GENE_EXPRESSIO N_IN_B ETA_CELLS0.005525101
REACTOME_HDL_MEDIATED_LIPID_TRANSPORT0.00637211
REACTOME_GTP_HYDROLYSIS_AND_JOINING_OF_THE_60S_RIBOSOMAL_SUBUNIT0.006755106
REACTOME_FACILITATIVE_NA_INDEPENDENT_GLUCOSE_TRANSPORTERS0.00759212
REACTOME_REGULATION_OF_BETA_CELLDEVELOPMENT0.009115114
REACTOME_TRANSLATION0.011215120
REACTOME_TRANSMEMBRANE_TRANSPORT_OF_SMALL_MOLECULES0.011487218
REACTOME_IRS_RELATED_EVENTS0.01182479
REACTOME_INFLUENZA_LIFE_CYCLE0.018895137
REACTOME_DEADENYLATION_OF_MRNA0.02469222

Top ten pathways among Group III Genes

pathwayp-valueoverlapall pathway genes
REACTOME_DOWN_STREAM_SIGNAL_TRANSDUCTION0.00302075535
REACTOME_GAB1_SIGNALOSOME0.00324484311
REACTOME_SIGNALING_IN_IMMUNE_SYSTEM0.0047054820366
REACTOME_MITOTIC_PROMETAPHASE0.00489389892
REACTOME_INNATE_IMMUNITY_SIGNALING0.0058488710136
REACTOME_SIGNALLING_TO_RAS0.0060289426
REACTOME_FORMATION_OF_PLATELET_PLUG0.0075319112186
REACTOME_GRB2_SOS_PROVIDES_LINKAGE_TO_MAPK_SIGNALING_FOR_INTERGRINS0.00821656315
REACTOME_MYOGENESSIS0.00895252429
REACTOME_HEMOSTASIS0.0130106115274

Top ten pathways among Group IV Genes

pathwayp-valueoverlapall pathway genes
REACTOME_PEPTIDE_LIGAND_BINDING_RECEPTORS0.0005912173
REACTOME_GAP_JUNCTION_ASSEMBLY0.00076419
KEGG_CHEMOKINE_SIGNALING_PATHWAY0.0013312190
REACTOME_GAP_JUNCTION_TRAFICKING0.00340428
REACTOME_CHEMOKINE_RECEPTORS_BIND_CHEMOKINES0.00787555
REACTOME_ACTIVATION_OF_ATR_IN_RESPONSE_TO_REPLICATION_STRESS0.00936437
REACTOME_SIGNALING_IN_IMMUNE_SYSTEM0.0094316366
KEGG_T_CELL_RECEPTOR_SIGNALING_PATHWAY0.011267108
REACTOME_CELL_CYCLE_CHECKPOINTS0.012377110
KEGG_NATURAL_KILLER_CELL_MEDIATED_CYTOTOXICITY0.012638137
The 10 most strongly enriched pathways in the set of genes common among all four groups described in table 3, as well as in the individual gene sets of each group. Finally, we identified genes with at least a 2-fold change (FC) in expression [16,17] at an FDR of <0.05, see Table 5 for the specific set of upregulated and downregulated genes in each subgroup. For example, MMP7 was found to be upregulated in group I (BMI), consistant with findings in [18], where nutritionally induced obese mice showed alterations in MMPs and TIMPs expression, thus providing further evidence for the role of these proteolytic system genes in COPD subtype with low BMI.
Table 5

Top ten upregulated and downregulated unique genes and their fold-change (FC) in each group (In group II, only five unique genes are downregulated).

Group 1Group IIGroup IIIGroup IV
GeneFCGeneFCGeneFCGeneFC

LOC1001279402.8RP-3377H14.52.4DDX3Y4.6IL1F92.5
PDCD62.4ZFYVE162.2EIF1AY3.2IL23A2.5
AHRR2.4TGFBR12.2HELB3TUB2.4
CD1B2.4MARCH62.2LOC1001302242.9GJB22.3
KIT2.4CAPZA12.2UTY2.9CD222.3
CADM12.3KIAA03192.2ADORA32.9FAF12.3
MMP72.3DHX362.2ARNT22.9MB0AT72.3
C20orf1972.3DLGAP42.1CXCL142.6SULT2A12.3
RNF144A2.2RIF12.1TMEM612.6TMEM882.3
MYO1B2.2NT5C22.1PPARGC1B2.6CHST72.3

SGK493-2.0TIFAB-2.0C1orf201-2.5VASH1-2.3
ALS2CR4-2.1CCDC42-2.2ST3GAL3-2.5LINC00607-2.3
ENPP5-2.2HBE1-2.2APOOL-2.6KLHDC7B-2.3
FLJ14082-2.2NAPSB-2.2IL28RA-2.6DHODH-2.3
LOC1441204-2.2C4orf7-3.8ZNF624-2.6CDDC113-2.3
L0C100134569-2.2SMAD5-2.6IGF2BP3-2.3
FAM101A-2.7NRP1-2.6C3orf27-2.3
LOC92270-2.8LOC654342-2.6ZNF618-2.3
HPR-2.9TSIX-3.3AKR1C4-2.4
HP-2.9XIST-4.1LOC401321-2.4
Top ten upregulated and downregulated unique genes and their fold-change (FC) in each group (In group II, only five unique genes are downregulated).

Conclusion

We have found that with the exception of severity of airflow limitation, categorizing COPD subtypes according to a single clinical characteristic does not yield groups of patients with significant gene expression differences. In this study, we therefore introduced a novel methodology that allowed us to identify combinations of clinical characteristics that correspond to clear gene expression differences. Our results suggest that while gene expression differences are mainly driven by the severity of airflow limitation and the extent of emphysema, a smaller, yet discriminative contribution is also observed for a set of additional clinical characteristics: BMI, distance walked, age and chronic phlegm production, each defining a subtype of patients. Validation of these groups and the underlying pathways will require replication in a second cohort of COPD subjects. Note that additional differences may also exist for clinical characteristics that have not been considered in the present study. The observed subgroups with combinations of different clinical characteristics are consistent with the clinical heterogeneity of COPD, where a given patient may manifest more than one measurable feature of COPD, suggesting either that the underlying mechanisms contribute to more than one feature or that multiple mechanisms are maladapted in an individual. While we focused on COPD in this study, the proposed VIStA method can be more generally applied to any other complex, heterogeneous disease and presents a promising approach to the important problem of disease heterogeneity and subtyping/subgrouping. A better understanding of this problem is invaluable, for example, for improving the selection of patients for evaluating novel agents. To the extent that gene expression reflects genetic and epigenetic variation, the subtypes identified by our method may further suggest different approaches to identifying genetic susceptibility.

Competing interests

RJM, RTS, BEM, NL. JR, CL, GB are employees of GlaxoSmithKline and own shares and share options in the company.

Authors' contributions

JM, AS, ALB carried out the analysis and wrote the manuscript; AA, BC, SR, ES, RTS, BM, JR, NL provided expertise on the ECLIPSE data and COPD and wrote the manuscript; MHC, RM, SG, CL, GB advised on the analysis and wrote the manuscript.
  18 in total

1.  The SERPINE2 gene is associated with chronic obstructive pulmonary disease.

Authors:  Dawn DeMeo; Thomas Mariani; Christoph Lange; Stephen Lake; Augusto Litonjua; Juan Celedón; John Reilly; Harold A Chapman; David Sparrow; Avrum Spira; Jennifer Beane; Victor Pinto-Plata; Frank E Speizer; Steve Shapiro; Scott T Weiss; Edwin K Silverman
Journal:  Proc Am Thorac Soc       Date:  2006-08

2.  Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles.

Authors:  Aravind Subramanian; Pablo Tamayo; Vamsi K Mootha; Sayan Mukherjee; Benjamin L Ebert; Michael A Gillette; Amanda Paulovich; Scott L Pomeroy; Todd R Golub; Eric S Lander; Jill P Mesirov
Journal:  Proc Natl Acad Sci U S A       Date:  2005-09-30       Impact factor: 11.205

3.  Detection of differentially expressed genes in lymphomas using cDNA arrays: identification of clusterin as a new diagnostic marker for anaplastic large-cell lymphomas.

Authors:  A Wellmann; C Thieblemont; S Pittaluga; A Sakai; E S Jaffe; P Siebert; M Raffeld
Journal:  Blood       Date:  2000-07-15       Impact factor: 22.113

4.  Use of a cDNA microarray to analyse gene expression patterns in human cancer.

Authors:  J DeRisi; L Penland; P O Brown; M L Bittner; P S Meltzer; M Ray; Y Chen; Y A Su; J M Trent
Journal:  Nat Genet       Date:  1996-12       Impact factor: 38.330

5.  Evaluation of COPD Longitudinally to Identify Predictive Surrogate End-points (ECLIPSE).

Authors:  J Vestbo; W Anderson; H O Coxson; C Crim; F Dawber; L Edwards; G Hagan; K Knobil; D A Lomas; W MacNee; E K Silverman; R Tal-Singer
Journal:  Eur Respir J       Date:  2008-01-23       Impact factor: 16.671

6.  Gene expression profiling of human lung tissue from smokers with severe emphysema.

Authors:  Avrum Spira; Jennifer Beane; Victor Pinto-Plata; Aran Kadar; Gang Liu; Vishal Shah; Bartolome Celli; Jerome S Brody
Journal:  Am J Respir Cell Mol Biol       Date:  2004-09-16       Impact factor: 6.914

7.  Molecular biomarkers for quantitative and discrete COPD phenotypes.

Authors:  Soumyaroop Bhattacharya; Sorachai Srisuma; Dawn L Demeo; Steven D Shapiro; Raphael Bueno; Edwin K Silverman; John J Reilly; Thomas J Mariani
Journal:  Am J Respir Cell Mol Biol       Date:  2008-10-10       Impact factor: 6.914

8.  Expression of genes involved in oxidative stress responses in airway epithelial cells of smokers with chronic obstructive pulmonary disease.

Authors:  Stefan Pierrou; Per Broberg; Rory A O'Donnell; Krzysztof Pawłowski; Robert Virtala; Eva Lindqvist; Audrey Richter; Susan J Wilson; Gilbert Angco; Sebastian Möller; Håkan Bergstrand; Witte Koopmann; Elisabet Wieslander; Per-Erik Strömstedt; Stephen T Holgate; Donna E Davies; Johan Lund; Ratko Djukanovic
Journal:  Am J Respir Crit Care Med       Date:  2006-12-07       Impact factor: 21.405

9.  Considerations when using the significance analysis of microarrays (SAM) algorithm.

Authors:  Ola Larsson; Claes Wahlestedt; James A Timmons
Journal:  BMC Bioinformatics       Date:  2005-05-29       Impact factor: 3.169

10.  The presence and progression of emphysema in COPD as determined by CT scanning and biomarker expression: a prospective analysis from the ECLIPSE study.

Authors:  Harvey O Coxson; Asger Dirksen; Lisa D Edwards; Julie C Yates; Alvar Agusti; Per Bakke; Peter Ma Calverley; Bartolome Celli; Courtney Crim; Annelyse Duvoix; Paola Nasute Fauerbach; David A Lomas; William Macnee; Ruth J Mayer; Bruce E Miller; Nestor L Müller; Stephen I Rennard; Edwin K Silverman; Ruth Tal-Singer; Emiel Fm Wouters; Jørgen Vestbo
Journal:  Lancet Respir Med       Date:  2013-02-01       Impact factor: 30.700

View more
  11 in total

1.  Subtyping Chronic Obstructive Pulmonary Disease Using Peripheral Blood Proteomics.

Authors:  Sara Zarei; Ali Mirtar; Jarrett D Morrow; Peter J Castaldi; Paula Belloni; Craig P Hersh
Journal:  Chronic Obstr Pulm Dis       Date:  2017-02-08

Review 2.  Systems medicine: evolution of systems biology from bench to bedside.

Authors:  Rui-Sheng Wang; Bradley A Maron; Joseph Loscalzo
Journal:  Wiley Interdiscip Rev Syst Biol Med       Date:  2015-04-17

Review 3.  Current concepts in targeting chronic obstructive pulmonary disease pharmacotherapy: making progress towards personalised management.

Authors:  Prescott G Woodruff; Alvar Agusti; Nicolas Roche; Dave Singh; Fernando J Martinez
Journal:  Lancet       Date:  2015-05-02       Impact factor: 79.321

4.  Early Endotyping: A Chance for Intervention in Chronic Obstructive Pulmonary Disease.

Authors:  Hans Petersen; Rodrigo Vazquez Guillamet; Paula Meek; Akshay Sood; Yohannes Tesfaigzi
Journal:  Am J Respir Cell Mol Biol       Date:  2018-07       Impact factor: 6.914

5.  A systems immunology approach identifies the collective impact of 5 miRs in Th2 inflammation.

Authors:  Ayşe Kılıç; Marc Santolini; Taiji Nakano; Matthias Schiller; Mizue Teranishi; Pascal Gellert; Yuliya Ponomareva; Thomas Braun; Shizuka Uchida; Scott T Weiss; Amitabh Sharma; Harald Renz
Journal:  JCI Insight       Date:  2018-06-07

Review 6.  Composition and function of ciliary inner-dynein-arm subunits studied in Chlamydomonas reinhardtii.

Authors:  Ryosuke Yamamoto; Juyeon Hwang; Takashi Ishikawa; Takahide Kon; Winfield S Sale
Journal:  Cytoskeleton (Hoboken)       Date:  2021-04-28

7.  Synergy-COPD: a systems approach for understanding and managing chronic diseases.

Authors:  David Gomez-Cabrero; Magi Lluch-Ariet; Jesper Tegnér; Marta Cascante; Felip Miralles; Josep Roca
Journal:  J Transl Med       Date:  2014-11-28       Impact factor: 5.531

8.  Systems Medicine: from molecular features and models to the clinic in COPD.

Authors:  David Gomez-Cabrero; Jörg Menche; Isaac Cano; Imad Abugessaisa; Mercedes Huertas-Migueláñez; Akos Tenyi; Igor Marin de Mas; Narsis A Kiani; Francesco Marabita; Francesco Falciani; Kelly Burrowes; Dieter Maier; Peter Wagner; Vitaly Selivanov; Marta Cascante; Josep Roca; Albert-László Barabási; Jesper Tegnér
Journal:  J Transl Med       Date:  2014-11-28       Impact factor: 5.531

Review 9.  Data integration in the era of omics: current and future challenges.

Authors:  David Gomez-Cabrero; Imad Abugessaisa; Dieter Maier; Andrew Teschendorff; Matthias Merkenschlager; Andreas Gisel; Esteban Ballestar; Erik Bongcam-Rudloff; Ana Conesa; Jesper Tegnér
Journal:  BMC Syst Biol       Date:  2014-03-13

10.  Prediction of microRNA and gene target from an integrated network in chronic obstructive pulmonary disease based on canonical correlation analysis.

Authors:  Lin Hua; Hong Xia; Wenbin Xu; Weiying Zheng; Ping Zhou
Journal:  Technol Health Care       Date:  2018       Impact factor: 1.285

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.