| Literature DB >> 27596246 |
Li Wang1,2, William K Oh3, Jun Zhu1,2,3.
Abstract
Blood-based biomarker assays have an advantage in being minimally invasive. Diagnostic and prognostic models built on peripheral blood gene expression have been reported for various types of disease. However, most of these studies focused on only one disease type, and failed to address whether the identified gene expression signature is disease-specific or more widely applicable across diseases. We conducted a meta-analysis of 46 whole blood gene expression datasets covering a wide range of diseases and physiological conditions. Our analysis uncovered a striking overlap of signature genes shared by multiple diseases, driven by an underlying common pattern of cell component change, specifically an increase in myeloid cells and decrease in lymphocytes. These observations reveal the necessity of building disease-specific classifiers that can distinguish different disease types as well as normal controls, and highlight the importance of cell component change in deriving blood gene expression based models. We developed a new strategy to develop blood-based disease-specific models by leveraging both cell component changes and cell molecular state changes, and demonstrate its superiority using independent datasets.Entities:
Mesh:
Substances:
Year: 2016 PMID: 27596246 PMCID: PMC5011717 DOI: 10.1038/srep32976
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Whole blood gene expression profile datasets.
| Dataset Name | Case# | Control# | Gene# | Platform | Disease Name |
|---|---|---|---|---|---|
| Aging_GSE33828 | 381 | 500 | 23097 | Illumina HumanHT-12 V4.0 (GPL10558) | |
| ASLE_GSE19491 | 28 | 17 | 19982 | Illumina HumanHT-12 V3.0 (GPL6947) | ASLE |
| BacterialPneumonia_GSE20346 | 26 | 36 | 19957 | Illumina HumanHT-12 V3.0 (GPL6947) | Pneumonia |
| BreastCancer_GSE16443 | 67 | 54 | 16752 | ABI Human Genome Survey Microarray V2 (GPL2986) | BreastCancer |
| ColorectalCancer_E-MEXP-3756 | 20 | 20 | 21049 | Affymetrix HG-U133_Plus_2 (GPL570) | ColorectalCancer |
| CoronaryArteryDiesease_Cathgen_GSE20686 | 87 | 52 | 19749 | Agilent-014850 (GPL4133) | CoronaryArteryDiesease |
| CoronaryArteryDiesease_PREDICT_GSE20686 | 99 | 99 | 19749 | Agilent-014850 (GPL4133) | CoronaryArteryDiesease |
| CRPC_GSE37199 | 63 | 31 | 20618 | Affymetrix HG-U133_Plus_2 (GPL570) | CRPC |
| CRPChighrisk_GSE37199 | 14 | 49 | 20618 | Affymetrix HG-U133_Plus_2 (GPL570) | |
| InfluenzaVaccine_day28_GSE30101 | 18 | 18 | 19982 | Illumina HumanHT-12 V3.0 (GPL6947) | |
| InfluenzaVaccine_day3_GSE30101 | 18 | 23 | 19982 | Illumina HumanHT-12 V3.0 (GPL6947) | |
| InfluenzaVaccine_day7_GSE30101 | 18 | 18 | 19982 | Illumina HumanHT-12 V3.0 (GPL6947) | InfluenzaVaccine |
| InfluenzaVaccine_GSE30101 | 202 | 208 | 19982 | Illumina HumanHT-12 V3.0 (GPL6947) | |
| IntermediateCoronaryArteryDiesease_Cathgen_GSE20686 | 56 | 52 | 19749 | Agilent-014850 (GPL4133) | |
| LTB_test_GSE19491 | 21 | 28 | 19982 | Illumina HumanHT-12 V3.0 (GPL6947) | LTB |
| LTB_training_GSE19491 | 16 | 12 | 19982 | Illumina HumanHT-12 V3.0 (GPL6947) | LTB |
| LungCancer_GSE12771 | 97 | 95 | 24614 | Illumina human-6 v2.0 (GPL6102) | LungCancer |
| LungCancer_GSE20189 | 81 | 80 | 13211 | Affymetrix HG-U133A_2 (GPL571) | LungCancer |
| LungCancer_GSE42834 | 16 | 118 | 23871 | Illumina HumanHT-12 V4.0 (GPL10558) | LungCancer |
| LungCancerStage_GSE20189 | 29 | 52 | 13211 | Affymetrix HG-U133A_2 (GPL571) | |
| MajorDepressiveDisorder_GSE19738 | 66 | 66 | 13331 | Agilent-012391 (GPL6848) | MajorDepressiveDisorder |
| MultipleSclerosis_GSE41850 | 170 | 60 | 17549 | Affymetrix Human Exon 1.0 (GPL16209) | MultipleSclerosis |
| Obesity_GSE18897 | 20 | 20 | 21049 | Affymetrix HG-U133_Plus_2 (GPL570) | Obesity |
| Obesity_E-MTAB-54 | 49 | 25 | 21049 | Affymetrix HG-U133_Plus_2 (GPL570) | Obesity |
| Parkinson_GSE6613 | 50 | 22 | 13211 | Affymetrix HG-U133A (GPL96) | Parkinson |
| Pneumonia_GSE42834 | 24 | 118 | 23871 | Illumina HumanHT-12 V4.0 (GPL10558) | Pneumonia |
| PneumovaxVaccine_day28_GSE30101 | 15 | 18 | 19982 | Illumina HumanHT-12 V3.0 (GPL6947) | |
| PneumovaxVaccine_day3_GSE30101 | 18 | 23 | 19982 | Illumina HumanHT-12 V3.0 (GPL6947) | |
| PneumovaxVaccine_day7_GSE30101 | 18 | 18 | 19982 | Illumina HumanHT-12 V3.0 (GPL6947) | PneumovaxVaccine |
| PneumovaxVaccine_GSE30101 | 197 | 208 | 19982 | Illumina HumanHT-12 V3.0 (GPL6947) | |
| PSLE_GSE19491 | 82 | 19 | 19982 | Illumina HumanHT-12 V3.0 (GPL6947) | PSLE |
| PTB_test_GSE19491 | 49 | 28 | 19982 | Illumina HumanHT-12 V3.0 (GPL6947) | TB |
| PTB_training_GSE19491 | 13 | 12 | 19982 | Illumina HumanHT-12 V3.0 (GPL6947) | TB |
| RheumatoidArthritis_GSE17755 | 112 | 45 | 14358 | Hitachisoft AceGene Human Oligo Chip (GPL1291) | RheumatoidArthritis |
| Sarcoid_GSE42834 | 83 | 118 | 23871 | Illumina HumanHT-12 V4.0 (GPL10558) | Sarcoid |
| Schizophrenia_GSE38485 | 106 | 96 | 19969 | Illumina HumanHT-12 V3.0 (GPL6947) | Schizophrenia |
| SevereInfluenza_GSE20346 | 19 | 36 | 19957 | Illumina HumanHT-12 V3.0 (GPL6947) | SevereInfluenza |
| SleepRestriction_16.5_GSE39445 | 22 | 23 | 19541 | Agilent-026817 (GPL15331) | |
| SleepRestriction_25.5_GSE39445 | 23 | 22 | 19541 | Agilent-026817 (GPL15331) | |
| SleepRestriction_34.5_GSE39445 | 20 | 20 | 19541 | Agilent-026817 (GPL15331) | |
| SleepRestriction_7.5_GSE39445 | 23 | 22 | 19541 | Agilent-026817 (GPL15331) | SleepRestriction |
| SleepRestriction_GSE39445 | 212 | 215 | 19541 | Agilent-026817 (GPL15331) | |
| STAPH_GSE19491 | 40 | 23 | 19982 | Illumina HumanHT-12 V3.0 (GPL6947) | STAPH |
| STILL_GSE19491 | 31 | 22 | 19982 | Illumina HumanHT-12 V3.0 (GPL6947) | STILL |
| STREP_GSE19491 | 12 | 23 | 19982 | Illumina HumanHT-12 V3.0 (GPL6947) | STREP |
| TB_GSE42834 | 40 | 118 | 23871 | Illumina HumanHT-12 V4.0 (GPL10558) | TB |
&Datasets with empty disease names were not included in building disease-specific classifiers (see Methods for details).
*10 independent datasets used in evaluating performance of disease-specific classifiers.
$For the dataset of Aging_GSE33828, samples were split into case (old) and control(young) group at the cutoff age of 60 years old (this cutoff is empirically chosen while to make the case and control group of similar size).
Figure 1(A) Fold change profiles of 3161 disease informative disease genes across the 46 datasets. Rows represent genes, and columns represent datasets. Each cell represents the log2 fold change of the corresponding gene in the corresponding dataset, which was calculated by comparing the gene expression in case samples with that in control samples. For display purpose, cells with value >2 (<−2) were set to 2 (−2). Cells of the grey color indicate the data were not available. (B) Correlation matrix of disease datasets. Each cell represents the Spearman’s correlation coefficient of the gene fold change profile (as shown in A) between two datasets. In both (A) and (B), disease datasets were clustered using complete-linkage hierarchical clustering. The distance matrix used in hierarchical clustering was calculated as 1 - the correlation matrix as shown in (B).
Figure 2Heatmap of commonly regulated genes across different types of blood cell lines.
Rows represent up-regulated genes (left) or down-regulated genes (right). Columns represent blood cell lines which are grouped according to the lineage (column legend). Some abbreviations: HSC: Hematopoietic stem cell. MYP: myeloid progenitor. ERY: Erythroid cell. MEGA: megakaryocyte. GM: Granulocyte/monocyte. EOS: eosinophil,BASO: basophil. DEND: dendritic cell.
Figure 3Heatmap of cell component change profiles.
The cell frequency was estimated by DSA (A) and CIBERSORT (B). Each row represents a dataset, and the row side color indicates members of the 19 similar datasets (orange) and the others (black). Each column represents a different cell component. The color in each cell of the heatmap encodes T-statistics in testing the cell component difference between the case and the control groups in each dataset. Complete-linkage clustering was applied with distance = 1-pearson’s correlation of two profiles. Grey color indicates that the T-statistics is not calculable. In this analysis, it corresponds to the situation where the estimated proportion of that particular cell component is zero for all samples in the dataset.
Figure 4Numbers of differentially expressed genes output from csSAM algorithm and SAM algorithm (FDR < 0.1).
Datasets annotated with the red star represent the 19 similar disease datasets in Fig. 1.
Figure 5Workflow of training and validating disease-specific classifiers using deconvolution-based strategy.
Figure 6Performance of disease-specific classifiers based on the residual gene expression profiles or the original gene expression profiles when different classification algorithms were used (A) or when different cell deconvolution methods were used to estimate residual profiles. The Y-axis represents the average AUC score as assessed by the 10 independent datasets. The error bar was calculated by running the training and evaluation 10 times (see Methods for details of the sample splitting schema). The X-axis represent the number of top-ranking (based on t-statistics) genes preselected to be included in the classifiers.
Figure 7Performance of disease-specific classifiers when residual gene expression-based classifiers were combined with cell component-based classifiers.
The figure legend is the same as Fig. 6.