| Literature DB >> 24955771 |
Michael Seifert1, Khalil Abou-El-Ardat2, Betty Friedrich1, Barbara Klink2, Andreas Deutsch1.
Abstract
Changes in gene expression programs play a central role in cancer. Chromosomal aberrations such as deletions, duplications and translocations of DNA segments can lead to highly significant positive correlations of gene expression levels of neighboring genes. This should be utilized to improve the analysis of tumor expression profiles. Here, we develop a novel model class of autoregressive higher-order Hidden Markov Models (HMMs) that carefully exploit local data-dependent chromosomal dependencies to improve the identification of differentially expressed genes in tumor. Autoregressive higher-order HMMs overcome generally existing limitations of standard first-order HMMs in the modeling of dependencies between genes in close chromosomal proximity by the simultaneous usage of higher-order state-transitions and autoregressive emissions as novel model features. We apply autoregressive higher-order HMMs to the analysis of breast cancer and glioma gene expression data and perform in-depth model evaluation studies. We find that autoregressive higher-order HMMs clearly improve the identification of overexpressed genes with underlying gene copy number duplications in breast cancer in comparison to mixture models, standard first- and higher-order HMMs, and other related methods. The performance benefit is attributed to the simultaneous usage of higher-order state-transitions in combination with autoregressive emissions. This benefit could not be reached by using each of these two features independently. We also find that autoregressive higher-order HMMs are better able to identify differentially expressed genes in tumors independent of the underlying gene copy number status in comparison to the majority of related methods. This is further supported by the identification of well-known and of previously unreported hotspots of differential expression in glioblastomas demonstrating the efficacy of autoregressive higher-order HMMs for the analysis of individual tumor expression profiles. Moreover, we reveal interesting novel details of systematic alterations of gene expression levels in known cancer signaling pathways distinguishing oligodendrogliomas, astrocytomas and glioblastomas. An implementation is available under www.jstacs.de/index.php/ARHMM.Entities:
Mesh:
Substances:
Year: 2014 PMID: 24955771 PMCID: PMC4067306 DOI: 10.1371/journal.pone.0100295
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Figure 1Local chromosomal dependencies of gene expression levels in different types of cancer.
Spatial correlations of expression levels of genes in increasing chromosomal order up to ten were quantified by an average autocorrelation function that considers each chromosome-specific expression profile in each individual tumor sample. The autocorrelation function quantifies the similarity of gene expression levels of neighboring genes on a chromosome in a fixed distance. Corresponding average autocorrelation functions are shown for three types of cancer (i) different types of gliomas (red) [33], (ii) breast cancer expression profiles (orange) [3] and (iii) glioblastoma expression profiles (grey) [4]. Additionally, the green curve represents the average autocorrelation function of normal brain reference gene expression profiles taken from [33]. Due to chromosomal aberrations in gliomas, expression levels of genes in close chromosomal proximity tend to show greater similarity in gliomas (red) than in corresponding normal brain tissues (green). Moreover, the black curve represents mean values and standard deviations of the average autocorrelation function for randomly permuted glioma gene expression profiles from [33] across 100 repeats. The observation of significant local chromosomal dependencies in tumor expression profiles compared to permuted expression profiles motivates the development of autoregressive higher-order HMMs for the analysis of tumor expression profiles.
Figure 2Selected state space representations of models included in the novel model class of autoregressive higher-order HMMs.
State space representations of selected models included in the class of autoregressive higher-order HMMs. Hidden states are denoted by and emissions are denoted by for . Arrows between nodes define modeled statistical dependencies. a) Standard mixture model (AR()-HMM()). b) Mixture model with second-order autoregressive emissions (AR()-HMM()). c) Standard HMM with first-order state-transitions (AR()-HMM()). d) Standard higher-order HMM with second-order state-transitions (AR()-HMM()). e) Autoregressive higher-order HMM with second-order state-transitions and second-order autoregressive emissions (AR()-HMM()).
Figure 3Identification of overexpressed genes with increased copy numbers in breast cancer by different autoregressive HMMs.
Systematic comparison of the identification of overexpressed genes with at least three-fold increased copy numbers by autoregressive HMMs based on breast cancer gene expression profiles from [3]. Each AR()-HMM() with an emission process of order (AR()) in combination with a state-transition process of order (HMM-Order) is considered. The left column shows the performances reached by autoregressive HMMs trained and applied to fifty percent of the breast cancer gene expression data. The right column shows the performances of these models reached on the remaining unseen fifty percent of the data set. For each model, the identification of candidate genes of overexpression with at least three-fold increased copy numbers is quantified by the true positive rate (TPR) reached at a fixed false positive rate of 5%. Six different scenarios are shown. a) and b) represent performances of the different models with respect to our standard initial model parameter settings. c) and d) represent average performances and corresponding standard deviations reached with respect to systematically changed initial model parameters. e) and f) represent average performances and corresponding standard deviations reached with respect to systematically modified prior hyperparameter settings. The predictions of autoregressive HMMs are generally very robust. Models utilizing a combination of higher-order state-transitions and autoregressive emissions (e.g. AR()-HMM() and AR()-HMM()) are clearly outperforming the mixture model (AR()-HMM()), the standard first-order HMM (AR()-HMM()), and standard higher-order HMMs (AR()-HMM()).
Method-specific characteristics for the identification of differentially expressed genes in breast cancer.
| Underexpressed Genes | Overexpressed Genes | |||||||
| Method | Reference | Prop. | Mean | Sd | Prop. | Mean | Sd | Runtime |
| Wavelet |
| 18.58% | −0.09 | 0.63 | 9.52% | 0.13 | 0.84 | 3 min 36 s |
| BioHMM |
| 7.41% | −0.30 | 0.88 | 9.96% | 0.39 | 0.90 | 5 min 03 s |
| FHMM |
| 6.37% | −0.37 | 0.75 | 5.42% | 0.62 | 0.92 | 2 min 59 s |
| CBS |
| 2.66% | −0.19 | 0.72 | 1.91% | 0.47 | 0.98 | 3 min 02 s |
| CGHseg |
| 2.45% | −0.11 | 0.64 | 0.97% | 0.33 | 1.10 | 2 min 52 s |
| ChARM |
| 1.02% | −0.30 | 0.66 | 1.84% | 0.31 | 0.77 | - |
| GLAD |
| 1.54% | −1.95 | 1.00 | 1.77% | 1.85 | 0.76 | 2 min 51 s |
| DSHMM |
| 1.48% | −2.18 | 0.97 | 2.25% | 1.90 | 0.70 | 1 min 26 s |
| AR( | see Methods | 1.51% | −2.23 | 0.88 | 2.19% | 1.85 | 0.66 | 2 min 56 s |
| MixMod |
| 1.34% | −2.39 | 0.81 | 1.84% | 2.13 | 0.57 | 11 s |
The proportion of genes predicted as under- or overexpressed in relation to the total number of measured genes and the corresponding means and standard deviations of the underlying measured log-ratios are summarized for each method based on the the breast cancer gene expression data set from [3]. The different methods were grouped into three categories according to their proportion of identified differentially expressed genes and the corresponding mean log-ratio columns. The rightmost column specifies the runtimes of the different methods required to analyze the data set. All methods except ChARM, MixMod, DSHMM and AR()-HMM() were run on the ADaCGH web-server [52] utilizing parallel computations (AMD Opteron 2.2 GHz CPUs with 6 GB RAM). The remaining methods were run on a standard laptop with Intel CPU T9500 2.6 GHz and 2 GB RAM.
Figure 4Comparison of an autoregressive higher-order HMM to related existing methods.
Comparison of the AR()-HMM() and related methods with respect to the identification of overexpressed genes with at least three-fold increased copy numbers based on breast cancer data from [3]. The performance of each method is quantified by a receiver operating characteristic (ROC) curve displaying the true positive rate (TPR) reached at different levels of false positive rates (FPR). The AR()-HMM() with a fourth-order autoregressive emission process and a second-order state-transition process reaches the best performance (red).
Figure 5Most discriminative signaling pathways distinguishing different types of gliomas.
Overview of known cancer signaling pathways identified to show largely distinct patterns of overexpressed genes in oligodendrogliomas, astrocytomas and glioblastomas based on predictions of the AR()-HMM(). The overlap of the top-ranking overexpressed genes with the specific signaling pathways is quantified from top 100 to top 600. Grey curves show random expectations with respect to the number of genes in a specific pathway. Robust systematic differences between the different types of gliomas are clearly visible.
Figure 6Systematic characterization of the most discriminative signaling pathways distinguishing different types of gliomas.
a) Characteristic view on the most discriminative pathways between oligodendrogliomas, astrocytomas and glioblastomas at the level of the top 300 overexpressed genes in Figure 5. b) Selected gene-based view on the most discriminative signaling pathways shown in a). The Venn diagrams show pathway-specific overlaps of overexpressed genes between the different types of gliomas. The strong overlap of genes between the different types of gliomas indicates the presence of common core sets of affected genes. These pathway-specific core gene sets are further extended towards the glioma with the greatest number of overexpressed genes. The corresponding genes are summarized in Table 2.
Genes overexpressed in the most discriminative pathways distinguishing different types of gliomas.
| Gene | OD | AS | GBM | Signaling Pathways | Annotation |
| ANGPT2 | 0 | 0 | 1 | PI3K-Akt | Angiopoietin-2 |
| CAV1 | 0 | 0 | 1 | Focal Adh. | Caveolin |
| COL1A1 | 0 | 0 | 1 | ECM, Focal Adh., PI3K-Akt | Collagen alpha-1(I) chain |
| COL4A2 | 0 | 0 | 1 | ECM, Focal Adh., PI3K-Akt | Canstatin |
| COL5A2 | 0 | 0 | 1 | ECM, Focal Adh., PI3K-Akt | Collagen alpha-2(V) chain |
| FN1 | 0 | 0 | 1 | ECM, Focal Adh., PI3K-Akt | Ugl-Y3 |
| LAMB1 | 0 | 0 | 1 | ECM, Focal Adh., PI3K-Akt | Laminin subunit beta-1 |
| LAMC1 | 0 | 0 | 1 | ECM, Focal Adh., PI3K-Akt | Laminin subunit gamma-1 |
| VEGFA | 0 | 0 | 1 | Focal Adh., PI3K-Akt, VEGF | Vascular endothelial growth factor A |
| RELA | 0 | 1 | 0 | PI3K-Akt | Transcription factor p65 |
| TGFB1 | 0 | 1 | 0 | TGF-Beta | Transforming growth factor beta-1 |
| CDK2 | 0 | 1 | 1 | PI3K-Akt | Cyclin-dependent kinase 2 |
| COL1A2 | 0 | 1 | 1 | ECM, Focal Adh., PI3K-Akt | Collagen alpha-2(I) chain |
| COL4A1 | 0 | 1 | 1 | ECM, Focal Adh., PI3K-Akt | Collagen alpha-1(IV) chain |
| LAMB2 | 0 | 1 | 1 | ECM, Focal Adh., PI3K-Akt | Laminin subunit gamma-1 |
| TLR2 | 0 | 1 | 1 | PI3K-Akt | Toll-like receptor 2 |
| BMP7 | 1 | 0 | 0 | TGF-Beta | Bone morphogenetic protein 7 |
| NOG | 1 | 0 | 0 | TGF-Beta | Noggin |
| NOTCH1 | 1 | 0 | 0 | Notch | Neurogenic locus notch homolog protein 1 |
| PGF | 1 | 0 | 0 | Focal Adh., PI3K-Akt | Placenta growth factor |
| RFC4 | 1 | 0 | 0 | DNA Repair, Telomere | Replication factor C subunit 4 |
| RFC5 | 1 | 0 | 0 | DNA Repair, Telomere | Replication factor C subunit 5 |
| RUVBL1 | 1 | 0 | 0 | Telomere | RuvB-like 1 |
| BMP2 | 1 | 1 | 0 | TGF-Beta | Bone morphogenetic protein 2 |
| CBLB | 1 | 1 | 0 | TGF-Beta | E3 ubiquitin-protein ligase CBL-B |
| DDIT4 | 1 | 1 | 0 | PI3K-Akt | DNA damage-inducible transcript 4 protein |
| DLL3 | 1 | 1 | 0 | Notch | Delta-like protein 3 |
| E2F5 | 1 | 1 | 0 | TGF-Beta | Transcription factor E2F5 |
| ID1 | 1 | 1 | 0 | TGF-Beta | DNA-binding protein inhibitor ID-1 |
| ID4 | 1 | 1 | 0 | TGF-Beta | DNA-binding protein inhibitor ID-4 |
| MAML2 | 1 | 1 | 0 | Notch | Mastermind-like protein 2 |
| CD44 | 1 | 1 | 1 | ECM | CD44 antigen |
| CDK4 | 1 | 1 | 1 | PI3K-Akt | Highly similar to Cell division protein kinase 4 |
| COL3A1 | 1 | 1 | 1 | ECM, Focal Adh., PI3K-Akt | Collagen alpha-1(III) chain |
| DTX3L | 1 | 1 | 1 | Notch | E3 ubiquitin-protein ligase DTX3L |
| EIF4EBP1 | 1 | 1 | 1 | PI3K-Akt, TGF-Beta | Euk. transl. initiation factor 4E-binding protein 1 |
| F2R | 1 | 1 | 1 | PI3K-Akt | Proteinase-activated receptor 1 |
| ID3 | 1 | 1 | 1 | TGF-Beta | DNA-binding protein inhibitor ID-3 |
| MYC | 1 | 1 | 1 | PI3K-Akt, TGF-Beta | Myc proto-oncogene protein |
| TNC | 1 | 1 | 1 | ECM, Focal Adh., PI3K-Akt | Tenascin |
| TP53 | 1 | 1 | 1 | PI3K-Akt | Cellular tumor antigen p53 |
Overexpressed genes representing the most discriminative cancer signaling pathways distinguishing oligodendrogliomas (OD), astrocytomas (AS) and glioblastomas (GBM) at the level of the top 300 genes (Figure 6). Genes identified as overexpressed in a specific glioma type are indicated by ‘1’, otherwise ‘0’. The column ‘Signaling Pathways’ represents the corresponding membership of each gene in one or more of these pathways.