| Literature DB >> 28819158 |
Stephen P Ficklin1, Leland J Dunwoodie2, William L Poehlman2, Christopher Watson3, Kimberly E Roche2, F Alex Feltus4.
Abstract
A gene co-expression network (GCN) describes associations between genes and points to genetic coordination of biochemical pathways. However, genetic correlations in a GCN are only detectable if they are present in the sampled conditions. With the increasing quantity of gene expression samples available in public repositories, there is greater potential for discovery of genetic correlations from a variety of biologically interesting conditions. However, even if gene correlations are present, their discovery can be masked by noise. Noise is introduced from natural variation (intrinsic and extrinsic), systematic variation (caused by sample measurement protocols and instruments), and algorithmic and statistical variation created by selection of data processing tools. A variety of published studies, approaches and methods attempt to address each of these contributions of variation to reduce noise. Here we describe an approach using Gaussian Mixture Models (GMMs) to address natural extrinsic (condition-specific) variation during network construction from mixed input conditions. To demonstrate utility, we build and analyze a condition-annotated GCN from a compendium of 2,016 mixed gene expression data sets from five tumor subtypes obtained from The Cancer Genome Atlas. Our results show that GMMs help discover tumor subtype specific gene co-expression patterns (modules) that are significantly enriched for clinical attributes.Entities:
Mesh:
Year: 2017 PMID: 28819158 PMCID: PMC5561081 DOI: 10.1038/s41598-017-09094-4
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1High, Medium, and Low Differences in Gene Expression Dependency. These scatterplots provide examples of high, medium and low differences in correlation between the Spearman and Pearson correlation methods. The x and y-axes represent log2 transformed gene expression levels for each gene respectively. The two plots on the left (top and bottom) represent pairwise correlation between transcripts with high differences between correlation where either Pearson correlation coefficient (PCC) is high and Spearman correlation coefficient (SCC) is low or vice versa. There are fewer samples when compared to other plots because of missing values. The middle two plots represent high correlation in one method and mid-range in the other. The right two plots are examples where both PCC and SCC are high. The title of each scatterplot indicates the PCC and SCC values for each comparison.
Figure 2The Tumor GMM Gene Co-expression Network. (A) The graph representation of the network. Points represent nodes (i.e. transcripts) and edges represent co-expression of transcripts. Modules are identified using the link communities method and uniquely colored. Not all nodes were circumscribed into a module. (B) The node degree distribution plot demonstrating scale-free behavior of the network. (C) The average clustering coefficient plot demonstrating a hierarchical network.
Figure 3GMM Pairwise Gene Expression Scatterplots. The Gaussian Mixture Model (GMM) algorithm is applied to the same random examples shown in Fig. 3. Each cluster (mode) of samples is identified with a different color. The position and orientation of the Gaussian variance of each cluster is indicated with a black circle and the cluster centers are indicated at the intersection of the variance axis.
GMM Network Modules with Enriched Clinical Annotations.
|
| ||||||
| BLCA | OV | LGG | THCA | GBM | ||
| 13 | 15 | 32 | 9 | 18 | ||
|
| ||||||
| Female | Male | |||||
| 11 | 22 | |||||
|
| ||||||
| Stage I | Stage II | Stage III | Stage IV | Stage IVA | Stage IVC | |
| 10 | 3 | 0 | 10 | 5 | 0 | |
|
| ||||||
| NHL | HL | W | AA | A | NHPI | AIAN |
| 2 | 3 | 22 | 0 | 6 | 0 | 0 |
Each value indicates the number of modules in the GMM network enriched for the specified annotation with a p-value < 0.001. BLCA (bladder cancer), OV (ovarian cancer), LGG (lower grade glioma), THCA (thyroid cancer), GBM (glioblastoma), NHL (not Hispanic or Latino), HL (Hispanic or Latino), W (White), AA (African American), A (Asian), NHPI (Native Hawaiian or Pacific Islander), AIAN (American Indian, Alaska Native).
Figure 4Network Sample Composition Heat map. Each edge in the network is annotated with a string of 1’s and 0’s, referred to as a sample string. For the human cancer network, each string consists of 2016 1’s and 0’s with a 1 indicating that the sample is present within the cluster that formed the edge, and a 0 indicating it is not included. Hierarchical clustering using the heat map function of the R statistical package was used to order edges by similarity of their sample strings and generate this figure. Here, red indicates the presence of a 0 in the sample string and green indicates the presence of a 1. Each column of the heat map represents an individual edge in the network. Samples represent rows in the heat map and are grouped according to cancer types (i.e. BLCA, GBM, LGG, OV and THCA). An artificial black line was added to distinguish between “lanes” of each cancer type.
Figure 5Non-GMM vs GMM PCC and SCC values. (A) The scatterplot of Pearson vs Spearman correlation from the edges of a cancer network constructed using the same input dataset, but created without usage of GMMs. (B) A similar plot but from the GMM network. The number of points in both panels (corresponding to edges in the network) is indicated by the variable n. Contour lines have been added to indicate point density.