| Literature DB >> 29971090 |
Gregory W Schwartz1,2, Jelena Petrovic1,2, Yeqiao Zhou1,2, Robert B Faryabi1,2,3.
Abstract
High-throughput analysis of the transcriptome and proteome individually are used to interrogate complex oncogenic processes in cancer. However, an outstanding challenge is how to combine these complementary, yet partially disparate data sources to accurately identify tumor-specific gene products and clinical biomarkers. Here, we introduce inteGREAT for robust and scalable differential integration of high-throughput measurements. With inteGREAT, each data source is represented as a co-expression network, which is analyzed to characterize the local and global structure of each node across networks. inteGREAT scores the degree by which the topology of each gene in both transcriptome and proteome networks are conserved within a tumor type, yet different from other normal or malignant cells. We demonstrated the high performance of inteGREAT based on several analyses: deconvolving synthetic networks, rediscovering known diagnostic biomarkers, establishing relationships between tumor lineages, and elucidating putative prognostic biomarkers which we experimentally validated. Furthermore, we introduce the application of a clumpiness measure to quantitatively describe tumor lineage similarity. Together, inteGREAT not only infers functional and clinical insights from the integration of transcriptomic and proteomic data sources in cancer, but also can be readily applied to other heterogeneous high-throughput data sources. inteGREAT is open source and available to download from https://github.com/faryabib/inteGREAT.Entities:
Keywords: cancer biology; data integration; network analysis; proteomics; transcriptomics
Year: 2018 PMID: 29971090 PMCID: PMC6018483 DOI: 10.3389/fgene.2018.00205
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.599
Figure 1Overview of the inteGREAT algorithm. inteGREAT supports both a non-differential (top left) and a differential (top right) integration analysis. Abundance data for all genes from the transcriptome (left matrix for both analyses) and proteome (right matrix for both analyses) provides information to generate a correlation network for each data source. For the differential integration analysis, these abundance values comes from two phenotypes as shown by the red and blue overlays. inteGREAT measures the local (structure of the immediate neighbors, one hop away, shown by red-marked edges) or global similarity (structure of the immediate neighborhood, multiple hops away as defined by the random walk with restart) for each gene and produces a ranked-order list of putative biomarkers with assigned confidences. Increasing red fills of nodes mark ascending gene rank.
Figure 2Accuracy of inteGREAT for various noise types across different number of data sources (networks) and network sizes (facet label). (A) Simulations with varying percent of the edges permuted. Existing edges are randomly chosen for permuting based on a uniform distribution. (B) Simulations with varying percent of the vertices deleted. Vertices are randomly chosen for deletion using a uniform distribution. (C) Simulations with varying amount of noise. Noise refers to the addition of random values to the network edges drawn from a Gaussian distribution with a standard deviation of the x-axis.
Figure 3Differential integration of basal vs. luminal breast cancer subtypes identified known gene-programs associated with these tumor subtypes and detected ESR1 and GATA3 as their differential biomarkers. (A) Top 20 gene-programs associated with the most differential genes between basal and luminal breast cancer subtypes for three differential analyses: integration of transcriptome and proteome, transcriptome only, and proteome only. Pre-ranked GSEA analysis was performed based on the gene-programs defined by MSigDB C2 curated gene sets and the ranked-order list of differential genes generated by each analysis. Black cells signify a gene set or pathway in the top 20 most significantly enriched pathways for that column. (B) Ten runs of differential integration of basal vs. luminal using local similarity. The final ranked-order list was generated from the joining of each ranked order-lists using the rank product. ESR1 and GATA3 are marked with red and blue respectively. (C) Final rank product of 10 runs based on global similarity.
Figure 4Pan-cancer differential integration. (A,B) Elucidate relationships between the tumor molecular features and tissue-of-origin. (A) Heatmap of Spearman correlations between local similarities of differential integrations, where each tile is the Spearman correlation between two differential integration similarity score vectors; (B) Heatmap of clumpiness values of cancer types from the dendrogram of hierarchically clustered columns of Spearman's rhos for the differential integrations in (A). (C) Identification of putative diagnostic biomarkers. Heatmap of genes with at least one outlier in a differential integration analysis. Columns underwent z-score normalization before outlier removal, rows after removal. Genes with CI widths >0.04 were removed. (D) Prognostic significance of putative biomarkers from (C) inferred from survival analysis of clinical outcomes reported in the Pathology Atlas database. Each gene was designated as one of four states in each comparative study: an outlier comparison for that gene (orange), significant prognosis of that gene in a tissue in that comparison (blue), both an outlier and prognostic (purple), and neither (white). RNA-seq analysis of ANXA1 (E) and ARF5 (F) expression in HC1599 (basal), MB-157 (basal), and MCF-7 (luminal) cell lines.