| Literature DB >> 27072482 |
Chongzhi Zang1,2, Tao Wang3,4, Ke Deng5, Bo Li1,2,6, Sheng'en Hu7, Qian Qin7, Tengfei Xiao1,2,8, Shihua Zhang9, Clifford A Meyer1,2, Housheng Hansen He1,2,8,10, Myles Brown2,8, Jun S Liu6, Yang Xie3,11,12, X Shirley Liu1,2.
Abstract
High-dimensional genomic data analysis is challenging due to noises and biases in high-throughput experiments. We present a computational method matrix analysis and normalization by concordant information enhancement (MANCIE) for bias correction and data integration of distinct genomic profiles on the same samples. MANCIE uses a Bayesian-supported principal component analysis-based approach to adjust the data so as to achieve better consistency between sample-wise distances in the different profiles. MANCIE can improve tissue-specific clustering in ENCODE data, prognostic prediction in Molecular Taxonomy of Breast Cancer International Consortium and The Cancer Genome Atlas data, copy number and expression agreement in Cancer Cell Line Encyclopedia data, and has broad applications in cross-platform, high-dimensional data integration.Entities:
Mesh:
Year: 2016 PMID: 27072482 PMCID: PMC4833864 DOI: 10.1038/ncomms11305
Source DB: PubMed Journal: Nat Commun ISSN: 2041-1723 Impact factor: 14.919
Figure 1Overview of MANCIE.
Each row vector in the adjusted matrix is generated from the corresponding row vectors in the main matrix and the associated matrix. On the basis of the correlation between the main row vector m and the associated row vector c, one of three scenarios will be chosen. See more details in the online methods.
Figure 2Case study on ENCODE data.
(a,b) Multi-dimensional scaling map representing genomic data from 61 cell lines. Each data point represents a cell line, with its tissue type labelled in the same colour as in the legend. (a, top) Raw DHS data; bottom, MANCIE-adjusted DHS data; (b, top) Raw expression data; bottom, MANCIE-adjusted expression data. (c,d) Adjusted Rand index comparing K-means clustering on the data with actual tissue-type clustering. K-means clustering was performed 1,000 times with random seeds. The three boxes represent original data (blue), MANCIE-adjusted with random data matrices (cyan) and MANCIE-adjusted with the other data type (red). (c) DHS data, (d) gene-expression data. P value was calculated using Wilcoxon rank sum test. (e) Relationship between the magnitude of MANCIE adjustment and the deviation of GC-content distribution of DNase-seq reads. The magnitude of MANCIE adjustment was calculated as the Euclidean distance between the sample data vectors before and after MANCIE adjustment. The deviation refers to the distance from each sample's data point to the centre of mass in the mean—coefficient of variation map of the GC-content distribution in Supplementary Fig 2c. Labels in the parentheses are the top sequence motif enriched in the most increased DHS in the corresponding cell line after MANCIE adjustment.
Figure 3Case studies on METABRIC and TCGA data.
(a) The Kaplan–Meier plots for an example showing the dichotomized risk scores from the original matrices (left) and the adjusted matrices (right) under a correlation threshold of 0.93 using the METABRIC data. Patient samples were separated into two groups according to the predicted risk scores from the selected genes. High-risk group is labelled in red and low-risk group is labelled in blue. The high-risk group is better separated from the low-risk group by using the MANCIE-adjusted expression data (right), compared with using the original data (left). (b) P value scores (−log10Pvalue) in survival prediction using METABRIC gene-expression data comparing before or after MANCIE adjustment with CNV data. The gene selection thresholds are set as 0.7, 0.75, 0.8, 0.85, 0.9, 0.93, from left to right, from top to bottom, respectively. (c) Difference of P value scores (−log10Pvalue) in survival prediction with each gene signature using TCGA gene-expression data before or after adjustment by MANCIE or SVA. Gene signatures are labelled with the first author name of the publication. Error bar stands for s.d. of the results from 1,000 random samples.
Figure 4Case Study on CCLE/GDSC data.
(a,b) Correlation between the CNV and RNA expression for gene NDUFC2. The expression data were using raw CCLE data (a) or MANCIE-adjusted with GDSC data (b). ρ refers to Spearman correlation coefficient. (c) Distribution of the correlation difference comparing before and after MANCIE adjustment, for all genes. The Spearman correlation coefficient between CNV and RNA expression was calculated for each gene, and the correlation difference is calculated by subtracting with MANCIE adjustment by without MANCIE adjustment. P value was calculated using the one-tail paired t-test. (d) Distribution of the correlation difference comparing comparing the raw data with SVA-adjusted expression data. P value was calculated using the one-tail paired t-test.