| Literature DB >> 19259399 |
Wenjuan Gu1, Hyungwon Choi, Debashis Ghosh.
Abstract
With an increasing number of cancer profiling studies assaying both transcript mRNA and copy number expression levels, a natural question then involves the potential to combine information across the two types of genomic data. In this article, we perform a study to assess the nature of association between the two types of data across several experiments. We report on several interesting findings: 1) global correlation between gene expression and copy number is relatively weak but consistent across studies; 2) there is strong evidence for a cis-dosage effect of copy number on gene expression; 3) segmenting the copy number levels helps to improve correlations.Entities:
Keywords: circular binary segmentation; high-dimensional data; machine learning; two-color microarray platform
Year: 2008 PMID: 19259399 PMCID: PMC2623285 DOI: 10.4137/cin.s342
Source DB: PubMed Journal: Cancer Inform ISSN: 1176-9351
Description of datasets used in the analysis. Dataset column refers to first author on publication containing data.
| Dataset | Organ site | In vivo/in vitro | Number of genes | Number of samples |
|---|---|---|---|---|
| Pollack | Breast | In vivo | 4841 | 36 |
| Hyman | Breast | In vitro | 6823 | 14 |
| Heidenblad | Pancreas | In vitro | 8879 | 16 |
| Zhao | Prostate | In vitro | 14824 | 8 |
| Kim | Lung | In vitro | 21066 | 22 |
Summary statistics for pair wise correlation between copy number and expression.
| Dataset | Min | 1st Quantile | Median | Mean | 3rd Quantile | Max | SD |
|---|---|---|---|---|---|---|---|
| Pollack | −0.49 | −0.02 | 0.11 | 0.12 | 0.26 | 0.90 | 0.22 |
| Hyman | −0.83 | −0.07 | 0.14 | 0.15 | 0.36 | 0.99 | 0.31 |
| Heidenblad | −0.79 | −0.04 | 0.16 | 0.15 | 0.34 | 0.92 | 0.28 |
| Zhao | −0.95 | −0.14 | 0.19 | 0.16 | 0.48 | 0.99 | 0.41 |
| Kim | −0.79 | 0.02 | 0.22 | 0.20 | 0.39 | 0.94 | 0.27 |
Notes: Min refers to minimum; Max refers to maximum. SD is standard deviation.
Summary of univariate correlation results across five studies.
| Dataset | Number of genes | # genes most correlated with themselves | # genes most correlated with genes from the same chromosome |
|---|---|---|---|
| Pollack | 4841 | 42 (0.90%) | 619 (13.21%) |
| Hyman | 6823 | 28 (0.41%) | 648 (9.50%) |
| Heidenblad | 8879 | 8 (0.09%) | 713 (8.03) |
| Zhao | 14284 | 3 (0.02%) | 853 (6.09%) |
| Kim | 21066 | 79 (0.5%) | 2344 (14.0%) |
Notes: In this table, the second column refers to the number of genes in which the largest correlation in magnitude with copy number is that from the same gene (i.e. the same spot on the microarray). The third column refers to the number of genes whose largest correlation between expression and copy number was with a spot that mapped to a gene on the same chromosome.
Correlations across five datasets taking segmentation into account.
| Both Unsegmented | −0.859 | −0.057 | 0.159 | 0.373 | 0.956 | 0.303 | 0.246 |
| Expression Segmented | −0.805 | −0.025 | 0.211 | 0.432 | 0.974 | 0.317 | 0.259 |
| Copynumber Segmented | −0.902 | −0.022 | 0.219 | 0.453 | 0.969 | 0.324 | 0.267 |
| Both Segmented | −0.734 | 0.259 | 0.471 | 0.652 | 0.976 | 0.281 | 0.224 |
| Both Unsegmented | −0.777 | −0.051 | 0.147 | 0.34 | 0.949 | 0.276 | 0.224 |
| Expression Segmented | −0.746 | 0.069 | 0.268 | 0.463 | 0.948 | 0.285 | 0.229 |
| Copynumber Segmented | −0.804 | −0.008 | 0.196 | 0.395 | 0.952 | 0.288 | 0.234 |
| Both Segmented | −0.691 | 0.247 | 0.464 | 0.645 | 0.975 | 0.288 | 0.231 |
| Both Unsegmented | −0.547 | −0.028 | 0.106 | 0.248 | 0.898 | 0.208 | 0.165 |
| Expression Segmented | −0.607 | 0.082 | 0.241 | 0.399 | 0.947 | 0.226 | 0.183 |
| Copynumber Segmented | −0.54 | 0.005 | 0.166 | 0.331 | 0.914 | 0.231 | 0.186 |
| Both Segmented | −0.396 | 0.327 | 0.479 | 0.62 | 0.971 | 0.214 | 0.172 |
| Both Unsegmented | −0.946 | −0.127 | 0.196 | 0.488 | 0.989 | 0.406 | 0.338 |
| Expression Segmented | −0.939 | 0.085 | 0.42 | 0.675 | 0.99 | 0.401 | 0.331 |
| Copynumber Segmented | −0.968 | −0.115 | 0.227 | 0.525 | 0.99 | 0.418 | 0.348 |
| Both Segmented | −0.923 | 0.309 | 0.623 | 0.819 | 0.997 | 0.383 | 0.304 |
| Both Unsegmented | −0.803 | 0.051 | 0.241 | 0.42 | 0.972 | 0.263 | 0.214 |
| Expression Segmented | −0.726 | 0.124 | 0.328 | 0.522 | 0.932 | 0.279 | 0.228 |
| Copynumber Segmented | −0.722 | 0.077 | 0.287 | 0.479 | 0.957 | 0.278 | 0.228 |
| Both Segmented | −0.644 | 0.275 | 0.482 | 0.675 | 0.924 | 0.282 | 0.228 |
Notes: SD represents standard deviation; MAD represents mean absolute deviation. Both Unsegmented refers to correlation coefficients between gene expression and copy number based on the raw expression number value, after taking the preprocessing steps described in the paper. Expression Segmented refers to running the algorithm of Olshen et al. (2004) on the gene expression data on individual samples and to then calculate the correlation coefficients between gene expression and copy number expression. Copynumber Segmented refers to running the algorithm of Olshen et al. (2004) on the copy number expression data on individual samples and to then calculate the correlation coefficients between gene expression and copy number expression. Both Segmented refers to running the algorithm of Olshen et al. (2004) on the copy number and gene expression data on individual samples and to then calculate the correlation coefficients between gene expression and copy number expression.