| Literature DB >> 27816056 |
David M Budden1,2, Edmund J Crampin3,4,5,6.
Abstract
BACKGROUND: Predictive gene expression modelling is an important tool in computational biology due to the volume of high-throughput sequencing data generated by recent consortia. However, the scope of previous studies has been restricted to a small set of cell-lines or experimental conditions due an inability to leverage distributed processing architectures for large, sharded data-sets.Entities:
Keywords: Epigenetics; Gene expression; Histone modifications; MapReduce
Mesh:
Substances:
Year: 2016 PMID: 27816056 PMCID: PMC5097851 DOI: 10.1186/s12859-016-1313-1
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
All ENCODE cell-lines for which matched ChIP-seq data was available for the full set of histone modifications considered in this study (listed in Table 2)
| Cell-line | Tier | Description | Lineage | Tissue | Karyotype |
|---|---|---|---|---|---|
|
| 2 | Alveolar carcinoma | Endoderm | Epithelium | Cancer |
|
| 1 | B-lymphocyte | Mesoderm | Blood | Normal |
|
| 1 | Embryonic stem cells | Inner cell mass | Embryonic stem cell | Normal |
|
| 2 | Cervical carcinoma | Ectoderm | Cervix | Cancer |
|
| 2 | Hepatocellular carcinoma | Endoderm | Liver | Cancer |
|
| 2 | Umbilical vein endothelial cells | Mesoderm | Blood vessel | Normal |
|
| 1 | Leukemia | Mesoderm | Blood | Cancer |
|
| 3 | Epidermal keratinocytes | Ectoderm | Skin | Normal |
All histone modifications considered in this study. The remaining histone modifications available from ENCODE are unsuitable for this study as they assert their functional role in non-promoter regions (e.g. H3K36me3 in the 3′-UTR)
| Histone modification | Regulatory role | Chromatin localisation |
|---|---|---|
|
| Bivalency | Euchromatin |
|
| Activator/Bivalency | Euchromatin |
|
| Activator | Euchromatin |
|
| Repressor | Constitutive heterochromatin |
|
| Activator | Euchromatin |
|
| Repressor/Bivalency | Facultative heterochromatin |
Fig. 1Density plots of predicted () versus measured (Y) mRNA transcript abundance abundance for cancerous (top row, mean adj. R 2=0.608) and normal cell-lines (bottom row, mean adj. R 2=0.581). The adj. R 2 performance and λ regularisation parameter (fitted using 10-fold cross validation) is reported for each cell-line
Fig. 2Hierarchical clustering of cell-lines by mRNA transcript abundance residuals (). The three mesodermal derivatives GM12878, K562 and HUVEC cluster together, suggesting that residuals are partially non-random and instead convey meaningful biological information. Consistently, it is evident that the expression levels of many genes are poorly predicted across all eight cell-lines, presumably capturing divergence from histone modification-mediated regulation (explored in detail in our previous study [2])
Fig. 3a Genome-wide accuracy of mRNA transcript abundance predictions (adj. R 2) for models trained and tested on each pairwise combination of cell lines. These results are strikingly non-symmetric, with significant dissimilarity between columns (predictions) but not rows (training observations). b Distribution of each fitted model parameter, , across all cell-lines considered in this study