| Literature DB >> 23153121 |
Daniel Capurso1, Hao Xiong, Mark R Segal.
Abstract
BACKGROUND: Applying supervised learning/classification techniques to epigenomic data may reveal properties that differentiate histone modifications. Previous analyses sought to classify nucleosomes containing histone H2A/H4 arginine 3 symmetric dimethylation (H2A/H4R3me2s) or H2A.Z using human CD4+ T-cell chromatin immunoprecipitation sequencing (ChIP-Seq) data. However, these efforts only achieved modest accuracy with limited biological interpretation. Here, we investigate the impact of using appropriate data pre-processing -deduplication, normalization, and position- (peak-) finding to identify stable nucleosome positions - in conjunction with advanced classification algorithms, notably discriminatory motif feature selection and random forests. Performance assessments are based on accuracy and interpretative yield.Entities:
Mesh:
Substances:
Year: 2012 PMID: 23153121 PMCID: PMC3559892 DOI: 10.1186/1471-2164-13-630
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Figure 1Classifying stable nucleosomes containing H2A/H4R3me2s or H2A.Z using histone modification features. (a) Receiver Operating Characteristic (ROC) curve, demonstrating classifier performance. (b) Random forests feature importance by mean decrease in Gini index. Features have a higher frequency in H2A/H4R3me2s nucleosomes (red) or H2A.Z nucleosomes (blue). The dashed, vertical line shows the estimated (permutation-based) significance threshold after multiple testing correction. (c) A classification tree with splits (no borders) and leaves (borders), below which is the number of nucleosomes classified correctly and, in parentheses, incorrectly at that stage. Leaves show the predicted class labels of nucleosomes partitioned there. Splits show the condition that best separates the data. Branch labels indicate the directions in which the split condition is true (“yes”) and false (“no”).
Figure 2Classifying stable nucleosomes containing H2A/H4R3me2s or H2A.Z using DNA sequence features. (a) Receiver Operating Characteristic (ROC) curve, demonstrating classifier performance. (b) Random forests feature importance by mean decrease in Gini index. Features have a higher frequency in H2A/H4R3me2s nucleosomal DNA (red) or H2A.Z nucleosomal DNA (blue). (c) Frequency histogram of the number of occurrences of the motif TCCATT in H2A/H4R3me2s nucleosomal DNA (red) or H2A.Z nucleosomal DNA (blue).
Satellite II and III DNA consensus sequences
| satellite II DNA | [(atTCCATTcg)2 + (atg)1–2]n |
| satellite III DNA | [(ATTCC)7–13 + (ATTcgggttg)1]n |
Subscripts indicate the number of occurrences of a subsequence in the consensus sequence. The motif TCCATT is displayed in uppercase. For satellite III DNA, the motif also appears when two instances of the first subsequence are juxtaposed. Adapted from [30,31].
Figure 3Relationship between stable nucleosomes containing histone modifications and satellite II and III DNA sequences. (a) The percentage contributions of types of repetitive elements to the total DNA sequence bound to nucleosomes containing the indicated histone modification. (b) The fraction of start site -aligned satellite II (upper) or III (lower) DNA sequences occupied by stable nucleosomes containing the indicated histone modification.