| Literature DB >> 20028554 |
Desiree Tillo1, Timothy R Hughes.
Abstract
BACKGROUND: The relative preference of nucleosomes to form on individual DNA sequences plays a major role in genome packaging. A wide variety of DNA sequence features are believed to influence nucleosome formation, including periodic dinucleotide signals, poly-A stretches and other short motifs, and sequence properties that influence DNA structure, including base content. It was recently shown by Kaplan et al. that a probabilistic model using composition of all 5-mers within a nucleosome-sized tiling window accurately predicts intrinsic nucleosome occupancy across an entire genome in vitro. However, the model is complicated, and it is not clear which specific DNA sequence properties are most important for intrinsic nucleosome-forming preferences.Entities:
Mesh:
Substances:
Year: 2009 PMID: 20028554 PMCID: PMC2808325 DOI: 10.1186/1471-2105-10-442
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Model feature weights selected by Lasso for eleven different training data sets. Chromosomes from which 1,000,000 random nucleotide positions were taken are given at bottom. Correlation coefficients are given in the middle, using a test set that does not include any of the random nucleotide positions used in the training set. The top panel is a zoom-in of the 16 features that were weighted in more than half of the eleven runs. Weights do not directly reflect importance or proportion of the data that a feature explains, because features are unit-normalized prior to analysis, and can have dissimilar distributions.
Figure 2Performance of a 14 feature linear model of intrinsic nucleosome sequence preference. (A) Scatter plot vs. test set (yeast chromosomes 10-16), shown as a heat-map. Axis values are log2 normalized nucleosome occupancy (see Methods). (B) Model scores (probabilistic[8] and linear) and in vivo and in vitro nucleosome occupancy[8] within a 20 kb region of chromosome 14. (C) and (D) Correlation of the 14 feature model score with measured in vivo nucleosome occupancy in yeast (C) and with the Kaplan model across chr10-16 (test set) (D).
Comparison of nucleosome occupancy prediction models on different data sets
| Model | Summary | Performance (Pearson R) | Correlation with %G+C (Yeast, 150 bp windows) | |||||
|---|---|---|---|---|---|---|---|---|
| Kaplan et al., 2009[ | Probabilistic model based on | 0.87 | ||||||
| Lasso model (this study) | See | 0.85 | ||||||
| Field et al., 2008[ | Probabilistic model based on 5-mer preferences measured | 0.64 | ||||||
| %G+C | The percentage of guanine and cytosine bases in a DNA sequence. | 0.53* | 0.49* | 0.78* | 0.25 | 0.42 | 0.47 | 1 |
| Lasso model[ | Linear regression model trained on | 0.23 | 0.22 | 0.55 | ||||
| Peckham et al., 2007[ | SVM classifier trained on overrepresented k-mers (k = 1-6) found in nucleosome occupied and depleted sequences determined | 0.22 | 0.29 | 0.33 | 0.57 | |||
| Yuan and Liu, 2008[ | Computes predicted nucleosome occupancy based on periodic dinucleotide signals found in nucleosomal and linker DNA sequences determined from | 0.02 | 0.05 | 0.35 | 0.30 | |||
| Miele et al., 2008[ | Computes free energy landscape of nucleosome formation using an estimation of dinucleotide-dependent DNA flexibility and intrinsic curvature. | 0.38 | 0.22 | 0.21 | 0.25 | 0.49 | ||
| Segal et al., 2006[ | Probabilistic model trained on yeast data, using a position specific scoring matrix derived from a collection of nucleosome-bound sequences obtained from | NaN | NaN | 0.05 | 0.09 | 0.05 | 0.05 | 0.07 |
| Ioshikhes et al., 2006[ | Computes the correlation of periodic AA/TT dinucleotide motifs in a given sequence with those found in a set of 204 eukaryotic and viral nucleosomal sequences determined through | -0.03 | -0.03 | 0.01 | 0.07 | -0.03 | -0.01 | 0.01 |
| Tolstorukov et al., 2007,2008[ | Estimates the dinucleotide-dependent cost of deformation caused by threading a given sequence on a template comprising the path of DNA found on the experimentally determined structure of the nucleosome core particle. | 0.01 | 0.004 | 0 | -0.001 | -0.001 | -0.001 | -0.0003 |
| Segal et al., 2006[ | Probabilistic model trained on yeast data, using a position specific scoring matrix derived from a collection of nucleosome-bound sequences obtained from | NaN | NaN | -0.2 | 0.001 | -0.06 | -0.05 | -0.21 |
Pearson correlation is shown as a performance metric. Nucleosome occupancy was predicted in yeast using only sequence from the test set (chr10-16) and chromosome III in C. elegans. "NaN" indicates that a score of "0" was obtained for each sequence (since this model[23] requires the sequence be > 150 bp in length). Models are sorted by their average rank in performance. Asterisks (*) and text in bold denote the top three and top 50% performing models for each data set, respectively.
Figure 3Correlation of each of the 14 features with nucleosome occupancy. (A) Graphic illustration of the correlation of each of the 14 sequence features with nucleosome occupancy in vitro and in vivo across the yeast genome (data from Kaplan et al.[8]). (B-D) Scatter plots showing performance of linear models on test set using only G+C content (B), AAAA occurrence (C), or both (D) as inputs. (E) Kaplan model score vs. proportion of G+C over all 150 bp tiling windows in the yeast genome.
Figure 4Correlation of DNA structural parameters, calculated as the average over a 150-base window, with nucleosome occupancy . Calculations were made using dinucleotide and other coefficients obtained from the PROPERTY database http://srs6.bionet.nsc.ru/srs6bin/cgi-bin/wgetz?-page+LibInfo+-newId+-lib+PROPERTY. Nucleosome occupancy data are from Kaplan et al.[8] and Lee et al.[2]. Pearson correlation is shown.
Figure 5Relative nucleosome preference of different subsets of synthetic 150-mers. (A) and (B) Dependence of relative nucleosome preference (as log2(occupancy ratio)) on G+C content (A) and maximum poly-A length (B). Oligonucleotides categorized as "Neutral %G+C" in (B) are those with 45-55% G+C. Graph below shows the frequency of the selected attribute in the oligonucleotides analyzed, and also the human and yeast genomes. (C) Dependence of relative occupancy on poly-A content and CpG status. Poly-A containing oligonucleotides are defined as containing at least four consecutive adenine bases. CpG oligonucleotides are defined as having a G+C content ≥50%, with an observed/expected CpG ratio ≥0.6 (Obs/Exp CpG = Number of CpG * N/(num G * num C), where N = length of sequence[37]). The sequencing readout (rather than array readout) data from the Kaplan paper was used in this analysis. On all box plots, whiskers indicate 10th and 90th percentiles.