| Literature DB >> 18366606 |
Xin Wang1, Guohua Wang, Changyu Shen, Lang Li, Xinguo Wang, Sean D Mooney, Howard J Edenberg, Jeremy R Sanford, Yunlong Liu.
Abstract
Massively parallel pyrosequencing is a high-throughput technology that can sequence hundreds of thousands of DNA/RNA fragments in a single experiment. Combining it with immunoprecipitation-based biochemical assays, such as cross-linking immunoprecipitation (CLIP), provides a genome-wide method to detect the sites at which proteins bind DNA or RNA. In a CLIP-pyrosequencing experiment, the resolutions of the detected protein binding regions are partially determined by the length of the detected RNA fragments (CLIP amplicons) after trimming by RNase digestion. The lengths of these fragments usually range from 50-70 nucleotides. Many genomic regions are marked by multiple RNA fragments. In this paper, we report an empirical approach to refine the localization of protein binding regions by using the distribution pattern of the detected RNA fragments and the sequence specificity of RNase digestion. We present two regions to which multiple amplicons map as examples to demonstrate this approach.Entities:
Mesh:
Substances:
Year: 2008 PMID: 18366606 PMCID: PMC2386059 DOI: 10.1186/1471-2164-9-S1-S17
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Figure 1Distribution of detected amplicons in the pyro-CLIP experiment. A. Detected and undetected sequence fragments. B. Distribution of detected amplicons in the experiment.
Figure 2Estimation of the sequence specificity of RNase digestion.
Figure 3Searching strategy. A. Distribution of detected amplicons in the experiment. B. Distribution of detected amplicons in the simulation.
Figure 4Histogram of the estimated likelihood of RNase digestion on 967 6-bp motifs that occur at least once on the edge of the experimentally detected RNA fragments.
Top 1% 6-bp motifs and their RNase digestion frequencies
| human transcriptome (entropy=0.96) | detected RNA fragments (entropy=0.42) | adjusted frequencies (entropy=0.56) | |||
| 6-bp | frequency | 6-bp | frequency | 6-bp | frequency |
| AAAAAA | 0.0035 | AATAAA | 0.262 | TCTACA | 0.333 |
| TTTTTT | 0.0035 | TCTACA | 0.054 | AATAAA | 0.085 |
| AAAAAT | 0.0016 | TTGAAT | 0.033 | TTGAAT | 0.044 |
| ATTTTT | 0.0016 | AACAGA | 0.021 | CCTACA | 0.036 |
| AAAATA | 0.0015 | AACAAG | 0.019 | AACAGA | 0.024 |
| TATTTT | 0.0013 | CCTACA | 0.019 | AACAAG | 0.023 |
| TAAAAA | 0.0013 | TCTGAA | 0.018 | TCTGAA | 0.016 |
| TTTTTA | 0.0013 | TTCAGA | 0.014 | TTCAGA | 0.016 |
| AAATAA | 0.0013 | ATTCTT | 0.013 | CGTGAA | 0.011 |
Figure 5Genomic region that contains one protein binding site. A. Relative likelihood of RNase digestion at each genomic locus. B. Similarity between the distribution of experimentally-detected amplicons and simulated amplicons assuming protein binding occurs at the each locus. A higher correlation coefficient implies higher probability of protein binding. C. Distribution of detected RNA fragments. D. Distribution of simulated fragments based on the best prediction locus (marked as blue ellipse).
Figure 6Genomic region that contains two protein binding site. A. Relative likelihood of RNase digestion at each genomic locus. B. Distribution of detected RNA fragments. C. Distribution of simulated fragments based on the best prediction loci (marked as blue ellipse). D. Similarity between the distribution of experimentally-detected amplicons and simulated amplicons assuming that protein binding occurs at the each pair of genomic loci. The blue frame indicates the region where highest similarity is observed (dark red).
Figure 7Effects of computational and biological variations on binding site prediction. A. Robustness of binding site prediction on antibody non-specificity and blastn inaccuracy. B. Robustness of binding site prediction on the inaccurate estimation of sequence specificity of RNase digestion.