Literature DB >> 19468055

NTAP: for NimbleGen tiling array ChIP-chip data analysis.

Kun He¹, Xueyong Li, Junli Zhou, Xing-Wang Deng, Hongyu Zhao, Jingchu Luo.

Abstract

SUMMARY: NTAP is designed to analyze ChIP-chip data generated by the NimbleGen tiling array platform and to accomplish various pattern recognition tasks that are useful especially for epigenetic studies. The modular design of NTAP makes the data processing highly customizable. Users can either use NTAP to perform the full process of NimbleGen tiling array data analysis, or choose post-processing modules in NTAP to analyze pre-processed epigenetic data generated by other platforms. The output of NTAP can be saved in standard GFF format files and visualized in GBrowse.
AVAILABILITY AND IMPLEMENTATION: The source code of NTAP is freely available at http://ntap.cbi.pku.edu.cn/. It is implemented in Perl and R and can be used on Linux, Mac and Windows platforms.

Entities: Chemical Gene Species

Mesh：

Year: 2009 PMID： 19468055 PMCID： PMC2705232 DOI： 10.1093/bioinformatics/btp320

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 INTRODUCTION

Genome-level high-density tiling arrays are becoming more accessible for genome-wide profiling studies including transcriptome identification (Bertone et al., 2004), transcription factor binding site identification (Lee et al., 2007), histone modification profiling (Gendrel et al., 2005; Li et al., 2008), DNA methylation profiling (Hayashi et al., 2007) and comparative genome hybridization. Specific analysis methods and tools are required for each type of study because the strategies behind different tiling array applications vary extensively. As a result, several models have been proposed and software tools have been developed for the analysis of different types of tiling array data (Chung et al., 2007; Ji et al., 2008; Li et al., 2005; Wang et al., 2006; Zhang et al., 2007). However, there is still room to improve for data analysis of epigenetic features including histone modifications and DNA methylation. The recognition of distribution patterns of modifications at both the local (gene) and global (chromosome) levels are usually required to infer biologically meaningful conclusions (Hayashi et al., 2007; Li et al., 2008). Here, we present a NimbleGen Tiling array data Analysis Package (NTAP) designed for histone modification profiling analysis (Li et al., 2008) that can also be applied to other ChIP-chip data (Lee et al., 2007). The advantage of our package is its ability to generate reports for various pattern recognition questions instead of focusing only on identifying significantly enriched oligos or genomic regions.

2 FUNCTIONS AND FEATURES

NTAP was developed using the R statistical language to take advantage of the powerful statistical functions of other open source packages especially those from the Bioconductor project (http://www.bioconductor.org/). It contains five main steps for data analysis: importing, normalization, feature identification, oligos mapping and post-processing for pattern recognition.

2.1 Data importing

We implemented an R function similar to the ‘read.maimages’ function in the limma package (Smyth 2004) to import NimbleGen raw data into limma data object formats for normalization.

2.2 Data normalization

Users can apply various microarray normalization methods to the imported datasets through the limma package functions ‘normalizeBetweenArrays’ and ‘normalizeWithinArrays’. Unlike the expression profiling arrays whose log transformed ratio distributions are usually symmetric around zero, the distribution of the ChIP-chip result tends to skew to the ChIP channel. Because only the protein-bound DNA fragments will be pulled down by a specific antibody, more positive log transformed ChIP/Input ratios are expected. Thus, the rank-invariant set scheme (Buck and Lieb, 2004) was incorporated for better data normalization.

2.3 Feature identification

Tiling arrays usually contain several oligos per single gene rather than one oligo per gene. For example, the traditional whole-genome array for expression studies in Arabidopsis thaliana usually contains only 23k oligos, while a customized whole-genome tiling array tiled at ∼250 bp resolution may contain ∼400k oligos. The much larger number of oligos on a single array makes the traditional methods for feature identification unfeasible. For tiling array data, expressed mRNA or pulled-down DNA fragments can cause the signal of a group of neighboring oligos to increase simultaneously. Therefore, our package implements the non-parametric Wilcoxon rank-sum method to compare the signal differences between the ChIP channel and the reference channel for a group of oligos using sliding windows. Under certain circumstances, however, the density of some tiling arrays may not be high enough to use the Wilcoxon method. In these cases, we utilize simple comparison linear models implemented in limma (Smyth 2004) to identify single oligos whose signal increased significantly in the ChIP channel. Then, we consider a genomic region as ‘positive’ if the region contains a single oligo that meets stringent user-defined criteria or the region contains a group of neighboring oligos that meet less stringent criteria.

2.4 Mapping oligos to gene models

Genome data are usually kept up-to-date by genome sequencing consortia or curation groups, who usually release their data as standard XML format files that can be parsed to easily obtain coordinates of gene models. A Perl module was implemented to retrieve records of the gene model position information on each chromosome and to determine the relative position of a specific oligo to its nearby gene model(s). Signal distribution patterns among different groups of genes can then be determined based on the stored relative position information.

2.5 Post-processing functions

The following questions are frequently asked in epigenomics research. What is the modification distribution pattern relative to genes and does it vary between different organs/tissues? Is there an association between specific histone modification levels and gene sizes, or gene expression levels? To answer these questions, we implemented several R functions to align genes, to calculate the average ChIP/Input intensity ratio of the oligos within sliding windows, and to plot the final results for different groups (Fig. 1).

Fig. 1.

Demonstration of two different methods for the alignment of gene models and reorganization of histone modification patterns. (A) Two different strategies to align genes (three genes with different lengths were used as examples). The alignment without gene length normalization overlapped all the oligos based on their absolute distance (kb) to the transcription start site while with length normalization based on the relative positions (percentile) to the transcription start site. (B) The histone modification distribution pattern between different user-defined gene sub-groups that contain various length genes in this particular case. (C) The tissue-specific histone modification distribution pattern on all genes by the two different strategies demonstrated in (A).

2.6 Result visualization

Quality control is a key step to guarantee the validity of the overall data analysis. An R function was implemented to calculate the raw intensities correlation coefficient between any pair of two replicates. MA-plots of array hybridization results are also generated in order to examine the intensity ratio (M) versus averaged intensity (A) to discover possible non-linear biases that require special normalization methods. After raw data processing, all the oligos are mapped back to the most up-to-date chromosomes and the ChIP/Input ratio value of each oligo can then be plotted along the chromosome. These values can be displayed either by a program within NTAP or they can be exported in the GFF format to be displayed in the Generic Genome Browser GBrowse (Stein et al., 2002).

3 IMPLEMENTATION

Most of the functions are implemented in the R statistical language (http://www.r-project.org/) and Perl. Users can also choose any other software to pre-process their data before using our post-processing modules.

13 in total

1. The generic genome browser: a building block for a model organism system database.

Authors: Lincoln D Stein; Christopher Mungall; ShengQiang Shu; Michael Caudy; Marco Mangone; Allen Day; Elizabeth Nickerson; Jason E Stajich; Todd W Harris; Adrian Arva; Suzanna Lewis
Journal: Genome Res Date: 2002-10 Impact factor: 9.043

Review 2. ChIP-chip: considerations for the design, analysis, and application of genome-wide chromatin immunoprecipitation experiments.

Authors: Michael J Buck; Jason D Lieb
Journal: Genomics Date: 2004-03 Impact factor: 5.736

3. Profiling histone modification patterns in plants using genomic tiling microarrays.

Authors: Anne-Valérie Gendrel; Zachary Lippman; Rob Martienssen; Vincent Colot
Journal: Nat Methods Date: 2005-03 Impact factor: 28.547

4. A hidden Markov model for analyzing ChIP-chip experiments on genome tiling arrays and its application to p53 binding sequences.

Authors: Wei Li; Clifford A Meyer; X Shirley Liu
Journal: Bioinformatics Date: 2005-06 Impact factor: 6.937

5. Linear models and empirical bayes methods for assessing differential expression in microarray experiments.

Authors: Gordon K Smyth
Journal: Stat Appl Genet Mol Biol Date: 2004-02-12

6. NMPP: a user-customized NimbleGen microarray data processing pipeline.

Authors: Xiangfeng Wang; Hang He; Lei Li; Runsheng Chen; Xing Wang Deng; Songgang Li
Journal: Bioinformatics Date: 2006-10-12 Impact factor: 6.937

7. High-resolution mapping of DNA methylation in human genome using oligonucleotide tiling array.

Authors: Hiroshi Hayashi; Genta Nagae; Shuichi Tsutsumi; Kiyofumi Kaneshiro; Takazumi Kozaki; Atsushi Kaneda; Hajime Sugisaki; Hiroyuki Aburatani
Journal: Hum Genet Date: 2006-09-26 Impact factor: 4.132

8. Analysis of transcription factor HY5 genomic binding sites revealed its hierarchical role in light regulation of development.

Authors: Jungeun Lee; Kun He; Viktor Stolc; Horim Lee; Pablo Figueroa; Ying Gao; Waraporn Tongprasit; Hongyu Zhao; Ilha Lee; Xing Wang Deng
Journal: Plant Cell Date: 2007-03-02 Impact factor: 11.277

9. High-resolution mapping of epigenetic modifications of the rice genome uncovers interplay between DNA methylation, histone methylation, and gene expression.

Authors: Xueyong Li; Xiangfeng Wang; Kun He; Yeqin Ma; Ning Su; Hang He; Viktor Stolc; Waraporn Tongprasit; Weiwei Jin; Jiming Jiang; William Terzaghi; Songgang Li; Xing Wang Deng
Journal: Plant Cell Date: 2008-02-08 Impact factor: 11.277

10. An integrated software system for analyzing ChIP-chip and ChIP-seq data.

Authors: Hongkai Ji; Hui Jiang; Wenxiu Ma; David S Johnson; Richard M Myers; Wing H Wong
Journal: Nat Biotechnol Date: 2008-11-02 Impact factor: 54.908

1 in total

1. Starr: Simple Tiling ARRay analysis of Affymetrix ChIP-chip data.

Authors: Benedikt Zacher; Pei Fen Kuan; Achim Tresch
Journal: BMC Bioinformatics Date: 2010-04-17 Impact factor: 3.169

1 in total