Literature DB >> 26112292

damidseq_pipeline: an automated pipeline for processing DamID sequencing datasets.

Abstract

UNLABELLED: DamID is a powerful technique for identifying regions of the genome bound by a DNA-binding (or DNA-associated) protein. Currently, no method exists for automatically processing next-generation sequencing DamID (DamID-seq) data, and the use of DamID-seq datasets with normalization based on read-counts alone can lead to high background and the loss of bound signal. DamID-seq thus presents novel challenges in terms of normalization and background minimization. We describe here damidseq_pipeline, a software pipeline that performs automatic normalization and background reduction on multiple DamID-seq FASTQ datasets.
AVAILABILITY AND IMPLEMENTATION: Open-source and freely available from http://owenjm.github.io/damidseq_pipeline. The damidseq_pipeline is implemented in Perl and is compatible with any Unix-based operating system (e.g. Linux, Mac OSX). CONTACT: o.marshall@gurdon.cam.ac.uk SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Entities: Chemical Disease Gene Species

Mesh：

Substances：
DNA-Binding Proteins

Year: 2015 PMID： 26112292 PMCID： PMC4595905 DOI： 10.1093/bioinformatics/btv386

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

DamID is a well-established technique for discovering regions of DNA bound by or associated with proteins (van Steensel and Henikoff, 2000). It has been used to map the genome-wide binding of transcription factors, chromatin proteins, nuclear complexes associated with DNA and RNA pol II (for e.g. Choksi ; Filion ; Singer ; Southall ). The technique can be performed in cell culture, whole organisms (van Steensel and Henikoff, 2000) or with cell-type specificity (Southall ), and requires no fixation or antibody purification. DamID involves the fusion of a bacterial DNA adenine methylase (Dam) to any DNA-associated protein of interest. The bacterial Dam protein methylates adenine in the sequence GATC and, given that higher eukaryotes lack native adenine methylation, the DNA-binding footprint of the protein of interest is uniquely detectable through isolating sequences flanked by methylated GATC sites. However, a major consideration with DamID is that any Dam protein within the nucleus will non-specifically methylate adenines in GATC sequences at accessible regions of the genome. For this reason, DamID is always performed concurrently with a Dam-only control, and the final DNA-binding profile is typically presented as a log2(Dam-fusion/Dam-only) ratio. Although the majority of published DamID experiments have used tiling microarrays for data analysis, next-generation sequencing (NGS) allows greater sensitivity and higher accuracy. Although several recent studies have used NGS with DamID (Carl and Russell, 2015; Clough ; Lie-A-Ling ; Wu and Yao, 2013), these have relied upon a comparison of peak binding intensities between read-count-normalized Dam-fusion and Dam samples. Depending on the characteristics of the Dam-fusion protein (see later) this approach may lead to real signal being lost, and correct normalization of the datasets is required to detect all binding by many Dam-fusion proteins. Here, we describe a software pipeline for the automated processing of DamID-sequencing (DamID-seq) data, including normalization and background reduction algorithms.

2 Algorithms

Although DamID-seq data can be aligned and binned as per all NGS data, two issues arise that are specific to DamID. The first major consideration is the correct normalization of the Dam-fusion and Dam-control samples. The greatest contribution to many Dam-fusion protein datasets is the non-specific methylation of accessible genomic regions (e.g. Fig. 1B), with a mean correlation between Dam alone and Dam-fusion datasets of 0.70 (n = 4, Spearman’s correlation). Representing the data as a (Dam-fusion/Dam) ratio in theory negates such non-specific methylation. However, strong methylation signals at highly bound regions in the Dam-fusion dataset will reduce the relative numbers of reads present at accessible genomic regions in this dataset (see, for example, the occupancy of Dam-RNA Pol II over the eyeless gene in Fig. 1), and normalizing the data based on read counts alone can therefore produce a strong negative bias to the ratio file [Fig. 1B (iii), Supplementary Fig. S5A]. Depending on the characteristics of the fusion protein, this negative bias can lead to real signal being lost (Fig. 1). Although microarray data inadvertently overcame this bias through the manual adjustment of laser intensities during microarray scanning, until now no method has existed for correctly normalizing DamID-seq datasets.

Fig. 1.

Results of the damidseq_pipeline. (A) The gene eyeless (ey) (highlighted) is expressed in D. melanogaster laval neural stem cells (Southall ) and previously published microarray DamID in these cells (i) shows RNA polymerase II occupancy (Southall ). (B) Performing DamID-seq in the same cell type illustrates the high correlation between Dam-Pol II (i) and Dam alone (ii) in terms of RPM (read counts/million mapped reads). Taking the ratio of the two RPM-normalized datasets fails to show significant RNA pol II occupancy at ey (iii); however, processing via the damidseq_pipeline software successfully recovers the RNA pol II occupancy profile while minimizing background (iv). See Supplementary Methods for experimental details In order to correct for this negative bias we use the read counts from accessible genomic regions—as determined from the Dam-only dataset—as the basis for normalization, while avoiding regions likely to contain real signal in the Dam-fusion sample. We use the following algorithm to adjust the Dam-fusion dataset. In addition to ensuring correct normalization, a second important consideration is the reduction of background noise. Regions without specific methylation will have randomly distributed background counts that, when a ratio file is generated, will generate a large degree of noise. Such noise can potentially obscure peak detection. In order to mitigate this effect we add pseudocounts to both datasets. In order to maintain equivalence between replicates with differing numbers of reads (assuming that genomebound ≪ genomeunbound) the number of pseudocounts added is proportional to the sequencing coverage, thus , where c is a constant. (Supplementary Table S1 for a comparison of gene calls with different read-depths). Adding pseudocounts increases the number and the total genomic coverage of detected peaks and increases the signal:noise ratio (Supplementary Figs S1–S4). Given the GATC-site resolution of DamID, we divide the read counts into GATC fragments. All GATC fragments lacking read counts are excluded. The remaining GATC fragments are divided into deciles. Given the high probability that the highest 10% of Dam-fusion read counts represent bound signal rather than background signal, we exclude fragments that have scores in this decile. The first three deciles of the Dam sample can generate inconsistent normalization values if included (Supplementary Table S2), so we exclude fragments that lie within this range. The distribution of the log2(Dam-fusion/Dam) ratio for all remaining fragments is determined via the Gaussian kernel density estimate , where h is the bandwidth, estimated via the method of Silverman (1986): (where σ is the standard deviation of the sample and IQR the interquartile range). For speed considerations, we estimate kernel density over 300 equally spaced points within the interval . The point of maximum kernel density represents the point of maximum correspondence between Dam-fusion and Dam values; if both samples are correctly normalized this value should equal 0. We therefore normalize all Dam-fusion values by . The combination of these two methods compares favorably with previously published microarray data [Fig. 1B (iv)] or DamID-seq data (Supplementary Figs S1–S4; Supplementary Fig. S5).

3 Implementation

The damidseq_pipeline software is implemented in Perl, and will process multiple single-end read sequencing files in FASTQ or BAM format. The pipeline can match sequencing adaptors to sample names, automatically identifies the Dam-only control, and performs alignment, read-length extension, normalization, background reduction and ratio file generation. (Supplementary Methods for details). A large number of user-configurable options are provided, including the ability to adjust the normalization algorithm parameters, generate read-count normalized files and add a user-specified number of pseudocounts. Parameters specified on the command-line can be saved as defaults if the user desires. The damidseq_pipeline software is open-source and freely available at http://owenjm.github.io/damidseq_pipeline. A detailed set of installation and usage instructions are provided at the above website, along with a small example dataset.

9 in total

1. Identification of in vivo DNA targets of chromatin proteins using tethered dam methyltransferase.

Authors: B van Steensel; S Henikoff
Journal: Nat Biotechnol Date: 2000-04 Impact factor: 54.908

2. Sex- and tissue-specific functions of Drosophila doublesex transcription factor target genes.

Authors: Emily Clough; Erin Jimenez; Yoo-Ah Kim; Cale Whitworth; Megan C Neville; Leonie U Hempel; Hania J Pavlou; Zhen-Xia Chen; David Sturgill; Ryan K Dale; Harold E Smith; Teresa M Przytycka; Stephen F Goodwin; Mark Van Doren; Brian Oliver
Journal: Dev Cell Date: 2014-12-22 Impact factor: 12.270

3. Systematic protein location mapping reveals five principal chromatin types in Drosophila cells.

Authors: Guillaume J Filion; Joke G van Bemmel; Ulrich Braunschweig; Wendy Talhout; Jop Kind; Lucas D Ward; Wim Brugman; Inês J de Castro; Ron M Kerkhoven; Harmen J Bussemaker; Bas van Steensel
Journal: Cell Date: 2010-09-30 Impact factor: 41.582

4. Drosophila COP9 signalosome subunit 7 interacts with multiple genomic loci to regulate development.

Authors: Ruth Singer; Shimshi Atar; Osnat Atias; Efrat Oron; Daniel Segal; Joel A Hirsch; Tamir Tuller; Amir Orian; Daniel A Chamovitz
Journal: Nucleic Acids Res Date: 2014-08-08 Impact factor: 16.971

5. RUNX1 positively regulates a cell adhesion and migration program in murine hemogenic endothelium prior to blood emergence.

Authors: Michael Lie-A-Ling; Elli Marinopoulou; Yaoyong Li; Rahima Patel; Monika Stefanska; Constanze Bonifer; Crispin Miller; Valerie Kouskoff; Georges Lacaud
Journal: Blood Date: 2014-07-31 Impact factor: 22.113

6. Common binding by redundant group B Sox proteins is evolutionarily conserved in Drosophila.

Authors: Sarah H Carl; Steven Russell
Journal: BMC Genomics Date: 2015-04-13 Impact factor: 3.969

7. Cell-type-specific profiling of gene expression and chromatin binding without cell isolation: assaying RNA Pol II occupancy in neural stem cells.

Authors: Tony D Southall; Katrina S Gold; Boris Egger; Catherine M Davidson; Elizabeth E Caygill; Owen J Marshall; Andrea H Brand
Journal: Dev Cell Date: 2013-06-20 Impact factor: 12.270

8. Prospero acts as a binary switch between self-renewal and differentiation in Drosophila neural stem cells.

Authors: Semil P Choksi; Tony D Southall; Torsten Bossing; Karin Edoff; Elzo de Wit; Bettina E Fischer; Bas van Steensel; Gos Micklem; Andrea H Brand
Journal: Dev Cell Date: 2006-12 Impact factor: 12.270

9. Spatial compartmentalization at the nuclear periphery characterized by genome-wide mapping.

Authors: Feinan Wu; Jie Yao
Journal: BMC Genomics Date: 2013-08-30 Impact factor: 3.969

9 in total

46 in total

1. Data analysis algorithm for DamID-seq profiling of chromatin proteins in Drosophila melanogaster.

Authors: Daniil A Maksimov; Petr P Laktionov; Stepan N Belyakin
Journal: Chromosome Res Date: 2016-10-21 Impact factor: 5.239

2. SMCHD1 Merges Chromosome Compartments and Assists Formation of Super-Structures on the Inactive X.

Authors: Chen-Yu Wang; Teddy Jégu; Hsueh-Ping Chu; Hyun Jung Oh; Jeannie T Lee
Journal: Cell Date: 2018-06-07 Impact factor: 41.582

3. Dynamic neurotransmitter specific transcription factor expression profiles during Drosophila development.

Authors: Alicia Estacio-Gómez; Amira Hassan; Emma Walmsley; Lily Wong Le; Tony D Southall
Journal: Biol Open Date: 2020-06-03 Impact factor: 2.422

4. Cell-type-specific profiling of protein-DNA interactions without cell isolation using targeted DamID with next-generation sequencing.

Authors: Owen J Marshall; Tony D Southall; Seth W Cheetham; Andrea H Brand
Journal: Nat Protoc Date: 2016-08-04 Impact factor: 13.491

5. Studying Protein Function Using Nanobodies and Other Protein Binders in Drosophila.

Authors: Katarzyna Lepeta; Milena Bauer; Gustavo Aguilar; M Alessandra Vigano; Shinya Matsuda; Markus Affolter
Journal: Methods Mol Biol Date: 2022

6. A Notch-dependent transcriptional mechanism controls expression of temporal patterning factors in Drosophila medulla.

Authors: Alokananda Ray; Xin Li
Journal: Elife Date: 2022-08-30 Impact factor: 8.713

7. Epigenetic remodelling licences adult cholangiocytes for organoid formation and liver regeneration.

Authors: Luigi Aloia; Mikel Alexander McKie; Grégoire Vernaz; Lucía Cordero-Espinoza; Niya Aleksieva; Jelle van den Ameele; Francesco Antonica; Berta Font-Cunill; Alexander Raven; Riccardo Aiese Cigliano; German Belenguer; Richard L Mort; Andrea H Brand; Magdalena Zernicka-Goetz; Stuart J Forbes; Eric A Miska; Meritxell Huch
Journal: Nat Cell Biol Date: 2019-11-04 Impact factor: 28.824