Literature DB >> 27942458

Single nucleotide-level mapping of DNA double-strand breaks in human HEK293T cells.

Bernard J Pope¹, Khalid Mahmood¹, Chol-Hee Jung¹, Peter Georgeson¹, Daniel J Park².

Abstract

Constitutional biological processes involve the generation of DNA double-strand breaks (DSBs). The production of such breaks and their subsequent resolution are also highly relevant to neurodegenerative diseases and cancer, in which extensive DNA fragmentation has been described Stephens et al. (2011), Blondet et al. (2001). Tchurikov et al. Tchurikov et al. (2011, 2013) have reported previously that frequent sites of DSBs occur in chromosomal domains involved in the co-ordinated expression of genes. This group report that hot spots of DSBs in human HEK293T cells often coincide with H3K4me3 marks, associated with active transcription Kravatsky et al. (2015) and that frequent sites of DNA double-strand breakage are likely to be relevant to cancer genomics Tchurikov et al. (2013, 2016) . Recently, they applied a RAFT (rapid amplification of forum termini) protocol that selects for blunt-ended DSB sites and mapped these to the human genome within defined co-ordinate 'windows'. In this paper, we re-analyse public RAFT data to derive sites of DSBs at the single-nucleotide level across the built genome for human HEK293T cells (https://figshare.com/s/35220b2b79eaaaf64ed8). This refined mapping, combined with accessory ENCODE data tracks and ribosomal DNA-related sequence annotations, will likely be of value for the design of clinically relevant targeted assays such as those for cancer susceptibility, diagnosis, treatment-matching and prognostication.

Entities: CellLine Chemical Disease Gene Species

Keywords: Double-strand breaks; Forum domains; Fragile sites; HEK293T; Human genome

Year: 2016 PMID： 27942458 PMCID： PMC5133665 DOI： 10.1016/j.gdata.2016.11.007

Source DB: PubMed Journal: Genom Data ISSN： 2213-5960

Direct link to deposited data

https://figshare.com/s/35220b2b79eaaaf64ed8

Experimental design, materials and methods

Sequencing data

The FASTQ file for Illumina Genome Analyzer IIx (GAIIx) run accession SRR944107 (single-end reads) was downloaded from http://www.ebi.ac.uk/ena/data/view/SRR944107, having sourced the accession code via http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE49302. The origins of these data have been reported previously [12]. Briefly, HEK293T cells were suspended in 1% low-melt agarose prior to lysis. DNA was then fractionated by gel electrophoresis and collected by electroelution. Free DNA ends (sites of DSBs) were ligated to a double-stranded biotinylated adapter oligonucleotide before digestion with the restriction endonuclease Sau3AI. DSB site-containing termini were phase-purified using streptavidin paramagnetic particles, eluted via EcoRI restriction endonuclease digestion and then subjected to Sau3AI site adapter ligation and PCR amplification. PCR products were ligated to Illumina adapters, allowing them to be represented in either orientation. Library fragments of ~ 200–400 bp (insert plus adapter and PCR primer sequences) were band isolated from agarose gels and the purified libraries were sequenced in single-ended fashion using the Illumina Genome Analyzer IIx sequencing platform.

Data processing

Fig. 1 provides a schematic representation of our bioinformatic analysis pipeline. Specifications are summarised in Table 1. In the first step, we used our custom software to produce a modified representation of . This tool is available at https://github.com/djpark1974/raft_hotspots_se. Briefly, it filters reads based on the observation of expected arrangements of adapter sequences, with the stringent requirement that both adapters be evident in a given read. Reads exhibiting evidence of ligation artefacts or insufficient evidence of expected adapter sequences were removed. Accepted reads were processed to trim adapter sequences, and those with library inserts greater than or equal to 25 nucleotides in length were retained and transformed to orient the DSB site at the start.

Fig. 1

Schematic illustration of our bioinformatic analysis pipeline to derive counts of DSBs by co-ordinate across genome-build hg19 concatenated with rDNA contiguous sequence U13369.1.

Table 1

Materials, data, tools and resources employed in the present study.

Systems and resources	Specifications
Sequencing platform	GAIIx single-read (SRR944107.fastq)
Cell line	Human HEK293T cells
Sequencing library	RAFT-seq
Reference files	hg19.fa;U13369.1.fa;ENCFF001TDO.bed;hg19_rmsk.bed;hg19_GATC5.bed
Data processing software	raft_fastq_2sites_parse.py;bwa (0.7.5a);samtools (1.3.1);bedtools (2.17.0);raft_bed_2sites_parse.py

The concatenated sequences of plus human reference genome build , represented as , were indexed using BWA (version 0.7.5a) [4] using the command: Reads of the transformed FASTQ file were then mapped to using BWA, thus: SAMtools (version 1.3.1) [5] was used to convert from SAM file format to BAM file format and to sort the resulting BAM file with the following command: BEDtools (version 2.17.0) [7] was then employed to produce a BED file representing the mapping, including CIGAR string information and mapping orientation, with the following command: To reduce false positives resulting from mapping artefacts, we filtered out reads that overlapped with ENCODE project [3] blacklist regions and RepeatMasker-derived repetitive regions as follows ( represents a file created by sorting a concatenation of the hg19 co-ordinate-associated files and ): We then used our custom software (available at https://github.com/djpark1974/raft_hotspots_se) to further filter the data and to count the number of observations of DSBs at co-ordinates in (yielding ). Briefly, this tool assesses the orientation of mapping for each read. Since we presented the DSB at the beginning of each read prior to mapping, we can determine the exact location of the DSB at the single nucleotide level for each read. This tool also performs additional filtering steps. Reads that mapped in either orientation were treated as likely to be erroneous if the CIGAR string showed evidence of clipping at either terminus. Additionally, we required reads to exhibit mapping qualities (MQs) of greater than 40 for them to be included in our DSB site counting. For increased specificity, we removed DSB sites located within 5 base pairs of a Sau3AI consensus site (GATC), thus (was derived via a custom python script): To enable detailed downstream analyses, we have supplemented the co-ordinate-DSB-count data with annotation derived from the ENCODE project and BLAST alignment scores derived from aligning U13369.1 to the human genome. ENCODE annotations are recorded with the identity and proximity of respective ENCODE elements and BLAST alignment scores provide the highest scoring sequence similarity match for a contiguous sequence spanning a given co-ordinate. Fig. 2 illustrates the frequency of DSBs at single nucleotide resolution sites across the hg19 reference human genome.

Fig. 2

Circos plot depicting relative DSB counts by co-ordinate for chromosome 19 of human genome-build hg19. The outer numbers indicate co-ordinates in megabases along the chromosome. Black bars indicate gene regions. The red portion indicates centromeric DNA.

Discussion

Here, we present the relative frequencies of DSBs across the human reference genome for HEK293T cells, at single nucleotide resolution. Since DNA strand breakage and genomic rearrangements are highly relevant to cancer and other diseases [2], [8], [14], it is probable that our new data will have utility for the development of clinically important diagnostic tests. The highest ranking DSB regions reported by Tchurikov et al. [12] for the SRR944107.fastq dataset predominantly relate to regions that would be likely to present problems to short read mapping software, such as satellite sequences. In an attempt to reduce mapping-related artefacts, we have elected to remove regions known to result in low-confidence mapping from our analysis. Top ranking single nucleotide-resolved DSB sites resulting from our analysis relate to regions listed previously as enriched for DSBs, albeit to a lesser extent than reported for numerous low-complexity (and low confidence) sequence regions. Our data relate to a particular subgroup of DSBs. The RAFT protocol from which our data are derived is theoretically enriched for blunt-ended forum domain termini, previously shown to be associated with transcriptional control [6], [10], [11]. They will be biased towards termini that occur within a particular range of genomic distances from a Sau3AI restriction endonuclease site. Future protocols that make use of multiple restriction endonucleases for cleavage following the initial ligation step, as alternatives to (and as well as) Sau3AI, should mitigate this to a large extent. Our data are further biased towards genomic regions that can be mapped unambiguously. We have applied high-stringency thresholding on mapping quality as part of our algorithm and discarded library elements that could not be uniquely assigned to a single genomic location with high confidence and, as such, repetitive genomic elements harbouring DSB sites will not be represented. It should be noted that the data we present relate to human HEK293T cells. Other cell-types will likely exhibit differences in their RAFT-detectable DSB profiles due to variations in higher-order chromosomal architecture and DNA cleavage-inducing enzyme activity. These differences will be elucidated with the expansion of studies to a range of cell and tissue types. We have supplemented the profiling of the relative frequency of DSB sites in HEK293T cells with ENCODE-derived annotation [3], including regional information pertaining to important transcription factor binding sites and other marks of gene regulation, regions of DNaseI hypersensitivity and repetitive elements. Further, we provide annotation in the form of sequence similarity scores, derived from BLAST analysis [1], for sites that occur in regions with high similarity to human ribosomal DNA, since such sequences are known to include hot spots for DSBs and present particular mapping challenges due to their representation at high copy number at multiple sites in the genome. This information should assist with the selection of suitable targets for diagnostic test design, allowing the user optionally to avoid sites that present excessive mapping difficulties or to focus on regions associated with particular genomic marks, for example. The refined characterisation of the propensity for particular types of DSBs, such as those identified by the RAFT procedure, across the human genome will likely allow more efficient assessment of genomic ‘scarring’ for an individual. This should be highly relevant to clinical management approaches such as risk stratification for particular types of cancer and treatment response prediction. As such, the use of these data has the potential to be beneficial to the reduction of disease associated mortality and morbidity.

12 in total

1. Transient massive DNA fragmentation in nervous system during the early course of a murine neurodegenerative disease.

Authors: B Blondet; A Aït-Ikhlef; M Murawsky; F Rieger
Journal: Neurosci Lett Date: 2001-06-15 Impact factor: 3.046

2. The Sequence Alignment/Map format and SAMtools.

Authors: Heng Li; Bob Handsaker; Alec Wysoker; Tim Fennell; Jue Ruan; Nils Homer; Gabor Marth; Goncalo Abecasis; Richard Durbin
Journal: Bioinformatics Date: 2009-06-08 Impact factor: 6.937

3. BEDTools: a flexible suite of utilities for comparing genomic features.

Authors: Aaron R Quinlan; Ira M Hall
Journal: Bioinformatics Date: 2010-01-28 Impact factor: 6.937

4. A user's guide to the encyclopedia of DNA elements (ENCODE).

Authors:
Journal: PLoS Biol Date: 2011-04-19 Impact factor: 8.029

5. Massive genomic rearrangement acquired in a single catastrophic event during cancer development.

Authors: Philip J Stephens; Chris D Greenman; Beiyuan Fu; Fengtang Yang; Graham R Bignell; Laura J Mudie; Erin D Pleasance; King Wai Lau; David Beare; Lucy A Stebbings; Stuart McLaren; Meng-Lay Lin; David J McBride; Ignacio Varela; Serena Nik-Zainal; Catherine Leroy; Mingming Jia; Andrew Menzies; Adam P Butler; Jon W Teague; Michael A Quail; John Burton; Harold Swerdlow; Nigel P Carter; Laura A Morsberger; Christine Iacobuzio-Donahue; George A Follows; Anthony R Green; Adrienne M Flanagan; Michael R Stratton; P Andrew Futreal; Peter J Campbell
Journal: Cell Date: 2011-01-07 Impact factor: 41.582

6. Genome-wide profiling of forum domains in Drosophila melanogaster.

Authors: Nickolai A Tchurikov; Olga V Kretova; Dmitri V Sosin; Ivan A Zykov; Igor F Zhimulev; Yuri V Kravatsky
Journal: Nucleic Acids Res Date: 2011-01-18 Impact factor: 16.971

7. Mapping of genomic double-strand breaks by ligation of biotinylated oligonucleotides to forum domains: Analysis of the data obtained for human rDNA units.

Authors: N A Tchurikov; O V Kretova; D M Fedoseeva; V R Chechetkin; M A Gorbacheva; A A Karnaukhov; G I Kravatskaya; Y V Kravatsky
Journal: Genom Data Date: 2014-11-12

8. Hot spots of DNA double-strand breaks in human rDNA units are produced in vivo.

Authors: Nickolai A Tchurikov; Dmitry V Yudkin; Maria A Gorbacheva; Anastasia I Kulemzina; Irina V Grischenko; Daria M Fedoseeva; Dmitri V Sosin; Yuri V Kravatsky; Olga V Kretova
Journal: Sci Rep Date: 2016-05-10 Impact factor: 4.379

9. Fast and accurate long-read alignment with Burrows-Wheeler transform.

Authors: Heng Li; Richard Durbin
Journal: Bioinformatics Date: 2010-01-15 Impact factor: 6.937

10. DNA double-strand breaks coupled with PARP1 and HNRNPA2B1 binding sites flank coordinately expressed domains in human chromosomes.

Authors: Nickolai A Tchurikov; Olga V Kretova; Daria M Fedoseeva; Dmitri V Sosin; Sergei A Grachev; Marina V Serebraykova; Svetlana A Romanenko; Nadezhda V Vorobieva; Yuri V Kravatsky
Journal: PLoS Genet Date: 2013-04-04 Impact factor: 5.917