Literature DB >> 24395753

CGAT: computational genomics analysis toolkit.

David Sims¹, Nicholas E Ilott, Stephen N Sansom, Ian M Sudbery, Jethro S Johnson, Katherine A Fawcett, Antonio J Berlanga-Taylor, Sebastian Luna-Valero, Chris P Ponting, Andreas Heger.

Abstract

Computational genomics seeks to draw biological inferences from genomic datasets, often by integrating and contextualizing next-generation sequencing data. CGAT provides an extensive suite of tools designed to assist in the analysis of genome scale data from a range of standard file formats. The toolkit enables filtering, comparison, conversion, summarization and annotation of genomic intervals, gene sets and sequences. The tools can both be run from the Unix command line and installed into visual workflow builders, such as Galaxy.

Entities: Chemical

Mesh：

Year: 2014 PMID： 24395753 PMCID： PMC3998125 DOI： 10.1093/bioinformatics/btt756

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 INTRODUCTION

A central task in computational genomics is to extract biologically meaningful summaries and annotations from short read sequences to facilitate both visualization and statistical analysis. Commonly, this process starts by mapping next-generation sequencing (NGS) reads and quantitating their distribution in genomic features such as transcripts with expression level and transcription factor binding sites with peak scores. This initial contextualization phase is well supported by specialized tools such as Tophat/Cufflinks (Trapnell ) or MACS (Feng ). In a second phase, datasets are typically integrated to allow interpretation, asking, for example, how many transcription factor binding sites are associated with each of exonic, intronic, flanking and intergenic genomic annotations. This phase necessarily relies on computational tools that can describe, integrate and summarize a variety of feature files produced from the initial phase and external annotation sources. Here we introduce a collection of tools that assist genomic scientists in successfully performing this crucial data integration and interpretation phase, bridging the gap from raw data to biologically interpretable results. We have made extensive use of these tools in a number of NGS projects (Long , Rajan , Ramagopalan ).

2 OVERVIEW

The computational genomic analysis toolkit comprises >50 tools, each with documentation and examples. Tools are tagged to facilitate discovery. Tags associate tools with broad themes (genomic intervals, gene sets, sequences), standard genomic file formats (BED, GTF, BAM, FASTA/Q) and the type of computation performed by the tool, such as statistical summary, format conversion, annotation, comparison or filtering. As an illustrative example, a gene set can be annotated with the tool gtf2table. In fact, gtf2table provides >25 different methods to annotate transcript models. Annotation is dependent on auxiliary data: given a genome sequence, transcripts can be annotated by composition (e.g. %GC); given a reference gene set, transcripts can be marked as fragments or extensions, enabling the user to ascertain the completeness of transcript models built for RNAseq data. Given a BAM file with NGS read data, gtf2table can compute coverage in sense/antisense direction over transcript models; another example, bam2geneprofile computes and plots metagene-profiles from mapped NGS read data in BAM format (Fig. 1a). Different metagene models (with/without UTRs/introns, etc.) and various normalization options are available. Finally, the tool bam2peakshape computes read densities in specified genomic intervals to generate matrix data suitable for visualization in heatmaps (Fig. 1b). The toolkit also contains standard sequence analysis utilities such as fasta2table, which annotates sequences with CpG frequencies, codon frequencies and amino acid composition. To assist the interpretation of NGS data, the toolkit implements various classification schemes for transcript data or interval data. RNA-seq-derived transcripts can be marked as instances, fragments, extensions or alternative versions of transcripts in a reference gene set. Chromatin immunoprecipitation-sequencing (ChIP-Seq) intervals can be marked as intronic, intergenic or within the UTR, upstream or downstream regions of transcript models. Finally, the toolkit provides tools to summarize genomic datasets, reporting the number of intervals or transcripts per chromosome, size distributions of features and more.

Fig. 1.

Visualization of the output of (a) bam2geneprofile and (b) bam2peakshape

3 USAGE

We introduce the usage of the computational genomics analysis toolkit with a brief example. The fully worked example can be found online. Given a set of transcription factor binding intervals from a ChIP-seq experiment in BED format (nfkb.bed), we wish to determine how many binding intervals lie within exons, introns or intergenic sequence using a reference gene set from GENCODE (Harrow ), in GTF format (hg19.gtf). We then want to plot the density of binding relative to transcript models and examine the chromatin signatures within the intervals: cgat gtf2gtf–sort = gene < hg19.gtf | cgat gtf2gtf–merge-exons-with-utr(1) | cgat gtf2gtf–filter = longest-gene(2) | cgat gtf2gff–flank=5000–method = genome(3) > annotations.gff cgat bed2table–filename-gff = annotations.gff –counter=classifier-chipseqannotated_peaks.tsv(4) The above sequence of Unix commands in turn (1) merges all exons of alternative transcripts per gene, (2) retains the longest gene in the case of overlapping genes and (3) annotates exonic, intronic, intergenic and flanking regions (size = 5 kb) within and between genes. Choosing different options can provide different sets of answers. Instead of merging all exons per gene, the longest transcript might be selected by replacing (2) with gtf2gtf – filter = longest-transcript. Note that the creation of annotations.gff` goes beyond simple interval intersection, as gene structures are normalized from multiple possible alternative transcripts to a single transcript that is chosen by the user depending on what is most relevant for the analysis. The generated annotations are then used to classify the transcription factor binding sites using bed2table (4). The profile of ChIP-seq binding over genes can be calculated and plotted using bam2geneprofile (Fig. 1a). Chromatin state at ChIP-seq peaks can be investigated by integrating H3K4me1 and H3K4me3 data for a relevant tissue (ENCODE Project Consortium ) using bam2peakshape and plotted in R (R Core Team, 2012) (Fig. 1b). Statistical significance can be assessed using tools such as GAT (Heger ). More usage examples, including testing for functional enrichment, assessment of CpG content in long non-coding RNA promoters and clustering metagenomic contigs on tetranucleotide frequency, can be found online.

4 IMPLEMENTATION

We aim to write legible and maintainable code that can serve as an entry point into computational methods for biologists. The toolkit is implemented in the Python language (van Rossum, 1995). Some performance-critical sections have been implemented in Cython (Behnel ). The toolkit can be installed from common Python package repositories. Dependencies will be installed automatically, although some tools require external software to be installed. All tools are freely available under the BSD 3-clause licence. The toolkit is under constant development, and community involvement in the project is welcome. Regression tests ensure that core functionality is maintained as scripts are extended. All tools are built using a common coding style and follow a naming scheme centred on common genomic file formats. The tools have a consistent command line interface enabling them to be combined into work flows using Unix pipes and integrated into automated pipelines allowing automated and parallel execution. They use a consistent logging mechanism to facilitate issue tracking. Furthermore, the use of common genomic formats means that tools can be easily combined with other popular genomic software such as BEDtools (Quinlan and Hall, 2010), University of California, Santa Cruz tools (Kuhn ) or biopieces, http://www.biopieces.org. An RDF (Resource Description Framework, http://www.w3.org/RDF) description of each tool can be generated for use with tools such as CLI-MATE (Tatum ) to generate XML files for a variety of workflow frameworks, such as Galaxy (Goecks ). Funding: This work was funded by the UK Medical Research Council. Conflict of Interest: none declared.

11 in total

1. Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks.

Authors: Cole Trapnell; Adam Roberts; Loyal Goff; Geo Pertea; Daehwan Kim; David R Kelley; Harold Pimentel; Steven L Salzberg; John L Rinn; Lior Pachter
Journal: Nat Protoc Date: 2012-03-01 Impact factor: 13.491

2. Identifying ChIP-seq enrichment using MACS.

Authors: Jianxing Feng; Tao Liu; Bo Qin; Yong Zhang; Xiaole Shirley Liu
Journal: Nat Protoc Date: 2012-08-30 Impact factor: 13.491

3. BEDTools: a flexible suite of utilities for comparing genomic features.

Authors: Aaron R Quinlan; Ira M Hall
Journal: Bioinformatics Date: 2010-01-28 Impact factor: 6.937

4. A ChIP-seq defined genome-wide map of vitamin D receptor binding: associations with disease and evolution.

Authors: Sreeram V Ramagopalan; Andreas Heger; Antonio J Berlanga; Narelle J Maugeri; Matthew R Lincoln; Amy Burrell; Lahiru Handunnetthi; Adam E Handel; Giulio Disanto; Sarah-Michelle Orton; Corey T Watson; Julia M Morahan; Gavin Giovannoni; Chris P Ponting; George C Ebers; Julian C Knight
Journal: Genome Res Date: 2010-08-24 Impact factor: 9.043

5. GENCODE: the reference human genome annotation for The ENCODE Project.

Authors: Jennifer Harrow; Adam Frankish; Jose M Gonzalez; Electra Tapanari; Mark Diekhans; Felix Kokocinski; Bronwen L Aken; Daniel Barrell; Amonida Zadissa; Stephen Searle; If Barnes; Alexandra Bignell; Veronika Boychenko; Toby Hunt; Mike Kay; Gaurab Mukherjee; Jeena Rajan; Gloria Despacio-Reyes; Gary Saunders; Charles Steward; Rachel Harte; Michael Lin; Cédric Howald; Andrea Tanzer; Thomas Derrien; Jacqueline Chrast; Nathalie Walters; Suganthi Balasubramanian; Baikang Pei; Michael Tress; Jose Manuel Rodriguez; Iakes Ezkurdia; Jeltje van Baren; Michael Brent; David Haussler; Manolis Kellis; Alfonso Valencia; Alexandre Reymond; Mark Gerstein; Roderic Guigó; Tim J Hubbard
Journal: Genome Res Date: 2012-09 Impact factor: 9.043

6. Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences.

Authors: Jeremy Goecks; Anton Nekrutenko; James Taylor
Journal: Genome Biol Date: 2010-08-25 Impact factor: 13.583

7. An integrated encyclopedia of DNA elements in the human genome.

Authors:
Journal: Nature Date: 2012-09-06 Impact factor: 49.962

8. Epigenetic conservation at gene regulatory elements revealed by non-methylated DNA profiling in seven vertebrates.

Authors: Hannah K Long; David Sims; Andreas Heger; Neil P Blackledge; Claudia Kutter; Megan L Wright; Frank Grützner; Duncan T Odom; Roger Patient; Chris P Ponting; Robert J Klose
Journal: Elife Date: 2013-02-26 Impact factor: 8.140

9. The UCSC genome browser and associated tools.

Authors: Robert M Kuhn; David Haussler; W James Kent
Journal: Brief Bioinform Date: 2012-08-20 Impact factor: 11.622

10. GAT: a simulation framework for testing the association of genomic intervals.

Authors: Andreas Heger; Caleb Webber; Martin Goodson; Chris P Ponting; Gerton Lunter
Journal: Bioinformatics Date: 2013-06-18 Impact factor: 6.937

32 in total

1. A mechanism for oxidative damage repair at gene regulatory elements.

Authors: Swagat Ray; Arwa A Abugable; Jacob Parker; Kirsty Liversidge; Nelma M Palminha; Chunyan Liao; Adelina E Acosta-Martin; Cleide D S Souza; Mateusz Jurga; Ian Sudbery; Sherif F El-Khamisy
Journal: Nature Date: 2022-09-28 Impact factor: 69.504

2. LncRNA-dependent nuclear stress bodies promote intron retention through SR protein phosphorylation.

Authors: Kensuke Ninomiya; Shungo Adachi; Tohru Natsume; Junichi Iwakiri; Goro Terai; Kiyoshi Asai; Tetsuro Hirose
Journal: EMBO J Date: 2019-11-29 Impact factor: 11.598

3. Bromodomain inhibition of the coactivators CBP/EP300 facilitate cellular reprogramming.

Authors: Ayyub Ebrahimi; Kenan Sevinç; Gülben Gürhan Sevinç; Adam P Cribbs; Martin Philpott; Fırat Uyulur; Tunç Morova; James E Dunford; Sencer Göklemez; Şule Arı; Udo Oppermann; Tamer T Önder
Journal: Nat Chem Biol Date: 2019-04-08 Impact factor: 15.040

4. An essential role for the Zn²⁺ transporter ZIP7 in B cell development.

Authors: Consuelo Anzilotti; David J Swan; Bertrand Boisson; Mukta Deobagkar-Lele; Catarina Oliveira; Pauline Chabosseau; Karin R Engelhardt; Xijin Xu; Rui Chen; Luis Alvarez; Rolando Berlinguer-Palmini; Katherine R Bull; Eleanor Cawthorne; Adam P Cribbs; Tanya L Crockford; Tarana Singh Dang; Amy Fearn; Emma J Fenech; Sarah J de Jong; B Christoffer Lagerholm; Cindy S Ma; David Sims; Bert van den Berg; Yaobo Xu; Andrew J Cant; Gary Kleiner; T Ronan Leahy; M Teresa de la Morena; Jennifer M Puck; Ralph S Shapiro; Mirjam van der Burg; J Ross Chapman; John C Christianson; Benjamin Davies; John A McGrath; Stefan Przyborski; Mauro Santibanez Koref; Stuart G Tangye; Andreas Werner; Guy A Rutter; Sergi Padilla-Parra; Jean-Laurent Casanova; Richard J Cornall; Mary Ellen Conley; Sophie Hambleton
Journal: Nat Immunol Date: 2019-02-04 Impact factor: 25.606

5. Correction of amyotrophic lateral sclerosis related phenotypes in induced pluripotent stem cell-derived motor neurons carrying a hexanucleotide expansion mutation in C9orf72 by CRISPR/Cas9 genome editing using homology-directed repair.

Authors: Nidaa A Ababneh; Jakub Scaber; Rowan Flynn; Andrew Douglas; Paola Barbagallo; Ana Candalija; Martin R Turner; David Sims; Ruxandra Dafinca; Sally A Cowley; Kevin Talbot
Journal: Hum Mol Genet Date: 2020-08-03 Impact factor: 6.150

6. m⁶ A modification of HSATIII lncRNAs regulates temperature-dependent splicing.

Authors: Kensuke Ninomiya; Junichi Iwakiri; Mahmoud Khamis Aly; Yuriko Sakaguchi; Shungo Adachi; Tohru Natsume; Goro Terai; Kiyoshi Asai; Tsutomu Suzuki; Tetsuro Hirose
Journal: EMBO J Date: 2021-06-29 Impact factor: 14.012

7. Transcriptome-wide dynamics of extensive m⁶A mRNA methylation during Plasmodium falciparum blood-stage development.

Authors: Sebastian Baumgarten; Jessica M Bryant; Ameya Sinha; Thibaud Reyser; Peter R Preiser; Peter C Dedon; Artur Scherf
Journal: Nat Microbiol Date: 2019-08-05 Impact factor: 17.745

8. Comprehensive identification of RNA-protein interactions in any organism using orthogonal organic phase separation (OOPS).

Authors: Rayner M L Queiroz; Tom Smith; Eneko Villanueva; Maria Marti-Solano; Mie Monti; Mariavittoria Pizzinga; Dan-Mircea Mirea; Manasa Ramakrishna; Robert F Harvey; Veronica Dezi; Gavin H Thomas; Anne E Willis; Kathryn S Lilley
Journal: Nat Biotechnol Date: 2019-01-03 Impact factor: 54.908

9. U1 snRNP telescripting regulates a size-function-stratified human genome.

Authors: Jung-Min Oh; Chao Di; Christopher C Venters; Jiannan Guo; Chie Arai; Byung Ran So; Anna Maria Pinto; Zhenxi Zhang; Lili Wan; Ihab Younis; Gideon Dreyfuss
Journal: Nat Struct Mol Biol Date: 2017-10-02 Impact factor: 15.369

10. Population and single-cell genomics reveal the Aire dependency, relief from Polycomb silencing, and distribution of self-antigen expression in thymic epithelia.

Authors: Stephen N Sansom; Noriko Shikama-Dorn; Saule Zhanybekova; Gretel Nusspaumer; Iain C Macaulay; Mary E Deadman; Andreas Heger; Chris P Ponting; Georg A Holländer
Journal: Genome Res Date: 2014-09-15 Impact factor: 9.043