Literature DB >> 21949271

Pybedtools: a flexible Python library for manipulating genomic datasets and annotations.

Ryan K Dale1, Brent S Pedersen, Aaron R Quinlan.   

Abstract

SUMMARY: pybedtools is a flexible Python software library for manipulating and exploring genomic datasets in many common formats. It provides an intuitive Python interface that extends upon the popular BEDTools genome arithmetic tools. The library is well documented and efficient, and allows researchers to quickly develop simple, yet powerful scripts that enable complex genomic analyses. AVAILABILITY: pybedtools is maintained under the GPL license. Stable versions of pybedtools as well as documentation are available on the Python Package Index at http://pypi.python.org/pypi/pybedtools. CONTACT: dalerr@niddk.nih.gov; arq5x@virginia.edu SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Entities:  

Mesh:

Year:  2011        PMID: 21949271      PMCID: PMC3232365          DOI: 10.1093/bioinformatics/btr539

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


1 INTRODUCTION

Due to advances in DNA sequencing technologies, genomic datasets are rapidly expanding in size and complexity (Stein, 2010). It is now clear that the primary bottleneck in genomics is data analysis and interpretation, not data generation. Therefore, researchers depend upon fast, flexible ‘genome arithmetic’ tools for interrogating and comparing diverse datasets of genome features. For example, genome arithmetic is used to interpret results from whole-genome sequencing, ChIP-seq and RNA-seq experiments by integrating experimental datasets with genes, genetic variation and the wealth of existing genome annotations (1000 Genomes Project Consortium ; ENCODE Project Consortium ). These analyses are complicated by the fact that they are often done via custom scripts or one-off manipulations that are inefficient and difficult to reproduce and maintain. Tools designed to manipulate, intersect and annotate these datasets in commonly-used formats greatly facilitate such analyses and provide a consistent framework for reproducible research. Here we introduce pybedtools, which extends the BEDTools (Quinlan and Hall, 2010) genome arithmetic utilities by providing a powerful interface combining the benefits of Python scripting and the BEDTools libraries. Using a simple syntax, it allows researchers to analyze datasets in BED (Kent ), VCF (Danacek ), GFF, BEDGRAPH (Kent ) and SAM/BAM (Li ) formats without the need for format conversion.

2 APPROACH

The pybedtools library allows one to manipulate datasets at both the file and individual feature level using the BedTool and Interval classes, respectively. It integrates high-level BEDTools programs through the Python subprocess module, and lower level BEDTools functionality by exposing a subset of BEDTools' libraries. At the core of pybedtools is the BedTool class. Typically, a BedTool is initially created with a file name. BEDTools programs are then accessed as methods of BedTool objects (e.g. BedTool.intersect for the BEDTools program intersectBed) with arguments identical to the user's installed version of BEDTools. However, in addition to passing filenames as in typical BEDTools command line usage, one may also pass collections of Interval objects which can be manipulated in Python on a feature-by-feature basis. Furthermore, BedTool methods return new BedTool instances, allowing users to chain many operations together in a fashion similar to the UNIX command line. The pybedtools package provides a standardized interface to individual features in diverse genomics datasets, thus allowing one to iterate through datasets while accessing chromosome, start and stop coordinates with identical syntax, regardless of the underlying file format. This abstraction is made possible via Cython (http://cython.org, last accessed Aug 2011) which exposes the BEDTools file manipulation, feature parsing and overlap detection functions. In terms of speed and memory efficiency, pybedtools therefore compares favorably with Galaxy's (Giardine ) bx-python, Kent source (Kent ) and the original BEDTools software (Supplementary Fig. 1). Formats with different coordinate systems (e.g. BED vs GFF) are handled with uniform, well-defined semantics described in the documentation. Additional features and example scripts illustrating the library's functionality are in the documentation at http://packages.python.org/pybedtools.

3 APPLICATION

The pybedtools package employs a syntax that is intuitive to Python programmers. For example, given an annotation file of genes, hg19.gff, and a file containing relevant genetic variation, snps.bed, one can identify genes that contain SNPs with the following: At this point, one can easily examine the genes that overlap SNPs: or filter the results with simple boolean functions: The underlying BEDTools commands send their results to ‘standard output’. To assist in managing intermediate files, pybedtools automatically saves these results as temporary files that are deleted when Python exits. Results can be explicitly saved with the saveas() method: Given a FASTA file of the genome, hg19.fa, sequences for this subset of genes can be retrieved and saved with: One of the more powerful extensions provided by the pybedtools interface is the ability to mix file operations with feature operations in a way that makes otherwise difficult tasks very accessible with minimal code. For example, the following identifies the closest gene (within 5 kb) to each intergenic SNP: This example illustrates several powerful features of pybedtools that confer additional functionality and greatly simplify analyses as compared with the BEDTools command line utilities (see Supplementary Material for an analogous experiment with BEDTools). For example, set subtraction between BedTools is used to extract features that are unique to one file (snps - genes). Similarly, one may also use the addition operator to identify features in the first file that overlap features in multiple datasets (e.g. snps + novel_snps + genes). Moreover, there is essentially no limit to the number of files that can be compared with the + and − operators. Arguments sent to BedTool objects are passed to BEDTools programs. The argument d=True tells the BEDTools closestBed program to append the distance (in base pairs) between each SNP and the closest gene to the end of each line, equivalent to the -d argument typically given on the command line. Additionally, the argument stream=True indicates that the resulting BedTool object will stream results as a Python iterable of Interval objects instead of saving the results to a temporary file. This saves disk space and reduces file operations when performing many operations on large files. Also note the indexing of the Interval object gene via [-1]. This retrieves the last item on the line, which, because of the d=True argument, represents the distance in base pairs between each SNP and gene. All elements of a line can be accessed from an Interval object by their integer index, and core attributes by their name. Finally, although nearby represents results that are a composite of GFF and BED features (i.e. genes and snps), the operation that produced nearby was driven by the gene GFF file. Therefore gene.name is seamlessly extracted from the GFF ‘attributes’ field. Pybedtools also allows one to integrate sequence alignments in the widely used SAM/BAM format into their analyses. The following example illustrates how one would use pybedtools to identify sequence alignments that overlap coding exons. Alternatively, this analysis could be reduced to the following statement: Some BEDTools programs require files containing chromosome sizes. Pybedtools handles these automatically with the genome keyword argument to methods that wrap such programs. For example, the following command creates a bedGraph file of read coverage for the hg19 assembly:

4 CONCLUSION

The pybedtools package provides a convenient and flexible interface to both the BEDTools command-line tools and efficient functions in the BEDTools C++ libraries. Pybedtools simplifies complicated analyses by extending the functionality in BEDTools and by providing, to our knowledge, the first Python library offering a common interface for manipulating datasets in diverse formats. Other new functionality includes: set operations on multiple datasets using a simple, intuitive syntax, the ability to filter features and select specific columns or attributes, a unified interface to common attributes (e.g. chromosome, start, end, name and strand) from many file formats, and a documented command history. Pybedtools provides researchers with a simple and efficient interface for exploring complex genomics datasets in widely used formats. Funding: Intramural Program of the National Institute of Diabetes and Digestive and Kidney Diseases. Conflict of Interest: none declared.
  8 in total

1.  The human genome browser at UCSC.

Authors:  W James Kent; Charles W Sugnet; Terrence S Furey; Krishna M Roskin; Tom H Pringle; Alan M Zahler; David Haussler
Journal:  Genome Res       Date:  2002-06       Impact factor: 9.043

2.  Galaxy: a platform for interactive large-scale genome analysis.

Authors:  Belinda Giardine; Cathy Riemer; Ross C Hardison; Richard Burhans; Laura Elnitski; Prachi Shah; Yi Zhang; Daniel Blankenberg; Istvan Albert; James Taylor; Webb Miller; W James Kent; Anton Nekrutenko
Journal:  Genome Res       Date:  2005-09-16       Impact factor: 9.043

3.  The case for cloud computing in genome informatics.

Authors:  Lincoln D Stein
Journal:  Genome Biol       Date:  2010-05-05       Impact factor: 13.583

4.  A map of human genome variation from population-scale sequencing.

Authors:  Gonçalo R Abecasis; David Altshuler; Adam Auton; Lisa D Brooks; Richard M Durbin; Richard A Gibbs; Matt E Hurles; Gil A McVean
Journal:  Nature       Date:  2010-10-28       Impact factor: 49.962

5.  The Sequence Alignment/Map format and SAMtools.

Authors:  Heng Li; Bob Handsaker; Alec Wysoker; Tim Fennell; Jue Ruan; Nils Homer; Gabor Marth; Goncalo Abecasis; Richard Durbin
Journal:  Bioinformatics       Date:  2009-06-08       Impact factor: 6.937

6.  BEDTools: a flexible suite of utilities for comparing genomic features.

Authors:  Aaron R Quinlan; Ira M Hall
Journal:  Bioinformatics       Date:  2010-01-28       Impact factor: 6.937

7.  A user's guide to the encyclopedia of DNA elements (ENCODE).

Authors: 
Journal:  PLoS Biol       Date:  2011-04-19       Impact factor: 8.029

8.  The variant call format and VCFtools.

Authors:  Petr Danecek; Adam Auton; Goncalo Abecasis; Cornelis A Albers; Eric Banks; Mark A DePristo; Robert E Handsaker; Gerton Lunter; Gabor T Marth; Stephen T Sherry; Gilean McVean; Richard Durbin
Journal:  Bioinformatics       Date:  2011-06-07       Impact factor: 6.937

  8 in total
  162 in total

1.  Highly divergent integration profile of adeno-associated virus serotype 5 revealed by high-throughput sequencing.

Authors:  Tyler Janovitz; Thiago Oliveira; Michel Sadelain; Erik Falck-Pedersen
Journal:  J Virol       Date:  2013-12-11       Impact factor: 5.103

2.  Using Genome Query Language to uncover genetic variation.

Authors:  Christos Kozanitis; Andrew Heiberg; George Varghese; Vineet Bafna
Journal:  Bioinformatics       Date:  2013-06-10       Impact factor: 6.937

3.  The dilemma of choosing the ideal permutation strategy while estimating statistical significance of genome-wide enrichment.

Authors:  Subhajyoti De; Brent S Pedersen; Katerina Kechris
Journal:  Brief Bioinform       Date:  2013-08-16       Impact factor: 11.622

4.  Red Blood Cell Invasion by the Malaria Parasite Is Coordinated by the PfAP2-I Transcription Factor.

Authors:  Joana Mendonca Santos; Gabrielle Josling; Philipp Ross; Preeti Joshi; Lindsey Orchard; Tracey Campbell; Ariel Schieler; Ileana M Cristea; Manuel Llinás
Journal:  Cell Host Microbe       Date:  2017-06-14       Impact factor: 21.023

5.  Transcriptional landscape of trans-kingdom communication between Candida albicans and Streptococcus gordonii.

Authors:  L C Dutton; K H Paszkiewicz; R J Silverman; P R Splatt; S Shaw; A H Nobbs; R J Lamont; H F Jenkinson; M Ramsdale
Journal:  Mol Oral Microbiol       Date:  2015-07-07       Impact factor: 3.563

6.  A functional and evolutionary perspective on transcription factor binding in Arabidopsis thaliana.

Authors:  Ken S Heyndrickx; Jan Van de Velde; Congmao Wang; Detlef Weigel; Klaas Vandepoele
Journal:  Plant Cell       Date:  2014-10-31       Impact factor: 11.277

7.  The zinc-finger protein CLAMP promotes gypsy chromatin insulator function in Drosophila.

Authors:  Indira Bag; Ryan K Dale; Cameron Palmer; Elissa P Lei
Journal:  J Cell Sci       Date:  2019-03-08       Impact factor: 5.285

8.  ORE identifies extreme expression effects enriched for rare variants.

Authors:  F Richter; G E Hoffman; K B Manheimer; N Patel; A J Sharp; D McKean; S U Morton; S DePalma; J Gorham; A Kitaygorodksy; G A Porter; A Giardini; Y Shen; W K Chung; J G Seidman; C E Seidman; E E Schadt; B D Gelb
Journal:  Bioinformatics       Date:  2019-10-15       Impact factor: 6.937

9.  Rbfox proteins regulate alternative mRNA splicing through evolutionarily conserved RNA bridges.

Authors:  Michael T Lovci; Dana Ghanem; Henry Marr; Justin Arnold; Sherry Gee; Marilyn Parra; Tiffany Y Liang; Thomas J Stark; Lauren T Gehman; Shawn Hoon; Katlin B Massirer; Gabriel A Pratt; Douglas L Black; Joe W Gray; John G Conboy; Gene W Yeo
Journal:  Nat Struct Mol Biol       Date:  2013-11-10       Impact factor: 15.369

10.  mountainClimber Identifies Alternative Transcription Start and Polyadenylation Sites in RNA-Seq.

Authors:  Ashley A Cass; Xinshu Xiao
Journal:  Cell Syst       Date:  2019-09-18       Impact factor: 10.304

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.