Literature DB >> 27153673

Goldilocks: a tool for identifying genomic regions that are 'just right'.

Samuel M Nicholls¹, Amanda Clare², Joshua C Randall³.

Abstract

UNLABELLED: : We present Goldilocks: a Python package providing functionality for collecting summary statistics, identifying shifts in variation, discovering outlier regions and locating and extracting interesting regions from one or more arbitrary genomes for further analysis, for a user-provided definition of interesting.
AVAILABILITY AND IMPLEMENTATION: Goldilocks is freely available open-source software distributed under the MIT licence. Source code is hosted publicly at https://github.com/SamStudio8/goldilocks and the package may also be installed using pip install goldilocks. Documentation can be found at https://goldilocks.readthedocs.org CONTACT: : msn@aber.ac.uk SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Entities: Species

Mesh：

Year: 2016 PMID： 27153673 PMCID： PMC4920124 DOI： 10.1093/bioinformatics/btw116

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

Goldilocks is a Python package designed to take a census of sequence data in order to find regions that meet some criteria that are ‘just right’ for further analysis. It can be used by importing the package into a standalone script or via the command line tool. The tool accepts sequence data which could be from one or more individuals (with one or more sequences from each), organized into chromosomes or contigs and stored as FASTA. Goldilocks must also be supplied with a desired census strategy defining the criteria of interest, such as occurrences of individual nucleotide bases, motifs, deviations from a reference or GC-content. Chromosomes (or contigs) are slid beneath a window of a user-defined length and overlap and the strategy is applied over all input genomes. Goldilocks was developed as part of a quality control study to first determine what constitutes ‘bad’ quality data. For a large study, a pipeline was constructed to repeatedly perform variant calling, withholding one sample from the analysis in turn. A given sample’s downstream effect could then be quantified by measuring the difference in the accuracy of a run in the absence of that sample. However, repeated application of computationally expensive and time consuming analysis of whole genomes was unsustainable. The problem space needed to be reduced to a representative region: the genome’s Goldilocks. For our study, we defined a 1 Mb region as ‘just right’ if it contained within ±10 percentiles of the median number of single nucleotide polymorphisms (SNPs) appearing over all regions of our genome wide association study. These regions were then sorted by the number of SNPs appearing within the corresponding regions in a separate genotyping chip study. This allowed us to maximize the number of sites at which to check consensus between a run of our quality control pipeline and the SNP chip. Goldilocks was originally designed to return regions that matched these explicit criteria across multiple samples. The package has since been made more flexible and can be used to find regions of interest based on any arbitrary criteria such as GC-content, density of target motifs, pre-defined metrics and uncalled nucleotides (N’s).

2 Features

2.1 Customizable and interchangeable strategies

Criteria defining the metric for ‘interesting’ are held by Goldilocks in a user-selected strategy. Strategies are responsible for annotating regions of sequence data as interesting (or not) and quantifying the result as a summary statistic that can be used later for ranking. Whilst Goldilocks includes a number of simple strategies for counting nucleotides, motifs, consensus to a reference and GC-content, the package offers a simple framework that allows one to customize or create strategies of their own (for an example refer to the supplement). Strategies implement the same interface and are thus all interchangeable, replacing the need for multiple stand-alone tools and data processing steps.

2.2 Simple sorting and flexible filtering

Goldilocks provides a simple but powerful query function capable of filtering and sorting regions by their absolute or percentile distance from the result of a mathematical operation over the whole set of regions, such as max, min, mean, median or proximity to an arbitrary target value. Candidates can be excluded based on the start or end position of the region, chromosome (or contig) and simple expressions on the strategy values themselves. Queries can be chained, allowing users to build up more complex queries to find regions with more specific criteria of interest.

2.3 Plotting and exporting

Goldilocks provides several functions for extracting collected region metadata and the results of queries on that data. Sequences that lie on regions identified as interesting can be exported as FASTA, regions themselves can be output as BED. The package can output publication-ready graphs via matplotlib, or region metadata as a ‘melted’ dataframe for easy plotting with external packages such as R’s ggplot2 (Wickham, 2009). Data may also be exported in an arbitrarily delimited format for import into virtually any other tool, including Circos (Krzywinski ). For examples, please refer to the supplement.

2.4 Multiprocessed census and efficient input

The census step can be executed over a user specified number of processes to improve performance on both desktops and servers. Sequence FASTA files are not read into physical RAM but instead mapped to virtual memory and accessed via indices stored in their corresponding index, avoiding expensive input reading operations. On a Core i5 laptop using four processes, Goldilocks can calculate GC-content over a whole human genome (hs37d5) with a window size of 100 kbp (and overlap of 50 kbp) in 3 min while using less than 2 GB of memory. On the same machine Goldilocks could simultaneously take a census of 10 short nucleotide motifs with regular expressions and overlaps in 15 min or 50 simple motifs in just over 5 min and using less than 3 GB of memory.

3 Discussion

The flexibility of Goldilocks lends itself for use in many different scenarios, for example: Summarizing genomesTabulate or plot variation in some statistic such as GC-content across one or more genomes for inspection. Exploring extremesFind regions demonstrating unusual extremes with respect to the rest of the sequence data, such as high numbers of repeating subsequences or low GC-content. Identifying shiftsDetect shifts in variation in some statistic along a sequence, such as identification of chimeric activity within contigs. Discovering outliersDetect regions of sequences with properties indicative of data quality issues. Seeding sequencesLocate subsequences to use as appropriate seeds for execution of other tools and algorithms. Extracting subsetsFrom a large set of regions of sequence, extract a subset which meets some desired arbitrary criteria for further analysis. Alternative counting mechanisms, such as those available in Galaxy (Giardine ) through EMBOSS (Rice ) or various statistics tools in the Galaxy Toolshed, can accomplish similar functionality. However, as different tools require specific input formats and produce a variety of output formats, additional data handling steps are required as part of any pipeline, making it difficult to customize or swap counting plugins. Goldilocks is not intended to replace already existing dedicated software for counting tasks that require specific optimized data structures such as khmer (Crusoe ) for counting k-mers, but instead provides a general suite of swappable counters. Genome browsers such as the UCSC Genome Browser (Kent ) allow the user to view a variety of tracks displaying the locations of different genomic properties. Interpreting large amounts of sequencing data by eye in such browsers can be difficult. Alternative browsers that allow zoomed browsing of interesting regions, such as the LayerCake (Correll ) visualization tool, attempt to ameliorate this problem. Goldilocks will automatically find regions of interest, suitable for further browsing or plotting if required. Epiviz and Epivizr (Chelaru ) combine both statistical analysis and visualization, connecting a web-based browsing environment to an R/Bioconductor-based calculation environment. Users interactively add more tracks or plots to further explore the areas surrounding the regions of interest in the genome. Their platform is comprehensive and provides a far larger scope for analysis than Goldilocks. We see Goldilocks as a lightweight Python solution for straightforward queries, and a useful addition to existing bioinformatics pipelines. Goldilocks has minimal dependencies: the core requires just NumPy (Oliphant, 2007) and matplotlib (Hunter, 2007) to enable plotting. We provide a command line tool that offers access to the base functionality of the package to users without having to write a script of their own (for example usage see the supplement). Goldilocks is packaged with a testing suite, results and coverage of which are available online along with documentation. Source code is publicly available under an open source licence for review and community contribution.

7 in total

1. EMBOSS: the European Molecular Biology Open Software Suite.

Authors: P Rice; I Longden; A Bleasby
Journal: Trends Genet Date: 2000-06 Impact factor: 11.639

2. The human genome browser at UCSC.

Authors: W James Kent; Charles W Sugnet; Terrence S Furey; Krishna M Roskin; Tom H Pringle; Alan M Zahler; David Haussler
Journal: Genome Res Date: 2002-06 Impact factor: 9.043

3. LayerCake: a tool for the visual comparison of viral deep sequencing data.

Authors: Michael Correll; Adam L Bailey; Alper Sarikaya; David H O'Connor; Michael Gleicher
Journal: Bioinformatics Date: 2015-07-07 Impact factor: 6.937

4. Galaxy: a platform for interactive large-scale genome analysis.

Authors: Belinda Giardine; Cathy Riemer; Ross C Hardison; Richard Burhans; Laura Elnitski; Prachi Shah; Yi Zhang; Daniel Blankenberg; Istvan Albert; James Taylor; Webb Miller; W James Kent; Anton Nekrutenko
Journal: Genome Res Date: 2005-09-16 Impact factor: 9.043

5. Circos: an information aesthetic for comparative genomics.

Authors: Martin Krzywinski; Jacqueline Schein; Inanç Birol; Joseph Connors; Randy Gascoyne; Doug Horsman; Steven J Jones; Marco A Marra
Journal: Genome Res Date: 2009-06-18 Impact factor: 9.043

6. The khmer software package: enabling efficient nucleotide sequence analysis.

Authors: Michael R Crusoe; Hussien F Alameldin; Sherine Awad; Elmar Boucher; Adam Caldwell; Reed Cartwright; Amanda Charbonneau; Bede Constantinides; Greg Edvenson; Scott Fay; Jacob Fenton; Thomas Fenzl; Jordan Fish; Leonor Garcia-Gutierrez; Phillip Garland; Jonathan Gluck; Iván González; Sarah Guermond; Jiarong Guo; Aditi Gupta; Joshua R Herr; Adina Howe; Alex Hyer; Andreas Härpfer; Luiz Irber; Rhys Kidd; David Lin; Justin Lippi; Tamer Mansour; Pamela McA'Nulty; Eric McDonald; Jessica Mizzi; Kevin D Murray; Joshua R Nahum; Kaben Nanlohy; Alexander Johan Nederbragt; Humberto Ortiz-Zuazaga; Jeramia Ory; Jason Pell; Charles Pepe-Ranney; Zachary N Russ; Erich Schwarz; Camille Scott; Josiah Seaman; Scott Sievert; Jared Simpson; Connor T Skennerton; James Spencer; Ramakrishnan Srinivasan; Daniel Standage; James A Stapleton; Susan R Steinman; Joe Stein; Benjamin Taylor; Will Trimble; Heather L Wiencko; Michael Wright; Brian Wyss; Qingpeng Zhang; En Zyme; C Titus Brown
Journal: F1000Res Date: 2015-09-25

7. Epiviz: interactive visual analytics for functional genomics data.

Authors: Florin Chelaru; Llewellyn Smith; Naomi Goldstein; Héctor Corrada Bravo
Journal: Nat Methods Date: 2014-08-03 Impact factor: 28.547

7 in total