Literature DB >> 24618464

The Biological Reference Repository (BioR): a rapid and flexible system for genomics annotation.

Jean-Pierre A Kocher¹, Daniel J Quest¹, Patrick Duffy¹, Michael A Meiners¹, Raymond M Moore¹, David Rider¹, Asif Hossain¹, Steven N Hart¹, Valentin Dinu¹.

Abstract

MOTIVATION: The Biological Reference Repository (BioR) is a toolkit for annotating variants. BioR stores public and user-specific annotation sources in indexed JSON-encoded flat files (catalogs). The BioR toolkit provides the functionality to combine and retrieve annotation from these catalogs via the command-line interface. Several catalogs from commonly used annotation sources and instructions for creating user-specific catalogs are provided. Commands from the toolkit can be combined with other UNIX commands for advanced annotation processing. We also provide instructions for the development of custom annotation pipelines.
AVAILABILITY AND IMPLEMENTATION: The package is implemented in Java and makes use of external tools written in Java and Perl. The toolkit can be executed on Mac OS X 10.5 and above or any Linux distribution. The BioR application, quickstart, and user guide documents and many biological examples are available at http://bioinformaticstools.mayo.edu.

Entities: Disease Species

Mesh：
Genomics
Software Design

Year: 2014 PMID： 24618464 PMCID： PMC4071205 DOI： 10.1093/bioinformatics/btu137

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 INTRODUCTION

Next-generation sequencing (NGS) technology platforms are providing unprecedented opportunities to study genomic variants that are associated with clinical conditions and drug response. Using NGS technologies, researchers can identify mutations associated with rare diseases, characterize somatic variants in tumor for diagnostic or prognostic purpose or guide therapeutic treatment. Although the large amount of data produced by NGS platforms and the time to process them are largely being addressed by expanding the IT infrastructure, high-performance computing and code optimization, the annotation process needed to interpret the thousands of variants found in individual genomes is still a challenging task. The annotation process requires extracting and combining information from disparate external and in-house annotation sources, or even command-line tools. Several applications such as ANNOVAR (Wang ), GEMINI (Paila ) and TREAT (Asmann ) have recently been developed to automate the annotation and filtering of genomics variants. However, these systems are restrictive, as expansion and maintenance of annotation depends on the authors’ availability/willingness, and annotation and filtering are often combined, making integration with other tools challenging. Other approaches such as Bio2RDF (Belleau ) propose the conversion of annotation sources into Resource Description Framework (RDF) format that can be loaded into a triple store database for querying. This approach, although flexible because it allows independent integration of new annotation sources, presents scalability limitations and does not integrate well with existing command-line tools. Under production loads, the number of searches to annotate variants can become extremely large. For instance, the annotation of ∼30 million variants from 10 whole genome sequencing runs per day, with annotation extracted from 10 data sources would involve >300 million queries. In this article, we present the Biological Reference Repository (BioR), a flexible and scalable infrastructure for the specific purpose of gene and variant annotation. BioR is built around a slightly modified version of the JSON format (http://www.json.org/), referred in this article as TJSON. To facilitate usability, BioR provides a toolkit (BioR toolkit) that includes a set of UNIX command-line functions to facilitate catalog management and annotation extraction. The BioR toolkit is engineered to work in high-performance computing environments and scale to multiple simultaneous instances.

2 METHODS AND RESULTS

2.1 The TJSON representation

The TJSON representation is used by catalogs and used as standard input/output for most of the functions of the BioR toolkit. The TJSON consists of a mix of tab-delimited values and JSON strings (see example below). Like JSON, TJSON is a compact, readable and hierarchical format that can be used to store one to many relationships present in relational annotation sources. TJSON was preferred over others like XML, as in addition to being readable, it is relatively compact. Like XML, it can represent complex hierarchical data structures into a single text string. The hierarchical structures existing in relational data sources are therefore maintained in BioR catalogs. JSON strings can easily be extracted from a TJSON and processed with JSON libraries in most programming languages like Perl, Java and Python. BioR provides commands necessary to retrieve nested values from JSON strings. An example of TJSON, where ‘\t’ is a tab character (typically non-displaying) acting as a column separator is here: 1024\t145.6\t{"_type":"gene","_strand":"+", "_minBP":10954,"_maxBP":11507,"note":"similarity to: 1 Protein", "GeneID":"100506145"}\t12.334

2.2 BioR toolkit

The BioR toolkit includes set of commands for the management of catalogs, extraction of annotation based on genomics coordinates, variant or gene information. These stand-alone commands that are executed like common UNIX commands leverage third-party JSON libraries to process JSON strings. TJSON is intentionally used as standard input/output by most of the BioR commands to enable the concatenation of multiple BioR commands into a single UNIX command using standard piping syntax. The user can add functions to the toolkit or operate on their data using conventional UNIX tools as long as the function operates on TJSON strings. The BioR toolkit also includes commands to convert tab-delimited input file into TJSON strings (such as VCF and BED files) or convert TJSON into tab-delimited output file. Any metadata recorded in VCF or GFF style header (starting with ‘#’) in the input file will be carried through by the BioR toolkit functions to be recorded in the output file. The commands included in the BioR toolkit are listed in Supplementary Table S1. Finally, the BioR toolkit supports two command-line utilities for annotating variants: (i) bior_snpeff, which integrates SnpEff annotations (Cingolani ), and (ii) bior_vep to annotate files using Ensemble’s variant effect predictor (www.ensembl.org/info/docs/variation/vep/).

2.3 BioR annotation catalogs

BioR catalogs are in a readable, indexable and schema-free format for storing and rapidly accessing arbitrary structured data such as genomic features, diseases, conditions, genetic tests and drugs. Catalogs are modular, based on specific data sources or tools, and can be built and queried independently of other catalogs. They use the TJSON representation to store annotation information and corresponding genomic coordinates. The first tab-delimited field is used to store the origin of the sequence (usually a chromosome). The next two fields record the start and end coordinates of a genomic interval for position-dependent annotations. These two fields are otherwise set to 0. These three fields are indexed by Tabix (Li, 2011). The last field is a JSON string that contains all the data from the original source. To reduce storage footprint and accelerate coordinate-based searches, catalogs are compressed using the open source BGZip (Danecek ) and indexed using Tabix. The Tabix index file is stored in the same directory as the related catalog. BioR toolkit takes advantage of the Tabix library to perform coordinate-based overlap searches. BioR can also perform searches on identifiers that can be indexed using a BioR toolkit command for fast querying. Finally, to accelerate coordinate-based and variant-matching searches, a set of semantically consistent identifiers called Golden Identifiers are automatically indexed. These identifiers are implicitly used by some BioR commands (Supplementary Table S2).

2.4 Building BioR catalogs

The complexity of building BioR catalogs depends on the organization of data in the annotation source. Data available in tab-delimited text format can be readily converted to a BioR catalog using the command ‘bior_create_catalog’ and a configuration file describing each column. When annotations are extracted from complex systems such as relational databases, programming is required to reformat related tables into a single tab-delimited text. BioR catalogs must be created for each set of related tables the user wants to use.

2.5 BioR catalog library

BioR includes 19 documented catalogs built from the most commonly used data sources (Supplementary Table S3). It also includes a list of catalogs built from UCSC Genome Browser tracks (Kent ). To increase clinical applicability, pharmacogenomics catalogs built from PharmGKB, DrugBank and Therapeutic Target Database are also provided.

2.6 Example

The following example illustrates how sample variant rsIDs stored in the file rsID.txt can be annotated with European frequency from the 1000 Genomes Project. First, using the ‘bior_lookup’ command, rsIDs in the rsID.txt file are matched to entries in the dbSNP.tsv.bgz catalog containing the identifier ‘ID’. Matching entries in JSON format are piped to the function ‘bior_same_variant’. This function uses the Golden Identifiers present in the JSON string to look up allele frequencies in the KGenomes.tsv.gz catalog. Finally, the function ‘bior_drill’ and the Unix command ‘cut’ reformat the TJSON string into a tab-delimited output. $ cat rsIDs.txt | bior_lookup -p ID –d dbSNP.tsv.bgz | bior_same_variant -d KGenomes.tsv.gz | bior_drill -c -1 -p INFO.EUR_AF | cut -f 1,3 This macro annotates 100 000 rsIDs in 2:23 min on a MacBook Pro 2.3 GHz Intel Core i7 with solid state drive and 8 G RAM.

3 RESULTS

BioR is an open annotation tool. It includes a toolkit with a base set of commands needed to build and index catalogs and retrieve annotations. Annotations can be retrieved based on location (genomic coordinates) or identifiers. The TJSON format is used for catalogs and as input/output for most of the toolkit functions facilitating the assembly of complex pipelines. Because the TJSON format is readable, users can design their own scripts to extract annotation from catalogs. Scripts can also be intermixed with toolkit commands as long as the TSJON format is maintained. This stream-based approach on which BioR is based significantly reduces memory footprint. In addition, the BioR toolkit is inherently parallel and can be configured to take advantage of computers with multi-core architectures. BioR catalogs can easily be combined into new catalogs to decrease retrieval time by avoiding multiple cross-catalog queries. In conclusion, BioR is a rapid and flexible system for annotating high-throughput genomics experiments.

8 in total

1. The human genome browser at UCSC.

Authors: W James Kent; Charles W Sugnet; Terrence S Furey; Krishna M Roskin; Tom H Pringle; Alan M Zahler; David Haussler
Journal: Genome Res Date: 2002-06 Impact factor: 9.043

2. Tabix: fast retrieval of sequence features from generic TAB-delimited files.

Authors: Heng Li
Journal: Bioinformatics Date: 2011-01-05 Impact factor: 6.937

3. A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3.

Authors: Pablo Cingolani; Adrian Platts; Le Lily Wang; Melissa Coon; Tung Nguyen; Luan Wang; Susan J Land; Xiangyi Lu; Douglas M Ruden
Journal: Fly (Austin) Date: 2012 Apr-Jun Impact factor: 2.160

4. Bio2RDF: towards a mashup to build bioinformatics knowledge systems.

Authors: François Belleau; Marc-Alexandre Nolin; Nicole Tourigny; Philippe Rigault; Jean Morissette
Journal: J Biomed Inform Date: 2008-03-21 Impact factor: 6.317

5. ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data.

Authors: Kai Wang; Mingyao Li; Hakon Hakonarson
Journal: Nucleic Acids Res Date: 2010-07-03 Impact factor: 16.971

6. The variant call format and VCFtools.

Authors: Petr Danecek; Adam Auton; Goncalo Abecasis; Cornelis A Albers; Eric Banks; Mark A DePristo; Robert E Handsaker; Gerton Lunter; Gabor T Marth; Stephen T Sherry; Gilean McVean; Richard Durbin
Journal: Bioinformatics Date: 2011-06-07 Impact factor: 6.937

7. TREAT: a bioinformatics tool for variant annotations and visualizations in targeted and exome sequencing data.

Authors: Yan W Asmann; Sumit Middha; Asif Hossain; Saurabh Baheti; Ying Li; High-Seng Chai; Zhifu Sun; Patrick H Duffy; Ahmed A Hadad; Asha Nair; Xiaoyu Liu; Yuji Zhang; Eric W Klee; Krishna R Kalari; Jean-Pierre A Kocher
Journal: Bioinformatics Date: 2011-11-15 Impact factor: 6.937

8. GEMINI: integrative exploration of genetic variation and genome annotations.

Authors: Umadevi Paila; Brad A Chapman; Rory Kirchner; Aaron R Quinlan
Journal: PLoS Comput Biol Date: 2013-07-18 Impact factor: 4.475

8 in total

43 in total

1. Comparison of Whole-Genome Sequencing Methods for Analysis of Three Methicillin-Resistant Staphylococcus aureus Outbreaks.

Authors: Scott A Cunningham; Nicholas Chia; Patricio R Jeraldo; Daniel J Quest; Julie A Johnson; Dave J Boxrud; Angela J Taylor; Jun Chen; Gregory D Jenkins; Travis M Drucker; Heidi Nelson; Robin Patel
Journal: J Clin Microbiol Date: 2017-04-12 Impact factor: 5.948

2. Loss of FAM46C Promotes Cell Survival in Myeloma.

Authors: Yuan Xiao Zhu; Chang-Xin Shi; Laura A Bruins; Patrick Jedlowski; Xuewei Wang; K Martin Kortüm; Moulun Luo; Jonathan M Ahmann; Esteban Braggio; A Keith Stewart
Journal: Cancer Res Date: 2017-06-15 Impact factor: 12.701

3. Trans-ethnic meta-analysis identifies common and rare variants associated with hepatocyte growth factor levels in the Multi-Ethnic Study of Atherosclerosis (MESA).

Authors: Nicholas B Larson; Cecilia Berardi; Paul A Decker; Christina L Wassel; Phillip S Kirsch; James S Pankow; Michele M Sale; Mariza de Andrade; Hugues Sicotte; Weihong Tang; Naomi Q Hanson; Michael Y Tsai; Kent D Taylor; Suzette J Bielinski
Journal: Ann Hum Genet Date: 2015-05-22 Impact factor: 1.670

4. Proposal for Modification of Cahan's Criteria Utilizing Molecular Genetic Analyses for Cases without Baseline Histopathology: A Unique Method Applicable to Primary Radiosurgery.

Authors: Aaron E Rusheen; James B Smadbeck; Lisa A Schimmenti; Eric W Klee; Michael J Link; George Vasmatzis; Matthew L Carlson
Journal: J Neurol Surg B Skull Base Date: 2018-05-31

5. ZSCAN10 expression corrects the genomic instability of iPSCs from aged donors.

Authors: Maria Skamagki; Cristina Correia; Percy Yeung; Timour Baslan; Samuel Beck; Cheng Zhang; Christian A Ross; Lam Dang; Zhong Liu; Simona Giunta; Tzu-Pei Chang; Joye Wang; Aparna Ananthanarayanan; Martina Bohndorf; Benedikt Bosbach; James Adjaye; Hironori Funabiki; Jonghwan Kim; Scott Lowe; James J Collins; Chi-Wei Lu; Hu Li; Rui Zhao; Kitai Kim
Journal: Nat Cell Biol Date: 2017-08-28 Impact factor: 28.824

6. RINT1 Bi-allelic Variations Cause Infantile-Onset Recurrent Acute Liver Failure and Skeletal Abnormalities.

Authors: Margot A Cousin; Erin Conboy; Jian-She Wang; Dominic Lenz; Tanya L Schwab; Monique Williams; Roshini S Abraham; Sarah Barnett; Mounif El-Youssef; Rondell P Graham; Luz Helena Gutierrez Sanchez; Linda Hasadsri; Georg F Hoffmann; Nathan C Hull; Robert Kopajtich; Reka Kovacs-Nagy; Jia-Qi Li; Daniela Marx-Berger; Valérie McLin; Mark A McNiven; Taofic Mounajjed; Holger Prokisch; Daisy Rymen; Ryan J Schulze; Christian Staufner; Ye Yang; Karl J Clark; Brendan C Lanpher; Eric W Klee
Journal: Am J Hum Genet Date: 2019-06-13 Impact factor: 11.025

7. Targeted next-generation sequencing in blast phase myeloproliferative neoplasms.

Authors: Terra L Lasho; Mythri Mudireddy; Christy M Finke; Curtis A Hanson; Rhett P Ketterling; Natasha Szuber; Kebede H Begna; Mrinal M Patnaik; Naseema Gangat; Animesh Pardanani; Ayalew Tefferi
Journal: Blood Adv Date: 2018-02-27

8. Genetic Polymorphisms in the Long Noncoding RNA MIR2052HG Offer a Pharmacogenomic Basis for the Response of Breast Cancer Patients to Aromatase Inhibitor Therapy.

Authors: James N Ingle; Fang Xie; Matthew J Ellis; Paul E Goss; Lois E Shepherd; Judith-Anne W Chapman; Bingshu E Chen; Michiaki Kubo; Yoichi Furukawa; Yukihide Momozawa; Vered Stearns; Kathleen I Pritchard; Poulami Barman; Erin E Carlson; Matthew P Goetz; Richard M Weinshilboum; Krishna R Kalari; Liewei Wang
Journal: Cancer Res Date: 2016-10-10 Impact factor: 12.701

9. Targeted sequencing of refractory myeloma reveals a high incidence of mutations in CRBN and Ras pathway genes.

Authors: K Martin Kortüm; Elias K Mai; Nur H Hanafiah; Chang-Xi Shi; Yuan-Xiao Zhu; Laura Bruins; Santiago Barrio; Patrick Jedlowski; Maximilian Merz; Jing Xu; Robert A Stewart; Mindaugas Andrulis; Anna Jauch; Jens Hillengass; Hartmut Goldschmidt; P Leif Bergsagel; Esteban Braggio; A Keith Stewart; Marc S Raab
Journal: Blood Date: 2016-07-25 Impact factor: 22.113

10. TYROBP genetic variants in early-onset Alzheimer's disease.

Authors: Cyril Pottier; Thomas A Ravenscroft; Patricia H Brown; NiCole A Finch; Matt Baker; Meeia Parsons; Yan W Asmann; Yingxue Ren; Elizabeth Christopher; Denise Levitch; Marka van Blitterswijk; Carlos Cruchaga; Dominique Campion; Gaël Nicolas; Anne-Claire Richard; Rita Guerreiro; Jose T Bras; Stephan Zuchner; Michael A Gonzalez; Guojun Bu; Steven Younkin; David S Knopman; Keith A Josephs; Joseph E Parisi; Ronald C Petersen; Nilüfer Ertekin-Taner; Neill R Graff-Radford; Bradley F Boeve; Dennis W Dickson; Rosa Rademakers
Journal: Neurobiol Aging Date: 2016-08-08 Impact factor: 4.673