Literature DB >> 21984761

Knime4Bio: a set of custom nodes for the interpretation of next-generation sequencing data with KNIME.

Pierre Lindenbaum¹, Solena Le Scouarnec, Vincent Portero, Richard Redon.

Abstract

SUMMARY: Analysing large amounts of data generated by next-generation sequencing (NGS) technologies is difficult for researchers or clinicians without computational skills. They are often compelled to delegate this task to computer biologists working with command line utilities. The availability of easy-to-use tools will become essential with the generalization of NGS in research and diagnosis. It will enable investigators to handle much more of the analysis. Here, we describe Knime4Bio, a set of custom nodes for the KNIME (The Konstanz Information Miner) interactive graphical workbench, for the interpretation of large biological datasets. We demonstrate that this tool can be utilized to quickly retrieve previously published scientific findings.

Entities: Disease Gene Species

Mesh：

Year: 2011 PMID： 21984761 PMCID： PMC3208396 DOI： 10.1093/bioinformatics/btr554

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 INTRODUCTION

Next-generation sequencing (NGS) technologies have led to an explosion of the amount of data to be analysed. As an example, a VCF (Danecek ) file (Variant Call Format—a standard specification for storing genomic variations in a text file) produced by the 1000 Genomes Project contains about 25 million Single Nucleotide Variants (SNV), [http://tinyurl.com/ALL2of4intersection (retrieved September 2011)], making it difficult to extract relevant information using spreadsheet programs. While computer biologists are used to invoke common command line tools—such as Perl and R—when analysing those data through Unix pipelines, scientific investigators generally lack the technical skills necessary to handle these tools and need to delegate data manipulation to a third party. Scientific workflow and data integration platforms aim to make those tasks more accessible to those research scientists. These tools are modular environments enabling an easy visual assembly and an interactive execution of an analysis pipeline (typically a directed graph) where a node defines a task to be executed on input data and an edge between two nodes represents a data flow. These applications provide an intuitive framework that can be used by the scientists themselves for building complex analyses. They allow data reproducibility and workflows sharing. Galaxy (Blankenberg ), Cyrille2 (Fiers ) and Mobyle (Nron ) are three web-based workflow engines that users have to install locally if computational needs on datasets are very large, or if absolute security is required. Alternatively, softwares such as the KNIME (Berthold ) workbench or Taverna (Hull ) run on the users' desktop and can interact with local resources. Taverna focuses on web services and may require a large number of nodes even for a simple task. In contrast, KNIME provides the ability to modify the nodes without having to re-run the whole analysis. We have chosen this latest tool to develop Knime4Bio, a set of new nodes mostly dedicated to the filtering and manipulation of VCF files. Although many standard nodes provided by KNIME can be used to perform such analysis, our nodes add new functionalities, some of which are described below.

2 IMPLEMENTATION

The java API for KNIME was used to write the new nodes, which were deployed and documented using some dedicated XML descriptors. A typical workflow for analysing exome sequencing data starts by loading VCF files into the working environment. The data contained in the INFO or the SAMPLE columns are extracted and the next task consists in annotating SNVs and/or indels. One node predicts the consequence of variations at the transcript/protein level. For each variant, genomic sequences of overlapping transcripts are retrieved from the UCSC knownGene database (Hsu ) to identify variants leading to premature stop codons, non-synonymous variants and variants likely to affect splicing. Some nodes have been designed to find the intersection between the variants in the VCF file and a various source of annotated genomic regions, which can be: a local BED file, a remote URL, a mysql table, a file indexed with tabix (Li, 2011), a BigBed or a BigWig file (Kent ). Other nodes are able to incorporate data from other databases: dbSNFRP (Liu ), dbSNP, Entrez Gene, PubMed, the EMBL STRING database, Uniprot, Reactome and GeneOntology (von Mering ), MediaWiki, or to export the data to SIFT (Ng and Henikoff, 2001), Polyphen2 (Adzhubei ), BED or MediaWiki formats. After being annotated, some SNVs (e.g. intronic) can be excluded from the dataset and the remaining data are rearranged by grouping the variants per sample or per gene as a pivot table. Some visualization tools have also been implemented: the Picard API (Li ) or the IGV browser (Robinson ) can be used visualize the short reads overlapping a variation. As a proof of concept, we tested our nodes to analyse the exomes of six patients from a previously published study (Isidor ) related to the Hajdu Cheney syndrome (Fig. 1). For this purpose, short reads were mapped to the human genome reference sequence using BWA (Li and Durbin, 2010) and variants were called using SAMtools mpileup (Li ). Homozygous variants, known SNPs (from dbSNP) and poor-quality variants were discarded, and only non-synonymous and variants introducing premature stop codons were considered. On a RedHat server (64 bits, 4 processors, 2 GB of RAM), our KNIME pipeline generated a list of six genes in 45 min: CELSR1, COL4A2, MAGEF1, MYO15A, ZNF341 and more importantly NOTCH2, the expected candidate gene.

Fig. 1.

Screenshot of a Knime4Bio workflow for the NOTCH2 analysis.

1The workflow was posted on myexperiment.org at: www.myexperiment.org/workflows/2320. Screenshot of a Knime4Bio workflow for the NOTCH2 analysis.

3 DISCUSSION

In practical terms, a computer biologist was close to our users to help them with the construction of a workflow. After this short tutorial, they were able to quickly play with the interface, add some nodes and modify the parameters without any further assistance, but the suggestion or the configuration of some specific nodes (for example, those who require a snippet of java code). At the time of writing, Knime4Bio contains 55 new nodes. We believe Knime4Bio is an efficient interactive tool for NGS analysis.

17 in total

1. Tabix: fast retrieval of sequence features from generic TAB-delimited files.

Authors: Heng Li
Journal: Bioinformatics Date: 2011-01-05 Impact factor: 6.937

2. The UCSC Known Genes.

Authors: Fan Hsu; W James Kent; Hiram Clawson; Robert M Kuhn; Mark Diekhans; David Haussler
Journal: Bioinformatics Date: 2006-02-24 Impact factor: 6.937

3. The Sequence Alignment/Map format and SAMtools.

Authors: Heng Li; Bob Handsaker; Alec Wysoker; Tim Fennell; Jue Ruan; Nils Homer; Gabor Marth; Goncalo Abecasis; Richard Durbin
Journal: Bioinformatics Date: 2009-06-08 Impact factor: 6.937

4. A method and server for predicting damaging missense mutations.

Authors: Ivan A Adzhubei; Steffen Schmidt; Leonid Peshkin; Vasily E Ramensky; Anna Gerasimova; Peer Bork; Alexey S Kondrashov; Shamil R Sunyaev
Journal: Nat Methods Date: 2010-04 Impact factor: 28.547

5. BigWig and BigBed: enabling browsing of large distributed datasets.

Authors: W J Kent; A S Zweig; G Barber; A S Hinrichs; D Karolchik
Journal: Bioinformatics Date: 2010-07-17 Impact factor: 6.937

6. Integrative genomics viewer.

Authors: James T Robinson; Helga Thorvaldsdóttir; Wendy Winckler; Mitchell Guttman; Eric S Lander; Gad Getz; Jill P Mesirov
Journal: Nat Biotechnol Date: 2011-01 Impact factor: 54.908

7. Taverna: a tool for building and running workflows of services.

Authors: Duncan Hull; Katy Wolstencroft; Robert Stevens; Carole Goble; Mathew R Pocock; Peter Li; Tom Oinn
Journal: Nucleic Acids Res Date: 2006-07-01 Impact factor: 16.971

8. High-throughput bioinformatics with the Cyrille2 pipeline system.

Authors: Mark W E J Fiers; Ate van der Burgt; Erwin Datema; Joost C W de Groot; Roeland C H J van Ham
Journal: BMC Bioinformatics Date: 2008-02-12 Impact factor: 3.169

9. Mobyle: a new full web bioinformatics framework.

Authors: Bertrand Néron; Hervé Ménager; Corinne Maufrais; Nicolas Joly; Julien Maupetit; Sébastien Letort; Sébastien Carrere; Pierre Tuffery; Catherine Letondal
Journal: Bioinformatics Date: 2009-08-17 Impact factor: 6.937

10. Fast and accurate long-read alignment with Burrows-Wheeler transform.

Authors: Heng Li; Richard Durbin
Journal: Bioinformatics Date: 2010-01-15 Impact factor: 6.937

10 in total

1. Rare Coding Variants in ANGPTL6 Are Associated with Familial Forms of Intracranial Aneurysm.

Authors: Romain Bourcier; Solena Le Scouarnec; Stéphanie Bonnaud; Matilde Karakachoff; Emmanuelle Bourcereau; Sandrine Heurtebise-Chrétien; Céline Menguy; Christian Dina; Floriane Simonet; Alexis Moles; Cédric Lenoble; Pierre Lindenbaum; Stéphanie Chatel; Bertrand Isidor; Emmanuelle Génin; Jean-François Deleuze; Jean-Jacques Schott; Hervé Le Marec; Gervaise Loirand; Hubert Desal; Richard Redon
Journal: Am J Hum Genet Date: 2018-01-04 Impact factor: 11.025

2. NIH Image to ImageJ: 25 years of image analysis.

Authors: Caroline A Schneider; Wayne S Rasband; Kevin W Eliceiri
Journal: Nat Methods Date: 2012-07 Impact factor: 28.547

Review 3. Proteogenomics from a bioinformatics angle: A growing field.

Authors: Gerben Menschaert; David Fenyö
Journal: Mass Spectrom Rev Date: 2015-12-15 Impact factor: 10.946

4. dbNSFP v2.0: a database of human non-synonymous SNVs and their functional predictions and annotations.

Authors: Xiaoming Liu; Xueqiu Jian; Eric Boerwinkle
Journal: Hum Mutat Date: 2013-07-10 Impact factor: 4.878

5. MassCascade: Visual Programming for LC-MS Data Processing in Metabolomics.

Authors: Stephan Beisken; Mark Earll; David Portwood; Mark Seymour; Christoph Steinbeck
Journal: Mol Inform Date: 2014-04-22 Impact factor: 3.353

Review 6. Trends in IT Innovation to Build a Next Generation Bioinformatics Solution to Manage and Analyse Biological Big Data Produced by NGS Technologies.

Authors: Alexandre G de Brevern; Jean-Philippe Meyniel; Cécile Fairhead; Cécile Neuvéglise; Alain Malpertuy
Journal: Biomed Res Int Date: 2015-06-01 Impact factor: 3.411

7. hiPSC-derived cardiomyocytes from Brugada Syndrome patients without identified mutations do not exhibit clear cellular electrophysiological abnormalities.

Authors: Christiaan C Veerman; Isabella Mengarelli; Kaomei Guan; Michael Stauske; Julien Barc; Hanno L Tan; Arthur A M Wilde; Arie O Verkerk; Connie R Bezzina
Journal: Sci Rep Date: 2016-08-03 Impact factor: 4.379

8. ImmunoNodes - graphical development of complex immunoinformatics workflows.

Authors: Benjamin Schubert; Luis de la Garza; Christopher Mohr; Mathias Walzer; Oliver Kohlbacher
Journal: BMC Bioinformatics Date: 2017-05-08 Impact factor: 3.169

9. KNIME-CDK: Workflow-driven cheminformatics.

Authors: Stephan Beisken; Thorsten Meinl; Bernd Wiswedel; Luis F de Figueiredo; Michael Berthold; Christoph Steinbeck
Journal: BMC Bioinformatics Date: 2013-08-22 Impact factor: 3.169

10. Dysfunction of the Voltage-Gated K+ Channel β2 Subunit in a Familial Case of Brugada Syndrome.

Authors: Vincent Portero; Solena Le Scouarnec; Zeineb Es-Salah-Lamoureux; Sophie Burel; Jean-Baptiste Gourraud; Stéphanie Bonnaud; Pierre Lindenbaum; Floriane Simonet; Jade Violleau; Estelle Baron; Eléonore Moreau; Carol Scott; Stéphanie Chatel; Gildas Loussouarn; Thomas O'Hara; Philippe Mabo; Christian Dina; Hervé Le Marec; Jean-Jacques Schott; Vincent Probst; Isabelle Baró; Céline Marionneau; Flavien Charpentier; Richard Redon
Journal: J Am Heart Assoc Date: 2016-06-10 Impact factor: 5.501

10 in total