Literature DB >> 27585593

SPAI: an interactive platform for indel analysis.

Mohammad Shabbir Hasan1, Liqing Zhang2.   

Abstract

BACKGROUND: Insertions and Deletions (Indels) are the most common form of structural variation in human genome. Indels not only contribute to genetic diversity but also cause diseases. Therefore assessing indels in human genome has become an interesting topic to the research community. This increasing interest on indel calling research has resulted into the development of a good number of indel calling tools. However, all of these tools are command line based and require expertise from Computer Science (CS) to execute them which makes it challenging for researchers from non-CS background.
METHODS: In this paper, we describe an interactive platform named SPAI which stands for Single Platform for Analyzing Indels.
RESULTS: Being a Graphical User Interface (GUI) tool, SPAI facilitates users to run several popular indel calling tools and perform several analyses on the indel calling results without knowing any command line programming.
CONCLUSIONS: SPAI is written in Java and tested in Linux operating system.

Entities:  

Mesh:

Substances:

Year:  2016        PMID: 27585593      PMCID: PMC5009558          DOI: 10.1186/s12864-016-2824-x

Source DB:  PubMed          Journal:  BMC Genomics        ISSN: 1471-2164            Impact factor:   3.969


Background

Single Nucleotide Polymorphisms (SNPs) constitute the major portion of genetic variations that happen in human genome. However, recent studies show that insertion and deletions which are collectively known as indels also contribute to genetic diversity dramatically [1, 2]. Being a structural variant, indels can alter human phenotype and therefore can cause several kinds of diseases [3, 4]. For example, Cystic Fibrosis, one of the most common genetic diseases in humans, is caused due to the deletion of 3 base pairs (bps) which leads to the elimination of a single amino acid from the encoded protein [5]. Similarly insertion of base pairs in the DNA sequence results change in gene function that cause diseases like Fragile X Syndrome [6], Mendelian disorders [7], Haemophilia [8], Neurofibromatosis [9], Muscular Dystrophy [10], and Cancer [11, 12]. In addition to causing diseases, indels within the promoter region influence gene expression and can be used to explain the difference in gene expression observed in different human [13] which apparently brings the use of indels as genetic markers in natural population [14]. With the introduction of Next Generation Sequencing (NGS) technology, now it is possible to sequence human genome at an unprecedented rate [1] and whole genome sequencing (WGS) is now possible at an individual level [15-19]. Whole genome sequencing has revealed numerous genetic variations that were not previously reported [20] and these variation profiles can also be used to predict ancestor’s traits such as height, weight, appearance, and intelligence [1]. Therefore the idea of predicting the future health of individuals to design personalized medicine is rapidly approaching. However, accurate detection of genetic variation at an individual level is a key challenge in evolutionary genomic research. To accept this challenge, fortunately, a good number of indel calling tools have been developed so far [21, 22]. In recent time, many indel calling tools have been developed which are publicly available as well as popular among the researchers. Some of these tools include Genome Analysis Tool Kit (GATK) [23, 24], VarScan [25, 26], Pindel [27], SAMtools [28], Dindel [29], Platypus [30], P-Dindel [21], SV-M [31], Stampy [32], PEMer [33], Hydra [34], BreakDancer [35], FreeBayes [36], and indelMINER [37]. A close look at these tools reveals that all of them are command line based [38]. Command line tools are very much useful for batch processing and they require a certain level of expertise from Computer Science (CS). Therefore, researchers from non-CS background such as Biology, Chemistry, and Microbiology, doing research on evolutionary genomics, often find it difficult to execute and explore different features of the tools by changing different parameters through command line. Research by several usability labs reveal that the usability of a product can be significantly improved through the use of Graphical User Interface (GUI) [39] and providing the user with GUI for command line based bioinformatics tools became a success previously [40]. To come up with a solution to the usability problem of the indel calling tools, here we describe SPAI (Single Platform for Analyzing Indels). SPAI provides the user with a complete platform for indel research. In addition to running popular indel calling tools, through SPAI, user can also download alignment files (BAM files) form the 1000 Genome Project [41] to be used as input to these tools. Moreover, user can get coverage information of these alignment files and see the alignment files in a tabular format. In SPAI, user can also see the indel calling results in a tabular format to get a better insight of the called indels. For downstream analysis, SPAI lets the users to compare the results from various tools and visualize those comparisons using graphs and charts. Being an interactive tool, therefore, SPAI lets the user to perform necessary works of indel research without having any prior knowledge of command line programming.

Methods

SPAI which is written in Java comes with several features that are briefly described below.

Running different indel calling tools from GUI

Existing indel calling tools can be divided into four major categories: alignment based methods, split read methods, paired end read mapping methods, and haplotype based methods [22]. In the current version of SPAI, we include tools from two categories: alignment based methods and split read methods. From the alignment based methods category, SPAI includes GATK Unified Genotyper, VarScan, Dindel, and SAMtools. It also includes Pindel which belongs to the split read method category. When the user installs SPAI, these tools are also installed automatically. In the next release of SPAI, the following tools will be added: GATK Haplotype Caller, Platypus, FreeBayes, and indelMINER. Other tools from different categories will be added as the development of SPAI proceeds. Figure 1 shows the main GUI of SPAI. As we can see from this Figure, user just needs to specify the inputs (alignment file and reference sequence file), output file location and which tool to run. Usually SPAI runs the selected indel calling tool using its default settings. However, SPAI allows advanced user to change different parameter of each of the tools and run those tools based on that settings.
Fig. 1

Main window of SPAI

Main window of SPAI Some of the programs require huge processing power to generate results in a reasonable time. Although the current version of SPAI is desktop based, we are now working on moving the processing step in a cloud based service so that enough processing power can be provided and in that case the computation time will no longer depend on the configuration of user’s computer.

Download alignment files from the 1000 Genomes project

Input to the existing indel calling tools is the sequence alignment file which is usually available in BAM format. In most of the cases, the size of the BAM files is huge which can’t be downloaded using the conventional downloader. To assist user in this case, an efficient downloader is integrated in SPAI which allows user to download single as well as multiple BAM files from the FTP server of the 1000 Genomes project. As shown in Fig. 2, the left panel of the downloader window shows the list of all human samples currently available in the 1000 Genomes project. From this list the user needs to select for which human and for which chromosome the alignment file is needed. After the selection is done, the file is added to the “Download List” as shown in the right panel of the downloader window. User can add multiple files to the downloader list or remove file from the list. After the selection is done, when the user hits the “Start Download” button, SPAI starts downloading the BAM file(s) and store it in the location specified by the user.
Fig. 2

Downloader window of SPAI

Downloader window of SPAI Sometimes it is inconvenient to store large BAM files in the local directory of the user’s computer. Therefore, in the future release, SPAI will store the URLs of the alignment files in a text file in user’s computer and will save the alignment file in a cloud based storage. Therefore while specifying the inputs, SPAI will allow user to put the URL of an alignment file instead of the physical location of that alignment file. It will also allow user to put external link to alignment files stored in different location other than the FTP server of the 1000 Genomes Project and in that case SPAI will fetch the alignment file and save it temporarily to the cloud based storage to be used during the execution time.

Comparing the results of different indel calling tools

User can compare the indel calling results produced by different tools which is a really useful feature for downstream analysis. In SPAI we provided a benchmark dataset [42] which contains 2 million small and large (length varies from 1 bp to 10,000 bps) indels found in the 24 chromosomes of 79 diverse humans. This dataset is considered to be the most reliable for indels in human genome and has been used as “gold standard” in other studies [22, 43, 44]. From the VCF (Variant Call Format) files that are generated by the tools as output, SPAI calculates the recall, precision, and F-measure of each tool after comparing their results with the benchmark dataset. User can see the comparisons as Graphs (shown in Fig. 3) which is really helpful to get an insight about the performance of the tools. Moreover, for the comparison purpose, SPAI allows user to supply results (in VCF format) from other tools that are not included in SPAI. This is really useful if the user want to assess the performance of a newly developed tool by comparing its result with existing tools as well as with “gold standard” indels.
Fig. 3

Comparing the results of different tools with benchmark dataset

Comparing the results of different tools with benchmark dataset

Displaying the alignment files and indel calling results in tabular format

Sometimes alignment files contain useful information about the reads (such as mapping quality, CIGAR string, type of the read etc.). However, the standard format of the alignment files is BAM which is a binary format and therefore, can’t be opened using a text editor. Although SAMtools can convert BAM file to text format (SAM format), it is not convenient to open large SAM files using a text editor. To solve this problem, in SPAI, we include a third party tool called BAMSeek [45]. It can show large BAM files in a tabular format and user can get useful information about the alignment by hovering mouse to the corresponding column of the table. Similarly BAMSeek can also display large VCF files (output of the indel calling tools) in a tabular format from where user can get insight about the called indels. Figure 4 shows the tabular display of a BAM file.
Fig. 4

Tabular display of a BAM file

Tabular display of a BAM file

Determining the depth of coverage

Depth of coverage is the average number of reads that represent a given nucleotide in the sequence. In most cases, high depth of coverage is desired for calling indels confidently [22]. SPAI allows users to determine whether an alignment file should be used as input for indel calling by calculating the coverage of that alignment file.

Future work

SPAI is an on-going project and under active development. In future release, we plan to move the indel calling steps of the tools to a cloud based service which will completely reduce the computational burden of user’s computer. Moreover, the future release will not require the user to download large alignment files. Instead, user will supply URL to the alignment file not only from the FTP server of the 1000 Genomes project but also from other sources and SPAI will fetch the file and produce result in the cloud based service. We will also let the user to open account in the cloud based service and the user’s previously obtained results will be stored in the account so that these results can be used later instead of running the tools again. We plan to keep including newly developed indel calling tools and the next release will include some highly used tools such as GATK Haplotype Caller, Platypus, FreeBayes, and indelMINER. Moreover, we will also add utility tools such as UPS-indel - a tool to find ambiguous indels [46]. In addition to that we also plan to include tools that display the effect of a list of indels in coding and non-coding regions. Batch processing feature will also be added which will allow the user to perform multiple tasks simultaneously.

Results and Discussion

In this paper we addressed two research questions that are given below:

What is the necessity of creating a GUI for existing indel calling tools?

Researchers heavily depend on existing command line programs for calling indels from their dataset as well as for downstream analysis to the problems of their research domain. Since most of these tools are very popular and widely used by the research community, the question which may arise is why don’t we just keep these tools as they are right now? From our experience of working on indel projects, we realized that to explore different features of these existing tools by changing parameters and/or by changing inputs, users need to write the command every time. This needs some expertise of command line programming. Moreover exploring tools by writing command every time causes lack of usability of these tools. To solve this usability problem and to overcome the requirement of command line programming expertise, we designed SPAI, a Graphical User Interface (GUI) based tool. Being a GUI based tool, SPAI allows users to explore the features of existing indel calling tools just by selecting input through a regular file browser and setting parameters by writing it in a text box. This not only saves time required for writing commands, but also gives users a better user experience. Moreover, since SPAI includes multiple indel calling tools in the same platform, user can run multiple tools at the same time for same inputs without writing a single line of command.

How SPAI can help in downstream analysis of indels?

After calling indels, the next step is the downstream analysis of the called result. The research question that we addressed is how SPAI can help in this context? SPAI comes with a list of known indels [42] for human genome which has been used as benchmark dataset by many researchers. After importing indel calling results from different tools, SPAI compares those indels with the above mentioned benchmark dataset. SPAI produces the comparison results in graphical format and also provides statistics such as recall, precision, and F-measure. This feature of SPAI allows users to assess the performance of the indel calling tools based on these matrices. Moreover, user can also supply their own benchmark dataset and list of indels produced by their own tool. Since SPAI produces graphs and charts with performance comparison matrices, user can easily assess the performance of their tools without doing these comparisons by themselves. This also saves time and ensures better user experience.

Conclusions

Indels constitute the most common form of structural variation in human genome and have been found to be responsible in causing diseases by abolishing gene functions. In addition to that, indels can influence human traits and gene expression and therefore can be used as genetic marker. All these statements lead to the necessity of variant profiling which should be achievable as whole genome sequencing at individual level is now possible because of Next Generation Sequencing (NGS) technology. A good number of indel calling tools have been developed that can be used for variant profiling purpose. However, all of these are command line based which require certain expertise from Computer Science (CS). As evolutionary genomics is a multi-disciplinary research area, people from non-CS background are also involved in indel calling research and should be able to use these tools without prior knowledge of command line programming. Here we introduce SPAI (Single Platform for Analyzing Indels) which provides user with an interactive platform to use popular indel calling tools using a user friendly GUI and perform different analyses without knowing any command line programming. We believe that people especially from non-CS background will find SPAI really useful while performing their indel calling research.
  36 in total

1.  VarScan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing.

Authors:  Daniel C Koboldt; Qunyuan Zhang; David E Larson; Dong Shen; Michael D McLellan; Ling Lin; Christopher A Miller; Elaine R Mardis; Li Ding; Richard K Wilson
Journal:  Genome Res       Date:  2012-02-02       Impact factor: 9.043

2.  VarScan: variant detection in massively parallel sequencing of individual and pooled samples.

Authors:  Daniel C Koboldt; Ken Chen; Todd Wylie; David E Larson; Michael D McLellan; Elaine R Mardis; George M Weinstock; Richard K Wilson; Li Ding
Journal:  Bioinformatics       Date:  2009-06-19       Impact factor: 6.937

3.  Haemophilia A resulting from de novo insertion of L1 sequences represents a novel mechanism for mutation in man.

Authors:  H H Kazazian; C Wong; H Youssoufian; A F Scott; D G Phillips; S E Antonarakis
Journal:  Nature       Date:  1988-03-10       Impact factor: 49.962

4.  A highly annotated whole-genome sequence of a Korean individual.

Authors:  Jong-Il Kim; Young Seok Ju; Hansoo Park; Sheehyun Kim; Seonwook Lee; Jae-Hyuk Yi; Joann Mudge; Neil A Miller; Dongwan Hong; Callum J Bell; Hye-Sun Kim; In-Soon Chung; Woo-Chung Lee; Ji-Sun Lee; Seung-Hyun Seo; Ji-Young Yun; Hyun Nyun Woo; Heewook Lee; Dongwhan Suh; Seungbok Lee; Hyun-Jin Kim; Maryam Yavartanoo; Minhye Kwak; Ying Zheng; Mi Kyeong Lee; Hyunjun Park; Jeong Yeon Kim; Omer Gokcumen; Ryan E Mills; Alexander Wait Zaranek; Joseph Thakuria; Xiaodi Wu; Ryan W Kim; Jim J Huntley; Shujun Luo; Gary P Schroth; Thomas D Wu; HyeRan Kim; Kap-Seok Yang; Woong-Yang Park; Hyungtae Kim; George M Church; Charles Lee; Stephen F Kingsmore; Jeong-Sun Seo
Journal:  Nature       Date:  2009-07-08       Impact factor: 49.962

5.  Natural genetic variation caused by small insertions and deletions in the human genome.

Authors:  Ryan E Mills; W Stephen Pittard; Julienne M Mullaney; Umar Farooq; Todd H Creasy; Anup A Mahurkar; David M Kemeza; Daniel S Strassler; Chris P Ponting; Caleb Webber; Scott E Devine
Journal:  Genome Res       Date:  2011-04-01       Impact factor: 9.043

6.  The diploid genome sequence of an individual human.

Authors:  Samuel Levy; Granger Sutton; Pauline C Ng; Lars Feuk; Aaron L Halpern; Brian P Walenz; Nelson Axelrod; Jiaqi Huang; Ewen F Kirkness; Gennady Denisov; Yuan Lin; Jeffrey R MacDonald; Andy Wing Chun Pang; Mary Shago; Timothy B Stockwell; Alexia Tsiamouri; Vineet Bafna; Vikas Bansal; Saul A Kravitz; Dana A Busam; Karen Y Beeson; Tina C McIntosh; Karin A Remington; Josep F Abril; John Gill; Jon Borman; Yu-Hui Rogers; Marvin E Frazier; Stephen W Scherer; Robert L Strausberg; J Craig Venter
Journal:  PLoS Biol       Date:  2007-09-04       Impact factor: 8.029

7.  A practical method to detect SNVs and indels from whole genome and exome sequencing data.

Authors:  Daichi Shigemizu; Akihiro Fujimoto; Shintaro Akiyama; Tetsuo Abe; Kaoru Nakano; Keith A Boroevich; Yujiro Yamamoto; Mayuko Furuta; Michiaki Kubo; Hidewaki Nakagawa; Tatsuhiko Tsunoda
Journal:  Sci Rep       Date:  2013       Impact factor: 4.379

8.  An integrated map of genetic variation from 1,092 human genomes.

Authors:  Goncalo R Abecasis; Adam Auton; Lisa D Brooks; Mark A DePristo; Richard M Durbin; Robert E Handsaker; Hyun Min Kang; Gabor T Marth; Gil A McVean
Journal:  Nature       Date:  2012-11-01       Impact factor: 49.962

9.  Insertion-deletion polymorphisms (indels) as genetic markers in natural populations.

Authors:  Ulo Väli; Mikael Brandström; Malin Johansson; Hans Ellegren
Journal:  BMC Genet       Date:  2008-01-22       Impact factor: 2.797

10.  The Human Gene Mutation Database: 2008 update.

Authors:  Peter D Stenson; Matthew Mort; Edward V Ball; Katy Howells; Andrew D Phillips; Nick St Thomas; David N Cooper
Journal:  Genome Med       Date:  2009-01-22       Impact factor: 11.117

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.