Literature DB >> 23632165

QualitySNPng: a user-friendly SNP detection and visualization tool.

Harm Nijveen¹, Martijn van Kaauwen, Danny G Esselink, Brechtje Hoegen, Ben Vosman.

Abstract

QualitySNPng is a new software tool for the detection and interactive visualization of single-nucleotide polymorphisms (SNPs). It uses a haplotype-based strategy to identify reliable SNPs; it is optimized for the analysis of current RNA-seq data; but it can also be used on genomic DNA sequences derived from next-generation sequencing experiments. QualitySNPng does not require a sequenced reference genome and delivers reliable SNPs for di- as well as polyploid species. The tool features a user-friendly interface, multiple filtering options to handle typical sequencing errors, support for SAM and ACE files and interactive visualization. QualitySNPng produces high-quality SNP information that can be used directly in genotyping by sequencing approaches for application in QTL and genome-wide association mapping as well as to populate SNP arrays. The software can be used as a stand-alone application with a graphical user interface or as part of a pipeline system like Galaxy. Versions for Windows, Mac OS X and Linux, as well as the source code, are available from http://www.bioinformatics.nl/QualitySNPng.

Entities: Chemical Disease Gene Species

Mesh：

Year: 2013 PMID： 23632165 PMCID： PMC3692117 DOI： 10.1093/nar/gkt333

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Recent developments in sequencing technology have revolutionized genetic research, as vast amounts of sequencing data are now becoming available. From this data, single-nucleotide polymorphism (SNP) information can be extracted that is useful for genetic analysis, including quantitative trait locus (QTL) mapping and genome-wide association studies (1,2). Although several tools for SNP detection are already available (3–5), they usually require Linux command line skills to run and use of a separate program to visualize the results. More user-friendly software would greatly benefit the community. Since its publication, the QualitySNP pipeline for SNP detection in diploid and polyploidy species (6) has been successfully used in dozens of projects in plant and animal genetics, for instance, for the identification of SNP markers in crop plants (7), zebra finch (8), water fleas (9), snakes (10) and scallops (11). Because QualitySNP can use de novo assembled sequence alignments as input, it can also be used for species without a reference genome. The original QualitySNP was developed and optimized for Sanger sequenced expressed sequence tag (EST) data; however, the nature of DNA and RNA sequencing has changed drastically during the past 6 years, making an update necessary. Here, we present QualitySNPng that was specifically tuned to identify SNPs in data from the current next-generation sequencing (NGS) platforms. It features a graphical user interface (GUI), supports the popular SAM format (3), general performance improvements to allow analysis of large data sets and additional filtering parameters that address specific characteristics of NGS data from different platforms. The identified SNPs can be viewed in the context of predicted haplotypes and per input sample, making it ideally suited for genotyping by sequencing approach (1). Additionally, QualitySNPng can be used as a component in an analysis pipeline like the Galaxy platform (12).

FEATURES

SNP calling

QualitySNPng takes as input a sequence alignment file in SAM (3) or ACE (13) format with single-end or paired-end reads as produced by read mappers like Bowtie (14) and BWA (15) or de novo assemblers like CABOG (16) and PCAP (17). The QualitySNPng software uses three filtering steps to eliminate unreliable variations similar to the original QualitySNP (6). The first filter labels all nucleotide differences that occur in a minimum number of reads as potential SNPs. This minimum number can be adjusted by the user as an absolute number or a fraction of the total number of reads. The second filter takes into account the quality of the sequence containing the variant nucleotide and leaves only the high confidence SNPs. The base quality, characterized by the Phred score (18), is used for this when it is present in the input sequence alignment. If no Phred score is present, all nucleotides in the input reads are assumed to be of high quality. Additionally, the score can be modified based on specific sequence patterns. For instance, variations found in homopolymeric tracts can be set to low quality. This option is particularly useful when Roche/454 sequences are processed, as these are known to be prone to homopolymer-associated errors (19). Also a number of nucleotides at the 5′- or 3′-ends can be labelled as low quality, for instance to avoid false SNPs caused by incomplete adaptor trimming. The third filter involves predicting haplotypes based on the high confidence SNPs. Only if variation is supported by one or more haplotypes, it is considered as a reliable SNP. Compared with the original QualitySNP software, the second and third filters were reversed to make sure that the detected haplotypes are based on high confidence SNPs only. The run time largely depends on the size and nature of the input sequencing data, ranging from less than a minute for a set of ∼25 000 contigs (∼100 reads/contig) to 10 min for one large single contig of 7000 bp with 800 000 reads. Larger and more variable sequence alignments can take longer, also depending on the stringency of the settings: lowering the threshold for potential SNPs will result in more work for the second and third filters that are computationally the most expensive. For large input files that are expected to take several hours to process, one can use the command line ‘server mode’ option of the tool to perform the SNP calling on a compute server and subsequently analyse the results using the GUI.

Viewing results

The results of the SNP calling can be viewed directly using the GUI, and they are also saved in structured text files for later reference or further processing. The different contigs from the input sequence alignments are listed in a table showing the number of SNPs, the reads and the haplotypes. The haplotype count in the table is corrected for fragmented haplotypes by taking the maximal number of haplotypes that is found per SNP position. Fragmentation of haplotypes may occur and is caused by SNPs that are too far apart to be linked to one allele by a single-sequence read or a read pair, see Figure 1 for an example. The contig list can be filtered based on the numbers of reads, SNPs and haplotypes and (partial) contig name.

Figure 1.

Screenshot of QualitySNPng output. Result of the SNP detection using Arabidopsis thaliana RNA-seq data set from two accessions that were mapped to Arabidopsis transcripts (20). In the left, the list of transcripts is shown, limited here using the filter options to only the ones with between 8 and 25 SNPs and between 1000 and 2000 reads. The details for the selected transcript are shown on the right: the top window shows the predicted haplotypes, the middle window shows the alleles per accession (Col-0 and Can-0) and the bottom window shows the reads aligned to the transcript sorted per haplotype (reads without SNP are not shown). A selected contig will show a window with the aligned reads and the SNPs indicated, a table with the haplotypes and their alleles per SNP position and a table showing the alleles for the different samples in the input data (Figure 1). For this last table to appear, the input sequence alignment file should be annotated with a ‘read group’ (see SAM format definition) per read, or alternatively, have group labels included in the read names. The overview per sample can for instance be used to compare alleles between different accessions, strains or ecotypes and for genotyping by sequencing. Manual inspection of the read alignment together with the haplotype overview gives insight in the quality of the alignment, local read coverage and positions of the SNPs. Based on this visual inspection, one can decide to alter the stringency of the filter settings and rerun the SNP calling. The reads can be sorted on start position or per haplotype and can be viewed at different zoom levels. For the creation of a SNP array, marker SNPs can be selected and exported with flanking sequence of a specified length as a structured text file that can be imported into a standard spreadsheet program or an assay design program. To avoid problems in SNP scoring, we suggest selecting markers from contigs that have no more than the maximum expected number of haplotypes, i.e. two for diploid species, as contigs with more haplotypes may contain paralogous sequences. To further increase the chance of obtaining markers that will perform well on arrays, one could use the BLAST program (21) to eliminate marker sequences that show high similarity to other genes, as was shown previously (7).

IMPLEMENTATION

QualitySNPng was written in C++ using the Qt toolkit. The same executable file can be used interactively with the GUI, or as a command line tool for inclusion in analysis pipelines to be run on a compute server. The software can be compiled and runs on the Windows, Mac OS X and Linux operating systems. The output data are saved as CSV text files and can be reloaded for later analysis using QualitySNPng, or processed by custom scripts for further analysis.

DISCUSSION AND FUTURE DIRECTIONS

We believe there is a strong need for user-friendly software tools that allow biologists to directly analyse and visualize their data. QualitySNPng is a versatile tool that combines SNP detection and genotyping with interactive visualization of the results. The GUI with its pre-set filter options is easy to use and also highly configurable for specific needs. QualitySNPng is routinely used in-house for marker SNP identification in several projects (22–24). In one project, QualitySNPng was used to analyse RNA-seq data with up to 8 million reads per transcript to genotype a mixture of a few hundred accessions (unpublished) by making use of the ‘server mode’ option to run on a compute server. We expect that developments like in cloud computing will make this possible without leaving the GUI. The source code of QualitySNPng is freely available, and we encourage further development and implementation of the software in custom SNP analysis pipelines or adaptation for specific applications.

FUNDING

The Netherlands Consortium for Systems Biology, which is part of the Netherlands Genomics Initiative/Netherlands Organization for Scientific Research. Funding for open access charge: Wageningen University and Research Centre. Conflict of interest statement. None declared.

24 in total

1. VarScan: variant detection in massively parallel sequencing of individual and pooled samples.

Authors: Daniel C Koboldt; Ken Chen; Todd Wylie; David E Larson; Michael D McLellan; Elaine R Mardis; George M Weinstock; Richard K Wilson; Li Ding
Journal: Bioinformatics Date: 2009-06-19 Impact factor: 6.937

Review 2. Genome-wide genetic marker discovery and genotyping using next-generation sequencing.

Authors: John W Davey; Paul A Hohenlohe; Paul D Etter; Jason Q Boone; Julian M Catchen; Mark L Blaxter
Journal: Nat Rev Genet Date: 2011-06-17 Impact factor: 53.242

3. Consed: a graphical tool for sequence finishing.

Authors: D Gordon; C Abajian; P Green
Journal: Genome Res Date: 1998-03 Impact factor: 9.043

Review 4. Genotype and SNP calling from next-generation sequencing data.

Authors: Rasmus Nielsen; Joshua S Paul; Anders Albrechtsen; Yun S Song
Journal: Nat Rev Genet Date: 2011-06 Impact factor: 53.242

5. The Sequence Alignment/Map format and SAMtools.

Authors: Heng Li; Bob Handsaker; Alec Wysoker; Tim Fennell; Jue Ruan; Nils Homer; Gabor Marth; Goncalo Abecasis; Richard Durbin
Journal: Bioinformatics Date: 2009-06-08 Impact factor: 6.937

6. Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences.

Authors: Jeremy Goecks; Anton Nekrutenko; James Taylor
Journal: Genome Biol Date: 2010-08-25 Impact factor: 13.583

7. Single nucleotide polymorphism discovery from expressed sequence tags in the waterflea Daphnia magna.

Authors: Luisa Orsini; Mieke Jansen; Erika L Souche; Sarah Geldof; Luc De Meester
Journal: BMC Genomics Date: 2011-06-13 Impact factor: 3.969

8. SNP markers retrieval for a non-model species: a practical approach.

Authors: Arwa Shahin; Thomas van Gurp; Sander A Peters; Richard Gf Visser; Jaap M van Tuyl; Paul Arens
Journal: BMC Res Notes Date: 2012-01-29

9. QualitySNP: a pipeline for detecting single nucleotide polymorphisms and insertions/deletions in EST data from diploid and polyploid species.

Authors: Jifeng Tang; Ben Vosman; Roeland E Voorrips; C Gerard van der Linden; Jack A M Leunissen
Journal: BMC Bioinformatics Date: 2006-10-09 Impact factor: 3.169

10. Aggressive assembly of pyrosequencing reads with mates.

Authors: Jason R Miller; Arthur L Delcher; Sergey Koren; Eli Venter; Brian P Walenz; Anushka Brownley; Justin Johnson; Kelvin Li; Clark Mobarry; Granger Sutton
Journal: Bioinformatics Date: 2008-10-24 Impact factor: 6.937

15 in total

1. Mosaic genome of endobacteria in arbuscular mycorrhizal fungi: Transkingdom gene transfer in an ancient mycoplasma-fungus association.

Authors: Gloria Torres-Cortés; Stefano Ghignone; Paola Bonfante; Arthur Schüßler
Journal: Proc Natl Acad Sci U S A Date: 2015-05-11 Impact factor: 11.205

2. High quality SNPs/Indels mining and characterization in ginger from ESTs data base.

Authors: Mahendra Gaur; Aradhana Das; Enketeswara Subudhi
Journal: Bioinformation Date: 2015-02-28

3. NGS-eval: NGS Error analysis and novel sequence VAriant detection tooL.

Authors: Ali May; Sanne Abeln; Mark J Buijs; Jaap Heringa; Wim Crielaard; Bernd W Brandt
Journal: Nucleic Acids Res Date: 2015-04-15 Impact factor: 16.971

4. Using RNA-Seq to assemble a rose transcriptome with more than 13,000 full-length expressed genes and to develop the WagRhSNP 68k Axiom SNP array for rose (Rosa L.).

Authors: Carole F S Koning-Boucoiran; G Danny Esselink; Mirjana Vukosavljev; Wendy P C van 't Westende; Virginia W Gitonga; Frans A Krens; Roeland E Voorrips; W Eric van de Weg; Dietmar Schulz; Thomas Debener; Chris Maliepaard; Paul Arens; Marinus J M Smulders
Journal: Front Plant Sci Date: 2015-04-21 Impact factor: 5.753

5. 4Pipe4--A 454 data analysis pipeline for SNP detection in datasets with no reference sequence or strain information.

Authors: Francisco Pina-Martins; Bruno M Vieira; Sofia G Seabra; Dora Batista; Octávio S Paulo
Journal: BMC Bioinformatics Date: 2016-01-19 Impact factor: 3.169

6. Genomic introgression mapping of field-derived multiple-anthelmintic resistance in Teladorsagia circumcincta.

Authors: Young-Jun Choi; Stewart A Bisset; Stephen R Doyle; Kymberlie Hallsworth-Pepin; John Martin; Warwick N Grant; Makedonka Mitreva
Journal: PLoS Genet Date: 2017-06-23 Impact factor: 5.917

7. Analysis of allelic variants of RhMLO genes in rose and functional studies on susceptibility to powdery mildew related to clade V homologs.

Authors: Peihong Fang; Paul Arens; Xintong Liu; Xin Zhang; Deepika Lakwani; Fabrice Foucher; Jérémy Clotault; Juliane Geike; Helgard Kaufmann; Thomas Debener; Yuling Bai; Zhao Zhang; Marinus J M Smulders
Journal: Theor Appl Genet Date: 2021-05-02 Impact factor: 5.699