Literature DB >> 19574284

SNP-o-matic.

Heinrich Magnus Manske1, Dominic P Kwiatkowski.   

Abstract

MOTIVATION: High throughput sequencing technologies generate large amounts of short reads. Mapping these to a reference sequence consumes large amounts of processing time and memory, and read mapping errors can lead to noisy or incorrect alignments. SNP-o-matic is a fast, memory-efficient and stringent read mapping tool offering a variety of analytical output functions, with an emphasis on genotyping. AVAILABILITY: http://snpomatic.sourceforge.net.

Entities:  

Mesh:

Year:  2009        PMID: 19574284      PMCID: PMC2735664          DOI: 10.1093/bioinformatics/btp403

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


Analysis of genome variation has been revolutionized by the advent of next-generation sequencing technologies (Bentley et al., 2008; Li et al., 2008b; Shendure and Ji, 2008). The short length of sequence reads, e.g. 50 base pairs, can pose considerable challenges in achieving accurate genome alignment, particularly if the genome sequence is highly polymorphic. Discovery of single nucleotide polymorphisms (SNPs) and other variants depends on the alignment algorithm allowing some mismatches to the reference sequence, but allowing too many mismatches may lead to incorrect alignments. Thus the process of discovering novel variants amounts to a complex statistical problem, particularly if sequencing errors and other sources of noise are taken into account. Various discovery algorithms have been developed and this is an area of much research interest (for example MAQ, Li et al., 2008a; and bowtie, Langmead et al., 2009). Here we focus on the problem of describing the genotype of an individual using short-read sequencing data. In principle, this can be incorporated into the same algorithms used for discovering novel variants, an approach that appears to work well for the human genome (Bentley et al., 2008). However, there are circumstances in which it may be useful to separate SNP discovery from SNP genotyping. For example, SNP discovery in Plasmodium falciparum is particularly complicated due to 80% AT content, many repeat sequences, regions of extreme polymorphism and the multiclonality of natural isolates. Thus different SNP discovery algorithms return widely different results. One way of addressing this problem is to begin by annotating the reference genome with all the putative SNPs generated by different discovery algorithms. Then individual SNPs may be genotyped by performing a stringent alignment of the sequencing reads against the reference genome, allowing for all the putative variable positions. To support this sort of genotyping analysis, we developed SNP-o-matic as a fast way of mapping short sequence reads to a reference genome with a list of putative variable positions that are specified at the outset. The default settings are highly stringent, returning only those sequence reads that align perfectly with the reference genome after allowing for the putative variable positions. An important feature of SNP-o-matic, which allows the rapid processing of large volumes of sequencing data, is that the reference genome sequence is first indexed (on the fly or by using a pre-computed index from disk), and then each sequence read or read pair is examined one at a time. This avoids having to build and store an index of the reads saving both compute cycles and memory. Indexing of the reference genome is done in memory on the fly from a generic FASTA file. A list of putative SNPs, if supplied, is integrated into the reference before indexing, and all permutations of this SNP-containing sequence are indexed. Indexing the 25 Mb P. falciparum genome (without SNPs) takes about 30 s on a single CPU core and occupies ∼1 GB of memory. A memory-saving option can reduce both memory and indexing time significantly at the expense of a longer mapping phase. The index can be stored in a file for future use, further reducing the time required for this step, or to facilitate the analysis for larger genomes (Supplementary Material). Reads are supplied in either FASTA or FASTQ (http://maq.sourceforge.net/fastq.shtml) format; read pairs can be in either single or split files. In performance tests, mapping 10-million 37 base paired reads against the P. falciparum genome takes 70 s on a single CPU core, not counting the indexing. No additional memory is required for the mapping. Additional time and memory may be required for some of the output functions. For genotyping, a variable length indexed kmer (default 26 bases) is compared to the same length kmer for each read (or both reads in a read pair). Matches in these bases thus have to be perfect, with respect to the putative SNPs. The remaining bases of the read are then compared base-by-base to the reference. By default, these matches have to be perfect as well, but a limited number of mismatches can be allowed. This stringency will avoid false SNP calls in genotyping mode that would otherwise be caused by aligning reads containing sequencing errors. Thus, SNP-o-matic will generally map less reads than other algorithms, but the mapping will have much higher accuracy. When allowing mismatches, the kmer length can be varied to increase mapping tolerance (Supplementary Material). Both parts of a read pair have to map on the same chromosome for valid mapping; a fragment range can be used to limit their mapping distance to conform to the expected size distribution for the library. An optional mode can increase stringency by ensuring that at least one read of a read pair maps uniquely within reference the genome. By default, SNP-o-matic will find and use all valid mappings for a read or read pair within the reference. When using read pairs, the stringent mapping algorithm can sometimes map one of the reads in the pair, but not the other. SNP-o-matic can output various data about such read pairs. From the mapping position, orientation, and fragment size, a likely position can be estimated for the non-mapping read. Based on this information, reads can be grouped by position and assembled to discover variation. Additionally, the estimated area can be searched for mappings with some mismatches, resulting in potential new SNP calls. This output is the primary method used by SNP-o-matic to discover new SNPs and small-scale variation, both of which require further downstream analysis (Supplementary Material). Scripts for such analysis are under development and will eventually be incorporated into the SNP-o-matic package to augment its core function. Similarly, both reads of the pair may map to the reference, but not on the same chromosome. This information can be used to detect misassemblies. When using (super)contigs as reference sequence, read pairs can thus be used to link contigs together, determine their order, and estimate the size of the gap between two contigs. An output type of SNP-o-matic is a read bin, a file containing reads grouped by mapping behavior. Bins are a quick and easy way to filter a read set, for example to remove DNA contamination and noise from non-uniquely mapping reads, or to gather non-mapping reads for further study or assembly. Available bins are single mapping reads (uniquely mapped in the genome), multiple-mapping reads, non-mapping reads, and reads containing IUPAC bases (e.g. ‘N’); the later are ignored by SNP-o-matic for mapping. Mapping/alignment output is supplied for pileup, coverage (base count per position), CIGAR format (http://biowiki.org/CigarFormat), gff format (http://biowiki.org/GffFormat), SAM format (http://samtools.sourceforge.net/) and sqlite database (http://www.sqlite.org/). For accurate SNP genotyping, it is advantageous to take account of sequence quality scores, especially in regions with low coverage. SNP-o-matic can generate an output file showing each instance where a mapped read covers a putative SNP. Each output line contains the read name, allele position on the reference, reference and observed allele, quality score of the allele base, average and minimum quality of both the entire read as well as the five bases on either side of the allele-calling base, and auxiliary data. This data can be further quality filtered, and used to generate a list of non-reference majority alleles. Other outputs include observed fragment size distribution, insertion/deletion predictions and inversion detection. These can also be determined by alternative algorithms from the afore-mentioned mapping/alignment outputs. SNP-o-matic is written in C++ /C (for performance optimization). Compilation with the Intel icpc compiler has shown significant runtime improvements over g++. We carried out a number of performance tests which are described in the Supplementary Material and briefly summarized below. The initial tests were based on an artificial dataset consisting of a 1mbp reference genome whose AT content (80%) is similar to the P. falciparum genome, and a duplicate genome with randomly introduced SNPs and indels. Solexa read pairs (2 × 37 bases) with random errors (one in five reads) were generated from the altered genome. SNP-o-matic correctly genotyped the SNPs when they were given as a putative SNP list. As expected, coverage dropped substantially when a SNP list was not supplied, unless the mapping stringency was reduced. We have not attempted to conduct a comprehensive comparison of the performance of SNP-o-matic with SNP discovery algorithms such as MAQ as it is designed primarily as a tool to be used after the stage of SNP discovery. However, as an illustration of where SNP-o-matic may be useful, we found that, when analyzing clusters of six SNPs in the simulated dataset, MAQ only called two of the SNPs, whereas SNP-o-matic called all six correctly when they were supplied in a putative SNP list. The current version of SNP-o-matic does not directly detect indels, but can be adapted to do so by using an optional ‘wobble function’ to identify read pairs where one read maps perfectly but the other does not, and then using an algorithm such as velvet (Zerbino and Birney, 2008) to assemble the non-mapping reads into a contig which is then mapped to the region covering the deletion site using an algorithm such as blat (Kent, 2002). Using this approach, we found that it was possible to detect a five-base deletion that was introduced into the simulated dataset described above. Finally, in the Supplementary Material, we provide data on the performance of SNP-o-matic on human chromosomes 1, X, and Y. Based on these findings we estimate that processing an entire human genome using a pre-computed index and the memory saving option, mapping the test reads should take ∼20 min and 29 GB of RAM. A similar timeframe, with <3 GB RAM usage, would be expected for a chromosome-by-chromosome serial execution; this would require an additional, albeit simple, filtering step to ensure uniqueness across the entire genome.
  7 in total

1.  BLAT--the BLAST-like alignment tool.

Authors:  W James Kent
Journal:  Genome Res       Date:  2002-04       Impact factor: 9.043

2.  Velvet: algorithms for de novo short read assembly using de Bruijn graphs.

Authors:  Daniel R Zerbino; Ewan Birney
Journal:  Genome Res       Date:  2008-03-18       Impact factor: 9.043

3.  Mapping short DNA sequencing reads and calling variants using mapping quality scores.

Authors:  Heng Li; Jue Ruan; Richard Durbin
Journal:  Genome Res       Date:  2008-08-19       Impact factor: 9.043

4.  Next-generation DNA sequencing.

Authors:  Jay Shendure; Hanlee Ji
Journal:  Nat Biotechnol       Date:  2008-10       Impact factor: 54.908

5.  Ultrafast and memory-efficient alignment of short DNA sequences to the human genome.

Authors:  Ben Langmead; Cole Trapnell; Mihai Pop; Steven L Salzberg
Journal:  Genome Biol       Date:  2009-03-04       Impact factor: 13.583

6.  SOAP: short oligonucleotide alignment program.

Authors:  Ruiqiang Li; Yingrui Li; Karsten Kristiansen; Jun Wang
Journal:  Bioinformatics       Date:  2008-01-28       Impact factor: 6.937

7.  Accurate whole human genome sequencing using reversible terminator chemistry.

Authors:  David R Bentley; Shankar Balasubramanian; Harold P Swerdlow; Geoffrey P Smith; John Milton; Clive G Brown; Kevin P Hall; Dirk J Evers; Colin L Barnes; Helen R Bignell; Jonathan M Boutell; Jason Bryant; Richard J Carter; R Keira Cheetham; Anthony J Cox; Darren J Ellis; Michael R Flatbush; Niall A Gormley; Sean J Humphray; Leslie J Irving; Mirian S Karbelashvili; Scott M Kirk; Heng Li; Xiaohai Liu; Klaus S Maisinger; Lisa J Murray; Bojan Obradovic; Tobias Ost; Michael L Parkinson; Mark R Pratt; Isabelle M J Rasolonjatovo; Mark T Reed; Roberto Rigatti; Chiara Rodighiero; Mark T Ross; Andrea Sabot; Subramanian V Sankar; Aylwyn Scally; Gary P Schroth; Mark E Smith; Vincent P Smith; Anastassia Spiridou; Peta E Torrance; Svilen S Tzonev; Eric H Vermaas; Klaudia Walter; Xiaolin Wu; Lu Zhang; Mohammed D Alam; Carole Anastasi; Ify C Aniebo; David M D Bailey; Iain R Bancarz; Saibal Banerjee; Selena G Barbour; Primo A Baybayan; Vincent A Benoit; Kevin F Benson; Claire Bevis; Phillip J Black; Asha Boodhun; Joe S Brennan; John A Bridgham; Rob C Brown; Andrew A Brown; Dale H Buermann; Abass A Bundu; James C Burrows; Nigel P Carter; Nestor Castillo; Maria Chiara E Catenazzi; Simon Chang; R Neil Cooley; Natasha R Crake; Olubunmi O Dada; Konstantinos D Diakoumakos; Belen Dominguez-Fernandez; David J Earnshaw; Ugonna C Egbujor; David W Elmore; Sergey S Etchin; Mark R Ewan; Milan Fedurco; Louise J Fraser; Karin V Fuentes Fajardo; W Scott Furey; David George; Kimberley J Gietzen; Colin P Goddard; George S Golda; Philip A Granieri; David E Green; David L Gustafson; Nancy F Hansen; Kevin Harnish; Christian D Haudenschild; Narinder I Heyer; Matthew M Hims; Johnny T Ho; Adrian M Horgan; Katya Hoschler; Steve Hurwitz; Denis V Ivanov; Maria Q Johnson; Terena James; T A Huw Jones; Gyoung-Dong Kang; Tzvetana H Kerelska; Alan D Kersey; Irina Khrebtukova; Alex P Kindwall; Zoya Kingsbury; Paula I Kokko-Gonzales; Anil Kumar; Marc A Laurent; Cynthia T Lawley; Sarah E Lee; Xavier Lee; Arnold K Liao; Jennifer A Loch; Mitch Lok; Shujun Luo; Radhika M Mammen; John W Martin; Patrick G McCauley; Paul McNitt; Parul Mehta; Keith W Moon; Joe W Mullens; Taksina Newington; Zemin Ning; Bee Ling Ng; Sonia M Novo; Michael J O'Neill; Mark A Osborne; Andrew Osnowski; Omead Ostadan; Lambros L Paraschos; Lea Pickering; Andrew C Pike; Alger C Pike; D Chris Pinkard; Daniel P Pliskin; Joe Podhasky; Victor J Quijano; Come Raczy; Vicki H Rae; Stephen R Rawlings; Ana Chiva Rodriguez; Phyllida M Roe; John Rogers; Maria C Rogert Bacigalupo; Nikolai Romanov; Anthony Romieu; Rithy K Roth; Natalie J Rourke; Silke T Ruediger; Eli Rusman; Raquel M Sanches-Kuiper; Martin R Schenker; Josefina M Seoane; Richard J Shaw; Mitch K Shiver; Steven W Short; Ning L Sizto; Johannes P Sluis; Melanie A Smith; Jean Ernest Sohna Sohna; Eric J Spence; Kim Stevens; Neil Sutton; Lukasz Szajkowski; Carolyn L Tregidgo; Gerardo Turcatti; Stephanie Vandevondele; Yuli Verhovsky; Selene M Virk; Suzanne Wakelin; Gregory C Walcott; Jingwen Wang; Graham J Worsley; Juying Yan; Ling Yau; Mike Zuerlein; Jane Rogers; James C Mullikin; Matthew E Hurles; Nick J McCooke; John S West; Frank L Oaks; Peter L Lundberg; David Klenerman; Richard Durbin; Anthony J Smith
Journal:  Nature       Date:  2008-11-06       Impact factor: 49.962

  7 in total
  18 in total

1.  Seeking perfection.

Authors:  Thomas D Otto
Journal:  Nat Rev Microbiol       Date:  2010-10       Impact factor: 60.633

2.  Exploring single-sample SNP and INDEL calling with whole-genome de novo assembly.

Authors:  Heng Li
Journal:  Bioinformatics       Date:  2012-05-07       Impact factor: 6.937

Review 3.  From next-generation resequencing reads to a high-quality variant data set.

Authors:  S P Pfeifer
Journal:  Heredity (Edinb)       Date:  2016-10-19       Impact factor: 3.821

4.  ASElux: an ultra-fast and accurate allelic reads counter.

Authors:  Zong Miao; Marcus Alvarez; Päivi Pajukanta; Arthur Ko
Journal:  Bioinformatics       Date:  2018-04-15       Impact factor: 6.937

5.  A post-assembly genome-improvement toolkit (PAGIT) to obtain annotated genomes from contigs.

Authors:  Martin T Swain; Isheng J Tsai; Samual A Assefa; Chris Newbold; Matthew Berriman; Thomas D Otto
Journal:  Nat Protoc       Date:  2012-06-07       Impact factor: 13.491

6.  Iterative Correction of Reference Nucleotides (iCORN) using second generation sequencing technology.

Authors:  Thomas D Otto; Mandy Sanders; Matthew Berriman; Chris Newbold
Journal:  Bioinformatics       Date:  2010-06-18       Impact factor: 6.937

7.  Fast and SNP-tolerant detection of complex variants and splicing in short reads.

Authors:  Thomas D Wu; Serban Nacu
Journal:  Bioinformatics       Date:  2010-02-10       Impact factor: 6.937

8.  AlignMiner: a Web-based tool for detection of divergent regions in multiple sequence alignments of conserved sequences.

Authors:  Darío Guerrero; Rocío Bautista; David P Villalobos; Francisco R Cantón; M Gonzalo Claros
Journal:  Algorithms Mol Biol       Date:  2010-06-02       Impact factor: 1.405

9.  Genetic architecture of artemisinin-resistant Plasmodium falciparum.

Authors:  Olivo Miotto; Roberto Amato; Elizabeth A Ashley; Bronwyn MacInnis; Jacob Almagro-Garcia; Chanaki Amaratunga; Pharath Lim; Daniel Mead; Samuel O Oyola; Mehul Dhorda; Mallika Imwong; Charles Woodrow; Magnus Manske; Jim Stalker; Eleanor Drury; Susana Campino; Lucas Amenga-Etego; Thuy-Nhien Nguyen Thanh; Hien Tinh Tran; Pascal Ringwald; Delia Bethell; Francois Nosten; Aung Pyae Phyo; Sasithon Pukrittayakamee; Kesinee Chotivanich; Char Meng Chuor; Chea Nguon; Seila Suon; Sokunthea Sreng; Paul N Newton; Mayfong Mayxay; Maniphone Khanthavong; Bouasy Hongvanthong; Ye Htut; Kay Thwe Han; Myat Phone Kyaw; Md Abul Faiz; Caterina I Fanello; Marie Onyamboko; Olugbenga A Mokuolu; Christopher G Jacob; Shannon Takala-Harrison; Christopher V Plowe; Nicholas P Day; Arjen M Dondorp; Chris C A Spencer; Gilean McVean; Rick M Fairhurst; Nicholas J White; Dominic P Kwiatkowski
Journal:  Nat Genet       Date:  2015-01-19       Impact factor: 38.330

10.  Population genomic scan for candidate signatures of balancing selection to guide antigen characterization in malaria parasites.

Authors:  Alfred Amambua-Ngwa; Kevin K A Tetteh; Magnus Manske; Natalia Gomez-Escobar; Lindsay B Stewart; M Elizabeth Deerhake; Ian H Cheeseman; Christopher I Newbold; Anthony A Holder; Ellen Knuepfer; Omar Janha; Muminatou Jallow; Susana Campino; Bronwyn Macinnis; Dominic P Kwiatkowski; David J Conway
Journal:  PLoS Genet       Date:  2012-11-01       Impact factor: 5.917

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.