Literature DB >> 19667082

PineSAP--sequence alignment and SNP identification pipeline.

Jill L Wegrzyn¹, Jennifer M Lee, John Liechty, David B Neale.

Abstract

UNLABELLED: The Pine Alignment and SNP Identification Pipeline (PineSAP) provides a high-throughput solution to single nucleotide polymorphism (SNP) prediction using multiple sequence alignments from re-sequencing data. This pipeline integrates a hybrid of customized scripting, existing utilities and machine learning in order to increase the speed and accuracy of SNP calls. The implementation of this pipeline results in significantly improved multiple sequence alignments and SNP identifications when compared with existing solutions. The use of machine learning in the SNP identifications extends the pipeline's application to any eukaryotic species where full genome sequence information is unavailable. AVAILABILITY: All code used for this pipeline is freely available at the Dendrome project website (http://dendrome.ucdavis.edu/adept2/resequencing.html)

Entities: Species

Mesh：

Year: 2009 PMID： 19667082 PMCID： PMC2752621 DOI： 10.1093/bioinformatics/btp477

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 INTRODUCTION

Single nucleotide polymorphism (SNP) detection involves looking across multiple sequence alignments and identifying base discrepancies. The higher the sequence coverage and quality score at a given site, the more confident the SNP prediction. In high-throughput studies, we rely on existing tools to make confident identifications as visual confirmation is not an option. Poor initial alignments can greatly increase both the false positive and false negative rate of SNP predictions. The automated alignment and SNP detection from re-sequencing data where a reference genome sequence is not available is an on-going challenge. Pine (Pinus) sequence data presents an even greater obstacle as it is highly polymorphic (average of one SNP per 50 bases) (Neale, 2007). Existing programs such as Phrap (Lee and Vega, 2004) are heavily used for aligning genomic re-sequencing data and providing direct input to SNP prediction programs. Phrap, by definition, is a DNA sequence assembler and does not perform well when paired with the task of aligning highly polymorphic re-sequencing samples. It will often place individuals with different haplotypes into separate contigs. If the stringency of Phrap is reduced (in an attempt to force creation a single contig), misalignments of indels leads to poor overall sequence alignments. Quality benchmarks evaluated across several DNA and RNA aligners found ProbconsRNA (Do et al., 2005) to be highly accurate (Carroll et al., 2007; Wilm et al., 2006), however, it proves to be inhibitive in terms of speed for high-throughput studies. SNP identification solutions that can accommodate fluorescence-based re-sequencing reads, often require genomic reference sequence. The utilities PolyPhred (Nickerson et al., 1997) and Polybayes (Marth et al., 1999) rely primarily on quality scores and sequence coverage. In tests utilizing Polybayes and Polyphred in loblolly pine (Pinus taeda L.), we received at best, 78% prediction accuracy with the majority of the discrepancy resulting from false positives. Recently, machine learning techniques have been applied to the problem of computational SNP discovery (Matukumalli et al., 2006; Unneberg et al., 2005; Zhang et al., 2005). Two of these applications rely on either an existing reference genome sequence (Zhang et al., 2005) or process EST sequences directly rather than raw tracefiles (Unneberg et al., 2005). The SNP-Phage application (Matukumalli et al., 2006) can call SNPs from fluorescent reads without a reference sequence, however, attention to the alignments of highly polymorphic organisms prior to SNP calling is not available. PineSAP was developed as a high-throughput solution to analyze re-sequencing data in the form of chromatogram files for forward and reverse reads of multiple individuals. The pipeline presented here runs on the Unix/Linux platform and was written in Perl. PineSAP implements a combination of Phred, Phrap and ProbconsRNA to efficiently and accurately call bases and align re-sequencing reads. Following alignment, SNPs and indels are identified through the Polyphred and Polybayes packages. Sequence-based information is extracted and processed through a supervised machine learning algorithm for the purpose of accepting or rejecting the SNP predictions.

2 ALIGNMENT

The alignment section of the pipeline implements a de-coupled and modified version of phredPhrap on the original chromatogram files. Phred (Ewing et al., 1998) is responsible for the base calls and the assignment of quality scores. Phred is integrated into this pipeline with parameters to trim low quality ends in order to prevent alignment issues from bases in these regions. Conservative Phrap parameters are applied to prevent any misalignments in the resulting contigs and to ensure that all reads are retained. Contig consensus sequences are exported from the ace format file generated from phredPhrap and aligned with ProbconsRNA. For each contig, an aligned FASTA file is created with each read in the contig aligned to the consensus sequence. These files along with the aligned file of all contig consensus sequences are used to generate a single multi-sequence FASTA file. In this file, each read is aligned to an overall consensus for the amplicon based on the alignment in the ProbconsRNA output and each read's alignment to the consensus sequence of its member contig. The final multi-sequence aligned FASTA file is converted back to an ace formatted file suitable for input to Polybayes and Polyphred.

3 SNP CALLING

The purpose of the classifier is to evaluate the accuracy of the SNP calls resulting from Polyphred and Polybayes. Sequence-based statistics were derived through a customized feature extraction program and fed as a vector for each polymorphism to the J48 classification tree available in the WEKA classifier package. The final set of features identified fully represents the local and global sequence variation, alignment depth and quality, local and global base quality and sequence alignment quality. From the set of 300 used for training, all of the true positive and false negative vectors are represented as real SNPs and the true negative and false positive vectors as non-SNPs. The resulting SNP calls were extracted from their respective output files flanking sequence, quality scores and a normalized confidence score.

4 VALIDATION

Manual validation of the alignments was completed with the same 300 loblolly pine amplicons. Each amplicon had between 8 and 36 reads. With Phrap alone and default parameters, 23% of the amplicons were placed in a single contig, 27% into two and 20% into three. Amplicons with two or more contigs could be forced into a single contig 82% of the time, however there are problems with the alignment in 34% of cases. The alignment method implemented in PineSAP improves the success rate to 98%. When evaluated in terms of speed, the alignment method described above was run with a straight ProbconsRNA implementation. We determined it would take ∼25 times longer to process 36 sequences per amplicon. The classification tree generated from the training sequences was tested against a unique set of 120 sequences with 563 manually validated SNPs. All SNP calls were identified as based on visual inspection of Polyphred and Polybayes predictions in Consed (Gordon et al., 1998). The classification tree resulted in a significant overall improvement with a calculated accuracy of 93.6% (Table 1).

Table 1.

Results of SNP prediction on the test sequence data

Evaluation	J48	Polyphred	Polybayes
Accuracy	93.6	76.25	78.02
Sensitivity	88.21	83.22	86.54
Specificity	98.73	N/A	N/A

Results of SNP prediction on the test sequence data

12 in total

1. A general approach to single-nucleotide polymorphism discovery.

Authors: G T Marth; I Korf; M D Yandell; R T Yeh; Z Gu; H Zakeri; N O Stitziel; L Hillier; P Y Kwok; W R Gish
Journal: Nat Genet Date: 1999-12 Impact factor: 38.330

2. Heterogeneity detector: finding heterogeneous positions in Phred/Phrap assemblies.

Authors: W H Lee; V B Vega
Journal: Bioinformatics Date: 2004-05-06 Impact factor: 6.937

3. SNP discovery using advanced algorithms and neural networks.

Authors: Per Unneberg; Michael Strömberg; Fredrik Sterky
Journal: Bioinformatics Date: 2005-03-03 Impact factor: 6.937

4. ProbCons: Probabilistic consistency-based multiple sequence alignment.

Authors: Chuong B Do; Mahathi S P Mahabhashyam; Michael Brudno; Serafim Batzoglou
Journal: Genome Res Date: 2005-02 Impact factor: 9.043

Review 5. Genomics to tree breeding and forest health.

Authors: David B Neale
Journal: Curr Opin Genet Dev Date: 2007-12 Impact factor: 5.578

6. DNA reference alignment benchmarks based on tertiary structure of encoded proteins.

Authors: Hyrum Carroll; Wesley Beckstead; Timothy O'Connor; Mark Ebbert; Mark Clement; Quinn Snell; David McClellan
Journal: Bioinformatics Date: 2007-08-08 Impact factor: 6.937

7. Consed: a graphical tool for sequence finishing.

Authors: D Gordon; C Abajian; P Green
Journal: Genome Res Date: 1998-03 Impact factor: 9.043

8. PolyPhred: automating the detection and genotyping of single nucleotide substitutions using fluorescence-based resequencing.

Authors: D A Nickerson; V O Tobe; S L Taylor
Journal: Nucleic Acids Res Date: 1997-07-15 Impact factor: 16.971

9. An enhanced RNA alignment benchmark for sequence alignment programs.

Authors: Andreas Wilm; Indra Mainz; Gerhard Steger
Journal: Algorithms Mol Biol Date: 2006-10-24 Impact factor: 1.405

10. Application of machine learning in SNP discovery.

Authors: Lakshmi K Matukumalli; John J Grefenstette; David L Hyten; Ik-Young Choi; Perry B Cregan; Curtis P Van Tassell
Journal: BMC Bioinformatics Date: 2006-01-06 Impact factor: 3.169

14 in total

1. The evolutionary genetics of the genes underlying phenotypic associations for loblolly pine (Pinus taeda, Pinaceae).

Authors: Andrew J Eckert; Jill L Wegrzyn; John D Liechty; Jennifer M Lee; W Patrick Cumbie; John M Davis; Barry Goldfarb; Carol A Loopstra; Sreenath R Palle; Tania Quesada; Charles H Langley; David B Neale
Journal: Genetics Date: 2013-10-11 Impact factor: 4.562

Review 2. Microbial sequence typing in the genomic era.

Authors: Marcos Pérez-Losada; Miguel Arenas; Eduardo Castro-Nallar
Journal: Infect Genet Evol Date: 2017-09-21 Impact factor: 3.342

3. Genetic heterogeneity underlying variation in a locally adaptive clinal trait in Pinus sylvestris revealed by a Bayesian multipopulation analysis.

Authors: S T Kujala; T Knürr; K Kärkkäinen; D B Neale; M J Sillanpää; O Savolainen
Journal: Heredity (Edinb) Date: 2016-11-30 Impact factor: 3.821

4. Multilocus patterns of nucleotide diversity and divergence reveal positive selection at candidate genes related to cold hardiness in coastal Douglas Fir (Pseudotsuga menziesii var. menziesii).

Authors: Andrew J Eckert; Jill L Wegrzyn; Barnaly Pande; Kathleen D Jermstad; Jennifer M Lee; John D Liechty; Brandon R Tearse; Konstantin V Krutovsky; David B Neale
Journal: Genetics Date: 2009-07-13 Impact factor: 4.562

5. AlignMiner: a Web-based tool for detection of divergent regions in multiple sequence alignments of conserved sequences.

Authors: Darío Guerrero; Rocío Bautista; David P Villalobos; Francisco R Cantón; M Gonzalo Claros
Journal: Algorithms Mol Biol Date: 2010-06-02 Impact factor: 1.405

6. Contrasting patterns of nucleotide diversity for four conifers of Alpine European forests.

Authors: Elena Mosca; Andrew J Eckert; John D Liechty; Jill L Wegrzyn; Nicola La Porta; Giovanni G Vendramin; David B Neale
Journal: Evol Appl Date: 2012-11 Impact factor: 5.183

7. Comparative mapping in the Fagaceae and beyond with EST-SSRs.

Authors: Catherine Bodénès; Emilie Chancerel; Oliver Gailing; Giovanni G Vendramin; Francesca Bagnoli; Jerome Durand; Pablo G Goicoechea; Carolina Soliani; Fiorella Villani; Claudia Mattioni; Hans Peter Koelewijn; Florent Murat; Jerome Salse; Guy Roussel; Christophe Boury; Florian Alberto; Antoine Kremer; Christophe Plomion
Journal: BMC Plant Biol Date: 2012-08-29 Impact factor: 4.215

8. Model SNP development for complex genomes based on hexaploid oat using high-throughput 454 sequencing technology.

Authors: Rebekah E Oliver; Gerard R Lazo; Joseph D Lutz; Marc J Rubenfield; Nicholas A Tinker; Joseph M Anderson; Nicole H Wisniewski Morehead; Dinesh Adhikary; Eric N Jellen; P Jeffrey Maughan; Gina L Brown Guedira; Shiaoman Chao; Aaron D Beattie; Martin L Carson; Howard W Rines; Donald E Obert; J Michael Bonman; Eric W Jackson
Journal: BMC Genomics Date: 2011-01-27 Impact factor: 3.969

9. High-throughput SNP genotyping in the highly heterozygous genome of Eucalyptus: assay success, polymorphism and transferability across species.

Authors: Dario Grattapaglia; Orzenil B Silva-Junior; Matias Kirst; Bruno Marco de Lima; Danielle A Faria; Georgios J Pappas
Journal: BMC Plant Biol Date: 2011-04-14 Impact factor: 4.215

10. EuroPineDB: a high-coverage web database for maritime pine transcriptome.

Authors: Noé Fernández-Pozo; Javier Canales; Darío Guerrero-Fernández; David P Villalobos; Sara M Díaz-Moreno; Rocío Bautista; Arantxa Flores-Monterroso; M Ángeles Guevara; Pedro Perdiguero; Carmen Collada; M Teresa Cervera; Alvaro Soto; Ricardo Ordás; Francisco R Cantón; Concepción Avila; Francisco M Cánovas; M Gonzalo Claros
Journal: BMC Genomics Date: 2011-07-15 Impact factor: 3.969