Literature DB >> 19463194

Perl module and PISE wrappers for the integrated analysis of sequence data and SNP features.

B Jayashree¹, A Bhanuprakash, Anusha Jami, P Srinivasa Reddy, Spurthi Nayak, Rajeev K Varshney.

Abstract

BACKGROUND: There is a need for software scripts and modules for format parsing, data manipulation, statistical analysis and annotation especially for tasks related to marker identification from sequence data and sequence diversity analysis.
RESULTS: Here we present several new Perl scripts and a module for sequence data diversity analysis. To enable the use of these software with other public domain tools, we also make available PISE (Pasteur Institute Software Environment) wrappers for these Perl scripts and module. This enables the user to generate pipelines for automated analysis, since PISE is a web interface generator for bioinformatics programmes.
CONCLUSION: A new set of modules and scripts for diversity statistic calculation, format parsing and data manipulation are available with PISE wrappers that enable pipelining of these scripts with commonly used contig assembly and sequence feature prediction software, to answer specific sequence diversity related questions.

Entities: Chemical Species

Year: 2009 PMID： 19463194 PMCID： PMC2694205 DOI： 10.1186/1756-0500-2-92

Source DB: PubMed Journal: BMC Res Notes ISSN： 1756-0500

Background

Single Nucleotide Polymorphisms (SNPs) are commonly found throughout the genome and provide dense maps over small chromosomal regions. The recent advances in sequencing and genotyping have made large scale SNP diversity analysis possible in several crop species. This helps assess genome variation that can then be harnessed for crop improvement. Sequence diversity information may be desirable across defined groups of sequences, such as candidate gene transcripts from different genotypes, or assembled transcripts for a particular marker from more than one genotype. The grouping could be based on the objective of the study – across race, location, genes or regions within genes. Sequence data analysis usually involves steps such as clustering of sequence data, to determine redundancy levels. Sequence assembly is carried out to generate consensus sequences or contigs and singlets. The user then processes this output to determine presence of microsatellites or SNPs. Along with SNP identification it is also desirable to obtain other aspects from the alignment; such as SNP and indel (insertion-deletions) frequency, the type of variant and haplotypes, PIC value for the SNP and haplotype besides nucleotide diversity (π). Validation of predicted SNP(s) through wet lab experiments is the next step to convert the identified SNP into a genetic marker. Although more than 30 SNP genotyping platforms are currently available, these are both expensive and demand considerable expertise. One solution for validating the identified SNP(s) through cost effective SNP genotyping platform is development of CAPS (cleaved amplified polymorphic sequences) marker by predicting the restriction enzyme that can use the identified SNP as a recognition site. There are several available software solutions for sequence clustering, and a few popular ones for assembly. The popular group of Clustal programmes [1], d2-cluster [2] for EST clustering and cap3 or PCAP [3,4], the TIGR assembler [5] or Phrap [6] are used for sequence assembly. Similarly, there are many freely available software programmes for the identification of SNPs[7-10]. DnaSP reports on nucleotide polymorphism features from aligned sequence data [11]. None of them however automate group wise identification and reporting of polymorphism statistics and more importantly consider the presence of heterozygous loci in the sequence data. Many available programs read heterozygous SNPs as missing/bad quality sequence data and thus do not consider them for analysis. As a result features such as sequence diversity, PIC of SNP and haplotypes, etc. may be underestimated. The need for a module that could report SNP features for any number of user defined groups coupled with the need to be able to calculate statistics taking into consideration the presence of heterozygous loci led to the development of the SNP DIVersity ESTimator module (divest.pm). Sequence analysis involves pipelining of data from one software to another and often also includes branched flows such as when annotation of sequences with putative function is also a requirement. Format conversion scripts to convert output of one program to input of another are needed when the user wants to pipeline several tools and modules. Along with output parsing scripts, some degree of automation can be achieved in data analysis tasks. The availability of software environments for pipelining and workflow management help the user to create custom analysis pipelines. The PISE programme [12,13] is a robust environment that has been around for many years now and allows integration of internal scripts/tools that are part of the target execution environment as well as external tools that a user runs. Ease of use is achieved through the creation of a graphical user interface (GUI) for all of the programmes/scripts available in the environment and the chaining together of scripts to facilitate automation and analysis. So rather than reinvent a workflow environment; we implemented PISE locally and provided PISE XML wrappers for the Perl scripts and modules developed by us, besides making them available as web services. The availability of the programs and wrapper sripts allow users to implement versatile pipelines either in the familiar browser environment or in the Taverna workbench. The modules and scripts are available to all interested users.

Implementation

The source code for the programs has been written in Perl 5.8. For the diversity estimator module, Bioperl modules have been used. The scripts have been tested within the PISE environment and are being used independently as well, on machines running either Linux or Windows OS. An example pipeline that makes use of the scripts written is indicated (Figure 1A). The modules allow a large amount of data to be processed; we have tested it to work with up to 99 user defined groups consisting of up to 225 sequences per group. Interface development involves writing the XML specification for the programme/script/module using the grammar provided , where file manipulations and redirection of output files to another programme are also specified. The scripts are also available as a soaplab service and can so be called through the Taverna workflow tool.

Figure 1

A. An example pipeline that can be executed through the PISE environment. Blue boxes represent inputs and outputs in the pipeline. B. The PISE interface to the diversity estimator module. C. Screenshot of divest.pm ouput.

Results and discussion

Functionality of Perl modules and scripts

The module divest.pm is the diversity estimator module that reads an input file derived from assembly programmes like cap3 or the alignment file output from multiple sequence alignment programmes like ClustalW or BioEdit. The file contains the assemblies of reads for each user defined group. The user can choose to calculate diversity statistics for all groups or one particular group. The statistics calculated include nucleotide diversity, number of SNPs, and type of variant whether transition or transversion, SNP frequency, number of haplotypes, the PIC (polymorphism information content) of SNP and haplotype, haplotype frequency and also number and frequency of indels. SNP and indel frequency are simply calculated as the total length of the sequence in base pairs/number of SNPs or indels. The count of number of haplotypes and haplotype frequency is based on the standard methods [14]. An important addition in these calculations is the consideration of heterozygous loci. The user can replace an N with an H (in case of true heterozygote) based on two peaks in chromatograms obtained from sequencing machines. The algorithm assigns 'H' two alleles (A/T or C/G each with a value of 0.5) and this is used in haplotype analysis and sequence diversity calculations. The nucleotide diversity π is calculated using the formula [15] where k is the number of SNPs identified in an alignment of 'n' genotypes, L is the number of basepairs and . The PIC of SNP is calculated using the formula [16] where pi is the frequency of the ith allele at a given SNP locus. The haplotype diversity is calculated using the formula [17,18] where K is the number of haplotypes, pi is the frequency of haplotypes and n is the total number of reads. The statistics generated is exported in Excel format, besides the script also generates input file to Network 4.502 . Network helps generate evolutionary trees from various data including haplotype and haplotype frequency data. Scripts that enable the conversion of a multiple sequence alignment into the diversity estimator input file have been currently written for the ClustalW and cap3 programmes. Another script that enables concatenating cap3 contig and singlets file is also available, the output of this file generates a 'unigene' file that can serve as input to the MISA microsatellite analysis programme. An example analysis pipeline is provided in Figure 1 along with the PISE interface to the diversity estimator programme and its output.

Accessibility of modules and scripts

Figure 2 illustrates the XML specification for a module like divest.pm. Chaining together of programmes is made feasible through the parameter setting in the XML file, piping output files to input file of the next programme. In a PISE implementation available at , all mentioned scripts are available under assembly, format parsing, statistics and data manipulation directories. A sample ACD file for the divest module is also given, this provides meta-data regarding the input and output files for this module and can be used by soaplab for deploying divest.pm as a web service.

Figure 2

PISE XML wrapper and ACD file for the diversity estimator module.

Intended use

The modules and scripts being provided through the PISE environment can be used for various purposes (a) identifying microsatellites in a set of unigenes, (b) calculating the diversity between two or more groups of data based on geographical location of the genotypes (e.g. cultivated, wild, landraces, etc.), regions of the gene (e.g. intronic and exonic regions) being sampled, (c) sequence diversity features in a candidate gene across several genotypes and (d) identifying SNPs and finding out how many of these are convertible to CAPS, (e) functionally annotate sequences. The ability to chain scripts to generate pipelines reduces the file management burden for the user. A part of the pipeline indicated in Figure 1 can also be executed through the Taverna workflow environment [19], which provides web service access besides local service access. The advantage to the user would be the accessibility and integration of databases through web services. Some of the scripts are available as soaplab web services at http://220.227.242.214:8080/soaplab2/; under the heading sequence analysis. As the modules and scripts are being released as open source code, interested users can continue to use as well as make improvements to them.

Conclusion

Modules and scripts that facilitate sequence diversity analysis, format parsing and data manipulation are being made available along with source code. The module/scripts can be chained together with other freely available sequence analysis software to generate pipelines for automated data analysis. The pipelining has been currently made feasible through the PISE web interface generator and a part of it can also be executed on the Taverna workbench. The module and scripts are available to interested users.

Availability & requirements

Modules available from: . Can be accessed and used at Soaplab service: Operating System: platform independent Programming language: Perl PISE SNP2CAPS MISA, cap3 (optional; ; ) Any restrictions to use by non-academics: none

Abbreviations

ACD: Ajax Command Definition; SNP: Single Nucleotide Polymorphism; PIC: Polymorphism Information Content; XML: Extensible Markup Language; PISE: Pasteur Institute Software Environment; CAPS: Cleaved Amplified Polymorphic markers; EST: Expressed Sequence Tag; PCAP: Parallel Contig Assembly Program; CPAN: Comprehensive Perl Archive Network; MISA: MIcroSAtellite Identification tool

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

BA, AJ coded for the Perl modules and scripts, PSR and SN used the programmes on their datasets, RKV articulated the use cases, indices for the diversity estimator module while JB developed the pipeline architecture and wrote the manuscript.

16 in total

1. A Web interface generator for molecular biology programs in Unix.

Authors: C Letondal
Journal: Bioinformatics Date: 2001-01 Impact factor: 6.937

Review 2. Applications of single nucleotide polymorphisms in crop genetics.

Authors: Antoni Rafalski
Journal: Curr Opin Plant Biol Date: 2002-04 Impact factor: 7.834

Review 3. Pise: software for building bioinformatics webs.

Authors: Don Gilbert
Journal: Brief Bioinform Date: 2002-12 Impact factor: 11.622

4. Multiple sequence alignment with the Clustal series of programs.

Authors: Ramu Chenna; Hideaki Sugawara; Tadashi Koike; Rodrigo Lopez; Toby J Gibson; Desmond G Higgins; Julie D Thompson
Journal: Nucleic Acids Res Date: 2003-07-01 Impact factor: 16.971

5. Haplotype diversity in 11 candidate genes across four populations.

Authors: T H Beaty; M D Fallin; J B Hetmanski; I McIntosh; S S Chong; R Ingersoll; X Sheng; R Chakraborty; A F Scott
Journal: Genetics Date: 2005-06-18 Impact factor: 4.562

6. Automated discovery of single nucleotide polymorphism and simple sequence repeat molecular genetic markers.

Authors: Jacqueline Batley; Erica Jewell; David Edwards
Journal: Methods Mol Biol Date: 2007

7. d2_cluster: a validated method for clustering EST and full-length cDNAsequences.

Authors: J Burke; D Davison; W Hide
Journal: Genome Res Date: 1999-11 Impact factor: 9.043

8. Snipping polymorphisms from large EST collections in barley (Hordeum vulgare L.).

Authors: R Kota; S Rudd; A Facius; G Kolesov; T Thiel; H Zhang; N Stein; K Mayer; A Graner
Journal: Mol Genet Genomics Date: 2003-08-23 Impact factor: 3.291

9. HaploSNPer: a web-based allele and SNP detection tool.

Authors: Jifeng Tang; Jack A M Leunissen; Roeland E Voorrips; C Gerard van der Linden; Ben Vosman
Journal: BMC Genet Date: 2008-02-28 Impact factor: 2.797

10. QualitySNP: a pipeline for detecting single nucleotide polymorphisms and insertions/deletions in EST data from diploid and polyploid species.

Authors: Jifeng Tang; Ben Vosman; Roeland E Voorrips; C Gerard van der Linden; Jack A M Leunissen
Journal: BMC Bioinformatics Date: 2006-10-09 Impact factor: 3.169

6 in total

1. The first set of EST resource for gene discovery and marker development in pigeonpea (Cajanus cajan L.).

Authors: Nikku L Raju; Belaghihalli N Gnanesh; Pazhamala Lekha; Balaji Jayashree; Suresh Pande; Pavana J Hiremath; Munishamappa Byregowda; Nagendra K Singh; Rajeev K Varshney
Journal: BMC Plant Biol Date: 2010-03-11 Impact factor: 4.215

2. Development and use of genic molecular markers (GMMs) for construction of a transcript map of chickpea (Cicer arietinum L.).

Authors: Neha Gujaria; Ashish Kumar; Preeti Dauthal; Anuja Dubey; Pavana Hiremath; A Bhanu Prakash; Andrew Farmer; Mangla Bhide; Trushar Shah; Pooran M Gaur; Hari D Upadhyaya; Sabhyata Bhatia; Douglas R Cook; Greg D May; Rajeev K Varshney
Journal: Theor Appl Genet Date: 2011-03-08 Impact factor: 5.699

3. Single feature polymorphisms (SFPs) for drought tolerance in pigeonpea (Cajanus spp.).

Authors: Rachit K Saxena; Xinping Cui; Vivek Thakur; Barbara Walter; Timothy J Close; Rajeev K Varshney
Journal: Funct Integr Genomics Date: 2011-05-06 Impact factor: 3.410

4. Allele diversity for abiotic stress responsive candidate genes in chickpea reference set using gene based SNP markers.

Authors: Manish Roorkiwal; Spurthi N Nayak; Mahendar Thudi; Hari D Upadhyaya; Dominique Brunel; Pierre Mournet; Dominique This; Prakash C Sharma; Rajeev K Varshney
Journal: Front Plant Sci Date: 2014-06-05 Impact factor: 5.753

5. Candidate gene analysis for determinacy in pigeonpea (Cajanus spp.).

Authors: Reyazul Rouf Mir; Himabindu Kudapa; Sandhya Srikanth; Rachit K Saxena; Ashutosh Sharma; Sarwar Azam; Kulbhushan Saxena; R Varma Penmetsa; Rajeev K Varshney
Journal: Theor Appl Genet Date: 2014-10-21 Impact factor: 5.699

6. A comprehensive resource of drought- and salinity- responsive ESTs for gene discovery and marker development in chickpea (Cicer arietinum L.).

Authors: Rajeev K Varshney; Pavana J Hiremath; Pazhamala Lekha; Junichi Kashiwagi; Jayashree Balaji; Amit A Deokar; Vincent Vadez; Yongli Xiao; Ramamurthy Srinivasan; Pooran M Gaur; Kadambot Hm Siddique; Christopher D Town; David A Hoisington
Journal: BMC Genomics Date: 2009-11-15 Impact factor: 3.969

6 in total