| Literature DB >> 18273384 |
B Jayashree1, Manindra S Hanspal, Rajgopal Srinivasan, R Vigneshwaran, Rajeev K Varshney, N Spurthi, K Eshwar, N Ramesh, S Chandra, David A Hoisington.
Abstract
The large amounts of EST sequence data available from a single species of an organism as well as for several species within a genus provide an easy source of identification of intra- and interspecies single nucleotide polymorphisms (SNPs). In the case of model organisms, the data available are numerous, given the degree of redundancy in the deposited EST data. There are several available bioinformatics tools that can be used to mine this data; however, using them requires a certain level of expertise: the tools have to be used sequentially with accompanying format conversion and steps like clustering and assembly of sequences become time-intensive jobs even for moderately sized datasets. We report here a pipeline of open source software extended to run on multiple CPU architectures that can be used to mine large EST datasets for SNPs and identify restriction sites for assaying the SNPs so that cost-effective CAPS assays can be developed for SNP genotyping in genetics and breeding applications. At the International Crops Research Institute for the Semi-Arid Tropics (ICRISAT), the pipeline has been implemented to run on a Paracel high-performance system consisting of four dual AMD Opteron processors running Linux with MPICH. The pipeline can be accessed through user-friendly web interfaces at http://hpc.icrisat.cgiar.org/PBSWeb and is available on request for academic use. We have validated the developed pipeline by mining chickpea ESTs for interspecies SNPs, development of CAPS assays for SNP genotyping, and confirmation of restriction digestion pattern at the sequence level.Entities:
Year: 2007 PMID: 18273384 PMCID: PMC2216057 DOI: 10.1155/2007/35604
Source DB: PubMed Journal: Comp Funct Genomics ISSN: 1531-6912
Figure 1Web interfaces to the pipeline: (A) the job submission page; (B) retrieval of output files; (C) visualization of PCAP assemblies in Consed, (D) visualization of polybayes alignment files using Gendoc; (E) SNP2CAPS output; (F) example excel sheet returning number of haplotypes, INDELs, PIC, and P values from PCAP or polybayes output files.
Test datasets for the pipeline, output, and time taken (running on four 64-bit dual AMD opteron nodes of the Paracel high-performance linux cluster).
| Species | Size of EST dataset | Number of clusters | Maximum size of cluster | Minimum size of cluster | Average size of cluster | Number of contigs | Total SNPs identified | Indels | Total time taken |
|---|---|---|---|---|---|---|---|---|---|
| Wheat | 306699318 bp (579879 seq.) | 27461 | 268269 | 2 | 9.769 | 39280 | 12217 | 10734 | 11 h 16 min |
| Maize | 183675067 bp (407423 seq.) | 22650 | 125014 | 2 | 5.5193 | 28008 | 10780 | 7822 | 7 h 35 min |
| Soybean | 135866187 bp (330436 seq.) | 22043 | 99619 | 2 | 4.5193 | 34622 | 7423 | 8178 | 7 h 4 m |
|
| 124381970 bp (227587 seq.) | 18488 | 54471 | 2 | 2.9462 | 21599 | 9463 | 8700 | 4 h 30 min |
|
| 121146352 bp (226923 seq.) | 16151 | 36053 | 2 | 2.2322 | 23839 | 6942 | 16319 | 4 h 2 min |
|
| 2596245 bp (48334 seq.) | 5537 | 1877 | 2 | 0.3389 | 4558 | 2086 | 1110 | 1 h |
|
| 8357124 bp (14381 seq.) | 1473 | 2517 | 2 | 1.7087 | 1180 | 454 | 532 | 22 min |
| Rye | 4342748 bp (9253 seq.) | 1295 | 174 | 2 | 0.1343 | 568 | 218 | 86 | 18 min |
| Millet | 1486253 bp (3106 seq.) | 440 | 184 | 2 | 0.4181 | 135 | 28 | 35 | 3 min |
| Pigeonpea | 428564 bp (925 seq.) | 88 | 14 | 2 | 0.1590 | 6 | 4 | 1 | 1 min |
aOutput of the first step of the pipeline, namely, clustering with parallelized MegaBlast.
bAverage size of cluster = maximum size of cluster/number of clusters.
cNumber of contigs derived from PCAP output.
dSNPs and indels identified from the polybayes output file by custom scripts.
eTotal time taken from EST file upload to SNP2CAPS output.
Figure 2Multiple sequence alignment of the sequences obtained using CL3e primer in 11 Cicer genotypes.