| Literature DB >> 25003610 |
Sarwar Azam1, Abhishek Rathore1, Trushar M Shah1, Mohan Telluri1, BhanuPrakash Amindala1, Pradeep Ruperao2, Mohan A V S K Katta1, Rajeev K Varshney1.
Abstract
Open source single nucleotide polymorphism (SNP) discovery pipelines for next generation sequencing data commonly requires working knowledge of command line interface, massive computational resources and expertise which is a daunting task for biologists. Further, the SNP information generated may not be readily used for downstream processes such as genotyping. Hence, a comprehensive pipeline has been developed by integrating several open source next generation sequencing (NGS) tools along with a graphical user interface called Integrated SNP Mining and Utilization (ISMU) for SNP discovery and their utilization by developing genotyping assays. The pipeline features functionalities such as pre-processing of raw data, integration of open source alignment tools (Bowtie2, BWA, Maq, NovoAlign and SOAP2), SNP prediction (SAMtools/SOAPsnp/CNS2snp and CbCC) methods and interfaces for developing genotyping assays. The pipeline outputs a list of high quality SNPs between all pairwise combinations of genotypes analyzed, in addition to the reference genome/sequence. Visualization tools (Tablet and Flapjack) integrated into the pipeline enable inspection of the alignment and errors, if any. The pipeline also provides a confidence score or polymorphism information content value with flanking sequences for identified SNPs in standard format required for developing marker genotyping (KASP and Golden Gate) assays. The pipeline enables users to process a range of NGS datasets such as whole genome re-sequencing, restriction site associated DNA sequencing and transcriptome sequencing data at a fast speed. The pipeline is very useful for plant genetics and breeding community with no computational expertise in order to discover SNPs and utilize in genomics, genetics and breeding studies. The pipeline has been parallelized to process huge datasets of next generation sequencing. It has been developed in Java language and is available at http://hpc.icrisat.cgiar.org/ISMU as a standalone free software.Entities:
Mesh:
Year: 2014 PMID: 25003610 PMCID: PMC4086967 DOI: 10.1371/journal.pone.0101754
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Figure 1The work-flow of the ISMU pipeline.
The work-flow of the ISMU pipeline is mainly divided into three steps: (A) Data import, quality pre-processing, (B) Sequence alignment and SNP discovery, and (C) Visualization and generation of input files for genotyping assay.
Details about whole genome re-sequencing (WGRS) dataset used for evaluation of the pipeline.
| Genotype name | Type | Total number of raw reads (PE) | Read length (bp) | Total number of filtered reads (PE) | Read length (bp) (post filter) | Alignment (%) | Number of SNPs as compared to the reference genome |
| Pistol |
| 33,467,106 | 101 | 32,193,245 | 86/87 | 93.03 | 317,991 |
| Hat Trick |
| 32,021,614 | 101 | 31,126,432 | 86/87 | 92.53 | 156,255 |
| Slasher |
| 31,093,427 | 101 | 30,286,838 | 86/88 | 92.51 | 351,844 |
| Genesis 90 |
| 32,210,496 | 101 | 31,467,878 | 86/88 | 92.66 | 253,472 |
Raw reads from above mentioned datasets were filtered and then aligned against the chickpea genome. These reads covered 92% of the reference genome. The pre-processing step of pipeline trimmed 100 bp reads into paired end reads of length 86 bp/87 bp each. The Hat Trick genotype showed half the number of SNPs called in comparison with other genotypes.
Restriction site associated DNA (RAD) sequence dataset used for evaluation of the pipeline.
| Genotype name | Total number of reads (SE) | Read length (bp) | Total number of filtered reads | Read length (bp) | Alignment (%) |
| ICCV 03107 | 2,360,400 | 100 | 2,250,687 | 78 | 91.27 |
| ICC 4918 | 5,761,446 | 100 | 5,486,801 | 78 | 89.82 |
| ICC 4930 | 10,595,164 | 100 | 10,103,218 | 78 | 89.90 |
| ICC 4958 | 10,874,599 | 100 | 10,400,166 | 79 | 91.83 |
| ICC 5270 | 8,198,607 | 100 | 7,790,524 | 78 | 90.38 |
| ICCV 05530 | 8,011,084 | 100 | 7,611,453 | 78 | 89.96 |
| ICC 5810 | 8,587,698 | 100 | 8,213,783 | 79 | 90.68 |
| ICC 5912 | 5,422,669 | 100 | 5,183,888 | 79 | 91.12 |
| ICC 6263 | 8,167,648 | 100 | 7,808,763 | 79 | 90.89 |
| ICC 8261 | 6,245,558 | 100 | 5,943,309 | 81 | 88.94 |
Raw reads from above mentioned datasets were filtered and then aligned against the chickpea reference genome. The reads cover 88% to 92% of the reference genome. The pre-processing step of pipeline trimmed 100 bp reads into single end reads in the range 78 bp to 81 bp.
RNAseq dataset used for evaluation of the pipeline.
| Genotype name | Raw data | Filtered data | Alignment (%) | SNP with reference | ||
| Total number of reads (PE) | Read length (bp) | Total number of reads (PE) | Read length (bp) | |||
| HuaU12 | 6,857,839 | 90/90 | 6,733,549 | 72/74 | 82.51 | 41,225 |
| HuaU606 | 6,771,173 | 90/90 | 6,649,229 | 72/74 | 78.71 | 44,984 |
Above mentioned RNA sequencing read data from two genotypes of peanut were included in this dataset. Raw reads were filtered and then aligned against the unigene sequences of peanut (ftp://ftp.ncbi.nih.gov/repository/UniGene/Arachis_hypogea/Ahy.seq.uniq.gz) as reference. The pre-processing step of pipeline trimmed 90 bp reads into paired end reads of length 72 bp/74 bp.
Figure 2A snapshot on SNPs in four chickpea genotypes compared to the reference genome.
The Venn diagram shows distribution of SNPs detected between four genotypes (Pistol, Hat Trick, Slasher and Genesis 90). The genotype CDC Frontier was used as a reference sequence. For instance, a total of 95,329 SNPs were found to be concordant between Pistol and Hat Trick genotypes. Similarly, amongst all the four genotypes 62,291 SNPs were found to be in common.
Pairwise SNP distribution between genotypes identified in RAD dataset.
| ICCV 03107 | ICC 4918 | ICC 4930 | ICC 4958 | ICC 5270 | ICCV 05530 | ICC 5810 | ICC 5912 | ICC 6263 | ICC 8261 | |
|
| 5068 | 6250 | 9206 | 8502 | 7418 | 7461 | 8347 | 4985 | 6372 | 5664 |
|
| 442 | 470 | 599 | 501 | 499 | 528 | 455 | 667 | 471 | |
|
| 700 | 648 | 624 | 723 | 606 | 704 | 763 | 623 | ||
|
| 1151 | 998 | 828 | 993 | 637 | 977 | 828 | |||
|
| 829 | 1016 | 852 | 892 | 752 | 617 | ||||
|
| 945 | 886 | 791 | 783 | 613 | |||||
|
| 972 | 778 | 924 | 774 | ||||||
|
| 793 | 958 | 761 | |||||||
|
| 910 | 743 | ||||||||
|
| 581 |
The genotypes show high variation with the reference when compared to pairwise combination of genotypes, indicating missing SNPs (a characteristic of RADseq) that could be imputed. Overall the numbers of SNPs between genotypes were found to be in the range of 442 to 1151.
Run time profile of the ISMU pipeline with three datasets (WGRS, RAD and RNAseq).
| Datasets | WGRS | RAD | RNAseq |
| Method (aligner-SNPcaller) | BWA-samtools | Bowtie-samtools | SOAP2-CbCC |
| Total number of cores | 18 | 18 | 18 |
| Total number of genotypes | 4 | 10 | 2 |
| Input file size (Gigabytes) | 105 | 19.4 | 6.2 |
| Total time (hours) | 26.25 | 4.5 | 2 |
| Disk space (Gigabytes) | 250 | 57 | 17.5 |
| Peak memory (Gigabytes) | 45 | 48 | 3.6 |
Three datasets (WGRS, RAD and RNAseq) were analysed independently with 18 processors on a 48 GB RAM Linux based machine. The disk-space used for analysis, peak memory used and the total time for the run were recorded. Analysis of RNAseq dataset was quicker than RAD/WGRS datasets owing to both small input size and smaller reference sequence pseudo-molecules/contigs. The disk space requirements were found to be proportionate to data size.
Comparison of key features of the ISMU pipeline with similar pipelines.
| Features | ISMU | SIMPLEX | ngs– backbone | GATK | inGAP | SeqGene | GAMES | TREAT | Atlas2 |
| Free of charge | Y | Y | Y | Y | Y | Y | Y | Y | Y |
| SE/PE data handling | Y/Y | Y/Y | Y/N | Y/Y | Y/Y | n.m | Y/Y | Y/Y | Y/Y |
| NS/CS data handling | Y/N | Y/Y | Y/Y | Y/Y | Y/N | n.m | Y/Y | Y/N | Y/Y |
| Alignment | Y | Y | Y | N | Y | Y | N | Y | N |
| No. of alignment tools | 5 | 1 | 1 | N | 2 | n.m | N | 2 | N |
| Variant annotation | Y | Y | N | Y | N | Y | Y | Y | N |
| Highly customizable | Y | Y | Y | Y | N | Y | Y | Y | N |
| Homo−/heterozygosity | Y/Y | Y/Y | N/N | Y/Y | N/N | Y/Y | N/N | Y/Y | Y/Y |
| Quality reports | Y | Y | Y | Y | N | Y | N | Y | N |
| Graphical user interface | Y | N | N | N | Y | N | N | N | Y |
| Standalone | Y | Y | Y | Y | Y | Y | Y | Y | Y |
| HPC support | Y | Y | Y | Y | N | N | N | Y | N |
| Multi user support | Y | Y | N | N | N | N | N | N | N |
| Cloud support | N | Y | N | N | N | N | N | Y | Y |
ISMU is one of the few tools that provide an easy to use graphical interface (GUI) packed with a wide choice of open source tools (alignment and variant calling) for handling NGS data. The information describing features of other pipelines is derived from Fisher et al. [60] and compared. The symbols “Y” and “N” represent, presence and absence of the feature in the pipeline. Numbers (1. 5, 2) indicate number of tool included in the pipeline. “n.m” refers to feature not mentioned.