| Literature DB >> 22238694 |
Swetansu Pattnaik1, Srividya Vaidyanathan, Durgad G Pooja, Sa Deepak, Binay Panda.
Abstract
The advent of next generation sequencing (NGS) technologies have revolutionised the way biologists produce, analyse and interpret data. Although NGS platforms provide a cost-effective way to discover genome-wide variants from a single experiment, variants discovered by NGS need follow up validation due to the high error rates associated with various sequencing chemistries. Recently, whole exome sequencing has been proposed as an affordable option compared to whole genome runs but it still requires follow up validation of all the novel exomic variants. Customarily, a consensus approach is used to overcome the systematic errors inherent to the sequencing technology, alignment and post alignment variant detection algorithms. However, the aforementioned approach warrants the use of multiple sequencing chemistry, multiple alignment tools, multiple variant callers which may not be viable in terms of time and money for individual investigators with limited informatics know-how. Biologists often lack the requisite training to deal with the huge amount of data produced by NGS runs and face difficulty in choosing from the list of freely available analytical tools for NGS data analysis. Hence, there is a need to customise the NGS data analysis pipeline to preferentially retain true variants by minimising the incidence of false positives and make the choice of right analytical tools easier. To this end, we have sampled different freely available tools used at the alignment and post alignment stage suggesting the use of the most suitable combination determined by a simple framework of pre-existing metrics to create significant datasets.Entities:
Mesh:
Year: 2012 PMID: 22238694 PMCID: PMC3253117 DOI: 10.1371/journal.pone.0030080
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Figure 1Steps involved in generating highly significant SNP dataset.
The different NGS aligners and variant callers sampled in our study.
| Sequencing platform | Aligners | SNP callers | Datasets |
| Illumina GAIIx, paired-end short-insert library of read length 76 | Bowtie, Smalt, Stampy, Ssaha, Novoalign, Bwa, Bfast | Samtools,GATK, Freebayes, Bambino | Sureselect enriched Exome data: 02B, 12L, 20T |
Figure 2The real time elapsed in calculating alignment maps.
Figure 3Base quality plots of sample 02B.
(A) Depicting the effect of seven aligners. (B) Depicting the effect of four variant callers.
Figure 4The variant rediscovery percentages determined using whole genome SNP array.
(A) All exonic variants. (B) dbSNP positive variants. The Y axis represents the percent re-discovery rate in relation to the aligner that performed the best (taken as 100%).
The Ti/Tv ratios of 28 different aligner-caller combinations for samples 02B, 12L and 20T.
| Ti/Tv for Exonic SNPs | BWA | BFAST | BOWTIE | STAMPY | NovoMPI | SMALT | SSAHA | |
|
|
| 3.78 | 3.59 | 4.12 | 3.28 | 3.22 | 3.49 | 3.53 |
|
| 2.73 | 2.73 | 2.86 | 2.77 | 2.77 | 2.79 | 2.75 | |
|
| 0.32 | 0.402 | 0.62 | 2.37 | 2.70 | 0.29 | 0.30 | |
|
| 2.56 | 2.55 | 2.89 | 2.8 | 2.87 | 2.62 | 2.59 | |
|
|
| 3.52 | 3.46 | 4.08 | 3.29 | 3.24 | 3.41 | 3.46 |
|
| 2.69 | 2.72 | 2.90 | 2.76 | 2.75 | 2.78 | 2.74 | |
|
| 0.25 | 0.34 | 0.525 | 2.26 | 2.63 | 0.22 | 0.23 | |
|
| 2.27 | 2.43 | 2.86 | 2.71 | 2.85 | 2.52 | 2.47 | |
|
|
| 3.80 | 3.45 | 3.99 | 3.30 | 3.30 | 3.38 | 3.47 |
|
| 2.74 | 2.73 | 2.85 | 2.75 | 2.77 | 2.78 | 2.75 | |
|
| 0.32 | 0.402 | 0.62 | 2.37 | 2.70 | 0.29 | 0.30 | |
|
| 2.24 | 2.33 | 2.72 | 2.70 | 2.82 | 2.31 | 2.26 | |
Figure 5The alignment statistics of the percentage of reads aligned by different aligners.