| Literature DB >> 22355574 |
Weixin Wang1, Zhi Wei, Tak-Wah Lam, Junwen Wang.
Abstract
The rapid development of next generation sequencing (NGS) technology provides a new chance to extend the scale and resolution of genomic research. How to efficiently map millions of short reads to the reference genome and how to make accurate SNP calls are two major challenges in taking full advantage of NGS. In this article, we reviewed the current software tools for mapping and SNP calling, and evaluated their performance on samples from The Cancer Genome Atlas (TCGA) project. We found that BWA and Bowtie are better than the other alignment tools in comprehensive performance for Illumina platform, while NovoalignCS showed the best overall performance for SOLiD. Furthermore, we showed that next-generation sequencing platform has significantly lower coverage and poorer SNP-calling performance in the CpG islands, promoter and 5'-UTR regions of the genome. NGS experiments targeting for these regions should have higher sequencing depth than the normal genomic region.Entities:
Mesh:
Substances:
Year: 2011 PMID: 22355574 PMCID: PMC3216542 DOI: 10.1038/srep00055
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Summary of the representative software tools
| Program | Version | Algorithm | Color-space supported | Read length(bp) supported | Gapped | pair-end supported | Can output all(suboptimal) hits | output format |
|---|---|---|---|---|---|---|---|---|
| Bowtie | 0.12.7 | FM-index | Yes | < = 1024 | no | yes | yes | SAM |
| BWA | 0.5.8c | FM-index | Yes | Arbitrary | yes | yes | yes | SAM |
| SOAP2 | 2.2 | FM-index | No | < = 1024 | no | yes | yes | SOAP2 |
| RMAP | 2.0.5 | hash reads | No | Arbitrary | no | yes | yes | BED |
| ZOOM | 1.5.0 | hash reads | Yes | < = 240 | yes | yes | yes | ZOOM |
| Maq | 0.7.1 | hash reads | Yes | < = 127 | yes | yes | no | Maq |
| Novoalign | 2.07.00 | hash ref. | yes | Arbitrary | yes | yes | yes | SAM |
| SHRiMP | 2.1.0 | hash ref. | Yes | Arbitrary | yes | yes | yes | SAM |
aNovoalignCS supports the SOLiD platform
Performance assessment of eight NGS mapping tools on Illumina paired-end sequencing data of SRR018643
| Program | Category | Version | Index time (h:m:s) | Peak Memory footprint (gigabyte) | Alignment time (h:m:s) | Peak memory footprint (gigabyte) | Reads aligned (%) |
|---|---|---|---|---|---|---|---|
| Bowtie | BWT | 0.12.7 | 3:43:36 | 5.5 | 2:22:36 | 2.9 | 67.55 |
| BWA | 0.5.8c | 1:46:42 | 1.5 | 8:24:12 | 5.0 | 72.99 | |
| SOAP2 | 2.20 | 1:45:54 | 2.3 | 10:22:26 | 6.8 | 60.93 | |
| RMAP | Hash reads | 2.0.5 | N/A | N/A | 10:15:18 | 10.0 | 55.98 |
| ZOOM | 1.5.0 | N/A | N/A | 7:01:53 | 10.2 | 62.86 | |
| Maq | 0.7.1 | 0:01:56 | 0.34 | 39:10:43 | 8.1 | 71.94 | |
| Novoalign | S-W | 2.07.06 | 0:06:28 | 13.5 | 144:25:35 | 13.1 | 77.65 |
| SHRiMP | 2.1.0 | 4:08:13 | 12.0 | 1065:10:05 | 12.0 | 81.23 |
aWith default -n mode to restrict no more than 2 mismatches in the first 28 bases (seed region) and the sum of Phred quality values at all mismatched positions (not just in seed) may not exceed 70, –chunkmbs 2000 to dedicate more memory to the descriptor, –I 0 –X 1000 to filter the insert size, -S to print in SAM format and –p 1 to denote 1 thread. Other parameters are default.
bWhen implement aln function, -k 2 and –l 28 to restrict at most 2 edit distance in first 28 bases seed region, –t 1 to denote 1 thread. When implement sampe function, set –a 1000 as the maximum insert size. Other parameters are default.
cWith –m 0 and –x 1000 to filter the insert size, -l 28 to denote the 28 seed region, -M 4 to report the best hits which has at most 2 mismatches in seed region, -p 1 to denote 1 thread. Other parameters are default.
dImplement rmappe function, with –m 5 to restrict no more than 5 mismatches in whole read, –min-sep 0 and –max-sep 1000 to restrict the insert size.
eDo not rely on index, the aligner create hashing table in memory every time.
fWith –pemin 0 –pemax 1000 to restrict the insert size and –mm 5 to at most 5 mismatches in whole read.
gWhen do map, setting –a 1000 and –A 1000
hSetting more precisely with –i 586 101, which stand for the average and the standard deviation of insert size(bp)
iDue to the memory limitation, we had to split the genome into 5 chunks. We prepared the seeds for each chunk as index, and sequentially did the alignment procedure. Setting: -N 1 -p opp-in -I 0,1000 -m 20 -i -25 -g -40 -e -10 –E( –N 1 denotes the 1 thread.).
Figure 1The relationship between sequence fold and genomic coverage.
Length of colour bar represents the percent of bases with corresponding depth in the whole genome under corresponding volume of sequencing bases.
Figure 2Coverage comparisons for different genetic regions at ten folds coverage.
P-value (all are less than 2.2e-16) for t-test through bootstrap shows the significant poorer coverage of CpG-island region compared with genomic background or gene region. Meanwhile, the promoter and 5′UTR region are both significantly under-covered. (One star: p-value<0.05, two stars: p-value<0.01, three stars: p-value<0.001).
Figure 3The relationship of the number of probes covered and genomic sequence fold (total 583891 SNP probes)
Figure 4Comparison of SNP calling qualities (AUCs) of three software tools at different depths.
Figure 5AUC (area under the curve) comparison for different genetic regions.
CpG-island region has significantly poorer performance than genomic background (p-value = 0.000972) or gene region (p-value = 0.0003607). Promoter (p-value = 0.00873) and 5′UTR (p-value = 0.00946) region shows similar pattern. Gene-region also reach a little bit lower performance (p-value = 0.0004641).