| Literature DB >> 29535759 |
Brittney N Keel1, Warren M Snelling1.
Abstract
Ongoing developments and cost decreases in next-generation sequencing (NGS) technologies have led to an increase in their application, which has greatly enhanced the fields of genetics and genomics. Mapping sequence reads onto a reference genome is a fundamental step in the analysis of NGS data. Efficient alignment of the reads onto the reference genome with high accuracy is very important because it determines the global quality of downstream analyses. In this study, we evaluate the performance of three Burrows-Wheeler transform-based mappers, BWA, Bowtie2, and HISAT2, in the context of paired-end Illumina whole-genome sequencing of livestock, using simulated sequence data sets with varying sequence read lengths, insert sizes, and levels of genomic coverage, as well as five real data sets. The mappers were evaluated based on two criteria, computational resource/time requirements and robustness of mapping. Our results show that BWA and Bowtie2 tend to be more robust than HISAT2, while HISAT2 was significantly faster and used less memory than both BWA and Bowtie2. We conclude that there is not a single mapper that is ideal in all scenarios but rather the choice of alignment tool should be driven by the application and sequencing technology.Entities:
Keywords: genomic coverage; livestock; mapper comparison; mapping algorithm; whole-genome sequencing
Year: 2018 PMID: 29535759 PMCID: PMC5834436 DOI: 10.3389/fgene.2018.00035
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.599
Parameters for simulated data sets used in this study.
| H350_100 | 350 | 100 | High (10x−25x) |
| H350_150 | 350 | 150 | High (10x−25x) |
| H550_100 | 550 | 100 | High (10x−25x) |
| H550_150 | 550 | 150 | High (10x−25x) |
| M350_100 | 350 | 100 | Medium (5x−10x) |
| M350_150 | 350 | 150 | Medium (5x−10x) |
| M550_100 | 550 | 100 | Medium (5x−10x) |
| M550_150 | 550 | 150 | Medium (5x−10x) |
| L350_100 | 350 | 100 | Low (1x−5x) |
| L350_150 | 350 | 150 | Low (1x−5x) |
| L550_100 | 550 | 100 | Low (1x−5x) |
| L550_150 | 550 | 150 | Low (1x−5x) |
Real data sets used in this study.
| R1 | Swine | PC, SE | TS | 100 | 350 | HiSeq2500 | |
| R2 | Swine | PC, SE | TS | 150 | 350 | NextSeq500 | |
| R3 | Swine | WIZ | TS-PCRF | 150 | 550 | NextSeq500 | |
| R4 | Cattle | PC, QIA | AGI | 100 | 350 | HiSeq2500 | |
| R5 | Cattle | PC, QIA | AGI | 150 | 350 | HiSeq2500 |
PC, pheno-chloroform extraction; SE, salt extraction; WIZ, Wizard SV96 Genomic Purification Kit; QIA, QIAamp DNA Mini kit.
TS, TruSeq DNA sample prep kit; TS-PCRF, TruSeq DNA PCR-Free sample prep kit; AGI, Agilent SureSelect Target Enrichment System Kit I or Kit II.
Figure 1Input data size vs. execution time for the five real data sets used in this study.
Figure 2Input data size vs. execution time for the five real data sets used in this study.
Area under the precision-recall curve (PR AUC) for each mapper in each of the 350 bp insert simulated data sets.
| BWA_H350_100 | 0.9940 (0.0109) | BWA_H350_150 | 0.9937 (0.0112) |
| Bowtie2_H350_100 | 0.9883 (0.0152) | Bowtie2_H350_150 | 0.9951 (0.0099) |
| Hisat2_H350_100 | 0.9666 (0.0254) | Hisat2_H350_150 | 0.9664 (0.0255) |
| BWA_M350_100 | 0.9940 (0.0109) | BWA_M350_150 | 0.9936 (0.0112) |
| Bowtie2_M350_100 | 0.9883 (0.0152) | Bowtie2_M350_150 | 0.9950 (0.0099) |
| Hisat2_M350_100 | 0.9666 (0.0254) | Hisat2_M350_150 | 0.9664 (0.0255) |
| BWA_L350_100 | 0.9883 (0.0152) | BWA_L350_150 | 0.9936 (0.0113) |
| Bowtie2_L350_100 | 0.9883 (0.0152) | Bowtie2_L350_150 | 0.9950 (0.0100) |
| Hisat2_L350_100 | 0.9667 (0.0254) | Hisat2_L350_150 | 0.9664 (0.0255) |
Standard error is given in parentheses.
Area under the precision-recall curve (PR AUC) for each mapper in each of the 550 bp insert simulated data sets.
| BWA_H550_100 | 0.9943 (0.0106) | BWA_H550_150 | 0.9938 (0.0111) |
| Bowtie2_H550_100 | 0.9875 (0.0157) | Bowtie2_H550_150 | 0.9947 (0.0103) |
| Hisat2_H550_100 | 0.9672 (0.0252) | Hisat2_H550_150 | 0.9667 (0.0254) |
| BWA_M550_100 | 0.9943 (0.0106) | BWA_M550_150 | 0.9939 (0.0110) |
| Bowtie2_M550_100 | 0.9874 (0.0157) | Bowtie2_M550_150 | 0.9947 (0.0103) |
| Hisat2_M550_100 | 0.9673 (0.0252) | Hisat2_M550_150 | 0.9668 (0.0253) |
| BWA_L550_100 | 0.9944 (0.0106) | BWA_L550_150 | 0.9939 (0.0111) |
| Bowtie2_L550_100 | 0.9875 (0.0157) | Bowtie2_L550_150 | 0.9947 (0.0103) |
| Hisat2_L550_100 | 0.9673 (0.0252) | Hisat2_L550_150 | 0.9666 (0.0254) |
Standard error is given in parentheses.
Figure 3Heatmap of the average percentage of properly paired reads for the 12 simulated data sets.
Scoring of aligners for various sequencing parameters based on criteria evaluated in this study; + indicates low score, ++ indicates medium score, and +++ indicates high score.
| Ins. (bp) | 350 | 550 | 350 | 550 | 350 | 550 | 350 | 550 | ||||||||
| RL (bp) | 100 | 150 | 100 | 150 | 100 | 150 | 100 | 150 | 100 | 150 | 100 | 150 | 100 | 150 | 100 | 150 |
| BWA | + | + | + | + | + | + | + | + | +++ | +++ | +++ | +++ | +++ | +++ | +++ | +++ |
| Bowtie2 | ++ | ++ | ++ | ++ | +++ | +++ | ++ | ++ | +++ | +++ | +++ | +++ | ++ | ++ | +++ | +++ |
| HISAT2 | +++ | +++ | +++ | +++ | +++ | +++ | +++ | +++ | ++ | ++ | ++ | ++ | + | + | + | + |