| Literature DB >> 29114401 |
Kelly M Robinson1, Aziah S Hawkins1, Ivette Santana-Cruz1, Ricky S Adkins1, Amol C Shetty1, Sushma Nagaraj1, Lisa Sadzewicz1, Luke J Tallon1, David A Rasko1,2, Claire M Fraser1,3, Anup Mahurkar1, Joana C Silva1,2, Julie C Dunning Hotopp2,1.
Abstract
As sequencing technologies have evolved, the tools to analyze these sequences have made similar advances. However, for multi-species samples, we observed important and adverse differences in alignment specificity and computation time for bwa- mem (Burrows-Wheeler aligner-maximum exact matches) relative to bwa-aln. Therefore, we sought to optimize bwa-mem for alignment of data from multi-species samples in order to reduce alignment time and increase the specificity of alignments. In the multi-species cases examined, there was one majority member (i.e. Plasmodium falciparum or Brugia malayi) and one minority member (i.e. human or the Wolbachia endosymbiont wBm) of the sequence data. Increasing bwa-mem seed length from the default value reduced the number of read pairs from the majority sequence member that incorrectly aligned to the reference genome of the minority sequence member. Combining both source genomes into a single reference genome increased the specificity of mapping, while also reducing the central processing unit (CPU) time. In Plasmodium, at a seed length of 18 nt, 24.1 % of reads mapped to the human genome using 1.7±0.1 CPU hours, while 83.6 % of reads mapped to the Plasmodium genome using 0.2±0.0 CPU hours (total: 107.7 % reads mapping; in 1.9±0.1 CPU hours). In contrast, 97.1 % of the reads mapped to a combined Plasmodium-human reference in only 0.7±0.0 CPU hours. Overall, the results suggest that combining all references into a single reference database and using a 23 nt seed length reduces the computational time, while maximizing specificity. Similar results were found for simulated sequence reads from a mock metagenomic data set. We found similar improvements to computation time in a publicly available human-only data set.Entities:
Keywords: BWA; Brugia; Plasmodium; Wolbachia; dual-species alignment; genome sequence alignment
Mesh:
Year: 2017 PMID: 29114401 PMCID: PMC5643015 DOI: 10.1099/mgen.0.000122
Source DB: PubMed Journal: Microb Genom ISSN: 2057-5858
Total number of reads for each dataset
The datasets are listed that were used in this study, with the total number of reads.
|
|
|
|
|---|---|---|
|
| SRR5188379 | 31 494 916 |
|
| ERR015379 | 6 051 406 |
|
| ERR012739 | 8 080 550 |
|
| ERR015360 | 25 867 194 |
| 1000 Genomes | ERR022446 | 219 493 146 |
| Simulated metagenome | Dryad acc. (doi:10.5061/dryad.m1m0p) | 5 490 726 |
Fig. 1.The percentage of mapped read pairs for all bwa-mem datasets. The percentage of mapped read pairs for all datasets against all references is shown for each seed length for (a) the human–Plasmodium dataset, (b) the Brugia–Wolbachia dataset and (c) the human-only dataset. The sequencing reads were aligned to the reference genome of each species separately, as well as a combined reference of both genomes, as indicated in the legend. Mappings to the reference genome of the minority member in the sample (human for the Plasmodium–human dataset and Wolbachia for the Brugia–Wolbachia dataset) were plotted on the secondary y-axis on the right, while all others were plotted on the primary y-axis on the left. The sum of the mappings to individual reference genomes is illustrated with a dashed line to enable comparisons.
The percentage of read pairs that aligned to each reference
The percentage is provided of read pairs from a subset of datasets that aligned to each reference for bwa-aln with default parameters and bwa-mem with seed lengths from 18 to 30 nt.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
| Human | 10.2 | 24.1 | 22.4 | 21.2 | 20.4 | 19.9 | 19.6 | 19.4 | 19.2 | 19.1 | 19.0 | 18.8 | 18.7 | 18.5 |
|
|
| 78.1 | 83.6 | 83.3 | 83.1 | 82.9 | 82.8 | 82.6 | 82.4 | 82.3 | 83.6 | 83.3 | 83.1 | 82.9 | 82.8 |
|
| Combined reference | 88.3 | 97.1 | 96.9 | 96.7 | 96.5 | 96.3 | 96.1 | 95.9 | 95.7 | 95.5 | 95.3 | 95.1 | 94.9 | 94.6 |
|
|
| 91.9 | 94.4 | 94.0 | 93.8 | 93.8 | 93.7 | 93.7 | 93.6 | 93.6 | 93.6 | 93.6 | 93.5 | 93.5 | 93.5 |
|
|
| 2.7 | 3.3 | 3.0 | 2.9 | 2.9 | 2.9 | 2.9 | 2.9 | 2.9 | 2.9 | 2.8 | 2.8 | 2.8 | 2.8 |
|
| Combined reference | 94.5 | 96.4 | 96.2 | 96.1 | 96.1 | 96.1 | 96.1 | 96.0 | 96.0 | 96.0 | 96.0 | 96.0 | 96.0 | 96.0 |
| 1000 Genomes | Human | 95.9 | 99.7 | 99.6 | 99.6 | 99.6 | 99.6 | 99.5 | 99.5 | 99.4 | 99.4 | 99.4 | 99.3 | 99.2 | 99.2 |
Fig. 2.Percentage of mapped read pairs for different Plasmodium–human datasets. The percentage of mapped read pairs for all datasets against all references is shown for each seed length for three Plasmodium datasets containing varying levels of human sequence. ERR015379 is estimated to contain 88 % reads with best matches to P. falciparum reads and 7 % reads with best matches to human sequences. ERR012739 was estimated to contain 73 % Plasmodium reads and 18 % human reads. ERR015360 was estimated to contain 60 % Plasmodium reads and 3 % human reads. All three datasets showed the same trends as the original dataset examined, namely over-mapping of reads when mapped to separate references that could be largely eliminated by mapping to an aggregate reference and/or increasing the seed length to 23 nt.
Fig. 3.The difference in the percentage of mapped reads from that expected for simulated reads from a mock metagenomic community. Because of differences in the genome size, the percentage of reads mapped varied extensively, such that the difference between the observed percentage of mapped read pairs and the known percentage of mapped read pairs was interrogated for the simulated data from the mock metagenomic community. This value is shown for each seed length for a simulated dataset of 101 bp paired-end sequencing reads from a mock metagenomic community. Given that the data is simulated, the known percentage of mapped reads was calculated from the 8× sequencing depth and the relative fraction the genome contributes to the population for each organism. The aggregate percentage difference in mapping was plotted on the secondary y-axis on the right, while all individual percentage differences were plotted on the primary y-axis on the left.
Fig. 4.CPU time for all bwa-mem datasets. The mean and sd of CPU time in hours for each of three replicate alignments is plotted against each seed length for a subset of datasets. In the two dual-species data sets, the sequencing reads were aligned to the reference genome of each species separately, as well as a combined reference of both genomes. The 1000 Genomes dataset was only aligned to the human reference genome.
The CPU time in hours for each replicate for a subset of datasets and references
The CPU time in hours is provided for a subset of datasets for bwa-aln with default parameters and bwa-mem with 18–30 nt seed lengths. For bwa-aln, the CPU time reported is the sum of the CPU time for aligning read 1, the CPU time for aligning read 2 and the CPU time for running the sampe algorithm on those two alignments.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
| 0.3±0.1 | 0.2±0 | 0.2±0 | 0.2±0 | 0.2±0 | 0.2±0 | 0.2±0 | 0.2±0 | 0.2±0 | 0.2±0 | 0.2±0 | 0.1±0 | 0.1±0 | 0.1±0 |
| Human | 0.9±0.1 | 1.7±0.1 | 1.1±0.1 | 0.8±0 | 0.7±0 | 0.6±0 | 0.5±0 | 0.5±0 | 0.4±0 | 0.4±0 | 0.4±0 | 0.4±0 | 0.3±0 | 0.3±0 |
| Human– | 1±0.1 | 0.7±0 | 0.6±0 | 0.6±0 | 0.5±0 | 0.5±0 | 0.5±0 | 0.5±0 | 0.4±0 | 0.4±0 | 0.4±0 | 0.4±0 | 0.3±0 | 0.3±0 |
| 1000 Genomes | 34.1±1 | 21.1±1.7 | 19.2±1.9 | 17.7±1.6 | 17.8±1.5 | 17.3±1.4 | 16.8±1 | 15.6±1.8 | 15.7±1.1 | 15.5±1 | 15.1±1.6 | 14.2±0.1 | 13.3±1.2 | 12.8±1.1 |
|
| 1.3±0 | 1.1±0 | 1±0 | 0.9±0 | 0.9±0 | 0.9±0 | 0.8±0 | 0.8±0 | 0.8±0 | 0.8±0 | 0.7±0 | 0.8±0 | 0.8±0 | 0.7±0 |
|
| 0.1±0 | 0.4±0 | 0.4±0 | 0.4±0 | 0.4±0 | 0.4±0 | 0.4±0 | 0.4±0 | 0.4±0 | 0.4±0 | 0.4±0 | 0.4±0 | 0.4±0 | 0.4±0 |
|
| 1.3±0 | 1.1±0.1 | 1±0 | 0.9±0.1 | 0.9±0 | 0.9±0 | 0.8±0 | 0.8±0 | 0.8±0 | 0.8±0 | 0.8±0 | 0.7±0 | 0.7±0 | 0.7±0 |