| Literature DB >> 28158639 |
Anthony Westbrook1, Jordan Ramsdell1,2, Taruna Schuelke3, Louisa Normington2, R Daniel Bergeron1,2, W Kelley Thomas2,3, Matthew D MacManes2,3.
Abstract
MOTIVATION: Whole metagenome shotgun sequencing is a powerful approach for assaying the functional potential of microbial communities. We currently lack tools that efficiently and accurately align DNA reads against protein references, the technique necessary for constructing a functional profile. Here, we present PALADIN-a novel modification of the Burrows-Wheeler Aligner that provides accurate alignment, robust reporting capabilities and orders-of-magnitude improved efficiency by directly mapping in protein space.Entities:
Mesh:
Year: 2017 PMID: 28158639 PMCID: PMC5423455 DOI: 10.1093/bioinformatics/btx021
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.PALADIN internal pipeline, outlining each step in the indexing and alignment process, modifications made to the BWA source, and options for file input. Shading indicates if native PALADIN code or modified BWA code. See Supplementary Figure 5 for output options and further pipeline details
Fig. 2.PALADIN performance for the given range of minimum score threshold values. Tests were performed by aligning the generated read set against the UniProt SwissProt reference database. Performance is measured by normalizing the percent of reads mapped, the similarity score, the average mapping quality and the number of unique proteins detected. As the maximal generalized performance of all four metrics centers around the threshold score of 15, this was used as the default parameter value in PALADIN. Variations to the score threshold were shown to have a consistently larger impact on performance than other alignment parameters
Establishing a positive control
| BWA | NovoAlign | PALADIN | |
|---|---|---|---|
| Reads mapped % | 96.02 | 36.39 | 98.00 |
| Correctly mapped % | 86.23 | 91.65 | 93.39 |
| Mapping quality (60) | 58.67 | 59.84 | 59.11 |
| Detected proteins | 20461 | 21889 | 22127 |
Positive control was established by aligning the simulated reads against the coding regions of the original six test genomes. Percentage of reads correctly mapped and mapping quality were used to demonstrate algorithm and method correctness. Quality scores are calculated using the Phred scale, with 60 representing the highest level of confidence.
Mapping efficiency against filtered Swiss-Prot
| BWA | NovoAlign | PALADIN | |
|---|---|---|---|
| Reads mapped % | 19.79 | 0.56 | 25.65 |
| Similarity index | 0.81 | 0.81 | 0.85 |
| Mapping quality (60) | 25.21 | 54.51 | 25.88 |
| Detected proteins | 6314 | 2265 | 7855 |
Mapping efficiency against the filtered Swiss-Prot database using simulated reads. Detected proteins were filtered for reads with 20 or greater mapping quality scores. Results showed an improvement in both quantity and functional accuracy when aligning in protein space with PALADIN.
Detected proteins mapped against the UniRef90
| Type | Project | BWA | PALADIN |
|---|---|---|---|
| Lung | Cystic Fibrosis Metagenome | 60 296 | 40 251* |
| Gut | HMP Core Microbiome | 175 448 | 190 947 |
| Soil | Merlot Microbiome | 7921 | 11 792 |
Number of proteins detected for three empirical read sets mapped against the UniRef90: Lung (BioProject:PRJNA71831), Gut (BioSample:SAMN00037421) and Soil (MG-RAST:4520320.3), when filtered for reads with 20 or greater mapping quality scores. Note, the lower performance of PALADIN for the lung set is a result of the proteins being underrepresented in the filtered version of the UniProt90. Mapping against the full UniRef90 yielded 66 592 proteins detected.
Performance comparison for large read sets
| Read Count | Aligner | Effective Time (HMS) | Aligns/min |
|---|---|---|---|
| 1 000 000 | PALADIN | 00:07:10 | 139 535 |
| 1 000 000 | RAPSearch2 | 32:54:59 | 506 |
| 1 000 000 | DIAMOND | 00:51:42 | 19 342 |
| 1 000 000 | Lambda | 04:46:23 | 3492 |
| 240 000 000 | PALADIN | 31:15:02 | 127 998 |
| 240 000 000 | BLASTX | 250 000:00:00 | 16 |
Performance evaluation of the three applicable BLAST competitors was performed against a 1 000 000 read subset of the full 240 000 000 read set. To estimate the linear portion of alignments per second, the effective time reflects the difference between total time and the constant setup time of loading the indexed reference. Due to obvious time constraints in the case of the BLASTX comparison, alignment rate was calculated using an 8000 read subset, from which the effective time was estimated. For this final test, the PALADIN was given the entire 240 000 000 set as a query. In both tests, PALADIN outperformed all other alignment tools.
Accuracy comparison for large read sets
| Aligner | Consensus % | Bit Difference | Quality Magnitude |
|---|---|---|---|
| RAPSearch2 | 99.34 | n/a* | n/a* |
| DIAMOND | 96.36 | 45.38 | 13.66 |
| Lambda | 98.85 | 47.62 | 14.33 |
| BLASTX | 99.23 | 20.60 | 6.20 |
Accuracy evaluation of the four protein alignment tools, calculated against the read sets outlined in Table 4. Consensus percentage indicates the portion of PALADIN mapped reads (filtered for a mapping quality score of 50) that matched the subject protein of the compared tool's alignment. Quality magnitude, calculated via the mean bit score difference, notes the significant decrease in alignment quality of reads left unmapped by PALADIN. Note, this calculation could not be performed on the RAPSearch2 results, as a software bug prevented writing read headers to the alignment report.