| Literature DB >> 20011594 |
Abstract
As the scope of microbial surveys expands with the parallel growth in sequencing capacity, a significant bottleneck in data analysis is the ability to generate a biologically meaningful multiple sequence alignment. The most commonly used aligners have varying alignment quality and speed, tend to depend on a specific reference alignment, or lack a complete description of the underlying algorithm. The purpose of this study was to create and validate an aligner with the goal of quickly generating a high quality alignment and having the flexibility to use any reference alignment. Using the simple nearest alignment space termination algorithm, the resulting aligner operates in linear time, requires a small memory footprint, and generates a high quality alignment. In addition, the alignments generated for variable regions were of as high a quality as the alignment of full-length sequences. As implemented, the method was able to align 18 full-length 16S rRNA gene sequences and 58 V2 region sequences per second to the 50,000-column SILVA reference alignment. Most importantly, the resulting alignments were of a quality equal to SILVA-generated alignments. The aligner described in this study will enable scientists to rapidly generate robust multiple sequences alignments that are implicitly based upon the predicted secondary structure of the 16S rRNA molecule. Furthermore, because the implementation is not connected to a specific database it is easy to generalize the method to reference alignments for any DNA sequence.Entities:
Mesh:
Year: 2009 PMID: 20011594 PMCID: PMC2788221 DOI: 10.1371/journal.pone.0008230
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Figure 1Flowchart describing the alignment algorithm.
The published and current greengenes aligner algorithm is shown in black and the modifications that were tested in this study are shown in blue.
Figure 2Comparison of alignments generated by the RDP, greengenes, and SILVA databases.
Alignments were taken between positions 60 and 113 of the E. coli 16S rRNA gene sequence for E. coli and four Enteroccocus spp. The alignment generated for these sequences within this region using 8-mers and the Needleman-Wunsch algorithm was identical to that found in the SILVA alignment. The lower-case bases in the RDP alignment indicate unaligned positions. For the greengenes and SILVA alignments, yellow-highlighting represents bases that are predicted to form traditional Watson-Crick base-pairs in the secondary structure, gray-highlighting represents weak base-pairs, black-highlighting represents bases that will not form base-pairs, and a lack of highlighting represents bases that are predicted to be in loop structures.
Comparison of search methods when using the V2 and V19 candidate sequences and full-length template sequences. a
| Region | Method | Speed (seqs/s) | % Correct template | % Δsimilarity (sd) |
| V2 | 5-mers | 118 | 50.6 | 11.7 (12.8) |
| 6-mers | 180 | 68.3 | 7.2 (11.2) | |
| 7-mers | 280 | 72.6 | 6.1 (10.6) | |
| 8-mers | 225 | 72.7 | 6.1 (10.5) | |
| 9-mers | 202 | 72.4 | 6.1 (10.5) | |
| 10-mers | 104 | 71.7 | 6.3 (10.6) | |
| Suffix tree | 5.5 | 39.4 | 15.3 (13.6) | |
| blastn | 8.8 | 49.0 | 13.0 (13.8) | |
| V19 | 5-mers | 37 | 54.9 | 8.3 (10.0) |
| 6-mers | 34 | 72.0 | 5.0 (8.5) | |
| 7-mers | 41 | 74.6 | 4.5 (8.2) | |
| 8-mers | 49 | 74.6 | 4.5 (8.2) | |
| 9-mers | 52 | 74.1 | 4.6 (8.3) | |
| 10-mers | 43 | 73.4 | 4.7 (8.4) | |
| Suffix tree | 1.9 | 62.3 | 7.2 (9.9) | |
| blastn | 0.7 | 63.4 | 6.8 (9.6) |
Data for the other regions and comparisons to region specific template sequences are provided in Tables S1 and S2.
The average percentage difference in similarity between the correct template and the actual template returned by the search method for each candidate sequence. Smaller values indicate that more similar sequences were identified. Values in parentheses represent the standard deviation.
Number of candidate sequences that did not yield a significant blast match against the full-length or region-specific template databases.
| Region | Total Candidate Seqs. | Template type | Count | % |
| V19 | 186,206 | Full-length | 0 | 0.00 |
| Region-specific | NA | NA | ||
| V14 | 139,987 | Full-length | 0 | 0.00 |
| Region-specific | 0 | 0.00 | ||
| V12 | 139,987 | Full-length | 12 | 0.01 |
| Region-specific | 12 | 0.01 | ||
| V2 | 186,206 | Full-length | 106 | 0.06 |
| Region-specific | 105 | 0.06 | ||
| V23 | 186,206 | Full-length | 10 | 0.01 |
| Region-specific | 11 | 0.01 | ||
| V3 | 186,206 | Full-length | 432 | 0.23 |
| Region-specific | 432 | 0.23 | ||
| V4 | 186,206 | Full-length | 548 | 0.29 |
| Region-specific | 546 | 0.29 | ||
| V6 | 186,206 | Full-length | 64,389 | 34.6 |
| Region-specific | 64,089 | 34.4 | ||
| V89 | 77,685 | Full-length | 1 | 0.00 |
| Region-specific | 1 | 0.00 | ||
| V9 | 77,685 | Full-length | 14 | 0.02 |
| Region-specific | 14 | 0.02 |
Summary of alignment improvement for V19 candidate sequences using the blastn, Gotoh, or Needleman-Wunsch pairwise alignment algorithms when the best template was selected for each candidate sequence. a
| Alignment method | Speed (seq/s) | Gap opening | Gap extension | % Δsimilarity (sd) |
| blastn | 10–12 | 5 | 2 | 0.42 (0.89) |
| 4 | 2 | 0.41 (0.87) | ||
| 3 | 2 | 0.42 (0.87) | ||
| 2 | 2 | 0.41 (0.82) | ||
| 1 | 2 | 0.43 (0.78) | ||
| 4 | 1 | 0.34 (0.68) | ||
| 3 | 1 | 0.36 (0.68) | ||
| 2 | 1 | 0.39 (0.69) | ||
| Gotoh | 15–17 | 5 | 2 | 0.23 (0.44) |
| 4 | 2 | 0.24 (0.45) | ||
| 3 | 2 | 0.27 (0.48) | ||
| 2 | 2 | 0.29 (0.49) | ||
| 1 | 2 | 0.34 (0.55) | ||
| 4 | 1 | 0.25 (0.45) | ||
| 3 | 1 | 0.29 (0.49) | ||
| 2 | 1 | 0.32 (0.52) | ||
| 1 | 1 | 0.41 (0.61) | ||
| Needleman-Wunsch | 21–24 | 5 | NA | 0.27 (0.49) |
| 4 | NA | 0.30 (0.51) | ||
| 3 | NA | 0.34 (0.55) | ||
| 2 | NA | 0.42 (0.62) | ||
| 1 | NA | 0.38 (0.60) |
Data for the other regions and comparisons to region specific template sequences are provided in Tables S3 and S4.
The average percentage difference in similarity between the template sequence and the SILVA aligned candidate sequence and the difference in similarity between the template sequence and the candidate sequence aligned by the different implementations. Positive values indicate the candidate alignment is more similar to the template sequence and negative values are less similar. Values in parentheses indicate the standard deviation.
blastn does not permit these gap penalties when using a match reward and mismatch penalty of 1 and the Needleman-Wunsch algorithm only takes one gap penalty parameter.
Analysis of optimal alignment settings for each variable region when using full-length, region specific, and vertical-gap filtered full-length template sequences.
| Region | Template sequences | Speed (seqs/s) | % Δsimilarity (sd) | % Trimmed |
| V19 | Full-length | 18 | 0.34 (0.64) | 0.17 |
| Region-specific | NA | NA | NA | |
| Vertical-gap filtered | 22 | 0.34 (0.65) | 0.17 | |
| V14 | Full-length | 31 | 0.30 (0.84) | 0.20 |
| Region-specific | 37 | 0.31 (0.83) | 0.20 | |
| Vertical-gap filtered | 41 | 0.29 (0.84) | 0.20 | |
| V12 | Full-length | 51 | 0.29 (1.59) | 0.29 |
| Region-specific | 79 | 0.40 (1.52) | 0.26 | |
| Vertical-gap filtered | 88 | 0.32 (1.57) | 0.27 | |
| V2 | Full-length | 58 | −0.09 (1.23) | 0.02 |
| Region-specific | 100 | −0.01 (1.16) | 0.10 | |
| Vertical-gap filtered | 105 | −0.09 (1.23) | 0.02 | |
| V23 | Full-length | 43 | 0.07 (0.91) | 0.02 |
| Region-specific | 64 | 0.14 (0.86) | 0.23 | |
| Vertical-gap filtered | 65 | 0.07 (0.91) | 0.02 | |
| V3 | Full-length | 69 | −0.18 (1.50) | 0.00 |
| Region-specific | 122 | −0.06 (1.35) | 0.27 | |
| Vertical-gap filtered | 151 | −0.16 (1.49) | 0.00 | |
| V4 | Full-length | 61 | −0.19 (1.00) | 0.00 |
| Region-specific | 100 | −0.07 (0.75) | 0.00 | |
| Vertical-gap filtered | 109 | −0.19 (1.00) | 0.00 | |
| V6 | Full-length | 78 | −0.61 (3.63) | 0.02 |
| Region-specific | 145 | −0.02 (2.92) | 0.44 | |
| Vertical-gap filtered | 204 | −0.64 (3.66) | 0.02 | |
| V89 | Full-length | 45 | 0.09 (0.78) | 0.13 |
| Region-specific | 70 | 0.12 (0.75) | 0.12 | |
| Vertical-gap filtered | 64 | 0.09 (0.78) | 0.13 | |
| V9 | Full-length | 61 | 0.01 (1.24) | 0.21 |
| Region-specific | 100 | 0.08 (1.14) | 0.17 | |
| Vertical-gap filtered | 102 | 0.01 (1.24) | 0.21 |
See description for Table 3.
The percentage of sequences where less than 95% of the bases were aligned to the template sequence.
Analysis of two versions of the greengenes aligner when aligning various regions to full-length SILVA-aligned template sequences.
| Region | greengenes version | Speed (seqs/s) | % Δsimilarity (sd)a | % Trimmedb |
| V19 | Original | 15 | 0.31 (0.83) | 3.09 |
| Current | 0.6 | 0.37 (0.82) | 3.06 | |
| V14 | Original | 18 | 0.26 (1.19) | 4.82 |
| Current | 1.5 | 0.39 (1.14) | 4.27 | |
| V12 | Original | 18 | 0.36 (4.51) | 4.51 |
| Current | 4.5 | 0.69 (2.73) | 5.40 | |
| V2 | Original | 19 | −0.09 (0.91) | 0.91 |
| Current | 5.0 | 0.27 (2.77) | 2.04 | |
| V23 | Original | 17 | 0.07 (1.15) | 2.16 |
| Current | 2.2 | 0.25 (1.27) | 2.08 | |
| V3 | Original | 20 | −0.02 (1.85) | 2.18 |
| Current | 7.3 | 0.00 (4.41) | 2.52 | |
| V4 | Original | 19 | −0.19 (1.03) | 0.77 |
| Current | 5.9 | −0.27 (4.44) | 1.10 | |
| V6 | Original | 26 | −4.58 (19.5) | 18.0 |
| Current | 12 | −28.4 (39.9) | 37.7 | |
| V89 | Original | 20 | 0.14 (1.07) | 1.99 |
| Current | 2.0 | 0.27 (1.08) | 1.96 | |
| V9 | Original | 21 | 0.16 (1.76) | 2.09 |
| Current | 3.0 | 0.29 (1.99) | 1.82 |
See descriptions for Table 4.
Searching with 7-mers and using blastn to align.
Searching and aligning with blastn.