| Literature DB >> 22556368 |
Elmar Pruesse1, Jörg Peplies, Frank Oliver Glöckner.
Abstract
MOTIVATION: In the analysis of homologous sequences, computation of multiple sequence alignments (MSAs) has become a bottleneck. This is especially troublesome for marker genes like the ribosomal RNA (rRNA) where already millions of sequences are publicly available and individual studies can easily produce hundreds of thousands of new sequences. Methods have been developed to cope with such numbers, but further improvements are needed to meet accuracy requirements.Entities:
Mesh:
Substances:
Year: 2012 PMID: 22556368 PMCID: PMC3389763 DOI: 10.1093/bioinformatics/bts252
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.The alignment of the selected reference sequences is converted from RC-MSA representation (top) to PO-MSA representation (bottom)
Fig. 2.SINA alignment accuracy decreases almost linearly with the shared fractional identity of candidate and reference when using one reference sequence (red line). Using larger numbers of reference sequences markedly increases accuracy
Fig. 3.An alternative implementation which used simple column-profiles built from the selected reference sequences showed overall lower accuracy. Increasing the number of reference sequences quickly led to a degradation in accuracy
Results from leave-query-out benchmarks
| 5S rRNA | tRNA | U5 | SILVA LSU | |
|---|---|---|---|---|
| Dataset sequences | 597 (%) | 1113 (%) | 232 (%) | 1588 (%) |
| PyNAST | 98.6 | 96.4 | 94.0 | 98.9 |
| mothur | 97.5 | 92.1 | 93.3 | 98.9 |
| SINA | 99.3 | 97.6 | 96.1 | 99.2 |
The reported percentages are the average Q scores. Only sequences aligned by all three tools where considered.
Results using test data sampled from the SILVA SSU dataset
| All SSU samples | <80 % Identity | |||
|---|---|---|---|---|
| Reference size | 1000 | 5000 | 1000 | 5000 |
| sequences | 100 000 | 500 000 | 5443 | 8811 |
| mean identity (%) | 92.34 | 95.24 | 75.71 | 75.9 |
| (PyNAST1) (%) | 96.7 | 97.6 | 90 | 89 |
| (0.20) | (0.08) | (1.7) | (1.5) | |
| mothur (%) | 96.6 | 97.8 | 88 | 88 |
| (0.23) | (0.07) | (2.0) | (1.3) | |
| SINA (%) | 98.9 | 99.3 | 94 | 93 |
| (0.12) | (0.03) | (1.2) | (1.1) | |
The average Q scores shown were obtained by randomly sampling sequences from the SILVA SSU-based test data to create 100 reference MSAs and benchmark sets. This was repeated once with a reference MSA size of 1000 and once with a size of 5000. The SD between Q score averages from each of the 100 reference MSAs is shown in parentheses. The two columns on the right show the results when considering only difficult cases where the candidate sequences have <80% identity with all sequences in the respective reference MSA.
a PyNAST failed to align 0.5% of the sequences.