| Literature DB >> 22333067 |
Andre P Masella1, Andrea K Bartram, Jakub M Truszkowski, Daniel G Brown, Josh D Neufeld.
Abstract
BACKGROUND: Illumina paired-end reads are used to analyse microbial communities by targeting amplicons of the 16S rRNA gene. Publicly available tools are needed to assemble overlapping paired-end reads while correcting mismatches and uncalled bases; many errors could be corrected to obtain higher sequence yields using quality information.Entities:
Mesh:
Substances:
Year: 2012 PMID: 22333067 PMCID: PMC3471323 DOI: 10.1186/1471-2105-13-31
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Schematic of paired-end assembly. Typical scenario: forward and reverse reads are overlapped and the primer regions are removed to reconstruct the sequences. Highly overlapping scenario: for short templates, the overlapping region may include the primer regions.
Figure 2Quality scores of assembled masked data. A perfect 16S rRNA sequence from Sinorhizobium meliloti was masked using real Illumina quality scores and the resulting paired-end sequences were assembled with PANDAseq. A histogram of quality scores for the assembled sequences is shown.
Read error correction frequencies
| Quality (Geometric Mean) | 0.9 - 1.0 | 0.6 - 0.9 |
|---|---|---|
| Error-free Input and Output | 544 669 | 21 095 |
| All Errors Retained | 4 023 | 4 675 |
| Input Errors Reduced | 50 082 | 27 668 |
| Errors Introduced | 0 | 37 |
| Total | 598 774 | 53 475 |
Summary of error frequencies in assembled Illumina paired-end reads generated from sequenced V3-region amplicons of Methylococcus capsulatus strain Bath.
All error data were analyzed solely within the region of overlap, which was relevant to PANDAseq assembly. Low-abundance "contamination" was observed in the dataset (data not shown), possibly due to reagents used for PCR. These will contribute to the counts of sequences that had errors that were retained. This category will also contain sequences in which both reads contain low-quality bases with quality scores masked by CASAVA.
Figure 3Comparison of output of various assemblers. A scatter plot of the percentage of paired-end sequence assemblies from sequenced V3-region amplicons of Methylococcus capsulatus strain Bath against the average number of mismatching nucleotides between the assembled sequence and the reference sequence. The comparison was done between PANDAseq and three alternative assemblers (see text).
Number of sequences with correct overlap regions
| Assember | Correct Assemblies | Percentage of Output |
|---|---|---|
| PANDAseq | 628 131 | 96.30 |
| BIPES (Merger) | 621 357 | 94.63 |
| SHERA | 637 646 | 92.35 |
| iTags (PHRAP) | 3 578 | 0.55 |
Summary of the number of correct overlap sequences in assembled output from sequenced V3-region ampli-cons of Methylococcus capsulatus strain Bath.
The percentage of sequences with error-free overlap regions is shown as a fraction of the total output for each assembler.
Figure 4Rank abundance curves for control libraries. Rank-abundance curves for defined multi-organism libraries [1] assembled at two different quality thresholds using PANDAseq and naïve assembly followed by clustering with CD-HIT into OTUs of 97% identity.