| Literature DB >> 25878034 |
Ali May1, Sanne Abeln2, Mark J Buijs3, Jaap Heringa2, Wim Crielaard3, Bernd W Brandt4.
Abstract
Massively parallel sequencing of microbial genetic markers (MGMs) is used to uncover the species composition in a multitude of ecological niches. These sequencing runs often contain a sample with known composition that can be used to evaluate the sequencing quality or to detect novel sequence variants. With NGS-eval, the reads from such (mock) samples can be used to (i) explore the differences between the reads and their references and to (ii) estimate the sequencing error rate. This tool maps these reads to references and calculates as well as visualizes the different types of sequencing errors. Clearly, sequencing errors can only be accurately calculated if the reference sequences are correct. However, even with known strains, it is not straightforward to select the correct references from databases. We previously analysed a pyrosequencing dataset from a mock sample to estimate sequencing error rates and detected sequence variants in our mock community, allowing us to obtain an accurate error estimation. Here, we demonstrate the variant detection and error analysis capability of NGS-eval with Illumina MiSeq reads from the same mock community. While tailored towards the field of metagenomics, this server can be used for any type of MGM-based reads. NGS-eval is available at http://www.ibi.vu.nl/programs/ngsevalwww/.Entities:
Mesh:
Substances:
Year: 2015 PMID: 25878034 PMCID: PMC4489229 DOI: 10.1093/nar/gkv346
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1.An example of interactive error plots for the (forward) reads obtained from the V4 region of the 16S rRNA gene of Fusobacterium nucleatum. (A) The reads were mapped to a single F. nucleatum reference sequence. At position 60 in the read above, a variant can be observed as a high substitution peak. Here, the server suggests a consensus base with IUPAC code R (A or G). (B) The reference sequence for the variant was added to the set of reference sequences and the reads were re-mapped to the new set, which led to the removal of the substitution peak and a reduced error rate. (C) When zooming into the region between positions 42 and 80, the complete absence of the substitution peak at position 60 can be observed, as well as two sequencing errors. Note the change in the number of mapped sequences (blue line) between (A) and (B), resulting from the mapping of variant sequence reads to the new reference (not shown) sequence during the re-mapping.
Sequencing error statistics, such as the percentage of insertions, deletions and mismatches, are reported for each reference sequence and for the sample as a whole
| Reference | InF | InR | DelF | DelR | SubF | SubR | MisF | MisR |
|---|---|---|---|---|---|---|---|---|
| 0.03 | 0.02 | 0.03 | 0.02 | 1.2 | 1.6 | 1.3 | 1.7 | |
| 0.004 | 0.05 | 0.005 | 0.02 | 0.45 | 1.9 | 0.46 | 2.0 | |
| 0.009 | 0.08 | 0.01 | 0.05 | 0.56 | 2.6 | 0.58 | 2.7 | |
| 0.007 | 0.04 | 0.02 | 0.03 | 0.21 | 1.8 | 0.23 | 1.9 | |
| All references | 0.01 | 0.04 | 0.01 | 0.03 | 0.60 | 1.6 | 0.62 | 1.7 |
The table shows the values for the chimera-free forward and reverse reads after separate calculations by NGS-eval. In: insertions, Del: deletions, Sub: substitutions, Mis: mismatches ( = In+Del+Sub), F: forward and R: reverse reads. All values are percentages.