Literature DB >> 25878034

NGS-eval: NGS Error analysis and novel sequence VAriant detection tooL.

Ali May¹, Sanne Abeln², Mark J Buijs³, Jaap Heringa², Wim Crielaard³, Bernd W Brandt⁴.

Abstract

Massively parallel sequencing of microbial genetic markers (MGMs) is used to uncover the species composition in a multitude of ecological niches. These sequencing runs often contain a sample with known composition that can be used to evaluate the sequencing quality or to detect novel sequence variants. With NGS-eval, the reads from such (mock) samples can be used to (i) explore the differences between the reads and their references and to (ii) estimate the sequencing error rate. This tool maps these reads to references and calculates as well as visualizes the different types of sequencing errors. Clearly, sequencing errors can only be accurately calculated if the reference sequences are correct. However, even with known strains, it is not straightforward to select the correct references from databases. We previously analysed a pyrosequencing dataset from a mock sample to estimate sequencing error rates and detected sequence variants in our mock community, allowing us to obtain an accurate error estimation. Here, we demonstrate the variant detection and error analysis capability of NGS-eval with Illumina MiSeq reads from the same mock community. While tailored towards the field of metagenomics, this server can be used for any type of MGM-based reads. NGS-eval is available at http://www.ibi.vu.nl/programs/ngsevalwww/.

Entities: Chemical Disease Species

Mesh：

Substances：
Genetic Markers

Year: 2015 PMID： 25878034 PMCID： PMC4489229 DOI： 10.1093/nar/gkv346

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Microbial genetic markers (MGMs) are genes or other DNA sequences that are widely used in phylogenetic and taxonomic analyses, for instance, in species classification and profiling of community structures in environmental sequencing (metagenomic) samples (1). The properties of MGMs that make them suitable for such analyses are their universal presence across species as well as their highly informative and relatively conserved sequence composition (2). The most commonly used MGMs for eukaryotes include the internal transcribed spacer region (3) and the 18S ribosomal RNA (rRNA) gene (4), and for prokaryotes, the spacer region between the 16S and 23S rRNA genes (5), as well as these genes themselves (6). Although limited by laborious and costly molecular techniques, earlier studies relying on the cloning and (partial) sequencing of MGMs have uncovered the previously unknown biological diversity in various ecosystems (7,8). Recently, next-generation sequencing (NGS) has become a standard method for determining the community structure in environmental samples and other samples of microbial communities, for example, in seawater (9) and soil (10). Moreover, the same technique initiated the characterization of the human microbiome in health (11) and in disease (12), making it possible to establish relations between microbiome and host health status. Environmental sequencing studies often include a ‘mock’ community sample, which is a low-diversity community with known composition. The sequencing data acquired from the mock samples has been used to (i) determine the influence of experimental noise on diversity estimates (13,14), (ii) standardize and improve experimental protocols to ensure consistency between sequencing runs (15) and (iii) evaluate the accuracy of data cleaning and taxonomic analysis pipelines (16–20). Furthermore, the mock samples can be used to determine the overall quality of a sequencing run, as well as error rates, such as the insertion, deletion and substitution rate (21,22). The accurate estimations of these errors predominantly depend on the use of correct reference sequences. This makes it essential to detect sequence variants that are missing in the reference dataset, which may otherwise lead to inflated errors (23). The identification of variants in metagenomic samples by the use of genetic markers is also key to detect clinically relevant novel bacterial strains (24) and taxonomic reconstruction (25). Numerous tools exist for the correction of errors in high-throughput sequencing data (26), including those specifically developed for MGMs (19,20,27). However, there are only a limited number of methods for error rate calculation. DRISEE is an error estimation tool designed specifically for whole-genome shotgun metagenomics sequences and depends on the presence of artificially duplicated reads, making it unsuitable for reads from MGMs (28). To our knowledge, the only computational tool currently available for estimating sequencing error in reads from MGMs is the seq.error command in mothur (29). Here, the reads are aligned to a reference alignment of marker genes (e.g. 16S rDNAs). Next, the leading and trailing bases in reads that do not fall into an overlapping alignment region are considered artefacts and are trimmed before error rate estimation. This may lead to undesired effects. Since no visualization is provided, it is difficult to get insight into the error rates due to the likely presence of novel variants. There are existing methods for variant calling and single nucleotide polymorphism discovery (30–32). However, these tools mainly focus on determining the significance of rare variants in single-organism studies using whole-genome shotgun data. Here, our purpose is different; we are looking for common variants in MGMs in microbial community samples that may affect error rates. Note that rare variants typically do not influence the accuracy of error rates. NGS-eval, presented here, facilitates the identification of common variants by visualizing the frequency of errors on each reference sequence; this allows the user to compare such frequencies to expected error rates and to determine whether they result from the presence of a variant sequence. We have developed NGS-eval, a user-friendly web server, for estimating different types of sequencing errors in (mock) samples from MGM-based sequencing runs. The interactive plots in our tool can be used to explore the differences between the reads and their reference sequences to detect novel sequence variants. Using a mock community sample sequenced on an Illumina MiSeq platform, we show that accurate error rate estimations can only be achieved by the detection of such variants. While most suitable in the field of environmental sequencing, the NGS-eval server can be used for any type of marker-based sequencing output.

MATERIALS AND METHODS

Data preparation

The reads should be processed to ensure that contaminants, that is, reads from species not included in the sample, are removed. In addition, to estimate only the sequencing error, experimental bias other than that of sequencing, such as chimeric sequences formed during PCR amplification (17), should be removed from the reads. A number of data processing methods exist for this purpose (29,33,34). A description is also available in the NGS-eval online documentation. Please note that remaining contaminants can still inflate the estimated error.

Web server

Input

The required inputs consist of two sets of nucleotide sequences; the NGS reads (e.g. from Roche 454 or Illumina platforms, from a single sample) and the reference sequences corresponding to the reads. The references should be in FASTA format, whereas the reads can be uploaded in FASTA or FASTQ format, without or with compression (gzip or zip). Optional inputs include the (gene-specific part of the) primer sequences used in the amplification of the marker gene or sequence. To prevent non-specific priming in reads from leading to inflated error estimates (cf. 23), the degenerate primer bases in the reference sequences can optionally be expanded to the corresponding IUPAC ambiguity characters. We also recommend using the processing option to trim (PCR) full-length reference sequences to the region of interest, for instance, full-length 16S rDNA sequences to the V4 hypervariable region. Likewise, in the case of paired-end sequencing, where the forward or the reverse reads may not fully cover the region of interest, these trimmed reference sequences can be further truncated to a length specified by the user (sequences shorter than this length are not filtered out).

Processing

First, the reads are dereplicated: one read becomes the representative read for each unique sequence and the IDs of all reads identical to it are stored. Next, the best-matching reference for each representative read is determined using the usearch_global command in USEARCH v.8.0 (35). Subsequently, optimal alignments are calculated by globally aligning each representative read to its reference sequence using the Needleman–Wunsch alignment algorithm implemented in EMBOSS needleall v6.6.0 (36). These alignments are parsed and sequencing errors, such as mismatches, insertions and deletions, are calculated for each reference sequence and for the overall sample. Finally, javascript objects are produced, which are used to plot the interactive graphs for each reference sequence in the user's web browser (using jqPlot, an open source project by Chris Leonello; http://www.jqplot.com/).

RESULTS AND DISCUSSION

Overview

NGS studies of microbial genetic markers (MGMs), for instance, the 16S rRNA gene, often include a ‘mock’ sample with a known species profile. Such a sample can be used for a variety of tasks, ranging from the evaluation of sequencing quality to the optimization of computational pipelines that handle NGS datasets. The NGS-eval server enables the analysis of the reads obtained from such microbial community samples for two main purposes: Calculating the rates of different sequencing error types, such as insertions, deletions and substitutions. The results can be used to evaluate the overall quality of a sequencing run as well as to assess the influence of corrective tools, such as error correction algorithms, on the resulting data. Detection of common sequence variants in the sample and correction of reference sequences, which is essential for accurate error rate estimates. This functionality can also be helpful for the identification of novel variants. The user can add such variants to the set of reference sequences and the server can be rerun to obtain an error rate that is more representative of sequencing error only.

Error analysis and variant detection

Previously, we analysed a pyrosequencing dataset, where the V5–V7 hypervariable region of the 16S rRNA gene was sequenced for a mock community (23). Using an initial reference dataset, the error rate calculated for substitutions was 10-fold higher than the values reported in literature. Further analysis with an earlier NGS-eval version led to the identification of seven novel sequence variants. The error rates were reduced to expected values after including these variants as additional reference sequences. Here, we analysed 251-bp long forward and reverse reads from a paired-end Illumina MiSeq dataset, where the V4 hypervariable region was sequenced for the same mock community, following a MiSeq 16S rDNA protocol (37). Before mapping, chimeras in the forward or reverse reads were removed using USEARCH v.8.0 (35) by following the chimera removal procedure described in the NGS-eval online documentation. NGS-eval was separately run for the forward and reverse reads. The high substitution peak in Figure 1A shows an example of a common variant from this dataset, which was later confirmed with BLAST (NCBI BLAST against nr) to be present in sequences of the same strain, illustrating how error rates can be estimated more correctly using our server (Figure 1B and C). The overall error rate was calculated by summing the number of mismatches in all alignments and dividing the result by the total length of the alignments. A detailed description of this calculation is given in the online server documentation. The overall estimates for the error rates in the forward and reverse reads were 0.62% and 1.7%, respectively. This difference is expected since the reverse reads are generally of lower quality than the forward reads. Table 1 shows an overview of the error statistics reported by NGS-eval. The overall combined error rate, 1.2%, was similar to the values obtained previously (0.8%) for the MiSeq platform (38). The error rate reported for the same platform can be as low as 0.1% in the case of shotgun reads and trimming low-quality tails (39,40).

Figure 1.

An example of interactive error plots for the (forward) reads obtained from the V4 region of the 16S rRNA gene of Fusobacterium nucleatum. (A) The reads were mapped to a single F. nucleatum reference sequence. At position 60 in the read above, a variant can be observed as a high substitution peak. Here, the server suggests a consensus base with IUPAC code R (A or G). (B) The reference sequence for the variant was added to the set of reference sequences and the reads were re-mapped to the new set, which led to the removal of the substitution peak and a reduced error rate. (C) When zooming into the region between positions 42 and 80, the complete absence of the substitution peak at position 60 can be observed, as well as two sequencing errors. Note the change in the number of mapped sequences (blue line) between (A) and (B), resulting from the mapping of variant sequence reads to the new reference (not shown) sequence during the re-mapping.

Table 1.

Sequencing error statistics, such as the percentage of insertions, deletions and mismatches, are reported for each reference sequence and for the sample as a whole

Reference	In_F	In_R	Del_F	Del_R	Sub_F	Sub_R	Mis_F	Mis_R
S. oralis	0.03	0.02	0.03	0.02	1.2	1.6	1.3	1.7
S. mutans	0.004	0.05	0.005	0.02	0.45	1.9	0.46	2.0
P. gingivalis	0.009	0.08	0.01	0.05	0.56	2.6	0.58	2.7
P. nigrescens	0.007	0.04	0.02	0.03	0.21	1.8	0.23	1.9
All references	0.01	0.04	0.01	0.03	0.60	1.6	0.62	1.7

The table shows the values for the chimera-free forward and reverse reads after separate calculations by NGS-eval. In: insertions, Del: deletions, Sub: substitutions, Mis: mismatches ( = In+Del+Sub), F: forward and R: reverse reads. All values are percentages.

Interactive visualization and server output files

The interactive analysis and visualization of the frequencies and positions of the errors in each reference resulted in the discovery of a sequence variant for one of the species (Figure 1). This functionality is provided by plotting the error data (e.g. insertions, deletions and substitutions) along the sequence coordinate of a selected reference. The data series to be plotted can be selected as well as the error axis scale(s) (unscaled, relative or logarithmic). In addition, the reference sequence itself can be added to the plot, which provides detailed insight into the bases at different positions on the zoomable sequence axis. Furthermore, the data points in the plots are clickable: upon a click, the corresponding position in the consensus sequence below the plot is highlighted. To support off-line usage and more in-depth analysis by the user, the error reports, as well as the calculated consensus sequences, can be downloaded. These reports include separate files for each reference sequence separately and a report for the total error rates.

CONCLUSION

The NGS-eval server provides a user-friendly way to inspect NGS datasets obtained from the sequencing of genetic markers in microbial communities. The error calculation functionality enables the evaluation of the overall sequencing quality and can further be used to assess the outcome of NGS data processing pipelines. The interactive plots in NGS-eval quickly illustrate the read coordinates where the errors occur. High frequency of errors at specific positions can be useful for detecting novel (common) sequence variants and identifying the differences between the strains that are present in the sample and that are used as reference sequences.

40 in total

1. Search and clustering orders of magnitude faster than BLAST.

Authors: Robert C Edgar
Journal: Bioinformatics Date: 2010-08-12 Impact factor: 6.937

2. Global patterns of 16S rRNA diversity at a depth of millions of sequences per sample.

Authors: J Gregory Caporaso; Christian L Lauber; William A Walters; Donna Berg-Lyons; Catherine A Lozupone; Peter J Turnbaugh; Noah Fierer; Rob Knight
Journal: Proc Natl Acad Sci U S A Date: 2010-06-03 Impact factor: 11.205

3. Chimeric 16S rRNA sequence formation and detection in Sanger and 454-pyrosequenced PCR amplicons.

Authors: Brian J Haas; Dirk Gevers; Ashlee M Earl; Mike Feldgarden; Doyle V Ward; Georgia Giannoukos; Dawn Ciulla; Diana Tabbaa; Sarah K Highlander; Erica Sodergren; Barbara Methé; Todd Z DeSantis; Joseph F Petrosino; Rob Knight; Bruce W Birren
Journal: Genome Res Date: 2011-01-06 Impact factor: 9.043

4. Accuracy and quality assessment of 454 GS-FLX Titanium pyrosequencing.

Authors: André Gilles; Emese Meglécz; Nicolas Pech; Stéphanie Ferreira; Thibaut Malausa; Jean-François Martin
Journal: BMC Genomics Date: 2011-05-19 Impact factor: 3.969

5. Rapidly denoising pyrosequencing amplicon reads by exploiting rank-abundance distributions.

Authors: Jens Reeder; Rob Knight
Journal: Nat Methods Date: 2010-09 Impact factor: 28.547

6. QIIME allows analysis of high-throughput community sequencing data.

Authors: J Gregory Caporaso; Justin Kuczynski; Jesse Stombaugh; Kyle Bittinger; Frederic D Bushman; Elizabeth K Costello; Noah Fierer; Antonio Gonzalez Peña; Julia K Goodrich; Jeffrey I Gordon; Gavin A Huttley; Scott T Kelley; Dan Knights; Jeremy E Koenig; Ruth E Ley; Catherine A Lozupone; Daniel McDonald; Brian D Muegge; Meg Pirrung; Jens Reeder; Joel R Sevinsky; Peter J Turnbaugh; William A Walters; Jeremy Widmann; Tanya Yatsunenko; Jesse Zaneveld; Rob Knight
Journal: Nat Methods Date: 2010-04-11 Impact factor: 28.547

7. Sampling and pyrosequencing methods for characterizing bacterial communities in the human gut using 16S sequence tags.

Authors: Gary D Wu; James D Lewis; Christian Hoffmann; Ying-Yu Chen; Rob Knight; Kyle Bittinger; Jennifer Hwang; Jun Chen; Ronald Berkowsky; Lisa Nessel; Hongzhe Li; Frederic D Bushman
Journal: BMC Microbiol Date: 2010-07-30 Impact factor: 3.605

8. Removing noise from pyrosequenced amplicons.

Authors: Christopher Quince; Anders Lanzen; Russell J Davenport; Peter J Turnbaugh
Journal: BMC Bioinformatics Date: 2011-01-28 Impact factor: 3.169

9. UCHIME improves sensitivity and speed of chimera detection.

Authors: Robert C Edgar; Brian J Haas; Jose C Clemente; Christopher Quince; Rob Knight
Journal: Bioinformatics Date: 2011-06-23 Impact factor: 6.937

10. A core gut microbiome in obese and lean twins.

Authors: Peter J Turnbaugh; Micah Hamady; Tanya Yatsunenko; Brandi L Cantarel; Alexis Duncan; Ruth E Ley; Mitchell L Sogin; William J Jones; Bruce A Roe; Jason P Affourtit; Michael Egholm; Bernard Henrissat; Andrew C Heath; Rob Knight; Jeffrey I Gordon
Journal: Nature Date: 2008-11-30 Impact factor: 49.962

8 in total

Review 1. Music of metagenomics-a review of its applications, analysis pipeline, and associated tools.

Authors: Bilal Wajid; Faria Anwar; Imran Wajid; Haseeb Nisar; Sharoze Meraj; Ali Zafar; Mustafa Kamal Al-Shawaqfeh; Ali Riza Ekti; Asia Khatoon; Jan S Suchodolski
Journal: Funct Integr Genomics Date: 2021-10-18 Impact factor: 3.410

2. Production of individualized V gene databases reveals high levels of immunoglobulin genetic diversity.

Authors: Martin M Corcoran; Ganesh E Phad; Christiane Stahl-Hennig; Noriyuki Sumida; Mats A A Persson; Marcel Martin; Gunilla B Karlsson Hedestam
Journal: Nat Commun Date: 2016-12-20 Impact factor: 14.919

3. Deep-Coverage MPS Analysis of Heteroplasmic Variants within the mtGenome Allows for Frequent Differentiation of Maternal Relatives.

Authors: Mitchell M Holland; Kateryna D Makova; Jennifer A McElhoe
Journal: Genes (Basel) Date: 2018-02-26 Impact factor: 4.096

4. Comparative analyses of the major royal jelly protein gene cluster in three Apis species with long amplicon sequencing.

Authors: Sophie Helbing; H Michael G Lattorff; Robin F A Moritz; Anja Buttstedt
Journal: DNA Res Date: 2017-06-01 Impact factor: 4.458

5. Systematic evaluation of error rates and causes in short samples in next-generation sequencing.

Authors: Franziska Pfeiffer; Carsten Gröber; Michael Blank; Kristian Händler; Marc Beyer; Joachim L Schultze; Günter Mayer
Journal: Sci Rep Date: 2018-07-19 Impact factor: 4.379

6. Sequencing-based microsatellite instability testing using as few as six markers for high-throughput clinical diagnostics.

Authors: Richard Gallon; Harsh Sheth; Christine Hayes; Lisa Redford; Ghanim Alhilal; Ottilia O'Brien; Helena Spiewak; Amanda Waltham; Ciaron McAnulty; Osagie G Izuogu; Mark J Arends; Anca Oniscu; Angel M Alonso; Sira M Laguna; Gillian M Borthwick; Mauro Santibanez-Koref; Michael S Jackson; John Burn
Journal: Hum Mutat Date: 2019-09-15 Impact factor: 4.878

7. OTUs and ASVs Produce Comparable Taxonomic and Diversity from Shrimp Microbiota 16S Profiles Using Tailored Abundance Filters.

Authors: Rodrigo García-López; Fernanda Cornejo-Granados; Alonso A Lopez-Zavala; Andrés Cota-Huízar; Rogerio R Sotelo-Mundo; Bruno Gómez-Gil; Adrian Ochoa-Leyva
Journal: Genes (Basel) Date: 2021-04-13 Impact factor: 4.096

8. Sequencing error profiles of Illumina sequencing instruments.

Authors: Nicholas Stoler; Anton Nekrutenko
Journal: NAR Genom Bioinform Date: 2021-03-27

8 in total