Literature DB >> 27153586

MeCorS: Metagenome-enabled error correction of single cell sequencing reads.

Andreas Bremges¹, Esther Singer², Tanja Woyke², Alexander Sczyrba¹.

Abstract

UNLABELLED: We present a new tool, MeCorS, to correct chimeric reads and sequencing errors in Illumina data generated from single amplified genomes (SAGs). It uses sequence information derived from accompanying metagenome sequencing to accurately correct errors in SAG reads, even from ultra-low coverage regions. In evaluations on real data, we show that MeCorS outperforms BayesHammer, the most widely used state-of-the-art approach. MeCorS performs particularly well in correcting chimeric reads, which greatly improves both accuracy and contiguity of de novo SAG assemblies.
AVAILABILITY AND IMPLEMENTATION: https://github.com/metagenomics/MeCorS CONTACT: abremges@cebitec.uni-bielefeld.de SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Entities: Chemical Species

Mesh：

Year: 2016 PMID： 27153586 PMCID： PMC4937190 DOI： 10.1093/bioinformatics/btw144

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

The vast majority of microbial species found in nature has yet to be grown in pure culture, turning metagenomics and—more recently—single cell genomics into indispensable methods to access the genetic makeup of microbial dark matter (Brown ; Rinke ). Frequently, single amplified genomes (SAGs) and shotgun metagenomes are generated from the same environmental sample, and are methodologically combined e.g. to validate metagenome bins with single cells or to improve the SAG’s assembly contiguity (Campbell ; Hess ). However, a single cell’s DNA needs to be amplified prior to sequencing, as usually accomplished by multiple displacement amplification (MDA; Lasken, 2007). This amplification is heavily biased, leading to uneven sequencing depth including ultra-low coverage regions with basically no informed error correction possible (Chitsaz ; Supplementary Fig. S1). Moreover, chimera formation occurs roughly once per 10 kbp during MDA, further complicating SAG assembly (Nurk ; Rodrigue ). While an array of error correction tools exist for a variety of use cases (Laehnemann ), only one tool was specifically designed to correct SAG data: hammer (Medvedev ), recently refined to BayesHammer (Nikolenko ). We propose a metagenome-enabled error correction strategy for single cell sequencing reads. Our method takes advantage of largely unbiased metagenomic coverage, enabling it to correct positions with too low a coverage for SAG-only error correction, and to correct chimeric SAG reads through non-chimeric metagenome reads.

2 Methods

We correct potential errors using an algorithm similar to solving the spectral alignment problem (Pevzner ). Given a set of trusted k-mers, we use a heuristic method to find a sequence with minimal corrections such that each k-mer on the corrected sequence is trusted. Using a k-mer size of 31, we consider a k-mer trusted if it occurs at least twice in the accompanying metagenome. This coverage threshold was determined empirically to work with most datasets (Supplementary Fig. S2). Our correction algorithm was inspired by fermi (Li, 2012) and BFC (Li, 2015), but we do not act on the assumption of uniform sequencing coverage, thereby accounting for the tremendous variation of coverage across the SAG. Instead, we exploit metagenomic sequence information to correct errors resulting from amplification and sequencing, as well as chimeras, even in ultra-low coverage regions of the SAG. The non-chimeric nature of the metagenome reads enables an implicit and thorough write-through correction of chimeric SAG reads. MeCorS works in three phases: MeCorS collects all 31-mers (and their reverse complements) occurring in the SAG reads. It uses this information to initialize a hash table with the 31-mers being valid keys. MeCorS scans the accompanying metagenomic reads. For each stored 31-mer, it counts the occurrence of the next (i.e. the 32nd) base in the metagenome and stores the totals in the hash table. This step is largely I/O bound and dominates MeCorS’s runtime. MeCorS processes each SAG read by using the 31-mer hash table to check if the 32nd base is sufficiently supported in the metagenome. Untrusted 32nd bases are replaced with the most frequent and trusted 32nd bases from the metagenome.

3 Results and discussion

As a realistic benchmark, we used eight Escherichia coli K12-MG1655 SAGs from Clingenpeel ), a strain for which the complete genome sequence is available (Supplementary Table S1). A concomitant in vitro mock metagenome consisting of 26 microbial species, including E. coli K12-MG1655, was sequenced on Illumina’s HiSeq platform (Bowers ). Based on metagenome read mapping, we estimate the relative abundance of E. coli to amount to 0.15%, corresponding to a mean per-base coverage of only 20.7× (Supplementary Table S2). We evaluated MeCorS along with BayesHammer (Nikolenko ), a widely used error correction tool for SAG data. Our method corrects more errors than BayesHammer, producing a significantly higher fraction of better and perfect reads after correction (Table 1; Supplementary Table S3). In contrast to BayesHammer, MeCorS reduces the amount of chimeric SAG reads by one order of magnitude, likely due to the non-chimeric nature of the metagenome reads. MeCorS works well with modern single cell assemblers, most notably reducing the misassembly rate of both IDBA-UD (Peng ) and SPAdes (Bankevich ) by half, while providing high sequence contiguity (Fig. 1). In particular poorly amplified SAGs benefit from metagenome-enabled error correction, yielding improved assembly accuracy and contiguity (Supplementary Tables S4 and S5).

Table 1.

Performance of SAG error correction

Program	% perfect	% chimeric	% better	% worse
Raw	22.52 ± 1.07	0.73 ± 0.15	–	–
BayesHammer	80.35 ± 8.77	0.77 ± 0.17	71.66 ± 2.12	0.33 ± 0.06
MeCorS	95.52 ± 0.43	0.06 ± 0.02	75.45 ± 1.11	0.26 ± 0.03

Mean percentage and standard deviation of perfect reads, chimeric reads (i.e. reads with parts mapped to different places), corrected reads becoming better and worse than the raw reads. Evaluation as described in Li (2015); please refer to Supplementary Table S3 for per-SAG metrics, including runtime and memory usage.

Fig. 1.

Effect on SAG assembly. We corrected the raw reads (R) with BayesHammer (B; Nikolenko ) or MeCorS (M). We then used IDBA-UD (Peng ) and SPAdes (Bankevich ) to assemble the SAGs. Brackets indicate all statistically significant changes (P < 0.05; two-tailed Wilcoxon signed-rank test). Quality assessment with QUAST (Gurevich ); Supplementary Tables S4 and S5 contain in-depth assembly statistics Performance of SAG error correction Mean percentage and standard deviation of perfect reads, chimeric reads (i.e. reads with parts mapped to different places), corrected reads becoming better and worse than the raw reads. Evaluation as described in Li (2015); please refer to Supplementary Table S3 for per-SAG metrics, including runtime and memory usage. We note that such a hybrid error correction of SAG data may result in miscorrection(s) of rare variants. If the captured cell contains a variant that is rare or absent in the corresponding metagenome, correction will be biased towards the most abundant variant in the metagenome sequence. If strain resolution is desired, we suggest polishing the SAG assembly using the uncorrected raw data. In all other cases, SAG assemblies benefit directly from metagenome-enabled error correction via MeCorS. Uneven genome coverage and chimera formation present the biggest challenges in the downstream processing and analysis of SAG datasets to date. We propose MeCorS for the correction of SAG reads when complementary metagenome datasets are available. Error and chimera correction is essential for improved SAG assembly and demonstrates a powerful application of combined shotgun metagenome and single cell sequencing.

Funding

A.B. is supported by a fellowship from the CLIB Graduate Cluster Industrial Biotechnology and is partially funded by the International DFG Research Training Group GRK 1906/1. The work conducted by the U.S. Department of Energy Joint Genome Institute, a DOE Office of Science User Facility, is supported under Contract No. DE-AC02-05CH11231. Conflict of Interest: none declared.

19 in total

1. IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth.

Authors: Yu Peng; Henry C M Leung; S M Yiu; Francis Y L Chin
Journal: Bioinformatics Date: 2012-04-11 Impact factor: 6.937

2. Exploring single-sample SNP and INDEL calling with whole-genome de novo assembly.

Authors: Heng Li
Journal: Bioinformatics Date: 2012-05-07 Impact factor: 6.937

3. Assembling single-cell genomes and mini-metagenomes from chimeric MDA products.

Authors: Sergey Nurk; Anton Bankevich; Dmitry Antipov; Alexey A Gurevich; Anton Korobeynikov; Alla Lapidus; Andrey D Prjibelski; Alexey Pyshkin; Alexander Sirotkin; Yakov Sirotkin; Ramunas Stepanauskas; Scott R Clingenpeel; Tanja Woyke; Jeffrey S McLean; Roger Lasken; Glenn Tesler; Max A Alekseyev; Pavel A Pevzner
Journal: J Comput Biol Date: 2013-10 Impact factor: 1.479

Review 4. Single-cell genomic sequencing using Multiple Displacement Amplification.

Authors: Roger S Lasken
Journal: Curr Opin Microbiol Date: 2007-10-17 Impact factor: 7.934

5. Unusual biology across a group comprising more than 15% of domain Bacteria.

Authors: Christopher T Brown; Laura A Hug; Brian C Thomas; Itai Sharon; Cindy J Castelle; Andrea Singh; Michael J Wilkins; Kelly C Wrighton; Kenneth H Williams; Jillian F Banfield
Journal: Nature Date: 2015-06-15 Impact factor: 49.962

6. BFC: correcting Illumina sequencing errors.

Authors: Heng Li
Journal: Bioinformatics Date: 2015-05-06 Impact factor: 6.937

7. BayesHammer: Bayesian clustering for error correction in single-cell sequencing.

Authors: Sergey I Nikolenko; Anton I Korobeynikov; Max A Alekseyev
Journal: BMC Genomics Date: 2013-01-21 Impact factor: 3.969

8. Reconstructing each cell's genome within complex microbial communities-dream or reality?

Authors: Scott Clingenpeel; Alicia Clum; Patrick Schwientek; Christian Rinke; Tanja Woyke
Journal: Front Microbiol Date: 2015-01-08 Impact factor: 5.640

9. Denoising DNA deep sequencing data-high-throughput sequencing errors and their correction.

Authors: David Laehnemann; Arndt Borkhardt; Alice Carolyn McHardy
Journal: Brief Bioinform Date: 2015-05-29 Impact factor: 11.622

10. Whole genome amplification and de novo assembly of single bacterial cells.

Authors: Sébastien Rodrigue; Rex R Malmstrom; Aaron M Berlin; Bruce W Birren; Matthew R Henn; Sallie W Chisholm
Journal: PLoS One Date: 2009-09-02 Impact factor: 3.240

3 in total

1. CAMISIM: simulating metagenomes and microbial communities.

Authors: Adrian Fritz; Peter Hofmann; Stephan Majda; Eik Dahms; Johannes Dröge; Jessika Fiedler; Till R Lesker; Peter Belmann; Matthew Z DeMaere; Aaron E Darling; Alexander Sczyrba; Andreas Bremges; Alice C McHardy
Journal: Microbiome Date: 2019-02-08 Impact factor: 14.650

2. Next generation sequencing data of a defined microbial mock community.

Authors: Esther Singer; Bill Andreopoulos; Robert M Bowers; Janey Lee; Shweta Deshpande; Jennifer Chiniquy; Doina Ciobanu; Hans-Peter Klenk; Matthew Zane; Christopher Daum; Alicia Clum; Jan-Fang Cheng; Alex Copeland; Tanja Woyke
Journal: Sci Data Date: 2016-09-27 Impact factor: 6.444

3. CAMITAX: Taxon labels for microbial genomes.

Authors: Andreas Bremges; Adrian Fritz; Alice C McHardy
Journal: Gigascience Date: 2020-01-01 Impact factor: 6.524

3 in total