| Literature DB >> 24508279 |
Qian Zhou1, Xiaoquan Su1, Gongchao Jing1, Kang Ning2.
Abstract
Next-generation sequencing (NGS) technology has revolutionized and significantly impacted metagenomic research. However, the NGS data usually contains sequencing artifacts such as low-quality reads and contaminating reads, which will significantly compromise downstream analysis. Many quality control (QC) tools have been proposed, however, few of them have been verified to be suitable or efficient for metagenomic data, which are composed of multiple genomes and are more complex than other kinds of NGS data. Here we present a metagenomic data QC method named Meta-QC-Chain. Meta-QC-Chain combines multiple QC functions: technical tests describe input data status and identify potential errors, quality trimming filters poor sequencing-quality bases and reads, and contamination screening identifies higher eukaryotic species, which are considered as contamination for metagenomic data. Most computing processes are optimized based on parallel programming. Testing on an 8-GB real dataset showed that Meta-QC-Chain trimmed low sequencing-quality reads and contaminating reads, and the whole quality control procedure was completed within 20 min. Therefore, Meta-QC-Chain provides a comprehensive, useful and high-performance QC tool for metagenomic data. Meta-QC-Chain is publicly available for free at: http://computationalbioenergy.org/meta-qc-chain.html.Entities:
Keywords: Metagenomic data; Next-generation sequencing; Parallel computing; Quality control
Mesh:
Year: 2014 PMID: 24508279 PMCID: PMC4411374 DOI: 10.1016/j.gpb.2014.01.002
Source DB: PubMed Journal: Genomics Proteomics Bioinformatics ISSN: 1672-0229 Impact factor: 7.691
Figure 1The workflow and functions of Meta-QC-Chain for metagenomic data quality control
Figure 2GC distribution plots generated by Meta-Parallel-QC for two human saliva metagenomic datasets Shown in the graph is the read GC content distribution of two real human saliva metagenomic dataset R1 (A) and R2 (B). Detailed information about these two datasets is listed in Table 1.
Summary of the three datasets examined in the current study
| R1 | 19,185,960 | 5.0 | 9,414,926 | 1.2 |
| R2 | 33,134,512 | 8.3 | 20,951,704 | 2.8 |
| S1 | 22,127,714 | 2.2 | 22,127,714 | 2.2 |
Figure 3Contaminating species identified from the three metagenomic datasets by Meta-QC-Chain Human was identified as the largest contaminating species in real sequenced human saliva datasets R1 (A) and R2 (B). Chlorophyta algae species were identified as possible contaminations in simulated dataset S1 (C). “1 more” or “11 more” means more species identified with very low proportion of 18S rRNAs, which can be neglected here.
Running time of Meta-QC-Chain and Prinseq on the three datasets
| R1 | 1 min 02 s | 8 min 33 s | 1 min 53 s | 11 min 28 s | 50 min 43 s |
| R2 | 1 min 37 s | 14 min 07 s | 4 min 04 s | 19 min 48 s | 76 min 03 s |
| S1 | 2 min 38 s | 4 min 19 s | 10 min 14 s | 17 min 01 s | 64 min 48 s |
Note: R1 and R2 are the two metagenomic datasets generated from human saliva sequenced in-house, whereas S1 is a simulated dataset for test.