Literature DB >> 21278185

Quality control and preprocessing of metagenomic datasets.

Abstract

SUMMARY: Here, we present PRINSEQ for easy and rapid quality control and data preprocessing of genomic and metagenomic datasets. Summary statistics of FASTA (and QUAL) or FASTQ files are generated in tabular and graphical form and sequences can be filtered, reformatted and trimmed by a variety of options to improve downstream analysis.
AVAILABILITY AND IMPLEMENTATION: This open-source application was implemented in Perl and can be used as a stand alone version or accessed online through a user-friendly web interface. The source code, user help and additional information are available at http://prinseq.sourceforge.net/.

Entities: Chemical Disease Species

Mesh：

Year: 2011 PMID： 21278185 PMCID： PMC3051327 DOI： 10.1093/bioinformatics/btr026

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 INTRODUCTION

High-throughput sequencing has revolutionized microbiology and accelerated genomic and metagenomic analyses; however, downstream sequence analysis is compromised by low-quality sequences, sequence artifacts and sequence contamination, eventually leading to misassembly and erroneous conclusions. These problems necessitate better tools for quality control and preprocessing of all sequence datasets. For most next-generation sequence datasets, the quality control should include the investigation of length, GC content, quality score and sequence complexity distributions; sequence duplication; contamination; artifacts; and number of ambiguous bases. In the preprocessing step, the sequence ends should be trimmed and unwanted sequences should be filtered. Here, we describe an application able to provide graphical guidance and to perform filtering, reformatting and trimming on FASTA (and QUAL) or FASTQ files. The program is publicly available through a user-friendly web interface and as a stand alone version. The web interface allows online analysis and data export for subsequent analysis.

2 METHODS

2.1 Sequence complexity

The sequence complexity is evaluated as the mean of complexity values using a window of size 64 and a step size of 32. There are two types of sequence complexity measures implemented in PRINSEQ. Both use overlapping nucleotide triplets as words and are scaled to a maximum value of 100. The first is an adaptation of the DUST algorithm (Morgulis ) used as BLAST search preprocessing for masking low complexity regions: where k = 43 is the alphabet size, w is the window size, n is the number of words i in a window, l ≤ 62 is the number of possible words in a window of size 64 and s = 100/31 is the scaling factor. The second method evaluates the block-entropies of words using the Shannon–Wiener method: where n is the number of words i in a window of size w, l is the number of possible words in a window and k is the alphabet size. For windows of size w < 66, k = l and otherwise k = 43.

2.2 Dinucleotide odds ratio

The basic version of the dinucleotide odds ratio calculation (Burge ) is used without taking into account the occurrence of ambiguous characters such as N. In addition, the commonly used version that accounts for the complementary antiparallel structure of double-stranded DNA introduces an additional dinucleotide by simply concatenating the sequence with its reverse complement. To account for this, the odds ratios are calculated using the number nX of nucleotide X and the number nXY of dinucleotide XY only for nucleotides A, C, G and T on the forward strand: where X′ is the complement of nucleotide X, m is the number of valid nucleotides and d is the number of valid dinucleotides in the sequence.

2.3 Tag sequence probability

Tag sequences are artifacts at the sequence ends such as adapter or barcode sequence. A k-mer approach is used to calculate the probability of a tag sequence at the 5′- or 3′-end. The k-mers are aligned and shifted before calculating the frequencies as described in (Schmieder ) to account for sequencing limitations.

2.4 Sequence duplication

Sequence replication can occur during different steps of the sequencing protocol, and can therefore generate artificial duplicates (Gomez-Alvarez ). Here, duplicates are categorized into the following groups: (i) exact duplicate, (ii) 5′ duplicate (sequence matches the 5′-end of a longer sequence), (iii) 3′ duplicate, (iv) exact duplicate with the reverse complement of another sequence and (v) 5′/3′ duplicate with the reverse complement of another sequence. The duplicates are identified independently by sorting and prefix/suffix matching of the sequences.

3 FEATURES

3.1 Quality control

The summary statistics provided include the number of sequences and number of bases in the FASTA or FASTQ file, tables with minimum, maximum, range, mean, standard deviation and mode for read length and GC content, charts for read length distribution, GC content distribution, quality scores, sequence complexity, sequence duplicates, occurrence of Ns and poly-A/T tails. Additionally, the base frequencies at the sequence ends and the probability of tag sequences are provides to the user. The dinucleotide odds ratios can be used to identify possibly contamination (Willner ) and the dinucleotide relative abundance profile can be used to compare the user metagenome to other microbial or viral metagenomes using principal component plots. The assembly measures such as N50 or N90 are helpful for datasets containing contigs.

3.2 Sequence filtering

Sequences can be filtered by their length, quality scores, GC content, number or percentage of ambiguous base N, non-IUPAC characters for nucleic acids, number of sequences, sequence duplicates, sequence complexity (for example, to remove simple repeat sequences such as ATATATATAT), and custom filters defined by the user given a predefined grammar.

3.3 Sequence trimming

The trimming options allow users to trim sequences to a specific length, trim bases from the 5′- and 3′-end, trim poly-A/T tails and trim by quality scores with user-defined options. The trimming of sequences can generate new sequence duplicates and therefore, trimming is performed before most filtering steps.

3.4 Sequence formatting

The sequences can be modified to change them to upper or lower case (for example, to remove soft-masking), convert between RNA and DNA sequences, change the line width in FASTA and QUAL files, remove sequence headers or rename sequence identifiers. Additionally, FASTQ inputs can be converted into FASTA and QUAL format, and vice versa.

3.5 Web interface

The web version includes sample datasets to compare and test the program. All graphics are generated using the Cairo graphics library (http://cairographics.org/). The web interface allows the submission of compressed FASTA (and QUAL) or FASTQ files to reduce the time of data upload. Currently, ZIP, GZIP and BZIP2 compression algorithms are supported allowing direct processing of compressed data from the NCBI Sequence Read Archive (http://www.ncbi.nlm.nih.gov/sra). The filter, trim and reformat options can be exported and imported for similar processing of different datasets. Additionally, the web interface provides predefined option sets to perform different types of preprocessing. Data uploaded using the web interface can be shared or accessed at a later point using unique data identifiers.

4 BRIEF SURVEY OF ALTERNATIVE PROGRAMS

There are different applications that provide quality control and preprocessing features for sequence datasets. PRINSEQ was compared with three other available programs, each offering various additional features and functions. Although the programs have been designed to process short read data, they are able to process longer read sequences. SolexaQA (Cox ) is software written in Perl that allows investigation and trimming of sequences by their base quality scores. The software does not provide additional summary statistics or preprocessing features and requires a working installation of R and Perl modules such as GD to produce graphical outputs. FastQC (http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/) is software written in Java that provides summary statistics for FASTQ files. In its current version, FastQC does not provide data preprocessing features. The FASTX-Toolkit (http://hannonlab.cshl.edu/fastx_toolkit/) is a collection of command line tools that provide preprocessing features and summaries for quality scores and nucleotide distributions. The tools were recently integrated into the Galaxy platform (Blankenberg ). All of these programs are still in active development and new functions will undoubtedly be added over time.

5 CONCLUSION

PRINSEQ allows scientists to efficiently check and prepare their datasets prior to downstream analysis. The web interface is simple and user-friendly, and the stand alone version allows offline analysis and integration into existing data processing pipelines. The results reveal whether the sequencing experiment has succeeded, whether the correct sample was sequenced and whether the sample contains any contamination from DNA preparation or host. The tool provides a computational resource able to handle the amount of data that next-generation sequencers are capable of generating and can place the process more within reach of the average research lab.

7 in total

1. A fast and symmetric DUST implementation to mask low-complexity DNA sequences.

Authors: Aleksandr Morgulis; E Michael Gertz; Alejandro A Schäffer; Richa Agarwala
Journal: J Comput Biol Date: 2006-06 Impact factor: 1.479

2. Systematic artifacts in metagenomes from complex microbial communities.

Authors: Vicente Gomez-Alvarez; Tracy K Teal; Thomas M Schmidt
Journal: ISME J Date: 2009-07-09 Impact factor: 10.302

3. Manipulation of FASTQ data with Galaxy.

Authors: Daniel Blankenberg; Assaf Gordon; Gregory Von Kuster; Nathan Coraor; James Taylor; Anton Nekrutenko
Journal: Bioinformatics Date: 2010-06-18 Impact factor: 6.937

4. TagCleaner: Identification and removal of tag sequences from genomic and metagenomic datasets.

Authors: Robert Schmieder; Yan Wei Lim; Forest Rohwer; Robert Edwards
Journal: BMC Bioinformatics Date: 2010-06-23 Impact factor: 3.169

5. Over- and under-representation of short oligonucleotides in DNA sequences.

Authors: C Burge; A M Campbell; S Karlin
Journal: Proc Natl Acad Sci U S A Date: 1992-02-15 Impact factor: 11.205

6. Metagenomic signatures of 86 microbial and viral metagenomes.

Authors: Dana Willner; Rebecca Vega Thurber; Forest Rohwer
Journal: Environ Microbiol Date: 2009-03-18 Impact factor: 5.491

7. SolexaQA: At-a-glance quality assessment of Illumina second-generation sequencing data.

Authors: Murray P Cox; Daniel A Peterson; Patrick J Biggs
Journal: BMC Bioinformatics Date: 2010-09-27 Impact factor: 3.169

7 in total

1745 in total

1. RNA-Seq reveals the molecular mechanism of trapping and killing of root-knot nematodes by nematode-trapping fungi.

Authors: Ramesh Pandit; Reena Patel; Namrata Patel; Vaibhav Bhatt; Chaitanya Joshi; Pawan Kumar Singh; Anju Kunjadia
Journal: World J Microbiol Biotechnol Date: 2017-03-04 Impact factor: 3.312

2. Maligner: a fast ordered restriction map aligner.

Authors: Lee M Mendelowitz; David C Schwartz; Mihai Pop
Journal: Bioinformatics Date: 2015-12-03 Impact factor: 6.937

3. Repeated Whole-Genome Duplication, Karyotype Reshuffling, and Biased Retention of Stress-Responding Genes in Buckler Mustard.

Authors: Céline Geiser; Terezie Mandáková; Nils Arrigo; Martin A Lysak; Christian Parisod
Journal: Plant Cell Date: 2015-12-14 Impact factor: 11.277

4. Ecological Genomics of the Uncultivated Marine Roseobacter Lineage CHAB-I-5.

Authors: Yao Zhang; Ying Sun; Nianzhi Jiao; Ramunas Stepanauskas; Haiwei Luo
Journal: Appl Environ Microbiol Date: 2016-01-29 Impact factor: 4.792

5. Multiple Barriers to the Evolution of Alternative Gene Orders in a Positive-Strand RNA Virus.

Authors: Anouk Willemsen; Mark P Zwart; Nicolas Tromas; Eszter Majer; José-Antonio Daròs; Santiago F Elena
Journal: Genetics Date: 2016-02-11 Impact factor: 4.562

6. Comparative RNA-Seq profiling of a resistant and susceptible peanut (Arachis hypogaea) genotypes in response to leaf rust infection caused by Puccinia arachidis.

Authors: Visha Rathod; Rasmieh Hamid; Rukam S Tomar; Rushika Patel; Shital Padhiyar; Jasminkumar Kheni; P P Thirumalaisamy; Nasreen S Munshi
Journal: 3 Biotech Date: 2020-06-01 Impact factor: 2.406

7. Developmental Heterogeneity of Microglia and Brain Myeloid Cells Revealed by Deep Single-Cell RNA Sequencing.

Authors: Qingyun Li; Zuolin Cheng; Lu Zhou; Spyros Darmanis; Norma F Neff; Jennifer Okamoto; Gunsagar Gulati; Mariko L Bennett; Lu O Sun; Laura E Clarke; Julia Marschallinger; Guoqiang Yu; Stephen R Quake; Tony Wyss-Coray; Ben A Barres
Journal: Neuron Date: 2018-12-31 Impact factor: 17.173

8. Hotspots of Aberrant Enhancer Activity in Fibrolamellar Carcinoma Reveal Candidate Oncogenic Pathways and Therapeutic Vulnerabilities.

Authors: Timothy A Dinh; Ramja Sritharan; F Donelson Smith; Adam B Francisco; Rosanna K Ma; Rodica P Bunaciu; Matt Kanke; Charles G Danko; Andrew P Massa; John D Scott; Praveen Sethupathy
Journal: Cell Rep Date: 2020-04-14 Impact factor: 9.423

9. De novo sequencing of Astyanax mexicanus surface fish and Pachón cavefish transcriptomes reveals enrichment of mutations in cavefish putative eye genes.

Authors: Hélène Hinaux; Julie Poulain; Corinne Da Silva; Céline Noirot; William R Jeffery; Didier Casane; Sylvie Rétaux
Journal: PLoS One Date: 2013-01-09 Impact factor: 3.240

10. Global microbialization of coral reefs.

Authors: Andreas F Haas; Mohamed F M Fairoz; Linda W Kelly; Craig E Nelson; Elizabeth A Dinsdale; Robert A Edwards; Steve Giles; Mark Hatay; Nao Hisakawa; Ben Knowles; Yan Wei Lim; Heather Maughan; Olga Pantos; Ty N F Roach; Savannah E Sanchez; Cynthia B Silveira; Stuart Sandin; Jennifer E Smith; Forest Rohwer
Journal: Nat Microbiol Date: 2016-04-25 Impact factor: 17.745