| Literature DB >> 28467460 |
Thomas Nordahl Petersen1, Oksana Lukjancenko2, Martin Christen Frølund Thomsen1, Maria Maddalena Sperotto1, Ole Lund1, Frank Møller Aarestrup2, Thomas Sicheritz-Pontén1.
Abstract
An increasing amount of species and gene identification studies rely on the use of next generation sequence analysis of either single isolate or metagenomics samples. Several methods are available to perform taxonomic annotations and a previous metagenomics benchmark study has shown that a vast number of false positive species annotations are a problem unless thresholds or post-processing are applied to differentiate between correct and false annotations. MGmapper is a package to process raw next generation sequence data and perform reference based sequence assignment, followed by a post-processing analysis to produce reliable taxonomy annotation at species and strain level resolution. An in-vitro bacterial mock community sample comprised of 8 genuses, 11 species and 12 strains was previously used to benchmark metagenomics classification methods. After applying a post-processing filter, we obtained 100% correct taxonomy assignments at species and genus level. A sensitivity and precision at 75% was obtained for strain level annotations. A comparison between MGmapper and Kraken at species level, shows MGmapper assigns taxonomy at species level using 84.8% of the sequence reads, compared to 70.5% for Kraken and both methods identified all species with no false positives. Extensive read count statistics are provided in plain text and excel sheets for both rejected and accepted taxonomy annotations. The use of custom databases is possible for the command-line version of MGmapper, and the complete pipeline is freely available as a bitbucked package (https://bitbucket.org/genomicepidemiology/mgmapper). A web-version (https://cge.cbs.dtu.dk/services/MGmapper) provides the basic functionality for analysis of small fastq datasets.Entities:
Mesh:
Year: 2017 PMID: 28467460 PMCID: PMC5415185 DOI: 10.1371/journal.pone.0176469
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Mock bacterial composition of in vitro and in silico datasets.
| Genus | Species | Strain | ||
|---|---|---|---|---|
| Bacillus | DSM7 | x | x | |
| Bacillus | ATCC 14579 | x | x | |
| Burkholderia | J2315 | x | x | |
| Escherichia | K-12 | x | x | |
| Frankia | CcI3 | x | x | |
| Micrococcus | NCTC 2665 | x | x | |
| Pseudomonas | PAO1 | x | x | |
| Pseudomonas | UCBPP-PA14 | x | x | |
| Pseudomonas | Pf-5 | x | x | |
| Pseudomonas | KT2440 | x | x | |
| Rhodobacter | SB 1003 | x | x | |
| Streptomyces | A3(2) | x | x | |
| Nocardioides | JS614 | x |
Taxonomy clades are as follows: 12 strains, 11 species and 8 genuses for in vitro data and 13 strains, 12 species and 9 genuses for the in silico dataset.
Fig 1A schematic flowchart for processing of paired-end sequences with MGmapper.
MGmapper processes fastq reads in four steps. These consist of: (I) Trimming and mapping reads against a phiX bacteriophage to remove potential positive control reads. (II) Mapping to specified reference databases, post-processing of BWA-mem alignments to remove reads with low alignment score or insufficient alignment coverage. (III) Identification of best hits in bestmode: Assignment of a read-pairs to only one specific reference sequence based on the highest sum of alignment scores. In fullmode, assigned a read-pair to a reference sequence even if a higher alignment score is found when mapping to another reference sequence database. This will provide best target match, considering only the sequences present in one particular reference database. (IV) Compilation of abundance statistics, read and nucleotide counts, depth, coverage, and summary reports.
Read count statistics and reference sequence information.
| Column identifier | Description |
|---|---|
| Database | Name of reference sequence database |
| Ref. seq | Name of reference sequence or clade name |
| S_Abundance(10^2) | Size normalized read count abundance |
| R_Abundance(%) | Read count abundance |
| Size(bp) | Size of reference sequence |
| Seq_count | Number of sequences in the clade (always 1 at strain level) |
| Nucleotides | Total number of nucleotides mapped to reference sequence |
| Covered positions | Number of nucleotide positions covered by the reads |
| Coverage | Covered positions/size of ref sequence |
| Depth | Nucleotides/size of ref sequence |
| ReadCount | Number of reads mapped a ref seq or clade |
| ReadCount uniq | Number of uniquely mapped reads a ref seq or clade |
| Mismatches | Number of nucleotide mismatches also known as edit-distance |
| Description | Description from fasta header or clade name |
The size normalized abundance is calculated as S_Abundance = ‘ReadCount x 100/Size’ for single-end reads and ‘ReadCount x100/(2 x Size)’ for paired-end reads. R_Abundance(%) is the number of reads mapped to a taxonomy clade in relation to number of reads after trimming and cleaning versus PhiX genome.
Benchmarking of the in vitro data mapped against a bacteria reference sequence database.
| Clade level | TP | FN | FP |
|---|---|---|---|
| Strain | 9 | 3 | 3 |
| Species | 11 | 0 | 0 |
| Genus | 8 | 0 | 0 |
The columns TP, FN and FP refer to the true positive, false negative and false positive taxonomy annotations for strain, species and genus, respectively.
Benchmarking of the in silico data mapped against a bacteria reference sequence database.
| Clade level | TP | FN | FP | Dataset id |
|---|---|---|---|---|
| Strain | 11 | 2 | 0 | A |
| Species | 12 | 0 | 0 | - |
| Genus | 9 | 0 | 0 | - |
| Strain | 11 | 2 | 0 | B |
| Species | 12 | 0 | 0 | - |
| Genus | 9 | 0 | 0 | - |
| Strain | 12 | 1 | 0 | C |
| Species | 12 | 0 | 0 | - |
| Genus | 9 | 0 | 0 | - |
| Strain | 12 | 1 | 0 | D |
| Species | 12 | 0 | 0 | - |
| Genus | 9 | 0 | 0 | - |
The columns TP, FN and FP refer to the true positive, false negative and false positive taxonomy annotations for strain, species and genus, respectively. The column ‘Dataset id’ referees to the four datasets A, B, C and D with read lengths of 100bp, 250bp, 500bp and 1000bp, respectively.
Read count benchmark at genus level.
| Genus | Read count abundance MGmapper (%) | Read count abundance Kraken (%) |
|---|---|---|
| Pseudomonas | 34.82 | 34.98 |
| Bacillus | 11.22 | 11.21 |
| Burkholderia | 8.05 | 7.05 |
| Micrococcus | 7.68 | 7.74 |
| Escherichia | 7.53 | 2.48 |
| Frankia | 7.31 | 7.37 |
| Streptomyces | 7.24 | 7.29 |
| Rhodobacter | 6.91 | 6.96 |
| Total | 90.76 | 87.21 |
‘Read count abundance’ is reported as the percentage of reads that were assigned to each of the 8 genuses present in the in vitro dataset for the two methods; MGmapper and Kraken. Last row ‘Total’ is the overall percentage sum of reads that were assigned to genuses present in the sample.
Read count benchmark at species level.
| Species | Read count Abundance MGmapper (%) | Read count Abundance Kraken (%) |
|---|---|---|
| Pseudomonas aeruginosa | 15.93 | 15.93 |
| Pseudomonas protegens | 10.81 | 10.76 |
| Pseudomonas putida | 7.95 | 7.44 |
| Micrococcus luteus | 7.68 | 7.74 |
| Escherichia coli | 7.52 | 2.37 |
| Frankia sp. CcI3 | 7.31 | 7.36 |
| Rhodobacter capsulatus | 6.91 | 6.96 |
| Burkholderia cenocepacia | 6.17 | 4.19 |
| Bacillus amyloliquefaciens | 5.75 | 1.96 |
| Streptomyces coelicolor | 5.38 | 4.40 |
| Bacillus cereus | 3.40 | 1.34 |
| Total | 84.82 | 70.45 |
‘Read count abundance’ is reported as the percentage of reads that were assigned to each of the 11 species present in the in vitro dataset for the two methods; MGmapper and Kraken. Last row ‘Total’ is the overall percentage sum of reads that were assigned to the species present in the sample.