| Literature DB >> 25884504 |
Jolanta Kawulok1, Sebastian Deorowicz1.
Abstract
Nowadays, the study of environmental samples has been developing rapidly. Characterization of the environment composition broadens the knowledge about the relationship between species composition and environmental conditions. An important element of extracting the knowledge of the sample composition is to compare the extracted fragments of DNA with sequences derived from known organisms. In the presented paper, we introduce an algorithm called CoMeta (Classification of metagenomes), which assigns a query read (a DNA fragment) into one of the groups previously prepared by the user. Typically, this is one of the taxonomic rank (e.g., phylum, genus), however prepared groups may contain sequences having various functions. In CoMeta, we used the exact method for read classification using short subsequences (k-mers) and fast program for indexing large set of k-mers. In contrast to the most popular methods based on BLAST, where the query is compared with each reference sequence, we begin the classification from the top of the taxonomy tree to reduce the number of comparisons. The presented experimental study confirms that CoMeta outperforms other programs used in this context. CoMeta is available at https://github.com/jkawulok/cometa under a free GNU GPL 2 license.Entities:
Mesh:
Year: 2015 PMID: 25884504 PMCID: PMC4401624 DOI: 10.1371/journal.pone.0121453
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Dictionary of symbols and acronyms used in the description of the classification.
|
| – | match score, similarity between query read and the set of the reference sequences |
| Ξ | – | match rate score, percentage ratio of the match score to the read length |
|
| – |
|
|
| – | number of various groups to which the reads were classified |
|
| – | output files after assignment to the best group |
|
| – | number of incorrectly classified reads |
|
| – | set of all reference sequences |
|
| – | set of reference sequences for the |
|
| – | subsequence ( |
|
| – | dataset of reads |
|
| – | match cut-off value of sequence identity |
|
| – | number of various sets (groups) of reference sequences, with whom the query read is compared at the |
|
| – | number of reads not classified to any group |
|
| – | query read |
|
| – | reference sequence |
|
| – | number of correctly classified reads |
Fig 1The processing pipeline for metagenomic reads classification for a single rank.
In order to avoiding obfuscating the schema, the upper index j is not added to the symbols, indicating the j-th level of taxonomic classifications.
Fig 2Taxonomy tree-based classification.
Iterative execution of stage II (Classification) in Fig 1.
Fig 3An example of comparing the query read with the reference sequence.
Comparison of FACS algorithms with CoMeta.
|
|
| Sensitivity | Precision | Classified |
|
|---|---|---|---|---|---|
| [%] | [%] | [%] | [%] | [hh:mm:ss] | |
|
| |||||
| 18 | 80 | 97.62 | 97.86 | 99.76 | 00:03:14 |
| 21 | 65 | 97.86 | 98.08 | 99.78 | 00:02:49 |
| 21 | 70 | 97.82 | 98.27 | 99.55 | 00:02:49 |
| 24 | 55 | 97.77 | 98.12 | 99.64 | 00:02:36 |
| 27 | 45 | 97.65 | 98.07 | 99.58 | 00:02:27 |
|
| |||||
| 17 | 30 |
| 90.20 | 99.93 | 00:01:08 |
| 17 | 40 | 98.78 | 93.25 | 98.78 | 00:01:12 |
| 19 | 30 | 99.48 | 92.65 | 99.48 |
|
| 21 | 30 | 98.26 | 94.27 | 98.27 |
|
|
| |||||
| 15 | 55 | 99.30 | 93.56 | 99.31 | 00:01:52 |
| 18 | 45 | 99.42 | 93.36 | 99.43 | 00:01:21 |
| 21 | 45 | 99.05 | 93.93 | 99.06 | 00:01:08 |
| 25 | 30 | 99.56 | 92.05 | 99.57 | 00:01:09 |
| 27 | 35 | 99.36 | 93.07 | 99.37 | 00:01:16 |
|
| |||||
| 18 | – | 97.91 | 97.91 |
| 00:01:37 |
| 21 | – | 98.40 | 98.41 | 99.99 | 00:01:36 |
| 24 | – | 98.69 | 98.75 | 99.93 | 00:01:37 |
| 27 | – | 98.71 |
| 99.63 | 00:01:30 |
Comparison of the best classification results obtained using four methods (bold values indicate the best score for each column):
FACS-P: the FACS 2.1 program in Perl [49]. When read is classified to some G -th reference sequence, it does not be compared with any further reference sequence;
FACS-C: the FACS program in C, which was downloaded from https://github.com/SciLifeLab/facs. The reads are classified to each reference sequence to which similarity is highest than MC;
pre-CoMeta: the only comparison step of CoMeta algorithm (without assignment). This is a similar strategy as implemented in FACS-C.
CoMeta: the full proposed algorithm, the reads are classified to the reference sequence according to the highest score.
Fig 4Classification accuracy for CoMeta in Experiment One.
Accuracy of classification is shown when taking into account only the match files (dotted line with square mark) and when considering additionally the mismatch files (solid line with a circle mark). The performance curve reflects various k-mer lengths.
Fig 5Classification accuracy for the Experiment One using various k parameter.
The plot A represents scores after classification using FACS-P, the plot B—using FACS-C, and the plot C—using pre-CoMeta. Each series shows the results for 11 different threshold values, in sequence starting from the left part of each figure: MC = 30,35,40,…,80 [%].
Comparison of programs using 454 reads.
| Program | FACS 269bp | MetaPhyler 300bp | CARMA 265bp | PhyloPythia 961bp |
|---|---|---|---|---|
| Percentage of classified reads | ||||
| CARMA | 29.0 | 93.6 | 68.7 | 61.3 |
| MEGAN | 48.4 | 88.2 | 90.5 | 62.2 |
| MetaPhyler | 0.2 | 80.9 | 0.5 | 0.6 |
| MG-RAST | 27.1 | 29.8 | 80.2 | 70.5 |
| LMAT | 24.7(26.4 | 96.5 | 80.4 | 98.3 |
| LMAT | 92.5(98.8 | 99.3 | 86.0 | 82.7 |
| MiniKraken | — | 100.0 | 96.7 | 98.0 |
| CoMeta | 93.6(100.0 | 100.0 | 99.9 | 94.7 |
| CoMeta | — | 100.0 | 98.9 | 97.4 |
| Sensitivity (percentage) | ||||
| CARMA | 26.7 | 93.4 | 68.5 | 59.8 |
| MEGAN | 42.5 | 87.9 | 90.3 | 61.0 |
| MetaPhyler | 0.1 | 80.7 | 0.5 | 0.5 |
| MG-RAST | 25.0 | 29.7 | 80.1 | 67.2 |
| LMAT | 24.7(26.3 | 95.7 | 80.4 | 98.1 |
| LMAT | 92.5(98.7 | 98.5 | 86.0 | 82.5 |
| MiniKraken | — | 99.9 | 96.7 | 97.7 |
| CoMeta | 93.4(99.7 | 99.6 | 99.1 | 94.1 |
| CoMeta | — | 99.8 | 98.9 | 96.2 |
| Precision (percentage) | ||||
| CARMA | 92.0 | 99.7 | 99.7 | 97.4 |
| MEGAN | 78.1 | 99.7 | 99.8 | 98.1 |
| MetaPhyler | 84.0 | 99.7 | 100.0 | 83.8 |
| MG-RAST | 92.4 | 99.8 | 99.9 | 95.3 |
| LMAT | 99.9(99.9 | 97.8 | 100.0 | 99.8 |
| LMAT | 100.0(100.0 | 97.8 | 100.0 | 99.8 |
| MiniKraken | — | 99.9 | 100.0 | 99.7 |
| CoMeta | 99.8(99.8 | 99.6 | 99.1 | 99.3 |
| CoMeta | — | 99.8 | 99.9 | 98.8 |
a—The results of the program are taken from the Bazinet–Cummings’ paper [56].
b—The results for FACS 269bp dataset, where reads with more than 50% of unknown nucleotides (Ns) are filtered out. The values outside the brackets are for the whole dataset.
CoMeta allDb parameters: MC = 30%, k = 24.
CoMeta micDb parameters: MC = 5%, k = 30.
LMAT kML and kFull parameter: ms = 0.
Comparison of programs for various level classification using Illumina reads.
| Programs | HiSeq 92 bp | MiSeq 156 bp | ||||
|---|---|---|---|---|---|---|
| Sensitivity | Precision | Classified | Sensitivity | Precision | Classified | |
| PHYLUM | ||||||
| LMAT | 89.89 | 99.74 | 90.12 | 88.23 | 99.47 | 88.70 |
| MiniKraken | 65.34 | 99.79 | 65.48 | 75.88 | 99.93 | 75.93 |
| CoMeta | 81.64 | 98.97 | 82.49 | 86.71 | 99.11 | 87.49 |
| CLASS | ||||||
| LMAT | 88.06 | 99.66 | 88.36 | 85.79 | 99.65 | 86.09 |
| MiniKraken | 65.16 | 99.65 | 65.39 | 75.73 | 99.91 | 75.80 |
| CoMeta | 80.87 | 98.14 | 82.40 | 86.34 | 98.83 | 87.36 |
| ORDER | ||||||
| LMAT | 86.48 | 99.80 | 86.65 | 81.00 | 99.63 | 81.30 |
| MiniKraken | 64.89 | 99.51 | 65.21 | 75.52 | 99.87 | 75.62 |
| CoMeta | 80.34 | 97.73 | 82.21 | 85.39 | 98.01 | 87.12 |
| FAMILY | ||||||
| LMAT | 84.96 | 99.79 | 85.14 | 79.40 | 99.72 | 79.62 |
| MiniKraken | 64.75 | 99.46 | 65.10 | 75.43 | 99.81 | 75.57 |
| CoMeta | 80.13 | 97.61 | 82.09 | 85.05 | 97.76 | 87.00 |
| GENUS | ||||||
| LMAT | 84.74 | 99.80 | 84.91 | 73.75 | 99.53 | 74.10 |
| MiniKraken | 64.54 | 99.45 | 64.90 | 71.95 | 98.04 | 73.39 |
| MiniKraken | 66.12 | 99.44 | — | 67.95 | 97.41 | — |
| Kraken | 77.15 | 99.20 | — | 73.46 | 94.71 | — |
| Kraken-GB | 93.75 | 99.51 | — | 86.23 | 98.48 | — |
| CoMeta | 79.82 | 97.44 | 81.92 | 77.50 | 90.83 | 85.32 |
a—The results of the program are counted by ourselves.
b—The results of the program are taken from the Wood–Salzberg’ paper [51].
CoMeta micDb parameters: MC = 5%, k=24. LMAT kFull parameter: ms = 0.
Comparison of RAM memory usage and CPU times.
| Program | FACS | MetaPhyler | CARMA | PhyloPythia | HiSeq | MiSeq |
|---|---|---|---|---|---|---|
| 269bp | 300bp | 265bp | 961bp | 92bp | 156bp | |
| CPU Runtime (minutes) | ||||||
| CARMA | 290880 | 77340 | 74950 | 360107 | — | — |
| MEGAN | 288020 | 72060 | 72010 | 351060 | — | — |
| MetaPhyler | 10 | 20 | 2 | 28 | — | — |
| MG-RAST | 60 | 10080 | 20160 | 12960 | — | — |
| LMAT | 36(60 | 58 | 43 | 348 | — | — |
| LMAT | 54(93 | 213 | 38 | 772 | 15 | 33 |
| MiniKraken | — | 1.22 | 1.07 | 2.95 | 1.3 | 1.2 |
| CoMeta | 41(76 | 14 | 28 | 144 | — | — |
| CoMeta | — | 9 | 14 | 35 | 8 | 9 |
| CoMeta | — | — | — | 79 | 42 | 68 |
| Memory Usage (Megabytes of RAM) | ||||||
| CARMA | 100 | 100 | 100 | 120 | — | — |
| MEGAN | 1024 | 1024 | 1024 | 1410 | — | — |
| MetaPhyler | 5734 | 5734 | 5734 | 5734 | — | — |
| MG-RAST | — | — | — | — | — | — |
| LMAT | 17000(17284 | 17019 | 2128 | 13311 | — | — |
| LMAT | 9295(9481 | 13247 | 13286 | 15092 | 5807 | 12392 |
| MiniKraken | — | 4098 | 3210 | 4100 | 1317 | 1449 |
| CoMeta | 71260(71903 | 70743 | 71313 | 69508 | — | — |
| CoMeta | — | 19552 | 19320 | 19552 | 10297 | 17689 |
a—The results of the program are taken from the Bazinet–Cummings’ paper [56].
b—The results for FACS 269bp dataset, where reads with more than 50% of unknown nucleotides (Ns) are filtered out. The values outside the brackets are for the whole dataset.
FACS 269 bp, MetaPhyler 300 bp, and CARMA 265 bp datasets were classified to phylum level, whilst PhyloPythia 961 bp, HiSeq 92 bp, and MiSeq 156 bp datasets to genus level. In the table besides the times of classification to the genus level for CoMeta micDb (ge), the times of classification to earlier levels are shown—the phylum levels (ph).
Compact k-mer database, where the reads are classified into the phylum rank.
| Archaea | Bacteria | Eukaryota | Viroids | Viruses | Total | |
|---|---|---|---|---|---|---|
| num groups | 6 | 36 | 55 | 1 | 1 | 99 |
| num seq | 261,295 | 4,036,205 | 10,205,401 | 3,127 | 1,175,053 | 15,681,081 |
|
| 1.9 GB | 17.0 GB | 29.9 GB | 1.1 MB | 1.1 GB | 49.9 GB |
|
| 2.2 GB | 34.4 GB | 93.7 GB | 1.1 MB | 1.4 GB | 131.7 GB |
|
| 2.3 GB | 37.6 GB | 111.9 GB | 1.2 MB | 1.5 GB | 153.3 GB |
|
| 2.3 GB | 39.0 GB | 117.4 GB | 1.3 MB | 1.6 GB | 160.4 GB |
|
| 2.4 GB | 39.3 GB | 120.9 GB | 1.4 MB | 1.7 GB | 164.2 GB |
|
| 2.4 GB | 39.6 GB | 123.3 GB | 1.4 MB | 1.8 GB | 167.0 GB |
The total size of the compact k-mer databases for groups of the phylum rank at various lengths of k-mer. The number of groups belonging to the superkingdom is given in the first row, and the number of the sequences is in the second one. The sizes of each dataset are provided in S1 Supporting Information.