| Literature DB >> 25879410 |
Rachid Ounit1, Steve Wanamaker2, Timothy J Close3, Stefano Lonardi4.
Abstract
BACKGROUND: The problem of supervised DNA sequence classification arises in several fields of computational molecular biology. Although this problem has been extensively studied, it is still computationally challenging due to size of the datasets that modern sequencing technologies can produce.Entities:
Mesh:
Year: 2015 PMID: 25879410 PMCID: PMC4428112 DOI: 10.1186/s12864-015-1419-2
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Genus-level classification accuracy and speed of CLARK, KRAKEN , and NBC for four simulated metagenomes and several -mer length
|
|
|
|
|
| |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
| ||
|
| 15 ∗ |
|
| 0.008 |
|
| 0.007 |
|
| 0.007 |
|
| 0.005 |
| 13 ∗ | 78.85 | 78.85 | 0.011 | 77.70 | 77.70 | 0.009 | 92.41 | 92.41 | 0.010 | 98.57 | 98.57 | 0.006 | |
| 11 ∗ | 58.97 | 58.97 |
| 64.43 | 64.43 |
| 46.10 | 46.10 |
| 86.83 | 86.83 |
| |
|
| 31 |
| 77.78 |
|
| 77.69 |
| 98.88 | 89.67 |
|
|
| 121 |
| 27 | 98.98 | 79.88 | 538 | 93.50 | 78.57 | 433 |
| 93.09 | 585 | 99.67 |
|
| |
| 23 | 97.33 | 81.97 | 530 | 90.06 | 80.02 | 426 | 98.71 | 94.54 | 559 | 99.59 |
| 119 | |
| 20 | 87.00 |
| 532 | 82.45 |
| 420 | 97.38 |
| 549 | 99.43 | 99.41 | 115 | |
|
| 31 |
| 77.76 |
|
| 77.59 |
| 98.28 | 89.35 |
| 96.83 | 96.55 |
|
| 27 | 99.01 | 79.85 | 2,048 | 93.91 | 78.47 | 1,240 |
| 92.73 | 1,917 |
| 96.57 | 231 | |
| 23 | 97.45 | 81.89 | 1,923 | 90.56 | 79.75 | 1,186 | 98.25 | 94.18 | 1,824 | 96.80 | 96.57 | 228 | |
| 20 | 90.22 |
| 1,546 | 86.28 |
| 965 | 98.07 |
| 1,478 | 96.71 |
| 211 | |
|
| 31 |
| 77.25 |
|
| 77.44 |
|
| 88.62 |
|
|
|
|
| 27 | 99.07 | 79.37 | 2,796 | 93.90 | 78.29 | 1,522 | 98.90 | 92.26 | 2,554 | 99.67 |
| 241 | |
| 23 | 97.85 | 81.36 | 2,679 | 90.98 | 79.57 | 1,482 | 98.75 | 94.26 | 2,394 | 99.60 |
| 244 | |
| 20 | 88.60 |
| 2,567 | 83.35 |
| 1,456 | 97.73 |
| 2,306 | 99.43 | 99.41 | 239 | |
|
| 31 |
| 76.84 | 6,224 |
|
| 5,308 |
| 87.46 | 7,023 |
|
| 3,809 |
| 27 | 98.79 | 78.19 | 6,410 | 94.12 | 73.73 | 5,555 | 98.11 |
| 7,992 | 90.99 | 83.71 | 4,196 | |
| 23 | 96.67 |
| 7,015 | 90.57 | 72.35 | 6,329 | 97.21 | 89.07 | 8,989 | 90.46 | 79.27 | 4,574 | |
| 20 | 82.07 | 70.11 |
| 80.05 | 65.25 |
| 90.02 | 77.04 |
| 70.86 | 57.40 |
| |
|
| 31 |
| 72.72 |
|
|
|
|
| 77.85 | 26,171 | 97.63 | 97.31 | 15,426 |
| 27 | 99.43 | 74.67 | 29,897 | 96.93 | 75.68 | 28,459 | 98.93 | 84.86 |
| 97.47 | 97.18 |
| |
| 23 | 98.93 | 78.20 | 31,112 | 95.01 | 76.88 | 26,747 | 98.34 | 90.20 | 26,647 |
|
| 15,408 | |
| 20 | 94.74 |
| 30,029 | 90.57 | 76.60 | 25,789 | 96.61 |
| 26,545 | 93.94 | 93.82 | 15,587 | |
|
| 27 | 98.45 | 62.30 | 1,525 | 92.11 | 69.64 | 861 | 95.96 | 52.00 | 1,705 | 99.49 | 98.94 | 143 |
Performance statistics for several choices of the k-mer length for NBC, KRAKEN, CLARK and their fast variants on the classification of “HiSeq”, “MiSeq”, “simBA-5” and “simHC.20.500” metagenomic datasets against the 695 genus-level targets; precision and sensitivity are expressed as percentages, while speed is expressed in 103 reads per minute; KRAKEN-Q and CLARK-E are faster, but less accurate, variants of these tools; CLARK-l is a less memory-intensive version of CLARK which runs only for k = 27; experiments were carried out in single-threaded mode; ∗parameter k is referred as N in the NBC manuscript.
Species-level classification accuracy and speed of CLARK, KRAKEN , and NBC for four simulated metagenomes
|
|
|
|
| |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
| |
|
| 68.67 | 68.70 | 0.008 | 68.33 | 68.33 | 0.007 | 91.74 | 91.74 | 0.007 | 94.32 | 94.32 | 0.005 |
|
| 69.44 | 61.46 | 272 | 70.72 | 62.45 | 239 | 91.32 | 82.48 | 269 | 94.34 | 94.32 | 96 |
|
| 74.00 | 53.49 | 2,332 | 77.72 | 58.72 | 1,361 | 92.99 | 78.70 | 1,976 | 84.67 | 84.31 | 237 |
|
| 86.74 | 58.59 | 3,011 | 89.49 | 61.84 | 1,566 | 98.85 | 76.80 | 2,855 | 94.67 | 94.26 | 251 |
|
| 75.88 | 50.78 | 6,224 | 78.07 | 53.68 | 5,308 | 92.67 | 74.39 | 7,023 | 82.40 | 74.84 | 3,809 |
|
| 90.08 | 55.18 | 30,976 | 94.31 | 58.36 | 24,029 | 98.92 | 66.02 | 24,996 | 92.78 | 92.38 | 15,583 |
|
| 85.35 | 53.95 | 1,676 | 85.89 | 64.91 | 904 | 85.55 | 46.28 | 1,702 | 94.06 | 93.53 | 141 |
Precision and sensitivity are expressed as percentages, while speed is expressed in 103 reads per minute for NBC, KRAKEN, and CLARK on the classification of “HiSeq”, “MiSeq”, “simBA-5” and “simHC.20.500” metagenome datasets against the 1473 species-level targets, in single-threaded mode.
Figure 1Classification performance of CLARK for several k-mer length and for various datasets.CLARK’s precision, sensitivity, assignment rate, average confidence scores and precision of high confidence assignments (HC) for several choices of the k-mer length on the “HiSeq” metagenomic dataset (a), the “MiSeq” metagenomic dataset (b), the “simBA-5” metagenomic dataset (c), the “simHC.20.500” metagenomic dataset (d), and barley unigenes (e). (a) – (d) are results of the classification against the 695 genus-level targets.
Summary of the Genus-level classification for three Human Microbiome Project datasets ( =20)
|
|
|
|
|
|
|
|---|---|---|---|---|---|
|
|
|
|
|
| |
| 015072 | 62.3% | 25.9% | 11.8% | 0.868 |
|
| (vagina) |
| ||||
|
| |||||
|
| |||||
|
| |||||
| 019120 | 55.1% | 28.2% | 16.7% | 0.842 |
|
| (mouth) |
| ||||
|
| |||||
|
| |||||
|
| |||||
| 023847 | 68.3% | 23.8% | 7.9% | 0.954 |
|
| (nose) |
| ||||
|
| |||||
|
| |||||
|
|
Columns: (1) short read sample ID; (2) percentage of high confidence assignments; (3) percentage of low confidence assignments; (4) percentage of unassigned reads; (5) average confidence score for all assignments; (6) five most frequent genera in high confidence assignments (listed in decreasing order). An assignment is high confidence if the confidence score is higher than 0.75, low confidence otherwise.
Summary of CLARK ’s assignment of 50,646 unigenes (EST assemblies) to barley chromosome arms (assemblies) and centromeres ( =19)
|
|
|
|
|
|
|
|---|---|---|---|---|---|
| 1H | 180,176,713 | 108,894,740 | 8,197 | 21.1% | 78.9% |
| 2HC | - | 814,357 | 15 | 93.3% | 6.7% |
| 2HL | 103,679,920 | 64,700,161 | 4,776 | 15.8% | 84.2% |
| 2HS | 90,912,314 | 54,449,430 | 3,334 | 17.3% | 82.7% |
| 3HC | - | 1,532,968 | 29 | 79.3% | 20.7% |
| 3HL | 123,140,951 | 78,158,244 | 4,726 | 16.7% | 83.3% |
| 3HS | 111,951,787 | 70,473,478 | 3,159 | 20.4% | 79.6% |
| 4HC | - | 3,105,047 | 54 | 50.0% | 50.0% |
| 4HL | 106,999,773 | 64,749,958 | 3,531 | 14.4% | 85.6% |
| 4HS | 89,027,872 | 51,612,790 | 2,468 | 16.4% | 83.6% |
| 5HC | - | 604,030 | 9 | 88.9% | 11.1% |
| 5HL | 117,915,094 | 77,128,375 | 6,111 | 12.2% | 87.8% |
| 5HS | 58,067,400 | 34,037,607 | 1,619 | 17.8% | 82.2% |
| 6HC | - | 469,530 | 9 | 100.0% | 0.0% |
| 6HL | 74,485,223 | 44,221,184 | 2,973 | 12.4% | 87.6% |
| 6HS | 111,834,123 | 83,957,421 | 2,721 | 24.4% | 75.6% |
| 7HC | - | 795,923 | 9 | 88.9% | 11.1% |
| 7HL | 92,603,503 | 58,159,248 | 3,556 | 10.9% | 89.1% |
| 7HS | 90,217,777 | 55,276,671 | 3,350 | 12.6% | 87.4% |
|
| 1,351,012,450 | 853,141,162 | 50,646 | 16.5% | 83.5% |
Columns: (1) barley chromosome 1H, twelve chromosome arms, and six centromeres; (2) number of distinct k-mers in each target; (3) number of discriminative k-mers present in target sequences (must occur at least once); (4) number of assigned objects per target; (5) number of low confidence assignment per target; (6) number of high confidence assignment per target; (7) percentage of low confidence assignment (as a fraction of the total number of assigned objects per target); (8) percentage of high confidence assignment (as a fraction of the total number of assigned objects per target).
Summary of CLARK ’s assignment of 15,665 BACs (represented as reads) to barley chromosome arms (reads) and centromeres ( =19)
|
|
|
|
|
|
|
|---|---|---|---|---|---|
| 1H | 448,768,897 | 126,997,864 | 2,068 | 4.2% | 95.8% |
| 2HC | - | 1,738,722 | 0 | - | - |
| 2HL | 451,729,142 | 102,959,160 | 1,417 | 2.1% | 97.9% |
| 2HS | 401,605,473 | 79,225,936 | 1,071 | 2.4% | 97.6% |
| 3HC | - | 4,631,639 | 0 | - | - |
| 3HL | 553,420,081 | 138,939,217 | 1,423 | 2.2% | 97.8% |
| 3HS | 538,777,930 | 113,354,224 | 892 | 3.5% | 96.5% |
| 4HC | - | 6,428,726 | 70 | 14.3 | 85.7% |
| 4HL | 494,923,209 | 106,930,230 | 1,127 | 2.3% | 97.7% |
| 4HS | 462,144,322 | 85,650,765 | 888 | 3.4% | 96.6% |
| 5HC | - | 1,643,194 | 0 | - | - |
| 5HL | 558,710,983 | 121,491,586 | 1,657 | 2.3% | 97.7% |
| 5HS | 281,062,766 | 57,181,745 | 658 | 2.4% | 97.6% |
| 6HC | - | 1,287,133 | 0 | - | - |
| 6HL | 311,443,157 | 70,856,097 | 1,136 | 2.0% | 98.0% |
| 6HS | 877,169,677 | 255,819,549 | 850 | 2.9% | 97.1% |
| 7HC | - | 1,697,991 | 0 | - | - |
| 7HL | 366,612,780 | 82,987,499 | 1,175 | 2.0% | 98.0% |
| 7HS | 365,475,556 | 83,848,867 | 1,233 | 2.8% | 97.2% |
|
| 6,111,843,973 | 1,443,670,144 | 15,665 | 2.7% | 97.3% |
Columns: (1) barley chromosome 1H, twelve chromosome arms, and six centromeres; (2) number of distinct k-mers in each target; (3) number of discriminative k-mers present in target sequences (must occur at least twice); (4) number of assigned objects per target; (5) number of low confidence assignment per target; (6) number of high confidence assignment per target; (7) percentage of low confidence assignment (as a fraction of the total number of assigned objects per target); (8) percentage of high confidence assignment (as a fraction of the total number of assigned objects per target).