| Literature DB >> 31722729 |
Jolanta Kawulok1, Michal Kawulok2, Sebastian Deorowicz2.
Abstract
BACKGROUND: Nowadays, not only are single genomes commonly analyzed, but also metagenomes, which are sets of, DNA fragments (reads) derived from microbes living in a given environment. Metagenome analysis is aimed at extracting crucial information on the organisms that have left their traces in an investigated environmental sample.In this study we focus on the MetaSUB Forensics Challenge (organized within the CAMDA 2018 conference) which consists in predicting the geographical origin of metagenomic samples. Contrary to the existing methods for environmental classification that are based on taxonomic or functional classification, we rely on the similarity between a sample and the reference database computed at a reads level.Entities:
Keywords: CAMDA challenge; Environmental classification; K-mers; MetaSUB; Metagenome; Sequence classification; Urban microbiome
Mesh:
Year: 2019 PMID: 31722729 PMCID: PMC6854650 DOI: 10.1186/s13062-019-0251-z
Source DB: PubMed Journal: Biol Direct ISSN: 1745-6150 Impact factor: 4.540
The content of the primary data set before and after removing human DNA fragments
| ID | Country | City | #samples | Average #reads per sample | |
|---|---|---|---|---|---|
| Original data | Without human DNA | ||||
| SCL | Chile | Santiago | 20 | 14,895,560 | 10,281,642 |
| TOK | Japan | Tokyo | 20 | 28,234,328 | 12,172,488 |
| AKL | New Zealand | Auckland | 15 | 4,929,497 | 4,849,711 |
| HAM | New Zealand | Hamilton | 16 | 6,073,774 | 5,999,711 |
| OFA | Nigeria | Offa | 20 | 35,469,676 | 34,936,176 |
| PXO | Portugal | Porto | 60 | 5,100,568 | 3,406,160 |
| NYC | USA | New York | 126 | 8,437,471 | 7,059,544 |
| SAC | USA | Sacramento | 34 | 25,153,713 | 22,627,578 |
| Together | 311 | 12,757,221 | 10,224,299 | ||
Fig. 1A map presenting the origin of the samples in the MetaSUB dataset. The eight cities marked with blue color are included in the primary dataset, and four cities marked with red color are the origins of the samples included in the C2 and C3 sets. On the map, we show the classification accuracies (obtained using the proposed method) for the cities from the primary dataset—blue indicates the scores for the primary dataset (based on leave-one-out cross validation), and green shows the scores for the C1 set (which includes samples from four cities out of eight from the primary dataset)
The test sets (C1, C2, and C3) before and after removing human DNA fragments
| Metagenome sample → | |||
|---|---|---|---|
| #samples | 30 | 3×12 | 16 |
| Average #reads per sample (original) | 4,637,923 | 28,907,439 | 18,000,000 |
| Average #reads (without human DNA) | 3,871,596 | 25,082,590 | 15,027,017 |
Fig. 2The processing pipeline for classifying metagenomic reads to one of the constructed classes. D—k-mer database for the human reference sequence; —k-mer databases from the original datasets for each of N classes; {D1,D2,…,D}—k-mer databases after subtracted D for each of N classes; R—an ith read from a query sample which includes reads; Ξ—a result of matching a jth read to the ith class (match rate score); x—one of the constructed classes; each blue block indicates data stored in a separate file
Classification accuracy obtained for the primary dataset using our method with class-level filtering at ci=4
We report the scores for three approaches to cumulating the similarity points for a sample: a) simple sum, b) fractional sum, and c) weighted sum, each for different values of threshold and maximum number of classes that a single read can be classified to ()
Classification accuracy obtained for the primary dataset using our method with sample-level filtering at ci=4
We report the scores for three approaches to cumulating the similarity points for a sample: a) simple sum, b) fractional sum, and c) weighted sum, each for different values of threshold and maximum number of classes that a single read can be classified to ()
Classification accuracy obtained for the C1 test set using our method with class-level filtering at ci=4
We report the scores for three approaches to cumulating the similarity points for a sample: a) simple sum, b) fractional sum, and c) weighted sum, each for different values of threshold and maximum number of classes that a single read can be classified to ()
Classification accuracy obtained for the C1 test set using our method with sample-level filtering at ci=4
We report the scores for three approaches to cumulating the similarity points for a sample: a) simple sum, b) fractional sum, and c) weighted sum, each for different values of threshold and maximum number of classes that a single read can be classified to ()
Confusion matrix for the primary dataset obtained using our method with sample-level filtering, similarity points computed using weighted sum, with and
| Predicted → | |||||||||
|---|---|---|---|---|---|---|---|---|---|
| Tested | AKL | HAM | NYC | OFA | PXO | SAC | SCL | TOK | ALL |
| AKL | 6 | 4 | 0 | 1 | 0 | 0 | 0 | 15 | |
| HAM | 2 | 1 | 0 | 0 | 0 | 0 | 0 | 16 | |
| NYC | 1 | 1 | 11 | 0 | 0 | 0 | 0 | 126 | |
| OFA | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 20 | |
| PXO | 0 | 0 | 1 | 0 | 0 | 0 | 2 | 60 | |
| SAC | 0 | 0 | 3 | 0 | 0 | 1 | 0 | 34 | |
| SCL | 0 | 0 | 1 | 0 | 3 | 0 | 0 | 20 | |
| TOK | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 20 | |
| ALL | 7 | 20 | 123 | 31 | 62 | 30 | 17 | 21 | 311 |
The diagonal values in bold indicate the correct results
Scores obtained for the primary dataset using cross validation
| AKL | HAM | NYC | OFA | PXO | SAC | SCL | TOK | Total | ||
|---|---|---|---|---|---|---|---|---|---|---|
| Ryan [ | #correct | 7 | 10 | 25 | 20 | 60 | 16 | 18 | 20 | |
| 0.54 | 0.56 | 0.96 | 0.95 | 0.98 | 1 | 1 | 1 | |||
| 0.47 | 0.63 | 0.96 | 1 | 1 | 1 | 0.9 | 1 | |||
| Sanchez et al. [ | #correct | 9 | 11 | 110 | 17 | 60 | 34 | 17 | 20 | |
| 0.69 | 0.73 | 0.95 | 0.89 | 1 | 0.83 | 0.89 | 0.71 | |||
| 0.6 | 0.69 | 0.87 | 0.85 | 1 | 1 | 0.85 | 1 | |||
| Harris et al. [ | — | — | — | — | — | — | — | — | — | |
| Walker and Datta [ | 0.6 | 0.62 | 0.58 | 0.95 | 0.87 | 0.76 | 0.3 | 0.7 | ||
| — | — | — | — | — | — | — | — | — | ||
| Zhu [ | #correct | 5 | 3 | 114 | 14 | 51 | 31 | 17 | 15 | |
| 0.33 | 0.19 | 0.9 | 0.74 | 0.85 | 0.91 | 0.85 | 0.75 | |||
| Chierici et al. [ | — | — | — | — | — | — | — | — | — | |
| Our method using Mash | #correct | 15 | 15 | 50 | 20 | 60 | 31 | 19 | 20 | |
| 0.34 | 0.26 | 1.00 | 0.67 | 1.00 | 1.00 | 1.00 | 1.00 | |||
| 1.00 | 0.94 | 0.40 | 1.00 | 1.00 | 0.91 | 0.95 | 1.00 | |||
| Our method using Mash | #correct | 15 | 16 | 42 | 20 | 60 | 34 | 20 | 20 | |
| 0.65 | 0.18 | 1.00 | 0.83 | 1.00 | 1.00 | 1.00 | 1.00 | |||
| 1.00 | 1.00 | 0.33 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | |||
| Our method using Mash | #correct | 15 | 16 | 44 | 20 | 60 | 34 | 19 | 20 | |
| 0.60 | 0.18 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | |||
| 1.00 | 1.00 | 0.35 | 1.00 | 1.00 | 1.00 | 0.95 | 1.00 | |||
| Our method using CoMeta | #correct | 4 | 12 | 116 | 20 | 37 | 34 | 13 | 20 | |
| (class-level filtering) | 0.67 | 0.63 | 0.92 | 0.74 | 1.00 | 0.97 | 1 | 0.42 | ||
| 0.27 | 0.75 | 0.92 | 1.00 | 0.62 | 1.00 | 0.65 | 1.00 | |||
| Our method using CoMeta | #correct | 4 | 13 | 113 | 20 | 57 | 30 | 16 | 19 | |
| (sample-level filtering) | 0.57 | 0.65 | 0.92 | 0.65 | 0.92 | 1.00 | 0.94 | 0.9 | ||
| 0.27 | 0.81 | 0.9 | 1.00 | 0.95 | 0.88 | 0.8 | 0.95 |
We report the number of correctly classified samples (#correct), precision (PPV), and recall (TPR) for each class, as well as the overall accuracy (ACC). Some of the values are missing, as they were not reported in the referenced papers. Also, we show the number of samples (N), as in some works, the results for a subset of all of N=311 samples were reported
Similarities (in %) of the samples in the C1 test set to the individual classes from the primary dataset, obtained using our method
Detailed classification outcomes obtained using different methods for the C1 test set. The correct results are highlighted
Classification scores obtained for the C1 test set using different methods
| NYC | OFA | PXO | SCL | Overall accuracy | ||
|---|---|---|---|---|---|---|
| Harris et al. [ | #correct | 0 | 5 | 10 | 5 | |
| — | 0.83 | 1.00 | 1.00 | |||
| 0.00 | 1.00 | 1.00 | 1.00 | |||
| Walker and Datta [ | #correct | 4 | 1 | 9 | 2 | |
| 0.67 | 1.00 | 0.75 | 0.50 | |||
| 0.40 | 0.20 | 0.90 | 0.40 | |||
| Zhu [ | #correct | 3 | 5 | 7 | 4 | |
| — | 5.00 | 0.58 | 1.00 | |||
| 0.30 | 1.00 | 0.70 | 0.80 | |||
| Chierici et al. [ | #correct | 10 | 0 | 10 | 5 | |
| 0.67 | — | 1.00 | 1.00 | |||
| 1.00 | 0.00 | 1.00 | 1.00 | |||
| Our method using Mash | #correct | 0 | 3 | 4 | 2 | |
| — | 1 | 1 | 0.5 | |||
| 0 | 0.6 | 0.4 | 0.4 | |||
| Our method using Mash | #correct | 0 | 3 | 6 | 5 | |
| — | 1 | 1 | 1 | |||
| 0 | 0.6 | 0.6 | 1 | |||
| Our method using Mash | #correct | 0 | 3 | 5 | 4 | |
| — | 1 | 1 | 1 | |||
| 0 | 0.6 | 0.5 | 0.8 | |||
| Our method using CoMeta | #correct | 10 | 4 | 2 | 4 | |
| (class-level filtering) | 0.91 | 1.00 | 0.91 | 1.00 | ||
| 1.00 | 0.80 | 1.00 | 0.80 | |||
| Our method using CoMeta | #correct | 10 | 4 | 10 | 4 | |
| (sample-level filtering) | 0.91 | 1.00 | 1.00 | 1.00 | ||
| 1.00 | 0.80 | 0.20 | 0.80 |
We report the number of correctly classified samples (#correct), precision (PPV), and recall (TPR) for each class, as well as the overall accuracy (ACC)
Similarities (in %) of the samples that originate from Ilorin (Nigeria) in the C2 test set to the individual classes from the primary dataset, obtained using our method
Similarities (in %) of the samples that originate from Lisbon (Portugal) in the C2 test set to the individual classes from the primary dataset, obtained using our method
Similarities (in %) of the samples that originate from Boston (USA) in the C2 test set to the individual classes from the primary dataset, obtained using our method
Mutual similarities (in %) between the samples in the C3 test set, obtained using our method
Mutual similarities (in %) between the samples in the C3 test set, obtained using our method, normalized independently for each row
The samples were sorted manually to identify four clusters (cluster 1: C3_01, C3_13, and C3_15, cluster 2: C3_07, C3_08, C3_10, C3_11, and C3_16, cluster 3: C3_03, C3_05, C3_09, and C3_12, and cluster 4: C3_04, C3_06, C3_14, and C3_02)
Similarities (in %) of the samples that originate in the C3 test set to the individual classes from the primary dataset and from the C2 test set, obtained using our method
Three out of four ground-truth origins were identical to these of the samples from the C2 set
The number of unique k-mers in the class-level databases extracted from the primary dataset (for k=24) after filtering infrequent k-mers (with ci=4) from (i) sample-level databases and (ii) class-level databases
| Class name | Class-level filtering | Sample-level filtering |
|---|---|---|
| Chile, Santiago | 3,330,241,847 | 1,947,678,404 |
| Japan, Tokyo | 6,179,603,359 | 3,436,570,406 |
| New Zealand, Auckland | 586,168,771 | 567,504,772 |
| New Zealand, Hamilton | 897,549,433 | 845,417,208 |
| Nigeria, Offa | 3,293,428,857 | 2,833,690,965 |
| Portugal, Porto | 3,793,750,265 | 3,108,855,323 |
| USA, New York | 7,413,034,106 | 4,252,342,215 |
| USA, Sacramento | 2,413,540,643 | 599,036,464 |