| Literature DB >> 31370905 |
Zachary N Harris1, Eliza Dhungel2, Matthew Mosior2, Tae-Hyuk Ahn3,4.
Abstract
BACKGROUND: Metagenomics is the application of modern genomic techniques to investigate the members of a microbial community directly in their natural environments and is widely used in many studies to survey the communities of microbial organisms that live in diverse ecosystems. In order to understand the metagenomic profile of one of the densest interaction spaces for millions of people, the public transit system, the MetaSUB international Consortium has collected and sequenced metagenomes from subways of different cities across the world. In collaboration with CAMDA, MetaSUB has made the metagenomic samples from these cities available for an open challenge of data analysis including, but not limited in scope to, the identification of unknown samples.Entities:
Keywords: CAMDA; Machine learning; MetaSUB; Metagenomics; Taxonomy profiling
Mesh:
Year: 2019 PMID: 31370905 PMCID: PMC6676585 DOI: 10.1186/s13062-019-0242-0
Source DB: PubMed Journal: Biol Direct ISSN: 1745-6150 Impact factor: 4.540
Primary and unknown data sets. Sample size for different cities and unknown, along with clean files (size is in GB)
| Location | Acronym | Number of samples | Total size (GB) of clean files (FASTQ format) | Total number of reads (filtered) |
|---|---|---|---|---|
| Auckland, New Zealand | AKL | 15 | 47.8 | 136,022,160 |
| Hamilton, Canada | HAM | 16 | 61.5 | 179,554,428 |
| Sacramento, US | SAC | 16 | 36.5 | 105,326,430 |
| Santiago, Chile | SCL | 20 | 215.3 | 613,721,390 |
| Offa, Nigeria | OFA | 20 | 438.2 | 1,267,427,220 |
| Porto, Portugal | PXO | 60 | 132.2 | 380,372,340 |
| Tokyo, Japan | TOK | 20 | 308.6 | 1,103,076,136 |
| New York, US | NYC | 26 | 368.8 | 1,086,713,476 |
| Unknown | UNK | 30 | 75.3 | 219,935,058 |
Fig. 1The analysis pipeline presented in this paper. Here we show the two-pronged approach used in this analysis. The data were analyzed under a read-based and assembly-based approach. In the read-based approach, we used taxonomic profiling for the generation of machine learning features for city prediction. In the assembly-based approach, we used two different reduced representation paradigms to generate features for machine learning features
Fig. 2LDA plots of the read-based approach. a LDA with all species. b LDA with rare species (present in < 5% of samples) removed
Fig. 3Confusion matrices for the read-based approach. a Confusion matrix for the random forest model trained on a random 70/30 train/test data partition. b Confusion matrix for the random forest model trained on a random 70/30 train/test data partition of the rare-species-removed data set
The system usage for read-based approach and two (PP and PL) assembly-based approaches (1 node based calculation)
| Method | CPU usage | Wall Clock Time (Hours) | Memory Usage |
|---|---|---|---|
| Read-based | 16 cores | 187.2 | 62 GB of RAM |
| PP Assembly | 24 cores | 83.28 | 500 GB of RAM |
| PL Assembly | 24 cores | 38.4 | 500 GB of RAM |
The evaluation of 30 unknown cities prediction from read-based RF and PP-assembly-based RF. The predictions that do not match true labels, and do not match between two predictions are shown in red. The predictions that do not match true labels, but match between two predictions are shown in blue
Fig. 4LDA of the assembly-based approach. a LDA of the random paired-end subset assembly (PP). b LDA of the left-only subset assembly (PL)
Fig. 5Confusion matrices for the assembly-based approach. a Confusion matrix for the random forest model trained on a random 70/30 train/test data partition in the random paired-end subset assembly. b Confusion matrix for the random forest model trained on a random 70/30 train/test data partition of the left-only assembly
Model prediction accuracies based on cross-validation of the training set. RF-10: Random forest with 10 random decision trees, RF-20: Random forest with 20 random decision trees, SVM: default support vector machine, SVM-N: SVM with normalized features, MLP: default Multilayer perceptron, MLP-C: Multilayer perceptron with complex nodal architecture (described in methods)
| Model | Accuracy |
|---|---|
| RF-10 | 87.9 |
| RF-20 | 89.7 |
| SVM | 43.1 |
| SVM-N | 32.8 |
| MLP | 63.7 |
| MLP-C | 55.2 |