Runzhi Zhang1, Alejandro R Walker2, Susmita Datta3. 1. Department of Biostatistics, University of Florida, 2004 Mowry Rd, Gainesville, FL, 32610, USA. 2. Department of Oral Biology, University of Florida, 1395 Center Drive, Gainesville, FL, 32610, USA. 3. Department of Biostatistics, University of Florida, 2004 Mowry Rd, Gainesville, FL, 32610, USA. susmita.datta@ufl.edu.
Abstract
BACKGROUND: Composition of microbial communities can be location-specific, and the different abundance of taxon within location could help us to unravel city-specific signature and predict the sample origin locations accurately. In this study, the whole genome shotgun (WGS) metagenomics data from samples across 16 cities around the world and samples from another 8 cities were provided as the main and mystery datasets respectively as the part of the CAMDA 2019 MetaSUB "Forensic Challenge". The feature selecting, normalization, three methods of machine learning, PCoA (Principal Coordinates Analysis) and ANCOM (Analysis of composition of microbiomes) were conducted for both the main and mystery datasets. RESULTS: Features selecting, combined with the machines learning methods, revealed that the combination of the common features was effective for predicting the origin of the samples. The average error rates of 11.93 and 30.37% of three machine learning methods were obtained for main and mystery datasets respectively. Using the samples from main dataset to predict the labels of samples from mystery dataset, nearly 89.98% of the test samples could be correctly labeled as "mystery" samples. PCoA showed that nearly 60% of the total variability of the data could be explained by the first two PCoA axes. Although many cities overlapped, the separation of some cities was found in PCoA. The results of ANCOM, combined with importance score from the Random Forest, indicated that the common "family", "order" of the main-dataset and the common "order" of the mystery dataset provided the most efficient information for prediction respectively. CONCLUSIONS: The results of the classification suggested that the composition of the microbiomes was distinctive across the cities, which could be used to identify the sample origins. This was also supported by the results from ANCOM and importance score from the RF. In addition, the accuracy of the prediction could be improved by more samples and better sequencing depth.
BACKGROUND: Composition of microbial communities can be location-specific, and the different abundance of taxon within location could help us to unravel city-specific signature and predict the sample origin locations accurately. In this study, the whole genome shotgun (WGS) metagenomics data from samples across 16 cities around the world and samples from another 8 cities were provided as the main and mystery datasets respectively as the part of the CAMDA 2019 MetaSUB "Forensic Challenge". The feature selecting, normalization, three methods of machine learning, PCoA (Principal Coordinates Analysis) and ANCOM (Analysis of composition of microbiomes) were conducted for both the main and mystery datasets. RESULTS: Features selecting, combined with the machines learning methods, revealed that the combination of the common features was effective for predicting the origin of the samples. The average error rates of 11.93 and 30.37% of three machine learning methods were obtained for main and mystery datasets respectively. Using the samples from main dataset to predict the labels of samples from mystery dataset, nearly 89.98% of the test samples could be correctly labeled as "mystery" samples. PCoA showed that nearly 60% of the total variability of the data could be explained by the first two PCoA axes. Although many cities overlapped, the separation of some cities was found in PCoA. The results of ANCOM, combined with importance score from the Random Forest, indicated that the common "family", "order" of the main-dataset and the common "order" of the mystery dataset provided the most efficient information for prediction respectively. CONCLUSIONS: The results of the classification suggested that the composition of the microbiomes was distinctive across the cities, which could be used to identify the sample origins. This was also supported by the results from ANCOM and importance score from the RF. In addition, the accuracy of the prediction could be improved by more samples and better sequencing depth.
Entities:
Keywords:
ANCOM; Linear discriminant analysis; Machine learning; Microbiome; OTU; PCoA; Random Forest; Support vector machine; WGS
Authors: Manuel Delgado-Baquerizo; Angela M Oliverio; Tess E Brewer; Alberto Benavent-González; David J Eldridge; Richard D Bardgett; Fernando T Maestre; Brajesh K Singh; Noah Fierer Journal: Science Date: 2018-01-19 Impact factor: 47.728
Authors: Matthew E Ritchie; Belinda Phipson; Di Wu; Yifang Hu; Charity W Law; Wei Shi; Gordon K Smyth Journal: Nucleic Acids Res Date: 2015-01-20 Impact factor: 16.971
Authors: John A McCulloch; Diwakar Davar; Richard R Rodrigues; Jonathan H Badger; Jennifer R Fang; Alicia M Cole; Ascharya K Balaji; Marie Vetizou; Stephanie M Prescott; Miriam R Fernandes; Raquel G F Costa; Wuxing Yuan; Rosalba Salcedo; Erol Bahadiroglu; Soumen Roy; Richelle N DeBlasio; Robert M Morrison; Joe-Marc Chauvin; Quanquan Ding; Bochra Zidi; Ava Lowin; Saranya Chakka; Wentao Gao; Ornella Pagliano; Scarlett J Ernst; Amy Rose; Nolan K Newman; Andrey Morgun; Hassane M Zarour; Giorgio Trinchieri; Amiran K Dzutsev Journal: Nat Med Date: 2022-02-28 Impact factor: 87.241
Authors: Miriam R Fernandes; Poonam Aggarwal; Raquel G F Costa; Alicia M Cole; Giorgio Trinchieri Journal: Nat Rev Cancer Date: 2022-10-17 Impact factor: 69.800
Authors: Timothy Chappell; Shlomo Geva; James M Hogan; David Lovell; Andrew Trotman; Dimitri Perrin Journal: Front Genet Date: 2022-02-28 Impact factor: 4.599