Quy Cao1, Xinxin Sun2, Karun Rajesh3,4, Naga Chalasani5, Kayla Gelow6, Barry Katz6, Vijay H Shah7, Arun J Sanyal8, Ekaterina Smirnova2. 1. Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, Pennsylvania, PA, United States. 2. Biostatistics Department, Virginia Commonwealth University, Richmond, VA, United States. 3. Bioinformatics Department, Virginia Commonwealth University, Richmond, VA, United States. 4. Department of Biostatistics, Harvard University, Boston, MA, United States. 5. Division of Gastroenterology, Department of Internal Medicine, Indiana University, Indianapolis, IN, United States. 6. Department of Biostatistics, Indiana University, Indianapolis, IN, United States. 7. Division of Gastroenterology, Department of Internal Medicine, Mayo Clinic, Rochester, MA, United States. 8. Division of Gastroenterology, Hepatology and Nutrition, Department of Internal Medicine, Virginia Commonwealth University, Richmond, VA, United States.
Abstract
Background: The accuracy of microbial community detection in 16S rRNA marker-gene and metagenomic studies suffers from contamination and sequencing errors that lead to either falsely identifying microbial taxa that were not in the sample or misclassifying the taxa of DNA fragment reads. Removing contaminants and filtering rare features are two common approaches to deal with this problem. While contaminant detection methods use auxiliary sequencing process information to identify known contaminants, filtering methods remove taxa that are present in a small number of samples and have small counts in the samples where they are observed. The latter approach reduces the extreme sparsity of microbiome data and has been shown to correctly remove contaminant taxa in cultured "mock" datasets, where the true taxa compositions are known. Although filtering is frequently used, careful evaluation of its effect on the data analysis and scientific conclusions remains unreported. Here, we assess the effect of filtering on the alpha and beta diversity estimation as well as its impact on identifying taxa that discriminate between disease states. Results: The effect of filtering on microbiome data analysis is illustrated on four datasets: two mock quality control datasets where the same cultured samples with known microbial composition are processed at different labs and two disease study datasets. Results show that in microbiome quality control datasets, filtering reduces the magnitude of differences in alpha diversity and alleviates technical variability between labs while preserving the between samples similarity (beta diversity). In the disease study datasets, DESeq2 and linear discriminant analysis Effect Size (LEfSe) methods were used to identify taxa that are differentially abundant across groups of samples, and random forest models were used to rank features with the largest contribution toward disease classification. Results reveal that filtering retains significant taxa and preserves the model classification ability measured by the area under the receiver operating characteristic curve (AUC). The comparison between the filtering and the contaminant removal method shows that they have complementary effects and are advised to be used in conjunction. Conclusions: Filtering reduces the complexity of microbiome data while preserving their integrity in downstream analysis. This leads to mitigation of the classification methods' sensitivity and reduction of technical variability, allowing researchers to generate more reproducible and comparable results in microbiome data analysis.
Background: The accuracy of microbial community detection in 16S rRNA marker-gene and metagenomic studies suffers from contamination and sequencing errors that lead to either falsely identifying microbial taxa that were not in the sample or misclassifying the taxa of DNA fragment reads. Removing contaminants and filtering rare features are two common approaches to deal with this problem. While contaminant detection methods use auxiliary sequencing process information to identify known contaminants, filtering methods remove taxa that are present in a small number of samples and have small counts in the samples where they are observed. The latter approach reduces the extreme sparsity of microbiome data and has been shown to correctly remove contaminant taxa in cultured "mock" datasets, where the true taxa compositions are known. Although filtering is frequently used, careful evaluation of its effect on the data analysis and scientific conclusions remains unreported. Here, we assess the effect of filtering on the alpha and beta diversity estimation as well as its impact on identifying taxa that discriminate between disease states. Results: The effect of filtering on microbiome data analysis is illustrated on four datasets: two mock quality control datasets where the same cultured samples with known microbial composition are processed at different labs and two disease study datasets. Results show that in microbiome quality control datasets, filtering reduces the magnitude of differences in alpha diversity and alleviates technical variability between labs while preserving the between samples similarity (beta diversity). In the disease study datasets, DESeq2 and linear discriminant analysis Effect Size (LEfSe) methods were used to identify taxa that are differentially abundant across groups of samples, and random forest models were used to rank features with the largest contribution toward disease classification. Results reveal that filtering retains significant taxa and preserves the model classification ability measured by the area under the receiver operating characteristic curve (AUC). The comparison between the filtering and the contaminant removal method shows that they have complementary effects and are advised to be used in conjunction. Conclusions: Filtering reduces the complexity of microbiome data while preserving their integrity in downstream analysis. This leads to mitigation of the classification methods' sensitivity and reduction of technical variability, allowing researchers to generate more reproducible and comparable results in microbiome data analysis.
Authors: Jacques Ravel; Pawel Gajer; Zaid Abdo; G Maria Schneider; Sara S K Koenig; Stacey L McCulle; Shara Karlebach; Reshma Gorle; Jennifer Russell; Carol O Tacket; Rebecca M Brotman; Catherine C Davis; Kevin Ault; Ligia Peralta; Larry J Forney Journal: Proc Natl Acad Sci U S A Date: 2010-06-03 Impact factor: 11.205
Authors: Benjamin J Callahan; Daniel B DiGiulio; Daniela S Aliaga Goltsman; Christine L Sun; Elizabeth K Costello; Pratheepa Jeganathan; Joseph R Biggio; Ronald J Wong; Maurice L Druzin; Gary M Shaw; David K Stevenson; Susan P Holmes; David A Relman Journal: Proc Natl Acad Sci U S A Date: 2017-08-28 Impact factor: 11.205
Authors: Dan Knights; Justin Kuczynski; Emily S Charlson; Jesse Zaneveld; Michael C Mozer; Ronald G Collman; Frederic D Bushman; Rob Knight; Scott T Kelley Journal: Nat Methods Date: 2011-07-17 Impact factor: 28.547
Authors: J Paul Brooks; David J Edwards; Michael D Harwich; Maria C Rivera; Jennifer M Fettweis; Myrna G Serrano; Robert A Reris; Nihar U Sheth; Bernice Huang; Philippe Girerd; Jerome F Strauss; Kimberly K Jefferson; Gregory A Buck Journal: BMC Microbiol Date: 2015-03-21 Impact factor: 3.605
Authors: Alexander Statnikov; Mikael Henaff; Varun Narendra; Kranti Konganti; Zhiguo Li; Liying Yang; Zhiheng Pei; Martin J Blaser; Constantin F Aliferis; Alexander V Alekseyenko Journal: Microbiome Date: 2013-04-05 Impact factor: 14.650
Authors: Nicole M Davis; Diana M Proctor; Susan P Holmes; David A Relman; Benjamin J Callahan Journal: Microbiome Date: 2018-12-17 Impact factor: 14.650
Authors: Nazema Y Siddiqui; Li Ma; Linda Brubaker; Jialiang Mao; Carter Hoffman; Erin M Dahl; Zhuoqun Wang; Lisa Karstens Journal: Front Cell Infect Microbiol Date: 2022-07-08 Impact factor: 6.073
Authors: Alexander Dietrich; Monica Steffi Matchado; Maximilian Zwiebel; Benjamin Ölke; Michael Lauber; Ilias Lagkouvardos; Jan Baumbach; Dirk Haller; Beate Brandl; Thomas Skurk; Hans Hauner; Sandra Reitmeier; Markus List Journal: Microb Genom Date: 2022-08
Authors: Aaron C Ericsson; Susheel B Busi; Daniel J Davis; Henda Nabli; David C Eckhoff; Rebecca A Dorfmeyer; Giedre Turner; Payton S Oswalt; Marcus J Crim; Elizabeth C Bryda Journal: Anim Microbiome Date: 2021-08-05
Authors: Desiree Henares; Pedro Brotons; Mariona F de Sevilla; Ana Fernandez-Lopez; Susanna Hernandez-Bou; Amaresh Perez-Argüello; Alex Mira; Carmen Muñoz-Almagro; Raul Cabrera-Rubio Journal: Microb Genom Date: 2021-10