Vivek Ramanan1,2, Shanti Mechery2, Indra Neil Sarkar1,2,3. 1. Center of Computational Molecular Biology, Brown University, Providence, Rhode Island, USA. 2. Center for Biomedical Informatics, Brown University, Providence, Rhode Island, USA. 3. Rhode Island Quality Institute, Providence, Rhode Island, USA.
Abstract
MOTIVATION: Microbiome datasets are often constrained by sequencing limitations. GenBank is the largest collection of publicly available DNA sequences, which is maintained by the National Center of Biotechnology Information (NCBI). The metadata of GenBank records are a largely understudied resource and may be uniquely leveraged to access the sum of prior studies focused on microbiome composition. Here, we developed a computational pipeline to analyze GenBank metadata, containing data on hosts, microorganisms, and their place of origin. This work provides the first opportunity to leverage the totality of GenBank to shed light on compositional data practices that shape how microbiome datasets are formed as well as examine host-microbiome relationships. RESULTS: The collected dataset contains multiple kingdoms of microorganisms, consisting of bacteria, viruses, archaea, protozoa, fungi, and invertebrate parasites, and hosts of multiple taxonomical classes, including mammals, birds, and fish. A human data subset of this dataset provides insights to gaps in current microbiome data collection, which is biased towards clinically relevant pathogens. Clustering and phylogenic analysis reveals the potential to use these data to model host taxonomy and evolution, revealing groupings formed by host diet, environment, and coevolution. AVAILABILITY: GenBank Host-Microbiome Pipeline is available at {{https://github.com/bcbi/genbank_holobiome}}. The GenBank loader is available at {{https://github.com/bcbi/genbank_loader}}. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
MOTIVATION: Microbiome datasets are often constrained by sequencing limitations. GenBank is the largest collection of publicly available DNA sequences, which is maintained by the National Center of Biotechnology Information (NCBI). The metadata of GenBank records are a largely understudied resource and may be uniquely leveraged to access the sum of prior studies focused on microbiome composition. Here, we developed a computational pipeline to analyze GenBank metadata, containing data on hosts, microorganisms, and their place of origin. This work provides the first opportunity to leverage the totality of GenBank to shed light on compositional data practices that shape how microbiome datasets are formed as well as examine host-microbiome relationships. RESULTS: The collected dataset contains multiple kingdoms of microorganisms, consisting of bacteria, viruses, archaea, protozoa, fungi, and invertebrate parasites, and hosts of multiple taxonomical classes, including mammals, birds, and fish. A human data subset of this dataset provides insights to gaps in current microbiome data collection, which is biased towards clinically relevant pathogens. Clustering and phylogenic analysis reveals the potential to use these data to model host taxonomy and evolution, revealing groupings formed by host diet, environment, and coevolution. AVAILABILITY: GenBank Host-Microbiome Pipeline is available at {{https://github.com/bcbi/genbank_holobiome}}. The GenBank loader is available at {{https://github.com/bcbi/genbank_loader}}. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.