Literature DB >> 36067158

A curated data resource of 214K metagenomes for characterization of the global antimicrobial resistome.

Hannah-Marie Martiny¹, Patrick Munk¹, Christian Brinch¹, Frank M Aarestrup¹, Thomas N Petersen¹.

Abstract

The growing threat of antimicrobial resistance (AMR) calls for new epidemiological surveillance methods, as well as a deeper understanding of how antimicrobial resistance genes (ARGs) have been transmitted around the world. The large pool of sequencing data available in public repositories provides an excellent resource for monitoring the temporal and spatial dissemination of AMR in different ecological settings. However, only a limited number of research groups globally have the computational resources to analyze such data. We retrieved 442 Tbp of sequencing reads from 214,095 metagenomic samples from the European Nucleotide Archive (ENA) and aligned them using a uniform approach against ARGs and 16S/18S rRNA genes. Here, we present the results of this extensive computational analysis and share the counts of reads aligned. Over 6.76∙108 read fragments were assigned to ARGs and 3.21∙109 to rRNA genes, where we observed distinct differences in both the abundance of ARGs and the link between microbiome and resistome compositions across various sampling types. This collection is another step towards establishing global surveillance of AMR and can serve as a resource for further research into the environmental spread and dynamic changes of ARGs.

Entities: Chemical

Mesh：

Substances：

Year: 2022 PMID： 36067158 PMCID： PMC9447899 DOI： 10.1371/journal.pbio.3001792

Source DB: PubMed Journal: PLoS Biol ISSN： 1544-9173 Impact factor: 9.593

Introduction

The vast amount of genomic data available in public data repositories is a unique and potentially important resource for doing research and genomic surveillance of antimicrobial resistance (AMR). Using these datasets collected from locations all over the world across different years and from various sampling sources might further aid our understanding of the emergence and distribution of antimicrobial resistance genes (ARGs). The sharing of genomic sequence data to one of the available repositories is today a major and often mandatory step in peer-reviewed journals, for which several repositories were created by the members of the International Nucleotide Sequence Database Collaboration (INSDC) [1], including the European Nucleotide Archive (ENA) [2]. The number of sequencing data available at ENA continues to increase with an estimated doubling time of 18 months (https://www.ebi.ac.uk/ena/browser/about/statistics; accessed 2022-03-08). Several approaches for analyzing genomic data depending on the sample types are already well established. However, the exploration of these resources is often restricted to a few research groups only since both sufficient skills in bioinformatics and access to high-performing computer resources are needed to handle the large amount of available data. Existing collections of analyzed datasets tend to focus on either specific sample sources, such as humans [3,4], marine [5], or urban sewage [6,7], or focus on specific genera [8]. Especially the COVID-19 pandemic has highlighted the value of data sharing to trace the spread and evolution of the virus [9]. Despite the attempts to standardize the analysis workflows of these databases, they are limited in their ability to generalize across environments and locations. A recent study [10] has shared a searchable collection of 661K bacterial genomes for exploring the global bacterial diversity across different origins, providing an easy-to-access resource for genomic research. While this is an impressive data-sharing effort, the authors did not include metagenomic samples in their pipeline. Metagenomic techniques aim to sequence all DNA in a sample and can be used to characterize the microbiome in different environments [11,12], discover novel organisms [13], monitor disease [14,15], and specific genes, such as ARGs [5,6,16]. Here, we present a large-scale metagenomic analysis of 214,095 metagenomic samples retrieved from ENA. We have carried out an assembly-free approach by aligning sequencing reads against ARGs and 16S/18S ribosomal RNA genes. We have previously published an in-depth analysis of the distribution of mobilized colistin resistance [17] based on those data. Now we both share the entire collection of mapping results and showcase how to characterize the global resistome and microbiome with this dataset. The curated metadata and mapping results are available at https://doi.org/10.5281/zenodo.6919377 and documentation at https://hmmartiny.github.io/mARG/Tables.html.

Materials and methods

Retrieval of metagenomes

We retrieved metagenomic datasets from ENA [2] uploaded between 2010-01-01 and 2020-01-01 that had library source as “METAGENOMIC” and library strategy of “WGS.” We collected 214,095 sequencing runs from 146,732 samples from 6,307 projects corresponding to 442 Tbp of raw reads taking up 300 TB of storage. The associated metadata for each sample was also retrieved.

Preprocessing and mapping of sequencing reads

The retrieved raw FASTQ reads were trimmed and aligned against reference sequences, as outlined in Martiny (2022) [17]. In brief, we used FASTQC v.0.11.15 (https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) for read quality checking and BBduk2 v.36.49 [18] for trimming the raw sequencing reads. With the k-mer-based alignment tool KMA 1.2.21 [19], the trimmed reads were mapped against reference sequences from 2 different databases: The AMR gene database ResFinder [20] (downloaded 2020-01-25), which contained 3,085 sequences of acquired ARGs, and the ribosomal rRNA Silva [21] gene database (version 138, downloaded 2020-01-16), which had 2,225,272 reference sequences with more than 88% of them being 16/18S rRNA genes. For KMA, we used the following alignment parameters: 1, -2, -3, -1 for a match, mismatch, gap opening, and gap extension. For read pairing, we used a value of 7 and a minimum relative alignment score of 0.75. Data retrieval, quality checking, trimming, and read alignments were done using the Danish National Supercomputer for Life Sciences (https://www.computerome.dk/).

Standardization of metadata

The following attributes for each metagenome were standardized: sampling location, sampling host or environment (referred to as a host below), and sampling date. To standardize the label for sampling locations, we looked at the values entered in the two fields “country” and “location.” First, the latitude and longitude coordinates were mapped to a country using the Python library Shapely 1.7.1 [22] to find the matching area defined in one of the 3 public domain map datasets (countries, marine, and lakes) available in the Natural Earth Data collection. If the lookup failed or the coordinates were not given, the second step was to match the text attribute in the country label to ISO 3166 country codes with a fuzzy search with the Python library PyCountry 20.7.3 (https://github.com/flyingcircusio/pycountry). Finally, if the 2 lookup searches did not yield a match, we did a manual lookup of the country labels to standardize the text. For the standardization of host labels, we mapped the taxonomic id given by the attribute “host_tax_id” to the NCBI Taxonomy database [23], or if the feature was missing, the “tax_id” was used instead. Since the only way to curate entered collection dates is to look up suspicious dates in published studies manually, and that was deemed too time-intensive, we decided to replace dates entered as later than 2020-01-01 in the sample attribute field “collection_date” with the missing value NULL.

Measuring the abundance of ARGs

Since we report the fragment count aligned to each reference gene, the mapping results are compositional and should be treated as such [24]. In the simplest form, the ARG abundance for a sample or sample group can be calculated as the log-ratio of the count of reads, n, aligned to each ARG i over the total sum of rRNA read fragments n: where D is the number of ARGs and with D being the number of read fragments aligned to rRNA genes. Each ARG count n has been adjusted with the length of the gene in kilobases. The relative abundance resistance classes were calculated as the proportion of ARG resistance assigned to different classes and scaled with κ = 100:

Diversity measurements

Besides the read abundance values, we report the species richness, Shannon diversity index [25], and the Gini–Simpson [26] diversity index of read counts of ARGs, genera, and phyla per sample. Species richness is the number of different genes or taxonomic groups present in the sample with at least 1 read fragment aligned. The Shannon index (H′) was calculated using the proportions of reads : whereas the Gini–Simpson index (GS) was calculated using the read counts n = [n1,…,n] and N = ∑n is the total count of reads for the group: Together with these 2 indices, we also report the sample-wise unique number of reference sequences or taxonomic groups matched.

Results

Here, we present a large-scale mapping of 442 Tbp of raw reads of 214,095 metagenomic samples suitable for analyzing the distribution of acquired antimicrobial resistance genes and 16S/18S rRNA genes. Furthermore, we have spent considerable effort standardizing 3 main sample attributes: sampling date, location, and source. To facilitate easy access and usage, we have shared the mapping results and corrected metadata in 3 different data formats (TSV, HDF, and MySQL dumps). We also provide tutorials with code examples in R and Python on using the data in different scenarios. Data files are all available at https://doi.org/10.5281/zenodo.6919377. By collecting the sequencing reads from ENA, we could also verify the inherited bias of specific sample types or sources being overrepresented simply due to the availability in the public repository. While the 214,095 metagenomic datasets were collected from 797 different hosts, most were either of human or marine origin (Fig 1A). A similar skewed geographical distribution towards European and North American countries was observed in the sampling locations (Fig 1B). The distribution of samples according to the sampling year reveals that a considerable number were collected between 2010 and 2020 (Fig 1C).

Fig 1

Distribution of metagenomes reveals the overrepresentation of samples from specific sources.

Distribution of metagenomes reveals the overrepresentation of samples from specific sources.

(a) Number of samples grouped per sampling host, where only hosts with more than 1,000 samples are plotted. (b) Sample locations for metagenomes with available GPS coordinates; each marker is a sample. A total of 83,903 samples did not have coordinates available. (c) Year of which a sample was collected. A total of 84,238 of the samples did not have a valid sampling date recorded. The data underlying this figure can be found at https://doi.org/10.5281/zenodo.6919377, and the base layer map was created with data from https://www.naturalearthdata.com/. Of the more than 1.8∙1012 raw sequencing reads, corresponding to 442.1 Tbp, 93% of the reads were generated using Illumina sequencing technologies (S1 Fig). We mapped over 1.69∙1012 trimmed read fragments, with a median of 784,748 fragments per sample (range 1 to 916,901,400) (Fig 2A). Approximately 0.04% of all read fragments could be aligned to ARGs, and 0.19% to rRNA genes. Overall, the amount of sequencing reads and bases available did increase the count of aligned read fragments (S3 Fig). The number of ARG fragments aligned increased with the number of aligned rRNA fragments, although for 34% of the samples, we did not find any ARGs despite having read fragments aligning to 16S rRNA genes (Fig 2B). The microbial differences in the different sampling origins were highlighted in the number of aligned fragments (S4 Fig).

Fig 2

Distribution of available and aligned fragments.

(a) Density distribution of available fragments per sample. (b) The distribution compares the number of fragments mapped to rRNA genes and ARGs. The data underlying this figure can be found at https://doi.org/10.5281/zenodo.6919377.

Distribution of available and aligned fragments.

The global abundance of antimicrobial resistance

To measure the global distribution of ARGs and the composition of the resistome, we calculated the abundance of ARGs as the log-ratio of ARG fragments over summed rRNA sequence fragments. Almost all of the reference sequences from the ResFinder database had at least 1 fragment aligned, and only 94 ARGs had no hits (S2 Fig). The median observed resistance load per metagenomic sample was 11.74 (log range: −1.45 to 23.52) (Fig 3A), which appeared to be mainly dependent on the geographic origin and environment (Fig 3B–3D) and not on which year the sample was taken. For example, samples originating from locations within Europe showed similar abundance levels for most of the samples but with several outliers, whereas multiple samples from locations in the Oceania region had a much broader load distribution with few outliers (Fig 3C).

Fig 3

Boxplots of ARG abundances in metagenomic samples show that levels vary across different origins.

Boxplots of ARG abundances in metagenomic samples show that levels vary across different origins.

(a) Distribution of ARG abundance per sample. (b) Distribution of sample-wise ARG abundance grouped by sampling year. (c) Sample-wise ARG abundance per sampling location. (d) Sample-wise ARG abundance grouped by hosts. Only hosts with more than 1,000 metagenomes analyzed are shown. The data underlying this figure can be found at https://doi.org/10.5281/zenodo.6919377. While the distribution of sample-wise resistance loads illustrates the high variability in this data collection (Fig 3), we saw that once we stratified the relative ARG read proportions per resistance class and sample type, there were clear separations between different groups (Fig 4). For the sampling years with a considerable number of samples available (2004 to 2019), the relative proportion of classes was relatively consistent, with Tetracycline reads being the most common, except for a spike of Beta-lactam reads in 2017 (Fig 4A). Across the continents and large water bodies, we observed that ARGs conferring resistance to Aminoglycosides or Beta-lactam antimicrobials were more common in water environments, whereas mainland regions had a more diverse distribution (Fig 4B). Once we stratified by sampling host or source, the distribution of resistance classes was very dependent on the group, as seen by the high proportion of read fragments aligned to, for example, Phenicol for marine and soil samples and Tetracycline reads being highly prevalent in mice (Mus musculus) samples (Fig 4C).

Fig 4

Composition of reads assigned to ARGs from different resistance classes grouped by sampling origin.

(a) Grouped by sampling year. (b) Grouped per sampling location. (c) Grouped per sampling host. Only hosts with more than 1,000 metagenomes analyzed are shown. The data underlying this figure can be found at https://doi.org/10.5281/zenodo.6919377.

Composition of reads assigned to ARGs from different resistance classes grouped by sampling origin.

Linking the microbiome diversity with resistance diversity

The relationship between the diversity of the microbiome and the resistance genes was quantified by calculating the species richness and 2 alpha diversity measurements (Shannon and Gini–Simpson) on ARG levels and phyla and genera taxonomic levels. Without looking at the sample origin, we observed that a majority of the samples had both high microbial diversity and ARG diversity (Figs 5 and S5). However, the relationship between genera and ARG diversity indexes differed between sampling sources, with several groups containing samples that did not follow the assumption of the 2 diversity measurements following each other, suggesting that increased diversity of microbes in, for example, soil samples does not necessarily lead to a higher diversity of resistance genes. Contrarily, the chicken (Gallus gallus) samples showed that they still had elevated ARG diversity despite having lower microbial diversity (Fig 5).

Fig 5

The genus–ARG diversity relationship for all metagenomic samples.

The Gini–Simpson diversity indexes were calculated on genus categories (x-axis) compared to ARG levels (y-axis). Left: scatterplot of all samples. Right: samples colored by selected host or environmental origins. The data underlying this figure can be found at https://doi.org/10.5281/zenodo.6919377.

The genus–ARG diversity relationship for all metagenomic samples.

Discussion

Global surveillance of AMR based on genomics continues to become more accessible due to the advancement in NGS technologies and the practice of sharing raw sequencing data in public repositories. Standardized pipelines and databases are needed to utilize these large data volumes for tracking the dissemination of AMR. We have uniformly processed the sequencing reads of 214,095 metagenomes for the abundance analysis of ARGs. Our data sharing efforts enable users to perform abundance analyses of individual ARGs, the resistome, and the microbiome across different environments, geographic locations, and sampling years. We have given a brief characterization of the distribution of ARGs according to the collection of metagenomes. However, in-depth analyses remain to be performed to investigate the influence of temporal, geographical, and environmental origins on the dissemination and evolution of antimicrobial resistance. For example, analyzing the spread of specific ARGs across locations and different environments could reveal new transmission routes of resistance and guide the design of intervention strategies to stop the spread. We have previously published a study focusing on the distribution of mobilized colistin resistance (mcr) genes using this data resource, showing how widely disseminated the genes were [17]. Another use of the data collection could be to explore how the changes in microbial abundances affect and are affected by the resistome. Furthermore, our coverage statistics of reads aligned to ARGs could be used to investigate the rate of new variants occurring in different reservoirs. Even though we have focused on the threat of antimicrobial resistance, potential applications of this resource can be to look at the effects of, for example, climate changes on microbial compositions. Linking our observed read fragment counts with other types of genomic data, such as evaluating the risk of ARG mobility, accessibility, and pathogenicity in assembled genomes [27,28], and verifying observations from clinical data [29]. We recommend that potential users consider all the confounders present in this data collection in their statistical tests and modeling workflows, emphasizing that the experimental methods and sequencing platforms dictate the obtained sequencing reads and that metadata for a sample might be mislabeled, despite our efforts to minimize those kinds of errors. Furthermore, it is essential to consider the compositional nature of microbiomes [30]. The reads do not depend on the distribution of genetic material in the sample but on the capacity of the sequencing platform [24,31]. Various statistical methods already exist that consider the compositionality [24,32,33]. Finally, it is important to highlight that the results we have presented here include fragment counts of 1 for the sake of transparency, but we also recommend potential users consider appropriate filters in their analysis. The sequencing data in public repositories has continued to grow, giving us plenty of opportunities to continue to expand our data collection even more. To establish a truly global surveillance program of AMR, sequencing data should be analyzed as soon as published in these archives. Although this would require access to even more computational resources, we hope to achieve this soon and compare our approach with other methods, such as AMRFinderPlus [34] and CARD [35]. As new sequencing technologies are becoming more used, our settings for our alignment procedure should also be tuned to better take advantage and be aware of the flaws of different sequencing platforms. With this data resource, we have taken a step towards enabling the scientific community to utilize the wealth of information in these metagenomic samples to broaden our understanding of the dissemination of antimicrobial resistance and changes in microbiomes at both local and global scales through time and environments.

Distribution of samples per sequencing instrument platform.

(a) Sample count per platform. (b) Distribution of raw sequencing read counts per platform. The data underlying this figure can be found at https://doi.org/10.5281/zenodo.6919377. (TIFF) Click here for additional data file.

More than 96% of ARG templates had at least 1 aligned fragment.

The bars illustrate the percentage of ARGs per resistance class without and with at least 1 aligned fragment. The parenthesis after each class label contains the number of genes found out of the total available templates. The data underlying this figure can be found at https://doi.org/10.5281/zenodo.6919377. (TIFF) Click here for additional data file. The sample-wise distribution of aligned (a) ARG or (b) rRNA fragments compared to raw sequencing base counts. The data underlying this figure can be found at https://doi.org/10.5281/zenodo.6919377. (TIFF) Click here for additional data file.

The sample-wise distribution of aligned rRNA fragments and ARG fragments, colored by selected host and environmental sources.

The data underlying this figure can be found at https://doi.org/10.5281/zenodo.6919377. (TIFF) Click here for additional data file.

Additional distributions showing the relationship between ARGs and genera for all metagenomic samples.

(a) The richness of genus groups (x-axis) vs. ARG richness (y-axis). (b) The relationship between Shannon diversity index calculated on genus level (x-axis) and ARGs (y-axis). Right: samples colored by selected host or environmental origins. The data underlying this figure can be found at https://doi.org/10.5281/zenodo.6919377. (TIFF) Click here for additional data file. 19 May 2022 Dear Dr Martiny, Thank you for submitting your manuscript entitled "A curated data resource of 214K metagenomes for characterization of the global resistome" for consideration as a Methods and Resources by PLOS Biology. Your manuscript has now been evaluated by the PLOS Biology editorial staff, as well as by an academic editor with relevant expertise, and I'm writing to let you know that we would like to send your submission out for external peer review. However, before we can send your manuscript to reviewers, we need you to complete your submission by providing the metadata that is required for full assessment. To this end, please login to Editorial Manager where you will find the paper in the 'Submissions Needing Revisions' folder on your homepage. Please click 'Revise Submission' from the Action Links and complete all additional questions in the submission questionnaire. Once your full submission is complete, your paper will undergo a series of checks in preparation for peer review. Once your manuscript has passed the checks it will be sent out for review. To provide the metadata for your submission, please Login to Editorial Manager (https://www.editorialmanager.com/pbiology) within two working days, i.e. by May 23 2022 11:59PM. If your manuscript has been previously reviewed at another journal, PLOS Biology is willing to work with those reviews in order to avoid re-starting the process. Submission of the previous reviews is entirely optional and our ability to use them effectively will depend on the willingness of the previous journal to confirm the content of the reports and share the reviewer identities. Please note that we reserve the right to invite additional reviewers if we consider that additional/independent reviewers are needed, although we aim to avoid this as far as possible. In our experience, working with previous reviews does save time. If you would like to send previous reviewer reports to us, please email me at rroberts@plos.org to let me know, including the name of the previous journal and the manuscript ID the study was given, as well as attaching a point-by-point response to reviewers that details how you have or plan to address the reviewers' concerns. During the process of completing your manuscript submission, you will be invited to opt-in to posting your pre-review manuscript as a bioRxiv preprint. Visit http://journals.plos.org/plosbiology/s/preprints for full details. If you consent to posting your current manuscript as a preprint, please upload a single Preprint PDF. Feel free to email us at plosbiology@plos.org if you have any queries relating to your submission. Kind regards, Roli Roberts Roland Roberts Senior Editor PLOS Biology rroberts@plos.org 14 Jul 2022 Dear Dr Martiny, Thank you for your patience while your manuscript "A curated data resource of 214K metagenomes for characterization of the global resistome" was peer-reviewed at PLOS Biology. It has now been evaluated by the PLOS Biology editors, an Academic Editor with relevant expertise, and by three independent reviewers. Based on the broadly very favourable reviews, we are likely to accept this manuscript for publication, provided you satisfactorily address the points raised by the reviewers. Please also make sure to address the following data and other policy-related requests. a) Please address the concerns raised by the three reviewers. b) Please could you change your Title to something slightly more explicit for our wider readership? We suggest "A curated data resource of 214K metagenomes characterizes the global antimicrobial resistome" c) Please address my Data Policy requests below; specifically, we need you to supply the numerical values underlying Figs 1ABC, 2AB, 3ABCD, 4ABC, 5, S1AB, S2, S3, S4AB. I note that your Zenodo deposition currently only seems to contain relatively “raw” values, rather than those directly shown in the Figure – please could you include the latter, clearly labelled? If you’ve used any custom code, please also include this. d) Please also cite the location of the data clearly in each main and supplementary Fig legend, e.g. “The data underlying this Figure can be found in https://doi.org/10.5281/zenodo.6519844” As you address these items, please take this last chance to review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the cover letter that accompanies your revised manuscript. We expect to receive your revised manuscript within two weeks. To submit your revision, please go to https://www.editorialmanager.com/pbiology/ and log in as an Author. Click the link labelled 'Submissions Needing Revision' to find your submission record. Your revised submission must include the following: - a cover letter that should detail your responses to any editorial requests, if applicable, and whether changes have been made to the reference list - a Response to Reviewers file that provides a detailed response to the reviewers' comments (if applicable) - a track-changes file indicating any changes that you have made to the manuscript. NOTE: If Supporting Information files are included with your article, note that these are not copyedited and will be published as they are submitted. Please ensure that these files are legible and of high quality (at least 300 dpi) in an easily accessible file format. For this reason, please be aware that any references listed in an SI file will not be indexed. For more information, see our Supporting Information guidelines: https://journals.plos.org/plosbiology/s/supporting-information *Published Peer Review History* Please note that you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out. Please see here for more details: https://blogs.plos.org/plos/2019/05/plos-journals-now-open-for-published-peer-review/ *Press* Should you, your institution's press office or the journal office choose to press release your paper, please ensure you have opted out of Early Article Posting on the submission form. We ask that you notify us as soon as possible if you or your institution is planning to press release the article. *Protocols deposition* To enhance the reproducibility of your results, we recommend that if applicable you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols Please do not hesitate to contact me should you have any questions. Sincerely, Roli Roberts Roland Roberts, PhD Senior Editor, rroberts@plos.org, PLOS Biology ------------------------------------------------------------------------ DATA POLICY: You may be aware of the PLOS Data Policy, which requires that all data be made available without restriction: http://journals.plos.org/plosbiology/s/data-availability. For more information, please also see this editorial: http://dx.doi.org/10.1371/journal.pbio.1001797 Note that we do not require all raw data. Rather, we ask that all individual quantitative observations that underlie the data summarized in the figures and results of your paper be made available in one of the following forms: 1) Supplementary files (e.g., excel). Please ensure that all data files are uploaded as 'Supporting Information' and are invariably referred to (in the manuscript, figure legends, and the Description field when uploading your files) using the following format verbatim: S1 Data, S2 Data, etc. Multiple panels of a single or even several figures can be included as multiple sheets in one excel file that is saved using exactly the following convention: S1_Data.xlsx (using an underscore). 2) Deposition in a publicly available repository. Please also provide the accession code or a reviewer link so that we may view your data before publication. Regardless of the method selected, please ensure that you provide the individual numerical values that underlie the summary data displayed in the following figure panels as they are essential for readers to assess your analysis and to reproduce it: Figs 1ABC, 2AB, 3ABCD, 4ABC, 5, S1AB, S2, S3, S4AB. NOTE: the numerical data provided should include all replicates AND the way in which the plotted mean and errors were derived (it should not present only the mean/average values). IMPORTANT: Please also ensure that figure legends in your manuscript include information on where the underlying data can be found, and ensure your supplemental data file/s has a legend. Please ensure that your Data Statement in the submission system accurately describes where your data can be found. ------------------------------------------------------------------------ DATA NOT SHOWN? - Please note that per journal policy, we do not allow the mention of "data not shown", "personal communication", "manuscript in preparation" or other references to data that is not publicly available or contained within this manuscript. Please either remove mention of these data or provide figures presenting the results and the data underlying the figure(s). ------------------------------------------------------------------------ REVIEWERS' COMMENTS: Reviewer #1: Martiny et al. describe a new data resource that is the product of intensive and large-scale bioinformatics analysis of metagenomic data for the presence and abundance of acquired antimicrobial resistance genes (ARGs). This paper can be viewed from two perspectives: (1) a data science contribution to allow the community to better examine AMR transmission patterns and (2) knowledge gained from analysis of the data. DATA SCIENCE From the data science perspective, this is a very significant contribution to the field using the latest standards and mining of >200,000 metagenomic datasets, totalling more than 400 TB of sequence data. Considerable effort was put into data harmonization and normalization to provide a high-value data set to AMR researchers. The data, pipelines, software, and results are provided in a well-organized open format, allowing their analysis by the broader community. As the amount of computation needed to produce this data set is well beyond most AMR researchers interested in using genomics to understand ARG transmission patterns, this contribution is novel and of high value. Software and their versions are properly described in the methods section, but parameters used for KMA are not outlined. Were default parameters used? It is fine that the methods are presented "in brief" given a citation of previous work, but the manuscript would be improved if this included the cut-offs used by KMA to determine aligned reads (MAPQ?). Similarly, a more explicit statement in the methods that use of ResFinder focuses on the analysis of acquired ARGs and does not include resistance via mutation (e.g. PointFinder) would be helpful. ANALYSIS Analysis and interpretation of the data are thin, with some issues that need to be addressed, but this does not undermine that the primary purpose of the manuscript is to describe the generation and content of the data produced for the broader community. Full analyses of these data are beyond the scope of this manuscript and the authors perform an adequate overview analysis and summary of the major sub-sets and trends in the data. However, the statement of "a general trend" in lines 227-229 does not appear supported by Figures 5 & S4. This section should be re-written to carefully discuss patterns supported by the data, such as exists for the chicken data, instead of broad statements based on unconvincing patterns in the plots. The data include ARGs that have as little as a single read fragment aligned and these ARGs were used in the species richness estimates. Can the authors explain why they did not include a minimum coverage cut-off in these analyses? The results presented are broken down by host (i.e. environment), location, and ResFinder drug class, but not by ARG families. While others are very likely to analyze these data for transmission patterns of ARGs or ARG families, at least an anecdotal investigation of a few ARGs would help illustrate the value of the data. Perhaps something recent like MCR versus the AACs? The "trends" mentioned above may be more obvious at the level ARG families. DISCUSSION Successful annotation of metagenomics data for ARGS requires both good software for sequencing read alignment and good reference data. Both KMA and ResFinder reflect the latest standards but like CARD and other databases, ResFinder's reference data is primarily from clinical isolates. It is possible there are ARGs in the environmental metagenomics data that are sufficiently different from these reference data to a degree that KMA is unsuccessful. CARD has its "Resistomes & Variants" data to provide an alternate in silico diversity of >200,000 ARG alleles for sequence read alignment. I'm not suggesting a re-analysis of these data with a broader in silico reference sequence collection, but I think the discussion should address this possible bias, i.e. false-negative results for divergent ARGs because of the algorithm/reference choice. As mentioned above, the manuscript has little in the assessment of ARG transmission patterns, which is fine as it was not the major point of the paper, but Zhang et al. (PMID 34362925; Nature Communications 12: 4765) & Zhang et al. (PMID 35322038; Nature Communications 13: 1553) have recently published some large scale ARG metagenomic analyses that included assessment of ARG transmission and generation of risk metrics. At a minimum, the discussion should place the author's work in the context of these recent efforts. As mentioned above, the data include ARGs that have as little as a single read fragment aligned. The authors should add a statement that they are including these data for complete transparency so others can decide their own cut-offs when analyzing the data. No information is given in the discussion on the long-term maintenance of this resource. What is the plan as new metagenomics data become available? CARD has (beta) pathogen-of-origin kmer tools for ARGs, will the authors be exploring similar methods to provide a more pathogen-centric perspective in future analyses? MINOR POINTS Figure 1C caption should mention the number of samples for which the collection date was NULL. Lines 210-214 have very confusing grammar. The phrase "ARG template" is used without proper definition. Reviewer #2: The resource presented by Martiny et al. is timely and has potential to boost the research antimicrobial resistance through widening access. The methods are presented clearly, and the datasets are made publicly available. There are a few points that I recommend the authors to consider towards improving the quality through cross-checking some of the analyses. 1. Given the impressive volume and the breadth of the data analysed, it is quite surprising that 96 ARGs did not have any alignments. A cross-cheling of these results and any indicators of the underlying reasons (e.g., these ARGS being very specific to the environments not being represented here?) will be important. 2. Figure 4c, what do the rows 'Metagenome' and 'Metagenomes' refer to? Some error in metadata curation? 3. Fosfomycin ARG (green in Figure 4C), seems quite high in Food Metagenomes, while it is barely present in panels A and B. I suppose this indicates uneven distribution of sampling 'host'? Also, is there known connection between food microbiomes and fosfomycin resistance? BTW, 'environment' will be a much better and accurate term than 'host'. Minor comments: 1. Line 81: uploaded between 2010-01-01 and 2020-01-01 that had library source as 'METAGEOMIC' 2. Inconsistent line spacing starting on page 5 line 135 through page 6 line 152 3. Any data on how sequencing depth affects detection of ARGs or 16/18S genes? For example, Fig2a could be converted from a density plot to a scatterplot showing sequencing depth vs fragments 4. Figure 3, what do different colour shades mean for boxes? Reviewer #3: [Note: because of other commitments, this reviewer was only able to give us the following preliminary comments; we hope that they will nevertheless be useful] My opinion on that paper is that it's a valuable analysis, and seems to have been done carefully. I had only two technical quibbles: 1. The manuscript does not explain how the analysis workflow handles two key issues. First: AMR databases are full of different versions of the same gene - e.g. there are more than 170 allelic versions of the CTX-M gene. Were all reads mapped to a database containing all of these, or were representatives chosen? If mapping to everything, what was done with reads that mapped to multiple alleles of one gene , and how were counts resolved? Two: I don't understand why, when calculating abundance, using counts of reads mapping to ribosomal RNA as a denominator makes sense, as rRNA arrays are different lengths in different species. 2. The text seems to suggest the same mapping workflow was used for nanopore, pacbio, and illumina. Is this really true? The same kmer size also? If yes, a lot of sensitivity will have been lost in the long read data , although since this is <10% of the data, this is not really a big issue. I also had one red flag: Given the high rate of metadata errors in the ENA, I am suspicious of the samples dated between 1845 and 1905 in Figure 3 - is there a way to check these? If there is no associated publication discussing old metagenomes, I would honestly consider discarding those datapoints as mislabelled. Figure 4 is great, v interesting! 8 Aug 2022 Submitted filename: response_PLOS_Biology.pdf Click here for additional data file. 9 Aug 2022 Dear Dr Martiny, Thank you for the submission of your revised Methods and Resources "A curated data resource of 214K metagenomes for characterization of the global antimicrobial resistome" for publication in PLOS Biology. On behalf of my colleagues and the Academic Editor, Tobias Bollenbach, I'm pleased to say that we can in principle accept your manuscript for publication, provided you address any remaining formatting and reporting issues. These will be detailed in an email you should receive within 2-3 business days from our colleagues in the journal operations team; no action is required from you until then. Please note that we will not be able to formally accept your manuscript and schedule it for publication until you have completed any requested changes. Please take a minute to log into Editorial Manager at http://www.editorialmanager.com/pbiology/, click the "Update My Information" link at the top of the page, and update your user information to ensure an efficient production process. PRESS: We frequently collaborate with press offices. If your institution or institutions have a press office, please notify them about your upcoming paper at this point, to enable them to help maximise its impact. If the press office is planning to promote your findings, we would be grateful if they could coordinate with biologypress@plos.org. If you have previously opted in to the early version process, we ask that you notify us immediately of any press plans so that we may opt out on your behalf. We also ask that you take this opportunity to read our Embargo Policy regarding the discussion, promotion and media coverage of work that is yet to be published by PLOS. As your manuscript is not yet published, it is bound by the conditions of our Embargo Policy. Please be aware that this policy is in place both to ensure that any press coverage of your article is fully substantiated and to provide a direct link between such coverage and the published work. For full details of our Embargo Policy, please visit http://www.plos.org/about/media-inquiries/embargo-policy/. Thank you again for choosing PLOS Biology for publication and supporting Open Access publishing. We look forward to publishing your study. Sincerely, Roli Roberts Roland G Roberts, PhD, PhD Senior Editor PLOS Biology rroberts@plos.org

30 in total

1. A field guide for the compositional analysis of any-omics data.

Authors: Thomas P Quinn; Ionas Erb; Greg Gloor; Cedric Notredame; Mark F Richardson; Tamsyn M Crowley
Journal: Gigascience Date: 2019-09-01 Impact factor: 6.524

2. The EnteroBase user's guide, with case studies on Salmonella transmissions, Yersinia pestis phylogeny, and Escherichia core genomic diversity.

Authors: Zhemin Zhou; Nabil-Fareed Alikhan; Khaled Mohamed; Yulei Fan; Mark Achtman
Journal: Genome Res Date: 2019-12-06 Impact factor: 9.043

3. Inferring correlation networks from genomic survey data.

Authors: Jonathan Friedman; Eric J Alm
Journal: PLoS Comput Biol Date: 2012-09-20 Impact factor: 4.475

4. Bacterial phylogeny structures soil resistomes across habitats.

Authors: Kevin J Forsberg; Sanket Patel; Molly K Gibson; Christian L Lauber; Rob Knight; Noah Fierer; Gautam Dantas
Journal: Nature Date: 2014-05-21 Impact factor: 49.962

5. Urban metagenomics uncover antibiotic resistance reservoirs in coastal beach and sewage waters.

Authors: Pablo Fresia; Verónica Antelo; Cecilia Salazar; Matías Giménez; Bruno D'Alessandro; Ebrahim Afshinnekoo; Christopher Mason; Gastón H Gonnet; Gregorio Iraola
Journal: Microbiome Date: 2019-02-28 Impact factor: 14.650

6. Exploring bacterial diversity via a curated and searchable snapshot of archived DNA sequences.

Authors: Grace A Blackwell; Martin Hunt; Kerri M Malone; Leandro Lima; Gal Horesh; Blaise T F Alako; Nicholas R Thomson; Zamin Iqbal
Journal: PLoS Biol Date: 2021-11-09 Impact factor: 8.029

7. An omics-based framework for assessing the health risk of antimicrobial resistance genes.

Authors: An-Ni Zhang; Jeffry M Gaston; Chengzhen L Dai; Shijie Zhao; Mathilde Poyet; Mathieu Groussin; Xiaole Yin; Li-Guan Li; Mark C M van Loosdrecht; Edward Topp; Michael R Gillings; William P Hanage; James M Tiedje; Katya Moniz; Eric J Alm; Tong Zhang
Journal: Nat Commun Date: 2021-08-06 Impact factor: 14.919

8. The SILVA ribosomal RNA gene database project: improved data processing and web-based tools.

Authors: Christian Quast; Elmar Pruesse; Pelin Yilmaz; Jan Gerken; Timmy Schweer; Pablo Yarza; Jörg Peplies; Frank Oliver Glöckner
Journal: Nucleic Acids Res Date: 2012-11-28 Impact factor: 16.971

9. AMRFinderPlus and the Reference Gene Catalog facilitate examination of the genomic links among antimicrobial resistance, stress response, and virulence.

Authors: Michael Feldgarden; Vyacheslav Brover; Narjol Gonzalez-Escalona; Jonathan G Frye; Julie Haendiges; Daniel H Haft; Maria Hoffmann; James B Pettengill; Arjun B Prasad; Glenn E Tillman; Gregory H Tyson; William Klimke
Journal: Sci Rep Date: 2021-06-16 Impact factor: 4.996

10. Assessment of global health risk of antibiotic resistance genes.

Authors: Zhenyan Zhang; Qi Zhang; Tingzhang Wang; Nuohan Xu; Tao Lu; Wenjie Hong; Josep Penuelas; Michael Gillings; Meixia Wang; Wenwen Gao; Haifeng Qian
Journal: Nat Commun Date: 2022-03-23 Impact factor: 17.694