Literature DB >> 35467413

Meaningful Use of Pathogen Genomic Data.

Abstract

Population genomic analysis is a powerful tool to understand the evolutionary history of pathogens and the factors contributing to the success or failure of lineages. These studies have significant implications for human health, as evident from our ongoing tracking of SARS-CoV-2. In their article, Gill et al. (J. L. Gill, J. Hedge, D. J. Wilson, and R. C. MacLean, mBio 12:e02168-21, 2021, https://doi.org/10.1128/mBio.02168-21) demonstrate the utility of pathogen genomic data by comprehensively elucidating the origin of methicillin-resistant Staphylococcus aureus ST239. To accomplish this, they leveraged newly developed tools for querying large genomic data sets. Overall, these analyses rely on the availability of representative genomic data along with their associated metadata-information about where and when samples were collected, clinical and epidemiological characteristics, and phenotypic properties. However, in many instances, these data are missing. Here, I borrow the term "meaningful use" from the Health IT field to describe the need to maximize the utility of genomic data and make suggestions for how to address the current limitations.

Entities: Chemical

Keywords: MRSA; Staphylococcus aureus; bioinformatics; genomic databases; genomic epidemiology; population genomics

Mesh：

Year: 2022 PMID： 35467413 PMCID： PMC9239187 DOI： 10.1128/mbio.00311-22

Source DB: PubMed Journal: mBio Impact factor: 7.786

COMMENTARY

A common goal among population genomic studies is to understand the evolutionary history of a pathogen and the relative success of its lineages. These approaches are dependent on access to pathogen genomic data, rich metadata, and preferably the isolate and/or relevant phenotypic profiling. In their recent mBio publication, Gill et al. provide a clear example of how, when available, such data can be applied to answering unresolved questions, thereby increasing the utility of genomic data beyond their initial purpose (1). The authors analyzed published Staphylococcus aureus genomic data spanning 70 years to understand the origins of ST239, a well known methicillin-resistant S. aureus (MRSA) strain. After its emergence, MRSA ST239 became a leading cause of health care-associated infections, highly prevalent and globally distributed, except in the United States, where it has remained relatively rare. Before their work, it was proposed that ST239 arose from a large recombination event between ST8 and ST30 strains (2). By querying representative genomes of ST30 and ST8, the authors were able to provide more definitive evidence of this hybrid evolution. They showed that the genomic backbone was most related to ST8, while the acquired region originated from an ST30 lineage that evolved from the phage type 80/81 clone, a notorious strain of methicillin-susceptible S. aureus (MSSA) that frequently caused hospital outbreaks in the 1950s and 1960s. As the SCCmec type III element, which confers resistance to multiple antibiotics, was not found among ∼1,900 published ST30 genomes, they posit that it was acquired from an ST30 ancestor after divergence of the ST239 lineage. Furthermore, they were able to date the origin of ST239 MRSA to between 1920 and 1945, echoing findings from another group that recently showed that methicillin-resistant S. aureus emerged prior to the clinical introduction of methicillin (3). To explain the recent decrease in prevalence of ST239, the authors investigated the competitive fitness of ST239, finding it lower than its ST30 and ST8 progenitors. To reinforce this finding, they assessed selective forces acting on the genomic backbone and acquired regions, finding a fitness cost associated with acquisition of genes coding for antimicrobial resistance. Nevertheless, it remains unclear whether the decline of ST239 was the result of direct competition with other lineages or some combination of competition with improvements in infection prevention in the health care setting and decreased selective pressure as the result of antibiotic stewardship. Although the limited geographic distribution of ST239 was not specifically addressed by their analysis, Gill and colleagues also provide a putative explanation for why it was never a successful strain in North America. In the United States, USA100 and USA800 belonging to ST5 and USA500 belonging to ST8 were historically the leading causes of health care-associated MRSA infections (4). These lineages emerged in the Western Hemisphere at approximately the same time that ST239 was emerging elsewhere. The emergence of the highly successful ST8 USA300 North American epidemic clone in the 1990s likely ensured that ST239 would never become established (5). This was supported by the authors’ findings that ST8 demonstrated greater fitness and the observation that the few sequenced ST239 North American isolates were interspersed throughout the ST239 global phylogeny, suggesting multiple introductions that never took hold. Taken together, their findings provide insight into the fate of prevalent lineages and considerably advance our understanding about how new lineages emerge, how selection may act on different parts of the genome, and why we observe considerable geographic variation in lineage distribution. More work is needed to investigate the relative contribution of human interventions and strain competition, especially among MRSA and MSSA lineages. Overall, the authors present an eloquent synthesis of computational and experimental approaches combined with genomic detective work, highlighting the promise of pathogen genomic data meta-analysis. The present analysis was made possible by the wealth of published sequencing data on S. aureus, which is one of the most well represented bacterial species in publicly available genomic data repositories (6). Over the last decade, the advances in sequencing technology and accompanying reduction in sequencing costs have resulted in an exponential increase in the number of published microbial data sets (Fig. 1) (7). However, the vast majority of these data exist as raw sequencing experiments deposited in the European Nucleotide Archive (ENA) and/or NCBI Sequence Read Archive (SRA), which makes them less accessible for routine querying. This limitation, compounded by the scarcity of available metadata, has diminished the utility of genomic data for secondary studies. In particular, it has been difficult for researchers to identify relevant data sets (e.g., genomes of samples collected from a disease type/sampling site, date range, or location) or search genomes for features of interest such as a specific genotype, virulence or antibiotic resistance determinants, or other mobile genetic element. Often this difficulty results in the generation of new genomic data when existing data may have sufficed had metadata been made available. The ideal genomic database would offer more comprehensive and easily searchable metadata, which would enable improved tracking of the emergence and spread of epidemiologically important pathogens, elucidate the factors contributing to emergence, and facilitate the identification of novel mechanisms for virulence and antimicrobial resistance.

FIG 1

Yearly published draft prokaryotic genomes published in the National Center for Biotechnology Information (NCBI) database (https://ftp.ncbi.nlm.nih.gov/genomes/GENOME_REPORTS/prokaryotes.txt). The release dates of notable sequencing platforms are noted on the figure. ONT, Oxford Nanopore Technologies. Current limitations are partially being addressed by significant methodological advances in development of computational tools for handling the ever-growing amount of data. Two such tools, BIGSI (https://bigsi.readme.io/) and Staphopia (https://staphopia.emory.edu/), were used in the present study (8, 9). BIGSI allows for efficient indexing and rapid querying of large genomic databases. As an early proof of concept, the authors indexed the raw sequencing data in the entire ENA database, while in subsequent iterations these data were assembled into draft genomes and then indexed (6). As a result, researchers were no longer limited to searching the ∼330,000 published draft prokaryotic genomes in NCBI/ENA but also the >660,000 genomes that previously existed only as raw sequencing data (6). Staphopia houses a centralized database for S. aureus genomes and pathogen-specific results such as genotype and antibiotic resistance and virulence gene profiles. Its successor Bactopia (https://bactopia.github.io/) provides an analysis framework that can be easily applied to other pathogens. In combination, these tools allowed the authors of the present study to easily identify all published genomes belonging to ST239, ST30, and ST8 in the Staphopia database and then use BIGSI to search all microbial genomic data for the SCCmec-III mobile genomic element sequence found in ST239. Simply put, their analysis would not have been possible or as comprehensive without these tools. During the development of Staphopia, the authors also highlighted two common limitations of large genomic databases: accessible metadata and geographic representation. In their work, Petit and Read found that of the ∼43,000 S. aureus samples deposited between 2010 and 2017, only 40% had a collection date, 35% had a geographic location, and 35% had an isolate source (9). Only 28% of samples could be linked to a publication, and while most journals require genomic data to be deposited and the metadata published, there are minimal standards for which fields are required. Even if rich metadata were included with associated publications, it is tedious for researchers to identify relevant genomic studies and then aggregate largely incomplete and inconsistent metadata. A recent secondary analysis of S. aureus genomic data we performed utilized 436 genomes spanning 45 published studies and 55 NCBI bioprojects (10). This required resource intensive abstraction of metadata, and in many instances, the study authors were contacted to obtain the requisite data for meaningful analysis. Finally, by assessing the available metadata, it is evident that there is a strong geographic bias among published genomic data, with much of it comprised of samples collected from North America and Western Europe. This has clear implications for population genomic studies. For example, accurately tracking the demographic history of a pathogen or identifying the origin of a recombination block relies on having representative sampling of the ancestor within the data set. Myriad entities such as private laboratories, public health agencies, and academic researchers routinely generate sequencing data, and reasons can vary for not publishing detailed metadata, including (i) privacy concerns, (ii) the desire to use the data to its fullest extent before others gain access, (iii) wanting credit for invested resources, and (iv) a feeling of inequity among data generators. Furthermore, when data are generated by a public health entity, common data elements such as date of collection or location can be considered protected health information released at the discretion of the agency. The debate about genomic data-sharing practices and policies has been brought to the forefront of scientific discussion during the SARS-CoV-2 pandemic (11). There are strong arguments from groups advocating for unrestricted data access and others wanting a more conservative approach. Moving forward, the scientific community should revisit how sequencing data are published along with their associated metadata. Likely, a top-down and bottom-up approach will be needed to address the current limitations. While the National Institutes of Health (NIH) published a genomic data-sharing policy in 2015 that set requirements for the publication of data generated through federal funding, the metadata requirements are relatively minimal (12). Further, the disparity between the number of published assembled draft genomes and those stored as raw sequencing data is partially the result of the current policy, which requires that genomic data be deposited within 45 days after generation (13). This approach favors rapid, prepublication data availability, which undeniably is necessary for responding to public health emergencies. However, the proper collection and curation of metadata requires considerable effort and is frequently performed during preparation for publication. The unintended result is a less than optimal genomic database. Currently, NIH is revisiting this policy and has posted a request for information, providing an opportunity for stakeholders in the fields of genomics, infectious diseases, public health, ecology, microbiology, and epidemiology to shape the long-term vision and balance acceptability with usability. As a community of scholars with a vested interest in access to useful microbial genomic data, we can improve our practices by publishing detailed metadata with their associated sequence data. While a minimal acceptable data set is not well established, at the least, the year and month of collection, location of collection (at a minimum region/state/province and country), source (e.g., human, fomite, or animal; carriage or disease if from a host), and disease type should be included. Ideally, phenotypic properties that cannot be inferred from the genomic data such as antibiotic susceptibility should be included. Further, as more complete metadata and draft genome assemblies are generated during analysis, these data should be appended to the primary submission, and as a community, we should determine whether policies requiring this are needed. Most importantly, metadata should not be relegated to an unsearchable table in a supplemental pdf file. If more detailed metadata are available that are not compatible with current data structures, repositories like Data Dryad (https://datadryad.org/) may provide an alternative to supplemental metadata tables. Retrospectively, we can leverage computational advances to scrape metadata from publications and link them to accession numbers of genomic data sets. This may be most useful for pathogen-specific databases such as Staphopia. Finally, in terms of increasing the geographic representativeness of genomic data, efforts like that of the Global Pneumococcal Sequencing Project are a model for engaging countries with limited resources and ensuring that their participation is equitable (14). Together, these efforts would greatly improve the richness and utility of genomic data sets and make analysis such as the one by Gill et al. (1) more commonplace.

11 in total

1. Evolution of Staphylococcus aureus by large chromosomal replacements.

Authors: D Ashley Robinson; Mark C Enright
Journal: J Bacteriol Date: 2004-02 Impact factor: 3.490

2. Why some researchers oppose unrestricted sharing of coronavirus genome data.

Authors: Amy Maxmen
Journal: Nature Date: 2021-05 Impact factor: 49.962

3. Genomic Epidemiology and Global Population Structure of Exfoliative Toxin A-Producing Staphylococcus aureus Strains Associated With Staphylococcal Scalded Skin Syndrome.

Authors: Taj Azarian; Eleonora Cella; Sarah L Baines; Margot J Shumaker; Carol Samel; Mohammad Jubair; David A Pegues; Michael Z David
Journal: Front Microbiol Date: 2021-08-18 Impact factor: 6.064

4. Pulsed-field gel electrophoresis typing of oxacillin-resistant Staphylococcus aureus isolates from the United States: establishing a national database.

Authors: Linda K McDougal; Christine D Steward; George E Killgore; Jasmine M Chaitram; Sigrid K McAllister; Fred C Tenover
Journal: J Clin Microbiol Date: 2003-11 Impact factor: 5.948

5. Range Expansion and the Origin of USA300 North American Epidemic Methicillin-Resistant Staphylococcus aureus.

Authors: Lavanya Challagundla; Xiao Luo; Isabella A Tickler; Xavier Didelot; David C Coleman; Anna C Shore; Geoffrey W Coombs; Daniel O Sordelli; Eric L Brown; Robert Skov; Anders Rhod Larsen; Jinnethe Reyes; Iraida E Robledo; Guillermo J Vazquez; Raul Rivera; Paul D Fey; Kurt Stevenson; Shu-Hua Wang; Barry N Kreiswirth; Jose R Mediavilla; Cesar A Arias; Paul J Planet; Rathel L Nolan; Fred C Tenover; Richard V Goering; D Ashley Robinson
Journal: mBio Date: 2018-01-02 Impact factor: 7.867

6. Ultrafast search of all deposited bacterial and viral genomic data.

Authors: Phelim Bradley; Henk C den Bakker; Eduardo P C Rocha; Gil McVean; Zamin Iqbal
Journal: Nat Biotechnol Date: 2019-02-04 Impact factor: 54.908

Review 7. Global genomic pathogen surveillance to inform vaccine strategies: a decade-long expedition in pneumococcal genomics.

Authors: Stephen D Bentley; Stephanie W Lo
Journal: Genome Med Date: 2021-05-17 Impact factor: 11.117

8. Evolutionary Processes Driving the Rise and Fall of Staphylococcus aureus ST239, a Dominant Hybrid Pathogen.

Authors: Daniel J Wilson; R Craig MacLean; Jacqueline L Gill; Jessica Hedge
Journal: mBio Date: 2021-12-14 Impact factor: 7.786

9. Emergence of methicillin resistance predates the clinical use of antibiotics.

Authors: Jesper Larsen; Claire L Raisen; Xiaoliang Ba; Nicholas J Sadgrove; Guillermo F Padilla-González; Monique S J Simmonds; Igor Loncaric; Heidrun Kerschner; Petra Apfalter; Rainer Hartl; Ariane Deplano; Stien Vandendriessche; Barbora Černá Bolfíková; Pavel Hulva; Maiken C Arendrup; Rasmus K Hare; Céline Barnadas; Marc Stegger; Raphael N Sieber; Robert L Skov; Andreas Petersen; Øystein Angen; Sophie L Rasmussen; Carmen Espinosa-Gongora; Frank M Aarestrup; Laura J Lindholm; Suvi M Nykäsenoja; Frederic Laurent; Karsten Becker; Birgit Walther; Corinna Kehrenberg; Christiane Cuny; Franziska Layer; Guido Werner; Wolfgang Witte; Ivonne Stamm; Paolo Moroni; Hannah J Jørgensen; Hermínia de Lencastre; Emilia Cercenado; Fernando García-Garrote; Stefan Börjesson; Sara Hæggman; Vincent Perreten; Christopher J Teale; Andrew S Waller; Bruno Pichon; Martin D Curran; Matthew J Ellington; John J Welch; Sharon J Peacock; David J Seilly; Fiona J E Morgan; Julian Parkhill; Nazreen F Hadjirin; Jodi A Lindsay; Matthew T G Holden; Giles F Edwards; Geoffrey Foster; Gavin K Paterson; Xavier Didelot; Mark A Holmes; Ewan M Harrison; Anders R Larsen
Journal: Nature Date: 2022-01-05 Impact factor: 69.504

10. Staphylococcus aureus viewed from the perspective of 40,000+ genomes.

Authors: Robert A Petit; Timothy D Read
Journal: PeerJ Date: 2018-07-12 Impact factor: 2.984