Literature DB >> 31647096

proGenomes2: an improved database for accurate and consistent habitat, taxonomic and functional annotations of prokaryotic genomes.

Daniel R Mende¹, Ivica Letunic², Oleksandr M Maistrenko³, Thomas S B Schmidt³, Alessio Milanese³, Lucas Paoli⁴, Ana Hernández-Plaza⁵, Askarbek N Orakov³, Sofia K Forslund⁶, Shinichi Sunagawa⁴, Georg Zeller³, Jaime Huerta-Cepas⁵, Luis Pedro Coelho^7,8, Peer Bork^3,6,9,10.

Abstract

Microbiology depends on the availability of annotated microbial genomes for many applications. Comparative genomics approaches have been a major advance, but consistent and accurate annotations of genomes can be hard to obtain. In addition, newer concepts such as the pan-genome concept are still being implemented to help answer biological questions. Hence, we present proGenomes2, which provides 87 920 high-quality genomes in a user-friendly and interactive manner. Genome sequences and annotations can be retrieved individually or by taxonomic clade. Every genome in the database has been assigned to a species cluster and most genomes could be accurately assigned to one or multiple habitats. In addition, general functional annotations and specific annotations of antibiotic resistance genes and single nucleotide variants are provided. In short, proGenomes2 provides threefold more genomes, enhanced habitat annotations, updated taxonomic and functional annotation and improved linkage to the NCBI BioSample database. The database is available at http://progenomes.embl.de/.

Entities: Chemical Disease Species

Mesh：

Year: 2020 PMID： 31647096 PMCID： PMC7145564 DOI： 10.1093/nar/gkz1002

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Large-scale genomics has been instrumental for our improved understanding of microbes. Microbiology has developed into a data-intensive field with the availability of thousands of sequenced genomes (1–3). Over the last 20+ years, the number of bacteria and archaea with sequenced genomes has grown exponentially (4,5). To facilitate an understanding of microbes from their genomic data, annotations are essential. These enable researchers to pinpoint potential functions and allow for comparative analyses (6). For this reason, we initially developed proGenomes and are continuing to improve the database. Several publicly accessible databases provide genomes with basic or even more elaborate annotations. For example, the NCBI RefSeq database (7) make a comprehensive set of genomes available to the public (though only minimal annotations are provided). Further, databases such as Ensembl Bacteria (8), the DOE’s Joint Genome Institute Integrated Microbial Genomes & Microbiomes (JGI IMG/M) database (9), or the PATRIC (Pathosystems Resource Integration Center) database (10) contain more sophisticated, but often select information and annotations. For these databases, the taxonomic annotations are selected by the submitter of each genome. This leads to inconsistencies across different clades across the tree of life, especially at the species level, as the species definition for bacteria and archaea remains a highly debated topic among microbiologists (11,12). In general and not only due to user errors, inconsistencies are wide-spread in genomic databases (13–15). A successful effort to increase the consistency of the taxonomy at higher taxonomic levels is the Genome Taxonomy Database (GTDB) (12), while specI (5) using genomics information to delineate species was used in proGenomes v1 (4). The pan-genome concept has been an important advance in microbial genomics and microbiology overall (16,17). Due to the availability of many genome sequences within one species, researchers can now explore the pan-genome of many species and study the functional repertoire of species. Still most genomes are studied on an individual basis even in comparative approaches. Dedicated databases for pan-genomes exist, but these are often focused on specific taxonomic clades or lack in-depth functional annotations. Hence, the availability of pan-genomes for many species could facilitate many studies and applications Here, we present proGenomes2 which was developed to address these issues as an update of the existing proGenomes database. The updated version provides three times as many genome sequences and annotations and a higher phylogenetic coverage while adding information about the pan-genome of every species cluster. A number of workflows were improved for proGenomes2 including enhanced habitat annotations and linkage to the NCBI BioSample database. The database is available at http://progenomes.embl.de/

DATABASE CONSTRUCTION AND CHARACTERISTICS

proGenomes2 provides the available microbial genomes and customizable subsets in a readily downloadable and user-friendly manner. Genomes and sets of genomes can be found and retrieved using the taxonomic name of the organism, species or clade. The provided information can be accessed, explored interactively and downloaded easily. The database will be updated regularly in the future and major upgrades of the underlying computational pipeline are planned every two years. The genomes for proGenomes2 were obtained on 15 May 2017 and the NCBI taxonomy database used was downloaded on 8 January 2019.

Genome collection

We downloaded all bacterial and archaeal genomes that were available from the NCBI Nucleotide database on 15 May 2017. Gene annotations provided with the deposited genomes were used when available. If gene annotations were absent from the deposited genomes, we used geneMarkS to predict genes (18). To exclude low quality genomes, we used the same parameters as used in the original proGenomes version (N50 score >10k bp, <300 contigs and >30 of 40 universal, single copy marker genes (19,20)). We further removed genomes that were since removed from the NCBI Nucleotide database and genomes that we found to be of low quality by manual quality filtering. After these filtering steps, 87 920 high-quality genomes remained (10 333 genomes were removed). In comparison, proGenomes version 1 contained 25 038 high-quality genomes. proGenomes2 normalizes genome, gene and protein identifiers to a consistent scheme, linking them to NCBI taxonomic and BioSample IDs, to facilitate downstream automated processing. This also ensure access to information about the sequenced sample that is provided by the NCBI BioSample Database (21)

Species clusters definitions using the specI approach

specI species clusters provide an accurate and consistent solution for genomics-based species definition that are largely consistent with consensus from morphological and phenotypic evaluation). We calculated specI species clusters as in the previous proGenomes version using the methodology described in (5), resulted in 12 221 specI species clusters for the 87 920 genomes currently in proGenomes2 (proGenomes version 1: 5306 specI cluster). In short, pairwise genome-to-genome identities were calculated as a length-weighted average of the nucleotide identities of a set of 40 universal, single copy marker genes (19,20) calculated with vsearch (v1.8.0) (22). These were transformed into distances and clustered using average linkage employing a cutoff of 3.5% distance (96.5% nucleotide ID). Genomes are annotated according to the NCBI taxonomy (downloaded on 8 January 2019 and available at https://doi.org/10.5281/zenodo.3357977). Annotation for the specI is derived from the annotation of the genomes that compose the specI clusters.

Selection of representative genomes

Many species and even strains have been sequenced multiple times leading to an increasing amount of redundancy in genomic databases. Non-redundant datasets are increasingly important in microbial genomics (23). Hence, proGenomes provides a non-redundant set of 12 221 representative genomes and habitat-specific subsets. These datasets are precompiled and can be readily downloaded. For each specI cluster one representative genome was selected. For this purpose, a whitelist containing several highly-important genomes was compiled. For all specI clusters that contained a genome on the whitelist, this genome was selected as representative. For all other specI clusters, the representative genomes were selected using citation counts as well as the N50 measure as a proxy for genome quality, while completely assembled genomes were selected preferentially.

Pan-genomes

In addition to representative genomes, proGenomes provides pan-genomes for the specI clusters. These are non-redundant sets of genes that represent the genetic diversity within a specI (species) cluster. Per specI-cluster pan-genomes were generated in two steps. First, genes were de-replicated by sorting the genes and removing identical sequences. Second, these dereplicated gene sets were further clustered with cd-hit-est (24) to produce non-redundant versions, at 95% identity and 90% coverage (exact command used: -c 0.95 -G 0 -g 1 -aS 0.9) (25). The resulting pan-genomes can be used for many applications ranging from metagenomic mapping to evolutionary analyses of functional genes across different species. For clusters with more than one genome, this reduced the number of genes from 283 million to 63 million, while providing a far greater coverage of the functional repertoire as the representative genomes alone (21.8 million) (Figure 1).

Figure 1.

Cumulative number of genes in specI clusters with more than one genome. Number of all redundant genes (left), genes represented in pan-genomes (middle) and genes from representative genomes (right).

Functional annotation

The functional repertoire of a microbial genome defines its phenotype, lifestyle and ecological role. Hence, it is pivotal to our understanding of a microorganism that we have a consistent, accurate and comprehensive functional annotation of its genes. One of the main aspects of proGenomes is to provide consistently-computed functional annotation of protein-coding genes. For general annotations, we use the eggNOG-mapper for eggNOG 5.0 (26) software that assigns protein-coding genes to orthologous groups which are in turn assigned to functions and broader functional categories. proGenomes2 provides dedicated antibiotic resistance annotations of both antimicrobial resistance genes and resistance-conveying single nucleotide variants. The antibiotic resistance annotations in proGenomes2 are provided based on integrated results from the Comprehensive Antibiotic Resistance Database (23,27) and ResFams (28) resources as in proGenomes version 1. Since both databases (CARD v3.0.0 and ResFams v1.2.2) map to the antibiotic resistance ontology (ARO), the ARO hierarchy (as per CARD version 3.0) was used to assess which antibiotics each resistance gene determinant protects against. Proxy terms for ‘unspecified beta-lactam’ and ‘multidrug efflux pump’ were added to reconcile ambiguities in some annotations. For complexes listed in the ARO, such as components with disparate subunits, such synergies between hits were counted within each genome, reflecting how the presence of several interacting antibiotic resistance genes can provide further resistance. Overall, almost 288 million protein-coding genes were annotated using eggNOG (proGenomes version 1: 80 million). This information can be explored interactively on the proGenomes website.

Habitat information

proGenomes2 provides annotations of each genome and species to a habitat. As previously, the habitat annotations are based on the PATRIC database. Information regarding the isolation source was parsed from Patric database version 3.5.28 (accessed on 8 December 2018) (29). Habitat annotations are available for 7218 out of the 12 221 specI clusters (59 344/87 920 isolates). Species were considered associated with a habitat if at least one genome was isolated from that habitat. We expanded the habitat classification from the previous version and provide three broad (soil-associated, aquatic, host-associated) and additionally five more specific categories (mud/sediment, freshwater, disease-associated, food-associated)—biologically meaningful categories (Figure 2). Representative genomes for these subsets of clusters are available for bulk download from the website. We removed the category ‘multiple’ used in the previous version and allow the annotation to more than one category in proGenomes2. As many genomes were annotated as food-associated, isolated from sediment or freshwater, we introduced these as a new categories. Annotations are provided with each genome and specI cluster as well as one large downloadable file.

Figure 2.

Habitat annotations. In proGenomes2 organisms can be associated with multiple habitats categories. The left panel shows the overlap between specI species cluster habitat annotations. The right panel shows how many species (clusters) and strains are associated with each habitat category.

Website

To navigate and access the proGenomes2 database, we provide a website (http://progenomes.embl.de). A search function allows users to access data for specific taxonomic groups or specI clusters directly. Information about single genomes is provided interactively with direct links to relevant external database information, while for higher level taxonomic groups concise information about the organisms belonging to that group are displayed. The website further allows users to provide their own genomes for annotation and contextualization within the proGenomes framework. Taxonomic placement with respect to the above-described specI clusters proceeds in four steps. First, genes and proteins in the uploaded genome are identified using Prodigal (30). Second, among these genes the 40 marker genes on which the specI clusters were defined are identified using the methodology described in (5). In a third step, the extracted marker genes are compared to those from the 12 221 specI clusters using vsearch (22) with the gene-specific thresholds introduced above. Finally, a consistent annotation for the genome is derived from the mapping of individual marker genes (more details are documented on the progenome website). This taxonomic placement tool is also available as stand-alone software. Functional annotations via the eggnog-mapper web server (31) are accessible via a link.

Future outlook

We are planning to upgrade proGenomes further in the near future. The upcoming improvements include the integration of the GTDB taxonomy and the inclusion of metagenomics-assembled genomes (MAGs).To enable this we aim to improve our quality measures for genomes in general and with a specific focus on issues related to MAGs.

DISCUSSION

proGenomes provides consistent taxonomic and functional annotations for quality filtered genomes, as well as a non-redundant, habitat-specific sets of representative genomes. The easy-to-use website provides a wide range of information relevant to researchers interested in microbial genomics and allows the customization of subsets of genomes for download, thus facilitating comparative studies that address questions from evolution, population genetics, functional genomics and many other research fields. We intend proGenomes to be a valuable resource for studies ranging from those focusing on one or a few organisms to those analyzing large-scale evolutionary patterns or complex microbial communities.

30 in total

1. Accurate and universal delineation of prokaryotic species.

Authors: Daniel R Mende; Shinichi Sunagawa; Georg Zeller; Peer Bork
Journal: Nat Methods Date: 2013-07-28 Impact factor: 28.547

Review 2. Microbiology in the post-genomic era.

Authors: Duccio Medini; Davide Serruto; Julian Parkhill; David A Relman; Claudio Donati; Richard Moxon; Stanley Falkow; Rino Rappuoli
Journal: Nat Rev Microbiol Date: 2008-05-13 Impact factor: 60.633

3. Improved annotation of antibiotic resistance determinants reveals microbial resistomes cluster by ecology.

Authors: Molly K Gibson; Kevin J Forsberg; Gautam Dantas
Journal: ISME J Date: 2014-07-08 Impact factor: 10.302

4. Genome analysis of multiple pathogenic isolates of Streptococcus agalactiae: implications for the microbial "pan-genome".

Authors: Hervé Tettelin; Vega Masignani; Michael J Cieslewicz; Claudio Donati; Duccio Medini; Naomi L Ward; Samuel V Angiuoli; Jonathan Crabtree; Amanda L Jones; A Scott Durkin; Robert T Deboy; Tanja M Davidsen; Marirosa Mora; Maria Scarselli; Immaculada Margarit y Ros; Jeremy D Peterson; Christopher R Hauser; Jaideep P Sundaram; William C Nelson; Ramana Madupu; Lauren M Brinkac; Robert J Dodson; Mary J Rosovitz; Steven A Sullivan; Sean C Daugherty; Daniel H Haft; Jeremy Selengut; Michelle L Gwinn; Liwei Zhou; Nikhat Zafar; Hoda Khouri; Diana Radune; George Dimitrov; Kisha Watkins; Kevin J B O'Connor; Shannon Smith; Teresa R Utterback; Owen White; Craig E Rubens; Guido Grandi; Lawrence C Madoff; Dennis L Kasper; John L Telford; Michael R Wessels; Rino Rappuoli; Claire M Fraser
Journal: Proc Natl Acad Sci U S A Date: 2005-09-19 Impact factor: 11.205

5. Genome-wide experimental determination of barriers to horizontal gene transfer.

Authors: Rotem Sorek; Yiwen Zhu; Christopher J Creevey; M Pilar Francino; Peer Bork; Edward M Rubin
Journal: Science Date: 2007-10-18 Impact factor: 47.728

6. VSEARCH: a versatile open source tool for metagenomics.

Authors: Torbjørn Rognes; Tomáš Flouri; Ben Nichols; Christopher Quince; Frédéric Mahé
Journal: PeerJ Date: 2016-10-18 Impact factor: 2.984

7. CARD 2017: expansion and model-centric curation of the comprehensive antibiotic resistance database.

Authors: Baofeng Jia; Amogelang R Raphenya; Brian Alcock; Nicholas Waglechner; Peiyao Guo; Kara K Tsang; Briony A Lago; Biren M Dave; Sheldon Pereira; Arjun N Sharma; Sachin Doshi; Mélanie Courtot; Raymond Lo; Laura E Williams; Jonathan G Frye; Tariq Elsayegh; Daim Sardar; Erin L Westman; Andrew C Pawlowski; Timothy A Johnson; Fiona S L Brinkman; Gerard D Wright; Andrew G McArthur
Journal: Nucleic Acids Res Date: 2016-10-26 Impact factor: 16.971

8. Duplicates, redundancies and inconsistencies in the primary nucleotide databases: a descriptive study.

Authors: Qingyu Chen; Justin Zobel; Karin Verspoor
Journal: Database (Oxford) Date: 2017-01-10 Impact factor: 3.451

9. Toward richer metadata for microbial sequences: replacing strain-level NCBI taxonomy taxids with BioProject, BioSample and Assembly records.

Authors: Scott Federhen; Karen Clark; Tanya Barrett; Helen Parkinson; James Ostell; Yuichi Kodama; Jun Mashima; Yasukazu Nakamura; Guy Cochrane; Ilene Karsch-Mizrachi
Journal: Stand Genomic Sci Date: 2014-01-29

10. IMG/M v.5.0: an integrated data management and comparative analysis system for microbial genomes and microbiomes.

Authors: I-Min A Chen; Ken Chu; Krishna Palaniappan; Manoj Pillay; Anna Ratner; Jinghua Huang; Marcel Huntemann; Neha Varghese; James R White; Rekha Seshadri; Tatyana Smirnova; Edward Kirton; Sean P Jungbluth; Tanja Woyke; Emiley A Eloe-Fadrosh; Natalia N Ivanova; Nikos C Kyrpides
Journal: Nucleic Acids Res Date: 2019-01-08 Impact factor: 16.971

20 in total

1. Metage2Metabo, microbiota-scale metabolic complementarity for the identification of key species.

Authors: Arnaud Belcour; Clémence Frioux; Méziane Aite; Anthony Bretaudeau; Falk Hildebrand; Anne Siegel
Journal: Elife Date: 2020-12-29 Impact factor: 8.140

2. Critical Assessment of Metagenome Interpretation: the second round of challenges.

Authors: Fernando Meyer; Adrian Fritz; Zhi-Luo Deng; David Koslicki; Till Robin Lesker; Alexey Gurevich; Gary Robertson; Mohammed Alser; Dmitry Antipov; Francesco Beghini; Denis Bertrand; Jaqueline J Brito; C Titus Brown; Jan Buchmann; Aydin Buluç; Bo Chen; Rayan Chikhi; Philip T L C Clausen; Alexandru Cristian; Piotr Wojciech Dabrowski; Aaron E Darling; Rob Egan; Eleazar Eskin; Evangelos Georganas; Eugene Goltsman; Melissa A Gray; Lars Hestbjerg Hansen; Steven Hofmeyr; Pingqin Huang; Luiz Irber; Huijue Jia; Tue Sparholt Jørgensen; Silas D Kieser; Terje Klemetsen; Axel Kola; Mikhail Kolmogorov; Anton Korobeynikov; Jason Kwan; Nathan LaPierre; Claire Lemaitre; Chenhao Li; Antoine Limasset; Fabio Malcher-Miranda; Serghei Mangul; Vanessa R Marcelino; Camille Marchet; Pierre Marijon; Dmitry Meleshko; Daniel R Mende; Alessio Milanese; Niranjan Nagarajan; Jakob Nissen; Sergey Nurk; Leonid Oliker; Lucas Paoli; Pierre Peterlongo; Vitor C Piro; Jacob S Porter; Simon Rasmussen; Evan R Rees; Knut Reinert; Bernhard Renard; Espen Mikal Robertsen; Gail L Rosen; Hans-Joachim Ruscheweyh; Varuni Sarwal; Nicola Segata; Enrico Seiler; Lizhen Shi; Fengzhu Sun; Shinichi Sunagawa; Søren Johannes Sørensen; Ashleigh Thomas; Chengxuan Tong; Mirko Trajkovski; Julien Tremblay; Gherman Uritskiy; Riccardo Vicedomini; Zhengyang Wang; Ziye Wang; Zhong Wang; Andrew Warren; Nils Peder Willassen; Katherine Yelick; Ronghui You; Georg Zeller; Zhengqiao Zhao; Shanfeng Zhu; Jie Zhu; Ruben Garrido-Oter; Petra Gastmeier; Stephane Hacquard; Susanne Häußler; Ariane Khaledi; Friederike Maechler; Fantin Mesny; Simona Radutoiu; Paul Schulze-Lefert; Nathiana Smit; Till Strowig; Andreas Bremges; Alexander Sczyrba; Alice Carolyn McHardy
Journal: Nat Methods Date: 2022-04-08 Impact factor: 28.547

3. Assembling a Reference Phylogenomic Tree of Bacteria and Archaea by Summarizing Many Gene Phylogenies.

Authors: Qiyun Zhu; Siavash Mirarab
Journal: Methods Mol Biol Date: 2022

4. Bacterial ribosome collision sensing by a MutS DNA repair ATPase paralogue.

Authors: Federico Cerullo; Sebastian Filbeck; Pratik Rajendra Patil; Hao-Chih Hung; Haifei Xu; Julia Vornberger; Florian W Hofer; Jaro Schmitt; Guenter Kramer; Bernd Bukau; Kay Hofmann; Stefan Pfeffer; Claudio A P Joazeiro
Journal: Nature Date: 2022-03-09 Impact factor: 69.504

5. Towards the biogeography of prokaryotic genes.

Authors: Luis Pedro Coelho; Renato Alves; Álvaro Rodríguez Del Río; Pernille Neve Myers; Carlos P Cantalapiedra; Joaquín Giner-Lamia; Thomas Sebastian Schmidt; Daniel R Mende; Askarbek Orakov; Ivica Letunic; Falk Hildebrand; Thea Van Rossum; Sofia K Forslund; Supriya Khedkar; Oleksandr M Maistrenko; Shaojun Pan; Longhao Jia; Pamela Ferretti; Shinichi Sunagawa; Xing-Ming Zhao; Henrik Bjørn Nielsen; Jaime Huerta-Cepas; Peer Bork
Journal: Nature Date: 2021-12-15 Impact factor: 69.504

6. Growth temperature and chromatinization in archaea.

Authors: Antoine Hocher; Guillaume Borrel; Khaled Fadhlaoui; Jean-François Brugère; Simonetta Gribaldo; Tobias Warnecke
Journal: Nat Microbiol Date: 2022-10-20 Impact factor: 30.964

7. Identification of antigenic domains and peptides from VP15 of white spot syndrome virus and their antiviral effects in Marsupenaeus japonicus.

Authors: Jirayu Boonyakida; Jian Xu; Jun Satoh; Takafumi Nakanishi; Tohru Mekata; Tatsuya Kato; Enoch Y Park
Journal: Sci Rep Date: 2021-06-17 Impact factor: 4.379

8. A faecal microbiota signature with high specificity for pancreatic cancer.

Authors: Ece Kartal; Thomas S B Schmidt; Esther Molina-Montes; Sandra Rodríguez-Perales; Jakob Wirbel; Oleksandr M Maistrenko; Wasiu A Akanni; Bilal Alashkar Alhamwe; Renato J Alves; Alfredo Carrato; Hans-Peter Erasmus; Lidia Estudillo; Fabian Finkelmeier; Anthony Fullam; Anna M Glazek; Paulina Gómez-Rubio; Rajna Hercog; Ferris Jung; Stefanie Kandels; Stephan Kersting; Melanie Langheinrich; Mirari Márquez; Xavier Molero; Askarbek Orakov; Thea Van Rossum; Raul Torres-Ruiz; Anja Telzerow; Konrad Zych; Vladimir Benes; Georg Zeller; Jonel Trebicka; Francisco X Real; Nuria Malats; Peer Bork
Journal: Gut Date: 2022-03-08 Impact factor: 31.793

Review 9. From bag-of-genes to bag-of-genomes: metabolic modelling of communities in the era of metagenome-assembled genomes.

Authors: Clémence Frioux; Dipali Singh; Tamas Korcsmaros; Falk Hildebrand
Journal: Comput Struct Biotechnol J Date: 2020-06-25 Impact factor: 7.271

10. GUNC: detection of chimerism and contamination in prokaryotic genomes.

Authors: Askarbek Orakov; Anthony Fullam; Luis Pedro Coelho; Supriya Khedkar; Damian Szklarczyk; Daniel R Mende; Thomas S B Schmidt; Peer Bork
Journal: Genome Biol Date: 2021-06-13 Impact factor: 13.583