Literature DB >> 26586805

COLOMBOS v3.0: leveraging gene expression compendia for cross-species analyses.

Marco Moretto1, Paolo Sonego2, Nicolas Dierckxsens3, Matteo Brilli4, Luca Bianco2, Daniela Ledezma-Tejeida5, Socorro Gama-Castro5, Marco Galardini6, Chiara Romualdi7, Kris Laukens8, Julio Collado-Vides5, Pieter Meysman8, Kristof Engelen9.   

Abstract

COLOMBOS is a database that integrates publicly available transcriptomics data for several prokaryotic model organisms. Compared to the previous version it has more than doubled in size, both in terms of species and data available. The manually curated condition annotation has been overhauled as well, giving more complete information about samples' experimental conditions and their differences. Functionality-wise cross-species analyses now enable users to analyse expression data for all species simultaneously, and identify candidate genes with evolutionary conserved expression behaviour. All the expression-based query tools have undergone a substantial improvement, overcoming the limit of enforced co-expression data retrieval and instead enabling the return of more complex patterns of expression behaviour. COLOMBOS is freely available through a web application at http://colombos.net/. The complete database is also accessible via REST API or downloadable as tab-delimited text files.
© The Author(s) 2015. Published by Oxford University Press on behalf of Nucleic Acids Research.

Entities:  

Mesh:

Year:  2015        PMID: 26586805      PMCID: PMC4702885          DOI: 10.1093/nar/gkv1251

Source DB:  PubMed          Journal:  Nucleic Acids Res        ISSN: 0305-1048            Impact factor:   16.971


INTRODUCTION

COLOMBOS is a collection of expression data from both microarray and RNA-Seq experiments for several prokaryotic species, taken from publicly available database such as the Gene Expression Omnibus (GEO) (1) and ArrayExpress (2). Its uniqueness resides in the ability to cope with data heterogeneity and directly integrate data coming from different platforms and technologies. Other gene expression compendia are usually built either from data for a single transcriptomics platform or they rely on the integration of expression analysis results, rather than the integration of the actual measurements. In COLOMBOS however, data are collected and curated starting from the original raw intensities for microarrays and sequence reads for RNA-Seq, and then processed with a robust normalization and quality control pipeline to allow direct comparison of gene expression behaviour across different experiments and platforms (3). This results in a single expression matrix for every species, its rows representing the measured genes and its columns representing condition contrasts, comparisons between test and reference samples of different biological conditions. Attention is also given to the acquisition of meta-data related to the description of the biological conditions surveyed in an experiment, so that all the included samples and condition contrasts are formally annotated by means of a controlled vocabulary of condition properties. This annotation is a manual effort with the purpose of making the data comparable from a biological viewpoint and to yield reliable interpretations of expression patterns. COLOMBOS compendia are accessible using the web interface, through a set of REST API calls, or via the R (4) package Rcolombos; they are also available for download in their entirety for use of COLOMBOS data in third-party stand-alone applications. Different types of analyses can be done using the COLOMBOS web interface itself; typical operations include starting from a set of known genes to find the conditions where they are (co)-expressed or to identify additional co-expressed genes. COLOMBOS’ tools are designed for users to ‘play around’ with the compendia, exploring the data with respect to the biological question they are interested in. They are encouraged to try different types of search queries based on genes or conditions, the available annotations or by relying on the actual expression values in a way reminiscent of a BLAST functionality with gene expression behaviour instead of sequence similarity. They can then visualize their results, use them as a basis for new queries to find additional (anti-)co-expressed genes, generate clusters to separate disjoint expression profiles, explore the overlap between multiple query results and potentially combine them, etc. There are several detailed use case tutorials on the website, illustrating step-by-step how concrete examples of conceptually different biological questions could be handled through the COLOMBOS interface. The previous v2.0, with all of its original databases and tools, will be kept available for future reference alongside COLOMBOS v3.0; how to access it is explained in the website's Help section.

DATA CONTENT UPDATE

COLOMBOS v2.0 (5) was composed of seven bacterial species, four more than it contained at its inception. The current update includes an additional twelve species of biomedical or industrial relevance, including some Archaea. The main criteria for selecting these new species were the amount of publicly available expression data and quality of genome annotation and their perceived status as model organisms. A complete overview of the available species and associated statistics can be found in Table 1. The previous compendia have also been updated with recent experiments, in some extreme cases leading to an almost 2-fold increase of available data. For instance, the biggest compendium is that for Escherichia coli, which now contains over 4000 condition contrasts, nearly 2000 more than COLOMBOS v2.0 and almost as many as its number of genes, rendering the expression matrix virtually square. Gene lists, representing the species’ measurable transcripts, have been created from the NCBI RefSeq database (6) and various gene annotation data were added (or updated) from UniProt-GOA (7), RegulonDB (8), BioCyc (9) and EcoCyc (10), or species-specific published datasets (11).
Table 1.

Overview of the data available in COLOMBOS

StrainNumber of genesNumber of contrastsMissing values (%)First inclusionSamplesExperimentsPlatforms
Escherichia coliMG1655432140773.6v1.05510254 [15]73
Bacillus subtilis168417612593.7v1.018144535
Salmonella enterica serovar Typhimuriumcross-strain6261106641.6v1.018563622
LT245561726.4v2.0316810
14028S541668122.7v2.01252177
SL134446552139.8v2.0288119
Streptomyces coelicolorA3(2)82393717.3v2.05467 [2]7
Pseudomonas aeruginosaPAO156475591.6v2.0592332
Helicobacter pylori2669516161333.1v2.025685
Bacillus anthracisAmes5039663.0v3.07544
Bacillus cereusATCC 1457952312832.4v3.03921610
Bacteroides thetaiotaomicronVPI-548248163331.9v3.0353194
Campylobacter jejuniNCTC 11168157215212.5v3.02601411
Clostridium acetobutylicumATCC 82437783772.4v3.04191211
Lactobacillus rhamnosusGG2834793.6v3.015832
Methanococcus maripaludisS217223641.5v3.0728193
Shigella flexneri30143153517.0v3.03833
Sinorhizobium meliloti102162184242.7v3.071320 [19]10
Streptococcus pneumoniaeD391914685.7v3.0136122
Thermus thermophilusHB821734441.4v3.048063
Yersinia pestisCO923979366.1v3.07252

Rows of the table represent all the species and strains for which a gene expression compendium is hosted. Columns represent (from left to right): the species name, the strain used as reference genome for microarray probe to gene mapping and RNA-Seq read alignment, the total number of genes in the compendium, the total number of contrasts in the compendium, the percentage of missing values, the COLOMBOS version of the first inclusion of the respective species or strain, the total number of samples from which the compendium's contrasts are built, the total number of corresponding experiments on GEO and ArrayExpress (the latter indicated between square brackets) and the total number of platforms represented.

Rows of the table represent all the species and strains for which a gene expression compendium is hosted. Columns represent (from left to right): the species name, the strain used as reference genome for microarray probe to gene mapping and RNA-Seq read alignment, the total number of genes in the compendium, the total number of contrasts in the compendium, the percentage of missing values, the COLOMBOS version of the first inclusion of the respective species or strain, the total number of samples from which the compendium's contrasts are built, the total number of corresponding experiments on GEO and ArrayExpress (the latter indicated between square brackets) and the total number of platforms represented.

Complete sample annotation

COLOMBOS sports an annotation system for condition contrast related meta-data which relies on a manually curated and controlled vocabulary. It is an essential information source that aids in the interpretation of gene expression patterns. As COLOMBOS condition contrasts represent comparisons between two samples (a ‘test’ sample compared to a ‘reference’ sample), in the past only condition properties which represented actual differences between the two samples were annotated. The major drawback of this approach is that it disregards what is shared between both samples: two contrasts could be annotated exactly the same regardless of the condition ‘background’ of their individual samples. For instance, when two contrasts had measured the exact same decrease in oxygen concentration, they would have been annotated identically. If one of the contrasts however had wild-type strains for both test and reference samples, and the other contrast had strains with a mutation in a gene important in aerobic respiration, this information would not be apparent from the contrast's annotation, while it is arguably an important factor to acknowledge. For this COLOMBOS update, we have fully overhauled the annotation system to instead work at the sample level (as opposed to the contrast level) and consequently hold the meta-information for both a contrast's samples’ experimental conditions, and not only the differences between them. When looking up a condition contrast in the COLOMBOS database, you will now be presented with the biological background (e.g. strains, medium, growth conditions) as well as the biological difference that results in the displayed expression behaviour.

FUNCTIONALITY UPDATE

Cross-species analysis

A completely new functionality in COLOMBOS v3.0 is the ability to work with all species simultaneously. The data from different organisms have been integrated on a higher level based on clusters of homologous genes (CHG) constructed with OrthoMCL v2.0.9 (12) using the default settings as applied to the protein sequences for the strains included in COLOMBOS v3.0. These CHGs can be thought of as the rows of an overarching expression matrix obtained by stitching together the individual compendia. Expression data for orthologous genes, i.e. genes assigned to the same CHG, are aligned across the respective species; species without a representative gene in a CHG can be thought of as having missing values. In case a CHG contains paralogous genes (multiple genes from the same species), their expression values are averaged. All data analysis tools included in COLOMBOS have been adapted to deal with these new cross-species compendia, so that this complex expression matrix can be queried and explored with the same flexibility as any single species. The cross-species comparison is not only a novelty for the identification of co-expressed gene sets across several species for e.g. evolutionary studies, but also has several advantages for the way compendia can be constructed. We can now build compendia for different strains and integrate them at the species level using homologue mappings. This has a clear advantage as, instead of using a single reference strain's genome to represent the species as was done before, we can now explicitly recognize genomic differences between strains and thus improve read alignment (RNA-seq) or probe to gene mapping (microarrays) to generate higher quality expression data. This concept has been used to improve our Salmonella enterica sp. Typhimurium compendium, where the original consisted of more or less equal parts of three different strains with minor differences in their genomic content.

Analysis tools

Several changes have been made to web portal's suite of analysis tools and the RESTful web service and R API. These are mainly related to the query functionalities that actually make use of the expression values themselves (‘BLASTing with expression data’). While these previously looked solely for consistent co-expression, they are now capable of returning complex patterns of expression behaviour across sets of query genes (or conditions). For instance, in v2.0 the Quicksearch functionality would return, for a set of user defined genes, the contrasts where those genes behave in a similar and coherent way. These are not necessarily the most informative, or relevant, contrasts for the user, especially for larger gene sets for which co-expression behaviour might be rare and unrepresentative. By default the Quicksearch in v3.0 will visualize complex patterns of co-expression by running a biclustering on the returned module data, and will not necessarily return contrasts where the query input genes behave in the same way (although this functionality is still available in the Advanced search). Other improvements include various export functionalities so that COLOMBOS results can be easily imported in other widely used tools or databases (such as Cytoscape (13), BioCyc) for further downstream analysis.

DISCUSSION AND FUTURE PLANS

COLOMBOS’ growths over the years have been a continuous effort towards better gene expression data integration and easier exploration and interpretation. Not only has the data more than doubled, but this last major update is another step in the direction of improving the strengths and eliminating the weaknesses of the previous version(s). The redesigned condition annotation system provides a more reliable interpretation of expression patterns with respect to the biological stimuli that are causing them. The new cross-species capabilities have the obvious advantage over the old system to be able to perform gene expression analyses on all species simultaneously, but also enable more accurate measurements mapping by separating different strains within the same species. Keeping the compendia up-to-date, as well as expanding the scope by adding new organisms, is naturally our first priority. We generally select new species or strains based on data availability, but are always open to suggestions or requests from users who are interested in access to a gene expression compendium for a particular species. Further improvements and new functionalities that revolve around cross-species capabilities are planned for future versions. Flexibility regarding CHGs selection and composition, as well as new tools to empower users when dealing with complex CHGs are amongst the priorities. For instance, instead of being limited to pre-calculated, fixed CHGs for which homologues cannot be re-defined and that encompass all species in the compendia as is the case now, users will be able to define the settings to create CHGs for the species of their choice and consequently more dynamically integrate the data from the corresponding compendia. Updated tools will likewise enable a finer management of CHGs, unlike e.g. the current paralogues’ expression calculation that is averaged across all paralogues without the possibility for a different evaluation considering the variability amongst those paralogues, as well as give users the ability to compare expression derived measures, such as co-expression scores or networks, across species.
  12 in total

1.  NCBI GEO: archive for functional genomics data sets--update.

Authors:  Tanya Barrett; Stephen E Wilhite; Pierre Ledoux; Carlos Evangelista; Irene F Kim; Maxim Tomashevsky; Kimberly A Marshall; Katherine H Phillippy; Patti M Sherman; Michelle Holko; Andrey Yefanov; Hyeseung Lee; Naigong Zhang; Cynthia L Robertson; Nadezhda Serova; Sean Davis; Alexandra Soboleva
Journal:  Nucleic Acids Res       Date:  2012-11-27       Impact factor: 16.971

2.  EcoCyc: fusing model organism databases with systems biology.

Authors:  Ingrid M Keseler; Amanda Mackie; Martin Peralta-Gil; Alberto Santos-Zavaleta; Socorro Gama-Castro; César Bonavides-Martínez; Carol Fulcher; Araceli M Huerta; Anamika Kothari; Markus Krummenacker; Mario Latendresse; Luis Muñiz-Rascado; Quang Ong; Suzanne Paley; Imke Schröder; Alexander G Shearer; Pallavi Subhraveti; Mike Travers; Deepika Weerasinghe; Verena Weiss; Julio Collado-Vides; Robert P Gunsalus; Ian Paulsen; Peter D Karp
Journal:  Nucleic Acids Res       Date:  2012-11-09       Impact factor: 16.971

3.  COLOMBOS: access port for cross-platform bacterial expression compendia.

Authors:  Kristof Engelen; Qiang Fu; Pieter Meysman; Aminael Sánchez-Rodríguez; Riet De Smet; Karen Lemmens; Ana Carolina Fierro; Kathleen Marchal
Journal:  PLoS One       Date:  2011-07-14       Impact factor: 3.240

4.  ArrayExpress update--simplifying data submissions.

Authors:  Nikolay Kolesnikov; Emma Hastings; Maria Keays; Olga Melnichuk; Y Amy Tang; Eleanor Williams; Miroslaw Dylag; Natalja Kurbatova; Marco Brandizi; Tony Burdett; Karyn Megy; Ekaterina Pilicheva; Gabriella Rustici; Andrew Tikhonov; Helen Parkinson; Robert Petryszak; Ugis Sarkans; Alvis Brazma
Journal:  Nucleic Acids Res       Date:  2014-10-31       Impact factor: 16.971

5.  RefSeq microbial genomes database: new representation and annotation strategy.

Authors:  Tatiana Tatusova; Stacy Ciufo; Boris Fedorov; Kathleen O'Neill; Igor Tolstoy
Journal:  Nucleic Acids Res       Date:  2013-12-06       Impact factor: 16.971

6.  The GOA database: gene Ontology annotation updates for 2015.

Authors:  Rachael P Huntley; Tony Sawford; Prudence Mutowo-Meullenet; Aleksandra Shypitsyna; Carlos Bonilla; Maria J Martin; Claire O'Donovan
Journal:  Nucleic Acids Res       Date:  2014-11-06       Impact factor: 19.160

7.  RegulonDB v8.0: omics data sets, evolutionary conservation, regulatory phrases, cross-validated gold standards and more.

Authors:  Heladia Salgado; Martin Peralta-Gil; Socorro Gama-Castro; Alberto Santos-Zavaleta; Luis Muñiz-Rascado; Jair S García-Sotelo; Verena Weiss; Hilda Solano-Lira; Irma Martínez-Flores; Alejandra Medina-Rivera; Gerardo Salgado-Osorio; Shirley Alquicira-Hernández; Kevin Alquicira-Hernández; Alejandra López-Fuentes; Liliana Porrón-Sotelo; Araceli M Huerta; César Bonavides-Martínez; Yalbi I Balderas-Martínez; Lucia Pannier; Maricela Olvera; Aurora Labastida; Verónica Jiménez-Jacinto; Leticia Vega-Alvarado; Victor Del Moral-Chávez; Alfredo Hernández-Alvarez; Enrique Morett; Julio Collado-Vides
Journal:  Nucleic Acids Res       Date:  2012-11-29       Impact factor: 16.971

8.  Evolution of Intra-specific Regulatory Networks in a Multipartite Bacterial Genome.

Authors:  Marco Galardini; Matteo Brilli; Giulia Spini; Matteo Rossi; Bianca Roncaglia; Alessia Bani; Manuela Chiancianesi; Marco Moretto; Kristof Engelen; Giovanni Bacci; Francesco Pini; Emanuele G Biondi; Marco Bazzicalupo; Alessio Mengoni
Journal:  PLoS Comput Biol       Date:  2015-09-04       Impact factor: 4.475

9.  COLOMBOS v2.0: an ever expanding collection of bacterial expression compendia.

Authors:  Pieter Meysman; Paolo Sonego; Luca Bianco; Qiang Fu; Daniela Ledezma-Tejeida; Socorro Gama-Castro; Veerle Liebens; Jan Michiels; Kris Laukens; Kathleen Marchal; Julio Collado-Vides; Kristof Engelen
Journal:  Nucleic Acids Res       Date:  2013-11-08       Impact factor: 16.971

10.  The MetaCyc database of metabolic pathways and enzymes and the BioCyc collection of Pathway/Genome Databases.

Authors:  Ron Caspi; Tomer Altman; Richard Billington; Kate Dreher; Hartmut Foerster; Carol A Fulcher; Timothy A Holland; Ingrid M Keseler; Anamika Kothari; Aya Kubo; Markus Krummenacker; Mario Latendresse; Lukas A Mueller; Quang Ong; Suzanne Paley; Pallavi Subhraveti; Daniel S Weaver; Deepika Weerasinghe; Peifen Zhang; Peter D Karp
Journal:  Nucleic Acids Res       Date:  2013-11-12       Impact factor: 16.971

View more
  34 in total

1.  MCO: towards an ontology and unified vocabulary for a framework-based annotation of microbial growth conditions.

Authors:  V H Tierrafría; C Mejía-Almonte; J M Camacho-Zaragoza; H Salgado; K Alquicira; C Ishida; S Gama-Castro; J Collado-Vides
Journal:  Bioinformatics       Date:  2019-03-01       Impact factor: 6.937

2.  Limits to a classic paradigm: most transcription factors in E. coli regulate genes involved in multiple biological processes.

Authors:  Daniela Ledezma-Tejeida; Luis Altamirano-Pacheco; Vicente Fajardo; Julio Collado-Vides
Journal:  Nucleic Acids Res       Date:  2019-07-26       Impact factor: 16.971

3.  Global transcriptional regulatory network for Escherichia coli robustly connects gene expression to transcription factor activities.

Authors:  Xin Fang; Anand Sastry; Nathan Mih; Donghyuk Kim; Justin Tan; James T Yurkovich; Colton J Lloyd; Ye Gao; Laurence Yang; Bernhard O Palsson
Journal:  Proc Natl Acad Sci U S A       Date:  2017-09-05       Impact factor: 11.205

4.  The complete chloroplast genome and characteristics analysis of Musa basjoo Siebold.

Authors:  Fenxiang Liu; Ali Movahedi; Wenguo Yang; Dezhi Xu; Chuanbei Jiang; Jigang Xie; Yu Zhang
Journal:  Mol Biol Rep       Date:  2021-09-19       Impact factor: 2.316

5.  Predictive regulatory and metabolic network models for systems analysis of Clostridioides difficile.

Authors:  Mario L Arrieta-Ortiz; Selva Rupa Christinal Immanuel; Serdar Turkarslan; Wei-Ju Wu; Brintha P Girinathan; Jay N Worley; Nicholas DiBenedetto; Olga Soutourina; Johann Peltier; Bruno Dupuy; Lynn Bry; Nitin S Baliga
Journal:  Cell Host Microbe       Date:  2021-10-11       Impact factor: 21.023

6.  Escherichia coli YigI is a Conserved Gammaproteobacterial Acyl-CoA Thioesterase Permitting Metabolism of Unusual Fatty Acid Substrates.

Authors:  Michael Schmidt; Theresa Proctor; Rucheng Diao; Peter L Freddolino
Journal:  J Bacteriol       Date:  2022-07-25       Impact factor: 3.476

7.  VESPUCCI: Exploring Patterns of Gene Expression in Grapevine.

Authors:  Marco Moretto; Paolo Sonego; Stefania Pilati; Giulia Malacarne; Laura Costantini; Lukasz Grzeskowiak; Giorgia Bagagli; Maria Stella Grando; Claudio Moser; Kristof Engelen
Journal:  Front Plant Sci       Date:  2016-05-10       Impact factor: 5.753

8.  Multi-omics integration accurately predicts cellular state in unexplored conditions for Escherichia coli.

Authors:  Minseung Kim; Navneet Rai; Violeta Zorraquino; Ilias Tagkopoulos
Journal:  Nat Commun       Date:  2016-10-07       Impact factor: 14.919

9.  Predicting synchronized gene coexpression patterns from fibration symmetries in gene regulatory networks in bacteria.

Authors:  Ian Leifer; Mishael Sánchez-Pérez; Cecilia Ishida; Hernán A Makse
Journal:  BMC Bioinformatics       Date:  2021-07-08       Impact factor: 3.169

10.  Heatmapper: web-enabled heat mapping for all.

Authors:  Sasha Babicki; David Arndt; Ana Marcu; Yongjie Liang; Jason R Grant; Adam Maciejewski; David S Wishart
Journal:  Nucleic Acids Res       Date:  2016-05-17       Impact factor: 16.971

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.