| Literature DB >> 35584008 |
Víctor H Tierrafría1,2, Claire Rioualen1, Heladia Salgado1, Paloma Lara1, Socorro Gama-Castro1, Patrick Lally2, Laura Gómez-Romero3, Pablo Peña-Loredo1, Andrés G López-Almazo1, Gabriel Alarcón-Carranza1, Felipe Betancourt-Figueroa1, Shirley Alquicira-Hernández1, J Enrique Polanco-Morelos1, Jair García-Sotelo4, Estefani Gaytan-Nuñez1, Carlos-Francisco Méndez-Cruz1, Luis J Muñiz1, César Bonavides-Martínez1, Gabriel Moreno-Hagelsieb5, James E Galagan2, Joseph T Wade6,7, Julio Collado-Vides1,2,8.
Abstract
Genomics has set the basis for a variety of methodologies that produce high-throughput datasets identifying the different players that define gene regulation, particularly regulation of transcription initiation and operon organization. These datasets are available in public repositories, such as the Gene Expression Omnibus, or ArrayExpress. However, accessing and navigating such a wealth of data is not straightforward. No resource currently exists that offers all available high and low-throughput data on transcriptional regulation in Escherichia coli K-12 to easily use both as whole datasets, or as individual interactions and regulatory elements. RegulonDB (https://regulondb.ccg.unam.mx) began gathering high-throughput dataset collections in 2009, starting with transcription start sites, then adding ChIP-seq and gSELEX in 2012, with up to 99 different experimental high-throughput datasets available in 2019. In this paper we present a radical upgrade to more than 2000 high-throughput datasets, processed to facilitate their comparison, introducing up-to-date collections of transcription termination sites, transcription units, as well as transcription factor binding interactions derived from ChIP-seq, ChIP-exo, gSELEX and DAP-seq experiments, besides expression profiles derived from RNA-seq experiments. For ChIP-seq experiments we offer both the data as presented by the authors, as well as data uniformly processed in-house, enhancing their comparability, as well as the traceability of the methods and reproducibility of the results. Furthermore, we have expanded the tools available for browsing and visualization across and within datasets. We include comparisons against previously existing knowledge in RegulonDB from classic experiments, a nucleotide-resolution genome viewer, and an interface that enables users to browse datasets by querying their metadata. A particular effort was made to automatically extract detailed experimental growth conditions by implementing an assisted curation strategy applying Natural language processing and machine learning. We provide summaries with the total number of interactions found in each experiment, as well as tools to identify common results among different experiments. This is a long-awaited resource to make use of such wealth of knowledge and advance our understanding of the biology of the model bacterium E. coli K-12.Entities:
Keywords: ChIP-exo; ChIP-seq; DAP-seq; Escherichia coli K-12; High-Throughput Nucleotide Sequencing; RNA-seq; Transcriptional Regulatory Network; gSELEX
Mesh:
Year: 2022 PMID: 35584008 PMCID: PMC9465075 DOI: 10.1099/mgen.0.000833
Source DB: PubMed Journal: Microb Genom ISSN: 2057-5858
Fig. 1.Data model for HT dataset collections represented as a Unified Modelling Language (UML) class diagram. The links represent bidirectional associations between two classes, and the numbers 1, 0.*, 1.* represent the multiplicity value. For example, the class Dataset can have 0 or 1 Author DataFile. The components of datasets are the Metadata, defined as properties in the Dataset class, the Growth Conditions, curated manually or using the NLP method, and related data files, either gathered from authors or processed for uniformity.
Fig. 2.Overview of the RegulonDB HT framework. This diagram summarizes the three types of dataset collections built in RegulonDB HT: i) genomic features (TUs, TSSs, and TTSs), ii) TF binding and iii) gene expression, displayed as grayscale background columns; and the steps implemented to generate them: i) data gathering, ii) curation, iii) normalization and iv) integration, displayed as horizontal lanes. Further details are described in Methods sections regarding datasets.
Fig. 3.Steps for growth conditions extraction using our NLP method.
Fig. 4.RegulonDB-HT search tool. This tool gives access to all types of HT datasets retrieved so far, but an example of access to a TF binding HT dataset is shown. (a) RegulonDB portal. (b) RegulonDB HT collections. (c) Content of a TF binding dataset, from the ChIP-seq subcollection.
Number and content of RegulonDB HT datasets
|
Object |
Strategy |
No. of datasets |
No. of objects |
Additional information | |
|---|---|---|---|---|---|
|
Curated from papers |
Identified from raw data | ||||
|
| |||||
|
Gene expression |
RNA-seq |
1864
|
|
4618
|
|
|
| |||||
|
TF Binding |
ChIP-seq |
29
|
6585 peaks |
13167 peaks 5108 sites |
Table S2 |
|
ChIP-exo |
94 |
23170 peaks |
|
Table S3 | |
|
gSELEX |
164 |
35022 peaks |
|
Table S4 | |
|
DAP-seq |
215 |
19540 peaks |
|
Table S5 | |
|
| |||||
|
TUs |
RNA-seq |
5 |
12347
|
|
Table S6 |
|
TSSs |
RNA-seq |
16 |
68049
|
|
Table S7 |
|
TTSs |
RNA-seq |
5 |
5326
|
|
Table S8 |
a, The total of SRRs retrieved, which include 575 only in DEE2, 914 (820 GSMs) only in GEO, and 375 (337 GSMs) in both DEE2 and GEO
b, Average number of genes per dataset.
c, Including 27 processed by authors and 28 processed in house.
d, The number of these objects may be higher from the original publications as they were calculated per dataset, after our uniformization process. nd. Object identification not determined by the RegulonDB Team.
Fig. 5.TFs with binding identified by ChIP-exo, ChIP-seq, DAP-seq and/or gSELEX. (a) Comparison of TFs studied with LT approaches available in RegulonDB, with TFs examined with HT technologies. In RegulonDB, 222 TFs have been confirmed by classical LT evidence with at least one regulatory interaction (displayed as a horizontal blue bar). Each vertical bar represents a group of TFs associated with LT and/or HT experiments, as displayed by the black dots in the bottom rows. (b) Average percentage of TF-gene interactions with classical evidence in RegulonDB, identified in data processed by authors.
Fig. 6.Number of TFRSs from classic RegulonDB (blue bars), those found in in-house processed ChIP-seq peaks (yellow bars), and those identified in peaks through pattern-matching, using RegulonDB TF motifs (grey bars).
Fig. 7.High-throughput TSS datasets collected and mapped to RegulonDB classic TSSs. (a) Number of TSSs per HT dataset. (b) Number of HT TSSs that match with at least one classic TSS. (c) Number of classic TSSs that match with at least one HT TSS, for each HT dataset.
F1-score in testing for types of growth condition
|
Growth condition |
Precision |
Recall |
F1-score |
Support* |
|---|---|---|---|---|
|
Optical density |
1.00 |
1.00 |
1.00 |
21 |
|
pH |
1.00 |
1.00 |
1.00 |
10 |
|
Technique |
1.00 |
1.00 |
1.00 |
33 |
|
Culture medium |
1.00 |
0.80 |
0.89 |
56 |
|
Temperature |
0.86 |
0.80 |
0.83 |
15 |
|
Agitation |
1.00 |
0.29 |
0.44 |
7 |
|
Growth phase |
0.94 |
0.76 |
0.84 |
21 |
|
Aeration |
0.63 |
0.59 |
0.61 |
88 |
|
Genetic background |
0.89 |
0.86 |
0.88 |
78 |
|
Medium supplements |
0.88 |
0.84 |
0.86 |
136 |
|
Genome version |
1 |
0.5 |
0.667 |
6 |
*Support stands for the number of growth conditions available in testing data for evaluation.
Fig. 8.Foreground bar plot: fraction of SRRs for each type of growth condition. GC term types retrieved for RNA-seq datasets from GEO (1289 SRRs, 1157 GSMs, 95 GSEs), 3224 extracted GC terms: 2680 were mapped and 544 non-mapped with MCO entities. Background bar plot: fraction of GSMs for each type of growth condition in the training data (228 GSMs from 27 GSEs).