Literature DB >> 26578589

ORegAnno 3.0: a community-driven resource for curated regulatory annotation.

Robert Lesurf1, Kelsy C Cotto1, Grace Wang1, Malachi Griffith2, Katayoon Kasaian3, Steven J M Jones4, Stephen B Montgomery5, Obi L Griffith6.   

Abstract

The Open Regulatory Annotation database (ORegAnno) is a resource for curated regulatory annotation. It contains information about regulatory regions, transcription factor binding sites, RNA binding sites, regulatory variants, haplotypes, and other regulatory elements. ORegAnno differentiates itself from other regulatory resources by facilitating crowd-sourced interpretation and annotation of regulatory observations from the literature and highly curated resources. It contains a comprehensive annotation scheme that aims to describe both the elements and outcomes of regulatory events. Moreover, ORegAnno assembles these disparate data sources and annotations into a single, high quality catalogue of curated regulatory information. The current release is an update of the database previously featured in the NAR Database Issue, and now contains 1 948 307 records, across 18 species, with a combined coverage of 334 215 080 bp. Complete records, annotation, and other associated data are available for browsing and download at http://www.oreganno.org/.
© The Author(s) 2015. Published by Oxford University Press on behalf of Nucleic Acids Research.

Entities:  

Mesh:

Substances:

Year:  2015        PMID: 26578589      PMCID: PMC4702855          DOI: 10.1093/nar/gkv1203

Source DB:  PubMed          Journal:  Nucleic Acids Res        ISSN: 0305-1048            Impact factor:   16.971


INTRODUCTION

The Open Regulatory Annotation database (ORegAnno) was first released about a decade ago (1), with the intention to collect and synthesize a catalogue of regulatory elements. It remains unique in the field because of its focus on collecting high quality, curated regulatory records from the literature. Moreover, ORegAnno relies on a thriving community of scientists who are interested in contributing to the resource, as well as utilizing its data. Since the last release of ORegAnno in early 2008 (2), the amount and types of published regulatory data have grown exponentially. This relates in part to high-throughput studies from the ENCODE consortium and others, who have performed an enormous number of ChIP-seq, DNase-seq, FAIRE-seq and other experiments aiming to identify biochemically available and transcriptionally active regions of genomes (3). While these efforts are excellent resources for identifying candidate regulatory regions, ENCODE efforts have suggested that as much as 80% of the genome could be functional (3). This controversial finding has been the focus of much attention in the community, with several commentaries pointing out that these types of high-throughput data are prone to overestimates due to experimental and statistical methods that result in a high number of false positive calls (4–6). Moreover, they do not necessarily provide a comprehensive understanding of all of the elements involved in gene regulation. For example, knowing the region of DNA that is bound by a transcription factor does not directly indicate whether the expression of any genes are altered, nor whether an alteration results in up- versus down-regulation. Validation of the genomic regions identified by ENCODE and others requires a large number of low-throughput experimental data paired with careful manual curation. Additionally, much of the available evidence supporting gene regulation is dispersed across various experiments, specialized datasets, and individual publications, making it cumbersome to obtain regulatory information that has been released by the community across this broad set of sources. The current version of ORegAnno seeks to address these issues by cataloging a large number of new, curated, high quality regulatory records that are derived from published literature and other data resources.

RESULTS

Overview

The current version of ORegAnno now has a total of 1 948 307 unique records. These records cover a combined 334 215 080 bp across 18 species (Figure 1A and B). The vast majority of these records are mapped to human and mouse genomes, with 1 452 466 records in human (261 660 516 bp in the GRCh38/hg38 genome assembly version) and 415 808 records in mouse (57 253 973 bp in the GRCm38/mm10 genome assembly version).
Figure 1.

Current content of the ORegAnno database. Content statistics are divided by species (A and B), regulatory type (C and D), and data source (E and F).

Current content of the ORegAnno database. Content statistics are divided by species (A and B), regulatory type (C and D), and data source (E and F). As a measure of the success of our community-based participation, ORegAnno currently has 1044 registered users. Aside from the principal authors of this paper, 13 301 records have been contributed by members of the broader community (The Open Regulatory Annotation Consortium). ORegAnno continues to have a robust verification system to ensure that contributed records are accurate and appropriately annotated. A set of trusted consortium members have been granted a ‘validator’ status, allowing them to review and up- or down-vote records. This results in individual record scores that are visible to all users. Moreover, when a record is negatively scored, it will typically be assigned a deprecated status. ORegAnno additionally includes an ontology for summarizing the experimental evidence that supports the regulatory elements and outcome in each record. Together, these features allow users to filter records according to various quality criteria. The ORegAnno database has served as a repository for publishing regulatory sites derived from experimental data (7), and it has been incorporated into other resources including the Babelomics (8), cisRED (9), ConTra (10), GRASP (11), i-cisTarget (12), LASAGNA-Search 2.0 (13), the UCSC Genome Browser (14) and more. Similarly, the annotated information included in ORegAnno has been used to construct gene regulation networks for the development of other tools and the analysis of gene expression data (15–19). ORegAnno records were used in the REC-set design for a capture sequence reagent (20), and as part of the definition for regulatory sites of the human genome (tier 2) in the Genome Modeling System (21), an analysis information management system at the McDonnell Genome Institute of Washington University that has been used to process over 4800 human whole genome samples, over 40 000 exomes, and over 1400 transcriptomes. Similarly, ORegAnno has been adapted into the information systems of other research centers including the Broad Institute and Cancer Research UK, where it has been used in the analysis of several high impact studies (22–25). Because ORegAnno focuses on curated regulatory information, the total genomic coverage found in ORegAnno is smaller than that identified by resources such as ENCODE or the ENSEMBL regulatory tracks (26), which are largely a summary of ENCODE data (Figure 2). This trade off is part of an effort to ensure that ORegAnno represents a high-quality curated set of regulatory elements, with the aim of maintaining a low number of false positive records.
Figure 2.

Comparison of the genomic coverage captured by ORegAnno and the ENSEMBL Regulatory Track. A Venn diagram demonstrates coverage overlaps for human genome assembly version GRCh38/hg38, with sets sized to scale. The ENSEMBL Regulatory Track is divided into two main sets, a track overview set and the transcription factor binding site (TFBS) set.

Comparison of the genomic coverage captured by ORegAnno and the ENSEMBL Regulatory Track. A Venn diagram demonstrates coverage overlaps for human genome assembly version GRCh38/hg38, with sets sized to scale. The ENSEMBL Regulatory Track is divided into two main sets, a track overview set and the transcription factor binding site (TFBS) set.

Updates

Older records, including those that were added through crowd-sourcing efforts via the web, have been updated to ensure that only accurate and up-to-date gene symbols are being used. This was accomplished through a combination of automatically updating symbols using NCBI Gene or ENSEMBL identifiers, as well as by manually checking incorrect and missing data. In addition, previously missing identifiers from NCBI Gene or ENSEMBL have been added where possible, allowing for future automated updates to ensure the accuracy of these gene lists. These updates have resulted in 423 automated changes and 13 174 manually curated changes (13 597 total) affecting 10 386 records. For all ORegAnno records (existing and new), genomic coordinates have been updated and expanded using liftOver (27). This involved converting older genomic coordinates to newer assembly versions, as well as converting coordinates from new versions to older assemblies. Thus, each record may now be associated with multiple genomic coordinates (from multiple assembly versions). For example, since the last version of ORegAnno was published in 2008, the human genome assembly version GRCh38/hg38 was released. All existing ORegAnno human records having genomic coordinates based on assembly versions GRCh36/hg18 or GRCh37/hg19 now have additional updated coordinates using GRCh38/hg38. Similarly, new records that were entered using GRCh38/hg38 coordinates have received additional coordinates based on GRCh37/hg19 and GRCh36/hg18. This allows users to access the genomic coordinates of regulatory regions for the assembly versions that best suit their purposes. Finally, new types of transcriptional regulation have been defined in the current release (Figure 1C and D). These includes microRNA and small non-coding RNA binding sites, as well as operons that function to regulate multiple genes under a single promoter.

New records

ORegAnno has maintained a focus on incorporating records derived from high quality, manually curated evidence for gene regulation. These typically include experimental evidence demonstrating that binding of a regulatory element to a specific region of DNA or RNA alters corresponding gene expression levels. In total, the current release of ORegAnno contains 2010 unique records covering 112 582 bp derived directly through literature curation, including 661 records that have been added since the previous ORegAnno release. Highly validated external databases that had been incorporated into earlier ORegAnno releases have been updated. This includes 1874 new records covering an additional 3 591 656 bp derived from VISTA Enhancers (28) (2196 total records covering 3 996 796 total bp), 2934 new records covering an additional 863 201 bp derived from the Yeast Regulatory Map (29) (7320 total records covering 899 449 total bp), as well as 2051 new transcription factor binding site records covering an additional 29 405 bp derived from REDfly (30) (2695 total records covering 913 486 total bp). Previously, ORegAnno had imported records from FlyReg (31), which has since been merged into REDfly. New records have been created by importing data from external databases that were not found in previous ORegAnno releases. This includes 1 093 443 records covering 11 780 604 bp imported from the JASPAR CORE database (32), which contains a curated, non-redundant set of experimentally obtained transcription factor binding sites in eukaryotes. 783 742 records covering 300 003 052 bp were imported from the PAZAR database (33), which included only records with curated evidence of transcription factor binding and regulatory sequence annotation across various species. 11 451 records covering 4 194 677 bp were derived from RegulonDB (34), a database of transcriptional regulation in Escherichia coli K-12, and includes manually curated records that have been complemented with high throughput datasets and comprehensive computational predictions. We combined conserved miRNA target site predictions from miRanda-mirSVR (35) with experimentally-validated miRNA-target interaction data from miRTarBase (36), leading to the addition of 3 072 new ORegAnno records covering 44 353 bp. 131 records covering 1216 bp were derived from NFI-RegulomeDB (37), a database with curated binding sites for the NFI (Nuclear Factor I) family of transcription factors using data from the published literature. Finally, 51 transcription factor binding site records covering 7503 bp were created from the PCNE database of phylogenetically conserved noncoding elements (38). Because of the open and accessible design of the ORegAnno database and website, ORegAnno has been used for submitting published experimental data. Since the previous ORegAnno release, four datasets derived from high throughput studies have been submitted, and were subsequently curated to ensure that only regulatory regions with a high degree of evidence were retained. These include RELA (p65) ChIP−PET binding sites in human monocytes (39) (489 records covering 52 886 bp), ESR1 binding sites in human MCF-7 breast cancer cells (40) (1234 records covering 165 538 bp), Esr1 binding sites in mouse liver (41) (5568 records covering 2 378 460 bp), and Foxa2 binding sites in mouse liver (7) (11 475 records covering 8 236 933 bp). In all of these cases, DNA sequences were filtered according to signal strength and proximity to signal peak to reduce false positive calls. A summary of the number of records and genomic coverage contributed by each data source is shown in Figure 1E, F and Supplementary Table S1.

Data access

The ORegAnno database continues to be accessible under an open-source license (GNU Lesser General Public License), in order to encourage development and participation from the community. Monthly ORegAnno database summaries are automatically performed and provide fundamental regulatory information from ORegAnno in a tab-delimited text file that is available for free download, without the need to register with the ORegAnno website (http://www.oreganno.org/). The ORegAnno website back end code has been updated to improve security and performance, and to accommodate the new data types, dataset sources, and the increased number of records that have been added since the previous release. New search functionality has been added, including the ability to browse records by transcription factor/regulatory element of interest. Source code for the ORegAnno website is available at https://java.net/projects/oreganno/. The regulatory regions and associated annotation for all supported species have been submitted to the UCSC Genome Browser (14) as updates to existing ORegAnno tracks. This updates existing tracks with a more comprehensive collection of putatively regulatory elements, and additionally provides new tracks on several genome assembly versions.

Applications

Recently, there has been immense focus on the role of regulatory regions in cancer. In particular, recurrent somatic mutations in the TERT promoter have been identified in various cancer types (42–45), and are associated with increased expression of TERT. Although the importance of TERT up-regulation in cancer has been well-established for nearly two decades (46), it is only in recent years that we've identified the regulatory mechanism driving TERT up-regulation in such cases. While additional efforts have identified a small number of other recurrent regulatory mutations in cancer (47–49), this number is far smaller than the recurrent protein-coding mutations that have been identified. This is likely due to several factors, including that most cancer survey projects have focused primarily on coding regions by using exome capture reagents to enrich for these regions, and that the TERT promoter region, as with many other genes, has a high GC content making both PCR amplification and sequencing challenging. Previous identification of coding regions of the genome made it possible to perform exome targeted sequencing of these regions in a large number of cancer cases at a relatively low cost. Similarly, we've used ORegAnno and other sources to design a ‘regulome’ capture reagent for targeted sequencing. The high quality, relatively small coverage of literature-curated transcription factor binding sites, regulatory polymorphisms, and NFI-RegulomeDB (37) sites identified in ORegAnno, in conjunction with regulatory regions defined by FunSeq (50), and 500 bp regions upstream of each gene transcription start site, were used to define the ‘regulome’ region. As a proof of principle, we then applied ‘regulome’ capture-sequencing to ten normal/tumor pairs of hepatocellular carcinoma (HCC). Overall coverage of the regulatory region defined in the capture reagent was higher in whole regulome sequencing (WRS) samples versus whole genome sequencing (WGS) samples of the same tissues, with median average read depths of 29× in WGS normal, 49× in WGS tumor, 60× in WRS normal and 68× in WRS tumor (Figure 3A, Supplementary Table S2). This improved coverage allowed us to reliably identify the canonical somatic TERT promoter mutation C228T in six of the ten cases, an illustrative example of which is shown in Figure 3B.
Figure 3.

Capture reagent using ORegAnno sites improves coverage of regulatory regions in human hepatocellular carcinoma. (A) Coverage across the entire ‘regulome’ is visualized as a heatmap for each of the ten HCC cases. WRS samples have greater sequencing read depth across the targeted region, compared with WGS samples. (B) An illustrative IGV (51) screenshot is shown of the TERT promoter for one HCC case. A canonical C228T somatic mutation is observed in the WRS data, but cannot be reliably called in the WGS data.

Capture reagent using ORegAnno sites improves coverage of regulatory regions in human hepatocellular carcinoma. (A) Coverage across the entire ‘regulome’ is visualized as a heatmap for each of the ten HCC cases. WRS samples have greater sequencing read depth across the targeted region, compared with WGS samples. (B) An illustrative IGV (51) screenshot is shown of the TERT promoter for one HCC case. A canonical C228T somatic mutation is observed in the WRS data, but cannot be reliably called in the WGS data.
  51 in total

1.  Genome-wide identification of estrogen receptor alpha-binding sites in mouse liver.

Authors:  Hui Gao; Susann Fält; Albin Sandelin; Jan-Ake Gustafsson; Karin Dahlman-Wright
Journal:  Mol Endocrinol       Date:  2007-09-27

2.  Use of data-biased random walks on graphs for the retrieval of context-specific networks from genomic data.

Authors:  Kakajan Komurov; Michael A White; Prahlad T Ram
Journal:  PLoS Comput Biol       Date:  2010-08-19       Impact factor: 4.475

3.  REDfly v3.0: toward a comprehensive database of transcriptional regulatory elements in Drosophila.

Authors:  Steven M Gallo; Dave T Gerrard; David Miner; Michael Simich; Benjamin Des Soye; Casey M Bergman; Marc S Halfon
Journal:  Nucleic Acids Res       Date:  2010-10-21       Impact factor: 16.971

4.  Recurrent somatic mutations in regulatory regions of human cancer genomes.

Authors:  Collin Melton; Jason A Reuter; Damek V Spacek; Michael Snyder
Journal:  Nat Genet       Date:  2015-06-08       Impact factor: 38.330

5.  Autophagy Regulatory Network - a systems-level bioinformatics resource for studying the mechanism and regulation of autophagy.

Authors:  Dénes Türei; László Földvári-Nagy; Dávid Fazekas; Dezső Módos; János Kubisch; Tamás Kadlecsik; Amanda Demeter; Katalin Lenti; Péter Csermely; Tibor Vellai; Tamás Korcsmáros
Journal:  Autophagy       Date:  2015       Impact factor: 16.016

6.  The ensembl regulatory build.

Authors:  Daniel R Zerbino; Steven P Wilder; Nathan Johnson; Thomas Juettemann; Paul R Flicek
Journal:  Genome Biol       Date:  2015-03-24       Impact factor: 13.583

7.  Genome-wide analysis of noncoding regulatory mutations in cancer.

Authors:  Nils Weinhold; Anders Jacobsen; Nikolaus Schultz; Chris Sander; William Lee
Journal:  Nat Genet       Date:  2014-09-28       Impact factor: 38.330

8.  An integrative approach to inferring gene regulatory module networks.

Authors:  Michael Baitaluk; Sergey Kozhenkov; Julia Ponomarenko
Journal:  PLoS One       Date:  2012-12-20       Impact factor: 3.240

9.  An improved map of conserved regulatory sites for Saccharomyces cerevisiae.

Authors:  Kenzie D MacIsaac; Ting Wang; D Benjamin Gordon; David K Gifford; Gary D Stormo; Ernest Fraenkel
Journal:  BMC Bioinformatics       Date:  2006-03-07       Impact factor: 3.169

10.  RANK- and c-Met-mediated signal network promotes prostate cancer metastatic colonization.

Authors:  Gina Chia-Yi Chu; Haiyen E Zhau; Ruoxiang Wang; André Rogatko; Xu Feng; Majd Zayzafoon; Youhua Liu; Mary C Farach-Carson; Sungyong You; Jayoung Kim; Michael R Freeman; Leland W K Chung
Journal:  Endocr Relat Cancer       Date:  2014-03-04       Impact factor: 5.678

View more
  58 in total

1.  Differential Allele-Specific Expression Uncovers Breast Cancer Genes Dysregulated by Cis Noncoding Mutations.

Authors:  Pawel F Przytycki; Mona Singh
Journal:  Cell Syst       Date:  2020-02-19       Impact factor: 10.304

2.  Interferons Induce Expression of SAMHD1 in Monocytes through Down-regulation of miR-181a and miR-30a.

Authors:  Maximilian Riess; Nina V Fuchs; Adam Idica; Matthias Hamdorf; Egbert Flory; Irene Munk Pedersen; Renate König
Journal:  J Biol Chem       Date:  2016-12-01       Impact factor: 5.157

3.  Genetic Modifiers of Cystic Fibrosis-Related Diabetes Have Extensive Overlap With Type 2 Diabetes and Related Traits.

Authors:  Melis A Aksit; Rhonda G Pace; Briana Vecchio-Pagán; Hua Ling; Johanna M Rommens; Pierre-Yves Boelle; Loic Guillot; Karen S Raraigh; Elizabeth Pugh; Peng Zhang; Lisa J Strug; Mitch L Drumm; Michael R Knowles; Garry R Cutting; Harriet Corvol; Scott M Blackman
Journal:  J Clin Endocrinol Metab       Date:  2020-05-01       Impact factor: 5.958

4.  Mapping the Effects of Genetic Variation on Chromatin State and Gene Expression Reveals Loci That Control Ground State Pluripotency.

Authors:  Daniel A Skelly; Anne Czechanski; Candice Byers; Selcan Aydin; Catrina Spruce; Chris Olivier; Kwangbom Choi; Daniel M Gatti; Narayanan Raghupathy; Gregory R Keele; Alexander Stanton; Matthew Vincent; Stephanie Dion; Ian Greenstein; Matthew Pankratz; Devin K Porter; Whitney Martin; Callan O'Connor; Wenning Qin; Alison H Harrill; Ted Choi; Gary A Churchill; Steven C Munger; Christopher L Baker; Laura G Reinholdt
Journal:  Cell Stem Cell       Date:  2020-08-13       Impact factor: 24.633

5.  Predicting target genes of non-coding regulatory variants with IRT.

Authors:  Zhenqin Wu; Nilah M Ioannidis; James Zou
Journal:  Bioinformatics       Date:  2020-08-15       Impact factor: 6.937

6.  Unification of miRNA and isomiR research: the mirGFF3 format and the mirtop API.

Authors:  Thomas Desvignes; Phillipe Loher; Karen Eilbeck; Jeffery Ma; Gianvito Urgese; Bastian Fromm; Jason Sydes; Ernesto Aparicio-Puerta; Victor Barrera; Roderic Espín; Florian Thibord; Xavier Bofill-De Ros; Eric Londin; Aristeidis G Telonis; Elisa Ficarra; Marc R Friedländer; John H Postlethwait; Isidore Rigoutsos; Michael Hackenberg; Ioannis S Vlachos; Marc K Halushka; Lorena Pantano
Journal:  Bioinformatics       Date:  2020-02-01       Impact factor: 6.937

7.  Increased expression of anion transporter SLC26A9 delays diabetes onset in cystic fibrosis.

Authors:  Anh-Thu N Lam; Melis A Aksit; Briana Vecchio-Pagan; Celeste A Shelton; Derek L Osorio; Arianna F Anzmann; Loyal A Goff; David C Whitcomb; Scott M Blackman; Garry R Cutting
Journal:  J Clin Invest       Date:  2020-01-02       Impact factor: 14.808

8.  MDH1 and MPP7 Regulate Autophagy in Pancreatic Ductal Adenocarcinoma.

Authors:  Maria New; Tim Van Acker; Jun-Ichi Sakamaki; Ming Jiang; Rebecca E Saunders; Jaclyn Long; Victoria M-Y Wang; Axel Behrens; Joana Cerveira; Padhmanand Sudhakar; Tamas Korcsmaros; Harold B J Jefferies; Kevin M Ryan; Michael Howell; Sharon A Tooze
Journal:  Cancer Res       Date:  2019-02-14       Impact factor: 12.701

9.  Inducible expression quantitative trait locus analysis of the MUC5AC gene in asthma in urban populations of children.

Authors:  Matthew C Altman; Kaitlin Flynn; Mario G Rosasco; Matthew Dapas; Meyer Kattan; Stephanie Lovinsky-Desir; George T O'Connor; Michelle A Gill; Rebecca S Gruchalla; Andrew H Liu; Jacqueline A Pongracic; Gurjit K Khurana Hershey; Edward M Zoratti; Stephen J Teach; Deepa Rastrogi; Robert A Wood; Leonard B Bacharier; Petra LeBeau; Peter J Gergen; Alkis Togias; William W Busse; Scott Presnell; James E Gern; Carole Ober; Daniel J Jackson
Journal:  J Allergy Clin Immunol       Date:  2021-05-18       Impact factor: 10.793

10.  DNA methylation and histone acetylation changes to cytochrome P450 2E1 regulation in normal aging and impact on rates of drug metabolism in the liver.

Authors:  Mohamad M Kronfol; Fay M Jahr; Mikhail G Dozmorov; Palak S Phansalkar; Lin Y Xie; Karolina A Aberg; MaryPeace McRae; Elvin T Price; Patricia W Slattum; Philip M Gerk; Joseph L McClay
Journal:  Geroscience       Date:  2020-03-27       Impact factor: 7.713

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.