Literature DB >> 19934259

GeneSigDB--a curated database of gene expression signatures.

Aedín C Culhane1, Thomas Schwarzl, Razvan Sultana, Kermshlise C Picard, Shaita C Picard, Tim H Lu, Katherine R Franklin, Simon J French, Gerald Papenhausen, Mick Correll, John Quackenbush.   

Abstract

The primary objective of most gene expression studies is the identification of one or more gene signatures; lists of genes whose transcriptional levels are uniquely associated with a specific biological phenotype. Whilst thousands of experimentally derived gene signatures are published, their potential value to the community is limited by their computational inaccessibility. Gene signatures are embedded in published article figures, tables or in supplementary materials, and are frequently presented using non-standard gene or probeset nomenclature. We present GeneSigDB (http://compbio.dfci.harvard.edu/genesigdb) a manually curated database of gene expression signatures. GeneSigDB release 1.0 focuses on cancer and stem cells gene signatures and was constructed from more than 850 publications from which we manually transcribed 575 gene signatures. Most gene signatures (n = 560) were successfully mapped to the genome to extract standardized lists of EnsEMBL gene identifiers. GeneSigDB provides the original gene signature, the standardized gene list and a fully traceable gene mapping history for each gene from the original transcribed data table through to the standardized list of genes. The GeneSigDB web portal is easy to search, allows users to compare their own gene list to those in the database, and download gene signatures in most common gene identifier formats.

Entities:  

Mesh:

Year:  2009        PMID: 19934259      PMCID: PMC2808880          DOI: 10.1093/nar/gkp1015

Source DB:  PubMed          Journal:  Nucleic Acids Res        ISSN: 0305-1048            Impact factor:   16.971


INTRODUCTION

Microarray gene expression profiling and other high throughput technologies have been applied to investigate and classify thousands of biological conditions. Most studies report one or more gene signatures; lists of genes that are differentially regulated between the cellular states under study, for example in a cell or tissue type, in response to treatment or at a specific time point. The value of these experimentally derived gene signatures often extend beyond their initial publication. A range of applications have been developed to use them, including Gene Set Enrichment Analysis (GSEA) which analyzes gene expression data to look for groups of genes (or gene lists) over-represented among statistically significant genes from a particular experiment (1–3). In breast cancer, a number of experimentally derived gene expression signatures including Mammaprint and Oncotype DX have been developed into commercial diagnostic assays (4) and are being validated in large scale clinical trials (5,6). Gene signatures are analyzed and validated on new gene expression data (7,8) and novel computational methods are being developed for meta analysis of gene signatures. Finally, because published experimentally derived gene signatures are typically selected to differentiate between different classes of samples, meta-analysis of multiple gene lists may provide deeper insight into the biological mechanisms underlying a wide range of processes. While public databases such as ArrayExpress and GEO have been developed to capture gene expression data, there is no existing resource to capture the valuable end-product of the analysis of those data—the gene lists that the analyses produce. Instead, these gene lists are often included in tables or figures embedded in publications or included as supplementary material on the journal’s or the author’s website, making them generally inaccessible to automated computational analysis. If one is able to access these lists, one often finds that the lists are reported using non-standard gene identifiers, making comparison to other lists, or often to the original data, a significant challenge. To be of maximal value, gene signatures should be available through a resource that provides gene lists in a common standard format that is computationally accessible. In addition it should provide the original gene signature table as transcribed from the publication. Reproduction of a computationally accessible original transcribed gene signature table may provide additional signature meta-data, such as information and annotation about the experimental conditions and the criteria used in generating gene lists from the data (such as t-statistics scores or other ranking information) which is useful in some gene set analyses. There have been a number of attempts to collect experimentally derived gene signatures, however these do not generally retain the original transcribed gene signature data from the publication. The largest collections of gene signature are available in MSigDB (3). MSigDB (v2.5) provides gene signatures as annotated lists of gene symbols and have curated gene signatures from 344 publications. Curators and users can submit gene lists as a two column table, containing a gene identifier and its gene symbol. However many MSigDB gene signatures contain only gene symbols, thereby limiting their future re-annotation. The Lists of Lists-Annotated (LOLA; http://www.lola.gwu.edu) database (9) contains 47 gene lists (v1.2, October 2009) and gene list input format is limited to EntrezGene or Affymetrix probeset identifiers. SignatureDB (http://lymphochip.nih.gov/signaturedb/) (10) provides 147 published and non-published gene signatures related to haematopoietic cells. The number of cancer gene signatures in Cancer Genes (http://cbio.mskcc.org/cancergenes) (11) is 26, of which 4 are from the published literature. The lack of retention of original gene list identifiers by these resources, prevents remapping of the original signatures as better genome assemblies and annotation become available (12–14). Possibly due to the limitations of the gene signature input formats, frequently little or no information is provided on the process used to map gene signatures to the identifiers that are reported. To address these issues and facilitate gene list meta-analysis, we have systematically collected published gene signatures from publications indexed in PubMed and mapped them to a common, standardized format, and have made these available in GeneSigDB (http://compbio.dfci.harvard.edu/genesigdb). We do not collect or re-analyze the gene expression data as this is being done by other projects including gene expression atlas (http://www.ebi.ac.uk/gxa). For the GeneSigDB initial release, we reviewed over 850 published articles and manually transcribed 575 experimental cancer gene signatures from tables, figures or Supplementary Data from 319 of those papers. Signatures were curated, annotated, and mapped to the genome, providing 560 standardized gene lists. GeneSigDB provides both the original transcribed gene signature as well as the gene signature in a standard format and we publish a mapping-trace showing how each gene identifier in the original signature was re-annotated. The GeneSigDB web portal allows users to search for gene signatures and provides tools to compare gene signatures, to convert gene lists to common gene identifiers and download gene signatures in over 30 different gene identifier formats.

COLLECTION AND CURATION OF GENE SIGNATURES

Papers likely to contain one or more gene lists were first identified using PubMed searches of the form XXX AND (‘genechip’ OR microarray OR ‘gene expression’) AND (‘gene signature’ OR ‘gene list’ OR ‘expression profile’ OR ‘Classifier’ OR ‘Predictor’) AND English [la] NOT Review [pt], where XXX represents terms relevant to the particular search being conducted, such as ‘breast cancer’ or ‘stem cells.’ A full list of these terms is given in Supplementary Table S1. GeneSigDB v1.0 is based on a search of PubMed which was performed on 15 July 2009. Each article was downloaded and gene signatures were transcribed from the manuscript or its supplementary materials. Information about the source and contents of each gene signature (Tables 1 and 2) were captured into an Excel spreadsheet template designed to capture gene signatures and associated annotation. Gene signatures appeared in a wide variety of places within particular manuscripts, including tables and graphical or textual figures (such as hierarchical clustering heatmaps) in the primary manuscripts and in supplementary pdf, excel, or text documents. Supplementary files appeared in a variety of places, including websites maintained by journals and on authors’ personal websites. Each gene signature was given a signature identifier (SigID) PMID-X, where PMID is the PubMed identifier of the article and X is the table, figure or supplementary file number from which the gene signature was extracted, for example 18490921-Table 3 indicates the gene signature was extracted from Table 3 of the article with the PMID 18490921 (15). Gene signatures were stored as tab-delimited text files and named SigID.txt. Metadata associated with each gene signature were extracted from the Excel file and stored as an xml file, SigID-index.xml, the elements of which are summarized in Figure 1 and Tables 1 and 2. The XSD schema is available File 2 in Supplementary Data. EnsEMBL gene identifiers were used as the primary gene identifier in standardized files to allow gene signature comparisons within GeneSigDB. All gene identifiers that could be mapped (listed in Table 2) to the genome were extracted into a file named SigID-mapping.txt (Figure 1) and were searched against EnsEMBL (version 55, July 2009) using BioMart (Perl API). The search history for each gene was saved in a file called SigID-maptrace.txt. If multiple gene identifiers successfully mapped to the genome, a hierarchal ranking of identifiers was used to select the best gene match (see online documentation for full description of mapping process). Standardized gene lists were stored in a file named SigID-standardized.txt. An individual directory was created for each PMID, which stored a PDF of the source manuscript and the five files derived from the gene signatures; SigID.txt, SigID-mapping.txt, SigID-maptrace.txt, SigID-standardized.txt and SigID-index.xml (Figure 1). These files are respectively, the original gene signature, the mappable gene identifiers from SigID.txt, the mapping-trace showing how each gene was mapped, a list of EnsEMBL gene identifiers that correspond to genes in SigID.txt, and xml annotation of SigID.txt.
Table 1.

Gene Signature Metadata (GeneSigDB v1.0)

NameDescription
PMIDPubMed identifier
TissueName of search term set used to search PubMed. (Supplementary Table S1)
OrganismSpecies common name (human, mouse, etc)
PlatformName of microarray or other experimental technique used to derive gene signature (selection from constrained list, Supplementary Data S2)
Platform descriptionDescription of platform
Genes articleNumber of genes in gene signature (as described in the text of the article)
SigIDSignature identifier, in the format PMID-XXX, where XXX is the gene signature table, figure or supplementary file e.g. 18490921-Table 3
Sig nameName of gene signature, in the format Tissue_AuthorYear_NumberofGenes _Description. Description is optional. e.g. Breast _Bertucci08_75genes
Sig descriptionDescription of gene signature, typically extracted from table or figure legend (free text)
File associatedName of tab delimited file gene signature file. Format is SigID.txt
URLURL from where gene signature was downloaded
Column mappingsContent of each column in gene signature file (selection from constrained list in Table 2)
Table 2.

Column mappings (GeneSigDB v1.0)

Mapping elementDescriptionMapping filea
Probe IDPlatform specific identifierYesb
Clone IDIMAGE clone identifierYesc
GenBank IDGenBank accession numberYesc
UniGene IDUnigene identifier.Yes
EntrezGene IDEntrezGene or LocusLink identifierYes
Gene symbolHGNC official gene symbolYes
CCDS IDConsensus Coding Sequence Database IDYes
EnsEMBL IDEnsEMBL gene IDYes
RefSeq IDRefSeq gene identifierYes
Protein IDProtein sequence ID, SwissProt, UniProtNo
Chromosome mapChromosomal localization dataNo
Geneset specific factorFactor or classifier specific to data, characterNo
Geneset specific statisticsFold change, Ranking of genes, T-statistics, correlation, p-value, numericNo
Gene descriptionDescription or title of geneNo
Other gene descriptionKEGG, GO terms, Keywords, etcNo

aYes indicates these columns were extracted to SigID-mapping.txt for searching biomart.

bNot all platform Probe IDs are sequence mapped in biomart. For some common unmapped microarrays, we sequence matched the probe sequences to the genome. Others were ignored.

cGenBank EST and IMAGE clone ID sequences are not in EnsEMBL and these were mapped via Unigene (See Methods).

Table 3.

Number of articles processed and gene signatures extracted by species

HumanMouseRatTotal
Publications263398308
Gene signatures4658411560
Genes14 1979755773
Number of platforms329438
Average genes/signature13221388
Figure 1.

Curation of gene signatures in GeneSigDB. (A) GeneSigDB hierarchical file structure SigID.txt, SigID-mapping.txt, SigID-maptrace.txt, SigID-standardized.txt and SigID-index.xml. These files are respectively, the original gene signature, the mappable gene identifiers from SigID.txt, the mapping-trace showing how each gene was mapped, a list of EnsEMBL gene identifiers that correspond to genes in SigID.txt, and xml annotation of SigID.txt. (B) An example xml gene signature annotation file.

Gene Signature Metadata (GeneSigDB v1.0) Column mappings (GeneSigDB v1.0) aYes indicates these columns were extracted to SigID-mapping.txt for searching biomart. bNot all platform Probe IDs are sequence mapped in biomart. For some common unmapped microarrays, we sequence matched the probe sequences to the genome. Others were ignored. cGenBank EST and IMAGE clone ID sequences are not in EnsEMBL and these were mapped via Unigene (See Methods). Number of articles processed and gene signatures extracted by species Curation of gene signatures in GeneSigDB. (A) GeneSigDB hierarchical file structure SigID.txt, SigID-mapping.txt, SigID-maptrace.txt, SigID-standardized.txt and SigID-index.xml. These files are respectively, the original gene signature, the mappable gene identifiers from SigID.txt, the mapping-trace showing how each gene was mapped, a list of EnsEMBL gene identifiers that correspond to genes in SigID.txt, and xml annotation of SigID.txt. (B) An example xml gene signature annotation file.

GeneSigDB RELEASE 1.0

The initial release of GeneSigDB provides 560 curated, standardized gene signatures related to cancer of the breast, ovary, lung, colon, skin, prostate, bladder, endometrial, kidney, thyroid and gene signatures related to stem cells (Tables 3 and 4). These are principally derived from gene expression studies in human tumors and cell lines (n=465) but also include additional signatures from mouse (n = 84) and rat (n = 11) (Table 3). We have curated a number of other species but have not presently mapped these to EnsEMBL genomes.
Table 4.

Number of articles processed and gene signatures extracted by search terms (tissue)

Search termsNumber of manuscripts (articles)
Gene signatures
Average of genesa
PubMed hitsDownloaded and processedCuratedMappedCuratedMapped to EnsEMBL
Bladder645610101818115
Breast471241134131243238190
Colon95542020353584
Endometrial1515549816
Kidney1212437655
Liver12927881212144
Lung1671012928424134
Ovary108722828414175
Prostate1361023028524847
Skin88339928
Stem cell1901414542104101205
Thyroid2614222216
Uterus548111145
Total1475851319308575560

aAverage number of genes per signature which were mapped to EnsEMBL genes.

Number of articles processed and gene signatures extracted by search terms (tissue) aAverage number of genes per signature which were mapped to EnsEMBL genes. The number of gene signatures per tumor type varies considerably (Table 4). Breast cancer which has been subjected to extensive gene expression profiling resulting in a new molecular subtype categorization and commercial diagnostic signatures (4), has a high number of gene signatures (n = 238). There is also a large number of gene signatures from stem cell research (n = 101), which are divided by those reported in studies of human (n = 52) and mouse (n = 49) in GeneSigDB. But in other fields of cancer research we have fewer gene signatures (kidney n = 6, endometrial n = 8). The average number of genes per signature across all gene signatures is 81 genes. However, we see variation is the number per tissue which again may reflect the ‘maturity’ of the analysis in a particular tumor. There is a broad correlation between the average number of genes per gene signature and the number of signatures collected for that tumor.

Gene loss when mapping gene signatures

The lack of standardization in reporting gene lists, and the continued evolution of the genome sequence and its annotation cause some loss in mapping gene signatures to standard identifiers. As can be seen in Table 5, many published gene signatures do not provide probe identifiers when reporting signatures despite the fact that the primary identification of the gene lists reported relies on the array probes (or probesets) rather than the genes themselves. Authors tend to report gene names or gene symbols but rarely provide details on how the gene annotations were obtained or the version of the database that was used for mapping. The latter can be important as some databases such as UniGene ‘retire’ cluster identifiers or change the gene associated with a particular cluster and other resources that rely on the genome and its annotation can change associations as the genome sequence is refined over time. In 15 of 575 curated signatures, no mappable gene identifiers were provided, the authors publish their gene signature simply as a set of gene descriptions. In Table 6, it can be seen that the success of mapping is greatly affected by the identifier provided. One lesson that clearly emerges from this analysis is that those identifiers closest to the primary data, such as probe identifiers, have the highest rate of mapping to the EnsEMBL geneIDs that are our standard identifiers. This failure in mapping severely limits our ability to compare gene lists between studies, underscoring the need for standardization in reporting gene lists. Although in Table 6 it appears that there is a low mapping success rate for EMBL/GenBank identifiers (16% success), this is a subset of EMBL/GenBank identifiers and this low mapping rate is an artifact of the search approach we are using that will be corrected in the next release GeneSigDB.
Table 5.

Frequency of different gene identifiers in mapped gene signatures

IdentifierFrequency all gene signatures (n = 560)Frequency human gene signatures (n = 465)
Gene description476381
Gene symbol432359
Probe ID212173
GenBank ID128109
UniGene ID9675
RefSeq ID7553
EntrezGene ID4939
Clone ID3026
EnsEMBL ID119
Table 6.

Success of matching different gene signature identifiers to an EnsEMBL gene

ID typeSpeciesSuccess (unique IDs)Failures (unique IDs)Percentage success
affy_hg_u133_ plus_2Human1108594292
affy_u133_x3pHuman1801094
affy_hg_u133aHuman638428895
affy_hg_u95av2Human6132396
affy_hg_u95aHuman25292
affy_hugeneflHuman43884
affy_mouse430_2Mouse381619595
affy_moe430aMouse340915995
affy_mg_u74av2Mouse215631787
affy_mg_u74aMouse116199
agilent_ wholegenomeHuman, mouse389661486
EntrezgeneHuman985948695
refseq_dnaHuman558657790
ensembl_gene_idHuman312048686
hgnc_symbolHuman5254290864
mgi_symbolMouse124778861
rgd_symbolRat30113569
unigeneHuman, mouse2131186553
embl (genbank)Human308154116
Frequency of different gene identifiers in mapped gene signatures Success of matching different gene signature identifiers to an EnsEMBL gene

Overlap in genes among gene signatures

Having assembled GeneSigDB, obvious questions are which genes are most common in the various signatures that have been reported and whether there is significant overlap between reported signatures in various cancers. To analyze the overlap between gene signatures, the union of all gene lists was compared to each individual list to generate a binary matrix of presence (1) or absence (0) calls of each gene in each gene list. GeneSigDB human gene signatures (n = 465) contain 14 197 EnsEMBL genes. Histograms showing the distribution of genes in human and mouse gene signatures are provided File 3 in Supplementary Data. A large number of genes only occur in only 1 of 465 gene signatures (n = 3611), and 10 586 genes occur in 5 or fewer of the 465 gene signatures. We used a simulation approach to estimate the number of genes which might be in gene signatures by chance (described File 3 in Supplementary Data) and excluded low abundance genes. We investigated the overlap of the remaining 9920 human genes across 465 gene signatures in GeneSigDB. Figure 2 shows the overlap in gene content across all gene signatures. It can clearly be seen that related tumors such as breast and ovarian have a large overlap relative to other tumor types. This may reflect the fact that breast and ovarian cancers are known to have some common genetic components, such as mutations in the BRCA1 and BRCA2 genes. Figure 2b shows a hierarchical clustering of genes in signatures that was performed using a Sorensen’s coefficient asymmetrical measure of binary distance which gives double weight to presence and ignores absence, as we assume that presence of a gene in two lists is more informative than its absence in two gene lists. An absent gene may not be truly absent; gene expression platforms may not sample the entire genome, the probes for a particular gene may be ineffective, or the applied feature selection approach may prove sub-optimal. Since the gene x gene signature overlap matrix is sparse, scoring a double-zero between two gene lists would result in high similarity scores for many gene lists containing only a few genes. In Figure 2b, we see tumor types which are well represented in GeneSigDB cluster apart from those for which we have fewer numbers of gene signatures. However it is intriguing to observe that breast, ovarian and stem cell, colon and prostate signatures cluster, and this may reflect common etiology in these cancers.
Figure 2.

Overlap in gene signatures across tumor types. (A) Heatmap-style representation of gene overlap. Each row is a gene and each column is a tissue type. Presence of a gene is indicated by a black line, and absence is white. It can be seen that some gene maybe linked to phenotypic subclasses in many tumors, but there are generally many more tumor-specific genes, likely indicating the effect of the tissue of origin. (B) Dendrogram of hierarchical cluster analysis that was performed using a Sorensen's coefficient asymmetrical measure of binary distance and joined using Ward's minimum variance method.

Overlap in gene signatures across tumor types. (A) Heatmap-style representation of gene overlap. Each row is a gene and each column is a tissue type. Presence of a gene is indicated by a black line, and absence is white. It can be seen that some gene maybe linked to phenotypic subclasses in many tumors, but there are generally many more tumor-specific genes, likely indicating the effect of the tissue of origin. (B) Dendrogram of hierarchical cluster analysis that was performed using a Sorensen's coefficient asymmetrical measure of binary distance and joined using Ward's minimum variance method. We investigated if there were genes that occur frequently in many gene signatures. We observed 80 genes occur in 25 or more gene signatures. The most frequently observed genes occur in 50 of the 465 genes signatures and are MAD2LI (ENSG00000164109) and RRM2 (ENSG00000171848). However, these occur predominately in breast genes signatures (n = 42/50). We therefore examined which genes are common in many tissues types and observed 29 genes occur in 7 or more tissue types (Table 7). We performed a representational analysis to search Gene Ontology terms and KEGG pathways that are over-represented in that set. Not surprisingly, the most common functional classes found were those associated with cell cycle, consistent with the fact that cancer is a disease that disrupts normal cell cycle control (File 3 in Supplementary Data).
Table 7.

Most common genes across genes signatures from all tissue types

EnsEMBL Gene IDHgnc symbolNumber tissue typesCounts of gene signatures in tissue types
BldBrCoEndKdLiLuOvPrSkSCThyUt
ENSG00000113140SPARC922420012313100
ENSG00000115414FN1812210021750200
ENSG00000131747TOP2A803010111320400
ENSG00000134755DSC280810111110100
ENSG00000151914DST801520011112300
ENSG00000157456CCNB2802710102210210
ENSG00000134057CCNB1802211003131200
ENSG00000087586AURKA803312100111400
ENSG00000120992LYPLA181611011010200
ENSG00000132646PCNA711920002320100
ENSG00000169429IL8721430001320200
ENSG00000121966CXCR4711610000111300
ENSG00000139318DUSP6701110021110200
ENSG00000146674IGFBP370820011430100
ENSG00000170312CDC2702811002120400
ENSG00000185275CD24L4701511001110200
ENSG00000164171ITGA270510110120200
ENSG00000176890TYMS711720011200200
ENSG00000044115CTNNA171810001101100
ENSG00000164442CITED2731021001200100
ENSG00000003436TFPI701010110102100
ENSG00000108821COL1A1711500014110400
ENSG00000111348ARHGDIB711000032110100
ENSG00000196139AKR1C371500011110300
ENSG00000204262COL5A2711300002212100
ENSG00000142871CYR61711100010141200
ENSG00000170345FOS712400010150201
ENSG00000167642SPINT270700101121100
ENSG00000175063UBE2C702500101210110

Tissue types are Bladder (Bld), Breast (Br), Colon (Co), Endometrial (End), Kidney (Kd), Liver (Li), Lung (Lu), Ovary (Ov), Prostate (Pr), Skin (Sk), Stem Cell (SC), Thyroid (Thy), Uterus (Ut).

Most common genes across genes signatures from all tissue types Tissue types are Bladder (Bld), Breast (Br), Colon (Co), Endometrial (End), Kidney (Kd), Liver (Li), Lung (Lu), Ovary (Ov), Prostate (Pr), Skin (Sk), Stem Cell (SC), Thyroid (Thy), Uterus (Ut).

THE GeneSigDB WEB INTERFACE

Querying GeneSigDB

There are two entry points into GeneSigDB, a publication-based and a gene-based search. The publication search queries articles and retrieves a list of publications and the signatures they describe. The results are based on two independent searches. The first is a full-text search of the articles indexed in GeneSigDB. This search includes the article title, authors, affiliations, abstract, introduction, methods, results, discussion and other items in the main body of the publication; the reference section is not included in this search. A second search queries only the information indexed by PubMed which includes title, author names, journal, title and abstract. The most common publication search terms would be an author name, article title, journal name or keywords. Terms can be combined using standard Boolean operators and examples are provided in the documentation online. The gene search queries the annotation of genes within indexed signatures. One can enter either a single gene or multiple genes into the gene search. Gene search terms can be gene symbols, EnsEMBL, Entrezgene, Affymetrix, Illumina or other common microarray probe identifiers. A gene list can be entered in space or comma separated format. Wildcard searches are permitted, for example BRCA* will return BRCA1 and BRCA2 genes. Examples of both publication and gene searches are provided in the help documentation online.

GeneSigDB data views

There are three data views in GeneSigDB; a publication view, a gene signature view, and a gene view. The first is the publication view which contains information about the publication (authors, title, journal, publication date and abstract) and links to all gene signatures extracted from that publication. The second is the gene signature view which presents the gene signature metadata (described in Tables 1 and 2) and data related to the gene signature (Figure 1), including the original transcribed gene signature table and a standardized gene list of EnsEMBL identifiers and gene symbols. GeneSigDB also provides a history of how each gene was mapped to EnsEMBL. When a gene cannot be mapping to an EnsEMBL gene, this is clearly stated. The third data view is the gene view, which provides gene annotation information such as gene synonyms, description and gene identifiers of popular databases (EnsEMBL, EntrezGene, RefSeq). Where a gene signature is of non-human origin, the human orthologue of the gene is provided (where possible). A gene view lists all gene signatures in which that gene can be found.

Visualization of overlap between gene signatures

To visualize the overlap between multiple gene signatures, one can tick checkboxes selecting multiple gene signatures in several views including the publication search results, gene search results, or gene or publication entry view. These gene signatures are passed to a gene signature comparison view. This opens a gene × signature comparison matrix in which the rows are genes and the columns are signatures; the elements of the matrix are colored heatmap-style red or grey to represent presence or absence respectively. The default setting is that only genes present in two or more signatures are shown. As an example, Figure 3 shows the overlap of Fanconi anemia associated genes in GeneSigDB gene signatures. This analysis is based on a gene search with the wild card search FANC* which returned 12 human genes and 2 mouse genes. We selected the 12 human genes by ticking the checkboxes and then clicked on the compare button to visualize the overlap in these in the comparison view.
Figure 3.

Visualization of gene overlap in gene signatures. In this example, we searched for Fanconi anemia-related genes by performing a gene search for FANC* which returned 12 human genes and 2 mouse genes. This screen shows the overlap between gene signatures in which these 12 genes are present in at least 2/12. Red and grey boxes indicate presence or absence, respectively.

Visualization of gene overlap in gene signatures. In this example, we searched for Fanconi anemia-related genes by performing a gene search for FANC* which returned 12 human genes and 2 mouse genes. This screen shows the overlap between gene signatures in which these 12 genes are present in at least 2/12. Red and grey boxes indicate presence or absence, respectively.

Downloading gene signatures

In the publication search results, gene search results or gene or publication entry view, gene signatures can be selected for download using checkboxes which are passed to a download page. There, a user can choose to download the standard gene list (EnsEMBL gene identifiers and gene symbols) or can choose to convert gene signatures into one or many commonly used identifiers, including Entrezgene, ReqSeq gene identifiers or Affymetrix, Agilent or Illumina probe identifiers. There is no limit to the number of identifiers that can be selected or to the number of gene signatures that can be downloaded concurrently. Each gene signature is provided in a separate comma separated file and if multiple gene signatures are downloaded together, these are compressed into one zip file.

Architecture of Web interface to GeneSigDB

The web interface to GeneSigDB (http://compbio.dfci.harvard.edu/genesigdb) is based on HTML, CSS, JSP, XML and Java 1.6 technologies. The application runs on an Apache Tomcat 6 web application server running on a CentOS 5 Linux server. Front-end interactivity makes use of the jQuery 1.3.2 Javascript library and server-side processing is based on the Apache Lucene 2.0 framework. GeneSigDB is not based on traditional database technology since the application does not require a high level of read/write access. Instead, the full text of articles along with numerous other curated resources are indexed and cross-referenced using the Lucene model (http://lucene.apache.org/) to produce a high-performance search system.

Comparison of GeneSigDB to other gene signature resources

GeneSigDB provides a large collection of experimentally derived gene signatures. To the best of our knowledge, only MSigDB contains more curated gene signatures. MSigDB contains 1186 curated (c2) gene signatures from 344 publications, however the overlap between MSigDB and GeneSigDB is minimal. Only signatures from 13 publications are contained in both MSigDB and GeneSigDB. Consequently GeneSigDB release 1.0 provides a large number of cancer and stem cell gene signatures that were not previously computationally accessible. One fundamental aspect of GeneSigDB that differs from existing resources is the importance given to traceability of each gene signature. Each gene signature has a signature identifier PMID-X, where PMID is the PubMed identifier of the article and X is the table, figure or supplementary file number from which the gene signature was extracted, so that it can be easily traced to the original publication. In addition we provide a transcribed copy of the original table from the article. A fully traceable gene history for each gene from the original transcribed data table through to the standardized list of genes is also provided, including version number of all databases used in generating gene annotation. Therefore the source gene of identifiers in each standardized gene list should be unambiguous. Since original gene identifiers are stored and formatted in an annotation pipeline, GeneSigDB standardized gene lists and annotation will be updated with each release of GeneSigDB.

SUMMARY AND FUTURE DIRECTIONS

GeneSigDB fills an important need within the community—the need to standardize gene expression signatures to facilitate comparison and to allow them to be easily queried and used in other analyses. GeneSigDB release v1.0 focused on cancer and stem cell gene signatures because these together represent some of the largest sources of gene expression-based signatures. Because of the way in which these signatures were identified, we anticipate that they may capture many of the underlying processes associated with the development and progression of cancers and that their comparison may yield additional insight into the disease. We also recognize that many of these processes likely are important in other disease and non-disease phenotypes. Consequently, we plan to expand GeneSigDB to include a broader range of gene signatures, both from other disease-based studies and signatures arising from different technologies such as copy number variation arrays. The GeneSigDB web site interface will continue to improve and we intend to implement several new features that will vastly improve the gene signature comparison visualization. We are also working to expand gene signature annotation in GenSigDB to provide web links to GEO or ArrayExpress datasets (where applicable), and biological sample information on the source of each gene signature, and in future also hope to implement a controlled vocabulary to enable better searching and analysis of gene signatures. One lesson that we have learned from GeneSigDB is that there is a pressing need for standardization of gene expression signatures. Our creation of this database grew out of a desire to do a simple computational analysis of published gene expression signatures to look for similarities between tumors arising in different organ sites. While databases such as ArrayExpress and GEO have become valuable repositories for the raw data from expression studies, the gene expression signatures that are the results of expert analysis of those data are currently not stored or reported in a systematic fashion. While GeneSigDB represents an attempt to remedy the situation, the need for extensive manual assembly and curation argues for the development of standardized reporting formats for gene signatures to facilitate their broader use and reuse.

LICENSE

The software used in constructing GeneSigDB is open source software and provided under the Artistic License. All content within GeneSigDB is provided without restriction.

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online.

FUNDING

Funding for open access charge: US National Institutes of Health (grant numbers R01-CA098522 and 1P50HG004233); the Dana-Farber Cancer Institute Women’s Cancer Program; and funds provided through the Dana-Farber Strategic Plan Initiative. Conflict of interest statement. None declared.
  15 in total

Review 1.  List of lists-annotated (LOLA): a database for annotation and comparison of published microarray gene lists.

Authors:  Patrick Cahan; Amera M Ahmad; Harry Burke; Sidney Fu; Yinglei Lai; Liliana Florea; Nachiket Dharker; Todd Kobrinski; Prachee Kale; Timothy A McCaffrey
Journal:  Gene       Date:  2005-09-02       Impact factor: 3.688

2.  Analyzing gene expression data in terms of gene sets: methodological issues.

Authors:  Jelle J Goeman; Peter Bühlmann
Journal:  Bioinformatics       Date:  2007-02-15       Impact factor: 6.937

3.  Concordance among gene-expression-based predictors for breast cancer.

Authors:  Cheng Fan; Daniel S Oh; Lodewyk Wessels; Britta Weigelt; Dimitry S A Nuyten; Andrew B Nobel; Laura J van't Veer; Charles M Perou
Journal:  N Engl J Med       Date:  2006-08-10       Impact factor: 91.245

4.  Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles.

Authors:  Aravind Subramanian; Pablo Tamayo; Vamsi K Mootha; Sayan Mukherjee; Benjamin L Ebert; Michael A Gillette; Amanda Paulovich; Scott L Pomeroy; Todd R Golub; Eric S Lander; Jill P Mesirov
Journal:  Proc Natl Acad Sci U S A       Date:  2005-09-30       Impact factor: 11.205

5.  Robustness, scalability, and integration of a wound-response gene expression signature in predicting breast cancer survival.

Authors:  Howard Y Chang; Dimitry S A Nuyten; Julie B Sneddon; Trevor Hastie; Robert Tibshirani; Therese Sørlie; Hongyue Dai; Yudong D He; Laura J van't Veer; Harry Bartelink; Matt van de Rijn; Patrick O Brown; Marc J van de Vijver
Journal:  Proc Natl Acad Sci U S A       Date:  2005-02-08       Impact factor: 11.205

6.  Gene set enrichment analysis using linear models and diagnostics.

Authors:  Assaf P Oron; Zhen Jiang; Robert Gentleman
Journal:  Bioinformatics       Date:  2008-09-11       Impact factor: 6.937

7.  Lobular and ductal carcinomas of the breast have distinct genomic and expression profiles.

Authors:  F Bertucci; B Orsetti; V Nègre; P Finetti; C Rougé; J-C Ahomadegbe; F Bibeau; M-C Mathieu; I Treilleux; J Jacquemier; L Ursule; A Martinec; Q Wang; J Bénard; A Puisieux; D Birnbaum; C Theillet
Journal:  Oncogene       Date:  2008-05-19       Impact factor: 9.867

Review 8.  Development of the 21-gene assay and its application in clinical practice and clinical trials.

Authors:  Joseph A Sparano; Soonmyung Paik
Journal:  J Clin Oncol       Date:  2008-02-10       Impact factor: 44.544

9.  Novel definition files for human GeneChips based on GeneAnnot.

Authors:  Francesco Ferrari; Stefania Bortoluzzi; Alessandro Coppe; Alexandra Sirota; Marilyn Safran; Michael Shmoish; Sergio Ferrari; Doron Lancet; Gian Antonio Danieli; Silvio Bicciato
Journal:  BMC Bioinformatics       Date:  2007-11-15       Impact factor: 3.169

10.  Transcript-based redefinition of grouped oligonucleotide probe sets using AceView: high-resolution annotation for microarrays.

Authors:  Jun Lu; Joseph C Lee; Marc L Salit; Margaret C Cam
Journal:  BMC Bioinformatics       Date:  2007-03-29       Impact factor: 3.169

View more
  61 in total

1.  Conceptual aspects of large meta-analyses with publicly available microarray data: a case study in oncology.

Authors:  Markus Schmidberger; Sabine Lennert; Ulrich Mansmann
Journal:  Bioinform Biol Insights       Date:  2011-01-23

2.  Network2Canvas: network visualization on a canvas with enrichment analysis.

Authors:  Christopher M Tan; Edward Y Chen; Ruth Dannenfelser; Neil R Clark; Avi Ma'ayan
Journal:  Bioinformatics       Date:  2013-06-07       Impact factor: 6.937

3.  Molecular signatures database (MSigDB) 3.0.

Authors:  Arthur Liberzon; Aravind Subramanian; Reid Pinchback; Helga Thorvaldsdóttir; Pablo Tamayo; Jill P Mesirov
Journal:  Bioinformatics       Date:  2011-05-05       Impact factor: 6.937

4.  DSigDB: drug signatures database for gene set analysis.

Authors:  Minjae Yoo; Jimin Shin; Jihye Kim; Karen A Ryall; Kyubum Lee; Sunwon Lee; Minji Jeon; Jaewoo Kang; Aik Choon Tan
Journal:  Bioinformatics       Date:  2015-05-19       Impact factor: 6.937

5.  A three-gene model to robustly identify breast cancer molecular subtypes.

Authors:  Benjamin Haibe-Kains; Christine Desmedt; Sherene Loi; Aedin C Culhane; Gianluca Bontempi; John Quackenbush; Christos Sotiriou
Journal:  J Natl Cancer Inst       Date:  2012-01-18       Impact factor: 13.506

6.  Identification of ovarian cancer driver genes by using module network integration of multi-omics data.

Authors:  Olivier Gevaert; Victor Villalobos; Branimir I Sikic; Sylvia K Plevritis
Journal:  Interface Focus       Date:  2013-08-06       Impact factor: 3.906

7.  Identifying master regulators of cancer and their downstream targets by integrating genomic and epigenomic features.

Authors:  Olivier Gevaert; Sylvia Plevritis
Journal:  Pac Symp Biocomput       Date:  2013

8.  Discovering causal signaling pathways through gene-expression patterns.

Authors:  Jignesh R Parikh; Bertram Klinger; Yu Xia; Jarrod A Marto; Nils Blüthgen
Journal:  Nucleic Acids Res       Date:  2010-05-21       Impact factor: 16.971

9.  Data reporting standards: making the things we use better.

Authors:  John Quackenbush
Journal:  Genome Med       Date:  2009-11-25       Impact factor: 11.117

10.  Breast cancer biomarker discovery in the functional genomic age: a systematic review of 42 gene expression signatures.

Authors:  M C Abba; E Lacunza; M Butti; C M Aldaz
Journal:  Biomark Insights       Date:  2010-10-27
View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.