| Literature DB >> 19934259 |
Aedín C Culhane1, Thomas Schwarzl, Razvan Sultana, Kermshlise C Picard, Shaita C Picard, Tim H Lu, Katherine R Franklin, Simon J French, Gerald Papenhausen, Mick Correll, John Quackenbush.
Abstract
The primary objective of most gene expression studies is the identification of one or more gene signatures; lists of genes whose transcriptional levels are uniquely associated with a specific biological phenotype. Whilst thousands of experimentally derived gene signatures are published, their potential value to the community is limited by their computational inaccessibility. Gene signatures are embedded in published article figures, tables or in supplementary materials, and are frequently presented using non-standard gene or probeset nomenclature. We present GeneSigDB (http://compbio.dfci.harvard.edu/genesigdb) a manually curated database of gene expression signatures. GeneSigDB release 1.0 focuses on cancer and stem cells gene signatures and was constructed from more than 850 publications from which we manually transcribed 575 gene signatures. Most gene signatures (n = 560) were successfully mapped to the genome to extract standardized lists of EnsEMBL gene identifiers. GeneSigDB provides the original gene signature, the standardized gene list and a fully traceable gene mapping history for each gene from the original transcribed data table through to the standardized list of genes. The GeneSigDB web portal is easy to search, allows users to compare their own gene list to those in the database, and download gene signatures in most common gene identifier formats.Entities:
Mesh:
Year: 2009 PMID: 19934259 PMCID: PMC2808880 DOI: 10.1093/nar/gkp1015
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Gene Signature Metadata (GeneSigDB v1.0)
| Name | Description |
|---|---|
| PMID | PubMed identifier |
| Tissue | Name of search term set used to search PubMed. ( |
| Organism | Species common name (human, mouse, etc) |
| Platform | Name of microarray or other experimental technique used to derive gene signature (selection from constrained list, |
| Platform description | Description of platform |
| Genes article | Number of genes in gene signature (as described in the text of the article) |
| SigID | Signature identifier, in the format PMID-XXX, where XXX is the gene signature table, figure or |
| Sig name | Name of gene signature, in the format Tissue_AuthorYear_NumberofGenes _Description. Description is optional. e.g. Breast _Bertucci08_75genes |
| Sig description | Description of gene signature, typically extracted from table or figure legend (free text) |
| File associated | Name of tab delimited file gene signature file. Format is SigID.txt |
| URL | URL from where gene signature was downloaded |
| Column mappings | Content of each column in gene signature file (selection from constrained list in |
Column mappings (GeneSigDB v1.0)
| Mapping element | Description | Mapping file |
|---|---|---|
| Probe ID | Platform specific identifier | Yes |
| Clone ID | IMAGE clone identifier | Yes |
| GenBank ID | GenBank accession number | Yes |
| UniGene ID | Unigene identifier. | Yes |
| EntrezGene ID | EntrezGene or LocusLink identifier | Yes |
| Gene symbol | HGNC official gene symbol | Yes |
| CCDS ID | Consensus Coding Sequence Database ID | Yes |
| EnsEMBL ID | EnsEMBL gene ID | Yes |
| RefSeq ID | RefSeq gene identifier | Yes |
| Protein ID | Protein sequence ID, SwissProt, UniProt | No |
| Chromosome map | Chromosomal localization data | No |
| Geneset specific factor | Factor or classifier specific to data, character | No |
| Geneset specific statistics | Fold change, Ranking of genes, | No |
| Gene description | Description or title of gene | No |
| Other gene description | KEGG, GO terms, Keywords, etc | No |
aYes indicates these columns were extracted to SigID-mapping.txt for searching biomart.
bNot all platform Probe IDs are sequence mapped in biomart. For some common unmapped microarrays, we sequence matched the probe sequences to the genome. Others were ignored.
cGenBank EST and IMAGE clone ID sequences are not in EnsEMBL and these were mapped via Unigene (See Methods).
Number of articles processed and gene signatures extracted by species
| Human | Mouse | Rat | Total | |
|---|---|---|---|---|
| Publications | 263 | 39 | 8 | 308 |
| Gene signatures | 465 | 84 | 11 | 560 |
| Genes | 14 197 | 9755 | 773 | – |
| Number of platforms | 32 | 9 | 4 | 38 |
| Average genes/signature | 132 | 213 | 88 | – |
Figure 1.Curation of gene signatures in GeneSigDB. (A) GeneSigDB hierarchical file structure SigID.txt, SigID-mapping.txt, SigID-maptrace.txt, SigID-standardized.txt and SigID-index.xml. These files are respectively, the original gene signature, the mappable gene identifiers from SigID.txt, the mapping-trace showing how each gene was mapped, a list of EnsEMBL gene identifiers that correspond to genes in SigID.txt, and xml annotation of SigID.txt. (B) An example xml gene signature annotation file.
Number of articles processed and gene signatures extracted by search terms (tissue)
| Search terms | Number of manuscripts (articles) | Gene signatures | Average of genes | ||||
|---|---|---|---|---|---|---|---|
| PubMed hits | Downloaded and processed | Curated | Mapped | Curated | Mapped to EnsEMBL | ||
| Bladder | 64 | 56 | 10 | 10 | 18 | 18 | 115 |
| Breast | 471 | 241 | 134 | 131 | 243 | 238 | 190 |
| Colon | 95 | 54 | 20 | 20 | 35 | 35 | 84 |
| Endometrial | 15 | 15 | 5 | 4 | 9 | 8 | 16 |
| Kidney | 12 | 12 | 4 | 3 | 7 | 6 | 55 |
| Liver | 129 | 27 | 8 | 8 | 12 | 12 | 144 |
| Lung | 167 | 101 | 29 | 28 | 42 | 41 | 34 |
| Ovary | 108 | 72 | 28 | 28 | 41 | 41 | 75 |
| Prostate | 136 | 102 | 30 | 28 | 52 | 48 | 47 |
| Skin | 8 | 8 | 3 | 3 | 9 | 9 | 28 |
| Stem cell | 190 | 141 | 45 | 42 | 104 | 101 | 205 |
| Thyroid | 26 | 14 | 2 | 2 | 2 | 2 | 16 |
| Uterus | 54 | 8 | 1 | 1 | 1 | 1 | 45 |
| Total | 1475 | 851 | 319 | 308 | 575 | 560 | |
aAverage number of genes per signature which were mapped to EnsEMBL genes.
Frequency of different gene identifiers in mapped gene signatures
| Identifier | Frequency all gene signatures ( | Frequency human gene signatures ( |
|---|---|---|
| Gene description | 476 | 381 |
| Gene symbol | 432 | 359 |
| Probe ID | 212 | 173 |
| GenBank ID | 128 | 109 |
| UniGene ID | 96 | 75 |
| RefSeq ID | 75 | 53 |
| EntrezGene ID | 49 | 39 |
| Clone ID | 30 | 26 |
| EnsEMBL ID | 11 | 9 |
Success of matching different gene signature identifiers to an EnsEMBL gene
| ID type | Species | Success (unique IDs) | Failures (unique IDs) | Percentage success |
|---|---|---|---|---|
| affy_hg_u133_ plus_2 | Human | 11085 | 942 | 92 |
| affy_u133_x3p | Human | 180 | 10 | 94 |
| affy_hg_u133a | Human | 6384 | 288 | 95 |
| affy_hg_u95av2 | Human | 613 | 23 | 96 |
| affy_hg_u95a | Human | 25 | 2 | 92 |
| affy_hugenefl | Human | 43 | 8 | 84 |
| affy_mouse430_2 | Mouse | 3816 | 195 | 95 |
| affy_moe430a | Mouse | 3409 | 159 | 95 |
| affy_mg_u74av2 | Mouse | 2156 | 317 | 87 |
| affy_mg_u74a | Mouse | 116 | 1 | 99 |
| agilent_ wholegenome | Human, mouse | 3896 | 614 | 86 |
| Entrezgene | Human | 9859 | 486 | 95 |
| refseq_dna | Human | 5586 | 577 | 90 |
| ensembl_gene_id | Human | 3120 | 486 | 86 |
| hgnc_symbol | Human | 5254 | 2908 | 64 |
| mgi_symbol | Mouse | 1247 | 788 | 61 |
| rgd_symbol | Rat | 301 | 135 | 69 |
| unigene | Human, mouse | 2131 | 1865 | 53 |
| embl (genbank) | Human | 308 | 1541 | 16 |
Figure 2.Overlap in gene signatures across tumor types. (A) Heatmap-style representation of gene overlap. Each row is a gene and each column is a tissue type. Presence of a gene is indicated by a black line, and absence is white. It can be seen that some gene maybe linked to phenotypic subclasses in many tumors, but there are generally many more tumor-specific genes, likely indicating the effect of the tissue of origin. (B) Dendrogram of hierarchical cluster analysis that was performed using a Sorensen's coefficient asymmetrical measure of binary distance and joined using Ward's minimum variance method.
Most common genes across genes signatures from all tissue types
| EnsEMBL Gene ID | Hgnc symbol | Number tissue types | Counts of gene signatures in tissue types | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Bld | Br | Co | End | Kd | Li | Lu | Ov | Pr | Sk | SC | Thy | Ut | |||
| ENSG00000113140 | SPARC | 9 | 2 | 24 | 2 | 0 | 0 | 1 | 2 | 3 | 1 | 3 | 1 | 0 | 0 |
| ENSG00000115414 | FN1 | 8 | 1 | 22 | 1 | 0 | 0 | 2 | 1 | 7 | 5 | 0 | 2 | 0 | 0 |
| ENSG00000131747 | TOP2A | 8 | 0 | 30 | 1 | 0 | 1 | 1 | 1 | 3 | 2 | 0 | 4 | 0 | 0 |
| ENSG00000134755 | DSC2 | 8 | 0 | 8 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 0 | 0 |
| ENSG00000151914 | DST | 8 | 0 | 15 | 2 | 0 | 0 | 1 | 1 | 1 | 1 | 2 | 3 | 0 | 0 |
| ENSG00000157456 | CCNB2 | 8 | 0 | 27 | 1 | 0 | 1 | 0 | 2 | 2 | 1 | 0 | 2 | 1 | 0 |
| ENSG00000134057 | CCNB1 | 8 | 0 | 22 | 1 | 1 | 0 | 0 | 3 | 1 | 3 | 1 | 2 | 0 | 0 |
| ENSG00000087586 | AURKA | 8 | 0 | 33 | 1 | 2 | 1 | 0 | 0 | 1 | 1 | 1 | 4 | 0 | 0 |
| ENSG00000120992 | LYPLA1 | 8 | 1 | 6 | 1 | 1 | 0 | 1 | 1 | 0 | 1 | 0 | 2 | 0 | 0 |
| ENSG00000132646 | PCNA | 7 | 1 | 19 | 2 | 0 | 0 | 0 | 2 | 3 | 2 | 0 | 1 | 0 | 0 |
| ENSG00000169429 | IL8 | 7 | 2 | 14 | 3 | 0 | 0 | 0 | 1 | 3 | 2 | 0 | 2 | 0 | 0 |
| ENSG00000121966 | CXCR4 | 7 | 1 | 16 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 3 | 0 | 0 |
| ENSG00000139318 | DUSP6 | 7 | 0 | 11 | 1 | 0 | 0 | 2 | 1 | 1 | 1 | 0 | 2 | 0 | 0 |
| ENSG00000146674 | IGFBP3 | 7 | 0 | 8 | 2 | 0 | 0 | 1 | 1 | 4 | 3 | 0 | 1 | 0 | 0 |
| ENSG00000170312 | CDC2 | 7 | 0 | 28 | 1 | 1 | 0 | 0 | 2 | 1 | 2 | 0 | 4 | 0 | 0 |
| ENSG00000185275 | CD24L4 | 7 | 0 | 15 | 1 | 1 | 0 | 0 | 1 | 1 | 1 | 0 | 2 | 0 | 0 |
| ENSG00000164171 | ITGA2 | 7 | 0 | 5 | 1 | 0 | 1 | 1 | 0 | 1 | 2 | 0 | 2 | 0 | 0 |
| ENSG00000176890 | TYMS | 7 | 1 | 17 | 2 | 0 | 0 | 1 | 1 | 2 | 0 | 0 | 2 | 0 | 0 |
| ENSG00000044115 | CTNNA1 | 7 | 1 | 8 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 1 | 0 | 0 |
| ENSG00000164442 | CITED2 | 7 | 3 | 10 | 2 | 1 | 0 | 0 | 1 | 2 | 0 | 0 | 1 | 0 | 0 |
| ENSG00000003436 | TFPI | 7 | 0 | 10 | 1 | 0 | 1 | 1 | 0 | 1 | 0 | 2 | 1 | 0 | 0 |
| ENSG00000108821 | COL1A1 | 7 | 1 | 15 | 0 | 0 | 0 | 1 | 4 | 1 | 1 | 0 | 4 | 0 | 0 |
| ENSG00000111348 | ARHGDIB | 7 | 1 | 10 | 0 | 0 | 0 | 3 | 2 | 1 | 1 | 0 | 1 | 0 | 0 |
| ENSG00000196139 | AKR1C3 | 7 | 1 | 5 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 0 | 3 | 0 | 0 |
| ENSG00000204262 | COL5A2 | 7 | 1 | 13 | 0 | 0 | 0 | 0 | 2 | 2 | 1 | 2 | 1 | 0 | 0 |
| ENSG00000142871 | CYR61 | 7 | 1 | 11 | 0 | 0 | 0 | 1 | 0 | 1 | 4 | 1 | 2 | 0 | 0 |
| ENSG00000170345 | FOS | 7 | 1 | 24 | 0 | 0 | 0 | 1 | 0 | 1 | 5 | 0 | 2 | 0 | 1 |
| ENSG00000167642 | SPINT2 | 7 | 0 | 7 | 0 | 0 | 1 | 0 | 1 | 1 | 2 | 1 | 1 | 0 | 0 |
| ENSG00000175063 | UBE2C | 7 | 0 | 25 | 0 | 0 | 1 | 0 | 1 | 2 | 1 | 0 | 1 | 1 | 0 |
Tissue types are Bladder (Bld), Breast (Br), Colon (Co), Endometrial (End), Kidney (Kd), Liver (Li), Lung (Lu), Ovary (Ov), Prostate (Pr), Skin (Sk), Stem Cell (SC), Thyroid (Thy), Uterus (Ut).
Figure 3.Visualization of gene overlap in gene signatures. In this example, we searched for Fanconi anemia-related genes by performing a gene search for FANC* which returned 12 human genes and 2 mouse genes. This screen shows the overlap between gene signatures in which these 12 genes are present in at least 2/12. Red and grey boxes indicate presence or absence, respectively.