| Literature DB >> 27402679 |
Imad Abugessaisa1, Hisashi Shimoji2, Serkan Sahin3, Atsushi Kondo3, Jayson Harshbarger3, Marina Lizio3, Yoshihide Hayashizaki4, Piero Carninci3, Alistair Forrest5, Takeya Kasukawa6, Hideya Kawaji7.
Abstract
The Functional Annotation of the Mammalian Genome project (FANTOM5) mapped transcription start sites (TSSs) and measured their activities in a diverse range of biological samples. The FANTOM5 project generated a large data set; including detailed information about the profiled samples, the uncovered TSSs at high base-pair resolution on the genome, their transcriptional initiation activities, and further information of transcriptional regulation. Data sets to explore transcriptome in individual cellular states encoded in the mammalian genomes have been enriched by a series of additional analysis, based on the raw experimental data, along with the progress of the research activities. To make the heterogeneous data set accessible and useful for investigators, we developed a web-based database called Semantic catalog of Samples, Transcription initiation And Regulators (SSTAR). SSTAR utilizes the open source wiki software MediaWiki along with the Semantic MediaWiki (SMW) extension, which provides flexibility to model, store, and display a series of data sets produced during the course of the FANTOM5 project. Our use of SMW demonstrates the utility of the framework for dissemination of large-scale analysis results. SSTAR is a case study in handling biological data generated from a large-scale research project in terms of maintenance and growth alongside research activities.Database URL: http://fantom.gsc.riken.jp/5/sstar/.Entities:
Mesh:
Year: 2016 PMID: 27402679 PMCID: PMC4940433 DOI: 10.1093/database/baw105
Source DB: PubMed Journal: Database (Oxford) ISSN: 1758-0463 Impact factor: 3.451
Figure 1.SSTAR data model. SSTAR data model consists of six classes, those represents the main ‘categories’ in SMW. The oval represents a class and the kind of the data stored. Relationship between any two categories is represented as an arrow. The direction of the arrow indicates which of the two classes stores the relationship (indicated by the end of the arrow) as a class attribute. The head and color of the arrow indicates the type of relationship.
Figure 2.Implementation scheme of data model with SMW template. (A) Template dependencies. Each page in SSTAR use one or more template from SMW, the template points to different style sheets, calls semantic property and SSTAR plug-ins to deliver particular function. (B) Data flow from depositing the data file in MediaWiki server to render the page into the client. (C) Code snippet showing template call (EntrezGene) with the semantic properties to generate the page with EntrezGene:4602 http://fantom.gsc.riken.jp/5/sstar/EntrezGene:4602. (D) Statements in the ‘EntrezGene’ template to store template parameters as semantic properties. The statements will add the semantic properties ‘GeneID’, ‘LocusTag’ and ‘type_of_gene’. (E) An example of inline semantic query, retrieving the association between two categories (gene and CAGE peaks) and show the result in an unnumbered list. (F) An example of the inline semantic query modified to call SSTAR ucsc_gb_link function to provide the genomic view of the FFCP in UCSC genome browser.
Figure 3.Graphical representation of a gene (MYB). (A) Result of the search for MYB gene in SSTAR, with its associated Motifs and list of TSS regions. The user is able to get UCSC genome browser view of the MYB gene. (B) The table shows the expression of five TSS regions associated with MYB. (C) The graphical representation of the B) in which X-axis represents individual samples and y-axis represents expression intensities.
Summary of the number of objects in SSTAR
| Data class > Category | Attributes | ||
|---|---|---|---|
| human | mouse | ||
| human readable description | 201 802 | 158 966 | |
| association with genesa | 174 802 | 136 492 | |
| co-expression modulea | 240 776 | 180 000 | |
| CAGE expressiona | 201 802 | 158 966 | |
| ontology-based sample term enrichment analysisa | 6 097 409 | 2 843 664 | |
| human readable description | 408 504 | 317 946 | |
| DNA-binding motifs (only for transcription factors) | 91 | 88 | |
| association with TSS regions (CAGE peaks)+ | 174 802 | 136 492 | |
| TSS region (CAGE peak) cluster+ | 310 | 278 | |
| pathway enrichment | 1672 | 1385 | |
| gene ontology enrichment | 48 473 | 38 751 | |
| ENCODE TF ChIP-seq peak enrichment analysis (with Coexpression) | 29 778 | N/A | |
| sample ontology enrichmenta | 224 852 | 87 181 | |
| relative expression of the co-expression clustera | 889 | 389 | |
| human readable description | 1816 | 1 018 | |
| classification according to the sample ontology (‘Ancestor terms’)a | 34 689 | 12 357 | |
| transcription factors with enriched expression | 983 000 | 367 000 | |
| co-expression clusters with enriched expression | 300 798 | 81 474 | |
| repeat families with enriched expression | 130 789 | 34 865 | |
| overrepresented JASPAR motifs | 112 | 112 | |
| overrepresented novel unique motifs | 169 | 168 | |
| Homer de novo motifs | 39 320 | 14 680 | |
| description | 3782 | ||
| parent terms | 9640 | ||
| children terms | 9620 | ||
| human readable description | 687 | ||
| association to promoter expressiona | 1278 | ||
SSTAR data objects and their corresponding categories in MediaWiki and the attributes in each object. Relationship to other objects are indicated with a (*:forward and +reverse).
Mapping of the modeling entities in SSTAR
| Modeling entities | MediaWiki | SMW |
|---|---|---|
| Class | Category | |
| Object | Page | |
| Attributes | Template parameters | Semantic properties |
Data objects and their relationships ; the table show the mapping between the data model (column 1), MediaWiki(column 2) and the SMW (column 3).
Figure 4.Measurements of page-timing. x-axis shows FANTOM5 categories and the y-axis is the different timing in seconds. Memcached ON and OFF, for the six categories: n = 87, n = 1003, n = 3011, n = 52, n = 16 and n = 10 . The box denotes the median. The bars on each column show the 25 and 75 percentiles. Latency time changed drastically between cache-on and cache-off. No change in the loading and parsing time and rendering time for all categories.
Number of semantic property values
| Category | Semantic property vlaues |
|---|---|
| Cell ontology | 646 |
| Coexpression clusters | 4882 |
| EntrezGene | 100286 |
| FFCP | 721537 |
| FF ontology | 5149 |
| FF samples | 3605 |
| FF terms | 1544 |
| Human disease ontoloy | 260 |
| JASPAR motif | 113 |
| MCL coexpression mm9 | 3771 |
| MacroAPE 1083 | 1080 |
| Motif | 691 |
| MotifCluster | 204 |
| NonRedundantMotifCluster | 210 |
| Novel motif | 170 |
| SwissregulonMotif | 198 |
| Time courses | 36 |
| Uber anatomy ontology | 1354 |
The table shows categories and their corrsponding semantic property values in SSTAR.
Comparison between SSTAR and other systems using SMW / MediaWiki
| System | URL | pages | Number of semantic properties | Semantic property values |
|---|---|---|---|---|
| FANTOM5 SSTAR | 415 676 | 196 | 54 266 939 | |
| ArthropodBase Wiki | 9902 | 171 | 67 407 | |
| Bioinformatics.Org | 2378 | 85 | 5898 | |
| GeneWiki+ | 91 379 | 245 | 1 978 820 | |
| GMOD | 3873 | 54 | 9 580 | |
| MetaBase | 4618 | 31 | 13 286 | |
| NeuroLex | 76 374 | 198 | 546 389 | |
| OpenToxipedia | 1280 | 8 | 4329 | |
| Pest information Wiki | 134 915 | 41 | 10 474 99 | |
| SNPedia | 111 052 | 103 | 4 313 629 | |
| SEQanswers wiki | 3338 | 110 | 36 623 |
The number of pages and semantic properties the statistics were collected on the 17 February 2015.