| Literature DB >> 18801154 |
Johan Leyritz1, Stéphane Schicklin, Sylvain Blachon, Céline Keime, Céline Robardet, Jean-François Boulicaut, Jérémy Besson, Ruggero G Pensa, Olivier Gandrillon.
Abstract
BACKGROUND: There is an increasing need in transcriptome research for gene expression data and pattern warehouses. It is of importance to integrate in these warehouses both raw transcriptomic data, as well as some properties encoded in these data, like local patterns. DESCRIPTION: We have developed an application called SQUAT (SAGE Querying and Analysis Tools) which is available at: http://bsmc.insa-lyon.fr/squat/. This database gives access to both raw SAGE data and patterns mined from these data, for three species (human, mouse and chicken). This database allows to make simple queries like "In which biological situations is my favorite gene expressed?" as well as much more complex queries like: <<what are the genes that are frequently co-over-expressed with my gene of interest in given biological situations?>>. Connections with external web databases enrich biological interpretations, and enable sophisticated queries. To illustrate the power of SQUAT, we show and analyze the results of three different queries, one of which led to a biological hypothesis that was experimentally validated.Entities:
Mesh:
Substances:
Year: 2008 PMID: 18801154 PMCID: PMC2567996 DOI: 10.1186/1471-2105-9-378
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
A summary of SQUAT possibilities.
| gene name | corresponding tags | Tag/Gene identification -> Gene information search |
| NCBI description | Tag/Gene identification -> Gene information search | |
| aliases | Tag/Gene identification -> Gene information search | |
| Gene Ontology data | Tag/Gene identification -> Gene information search | |
| transcript | Tag/Gene identification -> Gene information search | |
| promoter | Tag/Gene identification -> Gene information search | |
| SAGE libraries | Queries on raw SAGE data -> SAGE libraries search | |
| formal concepts/QSG | Queries on formal concepts -> Simple concepts search | |
| tag sequence | corresponding tags | Tag/Gene identification -> Tag-to-gene assignment |
| NCBI description | Tag/Gene identification -> Tag-to-gene assignment | |
| transcript | Tag/Gene identification -> Tag-to-gene assignment | |
| promoter | Tag/Gene identification -> Tag-to-gene assignment | |
| Gene Ontology data | Tag/Gene identification -> Tag-to-gene assignment | |
| SAGE libraries | Tag/Gene identification -> Tag-to-gene assignment | |
| Queries on raw SAGE data -> SAGE libraries search | ||
| formal concepts/QSG | Queries on formal concepts -> Simple concepts search | |
| expression sub-matrix | Queries on raw SAGE data -> Expression sub-matrix | |
| expression in normal and cancer cells | Queries on raw SAGE data -> Expression sub-matrix | |
| Gene Ontology term | corresponding tags | Queries on raw SAGE data -> Gene Ontology search |
| corresponding genes | Queries on raw SAGE data -> Gene Ontology search | |
| description | Queries on raw SAGE data -> Gene Ontology search | |
| SAGE library | Queries on raw SAGE data -> Gene Ontology search | |
| formal concepts/QSG | Queries on formal concepts -> Advanced concepts search | |
| SAGE library | corresponding tags | Queries on raw SAGE data -> Tags search |
| SAGEmap description | ||
| formal concepts/QSG | Queries on formal concepts -> Advanced concepts search | |
| expression sub-matrix | Queries on raw SAGE data -> Expression sub-matrix | |
| nucleotide sequence | reverse sequence | Tag/Gene identification -> Nucleotids sequence handling |
| complementary sequence | ||
| global keywords | corresponding tags | Tag/Gene identification -> Gene product finder |
| NCBI description | ||
| Gene Ontology ID/term | Queries on formal concepts -> Advanced concepts search | |
| accession number (RefSeq) or gene description | promoter | Promoter search |
Current content of the SQUAT website.
| Human | Mouse | Chicken | |
| Number of short SAGE libraries | 355 | 280 | 13 |
| Number of long SAGE libraries | 116 | 214 | 2 |
| Total number of different tags | 666 189 | 489 686 | 105 224 |
| Number of tags used for concepts generation | 29 016 | 29 343 | 12 345 |
| Number of concepts | 314 016 | 1 104 920 | 4 691 |
Figure 2A schematic view of the pipeline that establishes a link between RefSeq transcripts and their promoter sequence for the three species. For the human and the mouse, data is available through DBTSS (DataDase of Transcriptional Start Sites; [39]) which provides on one hand the exact RefSeq transcript TSS (Transcriptional Start Sites) position on a genome assembly and on the other hand, when it exists, alternative TSS position for this transcript. DBTSS enables to provide at least one TSS position for 53% of the human transcripts and for 46% of the mouse transcripts. In order to provide TSS positions for the rest of the transcripts, we used BLAT [40]. 83% of the human transcripts and 75% of the mouse transcripts were thereby endowed with a TSS position. Since there is no data available in DBTSS for the chicken, we first used data coming from Ensembl [41] to establish, when possible, the link between the RefSeq transcripts and the Ensembl transcripts. Some rare RefSeq transcripts correspond to several Ensembl transcripts, which confer to our database alternative TSS positions for the chicken as well. Transcripts which could not be linked to Ensembl were also aligned with BLAT on the same version of genome assembly used by Ensembl release. Finally, 85% of chicken RefSeq transcripts have found a TSS position with this pipeline which is close to the value obtained for the two other species.
Figure 3Gene expression matrix, (A), formal concepts (B) and QSQ (C). In A is shown a toy example of a gene expression matrix displaying the level of expression of 4 genes (G1 – G4) in 4 biological situations (S1 – S4). In order to extract formal concepts, one has first to encode some gene expression property. We decided to encode the over-expression by applying the mid-range method [11]. One first defines a threshold per gene (max value – min value)/2 – min value). For the G1 gene, this threshold = 62.5. All expression values below or equal to the threshold are considered null, all values strictly above the threshold are set to 1. This allows to create the binary matrix (B). One then extracts all formal concepts from such a matrix. It consists of a bi-set of genes and situations such that all genes are simultaneously over-expressed in the situations, and such that neither gene nor situation can be added without introducing a null value (those are maximal bi-sets). From the toy example, three formal concepts can be extracted (shown below the B matrix). It is immediately apparent that the two first concepts are closely related. It is therefore tempting to aggregate them, allowing the creation of a Quasi-synexpression group (QSG; [17]) containing three genes and three situations. One possible representation of a QSG is shown in C, the values indicating the number of formal concepts supporting the Gene-to-Situation association.
Figure 4Hierarchical clustering analysis of the 51 formal concepts. The concepts shown on the left are represented according to the libraries they contain (shown on top). A red square indicates a library within a concept, a green square a library not within the corresponding concept. From the hierarchical clustering shown on the left, one concept appears to be very different from the rest (n° 153308), and a group of concept appears sufficiently similar to be grouped within a QSG.
Figure 5Hierarchical clustering analysis of the formal concepts associating the Sca2 tag. Shown is a part of the clustering displaying concepts (on the left) contingent upon the tags they contain (not shown, on top). A red square indicates a tag within a concept, a green square a tag not within the corresponding concept. From the hierarchical clustering shown on the left, one group of concept appears sufficiently similar to be grouped within a QSG, and is extracted using the node 54.
Figure 6DAVID analysis of the genes associated in the QSG associated with node 54 (see Figure 5). In A is shown a graphical representation of the four overrepresented groups obtained using the "Gene Functional Classification" menu with a "high classification stringency" option. The four groups displayed an enrichment of respectively 3.01 times, 2.48 times, 1.21 times and 0.19 times. The genes are displayed in lines and the functional categories in columns. In B and C are shown an enlargement of the first two groups. Blue squares indicate a gene that belongs to a functional category.