| Literature DB >> 22190598 |
Jie Zheng1, Julia Stoyanovich, Elisabetta Manduchi, Junmin Liu, Christian J Stoeckert.
Abstract
The ever-increasing scale of biological data sets, particularly those arising in the context of high-throughput technologies, requires the development of rich data exploration tools. In this article, we present AnnotCompute, an information discovery platform for repositories of functional genomics experiments such as ArrayExpress. Our system leverages semantic annotations of functional genomics experiments with controlled vocabulary and ontology terms, such as those from the MGED Ontology, to compute conceptual dissimilarities between pairs of experiments. These dissimilarities are then used to support two types of exploratory analysis-clustering and query-by-example. We show that our proposed dissimilarity measures correspond to a user's intuition about conceptual dissimilarity, and can be used to support effective query-by-example. We also evaluate the quality of clustering based on these measures. While AnnotCompute can support a richer data exploration experience, its effectiveness is limited in some cases, due to the quality of available annotations. Nonetheless, tools such as AnnotCompute may provide an incentive for richer annotations of experiments. Database URL: http://www.cbil.upenn.edu/annotCompute/Entities:
Mesh:
Year: 2011 PMID: 22190598 PMCID: PMC3244265 DOI: 10.1093/database/bar045
Source DB: PubMed Journal: Database (Oxford) ISSN: 1758-0463 Impact factor: 3.451
Figure 1.System architecture of AnnotCompute. Off-line processing is executed once a month, and builds a dissimilarity matrix of experiments. This matrix is used at query time to produce a ranked list of results in ‘query-by-example’, or to cluster results in the ‘clustering’ scenario.
Figure 2.Annotation statistics for the ArrayExpress data set. The score is computed as the total number of extracted annotations per experiment, and is plotted on the x-axis. Each MAGE-TAB field that contains one or more valid ontology terms increments the score by 1, whereas fields with terms such as ‘unknown’, ‘none’ and ‘N/A’ do not increment the score. The field ‘Biomaterial characteristics’ may contain several ontology annotation categories, and so may increment the score by more than 1. A higher annotation score indicates that an experiment is annotated more richly. The percentage of the data set with a given score is plotted on the y-axis. Data used in the figure was downloaded on 1 August 2011.
Effectiveness of ‘query-by-example’
| Use case | Query experiment | NDCG |
|---|---|---|
| Metastasis | E-GEOD-2280 | 0.98 |
| Metastasis | E-GEOD-2685 | 1 |
| Metastasis | E-GEOD-15641 | 0.68 |
| Insulin | E-MEXP-867 | 0.613 |
| Insulin | E-TABM-141 | 0.641 |
| Insulin | E-GEOD-11484 | 0.203 |
| Aging | E-GEOD-3305 | 0.817 |
| Aging | E-GEOD-11882 | 0.055 |
| Aging | E-GEOD-3305 (enriched) | 0.944 |
| Aging | E-GEOD-11882 (enriched) | 0.894 |
Effectiveness of ‘query-by-example’ for three use cases and for several query experiments. Effectiveness is measured by NDCG, which ranges from 0 to 1, with a score of 1 corresponding to highest possible effectiveness.
Clusters for the ‘Metastasis’ use case
| Size | Quality | Description | |
|---|---|---|---|
| 1 | 142 | 1.3 | Description: years, months, plus, patient, transcription profiling, transcription, index, mm, carcinoma, soft Experiment Design Types: transcription profiling (121), disease state design (40), co-expression design (29) Experiment Factor Types: disease state (28), organism part (11), disease staging (10) Experiment Factor Values: normal (18), metastasis (16), node (15) Taxons: Biomaterial Characteristics: sex—female (26), sex—male (20), disease state—normal (15) |
| 2 | 36 | 1.3 | Description: Experiment Design Types: transcription profiling (35), co-expression design (5), individual genetic characteristics design (3) Experiment Factor Types: genotype (3), treatment (2) Experiment Factor Values: wild-type (3), cells (3), p1a (2) Taxons: M. musculus (30), R. norvegicus (5) Biomaterial Characteristics: biosource type—fresh sample (3), developmental stage—adult (2), time unit—weeks (2) |
| 3 | 35 | 1 | Description: x, taxol, fac, x4, 12, x12, fec, weekly, 4, mg/m2 Experiment Design Types: transcription profiling by array (35) Experiment Factor Types: cell line (11), tissue (8), cell type (5) Experiment Factor Values: not (11), specified (11), 4 (9) Taxons: H. sapiens (28), M. musculus (6) Biomaterial Characteristics: treatment comments—12 paclitaxel + 4fac (4), age—62 (3), age—71 (3) |
| 4 | 12 | 1.2 | Description: strain or line design, cell line, cms4-met, cms4, p63, amplification, RNA, transcription, transcription profiling, Experiment Design Types: transcription profiling (7), strain or line design (5), cell type comparison design (3) Experiment Factor Types: cell line (12) Experiment Factor Values: cms4-met (3), 4t1 (2), cms4 (2) Taxons: H. sapiens (7), M. musculus (5) Biomaterial Characteristics: biosourcetype—fresh sample (6), sex—male (3), cell line—cms4-met (2) |
| 5 | 5 | 1.3 | Description: comparative genomic hybridization by array, dog, vhl, dna, tissue, specified, inactivated, 1858, sporadic, not Experiment Design Types: comparative genomic hybridization by array (5) Experiment Factor Types: cell line (2) Experiment Factor Values: not (3), specified (3), cell (2) Taxons: |
| 6 | 2 | 1.5 | Description: mir-10a, repressor, activity, disease state—colorectal adenocarcinoma, age—50 years, cell line—sw480, sex—male … Experiment Design Types: co-expression design (2), Taxons: Biomaterial Characteristics: sex—male (2), developmental stage—adult (2), age—50 years (2) |
| 7 | 2 | 0.8 | Description: chip-chip by tiling array, characterization, agent—hep3b tta4-ptre-lap-flag cultured without doxycycline during 10 days … Experiment Design Types: chip-chip by tiling array (2) Taxons: |
Clustering result for the query ‘metastasis or metastatic’. ‘Size’ is the number of experiments in a cluster. ‘Quality’ is the average quality score assigned to a cluster by users; it ranges from 0 (worst) to 2 (best).
Clusters for the ‘Insulin’ use case
| Size | Quality | Description | |
|---|---|---|---|
| 1 | 38 | 1.7 | Description: Experiment Design Types: transcription profiling (38), compound treatment design (8), genetic modification design (5) Experiment Factor Types: compound treatment design (6), genetic modification (5), compound (4) Experiment Factor Values: insulin (3), glucose (3), gene knock out (3) Taxons: mus musculus (27), R. norvegicus (9), Drosophila melanogaster (2) Biomaterial Characteristics: organism part—islet (4), sex—male (4), developmental stage—adult (3) |
| 2 | 16 | 1.7 | Description: transcription profiling, gip-dependent, disease state, stem, transcription, profiling, human, history, cell line, cushings, family … Experiment Design Types: transcription profiling (16), cell type comparison design (4), disease state design (3) Experiment Factor Types: disease state (4), cell line (3), cell type (2) Experiment Factor Values: 2 (2), type (2), tissue (2) Taxons: H. sapiens (14), R. norvegicus (2) Biomaterial Characteristics: sex—male (3), disease state—normal (2), time unit—years (2) |
| 3 | 13 | 1 | Description: transcription profiling by array, five, total, years, female, mean, pooled, range, time point, age Experiment Design Types: transcription profiling by array (13) Experiment Factor Types: strain or line (3), tissue (2) Experiment Factor Values: 3 (2), fat (2), high (2) Taxons: M. musculus (6), R. norvegicus (3), H. sapiens (2) Biomaterial Characteristics: tissue—liver (4), gender—male (3), gender—female (2) |
| 4 | 2 | 0.8 | Description: weeks, lean, training, mm, time, time series design, exercise, obese Experiment Design Types: time series design (2), co-expression design (2), transcription profiling (2) Experiment Factor Types: time (2) Experiment Factor Values: weeks (2), 1 (2), 4 (2) |
| 5 | 2 | 1.7 | Description: oil, diet, olive, cod, coconut, lard, its, lipids, media, micelles Experiment Design Types: transcription profiling (2), co-expression design (2), growth condition design (2) Experiment Factor Types: growth condition (2) |
| 6 | 2 | 2 | Description: biomarker, progression, study, disease, diabetes, rat, tissue, adipose, liver, Taxons: R. norvegicus (2) |
Clustering result for the query ‘insulin and glucose’. ‘Size’ is the number of experiments in a cluster. ‘Quality’ is the average quality score assigned to a cluster by users; it ranges from 0 (worst) to 2 (best).
Clusters for the ‘Aging’ use case
| Size | Quality | Description | |
|---|---|---|---|
| 1 | 32 | 1.7 | Description: transcription profiling, transcription, flies, months, selected, diet, sex—male, span, Experiment Design Types: transcription profiling (32), co-expression design (7), compound treatment design (3) Experiment Factor Types: age (4), strain or line (3), compound (3) Experiment Factor Values: months (4), 30 (3), control (3) Taxons: M. musculus (13), D. melanogaster (6), H. sapiens (5) |
| 2 | 6 | 0.2 | Description: transcription profiling by array, expression, gene Experiment Design Types: transcription profiling by array (6) Taxons: |
| 3 | 4 | 1.7 | Description: glp-4 bn2, individual genetic characteristics design, genotype, leu2, his3, ura3, daf-2 m577, met15, delta0, mutants, genotype … Experiment Design Types: co-expression design (4), transcription profiling (4), individual genetic characteristics design (4) Experiment Factor Types: genotype (4) Experiment Factor Values: delta0 (2), wild-type (2) Taxons: Saccharomyces cerevisiae (2) Biomaterial Characteristics: genotype—wild-type (2), genotype—his3, leu2, met15, ura3 isc1::kanmx4 (2) |
| 4 | 3 | 0.7 | Description: collected, week, years, percent, living, free, parasitic, biosource type—fresh sample, age—6, old, growth condition design Experiment Design Types: transcription profiling (3), growth condition design (2) Experiment Factor Types: age (2) Biomaterial Characteristics: sex—female (3), biosource type—fresh sample (3), age—6 (2) |
| 5 | 2 | 1 | Description: wrn, compared, treated, vitamin, with/without, experiment, feeding, protein, liver, wt, c Taxons: |
Clustering result for the query ‘longevity or life span or lifespan’. ‘Size’ is the number of experiments in a cluster. ‘Quality’ is the average quality score assigned to a cluster by users; it ranges from 0 (worst) to 2 (best).
Effectiveness of clustering for the extended evaluation
| Query | Size | Number of clusters | Quality | |||
|---|---|---|---|---|---|---|
| Total | Nonsingleton | Minimum | Maximum | Average | ||
| Alzheimer | 34 | 5 | 5 | 1 | 2 | 1.2 |
| Autism | 12 | 3 | 3 | 1 | 2 | 1.7 |
| Cell and cycle and arrest | 30 | 5 | 4 | 1 | 2 | 1.5 |
| Enhancer and promoter | 27 | 5 | 5 | 1 | 2 | 1.4 |
| Flow and cytometry | 119 | 10 | 5 | 1 | 2 | 1.8 |
| Melanoma | 108 | 10 | 7 | 1 | 2 | 1.7 |
| Menin | 56 | 7 | 5 | 1 | 2 | 1.2 |
| Methylation | 199 | 10 | 10 | 1 | 2 | 1.3 |
| Migration | 119 | 10 | 5 | 1 | 2 | 1.2 |
| Olfactory | 46 | 6 | 4 | 2 | 2 | 2 |
| Average | 1.5 | |||||
Results of an evaluation of the effectiveness of clustering for 10 queries. In the table, ‘size’ is the total number of experiments returned by the query. We report both the total number of clusters and the number of ‘nonsingleton clusters’, which contain at least two experiments. ‘Quality’ is the quality score averaged across nonsingleton clusters, on a scale from 0 (worst) to 2(best).