Literature DB >> 21772260

Retrieval, alignment, and clustering of computational models based on semantic annotations.

Marvin Schulz¹, Falko Krause, Nicolas Le Novère, Edda Klipp, Wolfram Liebermeister.

Abstract

The exploding number of computational models produced by Systems Biologists over the last years is an invitation to structure and exploit this new wealth of information. Researchers would like to trace models relevant to specific scientific questions, to explore their biological content, to align and combine them, and to match them with experimental data. To automate these processes, it is essential to consider semantic annotations, which describe their biological meaning. As a prerequisite for a wide range of computational methods, we propose general and flexible similarity measures for Systems Biology models computed from semantic annotations. By using these measures and a large extensible ontology, we implement a platform that can retrieve, cluster, and align Systems Biology models and experimental data sets. At present, its major application is the search for relevant models in the BioModels Database, starting from initial models, data sets, or lists of biological concepts. Beyond similarity searches, the representation of models by semantic feature vectors may pave the way for visualisation, exploration, and statistical analysis of large collections of models and corresponding data.

Entities: Chemical

Mesh：

Year: 2011 PMID： 21772260 PMCID： PMC3159965 DOI： 10.1038/msb.2011.41

Source DB: PubMed Journal: Mol Syst Biol ISSN： 1744-4292 Impact factor: 11.429

Introduction

The rise of Systems Biology as a mainstream field of research triggered a fast accumulation of knowledge about cellular networks, their biochemical details, and their dynamic behaviour. Much of this complex information is condensed in mathematical models, which statically or dynamically describe the interconversion of biochemical compounds within reaction networks. A wealth of models, picturing various regions of the cellular networks, are available in public repositories like the BioModels Database (Le Novère et al, 2006) or JWS Online (Olivier and Snoep, 2004) in the machine-readable format Systems Biology Markup Language (SBML; Hucka et al, 2003). Meta-information on existing databases can be found on websites like PathGuide (Bader et al, 2006). The models in these repositories serve as information sources and they may be reused, refined, and combined for new research studies. Continued research aiming for improved and complex models, e.g., for biomedical purposes, makes it desirable to change or combine models automatically with the help of computers. Such an automatic processing would be easier if published models and data were based on a common list of well-defined elements with a fixed naming convention. However, since models cover a growing number of entities and describe processes by various levels of granularity, the meaning of model elements is established on a case-by-case basis by machine-readable annotations that link them to entries in public web resources. Annotations may, for instance, relate cell compartments to Gene Ontology entries (Ashburner et al, 2000) and small chemical compounds to entries from ChEBI (Degtyarenko et al, 2008). The MIRIAM initiative (Le Novère et al, 2005) has proposed a standard format for biochemical annotations, consisting of the URN of a web resource, an ID for the referenced resource element, and a biological qualifier stating a logical relation (e.g., ‘hasPart') between the model element and the resource entry. The use of public web resources and ontologies like the Systems Biology Ontology makes annotations unambiguous and facilitates search, visualisation, and automated reasoning. While building complex models, Systems Biologists need to search for relevant models, to rank or classify them, and to check how models differ, overlap, and complement each other (Liebermeister, 2008; Krause et al, 2010). Furthermore, models need to be validated and refined with experimental data, which have to be retrieved and aligned to the models beforehands. All these tasks call for automation and most of them require quantitative similarity measures between models and data sets, which should capture their biological meaning and be computable fast and reliably. The situation is comparable to the early days of bioinformatics, when nucleotide sequences became harder and harder to compare until filtering methods like FASTA and BLAST (Lipman and Pearson, 1985; Altschul et al, 1990) led to a breakthrough. A comparison of models can be based on various biological or mathematical aspects, including the biological entities and processes described, the mathematical formalism, details of the equations and numerical values, or even the dynamic behaviour. In particular, most Systems Biology models describe biological networks and can be depicted and treated as networks themselves. The automatic comparison of biological networks has been widely discussed in the literature. In their review, Sharan and Ideker (2006) distinguish between three main groups of applications. (i) Network alignment describes the process in which complete networks (e.g., protein–protein interaction networks from different species; Matthews et al, 2001) are compared in order to unveil similar and different regions. (ii) Network integration combines networks of different types to gain particular information (e.g., enriched protein interconnection patterns; Zhang et al, 2005). (iii) Network querying detects parts of a large network that resemble a query motif (e.g., to search how a metabolic pathway is conserved across species; Pinter et al, 2005). On the computational side, the alignment and the querying problem can be tackled by similar algorithms, which differ in how they compare network structures and how they relate nodes. The alignment of structures has evolved from simple paths (similar to sequence alignment) (Kelley et al, 2003) and trees (Pinter et al, 2005) to general graph structures (Yang and Sze, 2007). Depending on the application, the comparison of nodes can be based on their labels, on the relatedness of their annotations (e.g., their EC numbers; Tohsato et al, 2000), or on their chemical structures (Hattori et al, 2003). For a review on network querying algorithms, the reader is referred to Fionda and Palopoli (2011). When comparing Systems Biology models, the network structure may be less informative because the same system may be described by alternative models at different levels of granularity (Markevich et al, 2004). Graph reduction techniques (Gay et al, 2010) can partially handle this problem, but only if the networks are not too different. A more direct way to find biologically similar models, especially for searching and ranking (Henkel et al, 2010), is to compare their semantic annotations using methods from information retrieval (IR), as introduced in Box 1. However, the comparison of semantic annotations involves two general challenges: (i) annotations may describe the same chemical entity or process, but point to entries in different web resources; (ii) different web resource entries can share subtle biochemical relationships (e.g., the molecular species ATP3+ referenced in one model being a special case of—rather than identical to—ATP referenced in another data set). To overcome this problem, intra-ontology relationships and cross references for a large number of relevant web resources have to be combined in an integrative ontology, which can then be used to compare entries from various resources.

Case study: semantic similarity measures for SBML models

As an example case for the use of semantic annotations, we present a system for retrieval, clustering, and alignment of SBML models. It relies on a technical infrastructure for handling biological concepts (BCs) and on semantics-based similarity scores for models and data sets. We applied our framework to models from BioModels Database, validated the calculated model similarities with human expert knowledge, and present a number of practical applications. Researchers can use our online services at http://www.semanticsbml.org to retrieve Systems Biology models resembling a given SBML model or related to an experimental data set. Furthermore, they can cluster models by their semantic similarities and visually align their elements. The mathematical and technical details are explained in parts below and extensively in the Supplementary Appendix.

Challenges in the automatic comparison of model elements

The biological meaning of SBML elements can be declared by annotations according to the MIRIAM standard. Comparing SBML elements comprising possibly many annotations imposes even more challenges than the comparison of single annotations: (i) the relationships between model elements and resource entries, stated by qualifiers, may be complex (e.g., ‘hasPart' rather than a simple ‘is'); (ii) each model element may contain several annotations, describing its different aspects; (iii) annotations may be missing, unspecific, or simply wrong.

Semantic annotations in BioModels database

To compare biological identifiers and to evaluate the relationships between them efficiently, we developed the query engine libSBAnnotation, which collects biochemical knowledge from several public web resources and combines it in a single ontology. Equivalent entries from different resources are replaced by single ‘Biological Concepts' (BCs), whereas similar entries (e.g., ‘α-D-glucose' versus ‘glucose') are represented by different BCs, but connected by ontology relationships (e.g., ‘is_a'). Queries can be posed through a programming interface or through a web service compliant to the Representational State Transfer (REST) software architecture style (Fielding, 2000). The libSBAnnotation makes it easy to explore the semantic annotations present in BioModels Database, the major public collection of MIRIAM-compliant models (249 models in the 17th release from May 2010, which we use in the current study). A comprehensive statistics for the most recent BioModels release is provided at our online service. Approximately 69% of all compartments, species, and reactions are annotated. They show about 1.7 annotations per annotated element and almost all of them carry ‘isVersionOf', ‘hasPart', or ‘is' qualifiers. The annotations refer to a broad range of web resources and the high abundance of Gene Ontology and UniProt entries shows that the models contain more proteins than small metabolites. Figure 1 shows the prevalence of about 2000 BCs within all 249 models in the form of an annotation matrix. Positive matrix elements indicate which models (columns) contain annotations referring to certain BCs (rows). The numerical values may also state how often a BC appears in a model and which qualifiers are used. The matrix columns, called feature vectors, can be seen as the ‘annotation fingerprints' of models and may serve for simple comparisons and visualisation by multivariate statistical methods.

Figure 1

Annotation matrix. Semantic annotations link the elements of Systems Biology models to Biological Concepts from public web resources. The associations between them can be represented by a matrix: positive entries (red) show that a model (column) contains an annotation pointing to a certain Biological Concept (row). Left: annotation matrix for the BioModels Database, sorted by two-way agglomerative clustering. Right: close-up showing a number of MAP kinase models and Biological Concepts. Matrix visualised by GenePattern (Reich et al, 2006).

Similarity measures for BCs, annotations, and models

The libSBAnnotation interconnects BCs by various semantic relationships and thus forms an ontology. To express the direct and indirect relationships between BCs by numbers, we developed a series of similarity measures which resemble the scores used in semantic text analysis. Based on the similarities between individual BCs, we then define similarity measures between entire models. We investigated two groups of such measures and tested their performance in practical applications. Vector-based similarity measures are solely based on the feature vectors, i.e., on the set of BCs referenced by a model. Two feature vectors are compared by the cosine coefficient and a special metric is used to acknowledge that annotations can point to different, yet similar BCs (e.g., ‘α-D-glucose' versus ‘glucose'). Structure-based similarity measures, in contrast, start with a pairwise comparison of individual model elements and combine the resulting similarities in more complex model similarity scores. In the spirit of probabilistic reasoning, one of these similarity measures can combine evidence from several annotations, distinguish between missing information (i.e., no element annotations) and negative information (i.e., annotations pointing to different BCs), and weight different combinations of biological qualifiers and relationships between BCs. For a practical test, we evaluated all similarity measures with benchmark models from BioModels Database, the largest public collection of curated SBML models. After manually classifying the models by the biochemical pathways described, we clustered them all by each of the measures (similar to Figure 1) and compared the clusters with the predefined classification. A detailed evaluation can be found in the Supplementary Appendix. The vector-based similarity measures have a number of advantages: first of all, they performed well in the comparison and are easy to compute. Moreover, being solely based on sets of annotations, these annotation measures not only apply to kinetic or structural SBML models, but also to annotated ‘omics' data or any type of data associated with a list of BCs. Therefore, they are used in our online tools and will be explained below. More details and descriptions of other similarity measures are given in the Supplementary Appendix.

Similarity between BCs

Similarity measures for ontology elements have been discussed in the literature from a theoretical (Lin, 1998) and a practical point of view (Resnik, 1995; Budanitsky and Hirst, 2001; Li et al, 2003) and have been implemented in software tools (Lord et al, 2003). They are usually computed from the distance between entries in the relationship graph, their most specific common ancestor, and a corpus, a collection of text or data in which the appearance frequencies of ontology terms can be counted. Following Li et al, we define a similarity σ between BCs μ and ν taking into account three factors: (i) their weighted distance (f1) in the ontology forest, (ii) their depths (f2) (distance from a root), and (iii) their rarity (f3) in BioModels Database. The three factors are combined by the formula The factor f1 yields a high similarity if two BCs are connected by a short relationship path where p∈P is a possible path of relation arrows (r) between the two BCs and frts scores each relation type by a value between 0 and 1 (see Supplementary Appendix for numerical values). If there is no path between the BCs, f1 is set to 0. A sensitivity analysis in the Supplementary Appendix shows that the choice of numerical values for frts has only little effect on the model retrieval results. Since too few benchmark examples are available to optimise these parameters reliably, we use ad hoc values chosen before testing any of the measures. BCs that are deeper in the relationship graph are usually more precise because they comprise fewer subconcepts (e.g., ‘D-glucopyranose' being a subconcept of ‘carbohydrate'). Accordingly, if two pairs of BCs are connected by a similar path, the pair with the lower depth (e.g., ‘carbohydrate' and ‘sugar') should be less similar than the more specific pair (e.g., ‘D-glucopyranose' and ‘α-D-glucose'). This is implemented by the exponent where d(μ) is the path length of ‘is_a' relationships between an ontology element μ and its root. Since some BCs (e.g., ATP) appear very often in BioModels Database, a semantic density factor f3 can be introduced to downweight them in the similarity measure. For each BC μ, it reads where cμ is the number of occurrences of μ in BioModels Database, is the number of μ and all its ‘is_a' specialisations, and is a normalisation term. However, the term f3 did not seem to improve the results in our evaluation. This agrees with the findings of Li et al (2003), who concluded that the use of an ‘information factor' in text analysis decreases the quality of their similarity measure. Because of this and because the frequency of individual annotations is already part of our null model for computing P-values, we omitted this term from the online model search.

Vector-based model similarity

The elements of an SBML model, such as cellular compartments, molecular species, or biochemical reactions, can be annotated with links to various web resources. Given a general list of BCs, each model or data set M can be represented by a feature vector νM with components ν=1 if the ith BC appears in an annotation in model M and ν=0 otherwise. Similarities between two models M and N can be defined by functions of their feature vectors. From the various measures used in IR, we chose the cosine of the angle between the feature vectors (van Rijsbergen, 1979; Salton and McGill, 1986). Despite its simplicity, this cosine coefficient allows for reasonable comparisons between models. However, it cannot detect the resemblance between similar, but slightly different BCs (e.g., CHEBI:17634 for D-glucose and CHEBI:17925 for α-D-glucose). To capture such biochemical similarities, we replace the scalar product by a quadratic form based on the similarity matrix S for the BCs (S=σBC(μ, ν), where μ is the ith and ν is the jth BC): Since the feature vectors and the similarities σBC(μ, ν) are non-negative, this formula yields non-negative model similarities even if the similarity matrix is not positive definite. For positive-definite matrices S, the formula can be interpreted in terms of transformed feature vectors as proposed in the topic-based vector space model (TVSM; Becker and Kuropka, 2003).

Implementation and data

The query engine libSBAnnotation and all described methods were implemented in Python. The code is freely available at http://sourceforge.net/projects/semanticsbml. Online tools for similarity calculations, model retrieval, clustering based on the TVSM similarities, and alignment based on greedy pairing, as well as a public REST API for programmatic access are provided at http://www.semanticsbml.org.

Model search

Ranked retrieval of SBML models from BioModels database

Model similarities can be used to find models or data sets referring to a given query model or a list of BCs. As a practical application, we have implemented a similarity search for models from BioModels Database (Figures 2 and 3). The retrieved models are ranked by similarity scores, where high scores indicate that query model and retrieved model share a large fraction of similar annotations. To discard models describing unrelated pathways, we filter out low-scoring search results by a statistical significance test. Our null hypothesis states that BCs appear in the models independently and with the same BC-specific frequencies as in BioModels Database. Accordingly, low P-values indicate that the query model and a retrieved model share a set of common BCs that is unlikely to appear just by chance, which suggests that they describe the same biological pathways.

Figure 2

Model search based on a microarray study. The experiment (Klevecz et al, 2004) revealed differential gene expression during metabolic oscillations in yeast, which are coupled with bursts in DNA replication. Using the list of differentially expressed genes as a query, we obtained models of the affected pathways. The retrieved models describe methionine or more general amino-acid metabolism (BioModels 66, 212, 68, 15, 190, 213, and 18), sulphur metabolism (90), ubiquitination (105), and the DNA polymerase (15), and thus cover three of the six functional categories of the query genes. Bar lengths and colours show the vector-based model similarity scores.

Figure 3

Results of a semantic model search in BioModels Database. Starting from the kinase cascade model of Huang and Ferrell (1996), a ranked list of similar models was retrieved automatically. The first 15 models contain complete kinase cascades or parts of them. The top hit is the query model itself.

Given the similarity score σ of a certain retrieved model, the P-value states how probable it is to obtain an equal or higher similarity score from a random model. For the calculation, we randomly sample feature vectors in which each BC appears with the probability (b+1)/(B+1), where b is the number of models referring to this BC and B is the total number of models. We generate N=998 such random models, check how many of them show higher similarities to the query model than our actual retrieved model. From this number n, the P-value is estimated by the Bayesian estimator 〈P〉=(n+1)/(N+2) with a uniform prior for the P-value. For practical reasons, we also compute a second P-value for the model overlap νTMνN, which can be computed without the need for random sampling. Analytically calculating P-values for other similarity scores turned out to be too slow for efficient online searches. More details on the methods can be found in the Supplementary Appendix.

Search for models related to an experimental result

A model search may begin with a list of genes involved in a certain biological process. As an example (see Figure 2), we considered a microarray study on gene regulation during metabolic oscillations in the yeast S. cerevisiae. The experimental data of Klevecz et al (2004) show that the observed oscillations are coupled with bursts in DNA replication. Differentially expressed genes tend to be involved in sulphur and methionine metabolism and in the production of ubiquitine proteasomes, ribosomes, and the DNA polymerase. To retrieve models related to this gene set, we described the genes by MIRIAM-compliant annotations and started a search for relevant models. The retrieval returned models of methionine and sulphur metabolism, a model of the ubiquitine proteasome system, and a model containing a DNA polymerisation reaction (see Figure 2). Although the functional categories of the query genes were not explicitly used in the query, they perfectly agree with the search results. Similar model searches could start from any list of metabolites, genes, or proteins. Further examples and practical hints for the annotation and retrieval process can be found on our website.

Search results for a signal transduction model

The result of a model search starting from an MAP kinase cascade model, BioModel 9 (Huang and Ferrell, 1996), is shown in Figure 3. The topmost 15 models in the list describe either MAP kinase cascades or parts of them. While the models 11, 14, and 10 are as detailed as the query model, the models of Markevich (26–31) represent only the activation of MAPK, but in more detail, and the models 84, 116, 32, 149, and 33 contain additional proteins around the MAP kinase cascade. All similarities are highly significant (estimated P-values around 10−3). Models further down in the list share some general annotations with the query model, for instance, Gene Ontology terms for protein phosphorylation and dephosphorylation, but they rarely describe MAP kinase cascades. Depending on the frequency of the common annotations, the retrieved models may still appear significant, but they show much lower similarity scores than the first 15 hits.

Model clustering

Unsupervised clustering is one of the prominent applications of similarity measures. As an example, we clustered the first 10 models from Figure 3 by agglomerative clustering using vector-based similarities. As shown in Figure 4, the two model groups describing either the complete MAP kinase cascade (9, 10, 11, and 14) or MAPK activation (26–31) are clearly distinguished. Furthermore, the models 11 and 14, which stem from the same publication (Levchenko et al, 2000), show the highest similarity among the complete MAP kinase cascades, whereas model 10, the only model with enzymatic reactions represented with Michaelis–Menten-like kinetics, appears most distant to all others. Among the MAPK activation models, the clustering clearly distinguishes between models containing effective enzymatic rate laws (27, 29, and 31) and the ones containing elementary reaction steps (26, 28, and 30). The reason for this distinction is not the structural difference between the models, but the fact that many elementary reactions in these models (in contrast to the enzymatic ones) were annotated with Gene Ontology terms for enzyme binding or dissociation.

Figure 4

Clustering of computational models. The MAP kinase models retrieved from BioModels Database (see Figure 3) contain subgroups that are successfully detected by the clustering. Despite their different network structures, the models of Markevich (26–31), describing the same biological pathway, show high pairwise similarities (>0.82). Dendrogram drawn by DendroUPGMA (Garcia-Vallve et al, 1999).

Model alignment

One of the key challenges in automated model merging is to match equivalent elements from two models. To realise such a model alignment, we employed a greedy pairing: the two elements with the highest pairwise similarity are successively matched until all remaining similarities fall below a certain threshold. At our website, the user can visually align annotated SBML models or data sets to similar models from BioModels Database. An example, the visual alignment between the MAP kinase cascade models, BioModel 84 and BioModel 9, is shown in Figure 5. Although both models share the same general structure, the Huang model shows the phosphorylation states of the MAP kinases in much higher resolution. Structure-based alignment methods could not detect the similarity between these models without heavily increasing the ‘fuzziness' (number of node insertions/deletions/mismatches) of their matching, which in turn leads to a low specificity.

Figure 5

Visual alignment between computational models. An MAP kinase model (BioModel 84; Hornberg et al, 2005) (blue) is aligned with the more detailed BioModel 9 (Huang and Ferrell, 1996) (red). The reaction networks represent chemical species (circles) and reactions (squares) connected by reactant (green) and product edges (red). Orange edges connect elements between models if their similarity scores exceed a threshold value of 0.25.

As a further application, we used model alignments to tile the metabolic network of the yeast S. cerevisiae with kinetic models available in BioModels Database. By iteratively aligning kinetic models to the yeast consensus metabolic network (Herrgård et al, 2008), we could cover about 15% of the network with eight kinetic models (most of which described the metabolism of other organisms), whereas all other models contributed only few additional elements (see the Supplementary Appendix for details). This automatic comparison shows that existing kinetic models would by far not suffice to build a comprehensive model of yeast metabolism. In the future, the same method could help to find white spots in the cellular networks that might deserve further modelling efforts.

Discussion

The construction of large Systems Biology models in a bottom-up style requires models that are easy to reuse. Public model repositories, standard formats, and annotation schemes have already been established, and the wealth of information stored in models is ready to be processed by computer programmes. While interconversion of annotations is mainly a matter of technology, defining suitable similarities between models is a more delicate task. The reason is that similarity measures are not given a priori, but need to reflect specific human intentions and expectations in order to be useful for practical applications. Computational models can resemble each other in two complementary ways: first, they may describe similar biological systems; and second, in case they do, they may describe them using a similar level of granularity, similar formulas, or similar quantitative values. In the present approach, we focused on the first aspect, which is fully captured by the biological annotations. The technical challenges mentioned above were solved by interconnecting model annotations and BCs, by assigning quantitative weights to the biological qualifiers and relationships between BCs, and by condensing all information within the similarity measures. The second aspect, which concerns model formulation, network structure, mathematical statements, and numerical values, was ignored here. Of course, the similarity measures could be extended to compare enzymatic rate laws (e.g., by evaluating annotations with Systems Biology Ontology identifiers) or mathematical formulae. However, similarity scores for mathematical statements or network structures would strongly depend on specific model formalisms, while the comparison of annotations is not even limited to Systems Biology models, but may apply to models from other fields and even experimental data sets or annotated scientific literature (Cheung et al, 2010). Another reason for semantic comparisons is that biologists typically search for models describing a certain biochemical process, irrespective of the mathematical details. A paper such as Markevich et al (2004), for instance, represents the same biochemical process (MAPK phosphorylation) by six different mathematical structures and, therefore, different reaction networks. Annotations make it easy to recognise the similarity between these models, while network-based model similarities would emphasise their differences. Like many other classification tasks, model retrieval crucially depends on a sensible choice of the null model used for computing the P-values. Since the null model is used to distinguish between meaningless and meaningful similarities, it needs to be chosen as carefully as the similarity measure itself. In general, it should capture the typical properties of models that are not specifically interesting as search results. In the present approach, the main aim was to find models that specifically share annotations with a query model. Unspecific BCs, especially those that are very frequent, are likely to lead to spurious similaritiy values. Our null model was tailored to reproduce exactly this effect in order to tag the resulting low similarities as insignificant. In the future, larger model databases and more specific search tasks may raise the need for advanced null models, specifically constructed to match and exclude unintended search results. The comparison of SBML models by semantic annotations works well in practise and may pave the way to promising applications. By transforming and normalising the semantic feature vectors, similarities can be rewritten in terms of Euclidean distances, which makes them amenable to multivariate methods such as Kohonen maps, biclustering, principal and independent component analysis (Pearson, 1901; Comon, 1994), non-negative matrix factorisations (Lee and Seung, 1999), classification by support vector machines, and search for prototype models. These methods, in turn, may have various practical applications in the visualisation and statistical analysis of large model sets. As depicted in Figure 6, automated searches for models and experimental data can be helpful in early and later stages of the modelling process. Existing models can provide information about additional reactions, enzymatic rate laws, and parameter values, or suggest alternative descriptions of biochemical processes. More complex searches using positive and negative weights for the individual features, e.g., for models that contain certain annotations and lack certain others, could help to extend existing models by additional pathways. Finally, the possibility to start the retrieval process from ‘omics' data opens up new applications, including pathway enrichment analyses, comparison between experimental data and simulation results, or automated model parameter fitting and model selection.

Figure 6

Semantic model comparison can be useful during hypotheses generation, modelling, experimental verification, and model refinement. Given a model or an experimental data set, similar models or data can be found in repositories and be used to extend existing models, refine them using data, and finally select the most appropriate model. Models and data sets of interest can further be mapped, aligned, combined, and classified or displayed by clustering.

Conclusion

As models and data in Systems Biology are rapidly accumulating, automatic searches for models or data sets and pairwise alignments between them become increasingly important. For efficient searches, models and data have to adhere to standard formats, contain reliable biological annotations, and be stored in central, publicly accessible repositories. Public databases already provide a significant number of well-annotated models and data, and model comparison may promote various applications, allowing to exploit an otherwise hardly manageable amount of knowledge. Facilitating the reuse of models and data, such comparisons may become a basic method in computational Systems Biology, just as tools like BLAST (Altschul et al, 1990) became to scientists dealing with sequence data.

Introduction and comparison of different similarity measures

In the Appendix we present a number of model similarity measures and evaluate their use for unsupervised clustering. Furthermore, we demonstrate how a similarity search can be used for tiling the yeast metabolic network with kinetic models

33 in total

1. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium.

Authors: M Ashburner; C A Ball; J A Blake; D Botstein; H Butler; J M Cherry; A P Davis; K Dolinski; S S Dwight; J T Eppig; M A Harris; D P Hill; L Issel-Tarver; A Kasarskis; S Lewis; J C Matese; J E Richardson; M Ringwald; G M Rubin; G Sherlock
Journal: Nat Genet Date: 2000-05 Impact factor: 38.330

2. Learning the parts of objects by non-negative matrix factorization.

Authors: D D Lee; H S Seung
Journal: Nature Date: 1999-10-21 Impact factor: 49.962

3. Development of a chemical structure comparison method for integrated analysis of chemical and genomic information in the metabolic pathways.

Authors: Masahiro Hattori; Yasushi Okuno; Susumu Goto; Minoru Kanehisa
Journal: J Am Chem Soc Date: 2003-10-01 Impact factor: 15.419

4. Minimum information requested in the annotation of biochemical models (MIRIAM).

Authors: Nicolas Le Novère; Andrew Finney; Michael Hucka; Upinder S Bhalla; Fabien Campagne; Julio Collado-Vides; Edmund J Crampin; Matt Halstead; Edda Klipp; Pedro Mendes; Poul Nielsen; Herbert Sauro; Bruce Shapiro; Jacky L Snoep; Hugh D Spence; Barry L Wanner
Journal: Nat Biotechnol Date: 2005-12 Impact factor: 54.908

5. Biological network querying techniques: analysis and comparison.

Authors: Valeria Fionda; Luigi Palopoli
Journal: J Comput Biol Date: 2011-03-21 Impact factor: 1.479

6. Ranked retrieval of Computational Biology models.

Authors: Ron Henkel; Lukas Endler; Andre Peters; Nicolas Le Novère; Dagmar Waltemath
Journal: BMC Bioinformatics Date: 2010-08-11 Impact factor: 3.169

7. A graphical method for reducing and relating models in systems biology.

Authors: Steven Gay; Sylvain Soliman; François Fages
Journal: Bioinformatics Date: 2010-09-15 Impact factor: 6.937

8. Structured digital tables on the Semantic Web: toward a structured digital literature.

Authors: Kei-Hoi Cheung; Matthias Samwald; Raymond K Auerbach; Mark B Gerstein
Journal: Mol Syst Biol Date: 2010-08-24 Impact factor: 11.429

9. Pathguide: a pathway resource list.

Authors: Gary D Bader; Michael P Cary; Chris Sander
Journal: Nucleic Acids Res Date: 2006-01-01 Impact factor: 16.971

10. Motifs, themes and thematic maps of an integrated Saccharomyces cerevisiae interaction network.

Authors: Lan V Zhang; Oliver D King; Sharyl L Wong; Debra S Goldberg; Amy H Y Tong; Guillaume Lesage; Brenda Andrews; Howard Bussey; Charles Boone; Frederick P Roth
Journal: J Biol Date: 2005-06-01

14 in total

1. Meeting report from the first meetings of the Computational Modeling in Biology Network (COMBINE).

Authors: Nicolas Le Novère; Michael Hucka; Nadia Anwar; Gary D Bader; Emek Demir; Stuart Moodie; Anatoly Sorokin
Journal: Stand Genomic Sci Date: 2011-11-30

2. Controlled vocabularies and semantics in systems biology.

Authors: Mélanie Courtot; Nick Juty; Christian Knüpfer; Dagmar Waltemath; Anna Zhukova; Andreas Dräger; Michel Dumontier; Andrew Finney; Martin Golebiewski; Janna Hastings; Stefan Hoops; Sarah Keating; Douglas B Kell; Samuel Kerrien; James Lawson; Allyson Lister; James Lu; Rainer Machne; Pedro Mendes; Matthew Pocock; Nicolas Rodriguez; Alice Villeger; Darren J Wilkinson; Sarala Wimalaratne; Camille Laibe; Michael Hucka; Nicolas Le Novère
Journal: Mol Syst Biol Date: 2011-10-25 Impact factor: 11.429

3. Propagating semantic information in biochemical network models.

Authors: Marvin Schulz; Edda Klipp; Wolfram Liebermeister
Journal: BMC Bioinformatics Date: 2012-01-30 Impact factor: 3.169

4. Combining computational models, semantic annotations and simulation experiments in a graph database.

Authors: Ron Henkel; Olaf Wolkenhauer; Dagmar Waltemath
Journal: Database (Oxford) Date: 2015-03-08 Impact factor: 3.451

5. Semantic biomedical resource discovery: a Natural Language Processing framework.

Authors: Pepi Sfakianaki; Lefteris Koumakis; Stelios Sfakianakis; Galatia Iatraki; Giorgos Zacharioudakis; Norbert Graf; Kostas Marias; Manolis Tsiknakis
Journal: BMC Med Inform Decis Mak Date: 2015-09-30 Impact factor: 2.796

Review 6. Improving collaboration by standardization efforts in systems biology.

Authors: Andreas Dräger; Bernhard Ø Palsson
Journal: Front Bioeng Biotechnol Date: 2014-12-08

7. Semantics-Based Composition of Integrated Cardiomyocyte Models Motivated by Real-World Use Cases.

Authors: Maxwell L Neal; Brian E Carlson; Christopher T Thompson; Ryan C James; Karam G Kim; Kenneth Tran; Edmund J Crampin; Daniel L Cook; John H Gennari
Journal: PLoS One Date: 2015-12-30 Impact factor: 3.240

8. Integrated analysis of microRNAs, transcription factors and target genes expression discloses a specific molecular architecture of hyperdiploid multiple myeloma.

Authors: Maria Teresa Di Martino; Pietro Hiram Guzzi; Daniele Caracciolo; Luca Agnelli; Antonino Neri; Brian A Walker; Gareth J Morgan; Mario Cannataro; Pierfrancesco Tassone; Pierosandro Tagliaferri
Journal: Oncotarget Date: 2015-08-07

9. Multi-scale computational models of the airways to unravel the pathophysiological mechanisms in asthma and chronic obstructive pulmonary disease (AirPROM).

Authors: K S Burrowes; J De Backer; R Smallwood; P J Sterk; I Gut; R Wirix-Speetjens; S Siddiqui; J Owers-Bradley; J Wild; D Maier; C Brightling
Journal: Interface Focus Date: 2013-04-06 Impact factor: 3.906

10. BioModels: Content, Features, Functionality, and Use.

Authors: N Juty; R Ali; M Glont; S Keating; N Rodriguez; M J Swat; S M Wimalaratne; H Hermjakob; N Le Novère; C Laibe; V Chelliah
Journal: CPT Pharmacometrics Syst Pharmacol Date: 2015-02-26