MOTIVATION: Large-scale phenotyping projects such as the Sanger Mouse Genetics project are ongoing efforts to help identify the influences of genes and their modification on phenotypes. Gene-phenotype relations are crucial to the improvement of our understanding of human heritable diseases as well as the development of drugs. However, given that there are ∼: 20 000 genes in higher vertebrate genomes and the experimental verification of gene-phenotype relations requires a lot of resources, methods are needed that determine good candidates for testing. RESULTS: In this study, we applied an association rule mining approach to the identification of promising secondary phenotype candidates. The predictions rely on a large gene-phenotype annotation set that is used to find occurrence patterns of phenotypes. Applying an association rule mining approach, we could identify 1967 secondary phenotype hypotheses that cover 244 genes and 136 phenotypes. Using two automated and one manual evaluation strategies, we demonstrate that the secondary phenotype candidates possess biological relevance to the genes they are predicted for. From the results we conclude that the predicted secondary phenotypes constitute good candidates to be experimentally tested and confirmed. AVAILABILITY: The secondary phenotype candidates can be browsed through at http://www.sanger.ac.uk/resources/databases/phenodigm/gene/secondaryphenotype/list. CONTACT: ao5@sanger.ac.uk or ds5@sanger.ac.uk SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
MOTIVATION: Large-scale phenotyping projects such as the Sanger Mouse Genetics project are ongoing efforts to help identify the influences of genes and their modification on phenotypes. Gene-phenotype relations are crucial to the improvement of our understanding of human heritable diseases as well as the development of drugs. However, given that there are ∼: 20 000 genes in higher vertebrate genomes and the experimental verification of gene-phenotype relations requires a lot of resources, methods are needed that determine good candidates for testing. RESULTS: In this study, we applied an association rule mining approach to the identification of promising secondary phenotype candidates. The predictions rely on a large gene-phenotype annotation set that is used to find occurrence patterns of phenotypes. Applying an association rule mining approach, we could identify 1967 secondary phenotype hypotheses that cover 244 genes and 136 phenotypes. Using two automated and one manual evaluation strategies, we demonstrate that the secondary phenotype candidates possess biological relevance to the genes they are predicted for. From the results we conclude that the predicted secondary phenotypes constitute good candidates to be experimentally tested and confirmed. AVAILABILITY: The secondary phenotype candidates can be browsed through at http://www.sanger.ac.uk/resources/databases/phenodigm/gene/secondaryphenotype/list. CONTACT: ao5@sanger.ac.uk or ds5@sanger.ac.uk SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
A causative gene has not yet been identified for almost half of the existing human
heritable diseases (Schofield ). Without the knowledge of the molecular basis of a disease,
treatment possibilities are limited to treating symptoms instead of curing the underlying
defects. In order to be able to find cures and prevention mechanisms for human genetic
disorders, we need to comprehensively understand how each disease originates and progresses
over time. A collection of human diseases together with confirmed and speculative causes is
available from resources such as the Online Mendelian Inheritance in Man (OMIM) (Amberger ) or Orphanet
(Aymé, 2003) database.In the quest for identifying causative genes for humangenetic disorders, model organisms
have gained increasing importance due to the opportunities arising from targeted gene
modifications. For example, the mouse shares 99% of genes with humans, and gene
modifications leading to phenotypes characteristic for a disease may offer clues to the
origins of this disease (Rosenthal and Brown,
2007). Experimental results of mutagenesis experiments are stored in
species-specific Model Organism Database (MOD)s (Leonelli and Ankeny, 2012), e.g. the Sanger Mouse Genetics Project (Sanger-MGP)
(White ),
WormBase (Yook ),
the Mouse Genome Database (MGD) (Bult ) or FlyBase (Drysdale
and FlyBase Consortium, 2008).The Sanger-MGP is part of the International Mouse Phenotyping Consortium (IMPC) project
that aims to identify the phenotypic implications of 20 000 genes by 2021 (Brown and Moore, 2012). In the framework of this
project genetically modified mouse models are assessed according to 20 pre-defined standard
operating procedures (SOPs) that are linked to measurable physical parameters to ascertain
the implications of genetic mutations on phenotypes (Mallon ). An example of a SOP is the assessment of
the grip strength of mice at the age of 9 weeks to assess their neuromuscular function as
muscle strength (https://www.mousephenotype.org/impress/protocol/83/7). Mammalian Phenotype
Ontology (MP) (Smith and Eppig, 2009)
annotations are assigned using a reference range method followed by an expert review. Later
studies explore the application of other statistical methods to assign MP phenotype
annotations based on the obtained parameter readings (Beck ; Karp
), however, these methods only cover a subset of
the phenotypes covered by the 20 SOPs.The process of assessing physical measurements in accordance with the 20 pre-defined SOPs
is referred to as primary phenotyping (Justice, 2008). According to the European Mouse Disease Clinic
(EUMODIC) web page (http://www.eumodic.org/) “A
distributed network of centres with in depth expertise in a number of phenotyping domains
will undertake more complex, secondary phenotyping screens and apply them to a subset of
the mice which have shown interesting phenotypes in the primary screen.”.
However, with the increasing amount of genes being assessed in the primary phenotyping
screen, a manual investigation for interesting results from the primary screens becomes
impossible. In addition, the manual assessment of experimental results is time consuming,
expensive and requires trained biologists. Therefore, automated methods enabling the search
for promising secondary phenotypes are needed to complement the results obtained from the
primary screens.Existing automated solutions include the prediction of phenotypes based on orthologous
genes (Groth ;
McGary ) as
well as functional annotations of genes (King
). However, orthologous genes do not necessarily
exhibit the same function or expression patterns across different species and therefore, do
not always provide reliable answers. A solution relying only on existing phenotype
annotations could overcome the problem in differing gene function and resulting phenotypes
across different species. To the best of our knowledge, the prediction of secondary
phenotypes from primary screen annotations in combination with literature-curated phenotypes
in mouse has not been addressed before.Here, we present an association rule mining approach that enables the identification of
potential secondary phenotype screens in mouse using data from MGD, and complementing the
primary phenotype screens in Sanger-MGP. Applying association rule mining, we were able to
discover 188 rules, covering 242 phenotypes and leading to 1967 predictions for secondary
phenotypes for 244 genes contained in the Sanger-MGP database. These 1967 suggested
gene–phenotype associations include 136 unique phenotypes for which new assays can be
defined for. The predicted associations are neither contained in MGD nor Sanger-MGP. We
automatically as well as manually evaluated the secondary phenotype predictions and can
demonstrate that our results show viable candidates. In conclusion, we believe that novel
biological hypotheses and secondary phenotype screens can be formulated from the predicted
secondary phenotypes.
2 METHODS
Figure 1 illustrates the overall workflow of
this study. The following subsections describe the prediction approach and the utilized
datasets in detail.
Fig. 1.
Overall workflow
of the study. After determining-related phenotypes, the primary phenotype annotations
assigned to genes in Sanger-MGP are enriched with potentially related phenotypes. The
additional, predicted secondary phenotypes are evaluated in several
steps
Overall workflow
of the study. After determining-related phenotypes, the primary phenotype annotations
assigned to genes in Sanger-MGP are enriched with potentially related phenotypes. The
additional, predicted secondary phenotypes are evaluated in several
steps
2.1 Prediction of secondary phenotype candidates for mouse genes
To determine candidates for secondary phenotyping, we first analysed MGD’s
phenotype annotations for mouse mutants. We hypothesized that phenotypes that
significantly co-occur with each other more often than expected by chance, given the
overall amount of phenotype annotations, constitute good candidates for the secondary
phenotype experiments. For example, it is known that body weight correlates with bone
density or grip strength and changes in body weight often lead to changes in the
correlated phenotypes (Karp , b; Valdar ). Using a
large dataset of phenotype annotations, we can determine pairs of phenotypes that may be
biologically linked.Association rule mining was originally used to find patterns of items that are frequently
purchased together in one transaction in a supermarket. Each association rule assigns a
probability to an implication based on the dataset, e.g. how likely is it that someone who
bought bread and milk also purchased butter. In the Bioinformatics domain, association
rule mining has been previously successfully applied to large annotation sets with the aim
to find relationships between gene functions described with Gene Ontology (GO) (Botstein ; Kumar ; Manda ). As the
goal of determining significantly co-occurring concepts to define relationships is the
same here, association rule mining can also be applied. We used the apriori (http://www.borgelt.net/doc/apriori/apriori.html) software implementation
(Agrawal ;
Borgelt, 2003) with the following parameter
settings: with
-tr to enforce the output of association rules instead of item sets,
-s-6 to obtain only rules that are supported by at least six item sets,
-m2 to include only rules with a minimum of two items,
-n2 to include only rules with a maximum of two items,
-c90 to only allow rules with a confidence of 90%,
-ep to provide P-values for each rule and -v
“%e” to add the P-value separated by space
to each of the rules. The input to the apriori software was the set of literature-curated
phenotype annotations of mouse genes and the output of rules of the type
phenotype_1 → phenotype_2. As a starting point, we limited the
output to rules including only two phenotypes to avoid complex dependencies between the
annotations. However, in future work we aim to extend the approach to address more complex
dependencies between phenotypes.-tr -s-6 -m2 -n2 -c90 -ep -v
‘‘%e’’The two parameters that are used to narrow down the associations’ rules to obtain
meaningful, biologically related phenotypes, are support and confidence. We set the
support for association rules to six which means that a minimum of six genes have to be
annotated with both phenotype_1 and phenotype_2. The
confidence corresponds to the ratio of genes being annotated with
phenotype_1 as well as phenotype_2 over the genes that
are only annotated with phenotype_1. In our case, at least 90% of
the genes annotated with phenotype_1 must have also
phenotype_2 as annotation in order for this rule to be reported.
Changing either parameter may lead to the report of different association rules in the
output. We considered this to be conservative settings for an initial study, and the
determination of the ideal settings for both parameters is subject to future work.Rules are returned together with their corresponding P-value to enable
potential further filtering and user confidence, e.g.MP:0004725 <- MP:0009448 0MP:0005606 <- MP:0009448
3.9905e-212MP:0005606 <- MP:0009557
4.22726e-182MP:0000245 <- MP:0011171
2.64518e-125All extracted rules are then sorted according to the phenotypes including them, i.e. one
phenotype may potentially be associated with more than one secondary phenotype candidate.
As shown by the rules given before as an example, a decreased platelet ATP
level phenotype (MP:0009448) would be associated with an increased
bleeding time phenotypes (MP:0005606) and a decreased platelet
serotonin level (MP:0004725). Following this procedure will lead to a list of
mapped phenotypes including the phenotypes from the high-throughput assessment in the
primary phenotype screening defined in the SOPs. Therefore, the mapped phenotypes are then
filtered to exclude the phenotypes covered by the Sanger-MGP SOPs. Because of this
filtering, we obtain only predictions for secondary phenotypes that have not been included
in the primary screens.For all the genes contained in Sanger-MGP, we then generated a list of secondary
phenotype annotation predictions by going through all the existing phenotype annotations
for a gene and adding those phenotypes that have been mapped based on co-occurrence. Then
we removed all gene–phenotype associations that have been identified already and are
contained in either MGD or Sanger-MGP to only generate potentially novel links between
genes and phenotypes. We refer to the remaining phenotype annotations as predicted
secondary phenotypes.We chose the MGD phenotype annotations for gene knockouts as basis for our predictions
and downloaded the report file on July 20, 2013. The downloaded file comprised 126 522 MP
annotations for 9447 genes, covering 7393 unique MP concepts. The Sanger-MGP covers 20
SOPs that correspond to 367 MP annotations. Deducting the 367 MP that are covered by the
SOPs from the unique number of MP concepts in MGD, provides the target phenotype
annotation space. This means that 7027 unique phenotype concepts can be potentially
associated with any of the 725 genes that had been assessed by the primary screens in the
Sanger-MGP at the time this study was conducted. All the phenotype annotation datasets
were applied without conducting a taxonomic closure on the annotations. However, once the
secondary phenotypes have been predicted, corresponding assays would have to be determined
to test the generated phenotype hypotheses.
2.2 Evaluation of secondary phenotype predictions
We evaluated the secondary phenotype predictions automatically as well as manually. The
automated evaluation was realized by applying the secondary phenotype candidates in two
use cases for phenotype annotations: the clustering of genes according to phenotypes
leading to clusters of gene function, and the prediction of disease gene candidates by
comparing disease phenotypes with phenotypes that have been determined to be affected by a
gene mutation. In both use cases, we applied first phenotype annotations determined during
the primary screens, and after that a combination of the primary screen annotations
together with the predictions for secondary phenotypes. We assume that if the performance
improves when adding the secondary phenotype predictions, the secondary phenotypes possess
biological validity. In addition, we manually investigated five diseases further where the
predictability of at least one known causative gene improved. More information about the
evaluation of the secondary phenotypes is provided in the following subsections.
2.2.1 Automated evaluation based on gene function
In previous studies, it has been demonstrated that phenotype annotations can be used to
determine biologically meaningful clusters with respect to gene function and protein
interactions (van Driel ). Oti et al. extended the method to validate the content
of three human phenotype databases with respect to consistency and completeness. We
assume that the secondary phenotype predictions once added to the annotations assigned
in the primary screens improve consistency and completeness of the phenotype data.
Therefore, we applied the method introduced by Oti et al. relying on
the biological coherence of gene clusters built on phenotype similarity. The biological
coherence is calculated based on the overlap of GO annotations among all the genes
falling into one cluster.To assess the biological coherence without and with the predicted secondary phenotypes,
we generated gene clusters based on the primary phenotypes solely, as well as clusters
based on the primary and secondary phenotype data in conjunction. Before the actual
clustering step, we performed a taxonomic closure based on MP, which means that all
superclasses for each assigned phenotype annotation were added to a gene’s
phenotype annotation set. Clusters were formed based on the phenotype similarities, and
the similarity between pairs of genes based on their phenotype annotations using a
Jaccard coefficient (the ratio of shared phenotypes over the unique set of phenotypes
assigned to both the genes). Genes were clustered with respect to their phenotype
similarity using average linkage clustering. Clusters were determined by applying the
Dynamic Treecut package in R (Langfelder
) to the obtained dendrogram. We set the
parameters of the Dynamic Treecut package to require a minimum of two genes falling into
one cluster. For each of the determined cluster, the biological coherence was calculated
with
where C(i,j) is the term overlap
between gene i and gene j, and n is
the number of genes in this cluster. The overall biological coherence score for all
clusters is obtained by averaging the individual scores for the clusters:
where m is the number of clusters formed for a particular dataset.In compliance with the method described in (Oti
), the datasets are not directly compared,
instead they are compared to randomized datasets to correct for gene annotation biases.
For this purpose, we randomized each of the two phenotype annotation sets—primary
and secondary—maintaining the number and uniqueness of phenotype annotations per
gene. We randomized the original set of annotations 1000 times leading to 2002 phenotype
annotation sets in total. We increased the number of randomizations from 30 to 1000 to
compensate for the fact that Sanger-MGP contains a number of genes that are poorly
described in terms of gene function and may generate a high variation of coherence score
ratios otherwise. For each phenotype annotation set, the ratio of the overall biological
coherence C of the respective original phenotype annotation
set (either primary or secondary) over the randomized data is calculated. If this ratio
is >1, the biological coherence of the original dataset is greater than the
randomization data; vice versa for scores in the range [0,1], the
biological coherence for the randomized data exceeds the coherence of the respective
original dataset. The ratio scores are then summarized in box plots and the difference
between all the ratios of both datasets is calculated with a non-parametric, two-sided
Wilcoxon rank-sum test implemented in R. Figure
2 illustrates this evaluation step.
Fig.
2.
Illustration of the calculation of biological coherence
scores to evaluate secondary phenotype predictions. Boxes that possess the same
background colour are based on the same analysis scripts, only the input data
differ (either randomized or original data). Black boxes symbolize the ratio of
the biological coherence original versus randomized data which are used as input
for the box plots depicted in Figure
3
Illustration of the calculation of biological coherence
scores to evaluate secondary phenotype predictions. Boxes that possess the same
background colour are based on the same analysis scripts, only the input data
differ (either randomized or original data). Black boxes symbolize the ratio of
the biological coherence original versus randomized data which are used as input
for the box plots depicted in Figure
3
Fig. 3.
Adding the predicted secondary
phenotype annotation to the Sanger-MGP genes with reference range annotations and
using these to create gene clusters based on phenotype similarity, improves the
biological coherence of the obtained gene clusters
We assessed gene cluster coherence based on functional annotations of mouse genes. For
this purpose, we downloaded the GO annotations of mouse genes from the MGD database on
July 12, 2013 (ftp://ftp.informatics.jax.org/pub/reports/gene_association.mgi). The
dataset comprised annotations for 25 499 MGD marker accession identifiers with 13 551
unique GO concepts, with an average of 11.64 GO annotations per gene. Using the original
Sanger-MGP dataset with primary annotations only, we obtain 33 clusters for the 480
genes investigated that are then assessed for the biological coherence based on their
gene function annotations.
2.2.2 Automated evaluation based on disease gene candidate predictions
Among other tools for disease gene candidate prediction, PhenoDigm uses phenotype
annotations to predict gene candidates underlying a disease (Smedley ). Disease gene
candidates are predicted based on the primary phenotype annotations assigned to mouse
and zebrafish models and their phenotypic similarity to humangenetic disorders
described in OMIM (http://www.human-phenotype-ontology.org/contao/index.php/downloads.html).
The better the overlap of phenotypes between a model and a disease, the higher the
corresponding knockout gene is ranked for this disease.To assess the performance of a ranking algorithm, commonly Receiver Operating
Characteristic (ROC) curves are used that are calculated based on a benchmark dataset.
In our case, we used known gene–disease associations to assess the value of the
predictions. If the secondary phenotypes add value to the predictions, then a
performance increase should be visible from the Area Under Curve (AUC) of the two ROC
curves (one for the predictions based on primary phenotype data, and one for the
predictions based on primary and secondary phenotype data). To test whether the increase
in the AUC is significant, we used a two-sided test for ROC curves available online
(http://vassarstats.net/roc_comp.html; Hanley and McNeil, 1982).Using PhenoDigm as an automated evaluation algorithm of the secondary phenotype
predictions required a benchmark set of known gene–disease associations. If the
secondary phenotype annotations improve the phenotypic overlap of genes and diseases,
the ROC curves used for the evaluation should show an improvement. To generate the ROC
curves, we used the gene–disease associations contained in OMIM’s MorbidMap
file (http://omim.org/downloads), which
was downloaded on July 20, 2013. This dataset comprised 3781 gene–disease
associations, including 2530 genes and 3158 diseases.
2.2.3 Manual evaluation
To evaluate some of the secondary phenotype predictions, we manually investigated some
of the cases were the predictability of known disease genes improved when adding the
predicted secondary phenotypes. We chose five gene–disease associations where the
gene improved with respect to its rank for the disease and looked based on which
annotations the match between disease and gene could be made. The information concerning
the matched phenotype annotations of a disease and a gene is contained in Supplementary Material S1.
2.3 Implementation of PhenoDigm extension to provide secondary phenotype predictions
online
To enable access to the predicted secondary phenotypes, we implemented an extension to
our online tool PhenoDigm (Smedley ) that predicts causative genes for human heritable
disorders. The extension is, as well as the original tool, implemented using the Play!
Framework (http://www.playframework.com/)
(version 1.2.5), jQuery (http://jquery.com/) (version 1.6.4) and jQuery UI (http://jqueryui.com/) (version 1.9.1). The
secondary phenotype predictions were imported into PhenoDigm’s underlying MySQL
database (http://www.mysql.com/) by extending the
database schema. However, secondary phenotype predictions are not incorporated into
PhenoDigm’s disease gene candidate predictions available from the web page, unless
experimentally confirmed and integrated into one of the phenotype annotation
databases.
3 RESULTS
Applying association rule mining, we were able to identify 188 rules (provided in Supplementary Material S1) that lead to secondary phenotype hypotheses for 244
Sanger-MGP genes. In total, we could predict 1967 novel gene–phenotype associations
containing 136 unique phenotypes that are not contained in MGD. Out of these 1967
gene–phenotype relationships, 47 were covered by the taxonomy of the ontology, i.e.
the predicted phenotypes were ancestor concepts of annotations already used for one
particular gene. The 136 unique phenotype concepts span 23 of MGD’s 30 top level
phenotypes, such as tumorigenesis (MP:0002006), nervous system
phenotype (MP:0003631) or muscle phenotype (MP:0005369), showing
the diversity of phenotypes that could be added to the annotation of genes. The 136
phenotypes also cover different hierarchy levels in the ontology spanning from the second to
the 11th level, with the highest group falling into level 6 (all measured as shortest
distance from the root node of the MP ontology). For example, adenohypophysis
hypoplasia (MP:0008365) as well as abnormal cranium size
(MP:0010031) are suggested as secondary phenotypes. In general, the deeper an ontology term
is the more specific is the concept it is representing. This means that the predictions not
only span a variety of different high level phenotypes but also add detailed information to
the genes they are associated with which allows for a better characterization of individual
genes. Supplementary Material S1 provides all the predicted secondary phenotype
annotations together with additional information such as the high level phenotype, term name
and frequency of occurrence in the prediction dataset.
3.1 Predicted secondary phenotypes significantly improve the biological coherence of
gene clusters
In recent studies, phenotypes have successfully been applied to determine disease gene
candidates and gene function (Smedley ; van Driel ). In order to asses the validity and quality of the
predicted secondary phenotype annotations, we assessed the biological coherence of gene
clusters, built based on phenotype similarity between genes. Applying the method described
by Oti et al. first to the primary phenotype annotations only, and then
to both primary and predicted secondary phenotype annotations, shows that the biological
coherence of clusters increases when adding the predicted secondary phenotype annotations.
The obtained results are depicted in Figure 3.Adding the predicted secondary
phenotype annotation to the Sanger-MGP genes with reference range annotations and
using these to create gene clusters based on phenotype similarity, improves the
biological coherence of the obtained gene clustersIn addition to calculating the biological coherence for both datasets, we determined the
significance of the fold-increase of the coherence of the clusters. Using a two-sided
Wilcoxon signed-rank test (as implemented in R, α = 0.05),
we obtained a P-value of 2.2 × 10−16,
indicating a significant improvement when adding secondary phenotype annotations to the
previously confirmed in the primary phenotype scans. These results suggest that the
predicted secondary phenotypes possess biological validity but will have to be
experimentally verified in secondary phenotype screens.
3.2 Secondary phenotypes significantly improve the predictability of disease gene
candidates
In addition to assessing the biological coherence of gene clusters, we also verified the
predicted secondary phenotype annotations by applying them in a second application use
case: the prediction of disease gene candidates. One tool that already uses the primary
phenotype data of genes to predict disease gene candidates is PhenoDigm (Smedley ). To
assess whether the secondary phenotypes possess biological validity, we first used only
the primary annotations to predict disease gene candidates and then added the secondary
phenotype annotations. For both predictions, we calculated the ROC curves, using
gene–disease associations contained in MGD as benchmark dataset. Both the obtained
ROC curves are depicted in Figure 4. Applying
a one-tailed Student’s t-test (α =
0.05) to the ROC curves, we obtain a P-value of P
= 0.02.
Fig. 4.
Accumulating the
predicted secondary phenotypes together with reference range annotations for
Sanger-MGP genes improves the predictability of causative disease genes using
PhenoDigm
Accumulating the
predicted secondary phenotypes together with reference range annotations for
Sanger-MGP genes improves the predictability of causative disease genes using
PhenoDigm
3.3 The Coq9 mouse improves as a model for primary coenzyme Q10 deficiency 5
To further assess the value added by the predicted secondary phenotypes, we manually
assessed diseases that show rank improvements for known causative genes. We determined the
number and particular phenotypes that could be matched between models and diseases, where
causative genes improved in the ranking as disease candidates. In the best case, the gene
Coq9 (MGI:1915164) that has been recognized as a being disrupted in
cases of Primary coenzyme Q10 defiency 5 (COQ10D5; MIM:#614654) improves
from rank 181 to rank 2 based on the secondary phenotype annotations. Using the
annotations assigned in the primary screens, only one pair of matching phenotypes can be
determined: Hyperreflexia (HP:0001347) and hyperactivity
(MP:0001399). Applying in addition the predicted secondary phenotypes, other signs and
symptoms of this disease, such as Postnatal microcephaly (HP:0005484) and
Left ventricular hypertrophy (HP:0001712), are detected.Another example for gene rank improvement is that the Cfh (MGI:88385)
gene was ranked in second place after adding the predicted secondary phenotypes (rank 43
when using only primary phenotypes) for Complement factor H deficiency
(MIM:#609814). Including the predicted secondary phenotypes allows for a coverage of the
following additional phenotypes: Thickening of the glomerular basement
membrane (HP:0004722), Progressive renal insufficiency
(HP:0000106) and Hematuria (HP:0000790). Interestingly, the
Cfh gene is not only associated with Complement factor H
deficiency but also with Atypical hemolytic uremic syndrome 1
(MIM:#235400), and adding the secondary phenotype information, the gene also obtained an
improved rank for this disease (rank 177 with primary phenotypes only and rank 41 with
inserting secondary phenotype information). From the improvement in both cases, we
conjecture that the secondary phenotypes cover correct functional aspects of the gene that
could not have been identified with the primary phenotype screens.This information together with additional information for another three diseases and
their respective genes is provided in Supplementary Material S1.
3.4 Browsing the secondary phenotype predictions online
To provide access to the secondary phenotype predictions, we implemented an extension to
our disease gene candidate prediction tool PhenoDigm (Smedley ). The results can be
browsed at http://www.sanger.ac.uk/resources/databases/phenodigm/. The web interface
provides all the genes that possess secondary phenotype candidates as a list and the user
can select individual genes for further investigation. Upon selecting a gene, the user is
provided with all the details available for this gene, i.e. diseases the gene has been
confirmed to be a cause for, phenotype annotations from the primary screens, the
literature-curated annotations from MGD and the suggested phenotypes for secondary
screens. Providing this information together, a biologist or clinician could easily assess
whether the secondary phenotype candidates are worthwhile to be tested in a biological
experiment. Figure 5 provides a snapshot of
the available gene-centric information through PhenoDigm’s web interface.
Fig. 5.
An extension of PhenoDigm’s web
interface holds the secondary phenotype predictions
An extension of PhenoDigm’s web
interface holds the secondary phenotype predictions
4 DISCUSSION
In this study, we applied an association rule mining approach to mine secondary phenotypes
and computationally verify their biological validity using two automated and a limited
manual evaluation. We used the manually assigned phenotype annotations contained in MGD for
learning phenotype co-occurrence patterns and merged the identified patterns with phenotypes
that were experimentally verified in primary screens and are available from the Sanger-MGP
(White ).Using data from particular resources creates dependencies towards those data resources. For
example, annotation guidelines such as those employed by MGD ensure a consistency of human
annotators but can also create artefacts in the predictions generated from the data. In our
particular case, we may find phenotype co-occurrence patterns that are intrinsic to
annotation guidelines and not purely due to their co-occurrence. This may occur when the
annotation guidelines cover rules that enforce a set of phenotypes to be annotated in
particular circumstances instead of only one. However, as these guidelines exist to ensure
biological correctness of the data, we expect that those cases still constitute biologically
interesting, though known connections between individual phenotypes.The implementation of the secondary phenotype prediction pipeline relies solely on an
association rule mining approach. Using the pipeline in conjunction with 7027 unique
phenotypes (see Section 2.1), the obtained result of 188 new rules seems comparatively
small. As the number of rules is directly related to the settings for the apriori software,
the number of potential hypotheses may be increased by changing these parameters. However,
and as with any prediction tool, we applied conservative measures that would reduce the
likelihood of creating false hypotheses. In addition, starting with a small subset of rules
enables better verification possibilities and selection mechanisms for biological experiment
design. Potential areas of extension are the incorporation of additional pattern recognition
methods that could then be used to form a support system and provide provenance for
identified patterns, e.g. only predictions that are made by a number of systems are more
likely to be secondary phenotypes.Using phenotype patterns to generate secondary phenotype predictions implies that
phenotypes that co-occur often with each other, are likely to always co-occur. For some
biological phenomena this assumption has been validated, e.g. the correlation of body weight
with bone density or blood calcium levels (Karp
). Given that the secondary phenotypes perform well
in the evaluation, we feel that the assumption can still be used for forming secondary
phenotype hypotheses. However, in future work we envisage a more complex filtering strategy
for assigning phenotype annotations using not only primary phenotypes, but also gene
function annotations and disease involvement to reduce the number of falsely associated
genes and phenotypes in addition to voting from different prediction algorithms.
4.1 Predicted secondary phenotypes improve the biological coherence of the
clusters
Using automated evaluation procedures that apply predictions in biological use cases may
mask-specific problems that can only be spotted by human curators, e.g. if it is known
that one particular gene does not cause a particular phenotype in particular
circumstances. However, the methods can provide a summarized judgement over all the
results instead of providing details for all cases. As demonstrated by van Driel
et al., clustering genes according to phenotypes leads to clusters
consistent with gene function and protein interaction networks. The same evaluation
mechanism that has been applied here, had been successfully applied to assess the quality
of existing human phenome databases (Oti ). Using gene function annotation to determine biological
coherence is directly influenced by the number of annotations available.
4.2 Secondary phenotypes improve the predictability of causative disease
genes
In addition to the use case of gene characterization, phenotype annotations are applied
to identify the underlying mechanisms of human heritable diseases (Hoehndorf ; Robinson ; Smedley ; Washington ). As the primary
screens only cover 20 SOPs, the outcome of phenotype annotations is limited by the
screens. Using the predicted secondary phenotype annotations may highlight phenotypes of
genes that are currently limiting the predictability of certain diseases or groups of
diseases. Our results show that adding the secondary phenotype annotations improve the
predictability of disease genes and the characterization of genes on a phenotype level can
be improved with the suggested phenotypes. However, experimental verification is necessary
and assays would have to be incorporated to test for the 137 identified phenotypes.
4.3 Secondary phenotypes further characterize genes assessed with primary
phenotyping
As illustrated with a small subset of selected diseases, adding the predicted secondary
phenotypes leads to an increase of matched phenotypes between diseases and models. As
discussed before, these results indicate that the predictions indeed possess biological
validity and constitute good biological hypotheses to guide the design of experimental
setups for secondary screens. The selection of diseases, however, was limited to the cases
where an improvement for the causative gene happened. In future work, we will also have to
extend our analysis to genes where no improvement or rank decrease was experienced.
However, we note here that the rank changes of known causative disease genes are only
indicators for performance changes and need manual investigation. Some of the OMIM
diseases may possess multiple causative genes, some of which have not yet been discovered
or listed in OMIM. As a consequence, those genes will be recognized as false positives
during the evaluation. If one of the causative genes that have not been listed in OMIM
improves tremendously over those genes that are listed, we could obtain a rank decrease
for genes that are listed in OMIM.
4.4 Future development of the PhenoDigm extension
The data have been made available through a web interface that provides a gene-centric
view on the data. All the predictions can be assessed and are provided in the context of
diseases and information about the primary phenotype screens for easy verification and
hypothesis derivation. As more data become available, e.g. through additional automated,
statistical screens such as suggested by Karp et al., further information
can be included such as effect size and additional phenotype annotations.Furthermore, even though the annotations have been validated using the predicted
secondary phenotype annotations in PhenoDigm’s disease prediction algorithm, the
disease gene candidate predictions based on the secondary phenotype data are not yet
available. A possible extension of the web interface in future work could be the inclusion
of these predictions. If the predictions are included, possible new emerging disease
groups relevant for a gene could be easily spotted from the list and guide new
experiments.
5 CONCLUSION
Here, we presented a method to predict secondary phenotype candidates based on existing
large-scale phenotype annotation sources and primary screens for genes. We verified the
secondary phenotype candidates by applying it in two use cases and could demonstrate that
the predictions add value in either use case and, therefore, seem biologically relevant to
the genes they are predicted for. We could show that the phenotype candidates not only
increase the biological coherence of gene clusters, but also improve the candidate
prediction of genes for human heritable diseases. In conclusion, we provide a set of
gene–phenotype associations that can be further assessed in biological experiments and
guide the experimental design to further investigate specific genes or gene groups. All the
data are freely available online from http://www.sanger.ac.uk/resources/databases/phenodigm/.In future work, we aim to further improve the method by determining the best parameter
settings for the association rule learning, but also investigate other phenotype
co-occurrence pattern recognition methods. One possibility is the application of a
hypergeometric distribution and find support for patterns that have been identified with the
association rule mining approach. We further intend to provide update results through the
web interface and improve the integration with other existing resources.
Authors: M Ashburner; C A Ball; J A Blake; D Botstein; H Butler; J M Cherry; A P Davis; K Dolinski; S S Dwight; J T Eppig; M A Harris; D P Hill; L Issel-Tarver; A Kasarskis; S Lewis; J C Matese; J E Richardson; M Ringwald; G M Rubin; G Sherlock Journal: Nat Genet Date: 2000-05 Impact factor: 38.330
Authors: Natasha A Karp; Anne Segonds-Pichon; Anna-Karin B Gerdin; Ramiro Ramírez-Solis; Jacqueline K White Journal: Lab Anim Date: 2012-07 Impact factor: 2.471
Authors: William Valdar; Leah C Solberg; Dominique Gauguier; William O Cookson; J Nicholas P Rawlins; Richard Mott; Jonathan Flint Journal: Genetics Date: 2006-08-03 Impact factor: 4.562
Authors: Oliver D King; Jeffrey C Lee; Aimée M Dudley; Daniel M Janse; George M Church; Frederick P Roth Journal: Bioinformatics Date: 2003 Impact factor: 6.937
Authors: Peter N Robinson; Sebastian Köhler; Anika Oellrich; Kai Wang; Christopher J Mungall; Suzanna E Lewis; Nicole Washington; Sebastian Bauer; Dominik Seelow; Peter Krawitz; Christian Gilissen; Melissa Haendel; Damian Smedley Journal: Genome Res Date: 2013-10-25 Impact factor: 9.043
Authors: Ann-Marie Mallon; Vivek Iyer; David Melvin; Hugh Morgan; Helen Parkinson; Steve D M Brown; Paul Flicek; William C Skarnes Journal: Mamm Genome Date: 2012-09-19 Impact factor: 2.957