| Literature DB >> 16115313 |
Carolina Perez-Iratxeta1, Matthias Wjst, Peer Bork, Miguel A Andrade.
Abstract
BACKGROUND: Human inherited diseases can be associated by genetic linkage with one or more genomic regions. The availability of the complete sequence of the human genome allows examining those locations for an associated gene. We previously developed an algorithm to prioritize genes on a chromosomal region according to their possible relation to an inherited disease using a combination of data mining on biomedical databases and gene sequence analysis.Entities:
Mesh:
Year: 2005 PMID: 16115313 PMCID: PMC1208881 DOI: 10.1186/1471-2156-6-45
Source DB: PubMed Journal: BMC Genet ISSN: 1471-2156 Impact factor: 2.797
Figure 1The G2D algorithm. The cylinders represent public databases. MEDLINE contains references to scientific literature annotated at the National Library of Medicine with terms from the MeSH ontology. For each disease being studied we take the MeSH C terms ('Diseases Category') from the publications associated in OMIM [3] as its keywords. For each gene we take the Gene Ontology (GO) terms [8] associated to its product in the RefSeq protein database [34] as its keywords. MEDLINE does not contain enough clinical literature to allow us to directly relate every symptom, represented by a MeSH C term, to every gene feature, represented by a GO term. Taking into account that genes relate to phenotypes by means of molecules, we can increase the robustness of the gene/phenotype relations using an intermediate association step through the MeSH D category of 'Chemicals & Drugs' (top). Accordingly, we first compute associations between MeSH C terms ('Diseases') and MeSH D terms ('Chemicals & Drugs') by their co-annotation on the same record, more specifically looking for dependences of MeSH D terms on MeSH C terms. For example, we would deduce a relation between "Alzheimer's disease" (MeSH C) and "Amyloid protein" (MeSH D) if the presence of the C term in a MEDLINE entry always implies the presence of the D term. Records in the RefSeq database contain annotations from GO that describe the protein function, and will often include a link to MEDLINE, mostly dealing with the experimental characterization of the protein. We use these links to relate MeSH D terms from the MEDLINE reference to GO terms from the sequence, again looking for GO term dependence on a MeSH D term. In this case we could deduce an association between the MeSH D term "Amyloid Protein" and the GO term "Amyloid Protein". Finally, we combine both sets of relations to obtain associations between MeSH C terms and GO terms (for example, the relation of Alzheimer's disease to the amyloid protein). To evaluate the genes associated with a particular disease we follow two directions. First, we deduce the gene functions (GO terms) related to the disease using the associations from phenotypes (MeSH C terms) describing the disease. For this, we collect the MeSH C terms found in the MEDLINE references from its corresponding OMIM entry (left), score all GO terms according to their relation to the terms in the MeSH C list (top), and finally, score all the proteins in RefSeq with the average of scores of their GO terms (right). For example, the analysis of late-onset familial Alzheimer disease (LOFAD) [9] would start by characterizing the disease with the MeSH C term "Alzheimer's Disease" among others. This would point to a series of GO terms including "Amyloid Protein" as a likely related function. One of the most related sequences in RefSeq (according to its GO annotations) would be the human amyloid beta A4 precursor protein-binding, which is annotated with the GO-term "amyloid protein". The other component of the analysis is a BLAST homology search [35] of the human genome region where the disease is mapped against the sequences stored in the RefSeq database (bottom). All hits in the region (red block) below a cut-off of E-value of 10e-10 are registered and sorted according to the score of the RefSeq protein they hit. Following our example, the analysis of the region where the LOFAD was mapped would show a gene similar to the human amyloid beta A4 precursor protein-binding annotated with the GO-term "amyloid protein": the APBA3 gene, which interacts with the Alzheimer's beta-amyloid precursor protein [12]. The analysis of LOFAD is extensively described in the Results section. Further details of the method are given in [2] and in the G2D web site.
Figure 2Example of analysis of a monogenic disease. (a) The data defining the phenotype of the disease (in this case the OMIM identifier of an equivalent disease) and the region where it was mapped are given in the COMBO box. (b) The results window displays the MeSH C terms derived from the links to MEDLINE found in the OMIM entry, and the resulting scores for the GO terms. The green arrows allow traveling the MeSH C/MeSH D/GO network of connections back and forth. (c) Further down in the results window, the list of candidates displays the position of the BLASTx hits [35] in the chromosomal region (dark green bar over the light green bar) and of the hits in the matching protein sequence (dark red bars over the light red bar). Each hit in the genome is linked to the UCSC Genome Browser ("U" link). (d) The UCSC Genome Browser allows examining the genes known or predicted that overlap with the match linking to very useful databases and resources.