Literature DB >> 31678735

Computational Methods for Identifying Similar Diseases.

Liang Cheng1, Hengqiang Zhao1, Pingping Wang2, Wenyang Zhou2, Meng Luo2, Tianxin Li2, Junwei Han3, Shulin Liu4, Qinghua Jiang5.   

Abstract

Although our knowledge of <span class="Species">human diseases has increased dramatically, the molecular basis, phenotypic traits, and therapeutic targets of most diseases still remain unclear. An increasing number of studies have observed that similar diseases often are caused by similar molecules, can be diagnosed by similar markers or phenotypes, or can be cured by similar drugs. Thus, the identification of diseases similar to known ones has attracted considerable attention worldwide. To this end, the associations between diseases at the molecular, phenotypic, and taxonomic levels were used to measure the pairwise similarity in diseases. The corresponding perfor<span class="Species">mance assessment strategies for these methods involving the terms "category-based," "simulated-patient-based," and "benchmark-data-based" were thus further emphasized. Then, frequently used methods were evaluated using a benchmark-data-based strategy. To facilitate the assessment of disease similarity scores, researchers have designed dozens of tools that implement these methods for calculating disease similarity. Currently, disease similarity has been advantageous in predicting noncoding RNA (ncRNA) function and therapeutic drugs for diseases. In this article, we review disease similarity methods, evaluation strategies, tools, and their applications in the biomedical community. We further evaluate the performance of these methods and discuss the current limitations and future trends for calculating disease similarity.
Copyright © 2019 The Author(s). Published by Elsevier Inc. All rights reserved.

Entities:  

Keywords:  disease similarity; molecular basis; ncRNA function; phenotypic traits; therapeutic drugs

Year:  2019        PMID: 31678735      PMCID: PMC6838934          DOI: 10.1016/j.omtn.2019.09.019

Source DB:  PubMed          Journal:  Mol Ther Nucleic Acids        ISSN: 2162-2531            Impact factor:   8.886


Introduction

<span class="Species">Human disease is one of the permanent aspects of the human condition, similar to birth, aging, and death, from a philosophical point of view. The search for novel understanding of disease never stops. Although, currently, there has been great success with the development of biotechnology, the molecular basis of and therapeutic agents for most diseases remain unclear. Current studies have observed that similar diseases are often caused by similar molecules,1, 2, 3 can be diagnosed by similar markers or phenotypes,4, 5, 6 and are also cured by similar drugs.7, 8, 9, 10, 11 Based on this, novel functional molecules for a disease could, in theory, be revealed using prior knowledge of similar diseases.12, 13, 14, 15, 16, 17, 18 Thus, research on identifying the similarity between diseases has attracted increasing attention. A pair of diseases with a high similarity score can be defined as being similar diseases. To measure disease similarity, prior knowledge of diseases plays a crucial role. The symptoms and signs accompanying diseases, also called phenotypes, are the intuitive characteristics of a disease., As early as 2004, Freudenberg and Propping used phenotypes sourced from the Online Mendelian Inheritance in <span class="Species">Man (OMIM) website to calculate the similarity of OMIM diseases. With an ever-increasing number of phenotypes being observed by the biomedical community, abundant algorithms have been developed for measuring disease similarity at a phenotypic level. <span class="Species">Many studies have shown that the alterations of molecules can lead to the occurrence of diseases. Thus, the exploration of a common molecular basis is another way to measure disease similarity. With the development of next-generation sequencing technologies, a vast number of protein-coding genes (PCGs) and noncoding RNA (ncRNA) genes associated with diseases have been identified. For example, hemophilia A is an X-linked recessive bleeding disorder caused by a deficiency in the activity of coagulation factor VIII (F8), which can be affected by variations in the F8 genes., MicroRNA (miRNA)-155 is an endogenous ncRNA that regulates several mRNAs to cause B cell lymphomas., Based on the molecular basis of diseases, a large number of methods27, 28, 29, 30, 31, 32, 33 have been designed for calculating disease similarity, using this as a metric. Recently, disease taxonomy has begun to play an important role in measuring disease similarity. One of the typical taxonomic classifiers for diseases is Disease Ontology (DO). In this, each disease term represents a disease with different names, and two terms can be linked on the basis of a set of inclusive relationships. For example, “Alzheimer’s disease” can be linked to “<span class="Disease">tauopathy.” All of the disease terms and the set of inclusion relationships forms the disease hierarchy and directed acyclic graph (DAG) of DO (Figure 1), where a node represents a disease term, and an edge is a set of inclusive relationships between the two terms. The common ancestors of two disease terms based on the DAG have often been utilized to calculate the similarity of two terms.
Figure 1

Sub-graph of the DO Hierarchy for Alzheimer’s Disease

Arrows represent an “IS_A” relationship for DO. For example, “Alzheimer’s disease” is linked to “Dementia” by an “IS_A” relationship. All of the terms that can be linked by “IS_A” relationships in the graph from “Alzheimer’s disease” are the ancestors of “Alzheimer’s disease.” All of the terms that can link to “Disease” by “IS_A” relationships are the descendants of “Disease.”

Sub-graph of the DO Hierarchy for Alzheimer’s Disease Arrows represent an “IS_A” relationship for DO. For example, “Alzheimer’s disease” is linked to “<span class="Disease">Dementia” by an “IS_A” relationship. All of the terms that can be linked by “IS_A” relationships in the graph from “Alzheimer’s disease” are the ancestors of “Alzheimer’s disease.” All of the terms that can link to “Disease” by “IS_A” relationships are the descendants of “Disease.” Currently, dozens of methods have been designed for calculating disease similarity based on prior disease knowledge at the phenotypic, molecular, and hierarchical levels. In this article, we review the main topics of investigation in disease similarity, including the proper selection of proper data, the design and implementation of methods, the evaluation of a method’s perfor<span class="Species">mance, and even the application of existing methods for predicting molecular factors of diseases.

Data Sources

Three types of data sources, including disease vocabularies, disease annotations, and gene functional annotations, are widely utilized for calculating disease similarity (Table 1). Here, we list and introduce these main data sources.
Table 1

Summary of Data Sources

Category and NameCreation DateInitiatorPMID
Disease Vocabulary

OMIM1960sMcKusick3617357067
MeSH1960sWinifred Sewell3814119288
UMLS1980sOlivier Bodenreider4114681409
SNOMED CT2001Wang et al.4611825284
DO2003Schriml et al.3422080554
MEDIC2012Davis et al.3922434833

Disease Annotations

GeneRIF200717990498
CTD200327651457
GAD2004Becker et al.4815118671
miR2Disease2008Jiang et al.5418927107
HPO2008Robinson et al.518950739
SpliceDisease201122139928
lncRNADisease201223175614
HMDD v2.0201324194601
SIDD2013Cheng et al.6224146757
OAHG2016Cheng et al.6127703231

Gene Functional Annotations

GOA2003Camon et al.6312654719
HumanNet2011Lee et al.6621536720
Summary of Data Sources

Disease Vocabularies

Disease <span class="Disease">vocabularies document disease terms for distinguishing between different diseases. Each disease term in a vocabulary contains a unique identifier, preferred disease name, synonyms, abbreviations, and the definition of a disease. Parts of these vocabularies even provide a hierarchy of disease terms based on a set of inclusive relationships.

OMIM

The OMIM, is a comprehensive, authoritative compendium of <span class="Disease">genetic diseases, which is freely available and updated daily. It was initiated in the early 1960s by Dr. Victor A. McKusick and has been developed for online usage by the NCBI since 1985.

MeSH

The Medical Subject Headings (MeSH), provides hierarchically organized terminology for indexing and cataloging biomedical information for PubMed. MeSH divides all biomedical terms into 16 categories, in which C and F03 contain disease names, containing more than 4,600 disease terms. In addition to the terms in these categories, MeSH also contains supplementary term records, which document thousands of disease terms.

MEDIC

The “merged disease vocabulary” (MEDIC) was established by the Comparative Toxicogenomics Database (CTD) biocurators and is composed of more than 10,000 unique diseases. To take advantage of the familiarity and immediate genetic data offered by OMIM terms, as well as the navigation utility and PubMed indexing feature of MeSH terms, MEDIC integrates OMIM terms with MeSH terms and hierarchical relationships.

UMLS

The Unified Medical Language System (UMLS) is a repository of biomedical vocabularies developed by the U.S. National Library of Medicine (NLM). The UMLS integrates over 2 million names for some 900,000 concepts from more than 60 families of biomedical vocabularies, as well as 12 million relations between these concepts. Vocabularies integrated in the UMLS Metathesaurus include MeSH, OMIM, Gene Ontology (GO), and so forth.

DO

The Disease Ontology (DO) database was developed to create a single structure for the classification of diseases that unifies the representation of disease between varied vocabularies into a relational ontology. DO terms can be linked in a hierarchy by a type of se<span class="Species">mantic association called an “IS_A” relationship (Figure 1). The initial builds of DO in 2003 and 2004 used the International Classification of Diseases (ICD-9) as the foundational vocabulary. Recent revisions have improved this with the reorganization of DO based on UMLS disease terms in conjunction with term mappings to Systematized Nomenclature of Medicine--Clinical Terms (SNOMED CT), and ICD-9. The current version of DO is organized into eight main classes to represent cellular proliferation, mental health, anatomical entity, infectious, and agent, etc.

Disease Annotations

The molecular basis and phenotypic characterization of a disease are two main aspects of prior knowledge often used for measuring disease similarity. Resources collecting these sources of prior knowledge are called disease annotations.

Disease Annotations of PCGs

Disease-related PCGs are mainly documented in the OMIM, Gene Reference into Function (GeneRIF), Genetic Association Database (GAD), SpliceDisease, and CTD databases. OMIM was intended for use primarily by physicians and other professionals concerned with <span class="Disease">genetic disorders. GeneRIF provides functional annotations of genes from the NCBI and allows scientists to add a short functional summary of NCBI genes that is limited to 425 characters. The GAD emphasizes genetic association data from complex diseases and disorders. SpliceDisease provides detailed descriptions of the relationships between gene variations, splicing defects, and diseases. The CTD documents the interactions between chemicals and gene products, as well as their relationships to diseases. The relationships between genes and diseases in the CTD often comes in the form of information about RNA splicing, SNPs, and so on.

Disease Annotations of miRNAs

miRNAs are a class of endogenous single-stranded small ncRNAs that play a crucial role in various <span class="Species">human diseases by negatively regulating the expression of PCGs.50, 51, 52, 53 Two <span class="Species">manually curated data sources of disease-miRNA relationships include miR2Disease and the Human miRNA Disease Database (HMDD) v2.0. Both of these two resources document miRNA deregulation in various human diseases.

Disease Annotations of lncRNAs

Long ncRNAs (lncRNAs) are mRNA-like transcripts that are longer than 200 nt and have little or no protein-coding capacity., According to the theory of competing endogenous RNA (ceRNA), they can affect the expression of PCGs through competitively binding with miRNAs. Thus, it becomes important to understand the role of lncRNAs in diseases. The LncRNADisease database has a <span class="Species">manually accumulated set of relationships between lncRNAs and diseases.

Disease Annotations of Phenotypes

Phenotypes are documented in the Clinical Synopsis section of the textual descriptions of each OMIM disease. Robinson et al. extracted all of the phenotypes from this text and constructed a <span class="Species">human phenotype ontology (HPO) to annotate <span class="Species">human diseases.

Integrated Resources of Disease Annotations

In previous efforts, we developed two integrated resources for disease annotations. integrated resource for annotating <span class="Species">human genes with multi-level ontologies (OAHG) focused on the disease annotations of PCGs, miRNAs, and lncRNAs; and a se<span class="Species">mantically integrated database towards a global view of human disease (SIDD) documented disease-related molecular, phenotypic, and environmental features. The data sources integrated by OAHG involved OMIM, HMDD, and LncRNADisease. SIDD integrated up to 18 different data sources, including OMIM, GAD, CTD, LncRNADisease, and HPO.

Gene Functional Annotations

Similar molecular foundations of diseases may be influenced not only by common genes but also by different genes with common functions. Recently, associations between genes from gene functional annotation resources have been introduced for calculating disease similarity. Here, we list resources for the identification of gene functional annotations.

GOA

Disease-related PCGs can possess similar molecular functions (MFs), and may be involved in similar biological processes (BPs). This type of functional association of genes often exposes the similarity of different diseases. The GO annotation (GOA) of PCGs provides assignments of MF and BP terms of GO to gene products, in a project run by the European Bioinformatics Institute (EBI).

HumanNet

In addition to the GOA of PCGs, functional relationships between disease-related genes can also be reflected by protein-protein interactions, mRNA co-expression, and so forth. By integrating all of this data, <span class="Species">HumanNet provides a more comprehensive relative score of pairwise PCG relationship.

Disease Similarity Measures

The similarity between diseases can be reflected by their common phenotypic characteristic, molecular basis, and hierarchy structures. Therefore, we have classified the disease similarity methods into phenotype-based, molecule-based, hierarchy-based, and hybrid methods (Table 2).
Table 2

Summary of Disease Similarity Methods

Author(s)Molecule BasedPhenotype BasedHierarchy BasedVocabularyPMID (or Reference Number)Year
Freudenberg and Propping21OMIM123859922002
van Driel et al.67OMIM164934452006
Köhler et al.68OMIM198000492009
Zhang et al.69OMIM206594682010
Zhou et al.72MeSH249676662014
Chen et al.73UMLS252777582015
Hoehndorf et al.119DO260513592015
Deng et al.120OMIM256644622015
Mabotuwana et al.92SNOMED CT238508392013
Mathur et al.99DO213471372010
Suthram et al.78UMLS201402342010
Gottlieb et al.8UMLS216546732011
Hamaneh and Yu82OMIM/MeSH253607702014
Kim et al.83PharmGKB262124772015
Wang et al.35DO/MeSH173442342007
Resnik27DO271995
Lin126DO281998
Schlicker et al.98167768192006
Mathur et al.DO221664902012
Cheng et al.91DO249326372014
Summary of Disease Similarity Methods

Phenotype-Based Methods

Figure 2 shows the schematic process of phenotype-based methods. First, qualitative associations between phenotypes and diseases are extracted from phenotype data sources. Then, each pair of qualitative associations is quantified as a disease-phenotype score or phenotype-phenotype score. Finally, these scores are utilized for calculating disease similarity.
Figure 2

Schematic of the Process of Phenotype-Based Methods

Schematic of the Process of Phenotype-Based Methods

Freudenberg’s Method

OMIM diseases were originally attributed <span class="Species">manually by Freudenberg and Propping according to their phenotypic appearance, using the indices “periodicity,” “etiology,” “tissue,” “age of onset,” and “mode of inheritance.” The index “periodicity” is a Boolean variable, indicating an episodic occurrence of a disease in contrast to a linear progression. The index “etiology” is based on clinical signs and laboratory or pathological findings related to a disease. The index “tissue” is compiled as the anatomic location of phenotype. The index “inheritance” indicates whether a disease is inherited in an autosomal-dominant, autosomal-recessive, X chromosome, mitochondrial, or complex manner. The index “age of onset” refers to the age of a patient when symptoms are generally first noticed. Then, the similarity of diseases d and d is defined as the following:where w represents the contribution of a single index to the total similarity score, and sim(d.index, d.index) indicates the similarity between the ith indexes of d and d.

van Driel’s Method

van Driel et al. calculated the similarity between over 5,000 diseases based on phenotypic features of OMIM records. For each OMIM disease, its phenotypic descriptions were extracted from “TX” and “CS” fields. Then, the OMIM diseases and phenotypic descriptions were mapped to the anatomy (category A) and the disease (category C) sections of MeSH to establish disease-term associations. Each disease-term association was then defined as a vector with three features as follows:andwhere t and d represent a phenotype term and a disease, respectively. In Equations 2 and 4, counted(t,d) means the occurrence number of t in the OMIM records of d. In Equation 3, N is the total number of records analyzed, and n is the number of records that contain the term t. In Equation 4, descendant(t) is the number of descendant terms in the hierarchy of MeSH, and descendant(t,d) is the number of descendant terms in the OMIM records of d. The similarity between diseases d and d is then defined as Equation 5 below:where t and t mean the ith term vector of d and d, respectively; and m is the total number of phenotypic terms. Phenotypic terms of the “CS” field of OMIM records were also <span class="Species">manually extracted to construct an HPO by Freudenberg. Then, the similarity of pairwise phenotypic terms was calculated based on Resnik’s method as follows:where a is the ancestor of phenotypes p and p, N is the total number of genes associated with the phenotypes, and n(a) is the number of genes associated with a. Then, the similarity of <span class="Disease">pairwise diseases d and d is defined as follows:andwhere n and m represent the number of phenotypes associated with d and d respectively.

Zhang’s Method

Zhang et al. extracted phenotypic terms from the “TX” and “CS” fields of OMIM’s disease records using a MetaMap transfer tool. As a result, each disease could be represented as a set of phenotypes. Then the weights of phenotypic terms for diseases were calculated based on a term frequency-inverse document frequency (TF-IDF) weighting scheme. Subsequently, each disease was represented as a weighted vector of these phenotypic terms. Finally, the similarity of <span class="Disease">pairwise diseases was defined as the cosine of their corresponding phenotypic vectors.

Zhou’s Method

Zhou et al., define a disease as a set of symptoms, which were extracted from PubMed. Each disease was described as a weighted vector of phenotypic terms. Here the weight was calculated by a TF-IDF weighting scheme. The similarity of a <span class="Disease">pairwise disease was then defined as the cosine of their vectors.

Chen’s Method

Chen et al. extracted the disease-phenotype relationships from the UMLS file MRREL.RRF where disease-phenotype relationships were documented based on OMIM, Ultrasound Structured Attribute Reporting, and Minimal Standard Digestive Endoscopy Terminology. This group then used the information content (IC) to weight each phenotype concept as follows:where N is the total number of diseases, and n is the number of diseases associated with a phenotype p. Then they modeled the phenotype similarity of <span class="Disease">pairwise diseases by the cosine of their feature vectors.

Molecule-Based Methods

The schematic process of molecule-based methods is analogous to that of the previously stated phenotype-based methods. Here, genes are the mainly disease-related molecules. Phenotypic-based methods always utilized the se<span class="Species">mantics associations between phenotypes. In comparison, genes can be associated in more ways, such as in terms of protein-protein interactions (PPIs), co-expression, and so forth.

Mathur’s Method

SwissProt documents proteins that have been <span class="Species">manually annotated with diseases, which were mapped to DO terms using MetaMap by Mathur and Dinakarpandian. Then, the similarity of diseases d and d was calculated based on their corresponding genes as follows:where G and G are gene sets of diseases d and d, respectively, |.| is the number of terms in the specified set, and N is the total number of genes.

Suthram’s Method

Suthram et al. compared diseases using an integrated analysis of disease-related mRNA expression data and the <span class="Species">human protein interaction network. First, they identified conserved functional modules of genes using PathBLAST based on PPI data from the <span class="Species">Human Protein Reference Database (HPRD). Next, they normalized the gene expression data in each microarray sample using a Z-score transformation and computed the activity level of each gene in a disease. Then, the module response score for each module in a disease was assigned to be the mean of the gene activity score of its component genes. Finally, they calculated the partial correlation coefficient between diseases based on the corresponding module response score and defined it as the disease similarity.

Gottlieb’s Method

Gottlieb et al. presented four algorithms for calculating disease similarity using the genetic signatures of diseases from gene expression experiments, which involved signature-based, signature sequence-based, signature PPI-based, and signature GO-based methods. The signature-based method utilized a Jaccard index between every pair of disease signatures to calculate disease similarity as follows:where G and G are the signatures of diseases d and d, respectively, and |.| is the number of terms in the specified set. The signature PPI-based method calculated the distances between each pair of disease signatures based on their corresponding proteins using an all-pairs shortest paths algorithm on the <span class="Species">human PPI network. Distances were transformed into similarity values using the following formula:where P and P are the corresponding proteins of diseases d and d, respectively, and D(P P) is the shortest path between these proteins in the PPI network. A is a parameter chosen to be 0.9 × e by Perl<span class="Species">man et al. The signature sequence-based method calculated the Smith-Water<span class="Species">man sequence alignment score between disease signatures and then divided the score by the geometric mean of the scores from aligning each sequence against itself. In addition, the signature GO-based method calculated the similarity between each pair of disease signatures based on their corresponding GO terms.

Hamaneh’s Method

Ha<span class="Species">maneh and Yu devised a network-based measure to calculate disease similarity. First, they assigned weights to all proteins by using information flow from a disease to the <span class="Species">human PPI network and back. As a result, each disease was represented as a weighted vector whose dimension is the number of proteins in the network. Then, the similarity of two diseases was defined as the cosine of the angle between their corresponding vectors.

Kim’s Method

Kim et al. extracted disease-gene pairs and disease-drug pairs from the literature and used the frequencies of co-occurrence relationships as features to calculate disease similarity. In this work, disease names, gene symbols, and drug names were from the Pharmacogenomics Knowledgebase (PharmGKB). This assumes that G and G are genes that occurred in the same sentence as diseases d and d, respectively. D and D are drugs that occurred in the same sentence as diseases d and d, respectively. The similarity of d and d, therefore, can be defined as the following:andwhere N and M are the total number of genes and drugs, respectively.

Hierarchy-Based Methods

Hierarchy-based approaches are based only on the hierarchical structure of disease-related ontologies. In the previously mentioned studies, multiple methods have been presented for calculating the similarity of ontology terms using shared path and distance based on hierarchical structures85, 86, 87, 88, 89. However, currently only Wang’s method is widely utilized for calculating disease similarity.

Wang’s Method

Assuming that D is the set including d and all of its ancestor terms in an ontology-based “IS_A” relationship, the hierarchical contribution of the terms d to d is represented as follows:where w is a hierarchical contribution factor for hierarchical association. According to Wang et al., and Cheng et al., w is defined as 0.5 for an “IS_A” relationship of DO. Then, the value of the summation of all of the hierarchical contributions of D to d is SV(d), which is defined as follows:Assuming that D is the set including d and all of its ancestor terms, the similarity between d and d is defined by Wang’s method as follows:

Mabotuwana et al.’s Method

Mabotuwana et al. defined similarity of pairwise terms as inversely proportional to the distance between terms, as follows:where d is the number of nodes in the shortest path between two diseases based on the DAG of ontology.

Hybrid Methods

Molecular and hierarchical associations between diseases have been combined as hybrid methods for calculating disease similarity. These methods often utilize disease-related genes to define the IC of diseases93, 94, 95 as follows:where N denotes the total number of genes, and n represents the number of genes of d. Here, disease-related genes are often based on OMIM, CTD, SIDD, OAHG, and so on.

Resnik’s Method

Early in 1995, Resnik presented a method for calculating the similarity between ontology terms. In 2002, this method was introduced for calculating the similarity between GO terms. In 2011, Li et al. utilized this method for calculating the similarity between DO terms. According to Resnik’s method, the similarity of <span class="Disease">pairwise diseases d and d equals the IC of the most informative common ancestor (MICA) of these two diseases as follows:

Lin’s Method

Concerned that the similarity between ontology terms should also be decided by the IC of the two terms, Lin improved Resnik’s method in 1998. According to Lin’s method, the similarity of <span class="Disease">pairwise diseases d and d can be reflected by both the MICA of the disease pair and the IC of each disease as follows:

Schlicker’s Method

Schlicker et al. improved Resnik’s method from the same perspective as Lin, and they defined disease similarity as follows:In this equation, ancestors(d, d) represents the common ancestor of diseases d and d. In 2012, Mathur et al. designed a new method named PSB for calculating the similarity between DO terms. According to this method, the significance of related BPs terms from GO should be computed for each disease using a hypergeometric test. Assuming that d and d can be associated with m and n BP terms, respectively, the similarity of d and d is defined as follows:where represents the similarity between two BPs p and p as follows:Here, IC and IC represent the IC based on GO and DO, respectively. n(p∩p) and n(p∪p) denote the number of common genes of p and p and the number of total genes of p and p, respectively.

Cheng’s Method

In addition to related BP, genes can be associated by PPI, co-expression, and so forth. Therefore, Cheng et al. presented the SemFunSim method to improve Mathur’s method by incorporating the gene functional network from <span class="Species">HumanNet, which reflects the comprehensive gene associations from PPI, co-expression, BP, and so on. This assumes that G and G represent related gene sets of d and d, respectively. Then, the similarity between t and t by Cheng et al.’s method is described by the following:where |GMICA| represents the number of genes of MICA for t and t and m and n denote the number of genes in G and G, respectively. Sim(g, g) is the functional similarity score between genes g and g from <span class="Species">HumanNet.

Performance Evaluation

The perfor<span class="Species">mance of a disease similarity method can be affected by the quality of the prior knowledge it is based on. Most of the methods that utilize a <span class="Species">manually curated dataset is high reliability. Some of the methods mentioned here use data from the literature extracted using text-mining tools. Data obtained in an unsupervised way should always be evaluated. In Mathur’s method, disease-related genes were mined from literature using MetaMap. The recall and precision were calculated based on a benchmark dataset from Monttaz et al., which contained 200 records that were manually annotated by experts. The identified similarity pairs of diseases should always be then evaluated to measure the performance of the method used. Three types of classical evaluation strategies are introduced here (Figure 3).
Figure 3

Schematic of the Process of Performance Evaluation

(A) Performance evaluation of a simulated patient-based method. (B) Performance evaluation of a term-category-based method. (C) Performance evaluation of a benchmark-data-based method.

Schematic of the Process of Perfor<span class="Species">mance Evaluation (A) Perfor<span class="Species">mance evaluation of a simulated <span class="Species">patient-based method. (B) Performance evaluation of a term-category-based method. (C) Performance evaluation of a benchmark-data-based method.

Simulated-Patient-Based Strategy

In consideration of the difficulty in obtaining phenotypic information about a large number of <span class="Species">patients, Sebastian et al. presented a simulated-<span class="Species">patient-based method to evaluate their phenotype-based disease similarity method. We used 44 complex dysmorphology syndromes for which adequate frequency phenotypes were available, and then 100 virtual patients for each disease were generated on the basis of the frequency of phenotypes among persons diagnosed with a certain disease. For example, to generate patients with phenotypes A and B, in which A occurs in 40% and B occurs in 60% of patients, a random number generator was utilized to generate two random numbers uniformly distributed between 0 and 100. Subsequently, the similarity of the simulated patient to each of the OMIM diseases was calculated and then ranked. The average rank of all of the patients was returned to assess the performance of the original method.

Term-Category-Based Strategy

Sun et al. utilized information on disease-related molecules to design a disease similarity measurement method. Their results were evaluated using the disease classification terminologies found in the ICD-9. Their assumption was that two similar diseases should be subjected to the same categories in the ICD-9. Therefore, the correlation between the similarity of diseases and their classifications can reflect the perfor<span class="Species">mance of this method. Since similarity scores are not normally distributed, they used a nonparametric test—the <span class="Species">Mann-Whitney U test—to assess the statistical significance of the disease similarity.

Benchmark Data-Based Strategy

In the previous study, Cheng et al. constructed a benchmark set containing 70 pairs of similar diseases, which were <span class="Species">manually integrated from two datasets. One dataset was adapted from Suthram et al. from the literature. The other dataset was curated by medical residents. Here, we have evaluated the perfor<span class="Species">mance of Wang’s, Resnik’s, and Lin’s methods, PSB, and the SemFunSim using benchmark data. First, disease pairs of our benchmark dataset were deemed as positive groups, and 10-fold more disease pairs were randomly generated as a negative group. Next, the similarity of disease pairs of these two groups was calculated based on the aforementioned listed methods. Then, the area under receiver operating characteristic (ROC) curves (AUCs) was obtained. This process was iterated 100 times using different negative groups each time, and the average AUC reflects the respective perfor<span class="Species">mance of these methods. Figure 4A shows the AUC of one of 100 iterations using disease-related genes from GeneRIF, while Figure 4B shows the average AUC of 100 iterations using disease-related genes from GeneRIF. The average AUC for Resnik’s, Lin’s, and Wang’s methods, PSB, and the SemFunSim were 0.6484, 0.6791, 0.6978, 0.7759, and 0.9008, respectively. Figures 4C and 4D show the results using disease-related genes from SIDD. The calculated average AUC for Resnik’s, Lin’s, and Wang’s methods, PSB, and the SemFunSim were 0.6209, 0.6351, 0.6849, 0.8843, and 0.9849, respectively.
Figure 4

Performance Evaluation Using a Benchmark-Data-Based Strategy

(A) ROC curve for one of the 100 iterations using disease-related genes from GeneRIF. (B) The average AUC from 100 iterations using disease-related genes from GeneRIF. (C) ROC curve for one of the 100 iterations using disease-related genes from SIDD. (D) The average AUC from 100 iterations using disease-related genes from SIDD.

Perfor<span class="Species">mance Evaluation Using a Benchmark-Data-Based Strategy (A) ROC curve for one of the 100 iterations using disease-related genes from GeneRIF. (B) The average AUC from 100 iterations using disease-related genes from GeneRIF. (C) ROC curve for one of the 100 iterations using disease-related genes from SIDD. (D) The average AUC from 100 iterations using disease-related genes from SIDD. The perfor<span class="Species">mance of these methods are subject to the prior knowledge they used. Wang’s method only used the entire structure of the ontology; therefore, its perfor<span class="Species">mance is limited by the comprehensive of the ontology. Although Resnik’s and Lin’s methods incorporated the structure of ontology and ontology annotation, they do not utilize all the “IS_A” relationships of ontology. Thus, the performance of these three methods is not very good. In comparison with Resnik’s and Lin’s methods, PSB introduced GOA for associating disease-related genes. Thus, its performance improved a lot. Since disease-related genes could be associated in terms of PPIs, co-expression, and so on, the performance of PSB is improved much more by the SemFunSim method.

Applications

Disease similarity can be determined at the molecular, phenotypic, and hierarchical levels. Conversely, similar diseases reflect the correlations of their inducing molecules, phenotypes, and classifications. Therefore, disease similarity has been widely applied in the functional prediction of molecules, clinical diagnosis, and the establishment of disease associations.

The Functional Prediction of Molecules

This is based on the observation that genes causing similar diseases tend to lie close to one another in a network of PPI., Vanunu et al. constructed a comprehensive network using gene-disease association, disease similarity, and PPI data to predict disease-related PCGs using a random walk method. In comparison with PCGs, it is not easy to determine the function of ncRNAs due to limited knowledge with regard to their impact on proteins from wet lab experiments with these ncRNAs. Fortunately, disease similarity has been useful for this in previous investigations.,107, 108, 109, 110 Based on prior knowledge of the associations between ncRNAs and diseases, functional similarity of ncRNAs can be calculated based on the similarities of their related diseases to construct a network in which an ncRNA is represented as a node and the similarity of <span class="Disease">pairwise ncRNAs is represented as edges. Just such a network was then utilized for predicting novel ncRNA-disease associations by the random walk with restart (RWR) method.,, Recently, disease similarity has been utilized for mining potential therapeutic drugs for diseases. Based on the observation that similar diseases can often be treated with similar drugs, Cheng et al., prioritized potential drugs for a disease based on their results with similar diseases. Gottlieb et al. combined disease similarity and drug similarity to predict novel drug indications.

Clinical Diagnosis

The diagnosis process can be a challenging undertaking, given the large number of <span class="Disease">hereditary disorders and the range of partially overlapping clinical features associated with them. To resolve this problem, Robinson et al., established an HPO to calculate the disease similarity and diagnose diseases according to clinical phenotype. According to Equations 6, 7, and 8, disease similarity can be calculated based on their phenotype sets. For an individual <span class="Species">patient, the similarity between OMIM diseases and clinical features could also be calculated based on this method. The similarity score in this case then reflects the probability of a potential disease in the patient.

Construction of Qualitative Associations of Diseases

In 2006, Goh et al. utilized the common genetic origin of diseases to construct a <span class="Species">human disease network (HDN) from the molecular level based on OMIM. This was an early study that established a qualitative association between diseases from a quantitative perspective. A portion of each disease stems not as the consequence of the single genetic defects but, rather, the breakdown in molecular interaction networks. Thus, their associations cannot be reflected by this network. Therefore, the network was extended based on PPIs, metabolic networks, and different pathways.113, 114, 115 Recently, Zhou et al. established an HDN at the phenotypic level, where the link weight between two diseases quantified the disease similarity. Here, the symptoms of diseases were extracted from literature in PubMed. Each disease was described as a vector of phenotypes. Then, the similarity between diseases was defined as the cosine similarity of their vectors.

Tools for Calculating Disease Similarity

Inspired by the wide recent application of machine learning methods in bioinformatics,116, 117, 118 various algorithms have been implemented for calculating disease similarity using R and web-based programs,,,,,119, 120, 121, 122, 123, 124 (Table 3). These tools play important roles in disease diagnosis, the prediction of drugs, and so forth. Here, we introduce four frequently used tools in detail.
Table 3

Summary of Disease Similarity Tools

Author(s)NameTypeWeb SiteVocabularyPMIDYear
van Driel et al.67MimMinerwebpageOMIM164934452006
Robinson et al.5Phenomizerwebpagehttp://compbio.charite.de/phenomizer/OMIM198000492009
Wang et al.90MISIMwebpageMeSH204392552010
Li et al.97DOSimR packageDO217148962011
Hoehndorf et al.119NAwebpagehttp://aber-owl.net/aber-owl/diseasephenotypes/OMIM260513592015
Hamaneh and Yu123DeCoaDwebpagehttps://www.ncbi.nlm.nih.gov/CBBresearch/Yu/mn/DeCoaD/DO260479522015
Deng et al.120HPOSimR packagehttps://sourceforge.net/p/hposim/summary/OMIM256644622015
Yu et al.121DOSER packagehttp://www.bioconductor.org/packages/release/bioc/html/DOSE.htmlDO256771252015
Cheng et al.111DisSimwebpagehttp://bio-annotation.cn/DisSimDO274579212016
Cheng et al.122DisSetSimwebpagehttp://bio-annotation.cn/DisSetSim/DO292974112017
Cheng et al.124DincRNAwebpagehttp://bio-annotation.cn:18080/DincRNAClient/#/HomeDO293650452018
Summary of Disease Similarity Tools

MimMiner

van Driel et al. designed a phenotype-based method and implemented it as a tool—namely, MimMiner—for calculating the similarity of OMIM diseases. This tool provides interfaces to query the similar diseases related to an input diseases and is widely used in bioinformatics community. It should be noted that this tool needs to be updated due to the rapid increase in the size of the OMIM disease database.

Phenomizer

Phenomizer is an online tool that can be helpful in the diagnosis processes and is based on disease similarity. Currently, thousands of <span class="Disease">genetic disorders characterized by specific combinations of phenotypic features are documented in OMIM. The diagnosis process based on phenotypes is difficult without computer-based tools. Phenomizer allows an automatic correlation between <span class="Disease">phenotypic abnormalities and hereditary disorders found in OMIM. The p values are generated to evaluate the statistical significance of those correlation scores given by Phenomizer. This tool is also useful for suggesting additional possible phenotypic alterations for further evaluation in a patient of interest.

DOSim

DOSim is an R package used for computing the similarity between DO terms based on Wang’s method and nine hybrid methods involving Resnik’s method, Lin’s method, and so forth.93, 94, 95,,125, 126, 127. This tool also implements utilities to calculate the similarity of genes based on their inducing diseases and conduct DO enrichment analysis.

DisSim

DisSim is an online system for exploring similar diseases in DO. It provides both the similarity of <span class="Disease">pairwise diseases and the significance of their similarity score. In addition, the system integrates therapeutic drugs for known diseases to predict potential drugs for other <span class="Species">human diseases based on the observation that similar diseases can be treated with similar drugs.

Discussion

Most disease similarity methods depend on disease vocabularies and their annotations. Phenotype-based methods extract disease annotations of phenotypes from PubMed and OMIM. Disease names from these data sources are from MeSH and OMIM. Hierarchy-based methods utilize the structure of ontology from MeSH and DO. Current molecule-based methods mainly used the DO annotations of genes. In summary, DO, MeSH, and OMIM contain the most frequently used vocabularies for calculating disease similarity. However, not all disease terms are contained in any one of these vocabularies. For comparison, OMIM documents more specific disease terms, such as TYPE III SYNDACTYLY (OMIM: 186100). MeSH and DO involve classification of diseases, such as <span class="Disease">cancer (DOID: 162). Figure 5 shows the number of disease terms distributed across the different vocabularies. In total, 958 common disease terms are documented in DO, MeSH, and OMIM, which covers 8.8%, 8.5%, and 11.4% of DO, MeSH, and OMIM terms, respectively. Although OMIM and MeSH terms have been integrated into MEDIC, MEDIC lacks <span class="Species">many DO terms and disease classifications. Therefore, combining all of the disease terms of DO, MeSH, and OMIM is critical for calculating disease similarity using the same vocabulary. In addition, a unified disease annotation database based on this integrated vocabulary is indispensable for improving the universality of similarity determining algorithms. In our previous studies, we provided a global view of human diseases by annotating disease-related molecule and phenotype features with DO., However, the absence of disease terms in DO limits its application.
Figure 5

Distribution of Disease Terms in DO, MeSH, and OMIM

Distribution of Disease Terms in DO, MeSH, and OMIM Disease-related ontologies only contain “IS_A” relationships, which limits the perfor<span class="Species">mance of hierarchy-based methods. For example, Wang’s method could be applied to multiple term associations of ontology, such as “IS_A,” “PART_OF,” “LOCATE_IN,” and so on. The perfor<span class="Species">mance evaluation results in Figure 4 shows that Wang et al.’s method could be improved, which may be achieved with the occurrence of more types of disease associations than the “IS_A” relationship. Data quality and the quantity of disease annotations of phenotypes and molecules are crucial for the perfor<span class="Species">mance of molecule-based, phenotype-based, and hybrid-based methods. OMIM documents close but few disease-gene associations. Contrary to this, GeneRIF and SIDD retain loose but abundant associations. All of these datasets were combined together without distinction for calculating disease similarity in most cases. These methods could be improved by ranking all of the associations. For example, we can improve the disease annotations by adding the evidence for each disease-gene association such as that found in the GOA database. In general, newer methods should consider more types of prior knowledge, leading to better perfor<span class="Species">mance. Wang’s method, which is a hierarchy-based method, was presented in 2007. The SemFunSim method was presented in 2014, and it incorporates the hierarchical structure of DO, disease annotations of genes, and gene associations. The evaluation results in Figure 4 show that SemFunSim achieves a higher AUC than Wang’s method. Although hybrid methods integrate more types of prior knowledge of diseases, molecular and phenotypic associations of diseases were ignored. Therefore, it is possible that the perfor<span class="Species">mance of disease similarity methods could be further improved by fusing more disease knowledge types. Although comprehensive knowledge benefits the calculative precision of disease similarity, these methods based on a single type of prior knowledge can also very valuable for biological applications. Diseases are often caused by the molecular mechanism and could be reflected by diverse phenotypes. Disease phenotypes can be detected from clinical diagnosis, while causal molecules are identified from wet labs. Gaps in phenotypic and molecular levels exist for understanding diseases. Here, disease similarity based on different types of knowledge could bridge the gap. The purpose of calculating disease similarity is to identify similar diseases. However, it is not easy to determine similar diseases directly from most of the presented methods and tools. One feasible strategy for this purpose is provided here by DisSim, which provides the p values for each similarity score. According to current methods, the similarity of <span class="Disease">pairwise diseases can be obtained, which are then normalized to Z scores. Then, the one-side p values are calculated as a significance score for each similarity score. Another way to provide p values for similarity scores would be a permutation test. Disease similarity plays important roles in mining the novel molecular features of diseases, clinical diagnosis, and so on. The exploration of the function of ncRNAs is a long-term challenge, as these RNAs do not produce proteins. Currently, disease similarity has been successful in predicting the function of ncRNAs, especially in prioritizing miRNA-disease,129, 130, 131, 132, 133 and lncRNA-disease pairs., In the future, these methods can be used for comprehending the function of other types of ncRNAs, such as circular RNA (circRNAs). In a previous study, disease similarity was utilized for diagnosis based on phenotypes. This may also be helpful for molecular diagnosis. Alterations in the presence of metabolites are easily determined in the clinical, meaning metabolite-disease pairs can be prioritized based on disease similarity methods. Therefore, it is theoretically possible to predict potential diseases based on abnormalities in metabolite levels.

Author Contributions

L.C., J.H., S.L., and Q.J. conceived and designed the experiments. L.C., H.Z., P.W., W.Z., M.L., and T.L. analyzed data. L.C. wrote the <span class="Species">manuscript. All authors read and approved the final <span class="Species">manuscript.

Conflicts of Interest

The authors declare no competing interests.
  26 in total

1.  sgRNACNN: identifying sgRNA on-target activity in four crops using ensembles of convolutional neural networks.

Authors:  Mengting Niu; Yuan Lin; Quan Zou
Journal:  Plant Mol Biol       Date:  2021-01-01       Impact factor: 4.076

Review 2.  Large-scale comparative review and assessment of computational methods for anti-cancer peptide identification.

Authors:  Xiao Liang; Fuyi Li; Jinxiang Chen; Junlong Li; Hao Wu; Shuqin Li; Jiangning Song; Quanzhong Liu
Journal:  Brief Bioinform       Date:  2021-07-20       Impact factor: 11.622

3.  Disease Module Identification Based on Representation Learning of Complex Networks Integrated From GWAS, eQTL Summaries, and Human Interactome.

Authors:  Tao Wang; Qidi Peng; Bo Liu; Yongzhuang Liu; Yadong Wang
Journal:  Front Bioeng Biotechnol       Date:  2020-05-06

4.  Identification of Human Enzymes Using Amino Acid Composition and the Composition of k-Spaced Amino Acid Pairs.

Authors:  Lifu Zhang; Benzhi Dong; Zhixia Teng; Ying Zhang; Liran Juan
Journal:  Biomed Res Int       Date:  2020-05-22       Impact factor: 3.411

5.  Deep Reinforcement Learning for Data Association in Cell Tracking.

Authors:  Junjie Wang; Xiaohong Su; Lingling Zhao; Jun Zhang
Journal:  Front Bioeng Biotechnol       Date:  2020-04-09

6.  Identification and Classification of Enhancers Using Dimension Reduction Technique and Recurrent Neural Network.

Authors:  Qingwen Li; Lei Xu; Qingyuan Li; Lichao Zhang
Journal:  Comput Math Methods Med       Date:  2020-10-18       Impact factor: 2.238

7.  Early Diagnosis of Pancreatic Ductal Adenocarcinoma by Combining Relative Expression Orderings With Machine-Learning Method.

Authors:  Zi-Mei Zhang; Jia-Shu Wang; Hasan Zulfiqar; Hao Lv; Fu-Ying Dao; Hao Lin
Journal:  Front Cell Dev Biol       Date:  2020-10-15

8.  Identification of Causal Genes of COVID-19 Using the SMR Method.

Authors:  Yan Zong; Xiaofei Li
Journal:  Front Genet       Date:  2021-07-05       Impact factor: 4.599

9.  A Mendelian Randomization Analysis to Expose the Causal Effect of IL-18 on Osteoporosis Based on Genome-Wide Association Study Data.

Authors:  Ni Kou; Wenyang Zhou; Yuzhu He; Xiaoxia Ying; Songling Chai; Tao Fei; Wenqi Fu; Jiaqian Huang; Huiying Liu
Journal:  Front Bioeng Biotechnol       Date:  2020-03-20

10.  Identifying Heat Shock Protein Families from Imbalanced Data by Using Combined Features.

Authors:  Xiao-Yang Jing; Feng-Min Li
Journal:  Comput Math Methods Med       Date:  2020-09-23       Impact factor: 2.238

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.