Literature DB >> 31678735

Computational Methods for Identifying Similar Diseases.

Liang Cheng¹, Hengqiang Zhao¹, Pingping Wang², Wenyang Zhou², Meng Luo², Tianxin Li², Junwei Han³, Shulin Liu⁴, Qinghua Jiang⁵.

Abstract

Although our knowledge of human diseases has increased dramatically, the molecular basis, phenotypic traits, and therapeutic targets of most diseases still remain unclear. An increasing number of studies have observed that similar diseases often are caused by similar molecules, can be diagnosed by similar markers or phenotypes, or can be cured by similar drugs. Thus, the identification of diseases similar to known ones has attracted considerable attention worldwide. To this end, the associations between diseases at the molecular, phenotypic, and taxonomic levels were used to measure the pairwise similarity in diseases. The corresponding performance assessment strategies for these methods involving the terms "category-based," "simulated-patient-based," and "benchmark-data-based" were thus further emphasized. Then, frequently used methods were evaluated using a benchmark-data-based strategy. To facilitate the assessment of disease similarity scores, researchers have designed dozens of tools that implement these methods for calculating disease similarity. Currently, disease similarity has been advantageous in predicting noncoding RNA (ncRNA) function and therapeutic drugs for diseases. In this article, we review disease similarity methods, evaluation strategies, tools, and their applications in the biomedical community. We further evaluate the performance of these methods and discuss the current limitations and future trends for calculating disease similarity.

Entities: CellLine Chemical Disease Gene Species

Keywords: disease similarity; molecular basis; ncRNA function; phenotypic traits; therapeutic drugs

Year: 2019 PMID： 31678735 PMCID： PMC6838934 DOI： 10.1016/j.omtn.2019.09.019

Source DB: PubMed Journal: Mol Ther Nucleic Acids ISSN： 2162-2531 Impact factor: 8.886

Introduction

Human disease is one of the permanent aspects of the human condition, similar to birth, aging, and death, from a philosophical point of view. The search for novel understanding of disease never stops. Although, currently, there has been great success with the development of biotechnology, the molecular basis of and therapeutic agents for most diseases remain unclear. Current studies have observed that similar diseases are often caused by similar molecules,1, 2, 3 can be diagnosed by similar markers or phenotypes,4, 5, 6 and are also cured by similar drugs.7, 8, 9, 10, 11 Based on this, novel functional molecules for a disease could, in theory, be revealed using prior knowledge of similar diseases.12, 13, 14, 15, 16, 17, 18 Thus, research on identifying the similarity between diseases has attracted increasing attention. A pair of diseases with a high similarity score can be defined as being similar diseases. To measure disease similarity, prior knowledge of diseases plays a crucial role. The symptoms and signs accompanying diseases, also called phenotypes, are the intuitive characteristics of a disease., As early as 2004, Freudenberg and Propping used phenotypes sourced from the Online Mendelian Inheritance in Man (OMIM) website to calculate the similarity of OMIM diseases. With an ever-increasing number of phenotypes being observed by the biomedical community, abundant algorithms have been developed for measuring disease similarity at a phenotypic level. Many studies have shown that the alterations of molecules can lead to the occurrence of diseases. Thus, the exploration of a common molecular basis is another way to measure disease similarity. With the development of next-generation sequencing technologies, a vast number of protein-coding genes (PCGs) and noncoding RNA (ncRNA) genes associated with diseases have been identified. For example, hemophilia A is an X-linked recessive bleeding disorder caused by a deficiency in the activity of coagulation factor VIII (F8), which can be affected by variations in the F8 genes., MicroRNA (miRNA)-155 is an endogenous ncRNA that regulates several mRNAs to cause B cell lymphomas., Based on the molecular basis of diseases, a large number of methods27, 28, 29, 30, 31, 32, 33 have been designed for calculating disease similarity, using this as a metric. Recently, disease taxonomy has begun to play an important role in measuring disease similarity. One of the typical taxonomic classifiers for diseases is Disease Ontology (DO). In this, each disease term represents a disease with different names, and two terms can be linked on the basis of a set of inclusive relationships. For example, “Alzheimer’s disease” can be linked to “tauopathy.” All of the disease terms and the set of inclusion relationships forms the disease hierarchy and directed acyclic graph (DAG) of DO (Figure 1), where a node represents a disease term, and an edge is a set of inclusive relationships between the two terms. The common ancestors of two disease terms based on the DAG have often been utilized to calculate the similarity of two terms.

Figure 1

Sub-graph of the DO Hierarchy for Alzheimer’s Disease

Arrows represent an “IS_A” relationship for DO. For example, “Alzheimer’s disease” is linked to “Dementia” by an “IS_A” relationship. All of the terms that can be linked by “IS_A” relationships in the graph from “Alzheimer’s disease” are the ancestors of “Alzheimer’s disease.” All of the terms that can link to “Disease” by “IS_A” relationships are the descendants of “Disease.”

Sub-graph of the DO Hierarchy for Alzheimer’s Disease Arrows represent an “IS_A” relationship for DO. For example, “Alzheimer’s disease” is linked to “Dementia” by an “IS_A” relationship. All of the terms that can be linked by “IS_A” relationships in the graph from “Alzheimer’s disease” are the ancestors of “Alzheimer’s disease.” All of the terms that can link to “Disease” by “IS_A” relationships are the descendants of “Disease.” Currently, dozens of methods have been designed for calculating disease similarity based on prior disease knowledge at the phenotypic, molecular, and hierarchical levels. In this article, we review the main topics of investigation in disease similarity, including the proper selection of proper data, the design and implementation of methods, the evaluation of a method’s performance, and even the application of existing methods for predicting molecular factors of diseases.

Data Sources

Three types of data sources, including disease vocabularies, disease annotations, and gene functional annotations, are widely utilized for calculating disease similarity (Table 1). Here, we list and introduce these main data sources.

Table 1

Summary of Data Sources

Category and Name	Creation Date	Initiator	PMID
Disease Vocabulary

OMIM	1960s	McKusick³⁶	17357067
MeSH	1960s	Winifred Sewell³⁸	14119288
UMLS	1980s	Olivier Bodenreider⁴¹	14681409
SNOMED CT	2001	Wang et al.⁴⁶	11825284
DO	2003	Schriml et al.³⁴	22080554
MEDIC	2012	Davis et al.³⁹	22434833

Disease Annotations

GeneRIF	2007		17990498
CTD	2003		27651457
GAD	2004	Becker et al.⁴⁸	15118671
miR2Disease	2008	Jiang et al.⁵⁴	18927107
HPO	2008	Robinson et al.⁵	18950739
SpliceDisease	2011		22139928
lncRNADisease	2012		23175614
HMDD v2.0	2013		24194601
SIDD	2013	Cheng et al.⁶²	24146757
OAHG	2016	Cheng et al.⁶¹	27703231

Gene Functional Annotations

GOA	2003	Camon et al.⁶³	12654719
HumanNet	2011	Lee et al.⁶⁶	21536720

Summary of Data Sources

Disease Vocabularies

Disease vocabularies document disease terms for distinguishing between different diseases. Each disease term in a vocabulary contains a unique identifier, preferred disease name, synonyms, abbreviations, and the definition of a disease. Parts of these vocabularies even provide a hierarchy of disease terms based on a set of inclusive relationships.

OMIM

The OMIM, is a comprehensive, authoritative compendium of genetic diseases, which is freely available and updated daily. It was initiated in the early 1960s by Dr. Victor A. McKusick and has been developed for online usage by the NCBI since 1985.

MeSH

The Medical Subject Headings (MeSH), provides hierarchically organized terminology for indexing and cataloging biomedical information for PubMed. MeSH divides all biomedical terms into 16 categories, in which C and F03 contain disease names, containing more than 4,600 disease terms. In addition to the terms in these categories, MeSH also contains supplementary term records, which document thousands of disease terms.

MEDIC

The “merged disease vocabulary” (MEDIC) was established by the Comparative Toxicogenomics Database (CTD) biocurators and is composed of more than 10,000 unique diseases. To take advantage of the familiarity and immediate genetic data offered by OMIM terms, as well as the navigation utility and PubMed indexing feature of MeSH terms, MEDIC integrates OMIM terms with MeSH terms and hierarchical relationships.

UMLS

The Unified Medical Language System (UMLS) is a repository of biomedical vocabularies developed by the U.S. National Library of Medicine (NLM). The UMLS integrates over 2 million names for some 900,000 concepts from more than 60 families of biomedical vocabularies, as well as 12 million relations between these concepts. Vocabularies integrated in the UMLS Metathesaurus include MeSH, OMIM, Gene Ontology (GO), and so forth.

DO

The Disease Ontology (DO) database was developed to create a single structure for the classification of diseases that unifies the representation of disease between varied vocabularies into a relational ontology. DO terms can be linked in a hierarchy by a type of semantic association called an “IS_A” relationship (Figure 1). The initial builds of DO in 2003 and 2004 used the International Classification of Diseases (ICD-9) as the foundational vocabulary. Recent revisions have improved this with the reorganization of DO based on UMLS disease terms in conjunction with term mappings to Systematized Nomenclature of Medicine--Clinical Terms (SNOMED CT), and ICD-9. The current version of DO is organized into eight main classes to represent cellular proliferation, mental health, anatomical entity, infectious, and agent, etc.

Disease Annotations

The molecular basis and phenotypic characterization of a disease are two main aspects of prior knowledge often used for measuring disease similarity. Resources collecting these sources of prior knowledge are called disease annotations.

Disease Annotations of PCGs

Disease-related PCGs are mainly documented in the OMIM, Gene Reference into Function (GeneRIF), Genetic Association Database (GAD), SpliceDisease, and CTD databases. OMIM was intended for use primarily by physicians and other professionals concerned with genetic disorders. GeneRIF provides functional annotations of genes from the NCBI and allows scientists to add a short functional summary of NCBI genes that is limited to 425 characters. The GAD emphasizes genetic association data from complex diseases and disorders. SpliceDisease provides detailed descriptions of the relationships between gene variations, splicing defects, and diseases. The CTD documents the interactions between chemicals and gene products, as well as their relationships to diseases. The relationships between genes and diseases in the CTD often comes in the form of information about RNA splicing, SNPs, and so on.

Disease Annotations of miRNAs

miRNAs are a class of endogenous single-stranded small ncRNAs that play a crucial role in various human diseases by negatively regulating the expression of PCGs.50, 51, 52, 53 Two manually curated data sources of disease-miRNA relationships include miR2Disease and the Human miRNA Disease Database (HMDD) v2.0. Both of these two resources document miRNA deregulation in various human diseases.

Disease Annotations of lncRNAs

Long ncRNAs (lncRNAs) are mRNA-like transcripts that are longer than 200 nt and have little or no protein-coding capacity., According to the theory of competing endogenous RNA (ceRNA), they can affect the expression of PCGs through competitively binding with miRNAs. Thus, it becomes important to understand the role of lncRNAs in diseases. The LncRNADisease database has a manually accumulated set of relationships between lncRNAs and diseases.

Disease Annotations of Phenotypes

Phenotypes are documented in the Clinical Synopsis section of the textual descriptions of each OMIM disease. Robinson et al. extracted all of the phenotypes from this text and constructed a human phenotype ontology (HPO) to annotate human diseases.

Integrated Resources of Disease Annotations

In previous efforts, we developed two integrated resources for disease annotations. integrated resource for annotating human genes with multi-level ontologies (OAHG) focused on the disease annotations of PCGs, miRNAs, and lncRNAs; and a semantically integrated database towards a global view of human disease (SIDD) documented disease-related molecular, phenotypic, and environmental features. The data sources integrated by OAHG involved OMIM, HMDD, and LncRNADisease. SIDD integrated up to 18 different data sources, including OMIM, GAD, CTD, LncRNADisease, and HPO.

Gene Functional Annotations

Similar molecular foundations of diseases may be influenced not only by common genes but also by different genes with common functions. Recently, associations between genes from gene functional annotation resources have been introduced for calculating disease similarity. Here, we list resources for the identification of gene functional annotations.

GOA

Disease-related PCGs can possess similar molecular functions (MFs), and may be involved in similar biological processes (BPs). This type of functional association of genes often exposes the similarity of different diseases. The GO annotation (GOA) of PCGs provides assignments of MF and BP terms of GO to gene products, in a project run by the European Bioinformatics Institute (EBI).

HumanNet

In addition to the GOA of PCGs, functional relationships between disease-related genes can also be reflected by protein-protein interactions, mRNA co-expression, and so forth. By integrating all of this data, HumanNet provides a more comprehensive relative score of pairwise PCG relationship.

Disease Similarity Measures

The similarity between diseases can be reflected by their common phenotypic characteristic, molecular basis, and hierarchy structures. Therefore, we have classified the disease similarity methods into phenotype-based, molecule-based, hierarchy-based, and hybrid methods (Table 2).

Table 2

Summary of Disease Similarity Methods

Author(s)	Molecule Based	Phenotype Based	Hierarchy Based	Vocabulary	PMID (or Reference Number)	Year
Freudenberg and Propping²¹		√		OMIM	12385992	2002
van Driel et al.⁶⁷		√		OMIM	16493445	2006
Köhler et al.⁶⁸		√		OMIM	19800049	2009
Zhang et al.⁶⁹		√		OMIM	20659468	2010
Zhou et al.⁷²		√		MeSH	24967666	2014
Chen et al.⁷³		√		UMLS	25277758	2015
Hoehndorf et al.¹¹⁹		√		DO	26051359	2015
Deng et al.¹²⁰		√		OMIM	25664462	2015
Mabotuwana et al.⁹²		√		SNOMED CT	23850839	2013
Mathur et al.⁹⁹	√			DO	21347137	2010
Suthram et al.⁷⁸	√			UMLS	20140234	2010
Gottlieb et al.⁸	√			UMLS	21654673	2011
Hamaneh and Yu⁸²	√			OMIM/MeSH	25360770	2014
Kim et al.⁸³	√			PharmGKB	26212477	2015
Wang et al.³⁵			√	DO/MeSH	17344234	2007
Resnik²⁷	√		√	DO	²⁷	1995
Lin¹²⁶	√		√	DO	²⁸	1998
Schlicker et al.⁹⁸	√		√		16776819	2006
Mathur et al.	√		√	DO	22166490	2012
Cheng et al.⁹¹	√		√	DO	24932637	2014

Summary of Disease Similarity Methods

Phenotype-Based Methods

Figure 2 shows the schematic process of phenotype-based methods. First, qualitative associations between phenotypes and diseases are extracted from phenotype data sources. Then, each pair of qualitative associations is quantified as a disease-phenotype score or phenotype-phenotype score. Finally, these scores are utilized for calculating disease similarity.

Figure 2

Schematic of the Process of Phenotype-Based Methods

Freudenberg’s Method

OMIM diseases were originally attributed manually by Freudenberg and Propping according to their phenotypic appearance, using the indices “periodicity,” “etiology,” “tissue,” “age of onset,” and “mode of inheritance.” The index “periodicity” is a Boolean variable, indicating an episodic occurrence of a disease in contrast to a linear progression. The index “etiology” is based on clinical signs and laboratory or pathological findings related to a disease. The index “tissue” is compiled as the anatomic location of phenotype. The index “inheritance” indicates whether a disease is inherited in an autosomal-dominant, autosomal-recessive, X chromosome, mitochondrial, or complex manner. The index “age of onset” refers to the age of a patient when symptoms are generally first noticed. Then, the similarity of diseases d and d is defined as the following:where w represents the contribution of a single index to the total similarity score, and sim(d.index, d.index) indicates the similarity between the ith indexes of d and d.

van Driel’s Method

van Driel et al. calculated the similarity between over 5,000 diseases based on phenotypic features of OMIM records. For each OMIM disease, its phenotypic descriptions were extracted from “TX” and “CS” fields. Then, the OMIM diseases and phenotypic descriptions were mapped to the anatomy (category A) and the disease (category C) sections of MeSH to establish disease-term associations. Each disease-term association was then defined as a vector with three features as follows:andwhere t and d represent a phenotype term and a disease, respectively. In Equations 2 and 4, counted(t,d) means the occurrence number of t in the OMIM records of d. In Equation 3, N is the total number of records analyzed, and n is the number of records that contain the term t. In Equation 4, descendant(t) is the number of descendant terms in the hierarchy of MeSH, and descendant(t,d) is the number of descendant terms in the OMIM records of d. The similarity between diseases d and d is then defined as Equation 5 below:where t and t mean the ith term vector of d and d, respectively; and m is the total number of phenotypic terms. Phenotypic terms of the “CS” field of OMIM records were also manually extracted to construct an HPO by Freudenberg. Then, the similarity of pairwise phenotypic terms was calculated based on Resnik’s method as follows:where a is the ancestor of phenotypes p and p, N is the total number of genes associated with the phenotypes, and n(a) is the number of genes associated with a. Then, the similarity of pairwise diseases d and d is defined as follows:andwhere n and m represent the number of phenotypes associated with d and d respectively.

Zhang’s Method

Zhang et al. extracted phenotypic terms from the “TX” and “CS” fields of OMIM’s disease records using a MetaMap transfer tool. As a result, each disease could be represented as a set of phenotypes. Then the weights of phenotypic terms for diseases were calculated based on a term frequency-inverse document frequency (TF-IDF) weighting scheme. Subsequently, each disease was represented as a weighted vector of these phenotypic terms. Finally, the similarity of pairwise diseases was defined as the cosine of their corresponding phenotypic vectors.

Zhou’s Method

Zhou et al., define a disease as a set of symptoms, which were extracted from PubMed. Each disease was described as a weighted vector of phenotypic terms. Here the weight was calculated by a TF-IDF weighting scheme. The similarity of a pairwise disease was then defined as the cosine of their vectors.

Chen’s Method

Chen et al. extracted the disease-phenotype relationships from the UMLS file MRREL.RRF where disease-phenotype relationships were documented based on OMIM, Ultrasound Structured Attribute Reporting, and Minimal Standard Digestive Endoscopy Terminology. This group then used the information content (IC) to weight each phenotype concept as follows:where N is the total number of diseases, and n is the number of diseases associated with a phenotype p. Then they modeled the phenotype similarity of pairwise diseases by the cosine of their feature vectors.

Molecule-Based Methods

The schematic process of molecule-based methods is analogous to that of the previously stated phenotype-based methods. Here, genes are the mainly disease-related molecules. Phenotypic-based methods always utilized the semantics associations between phenotypes. In comparison, genes can be associated in more ways, such as in terms of protein-protein interactions (PPIs), co-expression, and so forth.

Mathur’s Method

SwissProt documents proteins that have been manually annotated with diseases, which were mapped to DO terms using MetaMap by Mathur and Dinakarpandian. Then, the similarity of diseases d and d was calculated based on their corresponding genes as follows:where G and G are gene sets of diseases d and d, respectively, |.| is the number of terms in the specified set, and N is the total number of genes.

Suthram’s Method

Suthram et al. compared diseases using an integrated analysis of disease-related mRNA expression data and the human protein interaction network. First, they identified conserved functional modules of genes using PathBLAST based on PPI data from the Human Protein Reference Database (HPRD). Next, they normalized the gene expression data in each microarray sample using a Z-score transformation and computed the activity level of each gene in a disease. Then, the module response score for each module in a disease was assigned to be the mean of the gene activity score of its component genes. Finally, they calculated the partial correlation coefficient between diseases based on the corresponding module response score and defined it as the disease similarity.

Gottlieb’s Method

Gottlieb et al. presented four algorithms for calculating disease similarity using the genetic signatures of diseases from gene expression experiments, which involved signature-based, signature sequence-based, signature PPI-based, and signature GO-based methods. The signature-based method utilized a Jaccard index between every pair of disease signatures to calculate disease similarity as follows:where G and G are the signatures of diseases d and d, respectively, and |.| is the number of terms in the specified set. The signature PPI-based method calculated the distances between each pair of disease signatures based on their corresponding proteins using an all-pairs shortest paths algorithm on the human PPI network. Distances were transformed into similarity values using the following formula:where P and P are the corresponding proteins of diseases d and d, respectively, and D(P P) is the shortest path between these proteins in the PPI network. A is a parameter chosen to be 0.9 × e by Perlman et al. The signature sequence-based method calculated the Smith-Waterman sequence alignment score between disease signatures and then divided the score by the geometric mean of the scores from aligning each sequence against itself. In addition, the signature GO-based method calculated the similarity between each pair of disease signatures based on their corresponding GO terms.

Hamaneh’s Method

Hamaneh and Yu devised a network-based measure to calculate disease similarity. First, they assigned weights to all proteins by using information flow from a disease to the human PPI network and back. As a result, each disease was represented as a weighted vector whose dimension is the number of proteins in the network. Then, the similarity of two diseases was defined as the cosine of the angle between their corresponding vectors.

Kim’s Method

Kim et al. extracted disease-gene pairs and disease-drug pairs from the literature and used the frequencies of co-occurrence relationships as features to calculate disease similarity. In this work, disease names, gene symbols, and drug names were from the Pharmacogenomics Knowledgebase (PharmGKB). This assumes that G and G are genes that occurred in the same sentence as diseases d and d, respectively. D and D are drugs that occurred in the same sentence as diseases d and d, respectively. The similarity of d and d, therefore, can be defined as the following:andwhere N and M are the total number of genes and drugs, respectively.

Hierarchy-Based Methods

Hierarchy-based approaches are based only on the hierarchical structure of disease-related ontologies. In the previously mentioned studies, multiple methods have been presented for calculating the similarity of ontology terms using shared path and distance based on hierarchical structures85, 86, 87, 88, 89. However, currently only Wang’s method is widely utilized for calculating disease similarity.

Wang’s Method

Assuming that D is the set including d and all of its ancestor terms in an ontology-based “IS_A” relationship, the hierarchical contribution of the terms d to d is represented as follows:where w is a hierarchical contribution factor for hierarchical association. According to Wang et al., and Cheng et al., w is defined as 0.5 for an “IS_A” relationship of DO. Then, the value of the summation of all of the hierarchical contributions of D to d is SV(d), which is defined as follows:Assuming that D is the set including d and all of its ancestor terms, the similarity between d and d is defined by Wang’s method as follows:

Mabotuwana et al.’s Method

Mabotuwana et al. defined similarity of pairwise terms as inversely proportional to the distance between terms, as follows:where d is the number of nodes in the shortest path between two diseases based on the DAG of ontology.

Hybrid Methods

Molecular and hierarchical associations between diseases have been combined as hybrid methods for calculating disease similarity. These methods often utilize disease-related genes to define the IC of diseases93, 94, 95 as follows:where N denotes the total number of genes, and n represents the number of genes of d. Here, disease-related genes are often based on OMIM, CTD, SIDD, OAHG, and so on.

Resnik’s Method

Early in 1995, Resnik presented a method for calculating the similarity between ontology terms. In 2002, this method was introduced for calculating the similarity between GO terms. In 2011, Li et al. utilized this method for calculating the similarity between DO terms. According to Resnik’s method, the similarity of pairwise diseases d and d equals the IC of the most informative common ancestor (MICA) of these two diseases as follows:

Lin’s Method

Concerned that the similarity between ontology terms should also be decided by the IC of the two terms, Lin improved Resnik’s method in 1998. According to Lin’s method, the similarity of pairwise diseases d and d can be reflected by both the MICA of the disease pair and the IC of each disease as follows:

Schlicker’s Method

Schlicker et al. improved Resnik’s method from the same perspective as Lin, and they defined disease similarity as follows:In this equation, ancestors(d, d) represents the common ancestor of diseases d and d. In 2012, Mathur et al. designed a new method named PSB for calculating the similarity between DO terms. According to this method, the significance of related BPs terms from GO should be computed for each disease using a hypergeometric test. Assuming that d and d can be associated with m and n BP terms, respectively, the similarity of d and d is defined as follows:where represents the similarity between two BPs p and p as follows:Here, IC and IC represent the IC based on GO and DO, respectively. n(p∩p) and n(p∪p) denote the number of common genes of p and p and the number of total genes of p and p, respectively.

Cheng’s Method

In addition to related BP, genes can be associated by PPI, co-expression, and so forth. Therefore, Cheng et al. presented the SemFunSim method to improve Mathur’s method by incorporating the gene functional network from HumanNet, which reflects the comprehensive gene associations from PPI, co-expression, BP, and so on. This assumes that G and G represent related gene sets of d and d, respectively. Then, the similarity between t and t by Cheng et al.’s method is described by the following:where |GMICA| represents the number of genes of MICA for t and t and m and n denote the number of genes in G and G, respectively. Sim(g, g) is the functional similarity score between genes g and g from HumanNet.

Performance Evaluation

The performance of a disease similarity method can be affected by the quality of the prior knowledge it is based on. Most of the methods that utilize a manually curated dataset is high reliability. Some of the methods mentioned here use data from the literature extracted using text-mining tools. Data obtained in an unsupervised way should always be evaluated. In Mathur’s method, disease-related genes were mined from literature using MetaMap. The recall and precision were calculated based on a benchmark dataset from Monttaz et al., which contained 200 records that were manually annotated by experts. The identified similarity pairs of diseases should always be then evaluated to measure the performance of the method used. Three types of classical evaluation strategies are introduced here (Figure 3).

Figure 3

Schematic of the Process of Performance Evaluation

(A) Performance evaluation of a simulated patient-based method. (B) Performance evaluation of a term-category-based method. (C) Performance evaluation of a benchmark-data-based method.

Schematic of the Process of Performance Evaluation (A) Performance evaluation of a simulated patient-based method. (B) Performance evaluation of a term-category-based method. (C) Performance evaluation of a benchmark-data-based method.

Simulated-Patient-Based Strategy

In consideration of the difficulty in obtaining phenotypic information about a large number of patients, Sebastian et al. presented a simulated-patient-based method to evaluate their phenotype-based disease similarity method. We used 44 complex dysmorphology syndromes for which adequate frequency phenotypes were available, and then 100 virtual patients for each disease were generated on the basis of the frequency of phenotypes among persons diagnosed with a certain disease. For example, to generate patients with phenotypes A and B, in which A occurs in 40% and B occurs in 60% of patients, a random number generator was utilized to generate two random numbers uniformly distributed between 0 and 100. Subsequently, the similarity of the simulated patient to each of the OMIM diseases was calculated and then ranked. The average rank of all of the patients was returned to assess the performance of the original method.

Term-Category-Based Strategy

Sun et al. utilized information on disease-related molecules to design a disease similarity measurement method. Their results were evaluated using the disease classification terminologies found in the ICD-9. Their assumption was that two similar diseases should be subjected to the same categories in the ICD-9. Therefore, the correlation between the similarity of diseases and their classifications can reflect the performance of this method. Since similarity scores are not normally distributed, they used a nonparametric test—the Mann-Whitney U test—to assess the statistical significance of the disease similarity.

Benchmark Data-Based Strategy

In the previous study, Cheng et al. constructed a benchmark set containing 70 pairs of similar diseases, which were manually integrated from two datasets. One dataset was adapted from Suthram et al. from the literature. The other dataset was curated by medical residents. Here, we have evaluated the performance of Wang’s, Resnik’s, and Lin’s methods, PSB, and the SemFunSim using benchmark data. First, disease pairs of our benchmark dataset were deemed as positive groups, and 10-fold more disease pairs were randomly generated as a negative group. Next, the similarity of disease pairs of these two groups was calculated based on the aforementioned listed methods. Then, the area under receiver operating characteristic (ROC) curves (AUCs) was obtained. This process was iterated 100 times using different negative groups each time, and the average AUC reflects the respective performance of these methods. Figure 4A shows the AUC of one of 100 iterations using disease-related genes from GeneRIF, while Figure 4B shows the average AUC of 100 iterations using disease-related genes from GeneRIF. The average AUC for Resnik’s, Lin’s, and Wang’s methods, PSB, and the SemFunSim were 0.6484, 0.6791, 0.6978, 0.7759, and 0.9008, respectively. Figures 4C and 4D show the results using disease-related genes from SIDD. The calculated average AUC for Resnik’s, Lin’s, and Wang’s methods, PSB, and the SemFunSim were 0.6209, 0.6351, 0.6849, 0.8843, and 0.9849, respectively.

Figure 4

Performance Evaluation Using a Benchmark-Data-Based Strategy

(A) ROC curve for one of the 100 iterations using disease-related genes from GeneRIF. (B) The average AUC from 100 iterations using disease-related genes from GeneRIF. (C) ROC curve for one of the 100 iterations using disease-related genes from SIDD. (D) The average AUC from 100 iterations using disease-related genes from SIDD.

Performance Evaluation Using a Benchmark-Data-Based Strategy (A) ROC curve for one of the 100 iterations using disease-related genes from GeneRIF. (B) The average AUC from 100 iterations using disease-related genes from GeneRIF. (C) ROC curve for one of the 100 iterations using disease-related genes from SIDD. (D) The average AUC from 100 iterations using disease-related genes from SIDD. The performance of these methods are subject to the prior knowledge they used. Wang’s method only used the entire structure of the ontology; therefore, its performance is limited by the comprehensive of the ontology. Although Resnik’s and Lin’s methods incorporated the structure of ontology and ontology annotation, they do not utilize all the “IS_A” relationships of ontology. Thus, the performance of these three methods is not very good. In comparison with Resnik’s and Lin’s methods, PSB introduced GOA for associating disease-related genes. Thus, its performance improved a lot. Since disease-related genes could be associated in terms of PPIs, co-expression, and so on, the performance of PSB is improved much more by the SemFunSim method.

Applications

Disease similarity can be determined at the molecular, phenotypic, and hierarchical levels. Conversely, similar diseases reflect the correlations of their inducing molecules, phenotypes, and classifications. Therefore, disease similarity has been widely applied in the functional prediction of molecules, clinical diagnosis, and the establishment of disease associations.

The Functional Prediction of Molecules

This is based on the observation that genes causing similar diseases tend to lie close to one another in a network of PPI., Vanunu et al. constructed a comprehensive network using gene-disease association, disease similarity, and PPI data to predict disease-related PCGs using a random walk method. In comparison with PCGs, it is not easy to determine the function of ncRNAs due to limited knowledge with regard to their impact on proteins from wet lab experiments with these ncRNAs. Fortunately, disease similarity has been useful for this in previous investigations.,107, 108, 109, 110 Based on prior knowledge of the associations between ncRNAs and diseases, functional similarity of ncRNAs can be calculated based on the similarities of their related diseases to construct a network in which an ncRNA is represented as a node and the similarity of pairwise ncRNAs is represented as edges. Just such a network was then utilized for predicting novel ncRNA-disease associations by the random walk with restart (RWR) method.,, Recently, disease similarity has been utilized for mining potential therapeutic drugs for diseases. Based on the observation that similar diseases can often be treated with similar drugs, Cheng et al., prioritized potential drugs for a disease based on their results with similar diseases. Gottlieb et al. combined disease similarity and drug similarity to predict novel drug indications.

Clinical Diagnosis

The diagnosis process can be a challenging undertaking, given the large number of hereditary disorders and the range of partially overlapping clinical features associated with them. To resolve this problem, Robinson et al., established an HPO to calculate the disease similarity and diagnose diseases according to clinical phenotype. According to Equations 6, 7, and 8, disease similarity can be calculated based on their phenotype sets. For an individual patient, the similarity between OMIM diseases and clinical features could also be calculated based on this method. The similarity score in this case then reflects the probability of a potential disease in the patient.

Construction of Qualitative Associations of Diseases

In 2006, Goh et al. utilized the common genetic origin of diseases to construct a human disease network (HDN) from the molecular level based on OMIM. This was an early study that established a qualitative association between diseases from a quantitative perspective. A portion of each disease stems not as the consequence of the single genetic defects but, rather, the breakdown in molecular interaction networks. Thus, their associations cannot be reflected by this network. Therefore, the network was extended based on PPIs, metabolic networks, and different pathways.113, 114, 115 Recently, Zhou et al. established an HDN at the phenotypic level, where the link weight between two diseases quantified the disease similarity. Here, the symptoms of diseases were extracted from literature in PubMed. Each disease was described as a vector of phenotypes. Then, the similarity between diseases was defined as the cosine similarity of their vectors.

Tools for Calculating Disease Similarity

Inspired by the wide recent application of machine learning methods in bioinformatics,116, 117, 118 various algorithms have been implemented for calculating disease similarity using R and web-based programs,,,,,119, 120, 121, 122, 123, 124 (Table 3). These tools play important roles in disease diagnosis, the prediction of drugs, and so forth. Here, we introduce four frequently used tools in detail.

Table 3

Summary of Disease Similarity Tools

Author(s)	Name	Type	Web Site	Vocabulary	PMID	Year
van Driel et al.⁶⁷	MimMiner	webpage		OMIM	16493445	2006
Robinson et al.⁵	Phenomizer	webpage	http://compbio.charite.de/phenomizer/	OMIM	19800049	2009
Wang et al.⁹⁰	MISIM	webpage		MeSH	20439255	2010
Li et al.⁹⁷	DOSim	R package		DO	21714896	2011
Hoehndorf et al.¹¹⁹	NA	webpage	http://aber-owl.net/aber-owl/diseasephenotypes/	OMIM	26051359	2015
Hamaneh and Yu¹²³	DeCoaD	webpage	https://www.ncbi.nlm.nih.gov/CBBresearch/Yu/mn/DeCoaD/	DO	26047952	2015
Deng et al.¹²⁰	HPOSim	R package	https://sourceforge.net/p/hposim/summary/	OMIM	25664462	2015
Yu et al.¹²¹	DOSE	R package	http://www.bioconductor.org/packages/release/bioc/html/DOSE.html	DO	25677125	2015
Cheng et al.¹¹¹	DisSim	webpage	http://bio-annotation.cn/DisSim	DO	27457921	2016
Cheng et al.¹²²	DisSetSim	webpage	http://bio-annotation.cn/DisSetSim/	DO	29297411	2017
Cheng et al.¹²⁴	DincRNA	webpage	http://bio-annotation.cn:18080/DincRNAClient/#/Home	DO	29365045	2018

Summary of Disease Similarity Tools

MimMiner

van Driel et al. designed a phenotype-based method and implemented it as a tool—namely, MimMiner—for calculating the similarity of OMIM diseases. This tool provides interfaces to query the similar diseases related to an input diseases and is widely used in bioinformatics community. It should be noted that this tool needs to be updated due to the rapid increase in the size of the OMIM disease database.

Phenomizer

Phenomizer is an online tool that can be helpful in the diagnosis processes and is based on disease similarity. Currently, thousands of genetic disorders characterized by specific combinations of phenotypic features are documented in OMIM. The diagnosis process based on phenotypes is difficult without computer-based tools. Phenomizer allows an automatic correlation between phenotypic abnormalities and hereditary disorders found in OMIM. The p values are generated to evaluate the statistical significance of those correlation scores given by Phenomizer. This tool is also useful for suggesting additional possible phenotypic alterations for further evaluation in a patient of interest.

DOSim

DOSim is an R package used for computing the similarity between DO terms based on Wang’s method and nine hybrid methods involving Resnik’s method, Lin’s method, and so forth.93, 94, 95,,125, 126, 127. This tool also implements utilities to calculate the similarity of genes based on their inducing diseases and conduct DO enrichment analysis.

DisSim

DisSim is an online system for exploring similar diseases in DO. It provides both the similarity of pairwise diseases and the significance of their similarity score. In addition, the system integrates therapeutic drugs for known diseases to predict potential drugs for other human diseases based on the observation that similar diseases can be treated with similar drugs.

Discussion

Most disease similarity methods depend on disease vocabularies and their annotations. Phenotype-based methods extract disease annotations of phenotypes from PubMed and OMIM. Disease names from these data sources are from MeSH and OMIM. Hierarchy-based methods utilize the structure of ontology from MeSH and DO. Current molecule-based methods mainly used the DO annotations of genes. In summary, DO, MeSH, and OMIM contain the most frequently used vocabularies for calculating disease similarity. However, not all disease terms are contained in any one of these vocabularies. For comparison, OMIM documents more specific disease terms, such as TYPE III SYNDACTYLY (OMIM: 186100). MeSH and DO involve classification of diseases, such as cancer (DOID: 162). Figure 5 shows the number of disease terms distributed across the different vocabularies. In total, 958 common disease terms are documented in DO, MeSH, and OMIM, which covers 8.8%, 8.5%, and 11.4% of DO, MeSH, and OMIM terms, respectively. Although OMIM and MeSH terms have been integrated into MEDIC, MEDIC lacks many DO terms and disease classifications. Therefore, combining all of the disease terms of DO, MeSH, and OMIM is critical for calculating disease similarity using the same vocabulary. In addition, a unified disease annotation database based on this integrated vocabulary is indispensable for improving the universality of similarity determining algorithms. In our previous studies, we provided a global view of human diseases by annotating disease-related molecule and phenotype features with DO., However, the absence of disease terms in DO limits its application.

Figure 5

Distribution of Disease Terms in DO, MeSH, and OMIM

Distribution of Disease Terms in DO, MeSH, and OMIM Disease-related ontologies only contain “IS_A” relationships, which limits the performance of hierarchy-based methods. For example, Wang’s method could be applied to multiple term associations of ontology, such as “IS_A,” “PART_OF,” “LOCATE_IN,” and so on. The performance evaluation results in Figure 4 shows that Wang et al.’s method could be improved, which may be achieved with the occurrence of more types of disease associations than the “IS_A” relationship. Data quality and the quantity of disease annotations of phenotypes and molecules are crucial for the performance of molecule-based, phenotype-based, and hybrid-based methods. OMIM documents close but few disease-gene associations. Contrary to this, GeneRIF and SIDD retain loose but abundant associations. All of these datasets were combined together without distinction for calculating disease similarity in most cases. These methods could be improved by ranking all of the associations. For example, we can improve the disease annotations by adding the evidence for each disease-gene association such as that found in the GOA database. In general, newer methods should consider more types of prior knowledge, leading to better performance. Wang’s method, which is a hierarchy-based method, was presented in 2007. The SemFunSim method was presented in 2014, and it incorporates the hierarchical structure of DO, disease annotations of genes, and gene associations. The evaluation results in Figure 4 show that SemFunSim achieves a higher AUC than Wang’s method. Although hybrid methods integrate more types of prior knowledge of diseases, molecular and phenotypic associations of diseases were ignored. Therefore, it is possible that the performance of disease similarity methods could be further improved by fusing more disease knowledge types. Although comprehensive knowledge benefits the calculative precision of disease similarity, these methods based on a single type of prior knowledge can also very valuable for biological applications. Diseases are often caused by the molecular mechanism and could be reflected by diverse phenotypes. Disease phenotypes can be detected from clinical diagnosis, while causal molecules are identified from wet labs. Gaps in phenotypic and molecular levels exist for understanding diseases. Here, disease similarity based on different types of knowledge could bridge the gap. The purpose of calculating disease similarity is to identify similar diseases. However, it is not easy to determine similar diseases directly from most of the presented methods and tools. One feasible strategy for this purpose is provided here by DisSim, which provides the p values for each similarity score. According to current methods, the similarity of pairwise diseases can be obtained, which are then normalized to Z scores. Then, the one-side p values are calculated as a significance score for each similarity score. Another way to provide p values for similarity scores would be a permutation test. Disease similarity plays important roles in mining the novel molecular features of diseases, clinical diagnosis, and so on. The exploration of the function of ncRNAs is a long-term challenge, as these RNAs do not produce proteins. Currently, disease similarity has been successful in predicting the function of ncRNAs, especially in prioritizing miRNA-disease,129, 130, 131, 132, 133 and lncRNA-disease pairs., In the future, these methods can be used for comprehending the function of other types of ncRNAs, such as circular RNA (circRNAs). In a previous study, disease similarity was utilized for diagnosis based on phenotypes. This may also be helpful for molecular diagnosis. Alterations in the presence of metabolites are easily determined in the clinical, meaning metabolite-disease pairs can be prioritized based on disease similarity methods. Therefore, it is theoretically possible to predict potential diseases based on abnormalities in metabolite levels.

Author Contributions

L.C., J.H., S.L., and Q.J. conceived and designed the experiments. L.C., H.Z., P.W., W.Z., M.L., and T.L. analyzed data. L.C. wrote the manuscript. All authors read and approved the final manuscript.

Conflicts of Interest

The authors declare no competing interests.

26 in total

1. sgRNACNN: identifying sgRNA on-target activity in four crops using ensembles of convolutional neural networks.

Authors: Mengting Niu; Yuan Lin; Quan Zou
Journal: Plant Mol Biol Date: 2021-01-01 Impact factor: 4.076

Review 2. Large-scale comparative review and assessment of computational methods for anti-cancer peptide identification.

Authors: Xiao Liang; Fuyi Li; Jinxiang Chen; Junlong Li; Hao Wu; Shuqin Li; Jiangning Song; Quanzhong Liu
Journal: Brief Bioinform Date: 2021-07-20 Impact factor: 11.622

3. Disease Module Identification Based on Representation Learning of Complex Networks Integrated From GWAS, eQTL Summaries, and Human Interactome.

Authors: Tao Wang; Qidi Peng; Bo Liu; Yongzhuang Liu; Yadong Wang
Journal: Front Bioeng Biotechnol Date: 2020-05-06

4. Identification of Human Enzymes Using Amino Acid Composition and the Composition of k-Spaced Amino Acid Pairs.

Authors: Lifu Zhang; Benzhi Dong; Zhixia Teng; Ying Zhang; Liran Juan
Journal: Biomed Res Int Date: 2020-05-22 Impact factor: 3.411

5. Deep Reinforcement Learning for Data Association in Cell Tracking.

Authors: Junjie Wang; Xiaohong Su; Lingling Zhao; Jun Zhang
Journal: Front Bioeng Biotechnol Date: 2020-04-09

6. Identification and Classification of Enhancers Using Dimension Reduction Technique and Recurrent Neural Network.

Authors: Qingwen Li; Lei Xu; Qingyuan Li; Lichao Zhang
Journal: Comput Math Methods Med Date: 2020-10-18 Impact factor: 2.238

7. Early Diagnosis of Pancreatic Ductal Adenocarcinoma by Combining Relative Expression Orderings With Machine-Learning Method.

Authors: Zi-Mei Zhang; Jia-Shu Wang; Hasan Zulfiqar; Hao Lv; Fu-Ying Dao; Hao Lin
Journal: Front Cell Dev Biol Date: 2020-10-15

8. Identification of Causal Genes of COVID-19 Using the SMR Method.

Authors: Yan Zong; Xiaofei Li
Journal: Front Genet Date: 2021-07-05 Impact factor: 4.599

9. A Mendelian Randomization Analysis to Expose the Causal Effect of IL-18 on Osteoporosis Based on Genome-Wide Association Study Data.

Authors: Ni Kou; Wenyang Zhou; Yuzhu He; Xiaoxia Ying; Songling Chai; Tao Fei; Wenqi Fu; Jiaqian Huang; Huiying Liu
Journal: Front Bioeng Biotechnol Date: 2020-03-20

10. Identifying Heat Shock Protein Families from Imbalanced Data by Using Combined Features.

Authors: Xiao-Yang Jing; Feng-Min Li
Journal: Comput Math Methods Med Date: 2020-09-23 Impact factor: 2.238