Literature DB >> 22735708

Co-clustering phenome-genome for phenotype classification and disease gene discovery.

TaeHyun Hwang¹, Gowtham Atluri, MaoQiang Xie, Sanjoy Dey, Changjin Hong, Vipin Kumar, Rui Kuang.

Abstract

Understanding the categorization of human diseases is critical for reliably identifying disease causal genes. Recently, genome-wide studies of abnormal chromosomal locations related to diseases have mapped >2000 phenotype-gene relations, which provide valuable information for classifying diseases and identifying candidate genes as drug targets. In this article, a regularized non-negative matrix tri-factorization (R-NMTF) algorithm is introduced to co-cluster phenotypes and genes, and simultaneously detect associations between the detected phenotype clusters and gene clusters. The R-NMTF algorithm factorizes the phenotype-gene association matrix under the prior knowledge from phenotype similarity network and protein-protein interaction network, supervised by the label information from known disease classes and biological pathways. In the experiments on disease phenotype-gene associations in OMIM and KEGG disease pathways, R-NMTF significantly improved the classification of disease phenotypes and disease pathway genes compared with support vector machines and Label Propagation in cross-validation on the annotated phenotypes and genes. The newly predicted phenotypes in each disease class are highly consistent with human phenotype ontology annotations. The roles of the new member genes in the disease pathways are examined and validated in the protein-protein interaction subnetworks. Extensive literature review also confirmed many new members of the disease classes and pathways as well as the predicted associations between disease phenotype classes and pathways.

Entities: Chemical Disease Gene Species

Mesh：

Year: 2012 PMID： 22735708 PMCID： PMC3479160 DOI： 10.1093/nar/gks615

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Phenotypes, the observable characteristics (traits) of an organism, are believed to be determined by genetic materials (DNAs) under environmental influences (1,2). The key to achieving desired phenotypes such as favorable disease treatment outcomes lies in the understanding of the relation between phenotypes and the biological roles of genes (3–5). In the past two decades, promising bio-technologies such as microarray-based profiling (6–9) and second generation sequencing (10,11) were developed to hunt for potential phenotype–gene associations. Currently, in the most comprehensive disease, phenotype–gene relation database, Online Mendelian Inheritance in Man (OMIM) (2), nearly 2000 confirmed relations between around 6000 phenotypes and over 12 000 genes are documented. This knowledge base provides a new phenome (the collection of all phenotypes) perspective to study human diseases and their molecular mechanisms. Although most previous studies focused on predicting new disease phenotype–gene relations with OMIM data (12–19), we propose to cluster phenotypes and find gene modules associated with the phenotype clusters by integrating OMIM phenotype–gene relations with disease phenotype similarity network and the human gene interaction network as well as exiting disease categorization and molecular pathways. To effectively use all the sources of information, we design regularized non–negative matrix tri-factorization (R-NMTF) algorithms to tri-factorize the binary matrix of phenotype–gene relations into phenotype clusters, gene clusters and an association matrix representing the associations between phenotype clusters and the gene clusters (Figure 1). Since the matrix of known phenotype–gene relations is very sparse, constraints constructed from the prior knowledge and the phenotype/gene labels are introduced to regularize the NMTF models.

Figure 1.

NMTF of disease phenotype–gene associations. The phenotype–gene association matrix X is factorized into products of three matrices, phenotype cluster membership F, gene cluster membership G and phenotype cluster–gene cluster association S for supervised co-clustering of phenotypes and genes. Label information for the disease classes and the pathways are available for a small number of phenotypes and genes. Prior knowledge is also introduced from phenotype similarity network and gene network. For better visualization, different colors are used to distinguish the phenotypes and the genes in different clusters. Current classification of human disease is mainly based on observational correlation between pathological analysis and clinical syndromes (20), and more recently, by text mining of clinical records and synopsis (21). An accurate classification of human diseases based on its phenotypic and molecular basis will help to establish syndromic patterns for selecting phenotypes to consider in diagnosis. Existing phenotype clustering approaches cluster phenotypes based on only text descriptions and synopsis (22–24) or shared disease genes (25), which do not fully reflect both phenotypic and genetic basis of the disease phenotypes. R-NMTF integrates various sources of phenotypic and genomic data as well as prior knowledge to perform supervised co-clustering of phenotypes and genes simultaneously. R-NMTF is the first of its kind that effectively discovers disease classes based on the molecular underpinnings of the phenotypes and the molecular interactions in a network. This approach implements the philosophy of network-based medicine (26), which is believed to be the promising approach for generating the next generation of disease categorization (20). The R-NMTF-based co-clustering also naturally induces the associations between the phenotype clusters and gene clusters, which provides a global pathway activity view of human disease classes for understanding the unique as well as common underlying molecular mechanisms of diseases.

MATERIALS AND METHODS

In this section, we first describe the notations for the data of disease phenotypes, genes and their associations. We then review NMTF and introduce the framework of R-NMTF for co-clustering phenotypes and genes. We also outline the multiplicative update algorithm for solving the R-NMTF model.

Notations

The notations and definitions used in the article are specified in Table 1. We denote the OMIM phenotype–gene associations by a m by n binary matrix X with 1 for known associations and 0 otherwise. The objective is to derive phenotype clusters (F) and find their association (S) with gene clusters (G) based on X (Figure 1). F and G are non-negative matrices representing the soft memberships of each gene/phenotype against the k1 phenotype clusters or the k2 gene clusters. To perform more reliable phenotype clustering in a supervised setting, we use the partial phenotype annotations by (25) represented by a binary matrix F0 with 1 for the known class memberships. Similarly, KEGG pathways (27) are also included in a binary matrix G0 to guide gene clustering. Note that, since training samples are not required for each disease category to classify the phenotypes in the model, we use the word ‘co-clustering’ instead of ‘classification’ or ‘semi-supervised learning’ for the learning problem although in the experiments, we only focused on recovering the 21 disease categories with at least one OMIM disease phenotype. Finally, a phenotype similarity network M (21) and the gene interaction network N were also introduced to capture modular relations among phenotypes and genes. M and N contain edges weighted by the degree of similarity between phenotypes or the confidence of interaction between genes, respectively.

Table 1.

Notations

Notation	Definition
m	Number of disease phenotypes
n	Number of genes
k₁	Number of phenotype clusters (e.g. classes)
k₂	Number of gene clusters (e.g. pathways)
X	Disease phenotype–gene association matrix (m × n)
F	Phenotype cluster membership (m × k₁)
S	Phenotype cluster–gene cluster association Matrix (k₁ × k₂)
G	Gene cluster membership (n × k₂)
F⁰	Annotated phenotype cluster membership (m × k₁)
G⁰	Annotated gene cluster membership (n × k₂)
M	Disease phenotype similarity network (m × m)
N	Gene interaction network (n × n)

Notations

Non-negative matrix tri-factorization

Non-negative matrix factorization (NMF) was proposed by (28,29) as an alternative to principle component analysis and vector quantization for parts-based decomposition of a data matrix. NMF has been applied to solve various bioinformatics problems such as identifying gene clusters (30–32), bi-clustering (33) and identifying cancer tumor categories (34) in gene expression data analysis, and finding modules in protein–protein interaction (PPI) network (35). By imposing the orthogonality on the two factorized matrices, (36) proposed a framework to perform NMTF as X ≃ FSG under the constraints FF = 1 and GG = 1. This framework has the advantage of simultaneously clustering the columns and rows, and finding a condense representation of the data matrix by the row clusters and the column clusters, which can also be considered as associations between row clusters and column clusters. For co-clustering phenotypes and genes, the NMTF approach provides novel insights into the phenotype–gene associations beyond clustering and decomposition.

Regularization by phenotype and gene labels

To cluster phenotypes and genes based on their associations, we adopt supervised NMTF proposed for finding associations between document clusters and word clusters in text categorization (37,38). We use manually labeled phenotype clusters as the phenotype label F0 and gene clusters from existing pathway database as the gene label G0, and simultaneously cluster phenotypes and genes with tri-factorization as illustrated in Figure 1. The following optimization framework can be solved to achieve the goal: In equation (1), the first term is the NMTF of X, and the second and the third terms are the fitting penalties to keep the new cluster assignment consistent with the known phenotype and gene cluster labels. These two terms are introduced as a supervised way of minimizing the squared loss between the predicted phenotype cluster assignment F and the initial phenotype cluster assignment F0, and between the predicted gene cluster assignment G and the initial gene cluster assignment G0. Specifically, the phenotype clusters are taken from the 21 disease classes manually curated by (25), in which 872 disease phenotypes are assigned to 21 classes. The gene clusters are derived from the genes in KEGG pathways (27). The information of the labeled phenotypes and genes provides the useful guidance to learning more accurate co-clustering. A limitation of the approach in equation (1) is the low coverage and the sparsity of the disease gene association matrix used to cluster phenotypes and genes. The known disease–gene association only cover a small fraction of phenotypes and genes (one-third of the phenotypes and 5% of the genes), with very few associations between them (less than one association per phenotype/gene). Moreover, the phenotype cluster annotations and KEGG pathways also only provide a low coverage of around 15% phenotypes and one-fourth of the genes. The statistics simply suggest that with this model only a very small fraction of phenotypes and genes could be clustered properly.

Regularization by graph Laplacians

To address the above problem, we design R-NMTF to incorporate the prior knowledge in the phenotype similarity network and the PPI network (Figure 1) to cluster phenotypes and genes with matrix tri-factorization. Given the phenotype similar network M and the PPI network N, the following optimization problem is formulated for the purpose: where D is the diagonal matrix with the row summation of matrix M on the diagonal and D is similarly defined from N. In this equation, the first three terms are identical to those in equation (1). The fourth and fifth terms introduce the phenotype similarity network and the PPI network as prior knowledge to guide the clustering of the phenotypes and the genes. These two terms are called smoothness terms, which encourage the connected nodes (phenotypes/genes) in a graph to be assigned to the same cluster. Specifically, the term tr(F(D − M) F) requires that the phenotype clusters identified by NMTF are also densely connected in the phenotype network, and similarly for tr(G (D − N)G). D − M and D − N are known as the Laplacian matrices of the graphs, which are positive semi-definite (39).

Algorithm 1

Regularized Non-negative Matrix Tri-factorization INPUT: X, F0, G0, L, L, parameters α, β, γ, and λ, maximum interation T OUTPUT: F, G, S while not converged and t ≤ T do Update . Normalize Update . Normalize . Compute . end while

Multiplicative update algorithms

We extend the optimization algorithms for the original NMTF to handle the four additional penalty terms in equation (2). The alternative iterative scheme to solve the problem with respect to one variable while fixing the other variables are described.

Computation of F

If we fix variables S and G, solving equation (2) with respect to F is equivalent to minimizing the following function: subject to , where L is D − M. The differentiation of L with respect to F is The multiplicative update rule is To satisfy the equality constrain, we normalize F as

Computation of G

If we fix variables S and F, solving equation (2) with respect to G is equivalent to minimizing the function, subject to , where L is D − N. The differentiation of L with respect to G is The multiplicative update rule is To satisfy the equality constrain, we normalize G as

Computation of S

After F and G are computed, solving equation (2) with respect to S is equivalent to minimizing the following function: The differentiation of L with respect to S is The multiplicative update rule is The complete R-NMTF algorithm is outlined in Algorithm 1. Since the updating steps for F, S and G are non-increasing, the objective function will decrease until a lower bound is reached. Empirically, the algorithm converges fast within 100 iterations in the experiments.

EXPERIMENTS

To evaluate the performance of supervised co-clustering of phenotypes and genes, R-NMTF was applied to classifying OMIM human disease phenotypes and KEGG disease pathway genes with leave-one-out cross-validation. R-NMTF was compared with several baseline methods, including support vector machines (SVMs), Label Propagation (LP) and a NMTF model without network regularization defined in equation (1). R-NMTF was then applied to classify unannotated OMIM disease phenotypes and identify new member genes of KEGG disease pathways. The predictions were verified and analyzed by comparison with human phenotype ontology (HPO) and literature survey.

Data preparation

We collected the disease phenotype–gene associations in OMIM, which consist of the associations between 1284 disease phenotypes and 1777 disease genes. We also collected 200 KEGG pathways, which contain 4128 genes in total, from molecular signature database (40). We obtained the human protein-protein interaction (PPI) network from HPRD (41). The PPI network contains 76232 binary undirected interactions between 9667 genes. We obtained the phenotype similarity network from (21). The phenotype similarity network is an undirected graph with 5080 vertices representing OMIM disease phenotypes, and edges weighted by a number in [0,1]. The edge weights measure the similarity between phenotypes by their overlap in the text and the clinical synopsis in OMIM records, calculated by text mining (21). In the leave-one-out cross-validation, after preprocessing (removing the phenotypes classified as multiple and unclassified, removing disease phenotypes not present in both the disease phenotype–gene associations and the phenotype similarity network and removing genes not present in both the disease phenotype–gene associations and the PPI network), we generated a dataset containing 590 disease phenotypes in 20 disease classes (25) and 7997 genes in 200 gene pathways. This dataset was used in leave-one-out cross-validation on disease phenotype classification and disease pathway gene discovery. To further evaluate R-NMTF with more phenotypes and other independent phenotype annotations, we generated another larger dataset containing 1325 disease phenotypes with at least one known causal gene in OMIM. Among the 1325 disease phenotypes, 501 disease phenotypes intersect with the labeled disease phenotypes in the first dataset and the rest 824 disease phenotypes are unlabeled. Our task in this experiment is to perform a supervised clustering to assign the 824 unannotated disease phenotypes to the 20 disease classes.

Baselines and parameter tuning

Four baselines were introduced for comparison with R-NMTF, SVMs with linear kernel and radial basis kernel, LP and the NMTF model defined in equation (1) without the prior knowledge from the phenotype network and the PPI network (named NMTF). The SVMs used a binary vector representing the disease genes of each phenotype as the features for classification (25). We also tested SVMs with the similarity scores in the phenotype similarities network as features for classification. Since the results are close to random, we did not report them in the experiments. We also compared R-NMTF with a semi-supervised learning method, LP, which uses the disease similarity network and the PPI network for disease phenotype classification and disease gene discovery, respectively (42). The hyper-parameters (α and β for NMTF; α, β, γ and λ for R-NMTF and C and σ for SVMs) were chosen by a grid search in {10−3, 10−2, 10−1, 1, 10, 100}. The hyper-parameter α for LP was chosen by a grid search in {0.1, 0.3, 0.5, 0.7, 0.9}. More analysis of parameter tuning is described in the supplementary Table S1 and S2. In the leave-one-out cross-validation on the 590 labeled phenotypes in disease phenotype classification, we held out one phenotype as the test case to be classified by all the compared methods. The performance is measured by the rank of the true disease class among the 20 target classes ranked by the corresponding classification scores generated by a classification method. Similarly, in the leave-one-out cross-validation for disease gene discovery on the same data, we held out one gene in a KEGG disease pathway as the test case to be classified by all the compared methods. Since one gene could belong to multiple disease pathways, the performance is measured by the area under the curve of receiver operating characteristic (AUC). Since leave-one-out cross-validation usually gives less overfitting bias, we reported the results with the best parameters for all the methods in the experiments on both disease phenotype classification and disease gene discovery.

Performance of disease phenotype classification in leave-one-out cross-validation

The average ranking performance of the compared methods are reported in Table 2 and Figure 2. On average, R-NMTF were able to rank the target class at around third out of the 20 classes, while the other methods performed worse. To further assess the statistical significance of the difference in the performance between R-NMTF and the baselines, we also report the pairwise comparison of each test case and performed a Wilconsin test on the difference of the ranks in Table 2. The P-values suggest that R-NMTF performed significantly better than the baselines. Supplementary Figure S1 visualizes the pairwise comparison between R-NMTF and the baselines by scatter plot. Many more cases appeared in the top left triangle indicating a better ranking by R-NMTF. LP performed worse than R-NMTF but better than SVMs and NMTF. The observation indicates that the global structural information in the phenotype similarity network provides substantial information on phenotype classes. To further understand the classification performance in each disease class, we show in Table 3 the classification performance for the phenotypes by disease classes. R-NMTF outperformed all the baseline methods in 11 disease classes. In some of the small classes such as ‘ear, nose, throat’, ‘nutritional’ and ‘respiratory’, less relations among the training points are available for R-NMTF to improve classification.

Table 2.

Performance of phenotype classification in leave-one-out cross-validation

Compared methods	Avg. rank	win/draw/loss (P-value)
R-NMTF versus NMTF	3.124 versus 5.590	300/154/136 (4.617e−13)
versus SVM-linear	versus 6.103	308/154/128 (3.693e−12)
versus SVM-rbf	versus 5.037	268/213/109 (1.497e−4)
versus LP	versus 3.700	161/388/41 (9.145e−05)

This table reports the average rank of the target class out of the 20 classes, and the pairwise ‘win/draw/loss’ comparisons of each leave-one-out case between R-NMTF and the baselines, SVMs with linear and rbf kernels, NMTF and LP. The last column reports the statistical significance of the ranking results using Wilcoxon rank sum test.

Figure 2.

Table 3.

Disease phenotype classification results by disease classes

Disease classes (No)	Avg. rank
	R-NMTF	NMTF	SVM- linear	SVM-rbf	LP
Bone (23)	3.3	8.5	4.7	7.6	4.7
Cancer (53)	1.6	5.0	4.2	2.0	1.9
Cardiovascular (28)	3.8	10.1	10.0	6.0	4.3
Connective tissue (16)	8.5	8.9	10.6	11.4	11.1
Dermatological (32)	2.0	4.4	3.0	4.0	2.5
Developmental (28)	5.7	2.5	9.6	9.2	6.5
Ear,Nose,Throat (3)	20.0	20.0	14.7	15.0	16.7
Endocrine (30)	4.2	5.4	13.4	5.4	4.9
Gastrointestinal (12)	9.7	7.8	7.8	9.7	11.7
Hematological (30)	3.5	9.5	2.3	6.9	3.8
Immunological (31)	2.6	10.0	8.1	5.2	2.8
Metabolic (84)	1.0	2.2	4.1	2.2	1.0
Muscular (18)	5.7	5.3	12.2	9.1	7. 3
Neurological (80)	1.4	6.2	5.8	2.7	1.4
Nutritional (2)	16.0	3.0	19.0	2.0	20
Ophthamological (35)	1.9	4.2	2.5	2.9	2.5
Psychiatric (9)	7.9	6.1	8.0	11.4	14.8
Renal (23)	4.1	3.5	4.4	6.8	4.9
Respiratory (7)	15.4	10.4	10.4	14.1	15.7
Skeletal (46)	1.5	3.3	4.8	5.2	1.8

This table reports the ranking performance by R-NMTF, SVM with linear and rbf kernels, NMTF and LP in each disease class in the leave-one-out cross-validation. The number of phenotypes in each disease class is reported in the parentheses.

Performance of phenotype classification in leave-one-out cross-validation. In this plot, the x-axis represents the cutoffs of the rank of the target disease class out of the 20 classes. The y-axis represents the faction of phenotypes with their target disease class ranked within a certain cutoff. For example, R-NMTF ranked the target class of >60% of the phenotypes within Rank 2, while the other methods only ranked around or <50% within the same rank cutoff. Performance of phenotype classification in leave-one-out cross-validation This table reports the average rank of the target class out of the 20 classes, and the pairwise ‘win/draw/loss’ comparisons of each leave-one-out case between R-NMTF and the baselines, SVMs with linear and rbf kernels, NMTF and LP. The last column reports the statistical significance of the ranking results using Wilcoxon rank sum test. Disease phenotype classification results by disease classes This table reports the ranking performance by R-NMTF, SVM with linear and rbf kernels, NMTF and LP in each disease class in the leave-one-out cross-validation. The number of phenotypes in each disease class is reported in the parentheses.

Performance of disease gene discovery in leave-one-out cross-validation

In the experiment of disease gene discovery, we collected the member genes in the 200 pathways from KEGG. In the preprocessed data, there are 590 member genes in 27 KEGG disease pathways such as Alzheimer, diabetes and cancer-related pathways. In the leave-one-out cross-validation, each of the 590 member gene was held out and then classified into the 200 pathways as a multi-label classification problem since some of the disease genes are members of multiple pathways. The higher the target pathways in the ranking of the 200 pathways, the better the performance. We measured the performance by the AUC. LP was applied on the PPI network to predict the disease genes as the baseline. The other 589 member genes was used as the initialization of label propagations to classify the held-out gene. The average AUC across the 590 member genes by all the methods are reported in Table 4 and Figure 3. The results clearly show that by integration of phenotype similarity, phenotype class annotation and phenotype–gene associations with PPI network R-NMTF more accurately classified the disease genes compared with LP, which only uses the PPI network for disease gene discovery. R-NMTF performed better on >500 cases with an average AUC 0.930 compared with 0.73 by LP.

Table 4.

Performance of disease gene discovery in leave-one-out cross-validation

Compared methods	Avg. AUC	win/draw/loss (P-value)
R-NMTF versus LP	0.930 versus 0.730	526/1/63 (5.4482e−113)

This table reports the average AUC for disease gene classification, and the pairwise ‘win/draw/loss’ comparisons of each leave-one-out case between R-NMTF and LP. The last column reports the statistical significance of ranking results using Wilcoxon rank sum test.

Figure 3.

Performance of disease gene discovery in leave-one-out cross-validation. In the plot, the x-axis represents AUC cutoffs. The y-axis represents the faction of disease genes with a AUC score above the cutoffs. For example, R-NMTF achieved AUCs above 0.9 for >80% of the genes, while LP only achieved the same level of AUC for 20% of the genes. Performance of disease gene discovery in leave-one-out cross-validation This table reports the average AUC for disease gene classification, and the pairwise ‘win/draw/loss’ comparisons of each leave-one-out case between R-NMTF and LP. The last column reports the statistical significance of ranking results using Wilcoxon rank sum test.

Analysis of phenotype clusters with HPO

To bette characterize the discovered phenotype clusters for the 824 unannotated disease phenotypes, we compared the phenotype clusters with HPO (43). HPO describes human phenomic abnormalities with a controlled hierarchical vocabulary. Since the vocabulary in the HPO was developed independently of the disease classification by (25), it is an external resources for the validation of the phenotype clusters discovered by R-NMTF. Each OMIM phenotype was mapped to the hierarchy of HPO to retrieve the matched HPO terms. Then, a new HPO similarity is calculated for each pair of phenotypes by Jaccard similarity coefficient where P1 and P2 are the set of the matched HPO terms of the two phenotypes, respectively. We arranged the phenotypes into the 20 disease classes (clusters) based on the R-NMTF clustering, and show their HPO similarity by a heat map in Figure 4. There are clearly block structures among the predicted 20 clusters. Most of the phenotypes in the same cluster also share strong HPO similarity. The consistency between the predicted disease clusters and HPO similarities suggest that R-NMTF produced a phenotype clustering supported by HPO annotations. Another interesting observation is that there are also strong HPO similarities between different clusters (i.e. different disease classes share HPO similarities). This may imply that some of the disease classes may share common molecular mechanisms such as skeletal diseases and developmental diseases.

Figure 4.

HPO phenotype similarities by clusters. The HPO similarity matrix of the phenotypes are display as a heap map. The phenotypes are grouped into 20 clusters with the disease classes annotated below.

Analysis of new phenotypes in disease classes

Table 5 lists the newly predicted disease phenotypes in the 20 disease classes. Our survey identified supporting literatures for many of the predictions. One interesting finding is faconi anemia (FA) (OMIM:227650), a rare, inherited blood disorder, predicted as a cancer-related disease. Surprisingly, a recent study found that FA could share a common pathogenesis with diseases related with chromosomal instability including cancers, and suggested a possible use of cancer treatment for patients with FA (48). R-NMTF also predicted Proteus syndrome (OMIM:176920) as a cancer-related disease. PTEN, a well-known tumor suppressor gene, is a known causative gene for Proteus syndrome, which may indicate that cancer risk accompanying Proteus syndrome could be increased (49–52). Other interesting newly predicted disease phenotypes are Amyotrophic lateral sclerosis (ALS) (OMIM:105400), also known as Lou Gehrig's disease in neurological disease class, and Gambling, pathologic (OMIM:606349) in psychiatric disease class. ALS is a disease of the nerve cells in the brain and causes unstable muscle movement and Gambling, pathologic is a disabling disorder to fail to resist impulses to gamble, known for frequently co-occur with other psychiatric disorders (85,86). R-NMTF also accurately predicted a few disease phenotypes including juvenile myelomonocytic leukemia and breast cancer which were previously missed in the annotation of the cancer disease class (25). These findings suggest that R-NMTF could correctly classify complex and rare disease phenotypes into their relevant disease classes, which could be used to guide clinical decisions.

Table 5.

New disease phenotypes in 20 disease classes

Disease classes	New disease phenotypes
Bone	Achondrogenesis, Type III (44)	Canine Teeth (Omim:114600)	Dens Evaginatus (45)	Dental Noneruption (46)	Dentin Dysplasia, Type I(47)
Cancer	Fanconi Anemia (48)	Juvenile Myelomonocytic Leukemia	Breast Cancer	Proteus Syndrome (49,50,51,52)	Bannayan-Riley-Ruvalcaba Syndrome (53,54)
Cardiovascular	Cardiomyopathy (Omim:192600)	Atrial Standstill (55)	Cardiomyopathy, Dilated, 1E	Long Qt Syndrome 3 (56,57)	Sudden Infant Death Syndrome (58)
Connective tissue	Arthritis, Sacroiliac (59)	Spondyloarthropathy (Omim:183840)	Slipped Femoral Capital Epiphyses (60)	Facial Asymmetry (61)	Cervical Rib
Dermatological	Deafness; Dfna3 (62)	Epidermolysis Bullosa (Omim:131800)	Pachyonychia Congenita, Type 1 (63)	Epidermolysis Bullosa Herpetiformis (64)	Epidermolysis Bullosa Simplex, Koebner Type (64)
Developmental	Leucine Transport, High	Uterine Anomalies (65)	Testes, Rudimentary (66)	Oligosynaptic Infertility	Hypospadias, Autosomal (67)
Ear,Nose,Throat	Otosclerosis 3 (68)	Otosclerosis 2 (68)	Otosclerosis 5 (68)	Periodontitis, Aggressive, 2	Red Cell Permeability Defect
Endocrine	Diabetes Mellitus	Hypoglycemia (Omim:601820) (69)	Polycystic Ovary Syndrome 1 (70)	Diabetes Mellitus, Transient Neonatal	Goiter, Multinodular 2 (71)
Gastrointestinal	Cholestasis2 (Omim:605479) (72)	Bile Acid, Synthetic Defect Of	Cholestasis; Pfic2 (Omim:601847) (72)	Cholestasis; Pfic3 (Omim:602347) (72)	Pancreatitis, Hereditary (73)
Hematological	Anemia (74)	Hyperheparinemia	Sideroblastic Anemia, Autosomal (75)	Platelet Groups–ko System	Anemia, Familial Pyridoxine-Responsive (76)
Immunological	Herpesvirus Sensitivity (77)	Interleukin (Omim:243110) (78)	Panbronchiolitis, Diffuse (79)	Immune Deficiency Disease	Allergic Bronchopulmonary Aspergillosis (80)
Metabolic	Immunoglobulin D Level In Plasma	Magnesium, Elevated Red Cell	Flood Factor Deficiency	Citrulline Transport Defect	Amobarbital, Deficient N-Hydroxylation of
Muscular	Palmomental Reflex	Myopathy (Omim:255100)	Muscular Hypoplasia	Pleoconial Myopathy With Salt Craving	Myopathy, Congenital
Neurological	Amyotrophic Lateral Sclerosis 1	Amyotrophic Lateral Sclerosis 2	Alzheimer Disease 2	Prion Disease (Omim:603218)	Frontotemporal Dementia (Omim:607485)
Nutritional	Bulimia Nervosa	Red Cell Permeability Defect	Labia Minora (Omim:149600) (81)	Schizophrenia 9 (82)	Amyotrophic Lateral Sclerosis 6 (83)
Ophthamological	Cone Dystrophy 3	Cone-Rod Dystrophy 3	Leber Congenital Amaurosis	Cone-Rod Dystrophy 6	Retinitis Pigmentosa 19
Psychiatric	Fg Syndrome 2 (86)	Fg Syndrome 3 (84)	Schizophrenia 5	Cerebral Angiopathy, Dysphoric (85,86)	Gambling, Pathologic
Renal	Nephrotic Syndrome, Type 2 (87,88)	Hypertensive Nephropathy (89)	Enuresis, Nocturnal, 2 (90)	Enuresis, Nocturnal, 1 (90)	Blue Diaper Syndrome
Respiratory	Hemangiomatosis	Respiratory Underresponsiveness	Emphysema (Omim:130700)	Asthma, Short Stature, and Elevated Iga	Asthma-Related Traits, Susceptibility To, 1
Skeletal	Brachydactyly, Mononen Type	Tibial Hemimelia (91)	Acropectoral Syndrome	Syndactyly, Type IV	Spondyloepimetaphyseal Dysplasia, Irapa Type

The 5 most confident predictions of phenotypes in each disease class are reported.

New disease phenotypes in 20 disease classes The 5 most confident predictions of phenotypes in each disease class are reported.

Analysis of new member genes in disease pathways

KEGG provides a list of manually curated disease pathways. However, the current knowledge of biological pathways related with diseases is still incomplete and inaccurate, and there are many missing member genes in the disease-related pathways. Table 6 lists the newly predicted member genes in the KEGG disease pathways. Our literature review also identified supporting evidences for many of the predictions. Interesting examples include TMED10 and PRND, which are newly predicted member genes in Alzheimer's pathway and Prion disease pathway, respectively. TMED10 inhibits production of amyloid beta peptides, which is a critical feature of Alzheimers disease and RPND (prion protein 2) is known for that mutations in this gene may lead to neurological disorders. Other examples include EXO1 and ADIPOR1 in colorectal cancer pathway and FGFR3 and FGFR4 in melanoma pathway. Single nucleotide polymorphisms in EXO1 increases risk of colorectal cancer (106,107), and expression of ADIPOR1 is known for involving cancer progression in colorectal cancer (108,109). Mutations in FGFR3 and FGFR4 were previously described in melanoma (121).

Table 6.

New member genes of KEGG disease pathways

Kegg disease pathways	New member genes
Hsa04930: Type II Diabetes Mellitus	KCNJ8 (92)	EFHC1	ADIPOR2 (93)	ABCC9	LDHA	CDH13 (94)	ENSA	CRYBB1	CASR	KCNJ2
Hsa04940: Type I Diabetes Mellitus	CKAP5	SPTBN4	PTPRT	SNX19	CD74	LILRB1 (95)	LILRB2	GAST	LRRC23	CTLA4 (96)
Hsa04950: Maturity Onset Diabetes of the Young	OLIG2	EN2	PCSK1	PNRC1	PCSK2	GATA5	GATA6	PNRC2	OTX2	RAMP2
Hsa05010: Alzheimers Disease	TMED10 (97)	BRI3	PTX3	APH1B (98)	TFCP2 (99)	HRG	C1R	FKBP2	KHSRP	NEDD8 (100)
Hsa05020: Parkinsons Disease	ARIH1 (101)	AMFR	AGXT	TRIM25	CCNB1IP1	GAN	TMCC2	STUB1	SH2D3C	SLC6A1
Hsa05030: ALS	SSR3	JUB	ALS2CL (102)	APBA1	MTMR2	ABL2	HOXB2	RAB37	PKN1 (103)	CHML
Hsa05040: Huntingtons Disease	HIP1R (104)	SNX5	IFT20	PICALM	RPS10	PQBP1	NECAP1	ARF1	KPNA4	MBTPS1
Hsa05050: Dentatorubropallidoluysian Atrophy	ALG13	TRIM22	CLCN5	ECM1	MYST3	NET1	SYNPO	EFEMP1	CPSF6	NDFIP2
Hsa05060: Prion Disease	PRND (106)	CHD6	LAMA2	RPS21	EIF2AK3	KEAP1	ADAM23	DPP6	MOG	OPCML
Hsa05110: Cholera Infection	SERP1	SEC63	ARFIP2	APOB	ARFIP1	PIP5K1A	FLAD1	TRAM1	ETHE1	AP1B1
Hsa05120: Epithelial Cell Signaling in Helicobacter Pylori Infection	GRLF1	ETHE1	HBA1	EFNA2	TOMM34	DARC	ADD2	SH3D19	PFKM	ANG
Hsa05130: Pathogenic Escherichia Coli Infection Ehec	ARPC4	GRM7	HS1BP3	CGN	PLA2G7	KIAA1543	LAPTM4A	NOX4	ACTR2	SSB
Hsa05210: Colorectal Cancer	EXO1 (106,107)	ADIPOR1 (108,109)	MUTYH (111)	PMS2	CDCA8	ROR2	PMS1	MAZ	WNT5A	WNT7A
Hsa05211: Renal Cell Carcinoma	HIF3A	OS9	EGLN2	ING4	ARNTL2	SIM1	ASB8	LRRC41	SENP6	SIM2
Hsa05212: Pancreatic Cancer	REPS1	REPS2	PLCD1	SHFM1	EXOC1	RAD51AP1	RAD54L	RALGPS1	EXOC5	EXOC3
Hsa05213: Endometrial Cancer	MSR1	BRCA2 (112)	NF1	MXI1 (113)	RNASEL	FH	MSH2	ELAC2	MAD1L1	CHEK2
Hsa05214: Glioma	PDAP1	KIAA1683	RHBDF1	RPS18	ART1	BRD2	NKD2	MYO10	TFDP2	SETD8
Hsa05215: Prostate Cancer	KRT27	MTTP	ATF6 (114)	PTHLH (115)	SEMG1	ATF2 (116)	G6PC	NFIL3 (117)	ASGR1	MALL
Hsa05216: Thyroid Cancer	TSSK2	TMOD2	RNF14	TRIM25	PPP4C	IFI16	CNN1	TMOD1	S100A2	NUP98
Hsa05217: Basal Cell Carcinoma	IHH	DHH	ZIC1	ZIC2	PORCN	SFRP1	ROR2	FRMPD4	GPC3	GAS1
Hsa05218: Melanoma	FGFR4 (118)	FGFR2 (119)	PHEX	FGFR3 (120,121)	SCN8A	EBNA1BP2	RPS2	MAPK8IP2	TFEB	PDAP1
Hsa05219: Bladder Cancer	MLC1	UNC5B (122)	UNC5A	PAWR	AATF	TNXB	CAMK2A	RECK	HIST3H2A	ATF4 (123)
Hsa05220: Chronic Myeloid Leukemia	APBA3	MAP4K5 (124)	BAZ2B	KLF3	TDGF1	MAPK4	FMOD	RAI2	ELF2	SPRY2 (125)
Hsa05221 Acute Myeloid Leukemia	RPL21	NDUFB8 (126)	FBXO18	GATA2 (127)	CEBPD (128)	GFI1 (129)	TAF9B	MYST3 (130)	CBFA2T3	NFATC1
Hsa05222: Small Cell Lung Cancer	CKS2	BCKDK	TBC1D8	TNFRSF19	DUSP1	TNFRSF4	TNFRSF12A	NGFRAP1	LTBR	MAP6
Hsa05223: Non Small Cell Lung Cancer	FDXR (131)	LATS1 (132)	MAP6	NR1H2 (133)	PRKRIR	CSN1S1	NR1H3	CNKSR1	FOXG1 (134)	PNRC1

The 10 most confident predictions of member genes in KEGG disease pathways are reported.

New member genes of KEGG disease pathways The 10 most confident predictions of member genes in KEGG disease pathways are reported. We also provide a network view of three examples of disease pathways with addition of the newly predicted member genes in Figure 5. These examples demonstrate that, while KEGG disease pathways were manually curated, there are still missing member genes in the pathways. One example is WNT5A, a newly predicted member gene in the colorectal cancer pathway in Figure 5A. Recent study showed that WNT5A is a potential biomarker for colorectal cancer and could act as tumor suppressor for colorectal cancer by antagonizing the WNT signaling pathway (135). Another example is FGFR3 gene, the newly predicted member gene in the melanoma pathway, in Figure 5B. It has been shown that mutation and overexpression in FGFR3 are associated with survival of melanoma patients (136). However, FGFR3 was not annotated in the melanoma pathway although it is interacting with several members in the pathway. The network views of all the 27 expanded KEGG disease pathways with newly predicted member genes are available at the article's Supplementary Web. These results support that R-NMTF correctly predicted new member genes in several disease-related pathways, and these novel disease genes could play important roles in the disease pathways.

Figure 5.

PPI subnetworks of the extended disease pathways. In each pathway, gray nodes are known member genes in the disease pathways and red nodes are newly predicted member genes. Edges represent PPI between two genes. Note that if a known or a newly predicted member gene is not interacting with any other member genes in the pathway, the gene is not included. (A) Colorectal cancer pathway. The predicted colorectal cancer genes EXO1 and ADIPOR1 are interacting with many other genes in the colorectal cancer pathway. (B) Alzheimer pathway. Over-expression of C1R is known for involving alzheimer disease. (C) Melanoma pathway. Mutation and copy number changes in new member gene FGFR3 were recently discovered in melanoma.

Analysis of predicted disease phenotype cluster–gene cluster associations

We evaluated the predicted disease phenotype cluster–gene cluster associations by a literature survey. We performed two-way hierarchical clustering for the predicted disease phenotype cluster–gene cluster associations. Figure 6 shows the predicted associations between 20 disease phenotype clusters and 200 gene clusters (pathways). Interesting examples are the manually curated KEGG disease pathways. These disease-related pathways include pathways related to cancers, neurological diseases and psychiatric diseases. R-NMTF accurately predicted association between many of these disease-related pathways to the related disease classes. For example, many cancer-related pathways including colorectal, pancreatic, bladder, non-small cell lung, glioma and prostate cancer were correctly identified as cancer pathways. We also identified a set of biological pathways such as apoptosis, p53 signaling and ERBB signaling, hedgehog signaling which are previously known to contribute to tumorigenesis, as well as targets of many anti-cancer drugs (137–141). Other interesting examples are the pathways predicted to be associated to neurological and psychiatric disease classes. Prion disease is one of the well-known rare progressive neurodegenerative disorders that affect both humans and animals. R-NMTF accurately predicted the prion disease pathway as one of the pathways associated with neurological disease class. MAPK pathway is predicted to be associated with both neurological and psychiatric disease classes. Recent study reported that activation of MAPK pathway could play a role in alzheimer and psychiatric disorders such as increasing anxiety and depression and schizophrenia etc. (142,143). R-NMTF also correctly predicted Huntington's disease pathway to be associated with neurological and psychiatric diseases.

Figure 6.

Predicted associations between disease classes and pathways. Each red entry represents a predicted association between 20 disease classes and 200 KEGG pathways.

DISCUSSION

The number of documented disease phenotypes and phenotype–gene associations increases quickly. Since 2007, the number of OMIM disease–gene associations is nearly doubled. These determined associations provide valuable resources not only for predicting novel associations but also for understanding disease phenotypes. Our research work in the article explored this possibility and reported promising results. Recently, phenotype databases have been proposed and in the progress of becoming comprehensive and systematic for many species. R-NMTF will be a useful model for analyzing the new ‘phenomes’. Moreover, R-NMTF also identifies pathways associated with disease phenotype clusters. Since many drugs are developed to target proteins that act in disease-related pathways, precise identification of members of disease pathways could accelerate the development of more efficient targeted therapies, as well as improve understanding of the molecular mechanisms underlying complex human diseases. More recently, cross-species phenotype–gene association analysis based on ortholog genes and similar phenotypes has been performed (144). An interesting future direction is to extend R-NMTF to perform cross-species phenome–genome co-clustering. It is also possible to apply other advanced machine learning models to integrate the phenotype similarity network and the PPI network with phenotype–gene association data for co-clustering phenotypes and genes. More refined modeling might lead to further improvement in phenotype classification and disease–gene discovery. Previously, regularized NMTF models were only proposed for applications in image and document classification. Gu and Zhou (145) introduced a dual regularized co-clustering (DRCC), which extended NMTF by incorporating the graph Laplacian as additional regularizations in the objective function. DRCC was applied to classify images, documents and newsgroups. Zh vang et al. (38) introduced a matrix tri-factorization-based classification framework (MTrick) for transfer learning. MTrick first learns an association matrix from source domain by performing non-negative tri-factorization and use incorporates inferred association matrix S from source domain into non-negative tri-factorization for target domain classification. R-NMTF introduces regularization terms for label information from both phenotype and gene clusters, and thus R-NMTF is a supervised co-clustering method while DRCC is unsupervised. Compared with MTrick, which only uses label information, R-NMTF incorporates the prior knowledge in phenotype similarity network and PPI networks to cluster phenotypes and genes with tri-matrix factorization. To our best knowledge, no previous NMF-based model has been applied to clustering phenotypes or analyzing disease phenotype–gene associations. R-NMTF is an advanced model which integrates phenome, genome and interactome information for both problems.

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online: Supplementary Tables 1 and 2 and Supplementary Figure 1.

FUNDING

Funding for open access charge: National Science Foundation [III-1117153]. Conflict of interest statement. None declared.

132 in total

1. Single nucleotide polymorphisms in the EXO1 gene and risk of colorectal cancer in a Japanese population.

Authors: Hiromasa Yamamoto; Hiroko Hanafusa; Mamoru Ouchida; Masaaki Yano; Hiromitsu Suzuki; Masakazu Murakami; Motoi Aoe; Nobuyoshi Shimizu; Kei Nakachi; Kenji Shimizu
Journal: Carcinogenesis Date: 2004-11-18 Impact factor: 4.944

2. Ionizing radiation induces prostate cancer neuroendocrine differentiation through interplay of CREB and ATF2: implications for disease progression.

Authors: Xuehong Deng; Han Liu; Jiaoti Huang; Liang Cheng; Evan T Keller; Sarah J Parsons; Chang-Deng Hu
Journal: Cancer Res Date: 2008-12-01 Impact factor: 12.701

Review 3. Failure of fat cell proliferation, mitochondrial function and fat oxidation results in ectopic fat storage, insulin resistance and type II diabetes mellitus.

Authors: L Heilbronn; S R Smith; E Ravussin
Journal: Int J Obes Relat Metab Disord Date: 2004-12

4. NPHS2, encoding the glomerular protein podocin, is mutated in autosomal recessive steroid-resistant nephrotic syndrome.

Authors: N Boute; O Gribouval; S Roselli; F Benessy; H Lee; A Fuchshuber; K Dahan; M C Gubler; P Niaudet; C Antignac
Journal: Nat Genet Date: 2000-04 Impact factor: 38.330

5. Two subclasses of lung squamous cell carcinoma with different gene expression profiles and prognosis identified by hierarchical clustering and non-negative matrix factorization.

Authors: Kentaro Inamura; Takeshi Fujiwara; Yujin Hoshida; Takayuki Isagawa; Michael H Jones; Carl Virtanen; Miyuki Shimane; Yukitoshi Satoh; Sakae Okumura; Ken Nakagawa; Eiju Tsuchiya; Shumpei Ishikawa; Hiroyuki Aburatani; Hitoshi Nomura; Yuichi Ishikawa
Journal: Oncogene Date: 2005-10-27 Impact factor: 9.867

Review 6. Polycystic ovary syndrome.

Authors: S Franks
Journal: N Engl J Med Date: 1995-09-28 Impact factor: 91.245

7. Isolated tibial hemimelia in sibs: an autosomal-recessive disorder?

Authors: M McKay; S K Clarren; R Zorn
Journal: Am J Med Genet Date: 1984-03

8. Ranking candidate genes in rat models of type 2 diabetes.

Authors: Lars Andersson; Greta Petersen; Fredrik Ståhl
Journal: Theor Biol Med Model Date: 2009-07-03 Impact factor: 2.432

Review 9. Nutrition and schizophrenia: beyond omega-3 fatty acids.

Authors: Malcolm Peet
Journal: Prostaglandins Leukot Essent Fatty Acids Date: 2004-04 Impact factor: 4.006

10. Gain-of-function mutation of GATA-2 in acute myeloid transformation of chronic myeloid leukemia.

Authors: Su-Jiang Zhang; Li-Yuan Ma; Qiu-Hua Huang; Guo Li; Bai-Wei Gu; Xiao-Dong Gao; Jing-Yi Shi; Yue-Ying Wang; Li Gao; Xun Cai; Rui-Bao Ren; Jiang Zhu; Zhu Chen; Sai-Juan Chen
Journal: Proc Natl Acad Sci U S A Date: 2008-02-04 Impact factor: 11.205

31 in total

Review 1. Methods for biological data integration: perspectives and challenges.

Authors: Vladimir Gligorijević; Nataša Pržulj
Journal: J R Soc Interface Date: 2015-11-06 Impact factor: 4.118

2. Integrative construction of regulatory region networks in 127 human reference epigenomes by matrix factorization.

Authors: Dianbo Liu; Jose Davila-Velderrain; Zhizhuo Zhang; Manolis Kellis
Journal: Nucleic Acids Res Date: 2019-08-22 Impact factor: 16.971

Review 3. Biomechanisms of Comorbidity: Reviewing Integrative Analyses of Multi-omics Datasets and Electronic Health Records.

Authors: N Pouladi; I Achour; H Li; J Berghout; C Kenost; M L Gonzalez-Garay; Y A Lussier
Journal: Yearb Med Inform Date: 2016-11-10

4. Context-sensitive network-based disease genetics prediction and its implications in drug discovery.

Authors: Yang Chen; Rong Xu
Journal: Bioinformatics Date: 2017-04-01 Impact factor: 6.937

5. Identify bilayer modules via pseudo-3D clustering: applications to miRNA-gene bilayer networks.

Authors: Yungang Xu; Maozu Guo; Xiaoyan Liu; Chunyu Wang; Yang Liu; Guojun Liu
Journal: Nucleic Acids Res Date: 2016-08-02 Impact factor: 16.971

6. Effective diagnosis of genetic disease by computational phenotype analysis of the disease-associated genome.

Authors: Tomasz Zemojtel; Sebastian Köhler; Luisa Mackenroth; Marten Jäger; Jochen Hecht; Peter Krawitz; Luitgard Graul-Neumann; Sandra Doelken; Nadja Ehmke; Malte Spielmann; Nancy Christine Oien; Michal R Schweiger; Ulrike Krüger; Götz Frommer; Björn Fischer; Uwe Kornak; Ricarda Flöttmann; Amin Ardeshirdavani; Yves Moreau; Suzanna E Lewis; Melissa Haendel; Damian Smedley; Denise Horn; Stefan Mundlos; Peter N Robinson
Journal: Sci Transl Med Date: 2014-09-03 Impact factor: 17.956

7. Neighborhood Regularized Logistic Matrix Factorization for Drug-Target Interaction Prediction.

Authors: Yong Liu; Min Wu; Chunyan Miao; Peilin Zhao; Xiao-Li Li
Journal: PLoS Comput Biol Date: 2016-02-12 Impact factor: 4.475

8. Robust Multi-Network Clustering via Joint Cross-Domain Cluster Alignment.

Authors: Rui Liu; Wei Cheng; Hanghang Tong; Wei Wang; Xiang Zhang
Journal: Proc IEEE Int Conf Data Min Date: 2015-11

9. Comparative analysis of a novel disease phenotype network based on clinical manifestations.

Authors: Yang Chen; Xiang Zhang; Guo-Qiang Zhang; Rong Xu
Journal: J Biomed Inform Date: 2014-09-30 Impact factor: 6.317

10. An integrative somatic mutation analysis to identify pathways linked with survival outcomes across 19 cancer types.

Authors: Sunho Park; Seung-Jun Kim; Donghyeon Yu; Samuel Peña-Llopis; Jianjiong Gao; Jin Suk Park; Beibei Chen; Jessie Norris; Xinlei Wang; Min Chen; Minsoo Kim; Jeongsik Yong; Zabi Wardak; Kevin Choe; Michael Story; Timothy Starr; Jae-Ho Cheong; Tae Hyun Hwang
Journal: Bioinformatics Date: 2015-12-03 Impact factor: 6.937