Literature DB >> 23272042

Simultaneous non-negative matrix factorization for multiple large scale gene expression datasets in toxicology.

Clare M Lee1, Manikhandan A V Mudaliar, D R Haggart, C Roland Wolf, Gino Miele, J Keith Vass, Desmond J Higham, Daniel Crowther.   

Abstract

Non-negative matrix factorization is a useful tool for reducing the dimension of large datasets. This work considers simultaneous non-negative matrix factorization of multiple sources of data. In particular, we perform the first study that involves more than two datasets. We discuss the algorithmic issues required to convert the approach into a practical computational tool and apply the technique to new gene expression data quantifying the molecular changes in four tissue types due to different dosages of an experimental panPPAR agonist in mouse. This study is of interest in toxicology because, whilst PPARs form potential therapeutic targets for diabetes, it is known that they can induce serious side-effects. Our results show that the practical simultaneous non-negative matrix factorization developed here can add value to the data analysis. In particular, we find that factorizing the data as a single object allows us to distinguish between the four tissue types, but does not correctly reproduce the known dosage level groups. Applying our new approach, which treats the four tissue types as providing distinct, but related, datasets, we find that the dosage level groups are respected. The new algorithm then provides separate gene list orderings that can be studied for each tissue type, and compared with the ordering arising from the single factorization. We find that many of our conclusions can be corroborated with known biological behaviour, and others offer new insights into the toxicological effects. Overall, the algorithm shows promise for early detection of toxicity in the drug discovery process.

Entities:  

Mesh:

Substances:

Year:  2012        PMID: 23272042      PMCID: PMC3522745          DOI: 10.1371/journal.pone.0048238

Source DB:  PubMed          Journal:  PLoS One        ISSN: 1932-6203            Impact factor:   3.240


Introduction

The aim of this work is to highlight the usefulness of a recently proposed extension to the technique of non-negative matrix factorization (NMF) by demonstrating its promise for early detection of toxicity in the drug discovery process. In particular, we (a) show that any number of related datasets can be treated simultaneously with this approach, (b) deal with practical issues that arise when the algorithm is applied to real datasets, (c) demonstrate its use with a new large scale microrray dataset, and (d) interpret the results from a biological perspective.

Computational Background

NMF seeks to represent a large complex dataset in terms of smaller factors. The name covers many algorithms. Each approximates a non-negative matrix as the product of two or more smaller non-negative matrices, by attempting to minimise some objective function. Lee and Seung [1] showed that when applying multiplicative non-negative factorization to images of faces, each row/column pair of the factors expresses a recognisable facial feature. These techniques have since been used in many settings to learn parts of the data as well as to factorize and cluster datasets. For example, when applied to text data in [1] the algorithm can differentiate multiple meanings of the same word by context. On microarray data, NMF has been used to find patterns in genes or samples, typically bi-clustering both groups in a similar manner to two-way hierachical clustering [2]–[7]. The review article [8] shows how NMF has also been successful in other areas of computational biology, including molecular pattern discovery, class comparison and biomedical informatics. The new challenge that we address in this work is to apply the NMF methodology to multiple, related, large scale, data sets simultaneously. We use the work of Badea [9], [10], who considered an extension of NMF that deals with two data matrices. Simultaneous NMF is used in [9] to study pancreatic cancer microarray data alongside extra information concerning transcription regulatory factors. In [10] microarray datasets for pancreatic ductal adenocarcinoma and sporadic colon adenocarcinoma are sumiltaneously factorized in order to discover expression patterns common to both data sets. This simultaneous NMF approach readily extends to the case of an arbitrary number of data matrices and here, for what we believe to be the first time, we implement and evaluate the method on more than two. We also consider various practical issues that must be tackled in order to produce a useful computational tool. To minimize the number of algorithmic parameters, make the results straightforward to interpret, and exploit the natural sparsity in the algorithm [9, section 3], we focus on hard clustering. The interesting issue of allowing clusters to overlap in this context is therefore left as future work.

Biological Background

We analyse gene expression data describing the molecular changes in four tissue types due to different dosages of an experimental pan-peroxisome proliferator-activated receptor (pan-PPAR) agonist PPM-201, provided by Plexxikon. PPARs have attracted great interest as potential therapeutic targets for diabetes [11], but major concerns have arisen due to clinically observed side-effects [12]. Hence, there are compelling reasons for toxicological studies at the gene expression level. The material is organised as follows. In Section we describe the simultaneous NMF algorithm and outline our approach for using the output to order and cluster a dataset. Section describes the mouse microarray data, and the NMF results that arise when we treat it as a single dataset are given in Section . This is followed in Section by the analysis of the data split into four datasets corresponding to the known tissue types; liver, kidney, heart and skeletal muscle. In Section we compare the gene clusters from Sections and , and Section discusses the results. Conclusions are given in Section .

Methods

Algorithms

Given non-negative data matrices of size for , our aim is to simultaneously factorize all matrices so thatwith the additional constraints that is a non-negative matrix of size for , and is a non-negative matrix of size . Generalising naturally from the case in [9], we seek to minimise the objective functionwhere . Here denotes the Frobenius norm. As in [9] the coefficients are designed to give equal weight to the different error terms. Based on the multiplicative update rules developed in [13], an iterative algorithm that attempts to solve the optimisation problem can be derived using a gradient descent method times. This gives us the following sequence of approximations for , given initial choices and , for some small positive matrices , and , with representing element-wise multiplication. The iteration may be motivated through the intuition that when and are sufficiently small and positive each of these equations should reduce the objective function. This allows us to setagain with the division being performed element-wise. Hence the overall iteration has the form The values in and are non-negative due to the constraints on the matrices, however they are not necessarily small. The iteration decreases the objective function (1), so this leads to a locally optimum solution, but we cannot guarantee convergence to a global optimum. In particular, different initial conditions can lead to different factorizations of different quality. Having iterated up to some stopping criterion and produced the factorizations, we use them to bi-cluster the data. Each sample is assigned to the cluster for which it has the largest value in the gene cluster and vice versa. In reordering the data for easy visualisation we organise the rows and columns by cluster number (assigned arbitrarily) and sort the elements within each cluster from the appropriate sample/gene set, with the largest value at the bottom/right of that cluster. Given that the second factor is common to all the factorizations, it produces a matching ordering of the columns of the data. Because the result depends on the choice of initial condition, and because the choice of is not automatic, further information is needed in order to specify a practical algorithm. To deal with the lack of uniqueness, we try several initial conditions and pick a realisation that minimises the objective function (1). We then continue until further runs do not significantly alter the results. The objective function value is also one of the criteria we use in order to decide which rank/clustering is the most “appropriate” for the data. By regarding the objective function as a function of , we identify values of where the decay in the objective function begins to diminish. In addition we also form a consensus matrix as in [3], [14] for the clustering of the objects. This is the average of the connectivity matrices where for each initialisation if objects and are clustered together and otherwise. So the consensus matrix contains values between and with the element being the likelihood that objects and cluster together. The cumulative density of these values is constructed, by summing the appropriate probabilities, and the area under this curve is the second measure we look at when considering choices for . The third measure is the Pearson correlation of the cophenetic distances, as explained in [3].

Mouse data

We apply these techniques to mouse gene expression data quantifying changes in four different tissue types following administration of different dosages (vehicle, therapeutic and toxic) of an experimental pan-PPAR agonist. The study design and clinical chemistry results are summarised in Table 1. ALT and AST are known markers in rodents for liver toxicity [15] and from this criterion mouse E may be showing a toxic response to PPM201, despite it being administered at a supposedly therapeutic dose level. This conditions our expectation of the gene-expression pattern for mouse E and suggests that it may be similar to the toxic level group III for liver.
Table 1

Blood clinical chemistry analysis for each mouse.

GroupMouseDoseALTASTCreatinineBUNLDHCK
IDID(mg/kg b.wt)(U/L)(U/L)(mol/L)(U/L)(U/L)(U/L)
IAVehicle42188124348484
IBVehicle419296364258
ICVehicle297596278166
IID69544111812184930
IIE66929818721261130
IIF6528398294152
IIIG2031213006831722544
IIIH204629378617601182
IIII2069810906726161592

The mice were randomly divided into three groups and treated with either Vehicle or two concentrations of PPM201 (6 or 20 mg/kg body weight). The response to the “therapeutic dose”, 6 mg/kg, was found to vary widely for ALT (alanine aminotransferase), AST (aspartate aminotransferase), LDH (lactate dehydrogenase) and CK (creatine kinase). AST is raised in PPM201 treated animals, with mouse E (6 mg/kg) seeming to be especially raised; AST is known to be variable between animals, but mouse E also shows a higher level of ALT, indicating that there may be a shared mechanism for the two enzymes. Creatinine is decreased in liver and possibly kidney disease; the contrasts observed here are inconclusive. BUN (Blood, Urea and Nitrogen) is raised in kidney disease; results are again inconclusive. Following cardiac infarction LDH is increased after 12 hours, possibly also caused by liver toxicity; mouse E is markedly lower than the other PPM201 treated animals and it may be that its heart muscle profile might be more similar to the untreated mice. CK is, like LDH, increased in myocardial infarction and this supports the LDH findings for mouse E.

The mice were randomly divided into three groups and treated with either Vehicle or two concentrations of PPM201 (6 or 20 mg/kg body weight). The response to the “therapeutic dose”, 6 mg/kg, was found to vary widely for ALT (alanine aminotransferase), AST (aspartate aminotransferase), LDH (lactate dehydrogenase) and CK (creatine kinase). AST is raised in PPM201 treated animals, with mouse E (6 mg/kg) seeming to be especially raised; AST is known to be variable between animals, but mouse E also shows a higher level of ALT, indicating that there may be a shared mechanism for the two enzymes. Creatinine is decreased in liver and possibly kidney disease; the contrasts observed here are inconclusive. BUN (Blood, Urea and Nitrogen) is raised in kidney disease; results are again inconclusive. Following cardiac infarction LDH is increased after 12 hours, possibly also caused by liver toxicity; mouse E is markedly lower than the other PPM201 treated animals and it may be that its heart muscle profile might be more similar to the untreated mice. CK is, like LDH, increased in myocardial infarction and this supports the LDH findings for mouse E. Nine wild type mice (strain: C57BL/6J) were randomly divided into three groups; - Group-I, II and III. PPM-201 in the vehicle base was administered daily for 14 days at 6 mg/kg body weight dose rate to each mouse in Group-II and at 20 mg/kg body weight dose rate to each mouse in Group-III while the mice in Group-I received only the vehicle base. On 15th day, the mice were sacrificed to harvest blood, heart, skeletal muscle, liver and kidney tissues for clinical chemistry, microarray and histopathology analysis. In the clinical chemistry analysis, alanine aminotransferase (ALT, U/L), aspartate aminotransferase (AST, U/L), creatinine kinase (CK, U/L), blood urea nitrogen (BUN, mmol/L), creatinine (mol/L) and lactate dehydrogenase (LDH, U/L) were measured from the blood of each mouse. Two sections of liver, two sections of kidney, one or two sections of skeletal muscle, and one section of heart were prepared from each mouse, stained with hematoxylin and eosin (H&E), and examined by a veterinary pathologist. Total RNA was isolated from murine tissues using Qiazol-based homogenization and subsequent column-based purification (Qiagen) with on-column DNAse-treatment. DNAse-free RNA was assessed for quality using Agilent Bioanalyser electrophoresis and acceptance criteria of RNA Integrity Number (RIN) greater than seven. 50 ng of total RNA was subsequently utilized as input to cDNA-based amplification and biotin-labelling using single-primer isothermal amplification according to the manufacturer's instructions (Ovation System, NuGEN Technologies). Unlabelled and biotin-labelled cDNA was qualitatively assessed by Agilent Bioanalyser electrophoresis to ensure identical size distributions of all samples pre- and post-fragmentation. Fragmented, biotin-labelled cDNA were hybridized to MOE430 2.0 GeneChip arrays (Affymetrix) with subsequent scanning and feature extraction according to the manufacturer's instructions. The dataset has been approved by the GEO curators and assigned the accession number GSE31561.

Ethics Statement

The in vivo procedures undertaken during the course of this study (Ref: CXR0631) were subject to the provisions of the United Kingdom Animals (Scientific Procedures) Act 1986. The study was approved by the CXR Biosciences Local Ethics Committee and complied with all applicable sections of the Act and the associated Codes of Practice for the Housing and Care of Animals used in Scientific Procedures and the Humane Killing of Animals under Schedule 1 to the Act, issued under section of the Act.

Results

Single dataset

First, the samples are treated as a single dataset, with thirty six samples and 45037 genes, hence the data matrix is . This corresponds to the case where in Section . The factorizations were performed twenty times for each , with a consensus matrix formed from the clustering of the samples. All gene clusters associated with this analysis are labelled with a subscript 1, e.g., . Figure 1(a) shows the minimum size of the objective function that we observed for each value of . We see that this value decreases monotonically, with a slower rate starting at around . Figure 1(b) shows the area under the cumulative density curves for the same values of . This subfigure clearly points to , as does subfigure (c) showing the cophenetic correlation.
Figure 1

Three measures of the performance versus specified cluster size,

, when the data set is factorised as a single entity. (a) The value of the objective function for . (b) The area under consensus cumulative density, [3], [14]. (c) The cophenetic correlation coefficient, [3].

Three measures of the performance versus specified cluster size,

, when the data set is factorised as a single entity. (a) The value of the objective function for . (b) The area under consensus cumulative density, [3], [14]. (c) The cophenetic correlation coefficient, [3]. Based on Figure 1, we conclude that when the data is factorized as a single entity, clusters is the most appropriate choice. Reordering the dataset using the ordering for in the manner described in Section gives the images shown in Figure 2. This figure shows the samples in the columns with cluster one at the top. To aid visualisation, the sample clusters are split by white lines, as are the gene clusters. This reordered data matrix shows a distinctive “ramp” effect in the blocks on the diagonal, placing genes that are most influential in identifying each tissue type to the bottom of the block. This figure also shows some of the differences in expression behaviour between the tissue types, particularly for the most influential genes.
Figure 2

Factorising as a single dataset; reordering using the NMF for

. The columns show the samples and the rows the gene expression for each of the 45037 genes. Genes and samples are organised by cluster number. Elements within each cluster are ordered, with the largest value at the bottom/right. Each tissue is characterised by a group of highly expressed genes; from the top left to bottom right these are heart, skeletal muscle, liver and kidney. For comparison purposes, the characteristic 100 “best” genes in the four columns are names , , and .

Factorising as a single dataset; reordering using the NMF for

. The columns show the samples and the rows the gene expression for each of the 45037 genes. Genes and samples are organised by cluster number. Elements within each cluster are ordered, with the largest value at the bottom/right. Each tissue is characterised by a group of highly expressed genes; from the top left to bottom right these are heart, skeletal muscle, liver and kidney. For comparison purposes, the characteristic 100 “best” genes in the four columns are names , , and . Because we know the origin of the samples, we can confirm that the algorithm has put the heart samples in cluster one, the skeletal muscle samples in cluster two, the liver samples in cluster three, and the kidney samples in cluster four. The exact ordering of the samples is shown in Table 2. This table also shows the mouse identification information for each sample, and we see that the mice are not ordered in the same way within each cluster. It is the liver and skeletal muscle samples that most closely respect the dosage levels within the clusters. Both these clusters only have one sample mis-ordered.
Table 2

Ordering of the tissue samples after single factorization of rank 4 of the entire dataset.

ClusterTissue typeMouseDosage
1HeartD6 mg/kg
1HeartBVehicle
1HeartCVehicle
1HeartI20 mg/kg
1HeartH20 mg/kg
1HeartAVehicle
1HeartG20 mg/kg
1HeartE6 mg/kg
1HeartF6 mg/kg
2Skeletal MuscleH20 mg/kg
2Skeletal MuscleD6 mg/kg
2Skeletal MuscleI20 mg/kg
2Skeletal MuscleG20 mg/kg
2Skeletal MuscleF6 mg/kg
2Skeletal MuscleE6 mg/kg
2Skeletal MuscleCVehicle
2Skeletal MuscleAVehicle
2Skeletal MuscleBVehicle
3LiverI20 mg/kg
3LiverH20 mg/kg
3LiverG20 mg/kg
3LiverE6 mg/kg
3LiverAVehicle
3LiverF6 mg/kg
3LiverD6 mg/kg
3LiverCVehicle
3LiverBVehicle
4KidneyG20 mg/kg
4KidneyI20 mg/kg
4KidneyH20 mg/kg
4KidneyE6 mg/kg
4KidneyAVehicle
4KidneyCVehicle
4KidneyF6 mg/kg
4KidneyBVehicle
4KidneyD6 mg/kg
Given that the factorization has been performed for we know what the clustering would be from all these rank factorizations. This information is displayed in Figure 3. Here the rows representing the samples are ordered in tissue then dosage subgroups. For each rank , samples with the same colour are assigned to the same cluster. As we have seen before, for the samples are split into tissue types. The figure shows that this split persists at with an empty cluster forming. In fact, for this range of there are at most twelve clusters of samples. We also see from this figure that for no value of are the twelve tissue/dosage subgroups found.
Figure 3

Factorising as a single dataset.

The clustering of the mouse samples for . Within each column the samples in the same colour are clustered together. No value of reveals the known tissue/dosage subgroups, or places different tissues in the same cluster.

Factorising as a single dataset.

The clustering of the mouse samples for . Within each column the samples in the same colour are clustered together. No value of reveals the known tissue/dosage subgroups, or places different tissues in the same cluster.

Multiple datasets

The test in Section indicates that the basic NMF factorization approach can deliver biologically meaningful results—separating the twelve samples by tissue type. But the failure to order correctly within tissue type according to dosage motivates the use of the multiple dataset generalization introduced in Section , where the four tissue types are treated as separate sources of information across a common set of mice. Intuitively, we would expect to add value to the data analysis by building known biology into the algorithm in this way. In this section, we therefore factorize the four new datasets simultaneously. This is similar to the test in Section in the sense that it produces a single ordering for the mice, but it has the potential to add extra information by providing four different, tissue-level, gene orderings. We thus have matrices of size . We again performed 20 factorizations, this time for and these have been used to generate a consensus for clustering the mice. The objective function and the consensus measurements are shown in Figure 4. The objective function in subfigure (a) does not show much decrease in convergence rate until we get to nine clusters. This is the point where each mouse is put into a cluster on its own. The area under the cumulative density curve in Figure 4(b) suggests using either rank , or factorizations for the clustering. The correlation coefficients shown in subfigure (c) give the same two values as peaks, as well as , though the peak is the highest.
Figure 4

Three measures of the performance versus specified cluster size,

, when the four tissue types are factorised separately. (a) The value of the objective function for . (b) The area under consensus cumulative density function for , [3], [14]. (c) The cophenetic correlation coefficient, [3].

, when the four tissue types are factorised separately. (a) The value of the objective function for . (b) The area under consensus cumulative density function for , [3], [14]. (c) The cophenetic correlation coefficient, [3]. Given these measurements we consider the four-way simultaneous factorization for in Figure 5. The reordered datasets are shown separately with the kidney dataset in the top left, the liver dataset in the top right, the heart dataset in the bottom left and the skeletal muscle in the bottom right. The mouse ordering and mouse clusters that arise are shown in Table 3. The four subfigures in Figure 5 also illustrate that the gene clusters are different for each dataset. The three clusters for each tissue in this 4-way factorization are subsequently refered to in the form “, cluster 1,2 or 3. ” Table 3 shows that the simultaneous NMF approach has recovered the known mouse treatments except for one misplacement. Figure 6 shows the clustering for the four-way simultaneous factorizations for . This indicates that this mouse does not cluster with all those of the same dosage for any rank of factorization greater than two, instead it associates with the higher more toxic dosage. This is borne out by the known blood chemistry, as summarised in Table 1; the mouse that is mis-classified exhibits a toxic response and is therefore classified with the mice that received the higher dose.
Figure 5

Factorisation of the four separate tissue types using simultaneous NMF with

. Top left, kidney; top right, liver; lower left, heart; lower right, skeletal muscle. The four tissue types are treated as separate sources of information across a common set of mice. Genes are therefore ordered differently in each of the four tissues, but the mice ordering is global. The resulting mouse ordering and mouse clusters are detailed in Table 3.

Table 3

Ordering of the tissue samples after a four-way factorization of rank 3.

ClusterMouseDosage
1E6 mg/kg
1G20 mg/kg
1I20 mg/kg
1H20 mg/kg
2F6 mg/kg
2D6 mg/kg
3BVehicle
3AVehicle
3CVehicle

The mouse clusters when split by tissue type and reordered using the 4-way simultaneous factorization for .

Figure 6

Factorisation of the four separate tissue types simultaneously.

The clustering of the mice for ; colour indicates cluster number. One “misclassification” is found for several values of . This involves the mouse showing a toxic response to the lower (6 mg/kg) dose of PPAR agonist, as discussed in section .

Factorisation of the four separate tissue types using simultaneous NMF with

. Top left, kidney; top right, liver; lower left, heart; lower right, skeletal muscle. The four tissue types are treated as separate sources of information across a common set of mice. Genes are therefore ordered differently in each of the four tissues, but the mice ordering is global. The resulting mouse ordering and mouse clusters are detailed in Table 3.

Factorisation of the four separate tissue types simultaneously.

The clustering of the mice for ; colour indicates cluster number. One “misclassification” is found for several values of . This involves the mouse showing a toxic response to the lower (6 mg/kg) dose of PPAR agonist, as discussed in section . The mouse clusters when split by tissue type and reordered using the 4-way simultaneous factorization for .

Comparing Gene clusters

Our aim now is to test the results from the novel multi-way NMF algorithm used in Section in order to see whether they (a) show consistency and (b) add value to the results in Section from standard NMF. We know that the four simultaneously factorized datasets correspond to the four clusters of samples that were discovered in an unsupervised manner from the single factorization of the full dataset. It could therefore be conjectured that the most influential genes in the first factorization will appear as influential genes in the four-way simultaneous factorization for that dataset, but less so for the other datasets. Our comparisons involve four reference sets. For illustration, we chose an arbitrary threshold of one hundred; that is, we consider the top one hundred most influential genes from the four clusters in the first factorization shown in Figure 2. For easy reference these sets are referred to using the known tissue type. This means that the genes from cluster one are the genes, those from cluster two are the genes, those from cluster three are the genes and those from cluster four are the genes. The 4-way factorization shown in Figure 5 identifies differently ordered gene clusters for each tissue, which we will refer to as “, cluster 1,2 or 3, etc. ” Table 4 shows the total number of co-incident genes between the top 100 lists arising from the one-way and four-way factorisations. The table also shows the probability of the two lists having that number of genes in common if the second list were randomly selected; hence these values come from the hypergeometric distribution. We see that the important genes for each tissue type appear significantly highly in the clusters from that tissue's data type. In addition, all the tissue type genes also appear significantly within the reordering of the heart dataset. This link is reciprocated, with the heart genes appearing significantly frequently within the skeletal muscle dataset. Surprisingly, the greatest overlap arose between and cluster 2. One of these genes, Apoliprotein A1, is being considered as a marker for cardiac toxicity [16].
Table 4

Gene cluster comparison for indivdual tissues in the single matrix, “,” with the four separate tissue matrices “.”

H1SM1L1K1
ClusterNo.ProbabilityNo.ProbabilityNo.ProbabilityNo.Probability
Clust.1222.2188e-3800.800500.800500.8005
Clust.210.178520.0195495.444e-10800.8005
Clust.3114.3469e-1672.8360e-0910.178553.0037e-06
total345.4969e-6791.4338e-12506.309e-11153.0037e-06
Clust.147.3075e-05151.1260e-1286.8371e-1100.8005
Clust.200.800500.800500.800500.8005
Clust.347.3075e-05141.0243e-2100.800500.8005
total86.8371e-11291.3672e-5486.8371e-1100.8005
Clust.110.178500.8005138.4974e-1110.1785
Clust.200.800500.800500.800500.8005
Clust.310.178520.0195161.1336e-2520.0195
total20.019520.0195291.3672e-5430.0014
Clust.100.800500.800510.178500.8005
Clust.220.019510.178510.178500.8005
Clust.300.800500.800520.0195188.9507e-30
total20.019510.178547.3075e-05188.9507e-30

H1, SM1, L1 and K1 are the gene clusters most characteristic for the heart, skeletal muscle, liver and kidney, respectively, in the single (combined) data set, as in Figure 2. Clust.1, 2, or 3 denotes the 100 genes most securely placed within the clusters of the diferently ordered genes in the 4-way factorization shown in Figure 5. The order of the clusters is 1–3, from the top of the figire, for each tissue. We refer to these clusters as “,” etc. The overlap of the from the one-way factorization to is referred to as cluster 1.

H1, SM1, L1 and K1 are the gene clusters most characteristic for the heart, skeletal muscle, liver and kidney, respectively, in the single (combined) data set, as in Figure 2. Clust.1, 2, or 3 denotes the 100 genes most securely placed within the clusters of the diferently ordered genes in the 4-way factorization shown in Figure 5. The order of the clusters is 1–3, from the top of the figire, for each tissue. We refer to these clusters as “,” etc. The overlap of the from the one-way factorization to is referred to as cluster 1. We would like to demonstrate the utility of the factorization method by using the gene clusters obtained in our analysis to understand tissue specific effects of the experimental drug, PPM-201. Of course, we are not claiming that this is an exhaustive analysis of the effects of PPM-201. We analysed the gene clusters for pathways enrichment and gene ontology enrichment using DAVID [17] and Ingenuity Pathways Analysis (IPA) [18] tools. Table 5 shows the comparison of KEGG pathways enriched in the four tissue specific top one hundred most influential probe-sets obtained in the first factorization. Pathways enriched in these clusters differ according to the tissue types and can be considered as the pathways that are most perturbed by PPM-201. For example, arrhythmogenic right ventricular cardiomyopathy, hypertrophic cardiomyopathy and dilated cardiomyopathy are enriched in heart, whereas starch and sucrose metabolism, drug metabolism and PPAR signalling pathway are enriched in liver. Similarly, Figure 7 shows the enrichment of canonical pathways in the four tissue specific clusters analysed using IPA. It also shows the tissue specific enrichment of pathways—calcium signalling, integrin linked kinase (ILK) signalling and cardiac hypertrophy signalling are enriched in and clusters, whereas fatty acid metabolism and farnesoid X receptor (FXR)/retinoid X receptor (RXR) activation are enriched in the cluster. Analysis of the same sets of genes for enrichment of toxicity functions in the IPA shows, in Figure 8, cardiac hypertrophy in genes, increased level of creatinine and hydronephrosis in genes, and increased levels of lactate dehydrogenase (LDH) and steatosis in genes.
Table 5

Enrichment of KEGG pathways in the four tissue specific gene clusters.

Kegg PathwaysHeartMuscleKidneyLiver
1 mmu05412: Arrhythmogenic right ventricular cardiomyopathy
2 mmu04020: Calcium signalling pathway
3 mmu04260: Cardiac muscle contraction
4 mmu05414: Dilated cardiomyopathy
5 mmu05410: Hypertrophic cardiomyopathy (HCM)
6 mmu04530: Tight junction
7 mmu00590: Arachidonic acid metabolism
8 mmu00983: Drug metabolism
9 mmu04610: Complement and coagulation cascades
10 mmu00980: Metabolism of xenobiotics by cytochrome P450
11 mmu03320: PPAR signalling pathway
12 mmu00830: Retinol metabolism
13 mmu00040: Pentose and glucuronate interconversions
14 mmu00591: Linoleic acid metabolism
15 mmu00053: Ascorbate and aldarate metabolism
16 mmu00860: Porphyrin and chlorophyll metabolism
17 mmu00500: Starch and sucrose metabolism
18 mmu00150: Androgen and estrogen metabolism
19 mmu00140: Steroid hormone biosynthesis

The top one hundred most influential probesets in the four tissue specific gene clusters were analysed using DAVID functional annotation tool. This table shows the comparison of KEGG pathways enriched in the four tissue specific gene clusters. The icon indicates a p-value and the a p-value showing the significance of the enrichment.

Figure 7

Enrichment of canonical pathways in the four tissue specific gene clusters.

The top one hundred most influential probe-sets in the four tissue specific gene clusters obtained in the first factorization were subjected to signalling and metabolic pathways analysis in the IPA software. This graph shows the comparison of canonical pathways enriched in the four tissue specific gene clusters, , , and . The coloured bars show the significance of the enrichment for a particular pathway in the cluster computed by Fisher's exact test.

Figure 8

Enrichment of toxicity functions in the four tissue specific gene clusters.

The top one hundred most influential probe-sets in the four tissue specific gene clusters obtained in the first factorization were subjected to IPA-Tox analysis in the IPA software. This graph shows the comparison of toxicity functions enriched in the four tissue specific gene clusters. The coloured bars show the significance of the enrichment for a particular toxicity functions in the cluster computed by Fisher's exact test.

Enrichment of canonical pathways in the four tissue specific gene clusters.

The top one hundred most influential probe-sets in the four tissue specific gene clusters obtained in the first factorization were subjected to signalling and metabolic pathways analysis in the IPA software. This graph shows the comparison of canonical pathways enriched in the four tissue specific gene clusters, , , and . The coloured bars show the significance of the enrichment for a particular pathway in the cluster computed by Fisher's exact test.

Enrichment of toxicity functions in the four tissue specific gene clusters.

The top one hundred most influential probe-sets in the four tissue specific gene clusters obtained in the first factorization were subjected to IPA-Tox analysis in the IPA software. This graph shows the comparison of toxicity functions enriched in the four tissue specific gene clusters. The coloured bars show the significance of the enrichment for a particular toxicity functions in the cluster computed by Fisher's exact test. The top one hundred most influential probesets in the four tissue specific gene clusters were analysed using DAVID functional annotation tool. This table shows the comparison of KEGG pathways enriched in the four tissue specific gene clusters. The icon indicates a p-value and the a p-value showing the significance of the enrichment. The common genes between the top one hundred most influential probe-sets in the four tissue specific clusters and the top one hundred most influential probe-sets in the clusters formed by 4-way simultaneous factorization of the split dataset were also analysed for enrichment of pathways, gene ontology and toxicity functions using DAVID and IPA. Tables 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17 summarise the results of this analysis, which are discussed further in the next section.
Table 6

Enrichment of KEGG pathways in the common genes between the clusters found by the two ways of factorization.

Kegg PathwaysHeart1 Heart1 Muscle1 Liver1 Liver1 Liver1 Liver1
Heart4 Muscle4 Muscle4 Liver4 Liver4 Heart4 Muscle4
clust. 1clust. 3clust. 1clust. 1clust. 3clust. 2clust. 1
1 mmu04020: Calcium signalling pathway
2 mmu04260: Cardiac muscle contraction
3 mmu04610: Complement and coagulation cascades
4 mmu05414: Dilated cardiomyopathy
5 mmu00983: Drug metabolism
6 mmu05410: Hypertrophic cardiomyopathy (HCM)
7 mmu03320: PPAR signalling pathway
8 mmu04530: Tight junction

The probesets common to clusters formed by the 4-way simultaneous factorization and the top one hundred most influential probesets in the four tissue specific clusters were analysed for enrichment of KEGG pathways using DAVID functional annotation tool. Fishers' exact test p-values for pathway enrichment in the clusters are shown graphically in this table. The icon indicates a p-value and the a p-value.

Table 7

Muscle genes present in the calcium signalling pathway.

Sr.Probeset IDGene SymbolEntrez Gene IDEntrez Gene Name
11427735 a atACTA111459Actin, alpha 1, skeletal muscle
21419312 atATP2A111937ATPase, Ca++ transporting, cardiac muscle, fast twitch 1
31422598 atCASQ112372Calsequestrin 1 (fast-twitch, skeletal muscle)
41427520 a atMYH117879Myosin, heavy chain 1, skeletal muscle, adult
51425153 atMYH217882Myosin, heavy chain 2, skeletal muscle, adult
61458368 atMYH417884Myosin, heavy chain 4, skeletal muscle
71452651 a atMYL117901Myosin, light chain 1, alkali; skeletal, fast
81457347 atRYR120190Ryanodine receptor 1 (skeletal)
91440962 atSLC8A3110893Solute carrier family 8, member 3
101417464 atTNNC221925Troponin C type 2 (fast)
111416889 atTNNI221953Troponin I type 2 (skeletal, fast)
121450118 a atTNNT321957Troponin T type 3 (skeletal, fast)
131419739 atTPM222004Tropomyosin 2 (beta)
141426144 x atTRDN76757Triadin

Table shows the probe-sets enriched for calcium signalling among the top 100 probe-sets from the gene cluster.

Table 8

Heart genes present in the calcium signalling pathway.

Sr.Probeset IDGene SymbolEntrez Gene IDEntrez Gene Name
11415927 atACTC111464Actin, alpha, cardiac muscle 1
21416551 atATP2A211938ATPase, Ca++ transporting, cardiac muscle, slow twitch 2
31422529 s atCASQ212373Calsequestrin 2 (cardiac muscle)
41448827 s atMYH617888Myosin, heavy chain 6, cardiac muscle, alpha
51448394 atMYL217906Myosin, light chain 2, regulatory, cardiac, slow
1427769 x atMYL317897Myosin, light chain 3, alkali; ventricular, skeletal, slow
71421126 atRYR220191Ryanodine receptor 2 (cardiac)
81418370 atTNNC121924Troponin C type 1 (slow)
91422536 atTNNI321954Troponin I type 3 (cardiac)
101440424 atTNNT221956Troponin T type 2 (cardiac)
111423049 a atTPM122003Tropomyosin 1 (alpha)
121451940 x atTRDN76757Triadin

Table shows the probe-sets enriched for calcium signalling among the top 100 probe-sets from the gene cluster.

Table 9

Liver genes present in the calcium signalling pathway.

Sr.Probeset IDGene SymbolEntrez Gene IDEntrez Gene Name
11449817 atABCB1127413ATP-binding cassette, sub-family B (MDR/TAP), member 11
21419393 atABCG527409ATP-binding cassette, sub-family G (WHITE), member 5
31419232 a atAPOA111806Apolipoprotein A-I
41418278 atAPOC311814Apolipoprotein C-III
51449309 atCYP8B113124Cytochrome P450, family 8, subfamily B, polypeptide 1
61418190 atPON118979Paraoxonase 1
71450261 a atSLC10A120493Solute carrier family 10, member 1
81449112 atSLC27A526459Solute carrier family 27, member 5
91449394 atSLCO1B328253Solute carrier organic anion transporter family, member 1B3
101424934 atUGT2B471773UDP glucuronosyltransferase 2 family, polypeptide B4

Table shows the probe-sets enriched for calcium signalling among the top 100 probe-sets from the gene cluster.

Table 10

cluster 1. Common probesets between the top one hundred most influential probesets in the cluster and 20 mg/kg dosage cluster (cluster 1) of the dataset.

Sr.Probeset IDGene SymbolEntrez Gene Name
11415927 atACTC1actin, alpha, cardiac muscle 1
21416551 atATP2A2ATPase, Ca++ transporting, cardiac muscle, slow twitch 2
31452363 a atATP2A2ATPase, Ca++ transporting, cardiac muscle, slow twitch 2
41417607 atCOX6A2cytochrome c oxidase subunit VIa polypeptide 2
51460318 atCSRP3cysteine and glycine-rich protein 3 (cardiac LIM protein)
61416023 atFABP3fatty acid binding protein 3, muscle and heart (mammary-derived growth inhibitor)
71453628 s atLRRC2leucine rich repeat containing 2
81451203 atMBmyoglobin
91418551 atMYBPC3myosin binding protein C, cardiac
101448554 s atMYH6myosin, heavy chain 6, cardiac muscle, alpha
111448826 atMYH6myosin, heavy chain 6, cardiac muscle, alpha
121448394 atMYL2myosin, light chain 2, regulatory, cardiac, slow
131427768 s atMYL3myosin, light chain 3, alkali; ventricular, skeletal, slow
141428266 atMYL3myosin, light chain 3, alkali; ventricular, skeletal, slow
151418769 atMYOZ2myozenin 2
161450952 atPLNphospholamban
171423859 a atPTGDSprostaglandin D2 synthase 21 kDa (brain)
181418370 atTNNC1troponin C type 1 (slow)
191422536 atTNNI3troponin I type 3 (cardiac)
201418726 a atTNNT2troponin T type 2 (cardiac)
211424967 x atTNNT2troponin T type 2 (cardiac)
221423049 a atTPM1tropomyosin 1 (alpha)
Table 11

cluster 3.

Sr.Probeset IDGene SymbolEntrez Gene Name
11422529 s atCASQ2calsequestrin 2 (cardiac muscle)
21444429 atLRTM1leucine-rich repeats and transmembrane domains 1
31439101 atMYLK3myosin light chain kinase 3
41426615 s atNDRG4NDRG family member 4
51436188 a atNDRG4NDRG family member 4
61438452 atNEBLnebulette
71437442 atPCDH7protocadherin 7
81436277 atRNF207ring finger protein 207
91423145 a atTCAPtitin-cap (telethonin)
101436833 x atTTLL1tubulin tyrosine ligase-like family, member 1
111444638 atTTNtitin

Common probesets between the top one hundred most influential probesets in the cluster and vehicle dose cluster (cluster 3) of the dataset.

Table 12

cluster 1.

Sr.Probeset IDGene SymbolEntrez Gene Name
11427735 a atACTA1actin, alpha 1, skeletal muscle
21418677 atACTN3actinin, alpha 3
31419312 atATP2A1ATPase, Ca++ transporting, cardiac muscle, fast twitch 1
41417614 atCKMcreatine kinase, muscle
51438059 atCTXN3 (includes EG:629147)cortexin 3
61455736 atMYBPC2myosin binding protein C, fast type
71427868 x atMYH1myosin, heavy chain 1, skeletal muscle, adult
81427026 atMYH4myosin, heavy chain 4, skeletal muscle
91448371 atMYLPFmyosin light chain, phosphorylatable, fast skeletal muscle
101418155 atMYOTmyotilin
111427306 atRYR1ryanodine receptor 1 (skeletal)
121417464 atTNNC2troponin C type 2 (fast)
131416889 atTNNI2troponin I type 2 (skeletal, fast)
141450118 a atTNNT3troponin T type 3 (skeletal, fast)
151426142 a atTRDNtriadin

Common probesets between the top one hundred most influential probesets in the cluster and 20 mg/kg dosage cluster (cluster 1) of the dataset.

Table 13

cluster 3.

Sr.Probeset IDGene SymbolEntrez Gene Name
11453657 at2310065F04RIKRIKEN cDNA 2310065F04 gene
21434722 atAMPD1adenosine monophosphate deaminase 1
31460256 atCA3carbonic anhydrase III, muscle specific
41422598 atCASQ1calsequestrin 1 (fast-twitch, skeletal muscle)
51439332 atDDIT4LDNA-damage-inducible transcript 4-like
61427400 atLBX1ladybird homeobox 1
71419487 atMYBPHmyosin binding protein H
81458368 atMYH4myosin, heavy chain 4, skeletal muscle
91441111 atMYLK4myosin light chain kinase family, member 4
101418373 atPGAM2phosphoglycerate mutase 2 (muscle)
111444480 atPRKAG3protein kinase, AMP-activated, gamma 3 non-catalytic subunit
121417653 atPVALBparvalbumin
131422644 atSH3BGRSH3 domain binding glutamic acid-rich protein
141449206 atSYPL2synaptophysin-like 2

Common probesets between the top one hundred most influential probesets in the cluster and vehicle dose cluster (cluster 3) of the dataset.

Table 14

cluster 2.

Sr.Probeset IDGene SymbolEntrez Gene Name
11449817 atABCB11ATP-binding cassette, sub-family B (MDR/TAP), member 11
21425260 atALBalbumin
31416649 atAMBPalpha-1-microglobulin/bikunin precursor
41419233 x atAPOA1apolipoprotein A-I
51438840 x atAPOA1apolipoprotein A-I
61455201 x atAPOA1apolipoprotein A-I
71419232 a atAPOA1apolipoprotein A-I
81417950 a atAPOA2apolipoprotein A-II
91417610 atAPOA5apolipoprotein A-V
101417561 atAPOC1apolipoprotein C-I
111418278 atAPOC3apolipoprotein C-III
121418708 atAPOC4apolipoprotein C-IV
131416677 atAPOHapolipoprotein H (beta-2-glycoprotein I)
141424011 atAQP9aquaporin 9
151419549 atARG1arginase, liver
161421944 a atASGR1asialoglycoprotein receptor 1
171450624 atBHMTbetaine–homocysteine S-methyltransferase
181451600 s atCES3carboxylesterase 3
191455540 atCPS1carbamoyl-phosphate synthase 1, mitochondrial
201418113 atCYP2D10cytochrome P450, family 2, subfamily d, polypeptide 10
211416913 atES1(includes EG:13884) esterase 1
221418897 atF2coagulation factor II (thrombin)
231417556 atFABP1fatty acid binding protein 1, liver
241418438 atFABP2fatty acid binding protein 2, intestinal
251424279 atFGAfibrinogen alpha chain
261428079 atFGBfibrinogen beta chain
271416025 atFGGfibrinogen gamma chain
281426547 atGCgroup-specific component (vitamin D binding protein)
291419196 atHAMPhepcidin antimicrobial peptide
301419197 x atHAMPhepcidin antimicrobial peptide
311436643 x atHAMPhepcidin antimicrobial peptide
321425137 a atHLA-Amajor histocompatibility complex, class I, A
331448881 atHPhaptoglobin
341423944 atHPXhemopexin
351434110 x atLOC100129193major urinary protein pseudogene
361428005 atMOSC1MOCO sulphurase C-terminal domain containing 1
371417835 atMUG1murinoglobulin 1
381451054 atORM1orosomucoid 1
391418190 atPON1paraoxonase 1
401417246 atPZPpregnancy-zone protein
411426225 atRBP4retinol binding protein 4, plasma
421451513 x atSERPINA1serpin peptidase inhibitor, clade A (alpha-1 antiproteinase, antitrypsin), member 1
431418282 x atSERPINA1serpin peptidase inhibitor, clade A (alpha-1 antiproteinase, antitrypsin), member 1
441423866 atSERPINA3Kserine (or cysteine) peptidase inhibitor, clade A, member 3K
451417909 atSERPINC1serpin peptidase inhibitor, clade C (antithrombin), member 1
461449112 atSLC27A5solute carrier family 27 (fatty acid transporter), member 5
471449394 atSLCO1B3solute carrier organic anion transporter family, member 1B3
481419093 atTDO2tryptophan 2,3-dioxygenase
491422604 atUOXurate oxidase, pseudogene

Common probesets between the top one hundred most influential probesets in the cluster and 6 mg/kg dosage cluster (cluster 2) of the dataset'.

Table 15

cluster 1.

Sr.Probeset IDGene SymbolEntrez Gene Name
11425260 atALBalbumin
21419059 atAPCSamyloid P component, serum
31419232 a atAPOA1apolipoprotein A-I
41419233 x atAPOA1apolipoprotein A-I
51438840 x atAPOA1apolipoprotein A-I
61455201 x atAPOA1apolipoprotein A-I
71417950 a atAPOA2apolipoprotein A-II
81416677 atAPOHapolipoprotein H (beta-2-glycoprotein I)
91419549 atARG1arginase, liver
101417556 atFABP1fatty acid binding protein 1, liver
111428079 atFGBfibrinogen beta chain
121426547 atGCgroup-specific component (vitamin D binding protein)
131448881 atHPhaptoglobin

Common probesets between the top one hundred most influential probesets in the cluster cluster and 20 mg/kg dosage cluster (cluster 1) of the dataset.

Table 16

cluster 3.

Sr.Probeset IDGene SymbolEntrez Gene Name
11428981 at2810007J24RIKRIKEN cDNA 2810007J24 gene
21449817 atABCB11ATP-binding cassette, sub-family B (MDR/TAP), member 11
31417085 atAKR1C4aldo-keto reductase family 1, member C4 (chlordecone reductase; 3-alpha hydroxysteroid dehydrogenase, type I; dihydrodiol dehydrogenase 4)
41451600 s atCES3carboxylesterase 3
51449242 s atHRGhistidine-rich glycoprotein
61431808 a atITIH4inter-alpha (globulin) inhibitor H4 (plasma Kallikrein-sensitive glycoprotein)
71434110 x atLOC100129193major urinary protein pseudogene
81420465 s atLOC100129193major urinary protein pseudogene
91426154 s atLOC100129193major urinary protein pseudogene
101420525 a atOTCornithine carbamoyltransferase
111436615 a atOTCornithine carbamoyltransferase
121448680 atSERPINA1serpin peptidase inhibitor, clade A (alpha-1 antiproteinase, antitrypsin), member 1
131448506 atSERPINA6serpin peptidase inhibitor, clade A (alpha-1 antiproteinase, antitrypsin), member 6
141449394 atSLCO1B3solute carrier organic anion transporter family, member 1B3
151424934 atUGT2B4UDP glucuronosyltransferase 2 family, polypeptide B4
161422604 atUOXurate oxidase, pseudogene

Common probesets between the top one hundred most influential probesets in the cluster and vehicle dose cluster (cluster 3) of the dataset.

Table 17

cluster 3.

Sr.Probeset IDGene SymbolEntrez Gene Name
11456190 a atACSM2Aacyl-CoA synthetase medium-chain family member 2A
21427223 a atACSM2Aacyl-CoA synthetase medium-chain family member 2A
31425207 atBC026439cDNA sequence BC026439
41424713 atCALML4calmodulin-like 4
51424592 a atDNASE1deoxyribonuclease I
61448485 atGGT1gamma-glutamyltransferase 1
71460233 atGUCA2Bguanylate cyclase activator 2B (uroguanylin)
81415969 s atKAPkidney androgen regulated protein
91415968 a atKAPkidney androgen regulated protein
101435094 atKCNJ16potassium inwardly-rectifying channel, subfamily J, member 16
111450719 atMEP1Ameprin A, alpha (PABA peptide hydrolase)
121418923 atSLC17A3solute carrier family 17 (sodium phosphate), member 3
131417072 atSLC22A6solute carrier family 22 (organic anion transporter), member 6
141423279 atSLC34A1solute carrier family 34 (sodium phosphate), member 1
151425606 atSLC5A8solute carrier family 5 (iodide transporter), member 8
161449301 atSLC7A13solute carrier family 7, (cationic amino acid transporter, y+ system) member 13
171435064 a atTMEM27transmembrane protein 27
181423397 atUGT2B17UDP glucuronosyltransferase 2 family, polypeptide B17

Common probesets between the top one hundred most influential probesets in the cluster and vehicle dose cluster (cluster 3) of the dataset.

The probesets common to clusters formed by the 4-way simultaneous factorization and the top one hundred most influential probesets in the four tissue specific clusters were analysed for enrichment of KEGG pathways using DAVID functional annotation tool. Fishers' exact test p-values for pathway enrichment in the clusters are shown graphically in this table. The icon indicates a p-value and the a p-value. Table shows the probe-sets enriched for calcium signalling among the top 100 probe-sets from the gene cluster. Table shows the probe-sets enriched for calcium signalling among the top 100 probe-sets from the gene cluster. Table shows the probe-sets enriched for calcium signalling among the top 100 probe-sets from the gene cluster. Common probesets between the top one hundred most influential probesets in the cluster and vehicle dose cluster (cluster 3) of the dataset. Common probesets between the top one hundred most influential probesets in the cluster and 20 mg/kg dosage cluster (cluster 1) of the dataset. Common probesets between the top one hundred most influential probesets in the cluster and vehicle dose cluster (cluster 3) of the dataset. Common probesets between the top one hundred most influential probesets in the cluster and 6 mg/kg dosage cluster (cluster 2) of the dataset'. Common probesets between the top one hundred most influential probesets in the cluster cluster and 20 mg/kg dosage cluster (cluster 1) of the dataset. Common probesets between the top one hundred most influential probesets in the cluster and vehicle dose cluster (cluster 3) of the dataset. Common probesets between the top one hundred most influential probesets in the cluster and vehicle dose cluster (cluster 3) of the dataset.

Discussion

The factorization and reordering of the dataset as a whole set (Figure 2 and Table 2) successfully clustered samples from the same tissue and further investigation showed that it simultaneously identified genes with a known relevance to those tissues. It was therefore reasonable to study the genes that were responsible for this differentiation. In the one-way clustering, the top 100 probe-sets from each of the four tissue specific clusters show remarkable coherence for tissue specific pathways. The calcium signalling pathway is highly enriched in both and clusters; these genes are linked to muscle contraction function. Muscle contraction is the prime function of cardiac and skeletal muscles. A deeper look at the probe-sets (Tables 7 and 8) from the heart and skeletal muscle clusters shows a successful identification of differences in the tissue types for this pathway; see Figure 9. MYH1, MYH2, MYH4 and MYL1 of the myosin family, which are specific to skeletal muscle, are found in the cluster while cardiac muscle specific myosin family members MYH6, MYL2 and MYL3 are found in the cluster [19]. This pattern is also true for troponin, calsequestrin, ryanodine and actin family members [20]–[25] (Tables 7 and 8). FXR/RXR activation pathway genes are significantly enriched in cluster (Figure 7) with most of the enriched genes present in the bile acid synthesis and regulation (Figure 10) pathway, which is one of the core functions of liver [26]–[28]. FXR/RXR activation is also found in the cluster, albeit with moderate significance; FBP1 and HNF4A are the two genes present in this pathway and they may be involved in gluconeogenesis in kidney [29].
Figure 9

Heart and muscle genes enriched in calcium signalling – muscle contraction pathway.

IPA analysis of the top 100 probe-sets from heart and muscle gene clusters (Figure 7) showed the enrichment of calcium signalling pathway. In this figure, we have highlighted the genes present in this pathway in orange. Though this pathway is generalised for skeletal muscle contraction and cardiac muscle contraction, they differ in the members of the same gene family. The heart and muscle genes present in this pathway are given in Tables 7 and 8. Pathway diagram was drawn using Path Designer function of IPA [18].

Figure 10

Liver genes enriched in FXR/RXR activation pathway IPA analysis of the top 100 probe-sets from the

cluster ( ) showed the enrichment of FXR/RXR activation pathway. The genes present in this pathway are highlighted in orange. The liver genes present in the pathway are given in Table 9. Pathway diagram was drawn using Path Designer function of IPA [18].

Heart and muscle genes enriched in calcium signalling – muscle contraction pathway.

IPA analysis of the top 100 probe-sets from heart and muscle gene clusters (Figure 7) showed the enrichment of calcium signalling pathway. In this figure, we have highlighted the genes present in this pathway in orange. Though this pathway is generalised for skeletal muscle contraction and cardiac muscle contraction, they differ in the members of the same gene family. The heart and muscle genes present in this pathway are given in Tables 7 and 8. Pathway diagram was drawn using Path Designer function of IPA [18].

Liver genes enriched in FXR/RXR activation pathway IPA analysis of the top 100 probe-sets from the

cluster ( ) showed the enrichment of FXR/RXR activation pathway. The genes present in this pathway are highlighted in orange. The liver genes present in the pathway are given in Table 9. Pathway diagram was drawn using Path Designer function of IPA [18]. Splitting the dataset into four on the basis of tissue types and simultaneous non-negative factorization of them gave us the added reassurance of clustering the samples according to the dosage groups (Figure 5 and Table 3). The clustering of one mouse (Mouse E) from the lower dosage group (Group-II) with the higher dosage group (Group-III) can be explained by the higher PPM201 drug sensitivity of that mouse, indicated by the elevated levels of the toxocology markers ALT, AST, LDH and CK, compared with the rest of its group (Table 1). Comparisons of top probe-sets in tissue specific clusters with dosage specific clusters also show very high overlap of tissue specific genes in the four tissue types. cluster1 has 22 probe-sets that are common between the top 100 probe-sets of cluster and 20 mg/kg dosage cluster of dataset, and are highly enriched for cardiac muscle contraction and hypertrophic cardiomyopathy pathways (Table 6). ACTC1, ATP2A2, MYH6, MYL2, MYL3, TNNC1, TNNI3, TNNT2 and TPM1 are the genes enriched for these two pathways and shared between these two clusters. However, cluster 3, with 11 probe-sets in common between the top 100 probe-sets of cluster and vehicle dose cluster of dataset, does not show enrichment for cardiac muscle contraction and hypertrophic cardiomyopathy pathways. From this we may assume that perturbation of cardiac muscle contraction and hypertrophic cardiomyopathy pathways by 20 mg/kg dosage may indicate toxic responses. We also see a similar pattern in skeletal muscle. Between the top 100 probe-sets of cluster and 20 mg/kg dosage cluster of , and between the top 100 probe-sets of and vehicle dose cluster of , 15 and 14 probe-sets were in common and are named as cluster 1 and cluster 3, respectively. The calcium signalling–skeletal muscle contraction pathway is enriched in cluster 1 with the presence of ACTA1, ATP2A1, MYH1, MYH4, RYR1, TNNC2, TNNI2, TNNT3 and TRDN genes, whereas cluster 3 does not show any significant enrichment for signalling or metabolic pathways. Interestingly, 49 probe-sets in the cluster 2 are common between the top 100 probe-sets of cluster cluster and 6 mg/kg dosage cluster of and highly enriched for acute phase response signalling, prothrombin activation and FXR/RXR activation pathways with the presence of ALB, ABCB11, AMBP, APOA1, APOA2, APOC3, APOH, F2, FGA, FGB, FGG, HAMP, HP, HPX, ORM1, PON1, RBP4, SERPINA1, SERPINC1, SLC27A5 and SLCO1B3 genes (Figure 11). This suggests alterations in lipid metabolism in liver along with tissue injury in heart induced by PPM-201 at 6 mg/kg dosage [30]–[33], which becomes more plausible when we look at the genes in cluster 1 that are common between the top genes and 20 mg/kg dosage cluster of dataset. Enrichment of toxicity functions in cluster 2 using IPA shows increased level of LDH as one of the toxicity functions (Figure 12) which has been validated with the increased level of LDH in the clinical chemistry results.
Figure 11

Enrichment of canonical pathways in the liver heart gene cluster no. 2.

This gene cluster has 49 common probe-sets between the top one hundred most influential probe-sets in the liver gene cluster and top one hundred probe-sets in cluster number 2 (6 mg/kg dose rate) of the heart dataset reordered by 4-way simultaneous factorization. Canonical pathways enrichment for these 49 probe-sets analysed using the IPA software is shown in this figure. The length of the bars shows the Fisher's exact test p-value for enrichment for a particular pathway in the cluster.

Figure 12

Enrichment of toxicity functions in

cluster 2. This gene cluster has 49 common probe-sets between the top one hundred most influential probe-sets in cluster 2 (6 mg/kg dose rate). Toxicity functions enrichment for these 49 probe-sets analysed using the IPA software is shown in this figure. The length of the bars shows the Fisher's exact test p-value for enrichment for a particular pathway in the cluster.

Enrichment of canonical pathways in the liver heart gene cluster no. 2.

This gene cluster has 49 common probe-sets between the top one hundred most influential probe-sets in the liver gene cluster and top one hundred probe-sets in cluster number 2 (6 mg/kg dose rate) of the heart dataset reordered by 4-way simultaneous factorization. Canonical pathways enrichment for these 49 probe-sets analysed using the IPA software is shown in this figure. The length of the bars shows the Fisher's exact test p-value for enrichment for a particular pathway in the cluster.

Enrichment of toxicity functions in

cluster 2. This gene cluster has 49 common probe-sets between the top one hundred most influential probe-sets in cluster 2 (6 mg/kg dose rate). Toxicity functions enrichment for these 49 probe-sets analysed using the IPA software is shown in this figure. The length of the bars shows the Fisher's exact test p-value for enrichment for a particular pathway in the cluster.

Conclusions

We have demonstrated that multi-way simultaneous nonnegative matrix factorization can be usefully applied to the case of multiple datasets—here, for what we believe to be the first time, more than two large scale matrices were treated. The results were shown to be consistent with, and to add value to, standard nonnegative matrix factorization of the whole dataset. In summarizing our biological findings, we first note that the roles of the three different isoforms of PPARs - PPAR-, PPAR- (also known as PPAR-) and PPAR- in metabolism and their difference in expression in different tissues and different species are well known [34]–[36]. In mouse, PPAR- is highly expressed in liver and to a lesser degree in kidney, heart and skeletal muscle; PPAR- is expressed in many tissues but peaks in kidney, heart and intestine whereas PPAR- is mostly expressed in adipose tissue [34], [37]. Pan-PPAR agonists activate two or all of the pan-PPAR isoforms and differ in their pharmacological actions. Factorisation of the dataset after splitting it on the tissue basis appears to be beneficial in identifying tissue specific and dosage effects of the experimental pan-PPAR agonist PPM-201 in this study. This approach could be useful in understanding molecular mechanisms and identifying potential tissue specific toxicological effects before they are apparent in histopathology studies. In this study, histopathology examination of heart did not show any defect though our method of gene expression analysis could identify enrichment of acute phase response signalling genes in heart that may point towards building up of toxic responses in heart. Given the fact that many PPAR agonist drugs have been shown to cause cardiac toxicity on prolonged usage and FDA's requirement of one year toxicity study for PPAR agonist drugs, our results show promising early detection of toxicity in the drug discovery process. Overall, our aim here is to establish a proof of principle for the approach of simultaneously analysing multiple, related large datasets. We therefore focused on a dataset where clear-cut validation is possible. However, we note that the technique is very general, and therefore opens up many new opportunities in data-driven computational biology. In particular, it can be applied to heterogeneous sources of data; for example, generated by different laboratories or experimental methodologies. We are currently pursuing this approach in the study of colon cancer.
  31 in total

Review 1.  Roles of PPARs in health and disease.

Authors:  S Kersten; B Desvergne; W Wahli
Journal:  Nature       Date:  2000-05-25       Impact factor: 49.962

2.  Improving molecular cancer class discovery through sparse non-negative matrix factorization.

Authors:  Yuan Gao; George Church
Journal:  Bioinformatics       Date:  2005-11-01       Impact factor: 6.937

3.  Inferential, robust non-negative matrix factorization analysis of microarray data.

Authors:  Paul Fogel; S Stanley Young; Douglas M Hawkins; Nathalie Ledirac
Journal:  Bioinformatics       Date:  2006-11-08       Impact factor: 6.937

4.  Underlying mechanisms of pharmacology and toxicity of a novel PPAR agonist revealed using rodent and canine hepatocytes.

Authors:  Yin Guo; Robert A Jolly; Bartley W Halstead; Thomas K Baker; John P Stutz; Melanie Huffman; John N Calley; Adam West; Hong Gao; George H Searfoss; Shuyu Li; Armando R Irizarry; Hui-Rong Qian; James L Stevens; Timothy P Ryan
Journal:  Toxicol Sci       Date:  2007-01-25       Impact factor: 4.849

Review 5.  Isoform diversity, regulation, and functional adaptation of troponin and calponin.

Authors:  Jian-Ping Jin; Zhiling Zhang; James A Bautista
Journal:  Crit Rev Eukaryot Gene Expr       Date:  2008       Impact factor: 1.807

6.  Negative regulation of human fibrinogen gene expression by peroxisome proliferator-activated receptor alpha agonists via inhibition of CCAAT box/enhancer-binding protein beta.

Authors:  P Gervois; N Vu-Dac; R Kleemann; M Kockx; G Dubois; B Laine; V Kosykh; J C Fruchart; T Kooistra; B Staels
Journal:  J Biol Chem       Date:  2001-06-20       Impact factor: 5.157

7.  Apolipoprotein A1 is a stronger prognostic marker than are HDL and LDL cholesterol for cardiovascular disease and mortality in elderly men.

Authors:  Gösta Florvall; Samar Basu; Anders Larsson
Journal:  J Gerontol A Biol Sci Med Sci       Date:  2006-12       Impact factor: 6.053

Review 8.  The troponin complex and regulation of muscle contraction.

Authors:  C S Farah; F C Reinach
Journal:  FASEB J       Date:  1995-06       Impact factor: 5.191

9.  Biclustering of gene expression data by Non-smooth Non-negative Matrix Factorization.

Authors:  Pedro Carmona-Saez; Roberto D Pascual-Marqui; F Tirado; Jose M Carazo; Alberto Pascual-Montano
Journal:  BMC Bioinformatics       Date:  2006-02-17       Impact factor: 3.169

Review 10.  Nonnegative matrix factorization: an analytical and interpretive tool in computational biology.

Authors:  Karthik Devarajan
Journal:  PLoS Comput Biol       Date:  2008-07-25       Impact factor: 4.475

View more
  7 in total

1.  Data Imputation in Epistatic MAPs by Network-Guided Matrix Completion.

Authors:  Marinka Žitnik; Blaž Zupan
Journal:  J Comput Biol       Date:  2015-02-06       Impact factor: 1.479

2.  Semi-Supervised Projective Non-Negative Matrix Factorization for Cancer Classification.

Authors:  Xiang Zhang; Naiyang Guan; Zhilong Jia; Xiaogang Qiu; Zhigang Luo
Journal:  PLoS One       Date:  2015-09-22       Impact factor: 3.240

3.  Scalable non-negative matrix tri-factorization.

Authors:  Andrej Čopar; Marinka Žitnik; Blaž Zupan
Journal:  BioData Min       Date:  2017-12-29       Impact factor: 2.522

4.  Multi-omics assessment of dilated cardiomyopathy using non-negative matrix factorization.

Authors:  Rewati Tappu; Jan Haas; David H Lehmann; Farbod Sedaghat-Hamedani; Elham Kayvanpour; Andreas Keller; Hugo A Katus; Norbert Frey; Benjamin Meder
Journal:  PLoS One       Date:  2022-08-18       Impact factor: 3.752

5.  Gene Ranking of RNA-Seq Data via Discriminant Non-Negative Matrix Factorization.

Authors:  Zhilong Jia; Xiang Zhang; Naiyang Guan; Xiaochen Bo; Michael R Barnes; Zhigang Luo
Journal:  PLoS One       Date:  2015-09-08       Impact factor: 3.240

6.  Discovering subgroups of patients from DNA copy number data using NMF on compacted matrices.

Authors:  Cassio P de Campos; Paola M V Rancoita; Ivo Kwee; Emanuele Zucca; Marco Zaffalon; Francesco Bertoni
Journal:  PLoS One       Date:  2013-11-20       Impact factor: 3.240

7.  Discriminant projective non-negative matrix factorization.

Authors:  Naiyang Guan; Xiang Zhang; Zhigang Luo; Dacheng Tao; Xuejun Yang
Journal:  PLoS One       Date:  2013-12-20       Impact factor: 3.240

  7 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.