Literature DB >> 23855671

Bioinformatic screening of autoimmune disease genes and protein structure prediction with FAMS for drug discovery.

Shigeharu Ishida, Hideaki Umeyama, Mitsuo Iwadate, Y-H Taguchi¹.

Abstract

Autoimmune diseases are often intractable because their causes are unknown. Identifying which genes contribute to these diseases may allow us to understand the pathogenesis, but it is difficult to determine which genes contribute to disease. Recently, epigenetic information has been considered to activate/deactivate disease-related genes. Thus, it may also be useful to study epigenetic information that differs between healthy controls and patients with autoimmune disease. Among several types of epigenetic information, promoter methylation is believed to be one of the most important factors. Here, we propose that principal component analysis is useful to identify specific gene promoters that are differently methylated between the normal healthy controls and patients with autoimmune disease. Full Automatic Modeling System (FAMS) was used to predict the three-dimensional structures of selected proteins and successfully inferred relatively confident structures. Several possibilities of the application to the drug discovery based on obtained structures are discussed.

Entities: CellLine Chemical Disease Gene Species

Mesh：

Substances：
Ligands
Proteins

Year: 2014 PMID： 23855671 PMCID： PMC4141326 DOI： 10.2174/09298665113209990052

Source DB: PubMed Journal: Protein Pept Lett ISSN： 0929-8665 Impact factor: 1.890

INTRODUCTION

Autoimmune disease is defined as "a clinical syndrome caused by the activation of T cells or B cells, or both, in the absence of an ongoing infection or other discernible cause" [1]. As this robust definition suggests, the causes of autoimmune diseases are not generally known. However, the genetic background of patients is generally believed to be important [2]. Furthermore, since the concordance rate between monozygotic (MZ) twins is not always high [3], some genetic mechanisms other than simple coincidence between DNA sequences have been sought. Recently, epigenetics was regarded to be a key factor in understanding the regulation of autoimmune diseases [3, 4]. Among these mechanisms, promoter methylation is regarded as the most critical factor [5]. Although studies have suggested that common genes underlie the pathogenesis of multiple autoimmune disorders [2], there have been no reports of common methylation patterns in more than one autoimmune disease. Recently Javierre et al. [6] investigated white blood cells extracted from the patients of three autoimmune diseases, systemic lupus erythematosus (SLE), rheumatoid arthritis (RA) and dermatomyositis (DM). However, disease-specific promoter methylation was only identified in SLE [6]. Thus, it appears there are no common-methylated promoters between these three autoimmune diseases. In this paper, we reanalyzed Javierre et al. [6]’s methylation data of autoimmune diseases and found disease-specific promoter methylation in RA and DM. We also identified a common methylation pattern in the three autoimmune diseases. The selected genes were evaluated based on their protein structure prediction by FAMS [7, 8]. The selected genes were also compared with previous studies regarding the individual gene. This study investigated the relationship between the selected genes and the proteins commonly expressed in more than one autoimmune disease [9]. The results presented here strongly suggest the validity of our findings. The possibility of using this application for drug discovery based upon FAMS will also be discussed.

MATERIALS AND METHODS

Promoter Methylation Profile and Principal Component Analysis (PCA) Based Selection of Genes

The supplementary file Supp_Data_2.xls, which can be downloaded from the journal website [6], was used for promoter methylation profile. A tab-limited csv file was generated and was loaded to R [10] by read.csv function. All 59 samples were employed for further analysis (Table ). PCA is a kind of ordination method. In ordination method, objects placed in the high dimensional space is aligned onto the newly generated lower dimensional space. In PCA, projection to the lower dimensional space is performed by simple linear transformation. PCA was applied to promoter methylation profiles. Principal component (PC) scores were attributed to probes that expressed the gene promoters. PCA was also applied to promoter methylation profiles subdivided into sample sets, which corresponded with each disease, SLE, RA and DM. Probes (promoters of genes) that were outliners along either the second or the third PC scores were selected as being significantly and differently methylated between the normal control and disease groups. The selection of PCs depended upon which axis represented the difference between disease and normal controls. Further details about thresholds and criteria for the selection can be found in Supporting Information.

t Test for Different Methylation of Probes

t test was used to determine whether a diseased or healthy twin had differently methylated promoters for the selected probes. The algorithm used was as follows: N is the number of probes and xijk is the amount of methylation of the ith promoter for the jth individual who belongs to the kth subgroup (k corresponds to disease, either SLE, RA, or DM). Here, j is either 1 (disease twin) or 0 (healthy twin). Then we defined the amount of differential methylation ∆xik between the diseased twin and the healthy twin as ∆x When we selected probes i' ∈ S’ as being differently methylated and i’' ∈ S’’ as not, P values were determined by the comparison between the probes in S’ and the probes in S’’ by t test.

Protein Structure Prediction and Protein Complex Inference Based on FAMS

Selected genes’ amino acid sequences were downloaded from SWISS Prot. Then their three-dimensional protein structures were inferred by FAMS [7, 8]. Whether a pair of model proteins used for structure prediction could form protein complexes or not was determined as follows. First, Protein Data Bank (PDB) files that contained at least one model protein as a member of a protein complex were downloaded. Then, the model proteins that were included in the common PDB files were investigated. Next, inter-atomic distances between pairs of model proteins that belonged to the same PDB file were computed. If there was at least one pair of atoms whose distances were less than 3.5 Å, the pair of model proteins was listed as a candidate to form a protein complex.

“in silico” Ligand Screening

First, amino acid sequences registered in PDB with more than 95 % sequence similarity with a reference protein were listed. Among those PDB structures, those with more than 30 heavy atoms and without any atoms within Van der Waals radius were selected. Then we analyzed PDB files that had at least one ligand binding. After removing ligands from these selected PDB files, chooseLD [11] was computed with alternative ligand candidates. Then, FingerPrint Alignment scores (FPAscores) were used to measure ligand candidates.

Validation of “in silico” Ligand Screening with ChEMBL Database

To validate the results of “in silico” ligand screening, we used an assay where ligand binding to the reference protein was investigated in ChEMBL [12]. Then K values of ligands were obtained from ChEMBL [12]. ChEMBL is an Open Data database containing binding, functional and “Absorption, Distribution, Metabolism, Excretion, Toxicity” (ADMET) information for a large number of drug-like bioactive compounds. K is the inhibition constant and represents the ability to reduce the function of a protein. The correlation coefficients between Ks and FPAScore were used as a measure to represent the accuracy of “in silico” ligand screening.

RESULTS

PCA Application for Whole Samples

First, PCA was applied to all 59 samples (Fig. ). It produced a one-dimensional structure and a barb-like structure was observed that extended to the right end of the one-dimensional structure towards the middle-bottom. To understand what this barb-like structure means, we drew the contribution of each sample to PCs (Fig. ). In contrast to the first PC, which had almost constant contributions from each sample (Fig. ()), the second PC had contributions with opposite signs, dependent upon the gender of each sample (Fig. ()). Thus, it is probable that the barb-like structure expresses differences between genders. This is likely, since genes located on the X chromosome are concentrated in this barb-like structure (red dots in Fig. ()). Once genes located on the X chromosome are removed, the barb-like structure disappears (Fig. ()).

PCA Application to SLE Samples

To select genes differently expressed between normal controls and SLE patients, we applied PCA to SLE samples (IDs 34, 20.1, 11, 29 and 20.2 in Table ). Figure () shows two-dimensional embeddings of probes spanning the second and third PC scores. Since the first PC did not represent a difference among probes but had an almost constant value for all probes, we hereafter excluded the first PC scores from further analysis. Although it is not as clear, a bump-like structure expands from the origin toward the negative direction of the second PC scores. To understand the biological meaning of this bump-like structure, we drew the contribution of each sample to the second PC (Fig. ). SLE samples were subdivided into five subgroups, each of which consisted of twins and age/gender-matched controls. Disease twins (○) always had the largest second PC scores within each subgroup (Fig. ()). Thus, the second PC scores were expected to express any differences in gene promoter methylation between diseased twins and controls. To understand the differences, we selected 58 probes located in the outer region along the second PC (red dots in Fig. ), and applied the t test to determine whether methylation was distinct between probes in the outer regions and others (Fig. ). For all five subgroups, the selected probes (red) were significantly demethylated in diseases from other probes (black). This supports the general belief that genes that cause autoimmune diseases should be overexpressed, because methylation is believed to suppress gene expression. Thus, PCA-based probe selection correctly selected the differently expressed probes.

PCA Application to RA Samples

We repeated the same procedure for 20 RA samples (IDs 4, 12, 80, 85 and 86 in Table ). The results are shown in (Figs. , and ). Although it is less obvious than for the SLE samples, the 53 selected probes form a bump-like structure along the second PC score axis (Fig. ()). These probes were significantly demethylated in the diseased twins (○) compared with the healthy twins (△) as expected (Fig. ), excluding the subgroup with ID: 85. The striking difference between the SLE samples and the RA samples is that the RA diseased twins did not have distinct second PC scores compared with the age/gender matched controls (+ and ×, Fig. ), while the SLE diseased twins did. This suggests that RA diseased twins do not have genes that are methylated differently from normal controls and that the difference between RA diseased twins and the group consisting of healthy twins and normal controls was not significant. This may be the reason why Javierre et al. [6] did not find any probes expressed differently or significantly between controls and RA patients. Again, the PCA-based probe selection successfully selected differently expressed probes.

PCA Application to DM Samples

We repeated the same procedure using 20 DM samples (IDs 16, 33, 103, 127 and 129 in Table ). The results are shown in (Figs. , , and ). Compared with the former two cases using SLE and RA samples, the DM sample was harder to analyze. First, as can be seen in (Fig. ), the second PC did not express differences between the control and disease groups but rather between genders. This suggested that the difference between disease and control groups was not even of secondary importance. The difference between disease and control groups was only observed in the third PC (Fig. ()). Second, along the third PC score, the bump-like structure was less clear, even compared with the RA samples (Fig. ()). This prevented us from selecting a clear threshold for outliers. Thus, we tentatively selected 44 probes (red dots in Fig. ) with negative third PC scores. However, close inspection of the distribution of the third PC scores highlighted a slight hump in the distribution of the third PC scores (red regions and a red arrow in Fig. ). Thus, we selected probes that belonged to this small hump as outliers, and as the probes that were differently expressed between disease and control groups. Third, the observed differences of the third PC scores between disease and control groups were dependent upon gender (Fig. ()). Although female twins (red) had larger third PC scores for DM twins (○) compared with healthy twins (△), male twins (black) had smaller PC scores for DM twins (○) compared with healthy twins (△). This gender-dependence makes it difficult to identify significant differences between disease and control groups, because samples with equal number of males and females can have lower third PC scores. This possibly also reduced the amount of methylation/demethylation differences between control and disease groups. Differential methylation is reversed between males (ID: 16 and 33 in Fig. ) and females (ID: 103, 127, and 129 in Fig. ). This may explain why Javierre et al. [6] did not find probes expressed differently and significantly between control and DM groups. Thus, the PCA-based method can select promoters that are methylated differently between normal twins and diseased twins from DM samples.

Comparisons with the Previous Feature Selection Methods and Comparisons with other Data Sets

Comparisons with the previous feature selection methods and comparisons with other (independent or validation) data set were not discussed here but were better to be discussed. However, since it is lengthy and is not directly related to the outcome obtained in the present study, we moved these to supporting information. As a result, we confirmed that our methods outperformed previously proposed methods [13, 14] (no previously proposed methods tested provided us commonly selected genes for three diseases), and our methods can work even if it is applied to other (only SLE [15] and RA [16], no DM) samples.

Protein Structures Predicted by FAMS

To understand the functions and structures of the selected genes, we applied FAMS to the genes selected by PCA. In Table , we listed the genes selected in the present study. These genes were used as reference proteins for FAMS. Together with the reference proteins, we listed the model proteins that were inferred to have similar structures to each of the selected genes by FAMS. First, FAMS successfully listed the model proteins with small P-values for most of the reference proteins. Fig. shows a representative example of the combination of a model protein and a reference protein: the model protein 2OQ0_B and the reference protein AIM2. Regions of alignment included a 192 amino acid sequence (total length = 209 amino acids) of 2OQ0_B and 191 amino acid sequence (total length = 343 amino acids) of AIM2. Sequence similarity between the two alignment regions was 44 %. P-value attributed was 2 × 10-90. 2OQ0_B is annotated as IFI-16. Although AIM2 itself is included in the present PDB, the structure predicted by FAMS using IFI-16 as a model protein was very similar to the true structure, although sequence similarity between AIM2 and IFI-16 is less than 50 %. This demonstrated that structure prediction using FAMS is very accurate. Although this is only one example of a typical relationship between model/reference proteins, it was generally representative of the quality of structural similarities we observed for other proteins. This suggests that the structural homology between models and references is reliable.

DISCUSSION

First, we will discuss the analysis of our results to determine whether our selection process was correct. Then, the possibilities of the application to drug discovery will be discussed.

Comparison with Previous Studies and between Diseases

We compared our results with Javierre et al. [6] (Table ). It is clear that there are substantial overlaps between our studies and Javierre et al. [6]. If one considers that different methods and/or samples were employed, this coincidence is more than remarkable. It may suggest that our analysis was correct. Figure shows the overlap between the distinct diseases. Again, the coincidence is very high. Thus, we can conclude that for the first time we could successfully identify commonly methylated promoters for more than one autoimmune disease.

Biological Relevance of Selected Proteins

In this subsection, we determine whether the selection of proteins was coincident with what was previously known, which is schematically summarized in (Fig. ). Recently, O’Hanlon et al. [9] investigated plasma proteomic profiles from disease-discordant MZ twins (four pairs discordant for SLE, four pairs discordant for juvenile idiopathic arthritis (juvenile RA) and two pairs discordant for juvenile DM), and found that several proteins were differently expressed between patients and normal controls. Among several proteins found, O’Hanlon et al. [9] identified 11 genes (STX17, MGAM, PON1, C6, SYNE1, PLEKHG5, ZNA2GP, LRG1, PKD1 and APOA2) as important. Many clinical studies suggested that these proteins (genes) are related to the genes selected by the present study. For example, GCSF was reported to upregulate LRG1 [17], while LRG1 was upregulated in patients as shown by O’Hanlon et al. [9]. LRG1 and GCSF were also upregulated in chemia-conditioned mice [18]. Ai et al. [19] also discussed the tight relationship between G-CSF and LRG1, which is located downstream of PU.1 signaling pathways. PU.1 is targeted by RARA [20], one of the promoters we identified as being differently methylated in patients compared with normal controls. (Table , see also Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway hsa05200). LGR1 is upregulated during neutrophil activation. Neutrophil granulocytes are the most abundant type of white blood cells in mammals and form an essential part of the innate immune system. It is notable that LCN2 (NGAL : neutrophil gelatinase-associated lipocalin) was also selected by our study. Thus, the selection of CSF1R, CSF3, and CSF3R as critical genes by our research is not accidental but biologically meaningful. O’Hanlon et al. [9] also emphasized the importance of PON1, a gene that causes IFI16-mediated expression changes in endothelial cells [21]. IFI16 was reported to have a similar structure to AIM2 in our research (see Fig. ). Fischer et al. [22] confirmed that C6 is a toll-like receptor (TLR) 4 agonist. C6 is expressed in autoimmune disease patients and is involved in the key pathway related to autoimmune disease [9], while TLR4 was found to have a similar structure to CARD15 by our study. Syntaxin 17 (STX17) protein was highlighted by O’Hanlon et al. [9]. Syntaxin forms a SNARE complex with VAMP [23], and VAMP8 was recognized by our analysis. VAMP8 and STX17 signal in the KEGG pathway "SNARE interactions in vesicular transport" (hsa04130). SNARE complexes including VAMP8 were reported to be related to the autoimmune disease, Sjögren’s syndrome [24], and was proposed to be a novel target for the development of new therapies to treat allergic and autoimmune disease [25]. Gordon et al. [26] also studied SNARE complex structures including STX17 and VAMP8 using a newly developed assay. They experimentally confirmed the various functions of SNARE complexes although they could not identify the specific function of either STX17 or VAMP8 for cell secretion processes. Thus, VAMP8 is expected to be a potential drug target for treatment of autoimmune diseases. Mashima et al. [27] reported that irf2 deficiency caused low serum levels of elastase-1 (ELA1) mediated by SNAREs. PI3 binds to ELA1 [28]. Thus, it is likely that PI3 is involved in mediating the functions of SNARE. O’Hanlon et al. [29] also investigated gene expression profiles. RNA microarray analyses (Agilent Human 1A(V2) 20K oligo arrays) were used to quantify gene expression in peripheral blood cells from 20 MZ twin pairs discordant for systemic autoimmune diseases. The cohort consisted of six affected probands with SLE, six with RA, eight with idiopathic inflammatory myopathies, and their same-gender unaffected twins. O’Hanlon et al. [29] confirmed that several genes’ expression could be distinguished between normal controls and patients. Genes that O’Hanlon et al. [29] identified as important were TNFAIP6, TNFSF10, MAP2K6, IL1RN, IFI27, ANXA3, CEA- CAM6, DEFA4, EIF2AK2, FCERA1, SYTL2, LGR6, KRTCAP2, LTK, and FYN. Among these genes, TNFAIP6 was observed to be co-expressed with AIM2 during the first 10 weeks of therapy with Pegylated-interferon-alfa2b (PegIntronTM ) and ribavirin (administered by weight) in HCV patients [30]. HGF and TIE1 are receptor tyrosine kinases. The overexpression of HGF results in the up- regulation of TIE1 and PECAM in the 3T3-F442A mouse cell line [31]. PECAM1 functions in the KEGG pathway for leukocyte transendothelial migration (hsa04670). TIE1 upregulates the cell adhesion molecules (CAMs) VCAM-1, E-selectin, and ICAM-1 through a p38-dependent mechanism. VCAM1 mediates adhesion between leukocytes and endothelium prior to leukocyte transendothelial migration. This is important as migrating leukocytes cause RA. However, HGF downregulates VEGF-mediated expression of ICAM-1 and VCAM-1 at the transcriptional level [32]. The association between HGF and CD82 has also been reported [33, 34]. Thus, it is likely that HGF, TIE1, PECAM1 and CD82 work cooperatively and may be potential drug targets for the treatment of autoimmune diseases. The Nucleotide Oligomerization Domain (NOD)-like receptor (NLR) signaling pathway (hsa04621) includes NOD2 (CARD15) and TRIP6. NLRs are cytoplasmic proteins with a variety of functions for the regulation of inflammatory and apoptotic responses. CARD15 has been identified as a risk factor gene for Crohn’s disease [2]. SYK is a spleen tyrosine kinase [EC:2.7.10.2], which functions in many immune-related KEGG pathways such as natural killer cell mediated cytotoxicity (ko04650), B cell receptor signaling (ko04662), Fc epsilon RI signaling (ko04664), and Fc gamma R-mediated phagocytosis (ko04666). Natural killer cell mediated cytotoxicity is the directed killing of a target cell by a natural killer cell through the release of granules containing cytotoxic mediators or through the engagement of death receptors. Taken together, a remarkable number of proteins among those selected by the present study have strong relationships with proteins previously reported to be related to the autoimmune disease- related biological processes.

Biological Significance of Inference by FAMS

In this subsection, we determined whether the biological features attributed to the model proteins selected by FAMS (Table ) were reasonable. Because of space limitations, we cannot explain all of them singly, therefore we will discuss selected examples. TRIP6 is expected to have a similar structure to CRP1, which is involved in immune responses [35]. TM7SF3 is a cytochrome c oxidase, which was reported to bind to immune gamma-globulins [36]. TIE1, PECAM1 and CSF3R form Down Syndrome Cell Adhesion Molecule (DSCAM), which is an immunoglobulin (Ig)-superfamily receptor in insects [37]. SYK is a tyrosine-protein kinase ZAP-70 and both SYK and ZAP-70 display distinct requirements for Src family kinases in immune response receptor signal transduction [38]. STAT5A is found in PDB, which has a critical role in cytokine responses and normal immune functions [39]. SPI1 is ETS1, and is expressed in SLE and functions in the immune system [40]. S100-A2 is in PDB and is reported to be an anti-body that inhibits the receptor for advanced glycation end products (RAGE) ligands [41]. RARA is structurally similar to RXR-α, which is involved in inflammatory responses [42]. PI3 is a WAP, which is important in innate immunity [43]. PADI4 is in PDB and is important in RA [44]. The structure of MPL structure was inferred to be similar to IL6RB. IL6R is a key mediator of RA [45]. LCN2, also called NGAL, is in PDB. NGAL has been tested as a marker of inflammatory status for the early diagnosis of inflammatory diseases such as DS [46]. HNF-6 is a HOXB2 model proteins and causes immunologically distinct features [47]. AIM2 is structurally similar to IFI16. AIM2 and IFI16 have critical roles in immunology [48]. CARD15 was inferred to be similar to TLR4, which has a role in cell antiviral responses together with TLR3:TICAM1-specific signaling pathways [49]. CD82 is an acetylcholine receptor protein that has functions in the immune system [50][1]. CSF1R is assigned as TITAN, and is involved in immune responses [51]. SPP1 is an acid phosphatase and is related to autoimmune prostatitis [52]. LMO2, (rhombotin-2), is related to ZFAT (a zinc-finger gene in autoimmune thyroid disease susceptibility region) an immune-related transcriptional regulator containing 18 C2H2-type zinc-finger domains and one AT-hook) [53]. DHCR24 is a cytokinin dehydrogenase. Cytokines have been referred to as immunomodulating agents. SEPT9 is homologous to septin-2, which is upregulated in cytoskeletal and immune function-related proteome profiles [54]. IFNGR2 is a four-domain fibronectin fragment, which plays a role in autoimmune diseases [55]. CSF3 is in PDB, and is related to immune system functions [56]. GRB7 (GRB10) has roles in the immune system and in cancer [57]. HGF is related to plasminogen, which plays critical roles in autoimmune diseases together with matrix metalloproteinase (MMP) 9 [58]. LTB4R is a substance-P receptor, which mediates immune responses to respiratory syncytial virus infection [59]. This is only a partial list of the immune system-related features that were attributed to each model protein selected by FAMS. Although more examples could be listed, we have omitted them because of space restrictions. Such a coincidence that most of the selected model proteins have some relationship to immune-related molecular functions cannot be accidental. This suggests that the three-dimensional structures inferred by FAMS have the potential for autoimmune disease-related drug discovery.

Possibility of Drug Discovery

To regard the selected proteins as drug targets, it is remarkable to find that all of the three-dimensional structures predicted by FAMS are immune-related. However, it is important to use this information for drug discovery by FAMS. In this subsection, we would like to demonstrate the possibilities of drug discovery for the proteins selected in the present study.

Ligand Binding to “Pocket”

The most popular method of identifying drugs is to find a small molecule to bind a “pocket” of each protein. If FAMS can identify or suggest such a candidate for each gene in Table , it will be very useful. For example, two proteins, MMP8 and MMP14, are shown in Table . They coregulate target genes [60]. Both of them are members of the MMP family, which is related to inflammation. For MMP8, using 1XUC_A, which is MMP-13, as a template, FAMS successfully showed that there might be many ligands that bind to MMP8 (Fig. ). Similarly, for MMP14, using 1BQO_B, which is MMP-3, as a template, FAMS successfully showed many ligands could also bind MMP14 (Fig. ). Although this cannot be described as finding a new drug, it does show the potential for proteins listed in Table that may be novel drug targets. To determine the possibility of “in silico” ligand screening, we employed MMP8. For MMP8, we identified three PDB structures for model proteins, 1A86_pld, 1I73_pld, and 1MMB_BAT. Table shows the correlation coefficients between FPAscores and Ki values for 15 ligand candidates used in the assay CHEMBL711322. In this calculation, we calculated the mean of the values obtained by more than one model protein that were indicated by three-digit numbers. For example, 100 represents the consideration of only 1A86_pld and the exclusion of 1I73_pld and 1MMB_BAT. Figure shows the scatter plots of FPAscores and K values. Table shows the FPAscores and Kvalues for ligand candidates. Table suggests that the mean over two pairs, i.e., 1A86_pld and 1I73_pld, or 1A86_pld and 1MMB_BAT, can provide a significant correlation independent of the types of correlation coefficients considered. Based on the results obtained here, we can expect possible “in silico” screenings for drug discovery.

Inhibition of Protein Complex Formation

Another new possibility for drug targets is the inhibition of protein complex formation. Many proteins cannot function as a single substance and must form protein complexes with other proteins. Thus, if we inhibit the protein complex formation, we can also inhibit the function of the protein complex. Table lists protein complex candidates inferred by FAMS. Since FAMS uses a representative protein within each cluster that has more than 95 % sequence similarity as a model protein, there are often more than a thousand model proteins, which can bind to other proteins. However, it can be seen that the list includes many reasonable outcomes. For example, there are 52 model protein complexes listed when using both CSF3 and CSF3R as reference proteins. Just by the name, it is obvious that they are possibly a ligand and its receptor. However, there are 186 model proteins between CSF3R and CSF1R. This represents the possibility that each monomer can form protein complexes that can function together, possibly as a receptor. In addition to this, both CSF3R and CSF1R most frequently have non-zero model proteins that bind to other reference proteins. This is reasonable since many proteins can bind as a ligand or can form a receptor. Further analysis of this table will give us fruitful information to find drug targets by inhibiting the formation of protein complexes. In addition to these known protein complexes, there are many new candidates for protein complex formation. Figure shows one such possible candidate. In Table , we computed binding energy with FiberDock software [61]. Obtained binding energy was significantly small enough to form protein complex. In Table , there are 410 possible candidate pairs between CSF1R and PECAM1. Among these, there is one pair having 61 atom pairs in contact with each other. This means there is a structure on PDB (2ZJS) that includes monomers whose protein structures are expected to be similar to CSF1R and PECAM1, respectively. 2ZJS is a SecYE translocon, which functions as a protein-conducting channel [62]. Although this protein complex was found in Thermus thermophilus, since this protein is expected to be highly conserved, it is possible that CSF1R and PECAM1 form a protein complex that is secreted across or integrated into membranes and plays a critical role in human autoimmune diseases. Thus, if we can identify a drug that inhibits the formation of a protein complex between CSF1R and PECAM1, it may be used to treat autoimmune diseases. Although many more protein complex formation candidates were detected, we cannot report all of them here because of space limitations. This will be reported in the future.

Why is PCA Useful to Infer Differential Expression between Control and Disease?

To our knowledge, this is the first report to describe commonly methylated promoters between more than one autoimmune disease and healthy controls. In this subsection, we argue why it is difficult to find commonly methylated promoters between more than one autoimmune disease and healthy controls and why our PCA-based method was successful. First, to determine whether individual promoters are differently methylated between control and disease groups by the conventional statistical test, more than one replicate is required. However, as shown by Javierre et al. [6], when twins are used, it is impossible to obtain more than one replicate, because they are twins. Twins consist of two genetically identical humans. Thus, if one has the disease and the other does not, there is no way to obtain additional genetically identical individuals. In this case, an additional normal control must be used to obtain biological replicates. However, since additional subjects will have a different genome from the twins, the addition of non-twin control samples weakens the ability of detection of significance. Another method to apply the statistical test to the set of probes is to divide the probes into two classes: differently expressed probes and others. However, to find suitable classifications with which to divide probes to differently expressed probes and others, multiple trials and errors are required. These trials weaken the sensitivity of the test because of the correction due to the multiple comparisons. Although these are general problems, there are also some autoimmune-specific problems. For example, there is a strong gender-dependence of promoter methylation related to autoimmune diseases. When PCA is applied to the whole sample, the second PC is related to X chromosome specific methylation (Fig. ). While males have only one X chromosome, females have two. Usually, this difference is sustained by methylation of one pair of the female X chromosomes. Thus, one has to exclude genes located on the X chromosome. However, differential promoter methylation is also observed for DM between different genders (see section, “PCA application to DM samples”). This can also weaken the sensitivity of statistical tests. Usually, there is no way to detect these dependencies in advance. However, PCA correctly detected these and selected differently expressed probes considering these gender specificities. Thus, PCA can separate different dependencies from each other. For example, for DM, the second PC represented gender specific promoter methylation (Fig. ()) while the third PC reflected differential methylation between healthy twins and DM twins (Fig. ()). It even automatically reflected the reversed methylation dependent upon gender. For RA, the subgroup with ID: 85 did not have differential methylation and it was automatically reflected (Figs. ()). PCA does not assume pre-defined classification but reflects classifications with larger inter- class differences. If there are some classes accompanied with differentially methylation, it is probable that PCA can detect them. This is not guaranteed, but it worked in the present research. This is an explanation for why this study could detect differences by PCA-based promoter methylation selection.

CONCLUSIONS

In this study, we applied PCA-based gene selection to promoter methylation measurements to identify promoters of genes differently methylated between healthy twins and autoimmune diseased twins. Significant numbers of gene promoters were commonly demethylated between distinct autoimmune diseases, but differently between healthy twins and diseased twins. The genes whose promoters were commonly methylated had a remarkable relationship to genes that were previously shown as related to autoimmune diseases. Their three-dimensional protein structures were also inferred by FAMS. Most model proteins selected to infer the three-dimensional structures were immune-related proteins. This fact reveals that the inference by FAMS was biologically accurate. The possible applications of the inference by FAMS for drug discovery were discussed. The inference by FAMS may therefore be useful for both ligand discovery and the search for composites that inhibit protein complex formation, which may promote autoimmune disease.

Table 1.

Information of samples used. The numbers in parentheses indicate age. M and F in the column identified as gender are male and female, respectively. Samples annotated as SLE, RA and DM are twins with disease. Those identified as Healthy correspond to healthy twins. Control 1 and Control 2 are age and gender matched controls.

ID	gender
			SLE
34	F	SLE (34)	Healthy (34)	Control1 (33)	Control2 (36)
20.1	F	SLE (20)	Healthy (20)	Control1 (23)	Control2 (22)
11	F	SLE (12)	Healthy (12)	Control1 (10)	—
29	F	SLE (29)	Healthy (29)	Control1 (24)	Control2 (31)
20.2	F	SLE (20)	Healthy (20)	Control1 (22)	Control2 (22)
			RA
4	F	RA (43)	Healthy (43)	Control1 (41)	Control2 (48)
12	M	RA (9)	Healthy (9)	Control1 (9)	Control2 (7)
80	M	RA (12)	Healthy (12)	Control1 (14)	Control2 (12)
85	F	RA (18)	Healthy (18)	Control1 (21)	Control2 (24)
86	F	RA (6)	Healthy (6)	Control1 (6)	Control2 (10)
			DM
16	M	DM (34)	Healthy (34)	Control1 (33)	Control2 (34)
33	M	DM (13)	Healthy (13)	Control1 (11)	Control2 (17)
103	F	DM (3)	Healthy (3)	Control1 (9)	Control2 (9)
127	F	DM (11)	Healthy (11)	Control1 (8)	Control2 (12)
129	F	DM (5)	Healthy (5)	Control1 (9)	Control2 (8)

Table 2.

Selected genes and model proteins used for structure prediction. Bold ID of PDB indicates that the reference protein itself was detected in PDB.

Reference gene symbol	Model PDB ID	P-value	gene symbol
AIM2	3RN5_A	4 ×10−84	INTERFERON-INDUCIBLE PROTEIN AIM2
CARD15	3CIY_B	1 ×10−58	TOLL-LIKE RECEPTOR 4, VARIABLE LYMPHOCYTE (TLR4)
CD82	2BG9_A	0.48	ACETYLCHOLINE RECEPTOR PROTEIN, ALPHA CHAIN
CSF1R	3B43_A	5 ×10−58	TITIN
CSF3	1GNC_A	8 ×10−66	GRANULOCYTE COLONY-STIMULATING FACTOR
CSF3R	3DMK_A	4 ×10−64	DOWN SYNDROME CELL ADHESION MOLECULE (DSCAM)
DHCR24	2Q4W_A	1 ×10−114	CYTOKININ DEHYDROGENASE 7 (CKO7)
ERCC3	2W74_D	1 ×10−151	TYPE I RESTRICTION ENZYME ECOR124II R PROTEIN (HSDR)
GRB7	3HK0_B	3 ×10−73	GROWTH FACTOR RECEPTOR-BOUND PROTEIN 10 (GRB10)
HGF	4DUU_A	1 ×10−179	PLASMINOGEN
HOXB2	2D5V_A	1 ×10−23	HEPATOCYTE NUCLEAR FACTOR 6 (HNF-6)
IFNGR2	3T1W_A	4 ×10−43	FOUR-DOMAIN FIBRONECTIN FRAGMENT
LCN2	1X71_A	4 ×10−52	NEUTROPHIL GELATINASE-ASSOCIATED LIPOCALIN (NGAL)
LMO2	2XJY_A	2 ×10−33	RHOMBOTIN-2
LTB4R	2KS9_A	2 ×10−77	SUBSTANCE-P RECEPTOR
MMP14	1SU3_B	1 ×10−160	INTERSTITIAL COLLAGENASE (MMP-1)
MMP8	1SU3_B	1 ×10−171	INTERSTITIAL COLLAGENASE (MMP-1)
MPL	3L5H_A	8 ×10−61	INTERLEUKIN-6 RECEPTOR SUBUNIT BETA (IL6RB)
PADI4	2DEW_X	0.0	PROTEIN-ARGININE DEIMINASE TYPE IV
PECAM1	3DMK_A	1 ×10−104	DOWN SYNDROME CELL ADHESION MOLECULE (DSCAM)
PI3	1TWP_A	2 ×10−19	WHEY ACIDIC PROTEIN (WAP)
RARA	3DZY_A	4 ×10−95	RETINOIC ACID RECEPTOR RXR-ALPHA
S100A2	2RGI_A	4 ×10−19	PROTEIN S100-A2
SEPT9	3FTQ_A	1 ×10−137	SEPTIN-2
SLC22A18	1PW4_A	1 ×10−108	GLYCEROL-3-PHOSPHATE TRANSPORTER
SPI1	1GVJ_B	1 ×10−21	C-ETS-1 PROTEIN (ETS1)
SPP1	1D2T_A	1 ×10−12	ACID PHOSPHATASE (ACP)
STAT5A	1Y1U_A	0.0	SIGNAL TRANSDUCER AND ACTIVATOR OF TRANSCRIPTION (STAT5A)
SYK	2OZO_A	1 ×10−170	TYROSINE-PROTEIN KINASE ZAP-70
TIE1	3DMK_A	2 ×10−88	DOWN SYNDROME CELL ADHESION MOLECULE (DSCAM)
TM7SF3	1AR1_A	6 ×10−88	CYTOCHROME C OXIDASE
TRIP6	1B8T_A	2 ×10−32	CYSTEINE-RICH PROTEIN 1 (CRP1)
VAMP8	2KOG_A	1 ×10−21	VESICLE-ASSOCIATED MEMBRANE PROTEIN 2 (VAMP2)

Table 3.

A comparison between previous studies (*) [6] and the present work. Numbers indicate the number of over-laps between previous studies and the present work. Numbers in parentheses represent the total number of probes that were regarded as being expressed differ-ently between disease and control groups in the pre-sent work

SLE(*)	SLE	RA	DM
54	51(58)	48(53)	37 (44)

Table 4.

A comparison between the FPAscores and Kvalues. Three-digit numbers indicated which of the three model PDB files, i.e., 1A86_pld, 1I73_pld,1MMB_BAT, were considered for choosing LD computation. Adjusted P-values were P-values adjusted by Benjamini and Hochberg[63]. Bold numbers indicate adjusted P-values less than 0.05.

		Pearson			Spearman
Combination	corr.	P-value	Adjusted P-value	corr.	P-value	Adjusted P-value
111	-0.503	0.056	0.098	-0.540	0.038	0.066
011	0.159	0.572	0.572	0.197	0.482	0.562
101	-0.694	0.004	0.016	-0.705	0.003	0.013
110	-0.688	0.005	0.016	-0.698	0.004	0.013
001	-0.334	0.223	0.261	-0.064	0.820	0.820
010	-0.624	0.013	0.030	-0.572	0.026	0.060
100	-0.444	0.097	0.136	-0.431	0.109	0.152

Table 5.

The number of model proteins that can bind to other model proteins

Table 6.

Caluculated binding enregy by FiberDock[61] for the binary complex shown in Fig. S17.

glob	aVdW	rVdW	ACE	inside	aElec	rElec
-32.93	-40.51	17.63	13.06	11.41	-47.40	61.35
laElec	lrElec	HB	piS	catpiS	aliph	BBdeform
-7.64	11.38	-0.36	-13.00	0.00	-5.00	0.00

59 in total

Review 1. Autoimmune diseases.

Authors: A Davidson; B Diamond
Journal: N Engl J Med Date: 2001-08-02 Impact factor: 91.245

2. Induction of tissue-specific autoimmune prostatitis with prostatic acid phosphatase immunization: implications for immunotherapy of prostate cancer.

Authors: L Fong; C L Ruegg; D Brockstedt; E G Engleman; R Laus
Journal: J Immunol Date: 1997-10-01 Impact factor: 5.422

3. Suppression of the NF-κB pathway by diesel exhaust particles impairs human antimycobacterial immunity.

Authors: Srijata Sarkar; Youngmia Song; Somak Sarkar; Howard M Kipen; Robert J Laumbach; Junfeng Zhang; Pamela A Ohman Strickland; Carol R Gardner; Stephan Schwander
Journal: J Immunol Date: 2012-02-15 Impact factor: 5.422

4. KAI1 inhibits HGF-induced invasion of pancreatic cancer by sphingosine kinase activity.

Authors: Xu Liu; Xiao-Zhong Guo; Wei-Wei Zhang; Zhuo-Zhuang Lu; Qun-Wei Zhang; Hai-Feng Duan; Li-Sheng Wang
Journal: Hepatobiliary Pancreat Dis Int Date: 2011-04

5. Critical Role of STAT5 transcription factor tetramerization for cytokine responses and normal immune function.

Authors: Jian-Xin Lin; Peng Li; Delong Liu; Hyun Tak Jin; Jianping He; Mohammed Ata Ur Rasheed; Yrina Rochman; Lu Wang; Kairong Cui; Chengyu Liu; Brian L Kelsall; Rafi Ahmed; Warren J Leonard
Journal: Immunity Date: 2012-04-20 Impact factor: 31.745

6. Genome-wide DNA methylation patterns in CD4+ T cells from patients with systemic lupus erythematosus.

Authors: Matlock A Jeffries; Mikhail Dozmorov; Yuhong Tang; Joan T Merrill; Jonathan D Wren; Amr H Sawalha
Journal: Epigenetics Date: 2011-05-01 Impact factor: 4.528

7. Ultraviolet B radiation regulates cysteine-rich protein 1 in human keratinocytes.

Authors: Leena Latonen; Päivi M Järvinen; Sari Suomela; Henna M Moore; Ulpu Saarialho-Kere; Marikki Laiho
Journal: Photodermatol Photoimmunol Photomed Date: 2010-04 Impact factor: 3.135

8. Substance P receptor expression on lymphocytes is associated with the immune response to respiratory syncytial virus infection.

Authors: Ralph A Tripp; Albert Barskey; Laura Goss; Larry J Anderson
Journal: J Neuroimmunol Date: 2002-08 Impact factor: 3.478

9. A novel role of the interferon-inducible protein IFI16 as inducer of proinflammatory molecules in endothelial cells.

Authors: Patrizia Caposio; Francesca Gugliesi; Claudia Zannetti; Simone Sponza; Michele Mondini; Enzo Medico; John Hiscott; Howard A Young; Giorgio Gribaudo; Marisa Gariglio; Santo Landolfo
Journal: J Biol Chem Date: 2007-08-14 Impact factor: 5.157

10. Alterations in cytoskeletal and immune function-related proteome profiles in whole rat lung following intratracheal instillation of heparin.

Authors: Amir A Gabr; Mathew Reed; Donna R Newman; Jan Pohl; Jody Khosla; Philip L Sannes
Journal: Respir Res Date: 2007-05-08

15 in total

1. Principal component analysis-based unsupervised feature extraction applied to in silico drug discovery for posttraumatic stress disorder-mediated heart disease.

Authors: Y-h Taguchi; Mitsuo Iwadate; Hideaki Umeyama
Journal: BMC Bioinformatics Date: 2015-04-30 Impact factor: 3.169

2. Genes associated with genotype-specific DNA methylation in squamous cell carcinoma as candidate drug targets.

Authors: Ryoichi Kinoshita; Mitsuo Iwadate; Hideaki Umeyama; Y-h Taguchi
Journal: BMC Syst Biol Date: 2014-01-24

3. TINAGL1 and B3GALNT1 are potential therapy target genes to suppress metastasis in non-small cell lung cancer.

Authors: Hideaki Umeyama; Mitsuo Iwadate; Y-h Taguchi
Journal: BMC Genomics Date: 2014-12-08 Impact factor: 3.969

4. Identification of More Feasible MicroRNA-mRNA Interactions within Multiple Cancers Using Principal Component Analysis Based Unsupervised Feature Extraction.

Authors: Y-H Taguchi
Journal: Int J Mol Sci Date: 2016-05-10 Impact factor: 5.923

5. SFRP1 is a possible candidate for epigenetic therapy in non-small cell lung cancer.

Authors: Y-H Taguchi; Mitsuo Iwadate; Hideaki Umeyama
Journal: BMC Med Genomics Date: 2016-08-12 Impact factor: 3.063

6. Tensor decomposition-based unsupervised feature extraction applied to matrix products for multi-view data processing.

Authors: Y-H Taguchi
Journal: PLoS One Date: 2017-08-25 Impact factor: 3.240

7. Tensor decomposition-based unsupervised feature extraction identifies candidate genes that induce post-traumatic stress disorder-mediated heart diseases.

Authors: Y-H Taguchi
Journal: BMC Med Genomics Date: 2017-12-21 Impact factor: 3.063

8. Identification of aberrant gene expression associated with aberrant promoter methylation in primordial germ cells between E13 and E16 rat F3 generation vinclozolin lineage.

Authors: Y-h Taguchi
Journal: BMC Bioinformatics Date: 2015-12-09 Impact factor: 3.169

9. Principal component analysis based unsupervised feature extraction applied to budding yeast temporally periodic gene expression.

Authors: Y-H Taguchi
Journal: BioData Min Date: 2016-06-29 Impact factor: 2.522

10. Identification of candidate drugs using tensor-decomposition-based unsupervised feature extraction in integrated analysis of gene expression between diseases and DrugMatrix datasets.

Authors: Y-H Taguchi
Journal: Sci Rep Date: 2017-10-23 Impact factor: 4.379