Literature DB >> 19801557

Prediction of candidate primary immunodeficiency disease genes using a support vector machine learning approach.

Shivakumar Keerthikumar¹, Sahely Bhadra, Kumaran Kandasamy, Rajesh Raju, Y L Ramachandra, Chiranjib Bhattacharyya, Kohsuke Imai, Osamu Ohara, Sujatha Mohan, Akhilesh Pandey.

Abstract

Screening and early identification of primary immunodeficiency disease (PID) genes is a major challenge for physicians. Many resources have catalogued molecular alterations in known PID genes along with their associated clinical and immunological phenotypes. However, these resources do not assist in identifying candidate PID genes. We have recently developed a platform designated Resource of Asian PDIs, which hosts information pertaining to molecular alterations, protein-protein interaction networks, mouse studies and microarray gene expression profiling of all known PID genes. Using this resource as a discovery tool, we describe the development of an algorithm for prediction of candidate PID genes. Using a support vector machine learning approach, we have predicted 1442 candidate PID genes using 69 binary features of 148 known PID genes and 3162 non-PID genes as a training data set. The power of this approach is illustrated by the fact that six of the predicted genes have recently been experimentally confirmed to be PID genes. The remaining genes in this predicted data set represent attractive candidates for testing in patients where the etiology cannot be ascribed to any of the known PID genes.

Entities: Chemical Disease Gene Species

Mesh：

Substances：
Proteins

Year: 2009 PMID： 19801557 PMCID： PMC2780952 DOI： 10.1093/dnares/dsp019

Source DB: PubMed Journal: DNA Res ISSN： 1340-2838 Impact factor: 4.458

Introduction

Primary immunodeficiency diseases (PIDs) are a genetically heterogeneous group of disorders that affect distinct components of the innate and adaptive immune system, such as neutrophils, macrophages, dendritic cells, natural killer cells and T and B lymphocytes. The study of these diseases has provided essential insights into the functioning of our immune system. More than 120 distinct genes have been identified, whose abnormalities account for more than 150 distinct forms of PID.[1] PIDs are challenging for both researchers and clinicians because they represent natural models of immunopathology, which can usually be studied effectively only in animal models, and manifest with a wide range of clinical symptoms ranging from susceptibility to infections and allergies to autoimmune and inflammatory diseases. The genetic defects that cause PIDs can affect the expression and function of proteins involved in a range of biological processes, such as immune development, effector-cell functions, signaling cascades and maintenance of immune homeostasis.[2] Because genes and proteins rarely work in isolation, genes that directly or functionally interact with known PID genes could also represent additional PID genes. We have recently developed a database of PID genes designated ‘Resource of Asian PDIs (RAPID)’, which contains information pertaining to genes and proteins involved in PDIs along with other relevant information about protein–protein interactions, mouse knockout studies and microarray gene expression profiles in various cells and organs of the immune system. These significant features of PID genes, including their involvement in immune signaling pathways, were used as input binary features for the prediction of additional candidate PID genes using a support vector machine (SVM) learning approach. SVM is a powerful machine learning technique widely used in the computational biology such as microarray data analysis,[3-8] protein secondary structure prediction,[9] prediction of human signal peptide cleavage sites,[10] translational initiation site recognition in DNA,[11] protein fold recognition,[12,13] prediction of protein–protein interactions,[14] prediction of protein sub-cellular localization,[15-18] and peptide identification from mass-spectrometry derived data.[19] SVM is a learning algorithm that can be used to generate a classifier from a set of positively and negatively labeled training data sets.[20] SVM learns the classifier by mapping the input training samples into a possibly high-dimensional feature space and seeking a hyperplane in this space, which separates the two types of examples with the largest possible margin, i.e. distance to the nearest points. If the training set is not linearly separable, SVM finds a hyperplane, which optimizes a trade-off between good classification and large margin.[20] For predicting a classifier between PID and non-PID genes, we have solved the above problem and obtained a linear classifier (Fig. 1). To prove generalization of the predicted classifier, we have reported leave-one-out (LOO) error for the training data set. In this approach, we have used all the known PID genes that have been described in the literature as a positive data set. The gene list for negative data sets was selected from mouse genomic informatics (MGI) database based on the criterion that mutations in mice do not result in either immune or hematopoietic system phenotypes. We trained SVM with 69 features (Supplementary Table S1) for both PID genes (positive data set) and genes that were not reported to be associated with PIDs (negative data set). The trained SVM was then used to predict candidate PID genes by testing all human genes (except those used in the training data sets) as test data set.

Figure 1

A schematic of SVM training strategy.

Materials and methods

Initial platform

RAPIDs, which is available as a worldwide web resource at http://rapid.rcai.riken.jp/[21] was used as a source of information about PID genes. RAPID hosts information on sequences and expression at the mRNA and protein levels of genes reported to be involved in PID patients. The main objective of this database was to provide detailed information pertaining to genes and proteins involved in PIDs along with other relevant information about protein–protein interactions, mouse knockout studies and microarray gene expression profiles in various organs and cells of the immune system.

Features used for training the data sets

The PDIs are characterized by essential defects in the functions of the immune system, leading to increased susceptibility to infections. Although rare, these disorders cover a wide spectrum of defects, including antibody deficiencies, cellular immune deficiencies, combined immune deficiencies, phagocytic defects, complement and other innate immunity defects. On the basis of these observations for all the known PID genes, we selected 69 features (Supplementary Table S1) which not only play an important role in the development, maintenance and normal functioning of immune/hematopoietic systems but also in understanding molecular pathophysiology of PID disease causing genes. These features can be broadly classified as features for signaling pathways from NetPath and KEGG[22-24] database, microarray gene expression profile from RefDIC[25] database, site of expression from HPRD[26] and Human Proteinpedia,[27] immune/hematopoietic phenotypes from MGI[28,29] and interaction with PID feature from HPRD.

Data sets

To train the SVM, two types of data sets were generated—the positive data set consists of all the known PID genes, whereas the negative data set contained genes where no immune/hematopoietic system abnormalities were described due to mouse knockouts, knockins or spontaneous mutations reported for the mouse orthologs in the MGI database.[30] On the basis of these criteria, 148 PID genes were in the positive data set and 3162 genes were in the negative data set. Test data set contains 36 677 genes encoded by the human genome. Genes involved in both the training and test data sets were assigned a binary score of ‘1’ and ‘0,’ respectively, based on their presence or absence in a particular feature. The trained SVM was used to classify PID or non-PID genes from an unlabeled test data set which consists of all human genes (Fig. 2).

Figure 2

A schematic of the algorithm for prediction of candidate PID genes.

SVM implementation

We used SVM (http://svmlight.joachims.org/), an implementation of Support Vector Machines in C, and also used customized functions written in MATLAB (http://www.mathtools.net/MATLAB/) for the calculation of confidence score for each predicted candidate PID gene. Absolute score also known as confidence score can be defined as where represents the separating hyperplane calculated by SVM. The score indicates how far that particular gene from the positive side of the hyperplane. In other words, higher the score more likely that a particular gene is a candidate PID gene. Using this approach, 1442 candidate PID genes were predicted which falls on the positive side of the hyperplane.

LOO error

LOO error measurement involves removing one gene from the training set, training the SVM on the remaining genes and then predicting the class label of that gene that was left out. This process is repeated until all the genes are left out exactly once. If the gene was classified correctly, the error was reported as zero, else the error was reported as one. This process was repeated by leaving out each gene once and the LOO error of the data set represent the average of individual errors.

Results and discussion

Over 1500 Mendelian disorders whose molecular basis is unknown are catalogued in the online Mendelian inheritance in man (OMIM) database.[31] Most of disease-gene identification efforts involve either linkage analysis or association studies.[32,33] Recently, a number of in silico approaches to identify candidate disease genes have been developed that use available information reported from various studies such as functional annotation, gene expression profiles, annotated sequence features, protein–protein interactions and pathway information.[34-39] Several machine learning approaches have also been employed to identify important genes for disease classification. SVM approach is generally preferred owing to its superior performance.[40] In most instances, SVM is a powerful tool in dealing with high-dimensional low sample size data sets, which also performs well in various biological analyses including text categorization, evaluating microarray expression data and inferring functional annotation from protein sequence and structure data.[3,4,41,42] In this study, we trained an SVM with 69 features for both positive (all known PID genes) and negative (genes with no immune/hematopoietic systems affected due to mutations from MGI) gene data sets. As the number of genes in the positive data set is small, the LOO error was calculated for showing generalization of the algorithm. LOO error is explained in detail under the Materials and methods section. For this, we used a data set containing 148 PID genes from positive data sets along with 148 genes that were randomly selected from the negative data set. This process was repeated and from 60 such data sets, the LOO error was calculated. The average LOO error reported over 60 data sets was ∼8%. The LOO error reported by leaving out only the PID (positive) genes one by one (where training set contains same setting of 296 data points) was ∼15%.

Sensitivity and specificity

The sensitivity and specificity of the data sets was 0.85 and 0.98, respectively. On the basis of these results, we conclude that the number of genes falsely predicted to be PID genes by the trained classifier is ∼2%. We believe that availability of comprehensive and accurate biological data is a limitation that restricts the prediction accuracy and performance of this algorithm. As more data accumulates about the human genome and proteome, we expect the performance of this algorithm to improve further in the future. The complete list of predicted candidate genes is provided in Supplementary Table S2 and also available at the RAPID website http://rapid.rcai.riken.jp/. All 69 features of the predicted candidate PID genes can also be downloaded from the RAPID website.

Evaluation studies

We were able to evaluate our predictions in a limited fashion because a few studies have been published describing novel PID genes that were not included in our original list of PID genes. These experimental studies have confirmed six of the genes in our predicted list of PID genes as true PID genes. These are myeloid differentiation factor-88 (MYD88), catalytic subunit of DNA dependent serine/threonine protein kinase (PRKDC), glucose-6-phosphatase, catalytic subunit 3 (G6PC3),[43-45] IL2-inducible T-cell kinase (ITK), coronin, actin binding protein 1A (CORO1A) and Interleukin 1 receptor antagonist (IL1RN).[46-49] MyD88 is a key downstream adaptor protein in IL1 receptor complex and toll-like receptors signaling pathways involved in inflammatory response and host defense. In addition, MyD88 is also involved in tumorigenesis in models of hepatocarcinoma and familial associated polyposis; negative regulation of TLR3 signaling and in PKC epsilon activation.[50] Patients with MyD88 deficiency are reported to be susceptible to pyogenic bacterial infections including invasive pneumococcal disease.[45] Defect in PRKDC has been reported for the first time in a radiosensitive T-B-SCID patient that results in inhibition of Artemis activation and non-homologous end-joining.[44] A report of mutations in G6PC3 gene has been observed among patients with severe congenital neutropenia syndrome and also shown to be susceptible to increased apoptosis that leads to disturbances in cardiac or urogenital development.[43] A novel PDI, IL-2 inducible T-cell kinase (ITK) deficiency has been observed due to fatal immune dysregulation followed by EBV infection and identified homozygous mutation in the SH2 domain of ITK gene that resulted in protein destabilization and absence of NKT cells.[47] A patient with T cell-deficient, B cell-sufficient and NK cell-sufficient severe combined immunodeficiency has been identified with mutation in CORO1A gene along with reduced T-cell function that was earlier demonstrated in knock-out mice of coro1a gene with similar phenotypes.[49] Deficiency of the IL1-receptor antagonist, an autosomal recessive autoinflammatory disease, has been reported for the first time in children presented with clinical phenotypes of multifocal osteomyelitis, periostitis, pustulosis, thrombosis and respiratory insufficiency due to the homozygous deletion of the IL1RN gene.[46,48] Further, functional analysis of these mutants confirmed diminished or lack of mRNA and protein expressions leading to cytokine abnormalities. There are two recent independent reports[51,52] on the identification and prioritization of candidate disease genes in general as well as specific to primary immunodeficiencies by integrating functional annotations from gene ontology and compilation of protein interaction network data sets from BIND,[53] BioGRID[54] and HPRD.[26] In the latter studies, 24 candidate genes were reported that are likely to be involved in PID have been identified using these parameters, out of which, over 80% of these genes are already listed as candidates in our SVM analysis, thereby, paving the way for successful implementation of this approach in the future. We have also summarized reports of genome-wide association studies and other related studies for newly identified candidate PID genes and the associated immunological disorder (Table 1). Because the candidate PID gene list is still large, this approach of integrating data from high-throughput studies would allow further prioritization of genes for confirmation in patients with PID where the exact gene is not yet identified. We hope that such integrated approaches should assist PID physicians and researchers to gain insights into the pathophysiology of these diseases at a faster pace, which could be translated to improve the diagnosis and/or treatment of PIDs.

Table 1

A list of predicted PID genes whose association with immunological disorders has been reported recently

Gene symbol	Molecule class	Immunological disorder(s)	Reference(s)
ITGAM	Cell surface receptor	Systemic lupus erythematosus	Harley et al., Nat. Genet., 2008 (PubMed ID: 18204446);[55] Nath et al., Nature Genetics, 2008 (PubMed ID: 18204448)[56]
BANK1	Chaperone	Systemic lupus erythematosus	Kozyrev et al., Nat. Genet., 2008 (PubMed ID: 18204447)[57]
MST1	Growth factor	Inflammatory bowel disease	Goyette et al., Mucosal Immunol., 2008 (PubMed ID: 19079170)[58]
CYLD	Ubiquitin–proteasome system protein	Crohn's disease	Johnson and O'Donnell et al., BMC Med. Genet., 2009 (PubMed ID: 19161620)[59]
PTPN2	Tyrosine phosphatase	Crohn's disease	Wellcome Trust Case Control Consortium, 2007;[60] Todd et al., Nat. Genet., 2007 (PubMed ID: 17554260)[61]
PTPN22	Tyrosine phosphatase	Systemic lupus erythematosus	Wellcome Trust Case Control Consortium, 2007;[60] Harley et al., Nat. Genet., 2008 (PubMed ID: 18204446)[55]
TNFAIP3	Transcription regulatory protein	Rheumatoid arthritis	Plenge et al., Nat. Genet., 2007 (PubMed ID: 17982456)[62]
STAT4	Transcription factor	Systemic lupus erythematosus	Remmers et al., N Engl J Med., 2007 (PubMed ID: 17804842)[63]
TNFSF4	Ligand	Systemic lupus erythematosus	Graham et al., Nat. Genet., 2008 (PubMed ID: 18059267)[64]
CTLA4	Adhesion molecule	Autoimmune thyroid diseases	Ueda et al., Nature, 2003 (PubMed ID: 12724780);[65] Ikegami et al., J. Clin. Endocrinol. Metab., 2006 (PubMed ID: 16352685)[66]

A list of predicted PID genes whose association with immunological disorders has been reported recently

Availability

The list of predicted PID genes is available as Supplementary Table S2 and at the RAPID website http://rapid.rcai.riken.jp/.

Supplementary Data

Supplementary data are available at www.dnaresearch.oxfordjournals.org.

Funding

We thank the Department of Biotechnology of the Government of India for research support to the Institute of Bioinformatics, Bangalore.

66 in total

1. KEGG: kyoto encyclopedia of genes and genomes.

Authors: M Kanehisa; S Goto
Journal: Nucleic Acids Res Date: 2000-01-01 Impact factor: 16.971

2. Support vector machine classification and validation of cancer tissue samples using microarray expression data.

Authors: T S Furey; N Cristianini; N Duffy; D W Bednarski; M Schummer; D Haussler
Journal: Bioinformatics Date: 2000-10 Impact factor: 6.937

3. Multi-class protein fold recognition using support vector machines and neural networks.

Authors: C H Ding; I Dubchak
Journal: Bioinformatics Date: 2001-04 Impact factor: 6.937

4. Adaptive encoding neural networks for the recognition of human signal peptide cleavage sites.

Authors: B Jagla; J Schuchhardt
Journal: Bioinformatics Date: 2000-03 Impact factor: 6.937

5. Knowledge-based analysis of microarray gene expression data by using support vector machines.

Authors: M P Brown; W N Grundy; D Lin; N Cristianini; C W Sugnet; T S Furey; M Ares; D Haussler
Journal: Proc Natl Acad Sci U S A Date: 2000-01-04 Impact factor: 11.205

6. Engineering support vector machine kernels that recognize translation initiation sites.

Authors: A Zien; G Rätsch; S Mika; B Schölkopf; T Lengauer; K R Müller
Journal: Bioinformatics Date: 2000-09 Impact factor: 6.937

Review 7. Signalling adaptors used by Toll-like receptors: an update.

Authors: Elaine F Kenny; Luke A J O'Neill
Journal: Cytokine Date: 2008-08-15 Impact factor: 3.861

8. The actin regulator coronin 1A is mutant in a thymic egress-deficient mouse strain and in a patient with severe combined immunodeficiency.

Authors: Lawrence R Shiow; David W Roadcap; Kenneth Paris; Susan R Watson; Irina L Grigorova; Tonya Lebet; Jinping An; Ying Xu; Craig N Jenne; Niko Föger; Ricardo U Sorensen; Christopher C Goodnow; James E Bear; Jennifer M Puck; Jason G Cyster
Journal: Nat Immunol Date: 2008-10-05 Impact factor: 25.606

9. Human Proteinpedia: a unified discovery resource for proteomics research.

Authors: Kumaran Kandasamy; Shivakumar Keerthikumar; Renu Goel; Suresh Mathivanan; Nandini Patankar; Beema Shafreen; Santosh Renuse; Harsh Pawar; Y L Ramachandra; Pradip Kumar Acharya; Prathibha Ranganathan; Raghothama Chaerkady; T S Keshava Prasad; Akhilesh Pandey
Journal: Nucleic Acids Res Date: 2008-10-23 Impact factor: 16.971

10. RAPID: Resource of Asian Primary Immunodeficiency Diseases.

Authors: Shivakumar Keerthikumar; Rajesh Raju; Kumaran Kandasamy; Atsushi Hijikata; Subhashri Ramabadran; Lavanya Balakrishnan; Mukhtar Ahmed; Sandhya Rani; Lakshmi Dhevi N Selvan; Devi S Somanathan; Somak Ray; Mitali Bhattacharjee; Sashikanth Gollapudi; Y L Ramachandra; Sahely Bhadra; Chiranjib Bhattacharyya; Kohsuke Imai; Shigeaki Nonoyama; Hirokazu Kanegane; Toshio Miyawaki; Akhilesh Pandey; Osamu Ohara; Sujatha Mohan
Journal: Nucleic Acids Res Date: 2008-10-08 Impact factor: 16.971

7 in total

1. Human protein reference database and human proteinpedia as discovery resources for molecular biotechnology.

Authors: Renu Goel; Babylakshmi Muthusamy; Akhilesh Pandey; T S Keshava Prasad
Journal: Mol Biotechnol Date: 2011-05 Impact factor: 2.695

2. svdPPCS: an effective singular value decomposition-based method for conserved and divergent co-expression gene module identification.

Authors: Wensheng Zhang; Andrea Edwards; Wei Fan; Dongxiao Zhu; Kun Zhang
Journal: BMC Bioinformatics Date: 2010-06-22 Impact factor: 3.169

3. A computational approach to candidate gene prioritization for X-linked mental retardation using annotation-based binary filtering and motif-based linear discriminatory analysis.

Authors: Zané Lombard; Chungoo Park; Kateryna D Makova; Michèle Ramsay
Journal: Biol Direct Date: 2011-06-13 Impact factor: 4.540

Review 4. Artificial intelligence and the hunt for immunological disorders.

Authors: Nicholas L Rider; Renganathan Srinivasan; Paneez Khoury
Journal: Curr Opin Allergy Clin Immunol Date: 2020-12

5. Simulation of the dynamics of primary immunodeficiencies in CD4+ T-cells.

Authors: Gabriel N Teku; Mauno Vihinen
Journal: PLoS One Date: 2017-04-27 Impact factor: 3.240

6. Simulation of the Dynamics of Primary Immunodeficiencies in B Cells.

Authors: Gabriel Ndipagbornchi Teku; Mauno Vihinen
Journal: Front Immunol Date: 2018-08-02 Impact factor: 7.561

7. Tuning of in vivo cognate B-T cell interactions by Intersectin 2 is required for effective anti-viral B cell immunity.

Authors: Marianne Burbage; Francesca Gasparrini; Shweta Aggarwal; Mauro Gaya; Johan Arnold; Usha Nair; Michael Way; Andreas Bruckbauer; Facundo D Batista
Journal: Elife Date: 2018-01-16 Impact factor: 8.140

7 in total