Literature DB >> 35223997

Inferring Retinal Degeneration-Related Genes Based on Xgboost.

Yujie Xia¹, Xiaojie Li¹, Xinlin Chen², Changjin Lu¹, Xiaoyi Yu¹.

Abstract

Retinal Degeneration (RD) is an inherited retinal disease characterized by degeneration of rods and cones photoreceptor cells and degeneration of retinal pigment epithelial cells. The age of onset and disease progression of RD are related to genes and environment. At present, research has discovered five genes closely related to RD. They are RHO, PDE6B, MERTK, RLBP1, RPGR, and researchers have developed corresponding gene therapy methods. Gene therapy uses vectors to transfer therapeutic genes, genetically modify target cells, and correct or replace disease-causing RD genes. Therefore, identifying the pathogenic genes of RD will play an important role in the development of treatment methods for the disease. However, the traditional methods of identifying RD-related genes are mostly based on animal experiments, and currently only a small number of RD-related genes have been identified. With the increase of biological data, Xgboost is purposed in this article to identify RP-related genes. Xgboost adds a regular term to control the complexity of the model, hence using Xgboost to find out true RD-related genes from complex and massive genes is suitable. The problem of overfitting can be avoided to some extent. To verify the power of Xgboost to identify RD-related genes, we did 10-cross validation and compared with three traditional methods: Random Forest, Back Propagation network, Support Vector Machine. The accuracy of Xgboost is 99.13% and AUC is much higher than other three methods. Therefore, this article can provide technical support for efficient identification of RD-related genes and help researchers have a deeper the understanding of the genetic characteristics of RD.

Entities: Chemical

Keywords: Xgboost; amino acids; machine learning; pathogenic gene; retinitis degeneration

Year: 2022 PMID： 35223997 PMCID： PMC8880610 DOI： 10.3389/fmolb.2022.843150

Source DB: PubMed Journal: Front Mol Biosci ISSN： 2296-889X

Introduction

Hereditary eye diseases include syndromes and non-syndromic forms of retinal degeneration, hereditary glaucoma, corneal dystrophy and eye movement disorders. Retinal degeneration (RD) is a group of single-gene hereditary blindness caused by loss of function of photoreceptor cells or retinal pigment epithelium (RPE). The incidence of RDs worldwide is 1/3,000–1/2,000 (Berger et al., 2010). According to whether they are accompanied by systemic symptoms, they are divided into simple and systemic RDs (Wennström et al., 2003).The former mainly includes retinitis pigmentosa (RP), Rod cell dystrophy (cone-rod dystrophies, CORD), Leber congenital amaurosis (Leber congenital amaurosis, LCA), etc. The latter mainly includes Usher syndrome and Bardet-Biedl syndrome (Muller et al., 2010).Up to now, more than 300 pathogenic genes have been reported for RD, which suggests that RD has a high degree of clinical and genetic heterogeneity, the diagnosis of this type of disease is extremely difficult (Benayoun et al., 2009). Research on the pathogenic genes of RDs and the development and application of related molecular diagnostic techniques are the prerequisites for the diagnosis, prevention and treatment of RDs. Both single-gene Mendelian or complex hereditary eye diseases require genetic testing to determine the underlying cause. There are nearly 1,200 genes related to eye diseases in the human online Mendelian genetic database (on-line Mendelian inheritancein man, oMIM) (http://www.omim.org)(Amberger et al., 2015). RD is a type of disease with obvious clinical phenotypic heterogeneity and genetic heterogeneity, and it is also the main type of ophthalmic genetic diseases and rare and difficult ophthalmic diseases. At present, the vast majority of RD is still incurable in ophthalmology, and research on its diagnosis and treatment has always been a hot spot. Diagnosing RD at the genetic level is helpful for a deep understanding of the disease mechanism (Boycott et al., 2017). Distinguishing what kind of gene mutation causes the disease can more accurately understand the occurrence, development and outcome of the disease. This is especially important for RD with obvious heterogeneity. The genetic heterogeneity of RD requires a new disease naming and definition system. The system should include at least two main factors, namely the disease-causing gene and the name of the disease related to it. For example, EYS-related retinitis pigmentosa is more accurate than retinitis pigmentosa alone, and it is easier to explain the condition to the patient. Because of the large number of pathogenic genes of retinal degeneration and the different mutation genes and loci in different families, it is very difficult to selectively screen candidate pathogenic genes. At present, the research on molecular genetics of hereditary eye disease is mainly family single gene research, which leads to controversy and deficiency in the genetic research of RD gene (Fan et al., 2006). A comprehensive and systematic analysis of known gene variation data may be helpful for the further study of such problems. Genes and mutations associated with retinal degeneration are controversial. Some genes were first reported to be disease-related, and then no mutations were reported. Although a large number of mutations in retinal degeneration are concentrated in a few genes, and the mutations of many genes only explain the causes of a very small number of patients, it is possible that only a very small number of patients with this gene carry mutations, but it cannot be ruled out that the previous research only found changes in a single gene and mistakenly believed that it was the cause of the disease. The controversial and questionable problems such as mutation penetrance and related risk factors reported in single gene research also bring confusion to researchers. In addition, because there was no public database containing a large number of variation data and a large number of control validation, some high-frequency SNPs were found in patients and were regarded as pathogenic mutations. These mutations are listed in the human gene mutation database (HGMD) as pathogenic mutations (Stenson et al., 2020), which mislead the follow-up molecular genetics research. At present, the reported variation analysis doubts and corrects the pathogenicity of individual Retnet genes and mutations (Pozo et al., 2015), such as the previously reported pathogenic genes fscn2 (MIM: 607643) and or2w3 of retinitis pigmentosa and hmcn1 (MIM: 608548) of macular degeneration (Fisher et al., 2007; Zhang et al., 2007; Sharon et al., 2016), and the subsequent research reports are questionable, but due to the lack of clinical phenotype analysis of patients with the same mutation, It is still impossible to completely deny its possibility as a pathogenic gene. In addition, single-gene research cannot comprehensively and systematically understand the genetic mutation spectrum of the people with hereditary retinal degeneration of this ethnic group. Different races have different gene mutation spectrums. Common disease-causing gene mutations in European and American populations are not common in Asian populations; based on common gene mutations in Asian populations, they may be very rare in European and American populations. For example, the pathogenic gene CNGA3 (MIM: 600053) of pyramidal cell dystrophy is the most frequently mutated gene in Chinese patients (Huang et al., 2016), and the most common recessive genetic mutation in foreign reports is ABCA4 (MIM: 601691) (Maugeri et al., 2000), CNGA3 only explains a small part of the cause of the disease (Wissinger et al., 2001). Even the Asian population has a different mutation spectrum. The highest mutation frequency in the Japanese retinitis pigmentosa population is EYS (MIM: 612424)(Oishi et al., 2014; Arai et al., 2015), and this gene mutation is very rare in Chinese patients (Xu et al., 2014; Chen et al., 2015). It is very important and necessary to conduct a comprehensive multi-gene systematic analysis of all retinal degeneration genes, and to understand the clinical characteristics, gene mutation frequency spectrum and discover the main pathogenic genes of the people with retinal degeneration of this nation. At the same time, it also provides important clinical evidence for the clinical diagnosis, genetic counseling, and prevention of hereditary eye diseases. Although researchers have made great achievement in identifying RD-related genes, identifying the huge and complex acid sequences needs an algorithm which has high computational efficiency and high recognition accuracy. The generation of multi-omics data allows us to combine different data from a large number of samples to explore RD-related genes at a comprehensive level (Zhao et al., 2021a). Integrating multiple omics data to discover biological knowledge on a large scale has become a universal method. An endless stream of methods have been developed to apply to different research problems, such as identification of disease-related gene (Zhao et al., 2020; Antonarakis, 2021), identification of disease-related protein (Katako et al., 2018; Zhao et al., 2021b), identification of disease-related metabolite (Lei and Tie, 2019; Zhao et al., 2021c), disease-related drug target identification (Agamah et al., 2020; Zhao et al., 2021d), etc. Chen (Chen and Guestrin, 2016) purposed a novel method named Extreme Gradient Boosting (Xgboost) in 2004. He improved the boosting algorithm. Its multi-threaded parallel and regularization term not only improve the accuracy of the algorithm but also reduce the running time. Therefore, Xgboost is a suitable algorithm to solve the problem of identifying RD-related genes.

Methods and Materials

Data Description

We searched RD-related genes from DisGeNET (Piñero et al., 2020) by the key word “Retinal Degeneration.” There are 207 genes which are known to be related to RD in this database. We downloaded the sequences of these genes corresponding proteins from Uniprot (Consortium, 2019). We also obtained 5,000 genes as genes potentially associated with RD from Genecard (Safran et al., 2010). Our aim is to identify RD-related genes from these 5,000 genes.

Feature Extraction

Compositional Analysis

Since the real constitution of RD-related genes encoded proteins is quite different from the non-related genes’, the frequency of the occurrence of the all 20 amino acids in these proteins could be quite different. We totally calculated the average amino acid composition of 207 RD-related genes encoded proteins. These proteins are richest in “L,” and the composition of “G,” “A,” “V,” “E,” “S” is very high.

Dissociation Constant

The protein structure is significantly related to the chemical characteristic of amino acid, especially hydrophobic and hydrophilic (Aftabuddin and Kundu, 2007). Aftabuddin et al. divided 20 amino acids into six groups based on the ranges of the hydropathy. The reason why the gene is related to RD is significantly related to the function of the protein it encodes. Therefore, the hydrophilicity and hydrophobicity of amino acids in protein are the key to judging whether the gene is related to RD. Table 1 shows the six groups of the 20 amino acids.

TABLE 1

The six groups of the 20 amino acids.

Groups	Amino acids
Strongly hydrophilic	R,D,E,N,Q,K,H
Strongly hydrophobic	L,I,V,A,M,F
Weakly hydrophilic or Weakly hydrophobic	S,T,Y,W
Proline	P
Glycine	G
Cysteine	C

The six groups of the 20 amino acids. So, the sequence of every protein could be diverted to a 6-dimension sequence. Each dimension is the average composition of one of these six groups.

PEST Regions

In 1986, Rechsteiner M and Rogers SW (Rechsteiner et al., 1996) made the assumption that the amino acids of “P,” “E,” “S” and “T” can serve as proteolytic signals. Now more and more reports have verified that the sequence which contains PEST regions can cause the rapid degradation of proteins. The Epestfind program can be used to identify all poor and potential PEST protein sequences. (Espreafico et al., 1992) http://emboss.bioinformatics.nl/cgi-bin/emboss/epestfind. We only included potential PEST protein region as a feature to identify the RD-related genes. We counted the number of potential pest regions in each sequence. In conclusion, we totally extracted three kinds of features (Figure 1).

FIGURE 1

Flow chart of Feature extraction.

Flow chart of Feature extraction. So, we used these 27-dimensions to identify the RD-related.

Methods and Framework

Extreme Gradient Boosting

The Extreme Gradient Boosting (Xgboost) is the improvement of traditional Gradient Boosting Decision Tree (GBDT). Xgboost implements the first and the two order derivatives from the loss function by applying two order Taylor expansion. However, the traditional GBDT algorithm only implements first derivative information during optimizing. Xgboost runs significantly faster than GBDT. Because it has two advantages. On the one hand, Xgboost supports automatic multi-core parallel computing through open MP. On the other hand, Xgboost proposes a new data format Dmatrix, which can be preprocessed first and then trained. This improves the efficiency of each iteration of the training process and reduces the model training time. In addition, we can input the sparse matrix into xgboost. First, we need to obtain our train set , and set the number of leaf nodes as J. Then, we need to initialize the final function. Then, the gradient of training samples can be obtained by: Then, the CART regression tree can be constructed. is the jth feature space. Then, each leaf node’s regression value can be obtained by: Finally, the final model is as following: The objective function is consisted by loss function and regularization term, which can be used to show the quality of our method. represents loss function. Algorithms such as artificial neural networks only use loss function to evaluate the quality of training, which is easy to cause over fitting. The regularization parameters are introduced into methods such as support vector machine, which can effectively reduce over fitting. However, the introduction of regularization parameters will increase the complexity of the model. CART is the basic unit of Xgboost. Therefore, the objective function in formula (5) can also be represented as following: Each tree is obtained based on the last tree we constructed. Finally, we can obtained the first and the two order derivatives from the loss function. The next part is to obtain regularization term. Firstly, we define the decision tree as: w represents leaf node’s score. q(x) is used to determine the position of the input sample in the tree. The regularization term can be represented as following: We need to set and to balance the complexity of the model. So tth tree’s objective function is as following: We could define and , then we get:

Results

Experiment Description

We totally got 207 true RD-related genes and we randomly selected 5,000 genes as the negative samples. To verify the effectiveness of Xgboost on identifying RD-related genes, we did ten-cross validation. We randomly divided these 5,207 sequences into ten groups. For every group, we choose 520 sequences as the test set and the rest 4,687 sequences as the train set. So, we did ten experiments in total. Besides, every sequence has become a training set and a test set. We set the parameters of Xgboost as the Table 2.

TABLE 2

The parameters of the Xgboost.

Setting items	The value set
Booster	gbtree
Silent	0
Learning rate	0.3
Maximum depth of a tree	6
Minimum sum of instance weight	1
Subsample ratio	1
Experimental parameter	1

The parameters of the Xgboost.

Evaluation Criteria

We use four evaluation ways to evaluate the performance of Xgboost on identifying RD-related genes. We put the results of the ten experiments in the Table 2. A total of 5,207 sequences were tested. As showed in Table 3, we could calculate the Accuracy = 99.13%, Precision = 99.04%, Recall = 99.23%, Specificity = 99.04%.

TABLE 3

The results of the ten experiments.

		Prediction
		1	0	Total
True Label	1	205 (TP)	2(FN)	207
True Label	0	20(FP)	4,980 (TN)	5,000
Total		225	4,982	5,207

The results of the ten experiments.

Experiments Result

In this study, the label of randomly selected genes is 0, and the label of RD-related genes are 1. The Figure 2 shows the curves of the ten times experiments’ accuracy. As we can see, the experiment with the lowest accuracy is also more than 98%.

FIGURE 2

The accuracy of ten experiments.

The accuracy of ten experiments. To verify the superiority of the Xgboost, we also use the same data to do the ten-cross validation by other methods. We use Back Propagation network (BP), Random Forest (RF), Support Vector Machine (SVM) respectively. The error statistics of the average results of 10 experiments are shown in the following table. As we can see in the Table 4, we could see the performance of Xgboost is the best, and the performance of BP is the worst. Although RF is better than the Xgboost in the evaluations of ‘Precision’ and “Specificity,” the accuracy of the Xgboost is the best. Besides, Xgboost uses the least time to build up the model.

TABLE 4

Comparison of the Xgboost with alternative models.

Algorithm	ACC (%)	Precision (%)	Recall (%)	Specificity (%)
Xgboost	99.13	99.04	99.23	99.04
BP	82.50	78.13	90.25	74.76
Random Forest	97.99	99.64	96.34	99.65
SVM	94.16	94.62	93.64	94.68

Comparison of the Xgboost with alternative models. Figure 3 is the ROC curve of four methods. The red line is the curve of Xgboost. The green line is the curve of RF. The blue and black one is the SVM and BP respectively. As we can see in the figure, Xgboost is the best among these four methods. Then we draw a figure of AUC in the Figure 4.

FIGURE 3

ROC curve of four methods.

FIGURE 4

AUC of four methods.

ROC curve of four methods. AUC of four methods. As we can see in the Figure 4, the AUC of Xgboost is very close to 1. It shows the high accuracy of the Xgboost.

Conclusion

Typical clinical features of RD include early night blindness, subsequent progressive vision loss and narrowing of the visual field, fundus showing osteocytic pigmentation, waxy pale atrophy of the optic disc, and electroretinogram (ERG) cone and rod Cell function decline, etc., the early rod cell response amplitude decline is more serious than the cone cell response amplitude. Due to the high degree of heterogeneity of the RP phenotype, many retinopathy have similar symptoms with RP, which is very easy to confuse. Therefore, exploring RD from a genetic perspective is very helpful for clinical diagnosis, treatment and research on the pathogenic mechanism of diseases. With the popularization of high-throughput sequencing technology, a large amount of genome and proteomic data has been released. However, no method has been proposed to specifically identify RD-related genes. In this article, we propose a method based on XGboost to identify RD-related genes. We extracted three features of the corresponding proteins of 207 genes known to be related to RD. Each gene has 27-dimensional features, and we input these features into Xgboost for training. Through 10-fold cross-validation, we confirmed the accuracy of our method to identify RD-related genes with AUC as 0.99. In summary, we propose a method for large-scale identification of RD-related genes. This type of machine learning method can prioritize genes that are potentially related to RD to save researchers the cost of conducting biological experiments.

32 in total

1. The 208delG mutation in FSCN2 does not associate with retinal degeneration in Chinese individuals.

Authors: Qingjiong Zhang; Shiqiang Li; Xueshan Xiao; Xiaoyun Jia; Xiangming Guo
Journal: Invest Ophthalmol Vis Sci Date: 2007-02 Impact factor: 4.799

Review 2. PEST sequences and regulation by proteolysis.

Authors: M Rechsteiner; S W Rogers
Journal: Trends Biochem Sci Date: 1996-07 Impact factor: 13.807

3. Re-evaluation casts doubt on the pathogenicity of homozygous USH2A p.C759F.

Authors: María González-Del Pozo; Nereida Bravo-Gil; Cristina Méndez-Vidal; Ignacio Montero-de-Espinosa; José M Millán; Joaquín Dopazo; Salud Borrego; Guillermo Antiñolo
Journal: Am J Med Genet A Date: 2015-03-30 Impact factor: 2.802

Review 4. The molecular basis of human retinal and vitreoretinal diseases.

Authors: Wolfgang Berger; Barbara Kloeckener-Gruissem; John Neidhardt
Journal: Prog Retin Eye Res Date: 2010-03-31 Impact factor: 21.198

5. Computational/in silico methods in drug target and lead prediction.

Authors: Francis E Agamah; Gaston K Mazandu; Radia Hassan; Christian D Bope; Nicholas E Thomford; Anita Ghansah; Emile R Chimusa
Journal: Brief Bioinform Date: 2019-11-10 Impact factor: 11.622

6. Mutations of 60 known causative genes in 157 families with retinitis pigmentosa based on exome sequencing.

Authors: Yan Xu; Liping Guan; Tao Shen; Jianguo Zhang; Xueshan Xiao; Hui Jiang; Shiqiang Li; Jianhua Yang; Xiaoyun Jia; Ye Yin; Xiangming Guo; Jun Wang; Qingjiong Zhang
Journal: Hum Genet Date: 2014-06-18 Impact factor: 4.132

7. Primary structure and cellular localization of chicken brain myosin-V (p190), an unconventional myosin with calmodulin light chains.

Authors: E M Espreafico; R E Cheney; M Matteoli; A A Nascimento; P V De Camilli; R E Larson; M S Mooseker
Journal: J Cell Biol Date: 1992-12 Impact factor: 10.539

8. Retinitis Pigmentosa with EYS Mutations Is the Most Prevalent Inherited Retinal Dystrophy in Japanese Populations.

Authors: Yuuki Arai; Akiko Maeda; Yasuhiko Hirami; Chie Ishigami; Shinji Kosugi; Michiko Mandai; Yasuo Kurimoto; Masayo Takahashi
Journal: J Ophthalmol Date: 2015-06-16 Impact factor: 1.909

9. Targeted next-generation sequencing reveals novel EYS mutations in Chinese families with autosomal recessive retinitis pigmentosa.

Authors: Xue Chen; Xiaoxing Liu; Xunlun Sheng; Xiang Gao; Xiumei Zhang; Zili Li; Huiping Li; Yani Liu; Weining Rong; Kanxing Zhao; Chen Zhao
Journal: Sci Rep Date: 2015-03-10 Impact factor: 4.379

10. UniProt: a worldwide hub of protein knowledge.

Authors:
Journal: Nucleic Acids Res Date: 2019-01-08 Impact factor: 16.971