| Literature DB >> 35223997 |
Yujie Xia1, Xiaojie Li1, Xinlin Chen2, Changjin Lu1, Xiaoyi Yu1.
Abstract
Retinal Degeneration (RD) is an inherited retinal disease characterized by degeneration of rods and cones photoreceptor cells and degeneration of retinal pigment epithelial cells. The age of onset and disease progression of RD are related to genes and environment. At present, research has discovered five genes closely related to RD. They are RHO, PDE6B, MERTK, RLBP1, RPGR, and researchers have developed corresponding gene therapy methods. Gene therapy uses vectors to transfer therapeutic genes, genetically modify target cells, and correct or replace disease-causing RD genes. Therefore, identifying the pathogenic genes of RD will play an important role in the development of treatment methods for the disease. However, the traditional methods of identifying RD-related genes are mostly based on animal experiments, and currently only a small number of RD-related genes have been identified. With the increase of biological data, Xgboost is purposed in this article to identify RP-related genes. Xgboost adds a regular term to control the complexity of the model, hence using Xgboost to find out true RD-related genes from complex and massive genes is suitable. The problem of overfitting can be avoided to some extent. To verify the power of Xgboost to identify RD-related genes, we did 10-cross validation and compared with three traditional methods: Random Forest, Back Propagation network, Support Vector Machine. The accuracy of Xgboost is 99.13% and AUC is much higher than other three methods. Therefore, this article can provide technical support for efficient identification of RD-related genes and help researchers have a deeper the understanding of the genetic characteristics of RD.Entities:
Keywords: Xgboost; amino acids; machine learning; pathogenic gene; retinitis degeneration
Year: 2022 PMID: 35223997 PMCID: PMC8880610 DOI: 10.3389/fmolb.2022.843150
Source DB: PubMed Journal: Front Mol Biosci ISSN: 2296-889X
The six groups of the 20 amino acids.
| Groups | Amino acids |
|---|---|
| Strongly hydrophilic | R,D,E,N,Q,K,H |
| Strongly hydrophobic | L,I,V,A,M,F |
| Weakly hydrophilic or Weakly hydrophobic | S,T,Y,W |
| Proline | P |
| Glycine | G |
| Cysteine | C |
FIGURE 1Flow chart of Feature extraction.
The parameters of the Xgboost.
| Setting items | The value set |
|---|---|
| Booster | gbtree |
| Silent | 0 |
| Learning rate | 0.3 |
| Maximum depth of a tree | 6 |
| Minimum sum of instance weight | 1 |
| Subsample ratio | 1 |
| Experimental parameter | 1 |
The results of the ten experiments.
| Prediction | ||||
|---|---|---|---|---|
| 1 | 0 | Total | ||
| True Label | 1 | 205 (TP) | 2(FN) | 207 |
| 0 | 20(FP) | 4,980 (TN) | 5,000 | |
| Total | 225 | 4,982 | 5,207 | |
FIGURE 2The accuracy of ten experiments.
Comparison of the Xgboost with alternative models.
| Algorithm | ACC (%) | Precision (%) | Recall (%) | Specificity (%) |
|---|---|---|---|---|
| Xgboost | 99.13 | 99.04 | 99.23 | 99.04 |
| BP | 82.50 | 78.13 | 90.25 | 74.76 |
| Random Forest | 97.99 | 99.64 | 96.34 | 99.65 |
| SVM | 94.16 | 94.62 | 93.64 | 94.68 |
FIGURE 3ROC curve of four methods.
FIGURE 4AUC of four methods.