| Literature DB >> 32102444 |
Chang Lu1,2, Wenjie Jiang1,2, Hang Wang1,2, Jinxiu Jiang1,2, Zhiqiang Ma1,2,3, Han Wang1,2,3.
Abstract
Ubiquinone is an important cofactor that plays vital and diverse roles in many biological processes. Ubiquinone-binding proteins (UBPs) are receptor proteins that dock with ubiquinones. Analyzing and identifying UBPs via a computational approach will provide insights into the pathways associated with ubiquinones. In this work, we were the first to propose a UBPs predictor (UBPs-Pred). The optimal feature subset selected from three categories of sequence-derived features was fed into the extreme gradient boosting (XGBoost) classifier, and the parameters of XGBoost were tuned by multi-objective particle swarm optimization (MOPSO). The experimental results over the independent validation demonstrated considerable prediction performance with a Matthews correlation coefficient (MCC) of 0.517. After that, we analyzed the UBPs using bioinformatics methods, including the statistics of the binding domain motifs and protein distribution, as well as an enrichment analysis of the gene ontology (GO) and the Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway.Entities:
Keywords: KEGG pathway; XGBoost; binding domain motifs; gene ontology; ubiquinone-binding proteins
Mesh:
Substances:
Year: 2020 PMID: 32102444 PMCID: PMC7072731 DOI: 10.3390/cells9020520
Source DB: PubMed Journal: Cells ISSN: 2073-4409 Impact factor: 6.600
Information about the parameters of XGBoost tuning by multi-objective particle swarm optimization (MOPSO) in this work: name, description, default value, threshold while searching, and tuned value.
| Parameter | Description | Default | Threshold | Tuned |
|---|---|---|---|---|
|
| ||||
| learning_rate | Step size shrinkage | 0.10 | [0,0.5] | 0.08 |
| n_estimators | Number of trees | 100 | [100,2,000] | 162 |
| max_depth | The maximum depth of a tree | 3 | [1,10] | 8 |
| subsample | Percentage of samples used per tree | 1.00 | [0,1] | 0.75 |
| colsample_bytree | Percentage of features used per tree | 1.00 | [0,1] | 0.12 |
|
| ||||
| gamma | Controls a given node will split or not | 0 | [0,1] | 0.83 |
| reg_alpha | L1 regularization term on weight | 0 | [0,1] | 0.08 |
| reg_lambda | L2 regularization term on weights | 1.00 | [0,2] | |
Comparison of the different classifiers.
| Classifier | Sen 1 | Spe 2 | Pre 3 | ACC 4 | F1 5 | MCC 6 |
|---|---|---|---|---|---|---|
| NB | 0.536 | 0.767 | 0.696 | 0.650 | 0.604 | 0.311 |
| MLP | 0.594 | 0.738 | 0.744 | 0.675 | 0.629 | 0.377 |
| SVM | 0.688 | 0.705 | 0.698 | 0.695 | 0.692 | 0.393 |
| AdaBoost | 0.704 | 0.734 | 0.723 | 0.719 | 0.712 | 0.438 |
| RF | 0.651 | 0.814 7 | 0.781 | 0.734 | 0.708 | 0.474 |
| XGBoost | 0.754 | 0.759 | 0.756 | 0.755 | 0.753 | 0.511 |
1–6 are the performance evaluation indicators of the predictor: Sen represents the sensitivity; Spe represents the specificity; Pre represents the precision; ACC represents the accuracy; F1 represents the F1-measure; MCC represents the Matthews correlation coefficient (MCC). 7 The bolded parts represent the highest value of the corresponding evaluation indicator.
Figure 1The Matthews correlation coefficient (MCC) value of the models in the process of incremental feature selection (IFS).
Figure 2Distribution of each kind of feature in the optimal feature subset. AAC: amino acid composition; DC: dipeptide composition; PSSM: position-specific scoring matrix.
Comparison of the prediction performance before and after parameter tuning through cross-validation and independent validation.
| Models | Sen | Spe | Pre | ACC | F1 | MCC |
|---|---|---|---|---|---|---|
|
| ||||||
| Default parameters | 0.759 | 0.786 | 0.779 | 0.772 | 0.768 | 0.545 |
| Tuned parameters |
|
|
|
|
| |
|
| ||||||
| Default parameters | 0.649 | 0.760 | 0.727 | 0.705 | 0.686 | 0.411 |
| Tuned parameters |
|
|
|
|
|
|
* The bolded parts represent the highest value of the corresponding evaluation indicator.
Figure 3Illustration of the respiratory complex II of the mitochondrial respiratory chain.
Figure 4Sequence logos of the motif within the ubiquinone-binding domains. The threshold of the E-value is 0.05. “Sites” represents the number of sites contributing to the construction of the motif. “Width” represents the width of the motif. The 3D visualization on the right is an example of the corresponding motif. “Protein” represents the PDB ID_Chain (domain). “Ligand” represents the type of ubiquinone.
Figure 5The superfamily distribution of the selected Ubiquinone-binding proteins (UBPs). The digital labels on the chart represent the number of UBPs that the superfamily contains. The names of the categories listed in the legend are the clan name in the Pfam database. All superfamilies in the “Others” category contain one protein.
Figure 6The general information of the gene ontology (GO) enrichment analysis result of human UBPs: (a) enriched biological processes; (b) enriched cell components; (c) enriched molecular functions. The description on the left side of the bar refers to the name of the gene term. “Percent of Genes” refers to the percentage of the number of genes involved in a given term compared to the total number of genes in the query proteins. The digital label on the right side of the bar of a gene term refers to the number of the genes involved in this term and its corresponding P-value. “Max Level” refers to the maximal annotated level of the given term in the GO graph. Different colors refer to the different max levels. Terms with the same max level are sorted according to P-value.
Figure 7The top 10 significantly enriched KEGG (Kyoto Encyclopedia of Genes and Genomes) pathways of human UBPs. The description on the left side of the bar refers to the name of the KEGG pathway. “Percent of Genes” refers to the percentage of the number of genes involved in a given pathway compared to the total number of genes in the query proteins. The digital label on the right side of the bar of a gene term refers to the number of the genes involved in this pathway and the corresponding P-value. Different colors refer to the different categories of the pathways. Pathways of the same category are sorted by P-value.