| Literature DB >> 35891796 |
Yangyang Yuan1,2,3, Liubin Zhang1,2,3, Qihan Long1,2,3, Hui Jiang1,2,3, Miaoxin Li1,2,3,4,5.
Abstract
Increasing evidence shows that genetic interaction across the entire genome may explain a non-trivial fraction of genetic diseases. Digenic interaction is the simplest manifestation of genetic interaction among genes. However, systematic exploration of digenic interactive effects on the whole genome is often discouraged by the high dimension burden. Thus, numerous digenic interactions are yet to be identified for many diseases. Here, we propose a Digenic Interaction Effect Predictor (DIEP), an accurate machine-learning approach to identify the genome-wide pathogenic coding gene pairs with digenic interaction effects. This approach achieved high accuracy and sensitivity in independent testing datasets, outperforming another gene-level digenic predictor (DiGePred). DIEP was also able to discriminate digenic interaction effect from bi-locus effects dual molecular diagnosis (pseudo-digenic). Using DIEP, we provided a valuable resource of genome-wide digenic interactions and demonstrated the enrichment of the digenic interaction effect in Mendelian and Oligogenic diseases. Therefore, DIEP will play a useful role in facilitating the genomic mapping of interactive causal genes for human diseases.Entities:
Keywords: Digenic interaction effect; Enrichment; Genetic interaction; Genomic mapping; Machine learning; Pathogenic gene pairs
Year: 2022 PMID: 35891796 PMCID: PMC9289819 DOI: 10.1016/j.csbj.2022.07.011
Source DB: PubMed Journal: Comput Struct Biotechnol J ISSN: 2001-0370 Impact factor: 6.155
Fig. 1The diagram of the framework for digenic interaction effect prediction, I. Data Collection and Feature Selection. Positive samples of the training set were collected from DIDA, which contains 140 positive gene pairs. Negative samples were obtained based on different theoretical assumptions, including two parts and 16,156 gene pairs in total, from which 8400 pairs were extracted for training and the remaining were used for testing. II. Model Construction. Down-sampling and bagging-based strategies were adopted to address the imbalanced issue for model construction. The weight of each classifier was assigned by the 10-fold cross-validation F1 score of each RF. III. Validation. Seven independent test sets were collected for model validation. One was self-constructed set under the null hypothesis, and another one was extracted from 7 trios cases with different rare diseases. Five were literature-based datasets, including one positive test set. The DD set was adopted for differentiating dual molecular diagnoses from the digenic interaction effect. IV. Application. Enrichment analysis was conducted in 15 different diseases to investigate the enrichment of the digenic interaction effect between disease-causing genes of the same disease.
Summary of the selected features and databases.
| Category | Database/Method and Reference | Feature name | Fraction of missing (%) | Significance (Wilcox/KS) | Description |
|---|---|---|---|---|---|
| Gene-level | RVIS | RVIS_EVS.add / sub | 59.14 | . | The addition / subtraction of the residual variation intolerance score based on EVS (ESP6500) data. |
| RVIS_ExAC.add / sub | 58.25 | . | The addition / subtraction of the residual variation intolerance score based on ExAC data. | ||
| GDI | GDI.add / sub | 56.68 | . | The addition / subtraction of the gene damage index score of two different genes. | |
| Gene Recessive score | Recs.add / sub | 35.04 | <10E-3 | The addition / subtraction of recessive scores of two different genes. | |
| The Essential State of genes | EssgCom | 11.58 | . | The count of essential genes. | |
| Indispensability | Indispensability.add / sub | 58.57 | . | The addition / subtraction of the gene indispensability score of two different genes. | |
| GOSemSim | GOSemSim_MF | 13.22 | <10E-3 | The Semantic similarity of Gene GO annotations for molecular function (MF), biological process (BP) and cellular component (CC). | |
| GOSemSim_BP | |||||
| GOSemSim_CC | |||||
| GeneMANIA | GeneMANIAGG | 4.9 | <10E-3 | The weight of the gene function relationship between two genes was calculated by integrating multiple functional association networks. | |
| ConsensusPathDB | NumOfCommonInteraction | 60.13 | . | The number of common interactive genes of two different genes. | |
| commonInteractionJacSim | The Jaccard similarity coefficient of two interactive gene sets. | ||||
| Haploinsufficiency | HI.add / sub | 23.83 | <10E-3 | The addition / subtraction of the probability of haploinsufficiency of two different genes. | |
| HGC | BioDis | 20.26 | The biological distance between two genes. | ||
| LOF | LofIn.add / sub | 13.48 | add: 0.94 / 0.02 | The addition / subtraction of the probability of being loss-of-function intolerant of two different genes. | |
| LoFtool | FuncChangeInt.add / sub | 66.02 | . | The addition / subtraction of the probability of being functional change intolerance of two different genes. | |
| Protein-level | BioGRID | BioGRIDPP | 19.9 | <10E-3 | Whether there is protein interaction between the two genes. “0″ indicates no protein interaction and “1” represents there is protein interaction. |
| STRING | STRINGPP | 4.83 | <10E-3 | Protein-protein associations, including direct physical binding and indirect interaction from a functional perspective. | |
| REACTOME | REAC_FI | 33.9 | <10E-3 | Protein functional interaction effects | |
| Structure-level | UniProtKB | PS_2DbJacSim | 5.4 | . | The Jaccard similarity coefficient of two sets of structure domains, the maximum value by different databases for each gene pair was regarded as the feature value. |
| Expression-level | GTEx | HighexpPer | 6.68 | <10E-3 | The Jaccard similarity coefficient of two sets of high expressed tissues. |
| PaxDb | Abundance.add / sub | 11.38 | add: | The addition / subtraction of abundance scores of two different genes. | |
| COXPRESdb | COXPRESdbMRvalue | 9.03 | <10E-3 | Gene co-expressed values. | |
| Phenotype-level | DOSE | DOSemSim | 13.22 | <10E-3 | The Semantic similarity of disease ontology annotations. |
Note: The value range of the features *.add / sub were [0,2], while the remaining were [0,1], except for the classification feature EssgCom and BioGRIDPP;
The fraction of missing values was calculated based on each database independently. If there were several databases, the minimum miss rate was shown.
The significance testing results of each feature (after feature selection) between positive set and negative set, p-values for deleted features were replaced by “.”. The significant test was conducted using two methods, including the Wilcoxon test and the Kolmogorov-Smirnov test.
Features such as RVIS_EVS, GDI and Recessive score are specific to a single gene. Thus, we calculated the addition and the subtraction to indicate the correlation between two genes. The “.add” is the addition of two values, and the “.sub” is the subtraction.
Fig. 2Parameter determination and model performance on the whole training set. (A) The change of the average feature importance of 200 RF classifiers when conducting the feature selection using the Recursive Feature Elimination (RFE) method. The initial number of features for REF was 20, and the average feature importance was calculated each time the least important feature was deleted. The value means the Gini feature importance (FI), the red cross indicates the features with FI < 0.01 (should be deleted), the yellow exclamation mark indicates the FI is in the range of [0.01, 0.1], and the green tick indicates the FI > 0.1. (B) The order of the 10-fold cross-validation F1 scores for the 200 single classifiers trained using sub-samples generated by the down-sampling method. The top 26 RF classifiers had 10x-CV-F1s ≥ 95%. (C) The ROC (blue) and PR (yellow) curves of the final predictor. The peak means the best classification threshold for each curve. The embedded plot showed the change of Recall and Specificity rate in the positive and negative training set with the increased threshold. The recall rate stabilized at 100% when the threshold was ≤0.5. (D) The probability distribution of the final predicted scores. The dark and light triangles represented the misclassified samples in the negative and positive gene pairs separately. The scatter plot showed the distribution of the predicted probability of digenic interaction for the whole training set (16156 pairs). One point represented one gene pair, and the height meant the probability of the digenic interaction effect, those gene pairs with the same probability would stay at the same height. Most of the gene pairs were at the upper and lower ends of the plot, which indicated a clear result by the classification model.
Fig. 3The feature importance of DIEP and multiple predictors comparison. (A) The bar plot of the feature importance of DIEP. (B) Five metrics for the comparison of four predictors trained with different subsets of input features. The mean 10-fold cross-validation F1 scores were calculated based on the training set. Other indicators, including the PR AUCs, AUCs, recall and specificity were calculated based on the Manual set and the Test set (Table S2). “Only_STRINGPP” means that the predictor was trained with only one feature. And “Top3_Features” indicates that the predictor was trained with the top three features. “Without_STRINGPP” suggests that only the feature “STRINGPP” was deleted. The “Final” represents our DIEP.
Fig. 4The boxplot of feature contribution in the Positive and Negative set. In each dataset, the boxplot of predicted digenic gene pairs and non-digenic gene pairs were plotted separately in (A), (B), (C), (D) and (E). The red box indicates that the mean feature contribution of all the gene pairs is positive (≥0), and the blue is negative (<0). (F) The feature contribution pattern of an exceptional gene pair in the different datasets. The red color indicates the positive contribution (+) while blue is the negative (−). The order of the features is the same as (A)-(E).
Fig. 5The distribution of the predicted probability of digenic interaction effect in 6 additional test sets. (A) The distribution of the predicted probability of digenic interaction effect in the hypothesis-based set extracted from the original negative set (Fig. 1). (B) The test set was generated from healthy people of 7 trios with different rare diseases. (C)-(E) The predicted probability of digenic interaction effect in 3 literature-based sets used in VarCoPP. (F)-(G) The only positive test set from which the digenic gene pairs were manually curated from the disease research studies. Samples in (G) were also served as positive test set by DiGePred.
Fig. 6The detailed comparison between DIEP and DiGePred. (A and B) The PR curves for DIEP and DiGePred on two held-out test sets from DiGePred paper (unaffected- and random-no-gene-overlap). The values near the curves indicate the PR AUCs of different curves. (C) The predicted results of DIEP and DiGePred on the Manual test set. (D) The comparison between DIEP and DiGePred on 100 down-sampling sub-test sets. Each sub-test set contains 128 gene pairs (64 positives from the Manual test set and 64 negatives down-sampled from the Test set). The scatter plot (red dots) indicates the McNemar’s p-values of 100 sub-test sets, and the line chart shows the corresponding predicted accuracy of two methods in each set. (E) The distribution of predicted scores of 64 probably digenic gene pairs by DIEP and DiGePred. Each bar represents one gene pair, and the length of each line indicates the value size. The gray area shows the percentage of the wrong predictions of two methods under the best thresholds.
Fig. 7The statistical result and the enrichment ratio plot for 15 same disease gene pair sets and 35 different diseases gene pair sets. (A) The statistic result between 15 same disease gene pair sets and 15 representative different diseases gene pair sets. The red circle indicates the proportion of digenic gene pairs in which both genes are from the same disease, and the blue one is the proportion of digenic gene pairs in which both genes are from different diseases. The size of the circle depended on the proportion. The red line indicates a significant result with P-value ≤ 1E-05, the black line indicates the P-value > 0.05, and the blue line indicates the P-value falls in the middle of two P values. (B) The scatter plot of odds ratios for comparisons between 15 same disease gene pair sets and 35 different diseases gene pair sets. The red dotted line shows the OR threshold of 1. (C) The detailed enrichment values of comparisons between monogenic disease Retinoblastoma Disease and 35 different diseases gene pair sets. Each line indicates one comparison, the length of the line means the 95% confidence interval and the red line link 35 enrichment values. The dotted line shows the Odds ratio = 1. (D) The detailed enrichment values of comparisons between oligogenic disease Bardet-Biedl syndrome and 35 different diseases gene pair sets. (E) The detailed enrichment values of comparisons between polygenic disease Diabetes and 35 different diseases gene pair sets.