| Literature DB >> 34407882 |
Li Zhang1,2, Jingru Shi1, Jian Ouyang1, Riquan Zhang2, Yiran Tao1, Dongsheng Yuan1, Chengkai Lv1, Ruiyuan Wang1, Baitang Ning3, Ruth Roberts4,5, Weida Tong6, Zhichao Liu7, Tieliu Shi8,9,10.
Abstract
BACKGROUND: Gene copy number variations (CNVs) contribute to genetic diversity and disease prevalence across populations. Substantial efforts have been made to decipher the relationship between CNVs and pathogenesis but with limited success.Entities:
Keywords: Copy number variation; Machine learning; Next-generation sequencing; Pathogenicity; XGBoost
Mesh:
Year: 2021 PMID: 34407882 PMCID: PMC8375180 DOI: 10.1186/s13073-021-00945-4
Source DB: PubMed Journal: Genome Med ISSN: 1756-994X Impact factor: 11.117
Fig. 1Workflow of X-CNV model training and validation. The model was trained based on the XGBoost algorithm using 30 predictive features of 5315 pathogenic and 14,260 benign CNVs from dbVar and was validated in 4893 pathogenic and 4073 benign CNVs from ClinGen and DECIPHER. The features were categorized into four types, including universal annotation, genome-wide annotation, coding annotation, and non-coding annotation. The allele frequency (AF) of CNVs was calculated based on the unified CNVs from DGV and dbVar
Fig. 2Strategy to unify potentially identical CNVs and the general properties of the unified CNVs in a natural population. A Schematic diagram depicting the use of maximal clique algorithm to unify CNVs. B Coverage of unified CNVs on the human genome. C The different lengths between gain and loss, pathogenic and benign, intragenic, and intergenic CNVs. D Proportions of the samples in the subpopulations from DbVar. E Population allele frequency (PAF) of gain and loss in the subpopulations
Fig. 3Performance and important features of X-CNV models. A Distribution of AUC value in the models during parameter tuning by 100-time tenfold cross-validation. B ROC curves for X-CNV and SVScores in the validation set (ClinGen & DECIPHER)
Model performance of X-CNV, AnnotSV, and ClassifyCNV on the independent validation set
| Metrics | X-CNV | AnnotSV | ClassifyCNV | ||||||
|---|---|---|---|---|---|---|---|---|---|
| All | Gain | Loss | All | Gain | Loss | All | Gain | Loss | |
| MCC | 0.27 | 0.06 | 0.46 | − 0.08 | − 0.24 | 0.01 | |||
| Accuracy | 0.62 | 0.38 | 0.79 | 0.48 | 0.34 | 0.57 | |||
| F1 score | 0.73 | 0.49 | 0.88 | 0.56 | 0.32 | 0.68 | |||
| Fowlkes–Mallows index | 0.76 | 0.56 | 0.87 | 0.56 | 0.34 | 0.68 | |||
| Sensitivity | 0.85 | 0.52 | 0.96 | 0.62 | 0.47 | 0.68 | |||
| Specificity | 0.21 | 0.13 | 0.35 | 0.31 | 0.28 | 0.34 | |||
Fig. 4.The important features of X-CNV model for CNV pathogenicity prediction. A The contribution of the ten important features for CNV pathogenicity prediction. B AUC values of the X-CNV model for CNVs with four different CNV lengths. C Receiver operating characteristic (ROC) curves for CNV gain (orange) and loss (green), respectively
Fig. 5The separating capability of the meta-voting prediction (MVP) score in the pathological categories and its application to rare disease, hereditary tumor, and population genetics. A Distribution of MVP scores in the five pathological categories. The points above the boxes represent the outliers. B AUC values and cutoffs of the meta-voting prediction (MVP) scores to separate the five pathological categories. C The distribution of MVP scores in the pathogenic CNVs of 22 rare disease types. D The number of CNVs harboring cancer predisposition genes and being predicted as pathogenic or likely pathogenic (MVP > 0.46). E The allele frequency distribution of the CNVs categorized by the MVP scores. The average and 95% confidence intervals of population allele frequency of the CNVs categorized by the MVP scores within the ethnic groups