| Literature DB >> 35232478 |
Jakob Hertzberg1,2, Stefan Mundlos3,4, Martin Vingron3, Giuseppe Gallone3.
Abstract
Few methods have been developed to investigate copy number variants (CNVs) based on their predicted pathogenicity. We introduce TADA, a method to prioritise pathogenic CNVs through assisted manual filtering and automated classification, based on an extensive catalogue of functional annotation supported by rigourous enrichment analysis. We demonstrate that our classifiers are able to accurately predict pathogenic CNVs, outperforming current alternative methods, and produce a well-calibrated pathogenicity score. Our results suggest that functional annotation-based prioritisation of pathogenic CNVs is a promising approach to support clinical diagnostics and to further the understanding of mechanisms controlling the disease impact of larger genomic alterations.Entities:
Keywords: Copy-number-variants; Functional annotation; Machine learning; Pathogenicity prediction; Structural variants; TADs
Mesh:
Year: 2022 PMID: 35232478 PMCID: PMC8886976 DOI: 10.1186/s13059-022-02631-z
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 13.583
Fig. 1Enrichment Analysis of non-pathogenic and pathogenic deletions. The figure shows the log2(fold change) for expected and observed variant overlap for each set of genomic annotations based on 10,000 simulations. The size of the squares on the right side of the figure is proportional to the overlap FC difference between pathogenic and non-pathogenic deletions. Grey bars and squares indicate a non-significant FC (q value ≤ 0.01)
Fig. 2Generalised Workflow of the TADA tool. The basis for the CNV annotation are BED-files of TAD boundaries and additional sets of genomic annotations e.g. gene coordinates. In a first step, the annotation sets are sorted into the corresponding TAD environment based on genomic position. The resulting annotated TAD regions are used as a proxy of the regulatory environment during the CNV annotation (“TAD-aware annotation”). The default feature set for the CNV annotation process consists of features describing the distance to genomic elements such as genes and enhancers in the same TAD environment as well as metrics describing the functional relevance, e.g. conservation scores of affected coding or regulatory elements. Alternatively, the user can provide a set of BED-files containing the coordinates of genomic elements from which a new feature set i.e. the distance of CNVs to these annotations is generated. The user is then able to manually prioritise CNVs based on the distance features. If the default feature set is used TADA also allows for automated prioritisation using the pathogenicity score computed by our pre-trained random forest model
Fig. 3Predictive Performance of the Deletion and Duplication Classifiers. A and B show the ranking performance of TADA and SVScore for deletions and duplications, respectively. For each bin we computed the percentage of variants placed amongst the corresponding rank or ranks. Black bars indicate the standard variation based on 30 random sampling runs. C shows the ROC-Curves and AUC values for the deletion and duplication classifiers based on the Test-Split and ClinVar variants for both TADA and SVScore
Fig. 4Feature Importance of the Deletion Model. The figure shows the mean loss in accuracy after permutation of highly correlated feature clusters (see Methods for a detailed description of individual features). The standard deviation based on 30 sampling runs with variable random seed is indicated by black lines