| Literature DB >> 35144616 |
G Castellani1, S Capellari2,3, M Tarozzi4, A Bartoletti-Stella5,2, D Dall'Olio6, T Matteuzzi6, S Baiardi5,2, P Parchi5,2.
Abstract
BACKGROUND: Targeted Next Generation Sequencing is a common and powerful approach used in both clinical and research settings. However, at present, a large fraction of the acquired genetic information is not used since pathogenicity cannot be assessed for most variants. Further complicating this scenario is the increasingly frequent description of a poli/oligogenic pattern of inheritance showing the contribution of multiple variants in increasing disease risk. We present an approach in which the entire genetic information provided by target sequencing is transformed into binary data on which we performed statistical, machine learning, and network analyses to extract all valuable information from the entire genetic profile. To test this approach and unbiasedly explore the presence of recurrent genetic patterns, we studied a cohort of 112 patients affected either by genetic Creutzfeldt-Jakob (CJD) disease caused by two mutations in the PRNP gene (p.E200K and p.V210I) with different penetrance or by sporadic Alzheimer disease (sAD).Entities:
Keywords: Alzheimer’s Disease; CJD; Complex diseases; Gene panels; Genetic modifiers; Machine learning; NGS; Neurodegeneration; Polygenic score
Mesh:
Year: 2022 PMID: 35144616 PMCID: PMC8830183 DOI: 10.1186/s12920-022-01173-4
Source DB: PubMed Journal: BMC Med Genomics ISSN: 1755-8794 Impact factor: 3.063
Fig. 12D plot of the Principal Component Analysis (PCA) computed on the 1046 × 112 ternary matrix. PCA is a dimensionality reduction technique that computes an orthogonal linear transformation of the data to a new 2D coordinate system so that the greatest variance is on the x-axis (PC1) and the second greatest variance on y-axis. Each dot represents a patient, that is plotted in the 2D space accordingly to its genetic profile expressed in the ternary matrix. PC1 and PC2 show the main sources of variance in our data, accounting for 22% of overall variance, that are represented by variants on MAPT and NOTCH3 genes, respectively. PCA plot and hierarchical clustering recognize clusters that correspond to the MAPT haplotypes on the x-axis, as shown by coloured labels in the picture legend. Similarly, the distribution along the y-axis matches haplotypes in the notch3 gene (not shown)
Fig. 2Dataset classification according to decision trees analysis: this supervised method computes on the 1046 × 112 matrix a classification based on the labels provided. The classifier correctly identifies the two disease groups on the two disease-causing mutations
Classification metrics.
| Precision | Recall | F1 | Support | |
|---|---|---|---|---|
| sAD | 0.71 | 0.8 | 0.71 | 15 |
| gCID | 0.85 | 0.77 | 0.85 | 22 |
Precision is the ratio of correctly predicted observation to the total predicted positive observations (TruePositive/TruePositive + FalsePositive), Recall is the ratio of correctly predicted positive observations to all observations in actual class (TruePositive/TruePositive + FalseNegative), F1 Score is the harmonic mean of Precision and Recall (F1 Score = 2*(Recall * Precision) / (Recall + Precision)). Support indicates class numerosity
Fig. 3Result of Decision Trees analysis on the dataset deprived of the information about gCJD-causing mutations. Classification is accomplished with 0.71 accuracy for sAD and 0.85 for gCJD. Classification is based on the reported eight variants harboured in six genes. Four of these are variants of uncertain significance not reported in the GnomAD database harbored in the genes APP c.*1A > C (rs748508166), GRN c.1179 + 100A > T, DCTN1 p.Lys519Glu, PRKAR1B c.595 + 369 T > C (rs1342588350), two of them are rare (Minor Allele Frequency < 0.05) variants in the European population, APP p.Phe435 = (rs148180403, MAF = 0.001), DCTN1 p.Ala816 = (rs1130484, MAF = 0.007) and two are common benign variants in CHCHD10 (c.261 + 99A > G) and GSN (c.666 + 53 T > C). “Value” indicates the number of samples at the given node that fall into each category. The “Gini” score quantifies the purity of the node/leaf, when greater than zero implies that samples contained within that node belong to different classes while a gini score of zero means that within that node only a single class of samples exist
Summary of results of statistical analysis on each variant detected in our target sequencing panel
| Disease group | Average number of SNV per patient | Unique SNV per disease group | Unique non-synonimous SNV per disease group | Unique SNV p < 0,05 per disease group | |
|---|---|---|---|---|---|
| AD (46) | 145.05 | 654 | 27 | 72 | |
| CJD (66) | 134.87 | 768 | 11 | 33 | |
| E200K (26) | 138.73 | 483 | 14 | 52 | |
| V210I (40) | 135.73 | 645 | 27 | 75 | |
Rows identify pathologic groups with their numerosity reported between brackets. The first column shows the average number of variants carried per patient in the different disease groups. The second column shows the overall number of different variants detected in each group in at least one patient. The third column indicates variants annotated as missense, splice variants or 3’or 5’ UTR in each disease group. The last column contains the number of variants with a p < 0.05 after Fisher’s exact test and Benjamini–Hochberg correction despite of their annotation
Fig. 4Result of functional enrichment analysis performed on genes harbouring variants with significantly altered allele frequency compared to European population reported in the GnomAd database. Results of pathway analysis are reported as significantly (p < 0.05) enriched pathways in the first group but not in the second of each coupled comparison. Since part of the affected pathways are shared among the considered conditions, results are reported as differences between comparisons of two groups. Complete results of the functional analysis with Gene Ontology and of the Protein–Protein Interaction networks are reported in Supplementary materials