| Literature DB >> 31253775 |
Patrick Deelen1,2, Sipko van Dam1, Johanna C Herkert1, Juha M Karjalainen1, Harm Brugge1, Kristin M Abbott1, Cleo C van Diemen1, Paul A van der Zwaag1, Erica H Gerkes1, Evelien Zonneveld-Huijssoon1, Jelkje J Boer-Bergsma1, Pytrik Folkertsma1, Tessa Gillett1, K Joeri van der Velde1,2, Roan Kanninga1,2, Peter C van den Akker1, Sabrina Z Jan1, Edgar T Hoorntje1,3, Wouter P Te Rijdt1,3, Yvonne J Vos1, Jan D H Jongbloed1, Conny M A van Ravenswaaij-Arts1, Richard Sinke1, Birgit Sikkema-Raddatz1, Wilhelmina S Kerstjens-Frederikse1, Morris A Swertz1,2, Lude Franke4.
Abstract
The diagnostic yield of exome and genome sequencing remains low (8-70%), due to incomplete knowledge on the genes that cause disease. To improve this, we use RNA-seq data from 31,499 samples to predict which genes cause specific disease phenotypes, and develop GeneNetwork Assisted Diagnostic Optimization (GADO). We show that this unbiased method, which does not rely upon specific knowledge on individual genes, is effective in both identifying previously unknown disease gene associations, and flagging genes that have previously been incorrectly implicated in disease. GADO can be run on www.genenetwork.nl by supplying HPO-terms and a list of genes that contain candidate variants. Finally, applying GADO to a cohort of 61 patients for whom exome-sequencing analysis had not resulted in a genetic diagnosis, yields likely causative genes for ten cases.Entities:
Mesh:
Year: 2019 PMID: 31253775 PMCID: PMC6599066 DOI: 10.1038/s41467-019-10649-4
Source DB: PubMed Journal: Nat Commun ISSN: 2041-1723 Impact factor: 14.919
Fig. 1Schematic overview of GADO. a Per patient, GADO requires a set of phenotypic features (encoded using HPO terms) and a list of candidate genes (gene names either entered using HGNC symbols or Ensembl IDs). This gene list should contain genes in which rare variants have been observed for the patient. It then ascertains whether any of these genes have been predicted to cause the phenotypic features, observed in the patient. These HPO phenotypes predictions per gene are based on observed co-regulation with sets of genes that are already known to be associated with these phenotypes. b Overview of how disease symptoms are predicted using gene expression data from 31,499 human RNA-seq samples. A principal component analysis on the co-expression matrix results in the identification of 1588 significant principal components. For each HPO term we investigate every component: per component we test whether there is a significant difference between eigenvector coefficients of genes known to cause a specific phenotype and a background set of genes. This results in a matrix that indicates which principal components are informative for every HPO term. By correlating this matrix to the eigenvector coefficients of every individual gene, it is possible to infer the likely HPO disease phenotype term that would be the result of a pathogenic variant in that gene
Fig. 2A compendium of gene expression profiles that can be used for gene function prediction. We downloaded 31,499 RNA-seq samples from ENA. These samples come from many different studies. They show coherent clustering after correcting for technical biases. Generally, samples originating from the same tissue, cell-type or cell-line cluster together. The two axes denote the two t-SNE components. The number of samples per tissue or cell-type are mentioned, and after the colon the number of unique studies is mentioned, indicating that samples cluster by tissue or cell-type, and that this clustering is not due to systematic technical confounding due to the fact that for a given tissue, samples come from only a single laboratory
Gene function prediction accuracy
| Database | Number of gene sets | Gene sets ≥ 10 genes | Gene sets with significant predictive power | Median AUC |
|---|---|---|---|---|
| Reactome | 2,143 | 1,388 | 1,150 | 0.87 |
| GO molecular function | 4,070 | 726 | 398 | 0.82 |
| GO biological process | 11,753 | 2,576 | 1,115 | 0.82 |
| GO cellular component | 1,609 | 500 | 370 | 0.84 |
| KEGG | 186 | 186 | 168 | 0.84 |
| HPO | 7,920 | 3,281 | 1,887 | 0.73 |
Note: Gene co-expression information of 31,499 samples is used to predict gene functions. We show the prediction accuracy for gene sets from different databases
AUC area under the curve, GO gene ontology, HPO human phenotype ontology
Fig. 3Performance of disease gene prioritization compared to random permutation. a OMIM disease genes and provisional disease genes have significantly stronger prioritization Z-scores compared to permuted disease genes (T-test p-values: 2.16 × 10−532 and 5.38 × 10−80, respectively). We also observe that the predictions of the provisional OMIM genes are, on average, weaker than the other OMIM disease genes (T-test p-value: 1.89 × 10−7). Because we use a leave-one-out strategy when calculating prioritization Z-scores for genes that have already been associated to an HPO term, there is no prediction bias towards known associations. Therefore, this benchmark is informative of the power to predict novel associations (see methods). b We observe a significant relation (Spearman p-value: 1.01 × 10−4) between the burden of evidence that a gene is associated to a disease and the GADO prioritization Z-score. Most genes are scored by[13] some additional refuted genes, denotated as squares or diamonds, are reported by ref. [8] and ref. [12] c We observe a clear relation between the prioritization Z-scores and the gene predictability scores (Pearson r = 0.54). We do not observe this relation in the permuted results. d Our gene prioritization Z-scores are significantly correlated (Pearson p-value: 1.67 × 10−23) to the number of likely pathogenic (LP) and pathogenic (P) variants reported for a gene in ClinVar
Fig. 4Performance of GeneNetwork on solved cases. a Comparison between using GADO and Exomiser to rank candidate variants. b Our cohort contained a case with two distinct conditions, and clustering showed the HPO terms of the same disease are closest to each other. Note, the HPO term “Inflammation of the large intestine” did not yield a significant prediction profile and therefore the parent terms “Abnormality of the large intestine”, “Increased inflammatory response” and “Functional abnormality of the gastrointestinal tract” were used for this case
Unsolved cases with new candidate genes
| HPO terms used | Number of genes with candidate variant | Number of genes with Z ≥ 5 | Candidate gene | Variants | CADD scores | GnomAD minor allele frequency | Supporting papers | Expression in relevant tissue |
|---|---|---|---|---|---|---|---|---|
| HP:0001644 | 215 | 5 |
| NM_001098623.2:c. [15037 C > T]; [20963delC] | 24.8 25.2 | 8.0 × 10−5 1.7 × 10−3 |
[ | Yes |
| HP:0001644 | 226 | 3 |
| NM_001098623.2:c. [5545 C > T]; [22384 + 3 _22384 + 21del] | 14.7 7.8 | 3.2 × 10−4 0 |
[ | Yes |
HP:0008066 HP:0008064 | 359 | 3 |
| NM_001014342.2:c. [632 C > G];[632 C > G] | 35.0 | 1.1 × 10−5 |
[ | Yes |
HP:0001263 HP:0001249 HP:0000717 HP:0000708 HP:0002167 HP:0002360 HP:0000664 | 206 | 12 |
| NM_017553.2:c. [898C > T] | 34 | 0 |
[ | Yes |
| HP:0001644 | 120a | 2 | MB | NM_00203377.1:c. [214 G > A] | 22.4 | 3.6 × 10−5 |
[ | Yes |
| HP:0001644 | 120a | 1 |
| NM_001114133.2:c. [473 G > A] | 24.1 | 5.4 × 10−4 |
[ | Yes |
| HP:0001638 | 292 | 4 |
| NM_001261463.1:c. [4648 C > T] | 20.4 | 8.7 × 10−4 |
[ | Yes |
HP:0004322 HP:0001249 | 381 | 10 |
| NM_004701.3:c. 25-3_25delCAGG | 24.5 | 0 |
[ | Yes |
HP:0003493 HP:0002583 | 246 | 6 |
| NM_002349.2: c. 3476 C > T(;) 23 C > G | 22.7 24.1 | 3.2 × 10−3 2.6 × 10−3 |
[ | Yes |
HP:0012649 HP:0002583 HP:0001890 | 318 | 8 |
| NM_001122772.1:c. 421delC | 27.2 | 0 |
[ | Yes |
Note: In 10 out of 61 unsolved patients we identified likely causative genes that were previously unknown. For these genes we found literature that indicates these genes fit the phenotype of these patients or we gained functional evidence implicating their disease relevance. HP:0001644 = Dilated cardiomyopathy; HP:0008066 = Abnormal blistering of the skin; HP:0008064 = Ichthyosis; HP:0001263 = Global developmental delay; HP:0001249 = Intellectual disability; HP:0000717 = Autism; HP:0000708 = Behavioral abnormality; HP:0002167 = Neurological speech impairment; HP:0002360 = Sleep disturbance; HP:0000664 = Synophrys; HP:0001638 = Cardiomyopathy; HP:0004322 = Short stature; HP:0001249 = Intellectual disability; HP:0003493 = Antinuclear antibody positivity; HP:0002583 = Colitis; HP:0012649 = Increased inflammatory response; HP:0001890 = Autoimmune hemolytic anemia
aThese variants were pre-filtered for family segregation
bThe variants in these genes do not fully explain the phenotype but are likely contributing to the phenotype
Fig. 5a Prioritization results of one of our previously solved cases (www.genenetwork.nl). This patient was diagnosed with Kleefstra syndrome. The patient only showed a few of the phenotypic features associated with Kleefstra syndrome and additionally had a neoplasm of the pituitary (which is not associated with Kleefstra syndrome). Despite this limited overlap in phenotypic features, GADO was able to rank the causative gene (EHMT1) second. Here, we also show the value of the HPO clustering heatmap: the two terms related to the neoplasm cluster separately from the intellectual disability and the facial abnormalities that are associated to Kleefstra syndrome. b Clustering of a set of genes allowing function/HPO enrichment of all genes or specific enrichment of automatically defined sub clusters. Here, we loaded all known DCM genes and OBSCN, and we focus on a sub-cluster of genes containing OBSCN (highlighted by the arrow). We see that it is strongly co-regulated with many of the known DCM genes. Pathway enrichment of this sub-cluster reveals that these genes are most strongly enriched for the muscle contraction Reactome pathway. DCM, Dilated Cardiomyopathy