| Literature DB >> 32005800 |
Pilar Cacheiro1, Violeta Muñoz-Fuentes2, Stephen A Murray3, Mary E Dickinson4,5, Maja Bucan6, Lauryl M J Nutter7, Kevin A Peterson3, Hamed Haselimashhadi2, Ann M Flenniken8, Hugh Morgan9, Henrik Westerberg9, Tomasz Konopka1, Chih-Wei Hsu4, Audrey Christiansen4, Denise G Lanza5, Arthur L Beaudet5, Jason D Heaney5, Helmut Fuchs10, Valerie Gailus-Durner10, Tania Sorg11, Jan Prochazka12, Vendula Novosadova12, Christopher J Lelliott13, Hannah Wardle-Jones13, Sara Wells9, Lydia Teboul9, Heather Cater9, Michelle Stewart9, Tertius Hough9, Wolfgang Wurst14,15,16, Radislav Sedlacek12, David J Adams13, John R Seavitt5, Glauco Tocchini-Valentini17, Fabio Mammano17, Robert E Braun3, Colin McKerlie7,18, Yann Herault19, Martin Hrabě de Angelis10,20,21, Ann-Marie Mallon9, K C Kent Lloyd22, Steve D M Brown9, Helen Parkinson2, Terrence F Meehan2, Damian Smedley23.
Abstract
The identification of causal variants in sequencing studies remains a considerable challenge that can be partially addressed by new gene-specific knowledge. Here, we integrate measures of how essential a gene is to supporting life, as inferred from viability and phenotyping screens performed on knockout mice by the International Mouse Phenotyping Consortium and essentiality screens carried out on human cell lines. We propose a cross-species gene classification across the Full Spectrum of Intolerance to Loss-of-function (FUSIL) and demonstrate that genes in five mutually exclusive FUSIL categories have differing biological properties. Most notably, Mendelian disease genes, particularly those associated with developmental disorders, are highly overrepresented among genes non-essential for cell survival but required for organism development. After screening developmental disorder cases from three independent disease sequencing consortia, we identify potentially pathogenic variants in genes not previously associated with rare diseases. We therefore propose FUSIL as an efficient approach for disease gene discovery.Entities:
Mesh:
Year: 2020 PMID: 32005800 PMCID: PMC6994715 DOI: 10.1038/s41467-020-14284-2
Source DB: PubMed Journal: Nat Commun ISSN: 2041-1723 Impact factor: 14.919
FUSIL categories.
| Mouse category | Human cell line category | Number of genes | % Overlap | FUSIL category |
|---|---|---|---|---|
| Lethal | Essential | 413 | 35.09% | Cellular lethal (CL) |
| Lethal | Non-essential | 764 | 64.91% | Developmental lethal (DL) |
| Subviable | Essential | 16 | 3.66% | — |
| Subviable | Non-essential | 421 | 96.34% | Subviable (SV) |
| Viable with phenotypic abnormalities | Essential | 18 | 0.95% | — |
| Viable with phenotypic abnormalities | Non-essential | 1867 | 99.05% | Viable with phenotype (VP) |
| Viable with normal phenotype | Essential | 2 | 0.62% | — |
| Viable with normal phenotype | Non-essential | 318 | 99.38% | Viable with no phenotype (VN) |
Integration of data from human cell essentiality screens from the Avana data set and mouse phenotypes from IMPC screens for 4446 protein-coding genes that have data in both resources and a high-quality orthologue. This defined five mutually exclusive categories of intolerance to loss of function and the number of human protein-coding genes is shown for each. For 627 of the viable mouse lines, the number of procedures with QCed data available was <50% and thus they were classified as Viable with insufficient procedures (see “Methods”, Supplementary Table 1) and not incorporated into these FUSIL categories. The Viable with phenotype (VP) category indicates that the phenotypes of the knockout (loss of function) mouse line differ significantly from the wild-type mice in at least one of the many parameters measured as part of the IMPC phenotyping pipeline (average of 163 parameters measured on any given mouse).
Fig. 1Cross-species FUSIL categories of intolerance to LoF.
a Correspondence between primary viability outcomes in mice and human cell line screens. The sankey diagram shows how human orthologues of mouse genes with IMPC primary viability assessment (lethal, subviable and viable) regroup into essential and non-essential human cell categories; the width of the bands is proportional to the number of genes. b Gene Ontology Biological Process (GO BP) enrichment results. Significantly enriched GO terms at the biological process level were computed using the set of IMPC mouse-to-human orthologues incorporated into the FUSIL categories as a reference (Table 1) and identified after correcting for multiple comparisons. Significant results were only found for the cellular and developmental lethal gene categories. Bubble size is proportional to the frequency of the term in the database and the colour indicates significance level as obtained in the enrichment analysis. The GO terms associated with embryo development are in bold. c Correspondence between mouse embryonic lethality stage and essentiality in human cell lines. Embryonic lethal LoF strains are assessed for viability at selected stages during embryonic development: early (gestation) lethal (prior to E9.5), mid (gestation) lethal (E9.5–E14.5/15.5), late (gestation) lethal (E14.5/E15.5 onwards). E embryonic day.
Fig. 2FUSIL categories and human gene features.
a Notched box plots showing the distribution of recombination rates for the different FUSIL bins. Human recombination rates[58] were mapped to the closest gene and average recombination rates per gene were computed. b Distribution of human gene expression values for different tissues. Median logTPM expression values from the GTEx database for selected non-correlated tissues are shown. c Protein–protein interaction network parameters. Notched box plots showing the distribution of degree and topological coefficient computed from human protein–protein interaction data extracted from STRING. Only high-confidence interactions, defined as those with a combined score of >0.7, were kept. d Protein complexes. Bar plots representing the percentage of genes in each FUSIL bin being part of a protein complex (human protein complexes). e Paralogues. The bar plot shows the percentage of genes without a protein-coding paralogue gene in each FUSIL bin. Paralogues of human genes were obtained from Ensembl Genes 95. A cut-off of 30% amino acid similarity was used. f Probability of mutation. Distribution of gene-specific probabilities of mutation from Samocha et al.[65]. g Transcript length. Maximum transcript lengths among all the associated gene transcripts (Ensembl Genes 95, hsapiens data set). h GIMS Selection Score. Distribution of Gene-level Integrated Metric of negative Selection (GIMS)[66] scores across the different FUSIL bins. i Probability of loss-of-function intolerance (pLI) retrieved from gnomAD2.1. Notched box plots and density plots showing the bimodal distribution of this score, with higher values indicating more intolerance to variation. j Distribution of gnomAD o/e LoF scores. Upper bound fraction of the confidence interval around the observed versus expected LoF score ratio (gnomAD 2.1.). A score <0.35 (dashed line) has been suggested to identify intolerant to LoF variation genes[56]. For a–c, f, g–j: centre line, median; notch, CI around the median; box edges, interquartile range, 75th and 25th percentile, respectively; whiskers, 1.5 times the interquartile range; outliers not shown. Significance for pairwise comparisons for all features is shown in Supplementary Tables 4 and 5. CL cellular lethal (pink), DL developmental lethal (orange), SV subviable (yellow), VP viable with phenotypic abnormalities (light blue), VN viable with normal phenotype (dark blue).
Fig. 3Human disease genes and FUSIL bins.
a Enrichment analysis of Mendelian disease genes. Combined OMIM-ORPHANET data was used to compute the number of disease genes in each FUSIL bin. Odds ratios were calculated by unconditional maximum likelihood estimation (Wald) and confidence intervals (CIs) using the normal approximation, with the corresponding adjusted P values for Fisher’s exact test. b Distribution of disease-associated genes according to mode of inheritance. Disease genes with annotations regarding the mode of inheritance according to the Human Phenotype Ontology[8]. c Haploinsufficient genes. Known haploinsufficient genes curated by ClinGen (percentage with respect to the total number of disease genes in each bin). d Age of onset as described in rare diseases epidemiological data from Orphanet (Orphadata). The earliest age of onset associated with each gene was used. Bar plots representing the percentage of disease genes associated with each age of onset for each FUSIL category. e Distribution of the number of physiological systems affected. The phenotypes (HPO) associated with each gene were mapped to the top level of the ontology to compute the number of unique physiological systems affected. f Enrichment analysis of developmental disorder genes. The Developmental Disorders Genotype-Phenotype Database (DDD-DDG2P) set of genes was used to compute the number of developmental disorder genes in each FUSIL bin. These genes were compared against non-disease genes (OMIM, ORPHANET and DDD-DD2GP). Odds ratios were calculated by unconditional maximum likelihood estimation (Wald) and confidence intervals (CIs) using the normal approximation, with the corresponding adjusted P values for Fisher’s exact test. g Distribution of disease genes. Percentage of distribution of Mendelian and developmental disorder genes among the different FUSIL categories. h Distribution of disease genes by mode of inheritance. Percentage of distribution of Mendelian and developmental disorder genes among the different FUSIL categories according to the mode of inheritance reported in the HPO (set of Mendelian disease genes) and DDD (developmental disease-associated genes). CL cellular lethal (pink), DL developmental lethal (orange), SV subviable (yellow), VP viable with phenotypic abnormalities (light blue), VN viable with normal phenotype (dark blue), DDD/DDD-DDG2P Deciphering Developmental Disorders database of genes that are likely causative of developmental disorders. For e, centre line, median; notch, CI around the median; box edges, interquartile range, 75th and 25th percentile, respectively; whiskers, 1.5 times the interquartile range.
Fig. 4Developmental disorders gene candidate prioritisation.
a Venn diagram showing the overlap between DL prioritised genes with evidence from 3 large-scale sequencing programmes. Overlap between the set of 163 developmental genes highly intolerant to LoF variation (pLI > 0.90 or o/e LoF upper bound < 0.35 or HI < 10) and not yet associated with disease and the set of candidate genes from three large rare disease sequencing consortia: 100KGP, CMG, and DDD. b Set of nine candidate genes. The selected genes met the following criteria: (1) evidence from both the 100KGP (with detailed clinical phenotypes and variants) and either DDD (variants and high-level phenotypes available) or CMG (gene and high-level phenotypes available), (2) the associated variants were not present in gnomAD, and (3) intolerance to missense variation; these genes were further prioritised based on the number of unrelated probands and the phenotypic similarity between them and the existence of a mouse knockout line with embryo and adult phenotypes that mimic the clinical phenotypes. c Mouse evidence for VPS4A. IMPC embryonic phenotyping of homozygous mutants at E18.5 showed abnormal/curved spine and abnormal brain among other relevant phenotypes. The phenotypic abnormalities observed in heterozygous knockout mice include lens opacity. Heterozygous mouse phenotypic similarity to known disorders as computed by the PhenoDigm algorithm. d Mouse evidence for TMEM63B. IMPC homozygous mouse embryo lacZ imaging at E14.5 supporting neuronal expression during development. Heterozygous IMPC knockout mice associated phenotypes included abnormal behaviour evaluated through different parameters. The heterozygous mice showed a high phenotypic similarity with several developmental disorder phenotypes. VUS variant of unknown significance.