| Literature DB >> 33947870 |
Friederike Ehrhart1,2, Egon L Willighagen3, Martina Kutmon3,4, Max van Hoften3, Leopold M G Curfs5, Chris T Evelo3,5,4.
Abstract
Here, we describe a dataset with information about monogenic, rare diseases with a known genetic background, supplemented with manually extracted provenance for the disease itself and the discovery of the underlying genetic cause. We assembled a collection of 4166 rare monogenic diseases and linked them to 3163 causative genes, annotated with OMIM and Ensembl identifiers and HGNC symbols. The PubMed identifiers of the scientific publications, which for the first time described the rare diseases, and the publications, which found the genes causing the diseases were added using information from OMIM, PubMed, Wikipedia, whonamedit.com, and Google Scholar. The data are available under CC0 license as spreadsheet and as RDF in a semantic model modified from DisGeNET, and was added to Wikidata. This dataset relies on publicly available data and publications with a PubMed identifier, but by our effort to make the data interoperable and linked, we can now analyse this data. Our analysis revealed the timeline of rare disease and causative gene discovery and links them to developments in methods.Entities:
Mesh:
Year: 2021 PMID: 33947870 PMCID: PMC8096966 DOI: 10.1038/s41597-021-00905-y
Source DB: PubMed Journal: Sci Data ISSN: 2052-4463 Impact factor: 6.444
Fig. 1Workflow of information acquisition, creation of dataset and downstream analysis.
The spreadsheet data content and structure.
| Gene information | Gene - disease association provenance | Disease information | |||
|---|---|---|---|---|---|
| ENSEMBL Gene ID ENSID | HGNC symbol | PMID Gene-disease association | Disease OMIM ID | Disease name | PMID Disease |
| ENSG00000188536 | HBA2 | 1115799 | 604131 | Thalassemia, alpha- | 1115799 |
Fig. 2(A) Number of first descriptions of rare diseases per year. (B) Number of publications identifying new disease description (orange bars) and new gene-disease links per year (blue bars). The black dots indicate the (rolling) median number of years the diseases had been known before the causative gene was identified in that year. The data are displayed from the year 1984, from which there were constantly more than four genes per year discovered. (C) Total current citation counts for the papers shown for gene-disease relationship papers shown as blue bars in panel B shown for the year these were published. One dot represents one publication from our dataset.
Journals, in which the new discovered diseases or genes causing a rare disease were published.
| journal | count | % |
|---|---|---|
| American Journal of Human Genetics | 457 | 14.9 |
| Nature Genetics | 145 | 4.7 |
| American Journal of Medical Genetics Part A | 137 | 4.5 |
| Journal of Medical Genetics | 136 | 4.4 |
| Human Molecular Genetics | 122 | 4.0 |
| The New England Journal of Medicine | 121 | 3.9 |
| Journal of Clinical Investigation | 78 | 2.5 |
| Neurology | 77 | 2.5 |
| The Journal of Pediatrics | 68 | 2.2 |
| Proceedings of the National Academy of Sciences of the United States of America | 59 | 1.9 |
| American Journal of Human Genetics | 1137 | 26.7 |
| Nature Genetics | 786 | 18.4 |
| Human Molecular Genetics | 266 | 6.2 |
| Journal of Medical Genetics | 175 | 4.1 |
| The New England Journal of Medicine | 148 | 3.5 |
| Journal of Clinical Investigation | 136 | 3.2 |
| Proceedings of the National Academy of Sciences of the United States of America | 129 | 3 |
| Science | 121 | 2.8 |
| Nature | 99 | 2.3 |
| Human Mutation | 94 | 2.2 |
Top ten of authors with most gene discoveries in this dataset.
| Author name | count |
|---|---|
| Arnold Munnich | 58 |
| Gudrun Nürnberg | 44 |
| Peter Nürnberg | 38 |
| Thomas Meitinger | 35 |
| Nicholas Katsanis | 32 |
| Friedhelm Hildebrandt | 31 |
| Jean-Laurent Casanova | 31 |
| Alexis Brice | 30 |
| Edgar A Otto | 19 |
| Bruno Dallapiccola | 28 |
Fig. 3(A) Network of gene-rare disease relationships. Blue nodes are genes (HGNC symbols), orange nodes are diseases (OMIM disease names). (B) The Rett syndrome causing genes pathway from WikiPathways, https://www.wikipathways.org/instance/WP4312 was imported as a network to Cytoscape environment using the WikiPathways app of Cytoscape. Using CyTargetLinker app, the MECP2 network was extended to predict and visualize overlap of pathway genes causing other rare diseases provided by the gene-RD-Provenance_V2 linkset. The expression data was taken from Miller et al.[29] and the data was originally produced by Lin et al.[30]. (C) Timeline of rare disease superclass descriptions in blocks of 20 years. The numbers are normalized to percentages of the maximum number of diseases in each disease superclass discovered (Table 4).
Top 10 MeSH superclasses for the rare diseases.
| Top 10 MeSH terms | Count |
|---|---|
| congenital, hereditary, and neonatal diseases and abnormalities | 1705 |
| nervous system disease | 945 |
| musculoskeletal diseases | 595 |
| nutritional and metabolic diseases | 549 |
| pathological conditions, signs and symptoms | 415 |
| eye diseases | 328 |
| skin and connective tissue diseases | 296 |
| cardiovascular diseases | 203 |
| hemic and lymphatic diseases | 182 |
| female genital diseases and pregnancy complications | 174 |
| Measurement(s) | Gene_Associated_With_Disease • genetic disorder |
| Technology Type(s) | digital curation |
| Factor Type(s) | disease |
| Sample Characteristic - Organism | Homo sapiens |