| Literature DB >> 35411390 |
Charlotte Nachtegael1,2, Barbara Gravel1,2,3, Arnau Dillen3,4, Guillaume Smits1,5,6, Ann Nowé1,3, Sofia Papadimitriou1,2,3, Tom Lenaerts1,2,3.
Abstract
Improving the understanding of the oligogenic nature of diseases requires access to high-quality, well-curated Findable, Accessible, Interoperable, Reusable (FAIR) data. Although first steps were taken with the development of the Digenic Diseases Database, leading to novel computational advancements to assist the field, these were also linked with a number of limitations, for instance, the ad hoc curation protocol and the inclusion of only digenic cases. The OLIgogenic diseases DAtabase (OLIDA) presents a novel, transparent and rigorous curation protocol, introducing a confidence scoring mechanism for the published oligogenic literature. The application of this protocol on the oligogenic literature generated a new repository containing 916 oligogenic variant combinations linked to 159 distinct diseases. Information extracted from the scientific literature is supplemented with current knowledge support obtained from public databases. Each entry is an oligogenic combination linked to a disease, labelled with a confidence score based on the level of genetic and functional evidence that supports its involvement in this disease. These scores allow users to assess the relevance and proof of pathogenicity of each oligogenic combination in the database, constituting markers for reporting improvements on disease-causing oligogenic variant combinations. OLIDA follows the FAIR principles, providing detailed documentation, easy data access through its application programming interface and website, use of unique identifiers and links to existing ontologies. DATABASE URL: https://olida.ibsquare.be.Entities:
Mesh:
Year: 2022 PMID: 35411390 PMCID: PMC9216476 DOI: 10.1093/database/baac023
Source DB: PubMed Journal: Database (Oxford) ISSN: 1758-0463 Impact factor: 4.462
Figure 1.Summary of the curation pipeline for the creation of the (A) Manual scores, (B) Knowledge scores and (C) Metascores for each oligogenic combination. (A) Research articles were selected using the keywords ‘digenic OR oligogenic’ in PubMed (https://pubmed.ncbi.nlm.nih.gov/), leading from a total of 1501 articles to 262 articles after filtering (see Materials and Methods, Supplementary Data, File 1). Two different curators independently extracted information and curated each article, assigning Manual scores. These scores include the FAMmanual, STATmanual, GENEmanual and VARmanual scores. The last two are used in a decision tree to assign the FUNmanual score, while all of the scores are used in another decision tree to create the FINALmanual score (see Materials and Methods, Supplementary Data, File 1). A discussion then took place to reach a consensus for the Manual Scores. All oligogenic variant combinations were evaluated as separate entities with their own evidence, regardless of whether they were described in the same or different articles. (B) The data were then processed to formalize the available information (see Materials and Methods, Supplementary Data, File 1). The disease names were formalized using the Orphanet database (https://www.orpha.net/). The gene names were formalized according to the gene nomenclature guidelines from the HGNC database (33). The variants were processed by the software Synvar (http://goldorak.hesge.ch/synvar/), and the databases Varsome (34) and dbSNP (35), to obtain genomic coordinates. To correct for the literature bias, the GENEmanual score for each gene pair was harmonized among all articles by assigning the maximum GENEmanual found for that gene pair, and all affected scores (FUNmanual and FINALmanual) were recalculated. To compensate for missing information in the articles due to no prior access to current knowledge, Knowledge scores were assigned per oligogenic combination: the STATknowledge score by checking the presence of the oligogenic combination in the 1000 Genomes project (30) and ClinVar (36), the GENEknowledge score by checking the PPI and KEGG (38) or Reactome (39) pathway links of the involved genes and the VARknowledge score by using variant pathogenicity information from different pathogenicity predictors: SIFT (40), MutationTaster2 (41), CADD (42) and Polyphen2 (43). (C) Finally, both Manual and Knowledge scores are combined in order to create the confidence Metascores for each type of evidence STATmeta, GENEmeta and VARmeta scores—by assigning the maximum score found between their corresponding Manual and Knowledge score (see Materials and Methods, Supplementary Data, File 1). One exception occurs in this rule: if the STATknowledge is 0 due to the fact that the combination is found in an individual of the 1000 Genomes project, then it replaces the STATmanual and, therefore, the STATmeta is also 0. The same procedure as in the manual curation is then followed when decision trees were used to define the FUNmeta and FINALmeta scores.
Summary descriptions of the curation confidence scores linked to the variant combinations present in OLIDA. For each type of evidence, if the information found for a combination does not fulfil the criteria to provide at least a Weak (1) score, an Absent (0) score is assigned. Decision trees (see Supplementary Data, File 1) are used to define the FUNmanual, FUNmeta, FINALmanual and FINALmeta scores. The GENEmanual_harmonized score is defined as the best GENEmanual score assigned for a gene pair among the research articles. More details on how each confidence level is defined per evidence can be found in the Supplementary Data, File 1
| FAMmanual: familial evidence based on the article | ||
|---|---|---|
| Weak (1) | Moderate (2) | Strong (3) |
| One of two conditions:
The genotypic and phenotypic information of only one healthy first-degree relative is described Imperfect segregation in a pedigree with information on two (or more) first-degree relatives | One of two conditions:
Information of two (or more) first-degree relatives, showing a perfect segregation of the variants according to the phenotype Imperfect segregation in a pedigree with information on first- and second-degree relatives | Information of healthy first- and second-degree relatives, showing a perfect segregation of the variants according to the phenotype |
| STATmanual: statistical evidence based on the article | ||
| Weak (1) | Moderate (2) | Strong (3) |
| Implicit evidence that healthy individuals do not carry the oligogenic combination based on control cohorts or public databases. Known control phenotypes, sufficient control size and matched ethnicity | Explicit evidence that healthy individuals do not carry the oligogenic combination based on control cohorts and public databases. Known control phenotypes, sufficient control size, matched ethnicity and (preferably) similar sequencing technology | NA |
| STATknowledge: statistical evidence based on databases and cohorts | ||
| Weak (1) | Moderate (2) | Strong (3) |
| The combination is not found in the 1000 Genomes Project and relevance of all involved variants in ClinVar | NA | NA |
| STATmeta: maximum of STATmanual and STATknowledge | ||
| Weak (1) | Moderate (2) | Strong (3) |
| The oligogenic combination is not found in the 1000 Genomes Project. Other implicit evidence of its statistical relevance for the phenotype | The variant combination is not found in the 1000 Genomes Project. Additional explicit evidence that healthy individuals do not carry the oligogenic combination based on control cohorts or public databases of matched ethnicity and sufficient control size | NA |
| GENEmanual: gene functional evidence based on the article | ||
| Weak (1) | Moderate (2) | Strong (3) |
| Relevance of involved pathway(s) or expressed tissues on the studied phenotype | One of two conditions: a) Effect of the gene combination on the observed phenotype using a functional experiment with either only a double knock-out or multiple single-gene knockouts b) Direct gene relationship (e.g. common pathway and direct interaction) and relevance for the studied phenotype. | Synergistic or additive effect of the gene combination on the observed phenotype using a functional experiment with single and multiple gene knockouts |
| GENEknowledge: gene functional evidence based on databases | ||
| Weak (1) | Moderate (2) | Strong (3) |
| Relevancy of Reactome or KEGG pathways linked with the genes for the observed phenotype. | One of two conditions: a) Gene combination forms a connected PPI network and the comPPI score of each link is >0.8 b) Common Reactome or KEGG pathways, relevant for the observed phenotype. | NA |
| GENEmeta: maximum of GENEmanual_harmonized and GENEknowledge | ||
| Weak (1) | Moderate (2) | Strong (3) |
| Relevance of the genes on the studied phenotype using pathway or tissue expression information | Direct gene relationship or effect of the gene combination on the observed phenotype without comparing the individual effects of genes | Synergistic or additive effect of the gene combination on the observed phenotype using a functional experiment with single and multiple gene knockouts |
| VARmanual: variant functional evidence based on the article | ||
| Weak (1) | Moderate (2) | Strong (3) |
| One of three conditions: a) All variants are predicted as pathogenic b) Functional experiments for some variants and predicted pathogenic effects for the rest c) Functional experiments using single-variant mutants for the involved variants with a promising but not conclusive effect on the observed phenotype | One of two conditions:
a) Effect of the variant combination on the observed phenotype using a functional experiment with either only a double mutant or multiple single mutants
b) Clear pathogenic impact of the variant combination on the observed phenotype in an | Synergistic or additive effect of the variant combination on the observed phenotype using a functional experiment with single and multiple gene mutants |
| VARknowledge: variant functional evidence based on predictors | ||
| Weak (1) | Moderate (2) | Strong (3) |
| Pathogenicity prediction for all variants by at least one predictor among CADD, SIFT, MutationTester and Polyphen | NA | NA |
| VARmeta: maximum of VARmanual and VARknowledge | ||
| Weak (1) | Moderate (2) | Strong (3) |
| Pathogenicity predictions for all involved variants or inconclusive effects of functional experiments | One of two conditions:
a) Effect of the oligogenic combination on the observed phenotype using a functional experiment with either a double mutant or multiple single mutants
b) Clear pathogenic impact of the oligogenic combination on the observed phenotype in an | Synergistic or additive effect of the variant combination on the observed phenotype using a functional experiment with single and multiple gene mutants |
| FUNmanual: functional evidence based on GENEmanual and VARmanual | ||
| Weak (1) | Moderate (2) | Strong (3) |
| Based on a decision tree, not enough evidence to suggest synergy, but relevance of the involved genes and variants | Based on a decision tree and evidence of a relationship, as well as potential functional synergy for the involved genes and variants, but the joint pathogenic effect on the studied phenotype is still not confirmed or clear | Based on a decision tree, strong evidence of the functional synergy of both involved genes and variants on the studied phenotype |
| FINALmanual: overall evidence based only on Manual scores | ||
| Weak (1) | Moderate (2) | Strong (3) |
| Based on a decision tree and evidence of the relevance of the variant combination for the observed phenotype but not enough to show that the involved variants are the only culprits for the studied phenotype or that the cause is indeed oligogenic | Based on a decision tree, good genetic and functional evidence of an effect of the oligogenic variant combination on the observed phenotype, but the described information/mechanism is not clear or strong enough to provide proof of oligogenicity | Based on a decision tree and strong evidence of the synergistic/additive effect of the oligogenic variant combination on the observed phenotype genetically and functionally |
Exception: if STATknowledge = 0 (because the variant combination is found in the 1000 Genomes Project), then it replaces STATmanual and, thus, STATmeta is also 0.
Figure 2.Histogram of the different gene relationship types found between the genes involved in an oligogenic variant combination. The types of gene relationship were obtained either directly from the articles or from public databases (for the ‘Relevant pathways for phenotype’, ‘Same Pathway’ and ‘Directly Interacting’ relationships). Genes are ‘Involved in the same disease’ if patients with the same phenotype described or referenced in the manuscript carried mutations in those genes, together or independently. Pathway information for each gene was either described in the article or found in the KEGG (38) or Reactome (39) databases and was then manually screened to check if the genes belong to ‘Relevant pathways for the phenotype’ (e.g. glucose metabolism pathway for a diabetic phenotype) or in the ‘Same pathway’. Similarly, genes ‘Affecting the same tissue’ must also be expressed in the same relevant tissue for the phenotype. The ‘Directly interacting’ denotes a PPI, either described in the article or retrieved from the comPPI database (37). It is distinguished from the ‘Same protein complex’ relation where the gene products are considered to only fulfil their function when linked together (e.g. the subunits of a channel). ‘Indirectly interacting’ genes are those whose products indirectly interact with an intermediate protein or are involved in a gene regulation mechanism with other gene products (e.g. transcription factors). ‘Similar function’ indicates that genes have the same function (e.g. motor proteins). ‘Co-localization’ implies a direct overlap of the location of the gene products in the cell (e.g. shown using immunofluorescence), while ‘Same organelle’ implies that the protein products exercise their function in the same organelle (e.g. cilia proteins). The ‘Co-expression’ relationship implies a positive correlation of the mRNA expression of the genes in a temporal fashion shown or referenced in the article. Finally the ‘Monogenic experiments only’ notes the fact that the experimental evidence and the assessment of their pathogenicity were done on the genes independently (e.g. single knockouts).
Figure 3.Confidence scores and types of evidence present in the OLIDA combinations. (A) Distribution of the FINALmanual and FINALmeta scores. (B) Venn diagram of the number of oligogenic combinations carrying a score of 1 or higher in the different main types of evidence metascores. The 130 oligogenic combinations whose FAMmanual, STATmeta and FUNmeta scores are all 0 are not shown in this diagram. (C) Heatmap of the number of combinations and their confidence functional and genetic scores based on the evidence collected via manual curation (Manual scores) only and (D) when adjusted using the external database information (Meta scores). The genetic score here represents the maximum score between the FAMmanual and STAT (manual or meta for plots a and b, respectively) and the functional score is the FUN (manual or meta for plots a and b, respectively), which are described in the Materials and Methods.
Figure 4.Screenshot of the Browse page of OLIDA with the Oligogenic Variant Combinations selected showing the different possibilities that the database offers. Six different tables can be browsed (A) with the currently selected one shown in blue. (B) The user can then select the columns of interest to be displayed in the table and (C) download the table with the selected columns. (D) In a particular table, data can be sorted in ascending or descending order based on a particular column’s data. (E) A specific term (e.g. gene name and disease name) can be used to search all tables. (F) Each row represents a specific instance and (G) clicking on specific terms in blue will bring the user to the detail page for that specific instance.
Figure 5.Screenshot of the detailed page for Alport syndrome. This page allows the user to visualize in more detail any instance of the database. It provides (A) links between this instance and the other entities of the database, as well as (B) clickable links towards corresponding pages in external databases where information about this entity was retrieved.