| Literature DB >> 29650041 |
Mercedes Arguello Casteleiro1, George Demetriou1, Warren Read1, Maria Jesus Fernandez Prieto2, Nava Maroto3, Diego Maseda Fernandez4, Goran Nenadic1,5, Julie Klein6,7, John Keane1,5, Robert Stevens8.
Abstract
BACKGROUND: Automatic identification of term variants or acceptable alternative free-text terms for gene and protein names from the millions of biomedical publications is a challenging task. Ontologies, such as the Cardiovascular Disease Ontology (CVDO), capture domain knowledge in a computational form and can provide context for gene/protein names as written in the literature. This study investigates: 1) if word embeddings from Deep Learning algorithms can provide a list of term variants for a given gene/protein of interest; and 2) if biological knowledge from the CVDO can improve such a list without modifying the word embeddings created.Entities:
Keywords: CBOW; Cardiovascular disease ontology; Deep learning; Ontology; PubMed; Semantic deep learning; Skip-gram
Mesh:
Year: 2018 PMID: 29650041 PMCID: PMC5896136 DOI: 10.1186/s13326-018-0181-1
Source DB: PubMed Journal: J Biomed Semantics
Exemplifying the identification of genes/proteins mentioned within the 25 PubMed titles/abstracts: Terms from PubMed abstract/title from the small-annotated corpus (first column) mapped to UniProtKB ACs (second column) and their corresponding values for skos:altLabel annotation properties of the PxO protein classes (third column)
|
|
|
|
|---|---|---|
| α(1)-antitrypsin | P01009 | SERPINA1 (P01009; A1AT_HUMAN) Alpha-1-antitrypsin |
| annexin 4 | P09525 | ANXA4 (P09525; ANXA4_HUMAN) Annexin A4 |
| superoxide dismutase 3 | P08294 | SOD3 (P08294; SODE_HUMAN) Extracellular superoxide dismutase [Cu-Zn] |
| OLR1 | P78380 | OLR1 (P78380; OLR1_HUMAN) Oxidized low-density lipoprotein receptor 1 |
| glutathione transferase | P30711 | GSTT1 (P30711; GSTT1_HUMAN) Glutathione S-transferase theta-1 |
| FJX1 | Q86VR8 | FJX1 (Q86VR8; FJX1_HUMAN) Four-jointed box protein 1 |
Fig. 1The SubClassOf axioms for the PxO protein class in OWL Manchester Syntax
Setup for Experiment I: The simple categorisation introduced (see ‘Setup of Experiment I and Experiment II for a gene/protein synonym detection task’) has been applied to the terms from PubMed abstract/title from the small-annotated corpus (first column) as well as to the target terms (second column). Each row of the third column contains the number of target terms for the experiment taking into account the categories that appear in the first and second column
|
| ||
|---|---|---|
|
|
|
|
| Gene symbol appears | Gene symbol appears | 5 |
| Gene symbol appears | Only gene symbol | 13 |
| Gene symbol appears | Only protein name | 3 |
| Gene symbol appears | Refer protein name | 2 |
| Gene symbol appears | Terms from protein name | 2 |
| Only gene symbol | Only gene symbol | 21 |
| Refer protein name | Gene symbol appears | 1 |
| Refer protein name | Only protein name | 16 |
| Refer protein name | Refer protein name | 18 |
| Refer protein name | Terms from protein name | 4 |
Setup for Experiment II and contribution of the CVDO: The simple categorisation introduced (see ‘Setup of Experiment I and Experiment II for a gene/protein synonym detection task’) has been applied to the terms from PubMed abstract/title from the small-annotated corpus (first column) as well as to the target terms (second column). Each row of the third column contains the number of target terms for the experiment taking into account the categories that appear in the first and second column
| Simple categorisation introduced | |||
|---|---|---|---|
| Terms from PubMed titles/abstracts | Target terms | n | Terms added by CVDO to the target terms |
| Gene symbol appears | Gene symbol appears | 6 | Terms from protein name |
| Gene symbol appears | Only protein name | 1 | Protein name |
| Gene symbol appears | Refer protein name | 1 | Terms referring to the protein name |
| Gene symbol appears | Terms from protein name | 2 | Terms from protein name |
| Only gene symbol | Gene symbol appears | 20 | Terms from protein name |
| Only gene symbol | Only protein name | 4 | Protein name |
| Refer protein name | Gene symbol appears | 27 | Terms from protein name and gene symbol |
| Refer protein name | Only gene symbol | 2 | Gene symbol |
| Refer protein name | Only protein name | 2 | Protein name |
| Refer protein name | Refer protein name | 1 | Terms referring to the protein name |
| Refer protein name | Terms from protein name | 2 | Terms from protein name |
The fourth column indicates the terms added by the CVDO, when the symbol (R) appears it means that the protein class expressions within the CVDO are used to add terms to the target terms
Exemplifying results for Experiment I: Top twelve ranked candidate terms (highest cosine similarity) from the word embeddings created with CBOW and Skip-gram for the target term “KLF7” that appears in the abstract of the PubMed article with ID = 23,468,932
| CBOW | Skip-gram | |||
|---|---|---|---|---|
| Rank | Candidate terms from word embeddings | Cosine | Candidate terms from word embeddings | Cosine |
| 1 | MoKA | 0.376371 | Prrx2 | 0.601920 |
| 2 | pluripotency-associated_genes | 0.335113 | Klf7 | 0.592946 |
| 3 | Sp1_regulates | 0.334092 | Klf7(−/−) | 0.590523 |
| 4 | LOC101928923 | 0.333423 | RXRG | 0.589875 |
| 5 | p107_dephosphorylation | 0.331689 | LOC101928923 | 0.585979 |
| 6 | PU_1 | 0.329925 | SOX-17 | 0.585295 |
| 7 | histone_demethylase | 0.323529 | rs820336 | 0.585094 |
| 8 | gene_promoter | 0.321640 | GLI-binding_site | 0.581073 |
| 9 | homeobox_protein | 0.319997 | Tead2 | 0.580012 |
| 10 | histone_arginine | 0.315875 | hHEX | 0.579868 |
| 11 | transfated | 0.314202 | ACY-957 | 0.579542 |
| 12 | are_unable_to_repress | 0.313112 | ETS1 | 0.577272 |
Exemplifying results for Experiment II: Top twelve ranked candidate terms (highest cosine similarity) from the word embeddings created with CBOW and Skip-gram using two terms as target: “OLR1” from the abstract of the PubMed article with ID = 22,738,689; and “oxidized_low-density_lipoprotein receptor_ 1” that is the CVDO protein class name (rdfs:label) for the CVDO class gene with name (rdfs:label) OLR1. Hence, the target term exploits the protein class expressions within the CVDO
| CBOW | Skip-gram | |||
|---|---|---|---|---|
| Rank | Candidate terms from word embeddings | Cosine | Candidate terms from word embeddings | Cosine |
| 1 | atherogenesis | 0.469405 | lectin-like_oxidized_low-density_lipoprotein | 0.688603 |
| 2 | atherosclerosis | 0.465861 | (LOX-1)_is | 0.672042 |
| 3 | CD36 | 0.439280 | atherosclerosis_we_investigated | 0.669050 |
| 4 | LOX-1 | 0.424173 | receptor-1 | 0.664891 |
| 5 | atherosclerotic_lesion_formation | 0.416537 | lectin-like_oxidized_LDL_receptor-1 | 0.663988 |
| 6 | vascular_inflammation | 0.414620 | lOX-1_is | 0.660110 |
| 7 | inflammatory_genes | 0.411186 | human_atherosclerotic_lesions | 0.657075 |
| 8 | atherosclerotic_lesions | 0.405906 | oxidized_low-density_lipoprotein_(ox-LDL) | 0.655515 |
| 9 | monocyte_chemoattractant_protein-1 | 0.398739 | oxidized_low-density_lipoprotein_(oxLDL) | 0.654965 |
| 10 | plaque_destabilization | 0.398201 | (LOX-1) | 0.652099 |
| 11 | oxidized_low-density_lipoprotein_(oxLDL) | 0.397967 | proatherosclerotic | 0.651571 |
| 12 | atherosclerosis_atherosclerosis | 0.396677 | receptor-1_(LOX-1)_is | 0.649000 |
Exemplifying human judgements and voting system for Skip-gram: Categories FTV, PTV, or NTV assigned for the four human raters (A, B, C, and D) to the top twelve candidate terms for the target term “oxidized_low-density_lipoprotein receptor_ 1 OLR1” in Experiment II using Skip-gram. The last three columns show the voting system (VS) applied for FTV (full term variant), FTV among the top three, and TV (full and/or partial term variant). The two rows in grey background remark how two almost identical candidate terms from the word embeddings are marked differently by rater C, and thus, the manual annotation by raters is error-prone
Fig. 2ROC curves for rater A: left-hand side CBOW and right-hand side Skip-gram. Abbreviations: Exp I = Experiment I; Exp II = Experiment II
Fig. 3ROC curves for rater B: left-hand side CBOW and right-hand side Skip-gram. Abbreviations: Exp I = Experiment I; Exp II = Experiment II
Fig. 4ROC curves for rater C: left-hand side CBOW and right-hand side Skip-gram. Abbreviations: Exp I = Experiment I; Exp II = Experiment II
Fig. 5ROC curves for rater D: left-hand side CBOW and right-hand side Skip-gram. Abbreviations: Exp I = Experiment I; Exp II = Experiment II
Median of the rank for CBOW and Skip-gram in Experiments I and II for each of the four raters
| Experiment | Model | Rater A | Rater B | Rater C | Rater D | ||||
|---|---|---|---|---|---|---|---|---|---|
| Median FTV | Median PTV | Median FTV | Median PTV | Median FTV | Median PTV | Median FTV | Median PTV | ||
| I | CBOW | 4 | 5 | 3 | 7 | 4 | 6 | 4 | 6 |
| II | CBOW | 3 | 5 | 4 | 5 | 3 | 5 | 3 | 5 |
| I | Skip-gram | 4 | 6 | 4 | 6 | 4 | 6 | 4 | 6 |
| II | Skip-gram | 3 | 6 | 4 | 6 | 3 | 6 | 3 | 6 |
Overall performance of CBOW and Skip-gram according to the voting system: Number of unique UniProtKB entries and number of term pairs for protein/gene names that are involved in Experiment I, II, and combined (i.e. merging Experiment I and II)
| Voting system | ||||||
|---|---|---|---|---|---|---|
| Experiment | Model | Number of terms pairs | Number of UniProtKB entries | Number | Number FTV for top three | Number TV |
| I | CBOW | 1020 | 64 | 31 | 21 | 43 (67%) |
| II | CBOW | 816 | 63 | 29 | 21 | 49 (78%) |
| I and II | CBOW | 1836 | 79 | 47 | 37 | 64 (81%) |
| I | Skip-gram | 1020 | 64 | 49 | 37 | 57 (89%) |
| II | Skip-gram | 816 | 63 | 56 | 51 | 60 (95%) |
| I and II | Skip-gram | 1836 | 79 | 71 | 63 | 77 (97%) |
According to the voting system, for each model the last three columns show: the number of full term variants among the top twelve ranked candidate terms for the UniProtKB entries (Number FTV column); the number of full term variants among the top three ranked candidate terms for the UniProtKB entries (Number FTV for top three); and the number and % of term variants (i.e. FTV and/or PTV) among the top twelve ranked candidate terms for the UniProtKB entries (Number TV column)
Performance of CBOW and Skip-gram - Experiment I and Experiment II: Number of unique UniProtKB entries mapped to CVDO gene and protein classes that participated in Experiment I or II
The rows with grey background remark the 48 UniProtKB entries that participate in both Experiment I and II. Each row of the third column contains the number of target terms for the experiment taking into account the number of UniProtKB entries. According to the voting system, for each model and experiment, the last three columns show: the number of full term variants among the top twelve ranked candidate terms for the UniProtKB entries (Number FTV column); the number of full term variants among the top three ranked candidate terms for the UniProtKB entries (Number FTV for top three); and the number of term variants (i.e. FTV and/or PTV) among the top twelve ranked candidate terms for the UniProtKB entries (Number TV column)
Results for Experiment I according to the voting system and the simple categorisation introduced: Results of the voting system according to the simple categorisation introduced (see ‘Setup of Experiment I and Experiment II for a gene/protein synonym detection task’), which has been applied to the terms from PubMed abstract/title from the small-annotated corpus (first column) as well as to the target terms (second column)
Abbreviations: n = number of target terms; nFTV = number of target terms that have a FTV among the top twelve candidate terms; nFTVr3 = number of target terms that have a FTV among the top three candidate terms; nTV = number of target terms that have a TV (i.e. FTV and/or PTV) among the top twelve candidate terms
Results for Experiment II according to the voting system and the simple categorisation introduced: Results of the voting system according to the simple categorisation introduced (see ‘Setup of Experiment I and Experiment II for a gene/protein synonym detection task’), which has been applied to the terms from PubMed abstract/title from the small-annotated corpus (first column) as well as to the target terms (second column)
Abbreviations: n ; number of target terms; nFTV ; number of target terms that have a FTV among the top twelve candidate terms; nFTVr3; number of target terms that have a FTV among the top three candidate terms; nTV; number of target terms that have a TV (i.e. FTV and/or PTV) among the top twelve candidate terms