| Literature DB >> 32552728 |
Zhi-Hui Luo1,2, Meng-Wei Shi1,2, Zhuang Yang1,2, Hong-Yu Zhang3, Zhen-Xia Chen4,5.
Abstract
BACKGROUND: Many disease causing genes have been identified through different methods, but there have been no uniform annotations of biomedical named entity (bio-NE) of the disease phenotypes of these genes yet. Furthermore, semantic similarity comparison between two bio-NE annotations has become important for data integration or system genetics analysis.Entities:
Keywords: Disease; MeSH; Named entity recognition; Semantic similarity; Supplementary concept records; UMLS
Mesh:
Year: 2020 PMID: 32552728 PMCID: PMC7301509 DOI: 10.1186/s12859-020-03583-6
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1The components and workflow of pyMeSHSim. pyMeSHSim consists of three subpackages, including metamapWrap, data and Sim. In bio-NE recognition, metamapWarp curates the UMLS concepts from free text. In bio-NE normalization, data translates UMLS concepts to MeSH terms, and maps SCRs to MHs using selected records and relationships between records in MeSH. In bio-NEs comparison, Sim uses IC-based and graph-based methods to measure semantic similarity between two bio-NEs
Fig. 2OMIM UMLS diseases processing pipeline. MeSH-synonymous UMLS concepts were mapped to MHs or SCRs by pyMeSHSim directly. Meanwhile, non-MeSH-synonymous UMLS concepts were processed as free texts into MeSH-synonymous UMLS concepts, and then mapped to MeSH terms. All gene symbols were mapped to Entrez IDs. SCRs were mapped to its broader MHs. MHs with at least 10 genes in at least two groups were remained for further analysis
Fig. 3Venn diagrams. Venn diagram of MH-gene pairs in MH, SCR and Non-MeSH groups. Yellow, red and blue circles represent MH, Non-MeSH and SCR groups respectively. The digital shows number of MH-gene pairs in each group and overlapped number of MH-gene pairs between different groups
Disease enrichment analysis of the genes assigned to the MHs before and after addition of MH-gene pairs from SCR and non-MeSH groups
| OMIM diseases | MH-gene pairs (MH group / all) | Enriched UMLS diseases with DOSE | ||||
|---|---|---|---|---|---|---|
| MH ID | MH description | UMLS ID | UMLS description | MH ID | ||
| D057130 | Leber Congenital Amaurosis | 17/22 | C0339527 | Leber Congenital Amaurosis | D057130 | 3.43E-33 / 1.45E-42 |
| D020754 | Spinocerebellar Ataxias | 23/28 | C0087012 | Ataxia, Spinocerebellar | D020754 | 1.93E-30 / 2.84E-38 |
| D052177 | Kidney Diseases, Cystic | 19/25 | C1691228 | Cystic Kidney Diseases | D052177 | 8.05E-19 / 2.37E-20 |
| D010009 | Osteochondrodysplasias | 14/64 | C0029422 | Osteochondrodysplasias | D010009 | 8.87E-19 / 6.57E-35 |
| D002925 | Ciliary Motility Disorders | 26/31 | C0008780 | Ciliary Motility Disorders | D002925 | 1.60E-23 / 3.90E-33 |
| D015419 | Spastic Paraplegia, Hereditary | 28/36 | C0037773 | Spastic Paraplegia, Hereditary | D015419 | 1.22E-37 / 2.71E-45 |
| D007938 | Leukemia | 18/51 | C0085669 | Acute leukemia | D007938 | 3.26E-10 / 6.63E-26 |
1 The OMIM diseases were collected from the database disease-connect (34) with at least five MH-gene pairs outside the MH group.
2 (Number of MH-gene pairs in MH group) / (number of MH-gene pairs in all the three groups including MH, SCR and non-MeSH group).
3 The MH ID was mapped from the UMLS ID by pyMeSHSim.
4 (The enrichment P value of genes in MH group) / (The enrichment P value of genes in all the three groups).
Fig. 4Recall, Precision and F1 of pyMeSHSim, DNorm and TaggerOne. a-d. Performance of pyMeSHSim without SCRs (a), pyMeSHSim with SCRs (b), DNorm (c) and TaggerOne (d). The similarity between MeSH terms identified by the tools and Nelson’s manual work were called as a true positive or false positive when their similarity was higher or lower than the determined threshold. When the similarity threshold is set to 1, only perfect matched terms would be considered as true positives. The recall (), precision () and F1 () of the tools were calculated at each similarity threshold
Performance comparing pyMeSHSim, DNorm, TaggerOne to Nelson’s manual work with similarity threshold set to 1
| Method | Recalla | Precisionb | F1c |
|---|---|---|---|
| 0.94 | 0.56 | 0.70 | |
| 0.94 | 0.54 | 0.68 | |
| 0.32 | 0.62 | 0.42 | |
| 0.49 | 0.64 | 0.55 |
a, where TP (true positive) is the number of phenotypes whose parsing results matched the manual work at determined similarity threshold. The similarity between MeSH terms identified by the two methods were measured with Lin score, and called as a TP or FP when their similarity was higher or lower than the determined threshold. FN (false negative) is the number of unrecognized phenotypes.
b, where FP is the number of phenotypes whose parsing results mismatched the manual work at determined similarity threshold.
c .
Correlation of calculated semantic similarities between pyMeSHSim and meshes
| Method | Lin’s | Res’ | Jiang’s | Rel’s | Wang’s |
|---|---|---|---|---|---|
| 0.97 | 0.99 | 0.89 | 0.98 | 0.97 | |
| < 2.2e-16 | < 2.2e-16 | 1.2e-14 | < 2.2e-16 | < 2.2e-16 |