| Literature DB >> 25155502 |
Rey-Long Liu1, Chia-Chun Shih.
Abstract
BACKGROUND: Curation of gene-disease associations published in literature should be based on careful and frequent survey of the references that are highly related to specific gene-disease associations. Retrieval of the references is thus essential for timely and complete curation.Entities:
Mesh:
Year: 2014 PMID: 25155502 PMCID: PMC4162969 DOI: 10.1186/1471-2105-15-286
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Experimental setup for evaluating CRFref
| Item | Setting |
|---|---|
| Experimental Data | (1) Gene-disease pairs and candidate references: |
| (A) | |
| (B) | |
| (C) | |
| (2) Target references of each target gene-disease pair: | |
| For each target gene-disease pair, all candidate references that GHR curators employed to develop a summary for the pair are the target references for the pair. | |
| Baselines | (1) Vector Space Model (VSM) and term weighting techniques: Two techniques: |
| (2) Proximity-based techniques: Two techniques | |
| (3) Position-and-frequency-based technique: A technique | |
| (4) Integrative techniques: Several rankers developed by combining the above techniques with SVMrank. | |
| Evaluation criterion | (1) |
| (2) |
Figure 1Distribution of the percentage of target references for each gene-disease pair: The percentage distributes diversely, indicating that the rankers have diverse degrees of difficulty in ranking the target references high for the gene-disease pairs.
Given a reference and a gene-disease pair < , >, CRFref estimates and integrates three measures: degrees of , , and of with respect to < , >
| Factors | Definition | Type |
|---|---|---|
| (1) Length( |
|
|
| (2) GeneTF( |
| |
| (3) DiseaseTF( |
| |
| (4) Gene@Title( |
| |
| (5) Disease@Title( |
| |
| (6) Gene@Ending( |
| |
| (7) Disease@Ending( |
| |
| (8) NotGeneNum( |
|
|
| (9) NotDiseaseNum( |
| |
| (10) NotGene@Title( |
|
|
| (11) NotDisease@Title( |
| |
| (12) NotGene@Ending( |
| |
| (13) NotDisease@Ending( |
|
[ AvgLen is the average length of references.
[ TF(x,r): Term frequency of x in r.
[ LastPos(x,r): The last position of x in r.
[ G: Set of gene names in HUGO Gene Nomenclature Committee (HGNC).
[ D: Set of terms in MeSH class of C04 to C26, with ‘disease’ and ‘syndrome’ removed.
Figure 2Comparing MAP of CRFref and each individual baseline: When CRFref only considered the degree of conclusiveness, it had been able to perform significantly better than all baselines except for PosFreq, which had quite similar performance; When CRFref considered conclusiveness, richness, and focus, it performed significantly better than all the baselines.
Figure 3Comparing average P@X of CRFref and each individual baseline: CRFref performed significantly better than all the baselines, indicating that CRFref ranked highly related references very high so that expert curators can focus on reading only a small number of recommended references.
Percentage of gene-disease pairs for which P@X > 0: When compared with the individual baselines, CRFref ranked highly related references at top-3 for a higher percentage of gene-disease pairs
| Ranker | % of pairs for which P@1 > 0 | % of pairs for which P@2 > 0 | % of pairs for which P@3 > 0 |
|---|---|---|---|
| CRFref | 34.44% | 46.29% | 53.55% |
| PosFreq | 28.39% | 40.81% | 47.90% |
| PRE | 25.24% | 36.13% | 44.44% |
| BM25 | 25.81% | 37.50% | 44.76% |
| PLM | 25.56% | 37.02% | 44.60% |
| Lucene | 24.60% | 35.40% | 43.79% |
Figure 4Comparing MAP of CRFref and each integrative baseline: Although each baseline was improved by integrating it with PosFreq, CRFref performed significantly better than all the integrative baselines when it considered conclusiveness, richness, and focus.
Figure 5Comparing average P@X of CRFref and each integrative baseline: Although each baseline was improved by integrating it with PosFreq, CRFref performed significantly better than all the integrative baselines.
Percentage of gene-disease pairs for which P@X > 0: When compared with the integrative baselines, CRFref ranked highly related references at top-3 for a higher percentage of gene-disease pairs
| Ranker | % of pairs for which P@1 > 0 | % of pairs for which P@2 > 0 | % of pairs for which P@3 > 0 |
|---|---|---|---|
| CRFref | 34.44% | 46.29% | 53.55% |
| PRE + PosFreq | 28.71% | 41.85% | 50.24% |
| BM25 + PosFreq | 29.35% | 41.77% | 50.32% |
| PLM + PosFreq | 28.23% | 41.94% | 48.47% |
| Lucene + PosFreq | 26.85% | 40.32% | 47.98% |
Comparing MAP of CRFref and the rankers constructed by integrating the baselines and CRFref: All baselines were improved by integrating them with CRFref, however all the integrated versions did not perform significantly better than CRFref
| Type | Ranker | MAP |
|---|---|---|
| CRFref | CRFref | 0.34718 |
| CRFref & individual baselines | CRFref + PosFreq | 0.35046 |
| CRFref + PRE | 0.34447 | |
| CRFref + BM25 | 0.34855 | |
| CRFref + PLM | 0.34630 | |
| CRFref + Lucene | 0.34138s | |
| CRFref & integrative baselines | CRFref + PRE + PosFreq | 0.35010 |
| CRFref + BM25 + PosFreq | 0.35231 | |
| CRFref + PLM + PosFreq | 0.34974 | |
| CRFref + Lucene + PosFreq | 0.34639 |
s: Statistically significant differences with p ≤ 0.05
Comparing average P@X of CRFref and the rankers constructed by integrating the baselines with CRFref: All baselines were improved by integrating them with CRFref, however when compared with CRFref, some of the integrated versions achieved better performance in P@3 but not P@1
| Type | Ranker | P@1 | P@2 | P@3 |
|---|---|---|---|---|
| CRFref | CRFref | 0.34435 | 0.30040 | 0.28078 |
| CRFref & individual baselines | CRFref + PosFreq | 0.33065 | 0.30202 | 0.28965 |
| CRFref + PRE | 0.32581s | 0.30161 | 0.28777 | |
| CRFref + BM25 | 0.33548 | 0.30323 | 0.28938s | |
| CRFref + PLM | 0.33548 | 0.30242 | 0.28589 | |
| CRFref + Lucene | 0.33306 | 0.29516 | 0.28132 | |
| CRFref & integrative baselines | CRFref + PRE + PosFreq | 0.32258s | 0.30685 | 0.29073s |
| CRFref + BM25 + PosFreq | 0.33629 | 0.30887 | 0.29019 | |
| CRFref + PLM + PosFreq | 0.32500s | 0.30524 | 0.29341s | |
| CRFref + Lucene + PosFreq | 0.33468 | 0.30161 | 0.28159 |
s: Statistically significant differences with p ≤ 0.05
Two references ranked by PubMed, CRFref, and the best baselines (BM25 + PosFreq and PRE + PosFreq): CRFref ranked the target reference at top-3, while PubMed and the baselines preferred the non-target reference
| Reference | Position | |||
|---|---|---|---|---|
| PubMed | BM25 + PosFreq | PRE + PosFreq | CRFref | |
| Target reference (PubMed ID: 10791557) | 34th | 5th | 6th | 3rd |
| Non-target reference (PubMed ID: 12972407) | 13th | 1st | 1st | 6th |
PubMed ranked the target reference at the 34th position, indicating that for an expert that employed PubMed to identify references to curate the gene-disease association, 33 references needed to be read and checked, which was a burden for the expert.