| Literature DB >> 16822321 |
Hua Xu1, Marianthi Markatou, Rositsa Dimova, Hongfang Liu, Carol Friedman.
Abstract
BACKGROUND: Word sense disambiguation (WSD) is critical in the biomedical domain for improving the precision of natural language processing (NLP), text mining, and information retrieval systems because ambiguous words negatively impact accurate access to literature containing biomolecular entities, such as genes, proteins, cells, diseases, and other important entities. Automated techniques have been developed that address the WSD problem for a number of text processing situations, but the problem is still a challenging one. Supervised WSD machine learning (ML) methods have been applied in the biomedical domain and have shown promising results, but the results typically incorporate a number of confounding factors, and it is problematic to truly understand the effectiveness and generalizability of the methods because these factors interact with each other and affect the final results. Thus, there is a need to explicitly address the factors and to systematically quantify their effects on performance.Entities:
Mesh:
Year: 2006 PMID: 16822321 PMCID: PMC1550263 DOI: 10.1186/1471-2105-7-334
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Information of abbreviation data set
| Abbreviation | Sense # | Sense | # of retrieved articles | Sense Distribution |
| BPD | BPD1 | borderline personality disorder | 1584 | 32% |
| BPD2 | bronchopulmonary dysplasia | 2335 | 47% | |
| BPD3 | biparietal diameter | 1032 | 21% | |
| BSA | BSA1 | bovine serum albumin | 13352 | 89% |
| BSA2 | body surface area | 5815 | 11% | |
| PCA | PCA1 | posterior cerebral artery | 1165 | 67% |
| PCA2 | posterior communicating artery | 584 | 33% | |
| RSV | RSV1 | respiratory syncytial virus | 5295 | 60% |
| RSV2 | rous sarcoma virus | 3520 | 40% | |
Results for BSA data set. Annotation of the table: Dist: Distribution of senses; S. Size: sample size; Err. Rate: Error Rate; SE: Standard Error of error rates; CV: cross-validation;
| Dist | S. Size | Err. Rate | SE | Err. Rate | SE |
| (0.5, 0.5) | 20 | 21.83% | 10.05% | 19.67% | 9.04% |
| 40 | 11.17% | 5.33% | 11.08% | 5.05% | |
| 80 | 5.08% | 2.60% | 5.04% | 2.44% | |
| 120 | 3.11% | 1.72% | 2.61% | 1.48% | |
| (0.6, 0.4) | 20 | 23.50% | 10.21% | 21.00% | 9.21% |
| 40 | 12.67% | 5.75% | 12.08% | 5.34% | |
| 80 | 5.75% | 2.82% | 5.00% | 2.48% | |
| 120 | 3.58% | 1.85% | 3.28% | 1.67% | |
| 20 | 24.33% | 10.59% | 23.00% | 9.74% | |
| 40 | 14.67% | 6.11% | 12.75% | 5.39% | |
| 80 | 7.17% | 3.16% | 6.67% | 2.87% | |
| 120 | 4.86% | 2.17% | 4.00% | 1.85% | |
| (0.8, 0.2) | 20 | 19.33% | 9.82% | 19.33% | 9.27% |
| 40 | 15.33% | 6.31% | 14.08% | 5.72% | |
| 80 | 9.13% | 3.58% | 8.00% | 3.16% | |
| 120 | 5.22% | 2.23% | 4.53% | 1.96% | |
| (0.9, 0.1) | 20 | 10.17% | 7.55% | 10.00% | 7.07% |
| 40 | 10.17% | 5.33% | 10.00% | 4.99% | |
| 80 | 8.00% | 3.38% | 7.71% | 3.13% | |
| 120 | 6.42% | 2.48% | 6.03% | 2.26% | |
Results for PCA data set. Annotation of the table: Dist: Distribution of senses; S. Size: sample size; Err. Rate: Error Rate; SE: Standard Error of error rates; CV: cross-validation;
| Dist | S. Size | Err. Rate | SE | Err. Rate | SE |
| (0.5, 0.5) | 20 | 43.00% | 12.14% | 41.00% | 11.25% |
| 40 | 34.58% | 8.21% | 34.33% | 7.68% | |
| 80 | 37.17% | 5.44% | 29.46% | 5.14% | |
| 120 | 28.53% | 4.45% | 31.47% | 4.13% | |
| (0.6, 0.4) | 20 | 37.83% | 11.62% | 38.50% | 11.04% |
| 40 | 36.42% | 8.12% | 35.92% | 7.37% | |
| 80 | 25.54% | 5.41% | 24.88% | 5.06% | |
| 120 | 28.22% | 4.25% | 29.25% | 3.96% | |
| (0.7, 0.3) | 20 | 33.67% | 11.48% | 33.50% | 10.90% |
| 40 | 33.08% | 8.06% | 33.08% | 7.62% | |
| 80 | 29.67% | 5.38% | 24.29% | 4.98% | |
| 120 | 26.83% | 4.36% | 27.83% | 4.11% | |
| (0.8, 0.2) | 20 | 23.67% | 10.48% | 24.50% | 9.99% |
| 40 | 21.83% | 7.01% | 20.58% | 6.61% | |
| 80 | 28.00% | 5.09% | 19.25% | 4.61% | |
| 120 | 22.92% | 3.97% | 25.03% | 3.65% | |
| (0.9, 0.1) | 20 | 12.33% | 8.14% | 12.00% | 7.59% |
| 40 | 10.92% | 5.48% | 11.08% | 5.20% | |
| 80 | 14.04% | 4.10% | 12.50% | 3.79% | |
| 120 | 10.42% | 3.11% | 11.14% | 2.98% | |
| 20 | 38.33% | 11.89% | 36.33% | 10.96% | |
| 40 | 30.17% | 7.93% | 29.50% | 7.46% | |
| 80 | 28.25% | 5.43% | 24.83% | 5.00% | |
| 120 | 29.47% | 4.50% | 35.33% | 4.15% | |
Results for RSV data set. Annotation of the table: Dist: Distribution of senses; S. Size: sample size; Err. Rate: Error Rate; SE: Standard Error of error rates; CV: cross-validation;
| Dist | S. Size | Err. Rate | SE | Err. Rate | SE |
| (0.5, 0.5) | 20 | 26.50% | 10.52% | 27.00% | 9.72% |
| 40 | 18.83% | 6.83% | 17.83% | 6.29% | |
| 80 | 12.79% | 4.09% | 12.17% | 3.78% | |
| 120 | 10.58% | 3.10% | 10.69% | 2.93% | |
| 20 | 27.83% | 10.78% | 27.67% | 10.09% | |
| 40 | 20.25% | 7.00% | 19.50% | 6.52% | |
| 80 | 13.67% | 4.25% | 12.83% | 3.91% | |
| 120 | 11.53% | 3.20% | 10.39% | 2.90% | |
| (0.7, 0.3) | 20 | 27.33% | 10.84% | 26.33% | 10.18% |
| 40 | 19.00% | 6.81% | 17.83% | 6.23% | |
| 80 | 13.96% | 4.27% | 13.08% | 3.91% | |
| 120 | 11.56% | 3.23% | 10.86% | 2.97% | |
| (0.8, 0.2) | 20 | 21.50% | 10.20% | 19.50% | 9.20% |
| 40 | 17.08% | 6.60% | 16.75% | 6.17% | |
| 80 | 14.00% | 4.29% | 13.29% | 3.96% | |
| 120 | 11.69% | 3.26% | 10.75% | 2.96% | |
| (0.9, 0.1) | 20 | 11.00% | 7.77% | 10.67% | 7.25% |
| 40 | 10.58% | 5.42% | 10.33% | 5.05% | |
| 80 | 9.54% | 3.66% | 9.33% | 3.41% | |
| 120 | 8.67% | 2.86% | 8.36% | 2.65% | |
Results for BPD data set. Annotation of the table: Dist: Distribution of senses; S. Size: sample size; Err. Rate: Error Rate; SE: Standard Error of error rates; CV: cross-validation;
| mc-svm | one-vs-rest | one-vs-one | mc-svm | one-vs-rest | one-vs-one | ||||||||
| Dist. | S. size | Err. Rate | SE | Err. Rate | SE | Err. Rate | SE | Err. Rate | SE | Err. Rate | SE | Err. Rate | SE |
| (0.33, 0.33, 0.33) | 30 | 26.56% | 8.77% | 25.78% | 8.68% | 29.22% | 9.06% | 23.89% | 8.05% | 23.44% | 7.96% | 26.22% | 8.31% |
| 60 | 13.89% | 4.89% | 13.39% | 4.83% | 15.83% | 5.18% | 11.89% | 4.30% | 11.56% | 4.23% | 13.78% | 4.62% | |
| 120 | 7.67% | 2.66% | 7.08% | 2.55% | 8.44% | 2.80% | 7.00% | 2.40% | 6.39% | 2.31% | 8.06% | 2.58% | |
| 180 | 6.06% | 1.96% | 5.70% | 1.90% | 6.69% | 2.05% | 5.70% | 1.79% | 5.20% | 1.72% | 6.24% | 1.88% | |
| (0.6, 0.2, 0.2) | 30 | 26.33% | 8.91% | 25.33% | 8.75% | 26.89% | 8.97% | 24.67% | 8.21% | 23.89% | 8.06% | 25.44% | 8.30% |
| 60 | 16.28% | 5.27% | 15.56% | 5.16% | 17.67% | 5.44% | 15.33% | 4.85% | 14.00% | 4.65% | 16.33% | 4.97% | |
| 120 | 10.11% | 3.05% | 9.22% | 2.93% | 10.50% | 3.10% | 9.36% | 2.78% | 8.50% | 2.66% | 10.00% | 2.87% | |
| 180 | 7.72% | 2.21% | 6.89% | 2.09% | 8.09% | 2.26% | 6.93% | 1.98% | 6.37% | 1.91% | 7.41% | 2.04% | |
| (0.8, 0.1, 0.1) | 30 | 18.11% | 7.82% | 18.11% | 7.83% | 19.00% | 7.99% | 18.33% | 7.41% | 18.22% | 7.40% | 19.00% | 7.53% |
| 60 | 14.78% | 5.10% | 14.28% | 5.03% | 15.39% | 5.18% | 14.67% | 4.79% | 13.83% | 4.67% | 14.78% | 4.81% | |
| 120 | 9.31% | 2.95% | 8.69% | 2.85% | 9.50% | 2.98% | 8.56% | 2.67% | 8.06% | 2.59% | 8.75% | 2.70% | |
| 180 | 6.87% | 2.09% | 6.59% | 2.05% | 7.17% | 2.14% | 6.35% | 1.91% | 5.87% | 1.84% | 6.61% | 1.94% | |
| 30 | 24.22% | 8.58% | 23.33% | 8.44% | 26.67% | 8.84% | 23.00% | 7.93% | 21.67% | 7.71% | 25.22% | 8.17% | |
| 60 | 15.89% | 5.21% | 14.89% | 5.08% | 16.83% | 5.34% | 14.11% | 4.66% | 13.33% | 4.55% | 15.39% | 4.80% | |
| 120 | 9.19% | 2.91% | 7.92% | 2.71% | 10.25% | 3.07% | 8.36% | 2.64% | 7.53% | 2.50% | 9.50% | 2.80% | |
| 180 | 6.07% | 1.95% | 5.48% | 1.85% | 6.78% | 2.07% | 5.39% | 1.73% | 4.61% | 1.61% | 6.07% | 1.85% | |
Figure 1Error Rate versus Sample Size with different sense distributions of . This figure shows the plots of "error rate" versus "sample size" with different sense distributions of BSA data set (case where the 2 ambiguous senses are very different) using 5-fold cross-validation.
Figure 2Error Rate versus Sample Size with different sense distributions of PCA data set. This figure shows the plots of ''error rate'' versus ''sample size'' with different sense distributions of PCA data set (case where the 2 ambiguous senses are very similar) using 5-fold cross validation.
Figure 3Error Rate versus Sample Size with different sense distributions of . This figure shows the plots of "error rate" versus "sample size" with different sense distributions of RSV data set (case where the 2 ambiguous senses both refer to viruses but the viruses are different types of viruses) using 5-fold cross validation.
Figure 4Error Rate versus Sample Size with different sense distributions of . This figure shows the plots of "error rate" versus "sample size" with different sense distributions of BPD data set (where there are 3 ambiguous senses that are different) using 5-fold cross validation and "one-vs-rest" algorithm.
Figure 5Error Rate versus Sample Size for . This figure shows the plots of "error rate" versus "sample size" for BSA, RSV and PCA data sets with fixed distribution of "(0.5, 0.5)" using 5-fold cross validation.