| Literature DB >> 27685652 |
Zhan Ye1, Ahmad P Tafti2,3, Karen Y He4, Kai Wang5,6, Max M He1,2,7.
Abstract
BACKGROUND: Many new biomedical research articles are published every day, accumulating rich information, such as genetic variants, genes, diseases, and treatments. Rapid yet accurate text mining on large-scale scientific literature can discover novel knowledge to better understand human diseases and to improve the quality of disease diagnosis, prevention, and treatment.Entities:
Year: 2016 PMID: 27685652 PMCID: PMC5042555 DOI: 10.1371/journal.pone.0162721
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
The datasets: all abstracts and full-text articles were downloaded from PubMed.
The datasets included abstracts and full-text articles related to three types of cancer, including breast, lung, and prostate cancer. For each dataset, we employed 80% of the entire dataset to train a prediction model while the remaining 20% was used for testing.
| Dataset | Year Range | # Instances | # Breast Cancer | # Lung Cancer | # Prostate Cancer |
|---|---|---|---|---|---|
| Abstracts | 2011–2016 | 19,681 | 6,137 | 6,680 | 6,864 |
| Full-text Articles I | 2011–2016 | 12,902 | 4,319 | 4,281 | 4,302 |
| Full-text Articles II | 2009–2016 | 29,437 | 9,787 | 9,861 | 9,789 |
An example of a bag-of-words representation.
The terms “biology”, “biopsy”, “biolab”, “biotin”, and “almost” are unigrams, but “cancer-surviv”, and “cancer-stage” are bigrams. Using TF/IDF weighting scores, the feature value of the term “almost” equals to zero.
| Article ID | biolog | biopsi | biolab | biotin | almost | cancer-surviv | cancer-stage | Article Class |
|---|---|---|---|---|---|---|---|---|
| 00001 | 12 | 1 | 2 | 10 | 0 | 1 | 4 | breast-cancer |
| 00002 | 10 | 1 | 0 | 3 | 0 | 6 | 1 | breast-cancer |
| 00014 | 4 | 1 | 1 | 1 | 0 | 28 | 0 | breast-cancer |
| 00063 | 4 | 0 | 0 | 0 | 0 | 18 | 7 | breast-cancer |
| 00319 | 0 | 1 | 0 | 9 | 0 | 20 | 1 | breast-cancer |
| 00847 | 7 | 2 | 0 | 14 | 0 | 11 | 5 | breast-cancer |
| 03042 | 3 | 1 | 3 | 1 | 0 | 19 | 8 | lung-cancer |
| 05267 | 4 | 4 | 2 | 6 | 0 | 14 | 11 | lung-cancer |
| 05970 | 8 | 0 | 4 | 9 | 0 | 9 | 17 | lung-cancer |
| 30261 | 1 | 0 | 0 | 11 | 0 | 21 | 1 | prostate-cancer |
| 41191 | 9 | 0 | 5 | 14 | 0 | 11 | 1 | prostate-cancer |
| 52038 | 6 | 1 | 1 | 17 | 0 | 19 | 0 | prostate-cancer |
| 73851 | 1 | 1 | 8 | 17 | 0 | 17 | 3 | prostate-cancer |
The quantitative results for accuracy, precision, and recall of SparkText using three datasets.
For each dataset, 80% was used to train a prediction model and the remaining 20% for testing.
| Dataset | Classifier | Accuracy | Precision | Recall |
|---|---|---|---|---|
| Abstracts | SVM | 94.63% | 93.11% | 94.81% |
| Abstracts | Logistic Regression | 92.19% | 91.07% | 89.49% |
| Abstracts | Naïve Byes | 89.38% | 89.13% | 90.82% |
| Full-text Articles I | SVM | 94.47% | 92.97% | 93.14% |
| Full-text Articles I | Logistic Regression | 91.05% | 90.77% | 89.19% |
| Full-text Articles I | Naïve Bayes | 88.02% | 89.01% | 90.68% |
| Full-text Articles II | SVM | 93.81% | 91.88% | 92.27% |
| Full-text Articles II | Logistic Regression | 90.57% | 90.28% | 91.59% |
| Full-text Articles II | Naïve Bayes | 86.44% | 87.61% | 89.12% |
The quantitative results for accuracy using different regularization parameters.
For each dataset, 80% was used to train a prediction model and the remaining 20% for testing.
| Classifier | Dataset | Regularization Parameter | Accuracy |
|---|---|---|---|
| SVM | Abstracts | L2 (Default) | 94.63% |
| SVM | Abstracts | L1 | 91.07% |
| SVM | Abstracts | None | 89.72% |
| Logistic Regression | Abstracts | L2 (Default) | 92.19% |
| Logistic Regression | Abstracts | L1 | 90.61% |
| Logistic Regression | Abstracts | None | 88.54% |
| SVM | Full-text Articles I | L2 (Default) | 94.47% |
| SVM | Full-text Articles I | L1 | 90.33% |
| SVM | Full-text Articles I | None | 88.51% |
| Logistic Regression | Full-text Articles I | L2 (Default) | 91.05% |
| Logistic Regression | Full-text Articles I | L1 | 88.19% |
| Logistic Regression | Full-text Articles I | None | 87.04% |
| SVM | Full-text Articles II | L2 (Default) | 93.81% |
| SVM | Full-text Articles II | L1 | 90.16% |
| SVM | Full-text Articles II | None | 87.94% |
| Logistic Regression | Full-text Articles II | L2 (Default) | 90.57% |
| Logistic Regression | Full-text Articles II | L1 | 87.63% |
| Logistic Regression | Full-text Articles II | None | 86.71% |
Comparing the time efficiency results, SparkText outperformed other available text mining tools with speeds up to 132 times faster on the larger dataset that included 29,437 full-text articles.
| Tools | Dataset | ≅ Running Time (minutes) |
|---|---|---|
| Weka Library | Abstracts | 138 |
| Tag Helper Tools | Abstracts | 201 |
| Weka Library | Full-text Articles I | 309 |
| Tag Helper Tools | Full-text Articles I | 571 |
| Weka Library | Full-text Articles II | 697 |
| Tag Helper Tools | Full-text Articles II | 768 |