| Literature DB >> 35795215 |
Malik Yousef1, Daniel Voskergian2.
Abstract
Medical document classification is one of the active research problems and the most challenging within the text classification domain. Medical datasets often contain massive feature sets where many features are considered irrelevant, redundant, and add noise, thus, reducing the classification performance. Therefore, to obtain a better accuracy of a classification model, it is crucial to choose a set of features (terms) that best discriminate between the classes of medical documents. This study proposes TextNetTopics, a novel approach that applies feature selection by considering Bag-of-topics (BOT) rather than the traditional approach, Bag-of-words (BOW). Thus our approach performs topic selections rather than words selection. TextNetTopics is based on the generic approach entitled G-S-M (Grouping, Scoring, and Modeling), developed by Yousef and his colleagues and used mainly in biological data. The proposed approach suggests scoring topics to select the top topics for training the classifier. This study applied TextNetTopics to textual data to respond to the CAMDA challenge. TextNetTopics outperforms various feature selection approaches while highly performing when applying the model to the validation data provided by the CAMDA. Additionally, we have applied our algorithm to different textual datasets.Entities:
Keywords: feature reduction; feature selection; grouping; latent dirichlet allocation (LDA); medical documents; ranking; text classification; topics detection
Year: 2022 PMID: 35795215 PMCID: PMC9251539 DOI: 10.3389/fgene.2022.893378
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.772
FIGURE 1The two commonly utilized LDA representation schemes (topics words as features and topics distributions as features).
FIGURE 2The TextNetTopics general approach is based on four main components: T for creating topics, G for generating topic-based sub-datasets, S for scoring/ranking topics, and M for creating and evaluating the model.
FIGURE 3Component T working principle: extracting topics from a preprocessed dataset utilizing LDA.
An example of topics detected by LDA applied on the CAMDA dataset.
| Topic id | List of words |
|---|---|
| topic_0 | “patient, disease, studi, risk, rate, percent, transplant, clinic, treatment, compar” |
| topic_1 | “patient, treatment, therapi, studi, infect, clinic, receiv, week, safeti, efficaci” |
| topic_3 | “patient, treatment, respons, cancer, studi, surviv, receiv, therapi, month, phase” |
| topic_4 | “express”, “cell, hepat, infect, viru, hcv, activ, hbv, respons, protein, express” |
FIGURE 4An example of how a sub-dataset is generated based on terms that belong to a topic and then subjected to the Scoring Component S.
FIGURE 5Performing internal cross-validation on the topic-based sub-dataset to assign a score to the associated topic.
FIGURE 6The working principle of Component M: finding the best feature set, in terms of topics terms combinations, that provides the best performance.
Distribution of classes in the dataset samples.
| The number of relevant papers | The number of non-relevant papers | |
|---|---|---|
| Full Dataset | ∼14,000 | ∼14,000 |
| Training DS | 7097 | 7026 |
| Validation DS#1 | 14211 papers (labels are withheld part for final performance testing) | |
| Vlidation DS#2 | 2000 papers (labels are withheld part for final performance testing) | |
TextNetTopics performance metrics for different values of n_topics and n_topics_words for the CAMDA dataset over ten iterations. The top ∼70 features are considered for comparison.
| #Topics | #Words | #Terms (mean) | Accuracy | Sensitivity | Specificity | F1 score | AUC | Precision |
|---|---|---|---|---|---|---|---|---|
| 10 | 10 | 67 | 0.914 | 0.905 | 0.923 | 0.913 | 0.969 | 0.922 |
| 20 | 10 | 70 | 0.917 | 0.911 | 0.922 | 0.917 | 0.972 | 0.922 |
| 40 | 10 | 65 | 0.921 | 0.917 | 0.925 | 0.921 | 0.971 | 0.925 |
| 60 | 10 | 64.5 | 0.919 | 0.918 | 0.919 | 0.919 | 0.967 | 0.920 |
| 10 | 20 | 69.0 | 0.921 | 0.915 | 0.926 | 0.920 | 0.974 | 0.926 |
| 20 | 20 | 75.2 | 0.924 | 0.920 | 0.927 | 0.924 | 0.970 | 0.928 |
| 40 | 20 | 73.2 | 0.917 | 0.913 | 0.921 | 0.917 | 0.968 | 0.921 |
| 60 | 20 | 73.0 | 0.917 | 0.920 | 0.914 | 0.917 | 0.966 | 0.915 |
| 10 | 30 | 75.3 | 0.920 | 0.917 | 0.924 | 0.920 | 0.973 | 0.924 |
| 20 | 30 | 76.5 | 0.920 | 0.915 | 0.926 | 0.920 | 0.972 | 0.926 |
TextNetTopics performance over top topics for CAMDA dataset. #Accumulated_Topics column is the number of significant topics, #Words is the average of words on each level over the 100 iterations.
| #Accumulated_Topics | #Words | Accuracy | Sensitivity | Specificity | F1 score | AUC | Precision |
|---|---|---|---|---|---|---|---|
| 11 | 104.14 | 0.93 | 0.92 | 0.93 | 0.93 | 0.98 | 0.93 |
| 10 | 94 | 0.93 | 0.92 | 0.93 | 0.93 | 0.98 | 0.93 |
| 8 | 75.3 | 0.92 | 0.92 | 0.92 | 0.92 | 0.97 | 0.93 |
| 6 | 62 | 0.92 | 0.92 | 0.92 | 0.92 | 0.97 | 0.92 |
| 4 | 49 | 0.91 | 0.91 | 0.91 | 0.91 | 0.96 | 0.91 |
| 2 | 31.94 | 0.89 | 0.90 | 0.89 | 0.89 | 0.95 | 0.89 |
| 1 | 20 | 0.87 | 0.88 | 0.87 | 0.88 | 0.93 | 0.87 |
TextNetTopics performance results over two validation datasets provided by CAMDA.
| Accuracy/stdv | Recall | Precision | F1 score | |
|---|---|---|---|---|
| TextNetTopics/V1 | 0.92 | 0.92 | 0.92 | 0.92 |
| TextNetTopics/V2 | 0.87 | 0.94 | 0.82 | 0.88 |
Results for different algorithms with different feature selections applied to the CAMDA dataset. The top 100 features are considered for each feature selection algorithm. The standard deviation of the accuracy is present in the Accuracy column after the “slash” sign.
| FS | Accuracy/stdv | Recall | Precision | F1 score | |
|---|---|---|---|---|---|
| TextNetTopics |
| 0.93 |
|
| |
| Adaboost | XGBOOST | 0.79/0.05 | 0.79 | 0.82 | 0.79 |
| DT | XGBOOST | 0.76/0.05 | 0.78 | 0.78 | 0.76 |
| LogitBoost | XGBOOST | 0.79/0.04 | 0.79 | 0.82 | 0.79 |
| RF | XGBOOST | 0.77/0.05 | 0.81 | 0.78 | 0.78 |
| Adaboost | SKB | 0.91/0.007 | 0.90 | 0.92 | 0.91 |
| DT | SKB | 0.88/0.01 | 0.87 | 0.88 | 0.88 |
| LogitBoost | SKB | 0.91/0.008 | 0.91 | 0.91 | 0.91 |
| RF | SKB |
| 0.93 | 0.92 |
|
| Adaboost | FCB | 0.70/0.03 | 0.89 | 0.65 | 0.75 |
| DT | FCB | 0.52/0.05 |
| 0.52 | 0.67 |
| LogitBoost | FCB | 0.71/0.03 | 0.89 | 0.65 | 0.75 |
| RF | FCB | 0.57/0.08 | 0.90 | 0.56 | 0.68 |
Bold values represent the highest values in each metric column.
TextNetTopics performance over top topics for PubMed 20k RCT dataset. #Accumulated_Topics column is the number of significant topics, #Words is the average of words on each level over the 100 iterations.
| #Accumulated_Topics | #Words | Accuracy | Sensitivity | Specificity | F1 score | AUC | Precision |
|---|---|---|---|---|---|---|---|
| 10 | 74.96 | 0.83 | 0.74 | 0.90 | 0.81 | 0.88 | 0.84 |
| 8 | 65.54 | 0.83 | 0.74 | 0.90 | 0.80 | 0.88 | 0.84 |
| 6 | 54.26 | 0.82 | 0.71 | 0.90 | 0.76 | 0.87 | 0.83 |
| 4 | 41.74 | 0.81 | 0.69 | 0.90 | 0.75 | 0.86 | 0.82 |
| 2 | 27.54 | 0.80 | 0.66 | 0.90 | 0.73 | 0.84 | 0.82 |
| 1 | 17.94 | 0.79 | 0.63 | 0.90 | 0.71 | 0.82 | 0.82 |
Results for different algorithms with different feature selections applied on PubMed 20k RCT dataset. The top 60 features are considered for each feature selection algorithm.
| FS | Accuracy | Recall | Specificity | F1 score | AUC | Precision | |
|---|---|---|---|---|---|---|---|
| TextNetTopics | 0.83 | 0.74 |
| 0.84 | 0.88 | 0.80 | |
| Adaboost | XGBOOST | 0.86 | 0.82 | 0.89 |
| 0.92 | 0.83 |
| DT | XGBOOST | 0.80 | 0.73 | 0.84 | 0.76 | 0.79 | 0.74 |
| LogitBoost | XGBOOST |
| 0.83 |
|
|
|
|
| RF | XGBOOST | 0.85 | 0.81 | 0.88 | 0.83 | 0.91 | 0.82 |
| Adaboost | SKB | 0.86 | 0.81 | 0.89 | 0.84 | 0.92 | 0.83 |
| DT | SKB | 0.79 | 0.72 | 0.84 | 0.76 | 0.79 | 0.74 |
| LogitBoost | SKB | 0.86 | 0.83 | 0.89 | 0.84 |
| 0.83 |
| RF | SKB | 0.85 | 0.82 | 0.88 | 0.82 | 0.91 | 0.82 |
| Adaboost | FCB | 0.62 | 0.84 | 0.47 | 0.53 | 0.66 | 0.65 |
| DT | FCB | 0.49 |
| 0.22 | 0.46 | 0.60 | 0.59 |
| LogitBoost | FCB | 0.62 | 0.85 | 0.46 | 0.53 | 0.66 | 0.65 |
| RF | FCB | 0.57 | 0.83 | 0.39 | 0.50 | 0.65 | 0.62 |