| Literature DB >> 35885630 |
Maximilian Legnar1,2, Philipp Daumke3, Jürgen Hesser2,4, Stefan Porubsky5, Zoran Popovic2, Jan Niklas Bindzus2, Joern-Helge Heinrich Siemoneit2, Cleo-Aron Weis2,6.
Abstract
INTRODUCTION: This study investigates whether it is possible to predict a final diagnosis based on a written nephropathological description-as a surrogate for image analysis-using various NLP methods.Entities:
Keywords: BERT; NLP; deep learning; machine learning; nephropathology; text analysis; text classification; topic modelling; transformer encoder
Year: 2022 PMID: 35885630 PMCID: PMC9325286 DOI: 10.3390/diagnostics12071726
Source DB: PubMed Journal: Diagnostics (Basel) ISSN: 2075-4418
Figure 1Flowchart, describing the general procedure of the project. After splitting each nephropathological report into its diagnosis and description section (data preparation), we first applied the clustering task (i) to the diagnosis texts in order to summarize them into less than 20 clusters. After labelling each cluster of diagnosis texts with a corresponding diagnostic group, we applied the classification task (ii) to the description texts in order to find out if it’s possible to predict the correct diagnostic group of a given description text with NLP techniques.
Metrics of different cluster-sets.
| Cluster Method | s-Score | cls Accuracy | rel Entropy | Clusters | Corpus Size |
|---|---|---|---|---|---|
| HDBSCAN | 0.587 | 0.951 | 0.588 | 16 | 906 |
| German-BERT | 0.576 | 0.856 | 0.618 | 13 | 759 |
| top2vec | 0.545 | 0.372 | 0.780 | 18 | 1026 |
| Patho-BERT | 0.536 | 0.848 | 0.531 | 17 | 757 |
| LDA | 0.517 | 0.581 | 0.611 | 7 | 1107 |
| k-means | 0.038 | 0.905 | 0.612 | 10 | 1107 |
| GSDPMM | 0.033 | 0.805 | 0.675 | 14 | 1107 |
We used the silhouette score (s-score), relative entropy (rel entropy) and the SVM! (SVM!)-based classification performance (cls accuracy) to evaluate and compare different cluster-sets, generated with different cluster methods (far left column). The entry clusters indicate how many clusters were generated by which method. Corpus size indicates how many reports remained after clustering, since several reports were identified as outliers and sorted out. HDBSCAN! (HDBSCAN!) has the best silhouette score as well as the best cls accuracy score. Although top2vec has an acceptable silhouette score, it is notable for its very poor predictability (cls accuracy: 0.372). Although k-means and GSDPMM! (GSDPMM!) have low silhouette scores, they are well predictable.
Figure 2UMAP! (UMAP!) and PCA! (PCA!) of different cluster-sets. UMAP representations of the cluster-sets generated with (a) LDA! (LDA!), (c) HDBSCAN! (HDBSCAN!), (d) top2vec, (e) German-BERT, (f) Patho-BERT, (g) k-means and (h) GSDPMM! (GSDPMM!). The LDA! cluster-set is also shown as PCA! (PCA!) in (b). Each data point represents a diagnosis section of a report. The data points are coloured according to the respective clusters. Black points represent outliers that were not assigned to any cluster. Above all, the clusters of top2vec and HDBSCAN! appear particularly tidy and separated. The clusters of k-means and GSDPMM! appear less well separated, which is probably also due to the fact that no data points are sorted out here.
Annotated topic words (translated from German to English), extracted from the HDBSCAN! (HDBSCAN!) cluster-set, using the tf–idf based extraction method. A particularly large number of topic words strongly refer to cluster names (left column) highlighted in green (strong cluster names). In the case of cluster names marked in orange, only a few topic words indicated the specified cluster name (weak cluster name). The same applies to the colour-coded topic words: topic words that strongly indicate a cluster name are highlighted in green (strong topic words). Orange highlighted topic words only weakly indicate a cluster name (weak topic words).
| Cluster Index-Cluster Name | Keywords According to tf–idf |
|---|---|
|
| scale, chronicity_index, class, activity_index, nih, |
|
| quantity, glomeruli, |
|
| approx, concerning, cortex, minor, immunostaining, damage, included, moderate, chronic, supplementary |
|
| of_this, glomeruli, total_amount, intact, |
|
| |
|
| |
|
| |
|
| nephropathy, |
|
| |
|
| renal_parenchyma, |
| 10 | cut_level, hardly, noteworthy, chronic, tubulointerstitial_damage, deep, so_far, processing, |
| 11 | microscopy, conventional, requirement, result, renal_parenchyma, foresee, chronic, nephrosclerosis, mild, examination |
|
| |
|
| |
|
| |
|
| a_mild, |
Annotated topic words (translated from German to English), extracted from the HDBSCAN! (HDBSCAN!) cluster-set, using the SVM! (SVM!) based extraction method. A particularly large number of topic words strongly refer to cluster names (left column) highlighted in green (strong cluster names). In the case of cluster names marked in orange, only a few topic words indicated the specified cluster name (weak cluster name). The same applies to the colour-coded topic words: topic words that strongly indicate a cluster name are highlighted in green (strong topic words). Orange highlighted topic words only weakly indicate a cluster name (weak topic words).
| Cluster Index-Cluster Name | Keywords According to SVM |
|---|---|
|
| scale, chronicity_index, activity_index, class, -nih, iv, |
|
| quantity, sclerosing, glomeruli, |
|
| |
|
| of_this, total_amount, intact, |
|
| oxford_classification, e0, s1, m1, t0, c0, iga_glomerulonephritis, s0, applicable, e1 |
|
| |
|
| |
|
| cast_nephropathy, |
|
| |
|
| renal_parenchyma, |
| 10 | hardly, cut_level, noteworthy, deep, so_far, processing, using, congo_red_coloring, to_exclusion, cellularor |
| 11 | microscopy, conventional, requirement, foresee, mild, membranous, early, cell_proliferation, result, g |
|
| membranous, proteinuria, as_a_result, glomerulonephritis, |
|
| |
|
| global, |
|
|
Figure A1UMAP! (UMAP!) of the cluster-sets generated with (a) LDA! (LDA!), (c) HDBSCAN! (HDBSCAN!), (d) top2vec, (e) German-BERT, (f) Patho-BERT, (g) k-means and (h) GSDPMM! (GSDPMM!). The LDA! (LDA!) cluster-set is also shown as PCA! (PCA!) in (b). Each dot colour represents a different author. The authors of the reports marked in black are unknown (e.g. because multiple authors were involved).
Performance of different classification models, trained with the HDBSCAN cluster-set.
| Classifier | F1-Score | Cohen’s Kappa Coefficient |
|---|---|---|
| Patho-BERT | 0.667 | 0.631 |
| SGD-classifier | 0.644 | 0.598 |
| MLP-classifier | 0.639 | 0.599 |
| German-BERT | 0.610 | 0.572 |
| Logistic Regression | 0.589 | 0.567 |
| CNN + embeddings | 0.523 | 0.450 |
| RNN + embeddings | 0.464 | 0.394 |
| Multinomial NB | 0.442 | 0.370 |
F1-score and Cohen’s kappa coefficient of the tested classification methods, which were trained to predict the HDBSCAN! clustered data set. Each score is determined with ten-fold cross-validation. The transformer based model Patho-BERT and the SVM! (SVM!)-based SGD-classifier performed best.
Figure 3Confusion matrices of the classification models. (a) German-BERT, (b) Patho-BERT, (c) the SVM! (SVM!)-based SGD-classifier, and (d) the MLP! (MLP!)-classifier. The brightness of a cell indicates how many times the class on the x-axis was predicted by the classifier. The true class is indicated by the index of the y-axis. Interestingly, there are classes that could be recognized well by all classifiers, including the weaker ones, e.g., class 1 (rapid progressive glomerulonephritis), 2 (tubulo-interstitial nephritis) and 3 (pauci immune glomerulonephritis). Although the transformer-based classifiers (a,b) generally performed better, the BoW!-based methods were able to detect class 0 (systemic lupus erythematosus) or 5 (fsgn) better (c,d).
Classification performance of the Patho-BERT-classifier, predicting the HDBSCAN cluster-set.
| Cluster/Diagnostic Group | F1-Score | Support |
|---|---|---|
| 3 | 0.892 | 72 |
| 2 | 0.880 | 324 |
| 1 | 0.847 | 51 |
| 8 | 0.728 | 76 |
| 4 | 0.601 | 71 |
| 15 | 0.545 | 78 |
| 13 | 0.529 | 56 |
| 12 | 0.417 | 26 |
| 10 | 0.367 | 23 |
| 9 | 0.364 | 31 |
| 7 | 0.333 | 19 |
| 14 | 0.312 | 19 |
| 11 | 0.160 | 17 |
| 0 | 0.000 | 18 |
| 5 | 0.000 | 14 |
| 6 | 0.000 | 11 |
Cluster-Predictability of the HDBSCAN! cluster-set, using Patho-BERT as classifier. The cluster predictability was determined with the F1-score and the table is sorted by descending F1-scores. Each F1-score is the result of a 10 fold cross validation (average of 10 test measurements). Cluster 3 has the highest F1-score. Cluster 2 has a particularly strong support, which means this cluster is particularly large (324 documents) and was therefore often seen during training. The support specifies how many documents a cluster consists of. It can be observed that especially the smaller clusters could be recognized with difficulty or not at all.
Annotated German topic words, extracted from the HDBSCAN! (HDBSCAN!) cluster-set, using the tf–idf based extraction method. A particularly large number of topic words strongly refer to cluster names (left column) highlighted in green (strong cluster names). In the case of cluster names marked in orange, only a few topic words indicated the specified cluster name (weak cluster name). The same applies to the colour-coded topic words: topic words that strongly indicate a cluster name are highlighted in green (strong topic words). Orange highlighted topic words only weakly indicate a cluster name (weak topic words).
| Cluster Index-Cluster Name | Keywords according to tf-idf |
|---|---|
|
| skala, chronizitätsindex, klasse, aktivitätsindex, nih, |
|
| anzahl, glomeruli, |
|
| ca, betreffend, cortex, leicht, immunfärbung, schädigung, miterfasst, mäßig, chronisch, ergänzend |
|
| hiervon, glomeruli, gesamtzahl, intakt, |
|
| |
|
| |
|
| |
|
| nephropathie, |
|
| |
|
| nierenparenchym, |
| 10 | schnittstufe, kaum, nennenswert, chronisch, tubulointerstitieller_schaden, tief, bislang, aufarbeitung, |
| 11 | mikroskopie, konventionell, maßgabe, ergeben, nierenparenchym, absehen, chronisch, nephrosklerose, leichtgradigen, untersuchung |
|
| |
|
| |
|
| |
|
| leichtgradiger, |
Annotated German topic words, extracted from the HDBSCAN! (HDBSCAN!) cluster-set, using the SVM! (SVM!) based extraction method. A particularly large number of topic words strongly refer to cluster names (left column) highlighted in green (strong cluster names). In the case of cluster names marked in orange, only a few topic words indicated the specified cluster name (weak cluster name). The same applies to the colour-coded topic words: topic words that strongly indicate a cluster name are highlighted in green (strong topic words). Orange highlighted topic words only weakly indicate a cluster name (weak topic words).
| Cluster Index-Cluster Name | Keywords according to SVM |
|---|---|
|
| skala, chronizitätsindex, aktivitätsindex, klasse, -nih, iv, |
|
| anzahl, sklerosieren, glomeruli, |
|
| |
|
| hiervon, gesamtzahl, intakt, |
|
| oxford-klassifikation, e0, s1, m1, t0, c0, iga-glomerulonephritis, s0, anwendbar, e1 |
|
| |
|
| |
|
| cast-nephropathie, |
|
| |
|
| nierenparenchym, |
| 10 | kaum, schnittstufe, nennenswert, tief, bislang, aufarbeitung, mittels, kongorot-färbung, zumausschluss, zelluläreoder |
| 11 | mikroskopie, konventionell, maßgabe, absehen, leichtgradigen, membranösen, früh, zellvermehrung, ergeben, g |
|
| membranöse, proteinurie, infolge, glomerulonephritis, |
|
| |
|
| global, |
|
|