| Literature DB >> 34823522 |
Naseem Cassim1, Michael Mapundu2, Victor Olago3, Turgay Celik4, Jaya Anna George5, Deborah Kim Glencross6.
Abstract
BACKGROUND: Prostate cancer (PCa) is the leading male neoplasm in South Africa with an age-standardised incidence rate of 68.0 per 100,000 population in 2018. The Gleason score (GS) is the strongest predictive factor for PCa treatment and is embedded within semi-structured prostate biopsy narrative reports. The manual extraction of the GS is labour-intensive. The objective of our study was to explore the use of text mining techniques to automate the extraction of the GS from irregularly reported text-intensive patient reports.Entities:
Keywords: Algorithm; Gleason score; Late presentation; Prostate cancer; Public health; Text mining
Mesh:
Year: 2021 PMID: 34823522 PMCID: PMC8614040 DOI: 10.1186/s12911-021-01697-2
Source DB: PubMed Journal: BMC Med Inform Decis Mak ISSN: 1472-6947 Impact factor: 2.796
Example of the semi-structured narrative prostate biopsy report
| Category | Biopsy report |
|---|---|
| Biopsy report | EPISODE NUMBER: ABC1234 |
The narrative biopsy report included the headings clinical history, macroscopy and pathological diagnosis
PSA: prostate specific antigen MM: millimetre P63: Protein 63 CK5/6: Cytokeratin 5/6
Fig. 1Diagram describing the logical processes used to analyse the raw narrative prostate biopsy report to generate the discovered knowledge. The steps were as follows: (i) data acquisition (ii) pre-processing and (iii) feature extraction, (iv) feature value representation, (v) feature selection, (vi) information extraction (vii) classification and (viii) discovered knowledge
N-grams feature extraction output for a sample of biopsies
| [‘major 4 minor 5’, ‘4 + 5’] | [‘4 + 4’, ‘4 + 4’] | [‘3 + 3’, ‘3 + 3’] | [‘4 + 4’, ‘4 + 4’] | [‘major 4 + minor 3’] | [‘4 + 5’] |
|---|---|---|---|---|---|
| [‘major 4 minor 5’, ‘4 + 5’] | [‘major 4 minor 4’] | [‘4 + 4’, ‘4 + 4’] | [‘4 + 4’, ‘4 + 4’] | [‘major 5 minor 4’] | [‘major 3 minor 5’] |
| [‘3 + 3’, ‘3 + 3’] | [‘2 + 2’] | [‘3 + 4’, ‘3 + 4’] | [‘3 + 2’, ‘3 + 3’, ‘3 + 3’] | [‘major 4 + minor 5’] | [‘major 3 minor 4’] |
| [‘3 + 5’] | [‘3 + 3’, ‘3 + 3’] | [‘2 + 2’, ‘2 + 2’, ‘2 + 2’] | [‘3 + 5’, ‘3 + 5’] | [‘4 + 3’] | [‘3 + 5’, ‘major 4 + minor 5’] |
| [‘major 4 minor 5’] | [‘4 + 3’, ‘4 + 3’] | [‘3 + 2’, ‘3 + 5’, ‘3 + 5’] | [‘3 + 3’] | [‘major 5 minor 4’] | [‘major 3 minor 5’] |
| [‘4 + 3’] | [‘2 + 3’, ‘2 + 3’] | [‘2 + 2’] | [‘3 + 2’, ‘3 + 2’] | [‘major 5 + minor 4’] | [‘major 5 + minor 4’] |
| [‘major 4 minor 3’] | [‘3 + 3’, ‘3 + 3’] | [‘4 + 4’, ‘4 + 4’] | [‘3 + 4’, ‘3 + 4’] | [‘major 3 minor 4’] | [‘major 4 minor 5’] |
| [‘3 + 2’, ‘3 + 2’] | [‘4 + 3’, ‘4 + 3’] | [‘2 + 3’, ‘3 + 4’, ‘3 + 4’] | [‘4 + 3’, ‘4 + 3’] | [‘major 5 minor 5’] | [‘major 3 minor 4’] |
| [‘major 4 + minor 3’] | [‘3 + 3’, ‘3 + 3’] | [‘4 + 4’, ‘4 + 4’] | [‘3 + 3’, ‘3 + 3’] | [‘major 5 minor 4’] | [‘major 4 minor 5’] |
| [‘5 + 5’, ‘5 + 5’] | [‘3 + 3’, ‘3 + 3’] | [‘5 + 4’, ‘5 + 4’] | [‘4 + 5’] | [‘major 3 minor 3’] | [‘major 5 minor 3’] |
| [‘major 3 minor 4’] | [‘major 5 + minor 5’] | [‘major 5 minor 4’] | [‘major 3 minor 5’] | [‘major 4 minor 5’] |
Fig. 2Horizontal bar graph depicting the top twenty occurring unigrams (A), bigrams (B), trigrams (C) and quadgrams (D). The number of occurrences is displayed on the x-axis
Performance of the text mining algorithm to automate the extraction of the Gleason score from narrative prostate biopsy narrative reports
| Manual coding | ||
|---|---|---|
| Exact Match: Yes | Exact Match: No | |
| Predicted | ||
| Exact Match: Yes | 984 | 0 |
| Exact Match: No | 16 | 0 |
| Precision = 1.00 | ||
| Recall = 0.98 | ||
| F-score = 0.99 | ||
A contingency table was used to compare the manually coded and algorithm predicted values. We reported the precision, recall and F-score reported for the first and updated text mining algorithm output as well as for the validation dataset.
Different Gleason score formats reported for the study
| # | Extracted score | As reported in the biopsy report |
|---|---|---|
| 1 | 5 + 4 = 9 | 5, 4 |
| 2 | 5 + 4 = 9 | 5 PLUS 4 EQUALS 9 |
| 3 | 3 + 3 = 6 | 3 + 3 = 6 OR 3 + 3 |
| 4 | 3 + 5 = 8 | MAJOR PATTERN 3, MINOR PATTERN 5 |
| 5 | 4 + 3 = 7 | MAJOR PATTERN: 4/5 MINOR PATTERN: 3/5 |
| 6 | 4 + 3 = 7 | MAJOR 4 PLUS MINOR 3 EQUALS 7 |
| 7 | 5 + 3 = 8 | SCORE 8 (MAJOR 5; MINOR 3) |
| 8 | 3 + 4 = 7 | 7 (3 + 4) |
| 9 | 4 + 3 = 7 | (4 + 3) = 7 |
| 10 | 3 + 4 = 7 | 3 (MAJOR) + 4 (MINOR) = 7/10 |
The clean extracted score reported, and the original value reported in the prostate biopsy report is indicated for the study dataset
The table reported the frequency for the top five reported Gleason scores with the remaining values grouped and reported as “Others”
| No | Study dataset | Validation dataset | ||||
|---|---|---|---|---|---|---|
| Gleason score | n = | % | Gleason score | n = | % | |
| 1 | 5 + 4 = 9 | 176 | 17.6 | 3 + 3 = 6 | 377 | 37.7 |
| 2 | 3 + 3 = 6 | 175 | 17.5 | 3 + 4 = 7 | 194 | 19.4 |
| 3 | 4 + 3 = 7 | 164 | 16.4 | 4 + 3 = 7 | 149 | 14.9 |
| 4 | 3 + 4 = 7 | 147 | 14.7 | 4 + 4 = 8 | 100 | 10.0 |
| 5 | 4 + 4 = 8 | 142 | 14.2 | 4 + 5 = 9 | 74 | 7.4 |
| 6 | Others | 196 | 19.6 | Others | 106 | 10.6 |
| Total | 1000 | 100 | Total | 1000 | 100 | |
| High-Risk GS ≥ 8 | 318 | 31.8 | High-Risk GS ≥ 8 | 174 | 17.4 | |
Data is reported for this study as well as for the separate dataset
GS: Gleason score
Comparison of low, intermediate and high-risk Gleason scores for the predicted and manually coded values
| GS risk category | Predicted | Manually coded | F-score |
|---|---|---|---|
| Low-risk GS (≤ 6) | 199 | 193 | 0.98 |
| Intermediate-risk GS (7) | 311 | 314 | 1.00 |
| High-risk GS (≥ 8) | 490 | 493 | 1.00 |
| 0.9439 | |||
| Macro-average F-score | 0.99 | ||
| Macro recall | 1.00 | ||
| Macro precision | 0.98 |
The macro-average F-score is reported
GS: Gleason score
&Alpha value of 0.05 used to assess significance