| Literature DB >> 32532784 |
Anthony Nguyen1, John O'Dwyer2, Thanh Vu2, Penelope M Webb3, Sharon E Johnatty3, Amanda B Spurdle3.
Abstract
OBJECTIVE: Medical research studies often rely on the manual collection of data from scanned typewritten clinical records, which can be laborious, time consuming and error prone because of the need to review individual clinical records. We aimed to use text mining to assist with the extraction of clinical features from complex text-based scanned pathology records for medical research studies.Entities:
Keywords: health informatics; information technology; oncology; pathology
Year: 2020 PMID: 32532784 PMCID: PMC7295399 DOI: 10.1136/bmjopen-2020-037740
Source DB: PubMed Journal: BMJ Open ISSN: 2044-6055 Impact factor: 2.692
Figure 1Redacted scanned pathology report.
Figure 2Extract of abstraction form for standardised capture of pathology features. FIGO, staging system determined by the International Federation of Gynecology and Obstetrics (Fédération Internationale de Gynécologie et d’Obstétrique); N/A, not applicable; N/R, not reported.
Inclusion and exclusion search terms selected based on expert knowledge
| Evidence type | Leiomyoma | Endometriosis | Adenomyosis |
| Inclusion search terms | fibroid | endometriosis | adenomyosis |
| fibroids | adenomyotic | ||
| leiomyoma | |||
| leiomyomata | |||
| leiomyomas | |||
| smooth muscle | |||
| neoplasm | |||
| smooth muscle | |||
| tumour | |||
| smooth muscle | |||
| tumor | |||
| Exclusion search terms | fibrosis | endometritis | |
| fibrotic |
Example search terms, regular expression search patterns and textual context in the portable document format report containing the search term (shown in italics)
| Search term | Regular expression pattern | Textual context |
| leiomyoma | (l|i)e(i|l|!)(o|a|c)(m|rn)y(o|a|c)(m|rn)(a|o) | “myometr!um contains a benign |
| endometriosis | end(o|a|c)(m|rn)etr(i|l|!)(o|a|c)s(i|l|!)s | “right fallopian tube shows a focus of |
| adenomyosis: absent | (a|o)den(o|a|c)(m|rn)y(o|a|c)s(i|l|!)s: (a|o)bsent | “evidence of |
Final coded abstraction statistics for leiomyomas, endometriosis and adenomyosis
| Pathology feature | Final abstracted coding (development/evaluation set) | ||
| Yes | No | Not reported | |
| Leiomyomas | 693 (9/684) | 25 (0/25) | 586 (2/584) |
| Endometriosis | 106 (1/105) | 14 (0/14) | 1184 (10/1174) |
| Adenomyosis | 538 (5/533) | 36 (3/33) | 730 (3/727) |
Contingency table for system/abstractor and the final abstracted codes on the evaluation set for (a) leiomyomas, (b) endometriosis and (c) adenomyosis
| System | Abstractor | ||||||
| Yes | No | Not reported | Conflict | Yes | No | Not reported | |
| Final abstracted codes (n) | |||||||
| (a) Leiomyomas | |||||||
| Yes (684) | 1 | 9 | 1 | 22 | 48 | ||
| No (25) | 8 | 0 | 1 | 1 | 3 | ||
| Not reported (584) | 5 | 0 | 0 | 12 | 196 | ||
| Total (1293) | 686 | 17 | 588 | 2 | 627 | 239 | 427 |
| (b) Endometriosis | |||||||
| Yes (105) | 0 | 0 | 2 | 1 | 14 | ||
| No (14) | 10 | 0 | 1 | 1 | 10 | ||
| Not reported (1174) | 0 | 0 | 0 | 2 | 14 | ||
| Total (1293) | 113 | 3 | 1174 | 3 | 93 | 18 | 1182 |
| (c) Adenomyosis | |||||||
| Yes (533) | 0 | 18 | 0 | 15 | 21 | ||
| No (33) | 7 | 0 | 2 | 3 | 1 | ||
| Not reported (727) | 2 | 0 | 0 | 5 | 252 | ||
| Total (1293) | 524 | 24 | 743 | 2 | 505 | 296 | 492 |
Results along the main diagonal (bold font) show feature value concordance, while the off-diagonal results show the feature value discrepancies.
Discrepancies in abstractor coding of ‘No’ and ‘Not reported’ values for leiomyomas and adenomyosis highlights the possible extent of abstractor inference in the coding of ‘No’ values.
System effectiveness results for leiomyomas, endometriosis and adenomyosis classification on the evaluation set
| Yes | Other | |||||
| PPV | Sensitivity | F-measure | PPV | Sensitivity | F-measure | |
| Leiomyomas | ||||||
| Abstractor | 97.93% | 89.77% | 93.67% | 89.49% | 97.87% | 93.49% |
| System | 98.11% | 98.39%† | 98.25%† | 98.19%† | 97.87% | 98.03%† |
| Endometriosis | ||||||
| Abstractor | 96.77% | 85.71% | 90.91% | 98.75% | 99.75% | 99.25% |
| System | 91.15% | 98.10%† | 94.50% | 99.83%† | 99.16% | 99.49% |
| Adenomyosis | ||||||
| Abstractor | 98.42% | 93.25% | 95.76% | 95.43% | 98.95% | 97.16% |
| System | 98.28% | 96.62% | 97.45% | 97.66% | 98.82% | 98.23% |
*Performance difference between system and abstractor is significant at alpha = 0.05.
†Performance difference between system and abstractor is very significant at alpha = 0.01.
PPV, positive predictive value.
Contribution of optical character recognition (OCR) error correction and negated assertions on the performance of the system
| Yes | Other | |||||
| PPV | Sensitivity | F-measure | PPV | Sensitivity | F-measure | |
| Leiomyomas | ||||||
| Baseline | 95.86% | 98.25% | 97.04% | 97.97% | 95.24% | 96.59% |
| +Negated assertions | 98.10%* | 97.95% | 98.03% | 97.71% | 97.87%* | 97.79%* |
| +OCR correction | 98.11%* | 98.39% | 98.25%* | 98.19% | 97.87%* | 98.03%* |
| Endometriosis | ||||||
| Baseline | 88.98% | 100.00% | 94.17% | 100.00% | 98.91% | 99.45% |
| +Negated assertions | 91.15% | 98.10% | 94.50% | 99.83% | 99.16% | 99.49% |
| +OCR correction | 91.15% | 98.10% | 94.50% | 99.83% | 99.16% | 99.49% |
| Adenomyosis | ||||||
| Baseline | 93.59% | 95.87% | 94.72% | 97.06% | 95.40% | 96.22% |
| +Negated assertions | 98.27%* | 95.87% | 97.06%* | 97.15%* | 98.82%* | 97.98%* |
| +OCR correction | 98.28%* | 96.62% | 97.45%* | 97.66%* | 98.82%* | 98.23%* |
Baseline configuration refers to the exact match of search terms.
*Performance difference against baseline is very significant at alpha = 0.01.
PPV, positive predictive value.