| Literature DB >> 30794610 |
Frances B Maguire1,2, Cyllene R Morris1, Arti Parikh-Patel1, Rosemary D Cress3, Theresa H M Keegan3,4, Chin-Shang Li5, Patrick S Lin4, Kenneth W Kizer1,6,7.
Abstract
BACKGROUND: Population-based cancer registries have treatment information for all patients making them an excellent resource for population-level monitoring. However, specific treatment details, such as drug names, are contained in a free-text format that is difficult to process and summarize. We assessed the accuracy and efficiency of a text-mining algorithm to identify systemic treatments for lung cancer from free-text fields in the California Cancer Registry.Entities:
Mesh:
Substances:
Year: 2019 PMID: 30794610 PMCID: PMC6386345 DOI: 10.1371/journal.pone.0212454
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1Text string search order for SAS-based text mining of non-small cell lung cancer (NSCLC) first-line systemic treatments in 17,310 patients diagnosed with stage IV disease, 2012–2104, California.
Agreement of treatment between the SAS text mining algorithm and manual review among stage IV non-small cell lung cancer patients (n = 17, 310), 2012–2014, California.
| SAS text mining | Manual Review | |||||||
|---|---|---|---|---|---|---|---|---|
| Yes | No | Total | Agreement | Kappa | ||||
| n (% of total) | n (% of total) | n (% of total) | % | 95% CI | Kappa | 95% CI | ||
| Platinum doublets | Yes | 2,442 (14.1) | 90 (0.5) | 2,532 (14.6) | 98.1 | 97.8, 98.3 | 0.92 | 0.91, 0.93 |
| No | 246 (1.4) | 14,532 (84.0) | 14,778 (85.4) | |||||
| Total | 2688 (15.5) | 14,622 (84.5) | 17,310 | |||||
| Pemetrexed-based | Yes | 1,974 (11.4) | 159 (0.9) | 2,133 (12.3) | 98.3 | 98.1, 98.5 | 0.92 | 0.91, 0.93 |
| No | 140 (0.8) | 15,037 (86.9) | 15,177 (87.7) | |||||
| Total | 2,114 (12.2) | 15,196 (87.8) | 17,310 | |||||
| Bevacizumab-based | Yes | 467 (2.7) | 35 (0.2) | 502 (2.9) | 99.4 | 99.3, 99.5 | 0.90 | 0.88, 0.92 |
| No | 63 (0.4) | 16745 (96.7) | 16808 (97.1) | |||||
| Total | 530 (3.1) | 16780 (96.9) | 17,310 | |||||
| Pemetrexed and bevacizumab | Yes | 618 (3.6) | 114 (0.6) | 732 (4.2) | 99.2 | 99.1, 99.4 | 0.90 | 0.88, 0.92 |
| No | 17 (0.1) | 16561 (95.7) | 16578 (95.8) | |||||
| Total | 635 (3.7) | 16675 (96.3) | 17,310 | |||||
| Single agents | Yes | 288 (1.7) | 189 (1.1) | 477 (2.8) | 98.7 | 98.5, 98.7 | 0.71 | 0.68, 0.75 |
| No | 37 (0.2) | 16796 (97.0) | 16833 (97.2) | |||||
| Total | 325 (1.9) | 16985 (98.1) | 17,310 | |||||
| Tyrosine kinase inhibitors | Yes | 1599 (9.2) | 287 (1.7) | 1886 (10.9) | 97.7 | 97.4, 97.9 | 0.88 | 0.86, 0.89 |
| No | 117 (0.7) | 15307 (88.4) | 15424 (89.1) | |||||
| Total | 1716 (9.9) | 15594 (90.1) | 17,310 | |||||
| No systemic treatment | Yes | 4,844 (28.0) | 895 (5.2) | 5,739 (33.2) | 91.1 | 90.7, 91.5 | 0.80 | 0.78, 0.81 |
| No | 642 (3.7) | 10,929 (63.1) | 11,571 (66.8) | |||||
| Total | 5,486 (31.7) | 11,824 (68.3) | 17,310 | |||||
| Unknown systemic treatment | Yes | 2,836 (16.4) | 473 (2.7) | 3,309 (19.1) | 91.6 | 91.2, 92.0 | 0.74 | 0.73, 0.76 |
| No | 981 (5.7) | 13,020 (75.2) | 14,001 (80.9) | |||||
| Total | 3,817 (22.1) | 13,493 (77.9) | 17,310 | |||||
Abbreviations: CI, confidence interval
Sensitivity, secificity, PPV, and NPV of treatment identified with SAS-based text mining for stage IV non-small cell lung cancer patients (n = 17,310), 2012–2014, California.
| Sensitivity | Specificity | PPV | NPV | |||||
|---|---|---|---|---|---|---|---|---|
| Treatment Group | % | 95% CI | % | 95% CI | % | 95% CI | % | 95% CI |
| Platinum doublets | 90.0 | 89.7, 91.9 | 99.4 | 99.2, 99.5 | 96.4 | 95.6, 97.1 | 98.3 | 98.1, 98.5 |
| Pemetrexed-based regimens | 93.4 | 92.2, 94.4 | 98.9 | 98.7, 99.1 | 92.5 | 91.4, 93.6 | 99.1 | 98.9, 99.2 |
| Bevacizumab-based regimens | 88.1 | 85.1, 90.7 | 99.8 | 99.7, 99.9 | 93.0 | 90.5, 94.9 | 99.6 | 99.5, 99.7 |
| Pemetrexed and bevacizumab regimens | 97.3 | 95.7, 98.4 | 99.3 | 99.1, 99.4 | 84.4 | 81.8, 86.6 | 99.9 | 99.8, 99.9 |
| Single agents | 88.6 | 84.6, 91.8 | 98.9 | 98.7, 99.0 | 60.4 | 56.8, 63.8 | 99.8 | 99.7, 99.8 |
| Tyrosine kinase inhibitors | 93.2 | 91.9, 94.3 | 98.2 | 97.9, 98.4 | 84.8 | 83.2, 86.2 | 99.2 | 99.1, 99.4 |
| No systemic treatment | 88.3 | 87.4, 89.1 | 92.4 | 91.9, 92.9 | 84.4 | 83.6, 85.2 | 94.5 | 94.1, 94.8 |
| Unknown systemic treatment | 74.3 | 72.9, 75.7 | 96.5 | 96.2, 96.8 | 85.7 | 84.6, 86.8 | 92.9 | 92.6, 93.3 |
Abbreviations: CI, confidence interval; NPV, negative predictive value; PPV, positive predictive value
False positives, false negatives, and total errors for treatment identified with SAS text mining algorithm among stage IV non-small cell lung cancer patients (n = 17,310), 2012–2014, California.
| False Positives | False Negatives | Total Errors | |
|---|---|---|---|
| Treatment Group | n (%) | n (%) | n (%) |
| Platinum doublets | 90 (0.5) | 246 (1.4) | 336 (1.9) |
| Pemetrexed-based regimens | 159 (0.9) | 140 (0.8) | 299 (1.7) |
| Bevacizumab-based regimens | 35 (0.2) | 63 (0.4) | 98 (0.6) |
| Pemetrexed and bevacizumab regimens | 114 (0.7) | 17 (0.1) | 131 (0.8) |
| Single agents | 189 (1.1) | 37 (0.2) | 226 (1.3) |
| Tyrosine kinase inhibitors | 287 (1.7) | 117 (0.7) | 404 (2.3) |
| No systemic treatment | 895 (5.2) | 642 (3.7) | 1537 (8.9) |
| Unknown systemic treatment | 473 (2.7) | 981 (5.7) | 1454 (8.4) |
Percentages (%) represent percent of total
Fig 2Process diagram for developing SAS-based text mining algorithm to summarize treatment information.