Literature DB >> 35127232

Comparison of Machine-Learning Algorithms for the Prediction of Current Procedural Terminology (CPT) Codes from Pathology Reports.

Joshua Levy1,2,3, Nishitha Vattikonda4, Christian Haudenschild5, Brock Christensen2,6,7, Louis Vaickus1.   

Abstract

BACKGROUND: Pathology reports serve as an auditable trial of a patient's clinical narrative, containing text pertaining to diagnosis, prognosis, and specimen processing. Recent works have utilized natural language processing (NLP) pipelines, which include rule-based or machine-learning analytics, to uncover textual patterns that inform clinical endpoints and biomarker information. Although deep learning methods have come to the forefront of NLP, there have been limited comparisons with the performance of other machine-learning methods in extracting key insights for the prediction of medical procedure information, which is used to inform reimbursement for pathology departments. In addition, the utility of combining and ranking information from multiple report subfields as compared with exclusively using the diagnostic field for the prediction of Current Procedural Terminology (CPT) codes and signing pathologists remains unclear.
METHODS: After preprocessing pathology reports, we utilized advanced topic modeling to identify topics that characterize a cohort of 93,039 pathology reports at the Dartmouth-Hitchcock Department of Pathology and Laboratory Medicine (DPLM). We separately compared XGBoost, SVM, and BERT (Bidirectional Encoder Representation from Transformers) methodologies for the prediction of primary CPT codes (CPT 88302, 88304, 88305, 88307, 88309) as well as 38 ancillary CPT codes, using both the diagnostic text alone and text from all subfields. We performed similar analyses for characterizing text from a group of the 20 pathologists with the most pathology report sign-outs. Finally, we uncovered important report subcomponents by using model explanation techniques.
RESULTS: We identified 20 topics that pertained to diagnostic and procedural information. Operating on diagnostic text alone, BERT outperformed XGBoost for the prediction of primary CPT codes. When utilizing all report subfields, XGBoost outperformed BERT for the prediction of primary CPT codes. Utilizing additional subfields of the pathology report increased prediction accuracy across ancillary CPT codes, and performance gains for using additional report subfields were high for the XGBoost model for primary CPT codes. Misclassifications of CPT codes were between codes of a similar complexity, and misclassifications between pathologists were subspecialty related.
CONCLUSIONS: Our approach generated CPT code predictions with an accuracy that was higher than previously reported. Although diagnostic text is an important source of information, additional insights may be extracted from other report subfields. Although BERT approaches performed comparably to the XGBoost approaches, they may lend valuable information to pipelines that combine image, text, and -omics information. Future resource-saving opportunities exist to help hospitals detect mis-billing, standardize report text, and estimate productivity metrics that pertain to pathologist compensation (RVUs). Copyright:
© 2021 Journal of Pathology Informatics.

Entities:  

Keywords:  BERT; XGBoost; current procedural terminology; deep learning; machine learning; pathology reports

Year:  2022        PMID: 35127232      PMCID: PMC8802304          DOI: 10.4103/jpi.jpi_52_21

Source DB:  PubMed          Journal:  J Pathol Inform


BACKGROUND AND SIGNIFICANCE

Electronic Health Records (EHR)[1] refer to both the structured and unstructured components of patients’ health records/information (PHI), synthesized from a myriad of data sources and modalities. Such data, particularly clinical text reports, are increasingly relevant to “Big Data” in the biomedical domain. Structured components of EHR, such as clinical procedural and diagnostic codes, are able to effectively store the patient’s history,[234] whereas unstructured clinical notes reflect an amalgamation of more nuanced clinical narratives. Such documentation may serve to refresh the clinician on the patient’s history, highlight key aspects of the patient’s health, and facilitate patient handoff among providers. Further, analysis of clinical free text may reveal physician bias or inform an audit trail of the patient’s clinical outcomes for purposes of quality improvement. As such, utilizing sophisticated algorithmic techniques to assess text data in pathology reports may improve decision making and hospital processes/efficiency, possibly saving hospital resources while prioritizing patient health. NLP[35678] is an analytic technique that is used to extract semantic and syntactic information from textual data. Traditionally, rule-based approaches cross-reference and tabulate domain-specific key words or phrases with large biomedical ontologies and standardized vocabularies, such as the Unified Medical Language System (UMLS).[910] However, although these approaches provide an accurate means of assessing a narrow range of specified patterns, they are neither flexible nor generalizable since they require extensive annotation and development from a specialist. Machine-learning approaches (e.g. support vector machine (SVM), random forest)[1112] employ a set of computational heuristics to circumvent manual specification of search criteria to reveal patterns and trends in the data. Bag-of-word approaches[1314] study the frequency counts of words (unigrams) and phrases (bigrams, etc.) to compare the content of multiple documents for recurrent themes, whereas deep learning approaches[151617] simultaneously capture syntax and semantics with artificial neural network (ANN) techniques. Recent deep learning NLP approaches have demonstrated the ability to capture meaningful nuances that are lost in frequency-based approaches; for instance, these approaches can effectively contextualize short- and long-range dependencies between words.[1819] Despite potential advantages conferred from less structured approaches, the analysis of text across any domain usually necessitates balancing domain-specific customization (e.g. a medical term/abbreviation corpora) with generalized NLP techniques. The analysis of pathology reports using NLP has been particularly impactful in recent years, particularly in the areas of information extraction, summarization, and categorization. Noteworthy developments include information extraction pipelines that utilize regular expressions (regex), to highlight key report findings (e.g., extraction of molecular test results),[20212223] as well as topic modeling approaches that summarize a document corpus by common themes and wording.[24] In addition to extraction methods, machine-learning techniques have been applied to classify pathologist reports[25]; notable examples include the prediction of ICD-O morphological diagnostic codes[2627] and the prediction of CPT codes based only on diagnostic text.[2829] Widespread misspelling of words and jargon specific to individual physicians have made it difficult to reliably utilize the rule-based and even machine-learning approaches for report prediction in a clinical workflow. In addition, hedging and uncertainty in text reports may further obfuscate findings.[30] The CPT codes are assigned to report reimbursable medical procedures for diagnosis, surgery, and ordering of additional ancillary tests.[3132] Assignments of CPT codes are informed by guidelines and are typically integrated into the Pathology Information System. As such, the degree to which new technologies and practices are implemented and disseminated are often informed by their impact on CPT coding practices. Reimbursements from CPT codes can represent tens to hundreds of millions of dollars of revenue at mid-sized medical centers, and thus systematic underbilling of codes could lead to lost hospital revenue, whereas overbilling patterns may lead to the identification of areas of redundant or unnecessary testing (e.g., duplication of codes, ordering of unnecessary tests, or assignment of codes representing more complex cases, etc.). Ancillary CPT codes represent procedural codes that are automatically assigned when ancillary tests are ordered (e.g., immunohistochemical stains; e.g., CPT 88341, 88342, 88313, 88360, etc.). In contrast, primary CPT codes (e.g., CPT 88300, 88302, 88304, 88305, 88307, and 88309) are assigned based on the pathologist examination of the specimen, where CPT 88300 represents an examination without requiring the use of a microscope (gross examination), whereas CPT 88302–88309 include gross and microscopic examination of the specimen and are ordered by the case’s complexity level (as specified by the CPT codebook; an ordinal outcome; e.g., CPT 88305: Pathology examination of tissue using a microscope, intermediate complexity), which determines reimbursement. The assignment of such codes is not devoid of controversy. Although it is expected that raters will not report a specimen with a higher/lower code level, some may argue that such measures may not reflect the degree of difficulty for a particular case or there may not be a specific language that denotes primary CPT code placement of the phenomena (i.e., unlisted specimen, where it is at the pathologist’s discretion to determine placement). For these codes, case complexity may ultimately be traced back to the clinical narrative reported in the pathology report text.[33] Since the assignment of case complexity is sometimes unclear to the practicing pathologist as guidelines evolve, the prediction of these CPT codes from the diagnostic text using NLP algorithms can be used to inform whether a code was assigned that matches the case complexity. Recently developed approaches to predict CPT codes demonstrate remarkable performance; however, they only rely on the first 100 words from the report text, do not compare across multiple state-of-the-art NLP prediction algorithms, and do not consider report text outside of the diagnosis section.[28] Further, report lexicon is hardly standardized, as it may be littered with language and jargon that is specific to the sign-out pathologist and may vary widely in length for the same diagnosis, which can make it difficult to build an objective understanding of the report text. Comparisons of different algorithmic techniques and relevant reporting text to use for the prediction of primary CPT codes are essential to further understand their utility for curbing under/overbilling issues. In addition, contextualizing primary code findings by ancillary findings and building a greater understanding of how pathologists differ in their lexical patterns may provide further motivation for the standardization of reporting practices and how report text can optimize the ordering of ancillary tests.[34]

OBJECTIVE

The primary objective of this study is to compare the capacity to delineate primary CPT procedural codes (CPT 88302, 88304, 88305, 88307, 88309) corresponding to case complexity across state-of-the-art machine-learning models over a large corpus of more than 93,039 pathology reports from the Dartmouth-Hitchcock Department of Pathology and Laboratory Medicine (DPLM), a mid-sized academic medical center. Using XGBoost, SVM, and BERT techniques, we hope to gain a better understanding of which algorithms are useful for predicting primary CPT codes representing case complexity, which will prove helpful for the detection of under/overbilling.

SECONDARY OBJECTIVES

We have formulated various secondary objectives that are focused on capturing additional components of reporting variation: Expanded reporting subfields: Exploration of methods that incorporate other document subfields outside of the diagnostic text into the modeling approaches, which may contain additional information. Ancillary Testing Codes: Predicting the assignment of 38 different CPT procedure codes, largely comprising secondary CPT codes, under the hypothesis that nondiagnostic text provides additional predictive accuracy as compared with primary CPT codes, which may rely more heavily on the diagnostic text. Although the prediction of whether an ancillary test was ordered via secondary CPT codes has limited potential for incorporation into the Pathology Information System, as these codes are automatically assigned after test ordering, prediction of the ancillary tests can provide an additional context for the prediction of primary codes. Pathologist-Specific Language: Investigate whether the sign-out pathologist can be predicted based on word choice. Although the sign-out pathologist can be found through an SQL query in the Pathology Information System, we are interested in translating sign-outs to a unified language that is consistent across sign-outs (i.e., a similar lexicon across pathologists, given diagnosis, code assignments, and subspecialty). As an example, some pathologists may more verbosely describe a phenomenon that could be succinctly summarized to match a colleague’s description, though this could be difficult to disentangle without a quantitative understanding of lexical differences. To do this, we need to identify several components of variation (i.e., within a subspecialty, where reports from pathologists may vary widely); we want to further understand this heterogeneity to standardize communications within our department. Although the final two objectives (ancillary testing and pathologist prediction) can be resolved by using an SQL query, we emphasize that these secondary objectives were selected to better identify the potential sources of reporting inconsistency with the aim of informing optimal reporting standards rather than imputing information that can be readily queried through the Pathology Information System.

APPROACH AND PROCEDURE

Data acquisition

We obtained Institutional Review Board approval and accessed more than 96,418 pathologist reports from DPLM, collected between June 2015 and June 2020. We removed a total of 3,379 reports that did not contain any diagnostic text associated with CPT codes, retaining 93,039 reports (Supplementary Table 1). Each report was appended with metadata, including corresponding EPIC (EPIC systems, Verona, WI),[35] Charge Description Master (CDM), and CPT procedural codes, the sign-out pathologist, the amount of time to sign out the document, and other details. Fuzzy string matching using the fuzzywuzzy package was used to identify whether any pathologists’ names were misspelled (or resolve potential last name changes) between documents.[36] First, all unique pathologist names were identified. Then, for each pair of names, the token sort ratio was calculated, thresholded by whether the ratio exceeded 0.7 to establish a unipartite graph of pathologist names connected to their candidate duplicates. Finally, clusters of similar names were identified by using connected component analysis. In most cases, unique names were assigned to each cluster of names, though in select cases, names were kept separate.[37] The documents were deidentified by stripping all PHI-containing fields and numerals from the text and replacing with holder characters (e.g. 87560 becomes #####). As a final check, we used regular expressions (regex) to remove mentions of patient names in the report text. This was accomplished by first compiling and storing several publicly available databases of 552,428 first and last names (Supplementary Materials, section “Additional Information on Deidentification Approach”). Then, using regex, we searched for the presence of each first and last name in the report subsections and replaced names at matched positions with white spaces. However, we did not remove mention of the physicians and consulting pathologist. The information on the physicians and consulting pathologist were identified in the “ordered by,” “reports to,” and “verified by” fields of the pathology report using known personal identifiers. The deidentification protocol was approved by the Institutional Review Board, Office of Research Operations and Data Governance. A total of 17,744 first and last names were stripped from the in-house data.
Supplementary Table 1

Recording the percent missingness of each report subsection before removing reports lacking a diagnostic section. Summary measures (median, 1st quartile, 3rd quartile) for the number of words in each document subsection (where the subfield existed) and the percentage of documents whose length exceeded 512 words

Report subfieldMissingness before removalMedian word count1st Q Word count3rd Q Word countExceeds BERT max words
ADDENDUM DISCUSSION96.2%84471210.000%
ADDITIONAL STUDIES86.6%7811920.039%
CLINICAL INFORMATION5.3%2518610.000%
DIAGNOSIS3.5%2313280.022%
DISCUSSION81.7%3616680.017%
FINAL DIAGNOSIS99.9%2351863130.000%
FROZEN SECTION99.4%21110.000%
FROZEN SECTION DIAGNOSIS99.3%2012330.000%
INTERPRETATION99.9%135911410.000%
RESULTS 97.9%2482162682.198%
SPECIMEN PROCESSING34.4%3827640.389%
Complete Text (All Fields)0%119681581.768%
Recording the percent missingness of each report subsection before removing reports lacking a diagnostic section. Summary measures (median, 1st quartile, 3rd quartile) for the number of words in each document subsection (where the subfield existed) and the percentage of documents whose length exceeded 512 words

Preprocessing

We used regular expressions (regex) to remove punctuation from the text, and the text was preprocessed by using the Spacy package,[38] to tokenize the text. We utilized Spacy’s en_core_web_sm processing pipeline (https://spacy.io/models/en#en_core_web_sm) to remove English stop words and words shorter than three characters. Out of concern for removing pathologist lexicon germane to pathologist sign-out, for this preliminary assessment, we did not attempt to prune additional words from the corpus outside of the methods used to generate word frequencies for the bag of words approaches. We also split up each pathology report into their structured sections: Diagnosis, Clinical Information, Specimen Processing, Discussion, Additional Studies, Results, and Interpretation. This allowed for an equal comparison between the machine-learning algorithms. The deep learning algorithm BERT can only operate on 512 words at a time due to computational constraints (See the “Limitations” and Supplementary Materials section “Additional Information on BERT Pretraining”). Sometimes, the pathology reports exceeded this length when considering the entire document (1.77% exceeded 512 words) and as such these reports were limited to the diagnosis section (0.02% exceeded 512 words) when training a new BERT model (Supplementary Table 1; Supplementary Figure 1). We removed all pathology reports that did not contain a diagnosis section.

Characterization of the text corpus

After preprocessing, we encoded each report tabulating the occurrence of all contiguous one- to two-word sequences (unigram and bigrams) to form sparse count matrices, where each column represents a word or phrase and each row represents the document, and the value is the frequency of occurrence in the document. Although the term “frequency” may be representative of the distribution of words/phrases in a corpus, high-frequency words that are featured across most of the document corpus are less likely to yield an informative lexicon that is specific to a subset of the documents. To account for less important but ubiquitous words, we transformed raw word frequencies to term frequency inverse document frequency (tf-idf) values, which up-weights the importance of the word based on its occurrence within a specific document (term frequency), but down-weights the importance if the word is featured across the corpus (inverse document frequency) (see the Supplementary Material section “Additional Description of Topic Modeling and Report Characterization Techniques”). We summed the tf-idf value of each word across the documents to capture the word’s overall importance across the reports and utilized a word cloud algorithm to display the relative importance of the top words. After constructing count matrices, we sought to characterize and cluster pathology documents as they relate to each other and ascribe themes to the clusters. Uniform Manifold Approximation and Projection for Dimension Reduction (UMAP)[39] dimensionality reduction was used to project the higher dimensional word frequency data into lower dimensions while preserving important functional relationships. Each document could then be represented by a 3D point in the Cartesian coordinate system; these points were clustered by using a density-based clustering algorithm called HDBSCAN[40] to simultaneously estimate characteristic groupings of documents while filtering out noisy documents that did not explicitly fit in these larger clusters. To understand which topics were generally present in each cluster, we deployed Latent Dirichlet Allocation (LDA),[13] which identifies topics characterized by a set of words, and then derives the distribution of topics over all clusters. This is accomplished via a generative model that attempts to recapitulate the original count matrix, which is further outlined in greater detail in the Supplementary Material section “Additional Description of Topic Modeling and Report Characterization Techniques.” The individual topics estimated using LDA may be conceptualized as a Dirichlet/multinomial distribution (“weight” per each word/phrase) over all unigrams and bigrams, where a higher weight indicates membership in the topic. The characteristic words pertaining to each topic were visualized by using a word cloud algorithm. Finally, we correlated the CPT codes with clusters, topics, and select pathologists by using Point-Biserial and Spearman correlation measures[41] to further characterize the overall cohort.

Machine learning models

We implemented the following three machine-learning algorithms in our study as a basis for our text classification pipeline [Figure 1]:
Figure 1

Model Descriptions: Graphics depicting: (A) SVM, where hyperplane linearly separates pathology reports, which are represented by individual datapoints; (B) XGBoost, which sequentially fits decision trees based on residuals from sum of conditional means of previous trees and outcomes; (C) All-Fields BERT model, where a diagnosis-specific neural network extracts relevant features from the diagnostic field, whereas a neural network trained on a separate clinical corpus extracts features for the remaining subfields; subfields are weighted and summed via the attention mechanism, indicated in red; subfields are combined with diagnostic features and fine-tuned with a multilayer perceptron for the final prediction

Model Descriptions: Graphics depicting: (A) SVM, where hyperplane linearly separates pathology reports, which are represented by individual datapoints; (B) XGBoost, which sequentially fits decision trees based on residuals from sum of conditional means of previous trees and outcomes; (C) All-Fields BERT model, where a diagnosis-specific neural network extracts relevant features from the diagnostic field, whereas a neural network trained on a separate clinical corpus extracts features for the remaining subfields; subfields are weighted and summed via the attention mechanism, indicated in red; subfields are combined with diagnostic features and fine-tuned with a multilayer perceptron for the final prediction

SVM

We trained an SVM[4243] to make predictions by using the UMAP embeddings formed from the tf-idf matrix. The SVM operates by learning a hyperplane that obtains maximal distance (margin) to datapoints of a particular class [Figure 1A]. However, because datapoints/texts from different classes may not be separable in the original embedding space, the SVM model projects data to a higher dimensional space where data can be linearly separated. We utilized GPU resources via the ThunderSVM package[44] to train the model in reasonable compute time.

Bag of words with XGBoost

XGBoost algorithms[45] operate on the entire word by report count matrix and ensemble or average predictions across individual Classification and Regression Tree (CART) models.[46] Individual CART models devise splitting rules that partition instances of the pathology notes based on whether the count of a particular word or phrase in a pathology note exceeds an algorithmically derived threshold. Important words and thresholds (i.e. partition rules) are selected from the corpus based on their ability to partition the data, based on the purity of a decision leaf through the calculation of an entropy measure. Each successive splitting rule serves to further minimize the entropy or maximize the information gained. Random Forest models[47] bootstrap which subsets of predictors/words and samples are selected for a given splitting rule of individual trees and aggregate the predictions from many such trees; Extreme Gradient Boosting Trees (XGBoost) fit trees (structure and the conditional means of the terminal nodes) sequentially based on the residual (in the binary classification setting, misclassification is estimated using a Bernoulli likelihood) between the outcome and the sum of both the conditional means of the previous trees (which are set) and the conditional means of the current tree (which is optimized). This gradient-based optimization technique prioritizes samples with a large residual/gradient from the previous model fit to account for the previous “weak learners” [Figure 1B]. In both scenarios, random forest (a bagging technique) and XGBoost (a boosting technique), individual trees may exhibit bias but together cover a larger predictor space. Our XGBoost classifier models were trained by using the XGBoost library, which utilizes GPUs to speed up calculation.

BERT

ANN[48] are a class of algorithms that use highly interconnected computational nodes to capture relationships between predictors in complex data. The information is passed from the nodes of an input layer to the individual nodes of subsequent layers that capture additional interactions and nonlinearities between predictors while forming abstractions of the data in the form of intermediate embeddings. The BERT[18] model first maps each word in a sentence to its own embedding and positional vectors, which captures key semantic/syntactic and contextual information that is largely absent from the bag of words approaches. These word-level embeddings are passed to a series of self-attention layers (the Transformer component of the BERT model), which contextualizes the information of a single word in a sentence based on short- and long-term dependencies between all words from the sentence. The individual word embeddings are combined with the positional/contextual information, obtained via the self-attention mechanism, to create embeddings that represent the totality of a sentence. Finally, this information is passed to a series of fully connected layers that produce the final classification. With BERT, we are also able to analyze the relative importance and dependency between words in a document by extracting “attention matrices.” We are also able to retrieve sentence-level embeddings encoded by the network by extracting vectors from the intermediate layers before they pass for the final classification. We trained the BERT models by using the HuggingFace Transformers package,[49] which utilizes GPU resources through the PyTorch framework. We used a collection of models that have already been pretrained on a large medical corpus[50] in order to both improve the predictive accuracy of our model and significantly reduce the computational load compared with training a model from scratch. Because significant compute resources are still required to train the model, most BERT models limit the document characterization length to 512 words. To address this, we split pathology reports into document subsections when training BERT models. In training a BERT model, we updated the word embeddings through fine-tuning a pretrained model on our diagnostic corpus. This model, which had been trained solely on diagnostic text, could be used to predict the target of interest (Dx Model). However, we then used this fine-tuned model to extracted embeddings that were specific to the diagnosis subfield to serve as input for a model that could utilize text from other document subfields. We separately utilized the original pretrained model to extract embeddings from the other report subfields that are less biased by diagnostic codes and thus more likely to provide contextual information (All Fields Model). We developed a global/gating attention mechanism procedure that serves to dynamically prune unimportant, missing, or low-quality document subsections for classification [Figure 1C]. Predictions may be obtained when some/all report subfields are supplied via the following method: where represents the embeddings extracted from the pretrained and fine-tuned BERT embeddings on respective report subsections, and is a vector of attention scores between 0 and 1 that dictates the importance of particular subsections. These attention scores are determined by using a separate gating neural network, , which maps , a 768-dimensional vector to a scalar for each document subsection through two projection matrices: W a 768-dimension (dimensionality of BERT embeddings) by 100-dimensional matrix, and W a 100-dimension (dimensionality of BERT embeddings) by 1-dimensional matrix that generates the attention scores. A softmax transformation is used to normalize the scores between zero and one across the subsections. Finally, are a set of fully connected layers that operate on the concatenation between the BERT embeddings that were fine-tuned on the diagnosis-specific section and those extracted by using the pre-trained BERT model on the other document subfields, as weighted by using the gated attention mechanism (Supplementary Section “Additional Description of Explanation Techniques”). To train this model, we experimented with an ordinal loss function,[51] based off of the proportional odds cumulative link model specification, which respects the ordering of the primary CPT codes by case complexity, though ultimately, we opted for using a Cross-Entropy loss since ordinal loss functions are not currently configured for the other machine-learning methods (e.g., XGBoost).

Prediction of primary current procedural terminology codes

We developed machine-learning pipelines to delineate primary CPT codes requiring examination with a microscope (CPT 88302, 88304, 88305, 88307, 88309) using BERT, XGBoost, and SVM, with reports selected based on whether they contained only one of the five codes (where the primary codes were present in the following proportions: CPT 88302:0.67%, 88304:6.59%, 88305:85.97%, 88307:6.32%, and 88309:0.44%). The prevalence of most of the five codes did not change over time (Supplementary Figure 2; Supplementary Table 2). Given the characterization of the aforementioned deep learning framework, we utilized a BERT model that was pretrained first on a large corpus of biomedical research articles from PubMed, and then pretrained by using a medical corpus of free text notes from an intensive care unit (MIMIC3 database; Bio-ClinicalBERT; Supplementary Materials section “Additional Information on BERT Pretraining”).[505253] Finally, the model was fine-tuned on our DHMC pathology report corpus (to capture institution-specific idiosyncrasies) for the task of classifying particular CPT codes from diagnostic text. XGBoost was trained on the original count matrix, whereas SVM was trained on a 6-dimensional UMAP projection; a UMAP projection was utilized for computational considerations. The models were evaluated by using five-fold cross-validation as a means to compare the model performances. Internal to each fold is a validation set used for identifying optimal hyperparameters (supplementary section “Additional Information on Hyperparameter Scans”) through performance statistics and a held-out test set. For each approach, we separately fit a model considering only the Diagnosis text (Dx Models) and all of the text (All Fields Models) to provide additional contextual information. We calculated the Area Under the Receiver Operating Curve (AUC-Score; considers sensitivity/specificity of the model at a variety of probability cutoffs; anything above a 0.5 AUC is better than random), F1-Score (which considers the tradeoff between sensitivity and specificity) and macro-averaged these scores across the five CPT codes, which gives greater importance to rare codes. Since codes are also ordered by complexity (ordinal variable), we also report a confusion matrix, which tabulates the real versus predicted codes for each approach and measures both a spearman correlation coefficient and linear-weighted kappa between predicted and real CPT codes as a means to communicate how the model preserves the relative ordering of codes (i.e., if the model is incorrect, better to predict a code of a similar complexity).
Supplementary Table 2

Changes in primary CPT code assignment over time; model fits for several logistic regression models, modeling time as years since 2017 (continuous) and whether a CPT code was assigned on a specific day as the dichotomous outcome variable

CPT CodeBSEP-valueCI [2.5%]CI [97.5%]
88300-0.0500.0370.172-0.1220.022
883020.1350.0530.0120.0300.239
883040.0960.020<0.0010.0560.136
883050.0040.0090.687-0.0140.021
88307-0.0040.0180.818-0.0390.031
88309-0.0430.0410.298-0.1230.038
Changes in primary CPT code assignment over time; model fits for several logistic regression models, modeling time as years since 2017 (continuous) and whether a CPT code was assigned on a specific day as the dichotomous outcome variable

Ancillary testing current procedural terminology codes and pathologist prediction tasks

To contextualize findings for primary codes, these machine-learning techniques were employed to predict each of 38 different CPT codes (38 codes remained after removing codes that occurred less than 150 times across all sign-outs) (e.g., if the prediction of primary codes relies on the diagnostic section, do secondary codes rely on other document sections more?). The primary code model predicted a categorical outcome, whereas ancillary testing models were configured in the multitarget setting, where each code represents a binary outcome. We compared cross-validated AUC statistics between and across the 38 codes to further explore the reasons that some codes yielded lower scores than others. We also compared different algorithms via the sensitivity/specificity reported via their Youden’s index (the optimal tradeoff possible between sensitivity and specificity from the receiver operating curve), averaged across validation folds. We similarly trained all models to recognize the texts of the 20 pathologists with the most sign-outs to see whether the models could reveal pathologist-specific text to inform future efforts to standardize text lexicon. We retained reports from the 20 pathologists with the most sign-outs, reducing our document corpus from 93,039 documents to 64,583 documents, and we utilized all three classification techniques to predict each sign-out pathologist simultaneously. The selected pathologists represented a variety of specialties. Choosing only the most prolific pathologists removed the potential for biased associations by a rare outcome in the multiclass setting.

Model interpretations

Finally, we used shapley additive explanations (SHAP; a model interpretation technique that estimates the contributions of predictors to the prediction through credit allocation)[54] to estimate which words were important for the classification of each of these codes, visualized by using a word cloud. For the BERT model, we utilized the Captum[55] framework to visualize backpropagation from the outcome to predictors/words via IntegratedGradients[56] and attention matrices. Additional extraction of attention weights also revealed not only which words and their relationships contributed to the prediction of the CPT code (i.e. self-attention denotes word-to-word relationships), but also which document subfields other than the diagnosis field were important for assignment of the procedure code (i.e. global/gating attention prunes document subfields by learning to ignore irrelevant information; the degree of pruning can be extracted during inference). Further description of these model interpretability techniques (SHAP, Integrated Gradients, Self-Attention / “word-to-word”, Attention) may be found in the supplementary material (section “Additional Description of Explanation Techniques: SHAP, Integrated Gradients, Self-Attention, Attention Over Pathology Report Subfields”). Pathologist-specific word choice was extracted by using SHAP/Captum from the resulting model fit and visualized by using word clouds and attention matrices.

RESULTS

Corpus preprocessing and Uniform Manifold Approximation and Projection for Dimension Reduction results

After initial filtering, we amassed a total of 93,039 pathology reports, which were broken into the following subsections: Diagnosis, Clinical Information, Specimen Processing, Discussion, Additional Studies, Results, and Interpretation. The median word length per document was 119 words (Interquartile Range; IQR=90). Very few reports contained subfields that exceeded the length acceptable by the BERT algorithm (2% of reports containing a Results section exceeded this threshold; Supplementary Table 1; Supplementary Figure 1). Displayed first are word clouds of the top 25 words in only the diagnostic document subsection [Figure 2A] and across all document subsections [Figure 2B], with their size reflecting their tf-idf scores [Figure 2A and B]. As expected, the diagnostic-field cloud contains words that are pertinent to the main diagnosis, whereas the all-field cloud contains words that are more procedural, suggesting that other pathology document subfields yield distinct and specific clinical information that may lend complementary information versus analysis solely on diagnostic fields. We clustered and visualized the diagnostic subsection and also all document subsections after running UMAP, which yielded 8 and 15 distinct clusters, respectively [Figure 2C and D]. The number of words per report correlated poorly with the number of total procedural codes assigned (Spearman r=0.066, p<0.01). However, when these correlations were assessed within the HDBSCAN report clusters (subset to reports within a particular cluster for cluster-specific trends), 33% of the all-fields report clusters reported moderate correlations (Supplementary Table 3). Interestingly, one of the eight report clusters from the diagnostic fields experienced a moderate negative correlation with the number of codes assigned.
Figure 2

Pathology report corpus characterization: (A and B) Word cloud depicting words with the highest aggregated tf-idf scores across the corpus of: (A) diagnostic text only, (B) all report subfields (all-fields); important words across the corpus indicated by relative size of the word in the word cloud; (C and D) UMAP projection of the tf-idf matrix, clustered and noise removal via HDBSCAN for: (C) diagnostic texts only, and (D) all report subfields (all-fields)

Supplementary Table 3

Correlation between length of the word document and the number of uniquely assigned codes; broken down by a reported cluster using the diagnostic fields and all report fields

ClusterDiagnostic clustersAll-field clusters


Correlationp-valueCorrelationp-value
1-0.091.6E-260.395.7E-178
20.071.6E-02-0.057.1E-02
3-0.303.4E-850.017.1E-01
4-0.052.8E-020.023.9E-01
50.186.9E-930.009.6E-01
60.219.4E-370.012.9E-01
70.272.5E-260.081.7E-05
80.146.3E-1370.574.9E-93
90.101.0E-28
100.315.3E-113
110.321.2E-33
120.161.0E-23
130.485.1E-98
140.093.9E-07
150.320.0E+00
Pathology report corpus characterization: (A and B) Word cloud depicting words with the highest aggregated tf-idf scores across the corpus of: (A) diagnostic text only, (B) all report subfields (all-fields); important words across the corpus indicated by relative size of the word in the word cloud; (C and D) UMAP projection of the tf-idf matrix, clustered and noise removal via HDBSCAN for: (C) diagnostic texts only, and (D) all report subfields (all-fields) Correlation between length of the word document and the number of uniquely assigned codes; broken down by a reported cluster using the diagnostic fields and all report fields

Topic modeling with Latent Dirichlet Allocation and additional topic associations

From our LDA analysis on all document subsections, we discovered 10 topics [Figure 3; Supplementary Table 4]. Correlations between these topics with clusters, pathologists, and CPT codes are displayed in the supplementary material (Supplementary Figures 3-6). We discovered additional associations between CPT codes, clusters, and pathologists (Supplementary Figure 7A), suggesting a specialty bias in document characterization. We clustered pathologists using co-occurrence of procedural code assignments in order to establish “subspecialties” (e.g., pathologist who signs out multiple specialties) that could be used to help interpret sources of bias in an evaluation of downstream modeling approaches.
Figure 3

LDA Topic Words: Important words found for three select LDA Topics from: (A) diagnostic text only and (B) all report subfields (all-fields); important words across the corpus indicated by relative size of the word in the word cloud

Supplementary Table 4

Top 10 words found for each LDA topic (“topic descriptors”); 10 topics were discovered for the diagnostic text; and 10 additional topics were discovered for all of the report subfields (All Fields)

Diagnostic textTopic 1Topic 2Topic 3Topic 4Topic 5Topic 6Topic 7Topic 8Topic 9Topic 10
0tumortissuetestmucosacervicalcolonnevuscellsshavefragments
1lymphrightcancergastricresultspolypectomyshaveplacentacellbenign
2carcinomaleftlesionesophaguscancertubularexcisioncordcarcinomaendocer- vical
3gradebenigncervicalchronicpleaseadenomaleftumbilicalleftevidence
4nodesexcisionpleasenormalguidelinespolyprightvesselrightcervical
5prostaticsoftresultswitdintesthyperplasticmelanocyticacutespecimeneffect
6leftbreastconsensuslimitsconsensusascendingchangesseenbasalsquamous
7rightfallopianmanagementabnormalityscreeningsigmoidcompoundthreesquamousdysplasia
8identifiedinflam- mation http://www.asccp.org mationdiagnosticmanage-mentfragmentsspecimengramsdiscussionmucosa
9invasiveresectionguidelinesseencellstransversebackvillousperipheralhpv
All-fields text Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6 Topic 7 Topic 8 Topic 9 Topic 10
0papskintissuetissuepaptissuebiopsypositiveclinicalskin
1hpvspecimenbiopsypolyphistlymphdiagnosisantibodypertinentshave
2testclinicalformalinsubmittedhpvmarginspecimentissuetotalbiopsy
3hist’clockquantity/sizecolontestspecimenclinicalclinicalreceivedleft
4screeningexcisionsections/processingformalinscreeningtumorseestudiesfluidclinical
5cervicalsubmittedsubmittedclinicalcervicalrightcasediagnosticsourceright
6clinicaltissuelabeled/fixativesoftclinicalleftpunchformalinspecimensubmitted
7therapyleftdescription labeled/fixativetherapynodediscussionstainingpreparationspecimen
8cancernevussofthistorycancersubmittedsubmittedcoredescriptiontissue
LDA Topic Words: Important words found for three select LDA Topics from: (A) diagnostic text only and (B) all report subfields (all-fields); important words across the corpus indicated by relative size of the word in the word cloud Top 10 words found for each LDA topic (“topic descriptors”); 10 topics were discovered for the diagnostic text; and 10 additional topics were discovered for all of the report subfields (All Fields)

Primary current procedural terminology code classification results

The XGBoost and BERT models significantly outperformed the SVM model for the prediction of primary CPT codes [Table 1; Figure 4A and B; Supplementary Table 5]. The BERT model made more effective use of the diagnostic text (macro-f1=0.825; κ=0.852) as compared with the XGBoost model (macro-f1=0.807; κ=0.835). Incorporating the text from other report subfields provided only a marginal performance gain for BERT (macro-f1=0.829; κ=0.855) and both a large and significant performance gain for XGBoost (macro-f1=0.831; κ=0.863) [Figure 4A and B]. Across the BERT and XGBoost models, codes were likely to be misclassified if they were of a similar complexity [Table 1; Supplementary Table 5]. Plots of low-dimensional text embeddings extracted from the BERT All-Fields model demonstrated clustering by code complexity and relative preservation of the ordering of code complexity (i.e., reports pertaining to codes of lower/higher complexity clustered together) [Figure 4C].
Table 1

Predictive performances for primary CPT code algorithms

ApproachTypeMacro-F1 ± SEκ± SEAUC ± SESpearman ± SE
BERTDiagnosis0.825 ± 0.00640.852 ± 0.00330.99 ± 0.00080.84 ± 0.0044
All fields0.828 ± 0.00620.855 ± 0.00320.99 ± 0.00060.843 ± 0.0044
XGBoostDiagnosis0.807 ± 0.00690.835 ± 0.00340.99 ± 0.00070.824 ± 0.0045
All fields0.832 ± 0.00690.863 ± 0.00320.994 ± 0.00040.855 ± 0.0042
SVMDiagnosis0.497 ± 0.00470.644 ± 0.00430.554 ± 0.00210.637 ± 0.0056
All fields0.518 ± 0.00480.668 ± 0.00440.554 ± 0.00140.652 ± 0.0058

Macro-F1 and AUC measures are agnostic to the ordering of the CPT code complexity; whereas Linear Kappa ( κ ) and Spearman correlation coefficients respect the CPT code ordering (88302, 88304, 88305, 88307, and 88309)

Figure 4

Primary CPT Code Model Performance: (A and B) Grouped boxenplots demonstrating the performance of machine-learning models (BERT, XGBoost) for the prediction of primary CPT codes (bootstrapped performance statistics; A) macro-averaged F1-Score, (B) Linear-Weighted Kappa for performance across different levels of complexity, which takes into account the ordinal nature of the outcome; reported across five CPT code), given analysis of either the diagnostic text (blue) or all report subfields (orange); (C) UMAP projection of All-Fields BERT embedding vectors after applying the attention mechanism across report subfields; each point is reported with information aggregated from all report subfields; individual points represent reports, colored by the CPT code; large thick circles represent the report centroids for each CPT code; note how codes CPT 88302 and CPT 88304 cluster together and separately CPT 88307 and CPT 88309 cluster together, whereas CPT 88305 sits in between clustered reports of low and high complexity

Supplementary Table 5

Confusion matrices for each of the modeling approaches for primary CPT code prediction; aggregated across test sets of cross-validation folds; note how for BERT and XGBoost modes, misclassifications are mostly by codes of a similar case complexity

Predicted

DiagnosisAll Fields


88302883048830588307883098830288304883058830788309
TRUE BERT 88302 357 252730 356 292430
8830419 3594 32292218 3568 3211184
8830521664 50434 3922016614 50490 38823
883073148257 3247 363131253 3249 55
883092102996 123 132794 135
XGBoost 88302 338 294500 334 334410
8830410 3322 61678312 3418 515822
8830519387 50867 25088376 50921 2215
883072149522 2989 290116366 3178 31
883091946100 104 053897 120
SVM 88302 45 59289190 56 84243290
8830420 3384 43618903 3312 618960
88305161516 48712 12870131068 49382 10680
883075299774 2613 011373736 2571 0
8830911881160 0 02098142 0
Predictive performances for primary CPT code algorithms Macro-F1 and AUC measures are agnostic to the ordering of the CPT code complexity; whereas Linear Kappa ( κ ) and Spearman correlation coefficients respect the CPT code ordering (88302, 88304, 88305, 88307, and 88309) Confusion matrices for each of the modeling approaches for primary CPT code prediction; aggregated across test sets of cross-validation folds; note how for BERT and XGBoost modes, misclassifications are mostly by codes of a similar case complexity Primary CPT Code Model Performance: (A and B) Grouped boxenplots demonstrating the performance of machine-learning models (BERT, XGBoost) for the prediction of primary CPT codes (bootstrapped performance statistics; A) macro-averaged F1-Score, (B) Linear-Weighted Kappa for performance across different levels of complexity, which takes into account the ordinal nature of the outcome; reported across five CPT code), given analysis of either the diagnostic text (blue) or all report subfields (orange); (C) UMAP projection of All-Fields BERT embedding vectors after applying the attention mechanism across report subfields; each point is reported with information aggregated from all report subfields; individual points represent reports, colored by the CPT code; large thick circles represent the report centroids for each CPT code; note how codes CPT 88302 and CPT 88304 cluster together and separately CPT 88307 and CPT 88309 cluster together, whereas CPT 88305 sits in between clustered reports of low and high complexity

Ancillary current procedural terminology code and pathologist classification results

We were able to accurately assign ancillary CPT codes to each document, regardless of which machine learning algorithm was utilized (Supplementary Figure 8; Supplementary Table 6). Across all ancillary codes, we found that XGBoost (median AUC=0.985) performed comparably to BERT (median AUC=0.990; P = 0.64) when predicting CPT codes based on the diagnostic subfield alone, whereas SVM performed worse (median AUC=0.966) than both approaches, per cross-validated AUC statistics (Supplementary Tables 6, 8-10; Supplementary Figures 9 and 10). In contrast to results obtained for the primary codes, we discovered that classifying by including all of the report subelements (All Fields) performed better than just classifying based on the diagnostic subsection (P < 0.001 for both BERT and XGBoost approaches; (Supplementary Table 6, 8-10; Supplementary Figures 9 and 10), suggesting that these other more procedural / descriptive elements contribute meaningful contextual information for the assignment of ancillary CPT codes (Supplementary Materials section “Supplementary Ancillary CPT Code Prediction Results”). We also report that the sign-out pathologist can also be accurately identified from the report text, with comparable performance between the BERT (macro-f1=0.72) and XGBoost (macro-f1=0.71) models, and optimal performance when all report subfields are used (macro-f1=0.77 and 0.78, respectively) (Supplementary Materials section “Supplementary Pathologist Prediction Results”; Supplementary Table 11; Supplementary Figure 11).
Supplementary Table 6

Summary of distribution of AUCs across ancillary CPT codes for BERT, XGBoost, and SVM prediction models for diagnostic and all-fields text

ModelReport subfieldsMedian1st Quartile3rd Quartile
BERTDiagnosis0.9900.9730.995
All-Fields0.9950.9850.999
XGBoostDiagnosis0.9850.9740.994
All-Fields0.9970.9940.999
SVMDiagnosis0.9660.9540.984
All-Fields0.9770.9570.992
Supplementary Table 8

Wilcoxon tests for significance of relative performance gains (distribution of paired AUC differences for codes between two algorithms/report subfield combinations); all Wilcoxon tests were one-sided (algorithm 1 / selected subfields performance greater than algorithm 2 / selected subfields performance) to see which models perform the best for CPT code prediction

Algorithm 1Algorithm 2



NameReport fieldsNameReport fieldsP-Value
XGBoostAll fieldsBERTAll fields 2.8E-07
XGBoostDiagnosisBERTDiagnosis6.4E-01
BERTAll fieldsBERTDiagnosis 4.2E-05
XGBoostAll fieldsXGBoostDiagnosis 4.0E-08
BERTAll fieldsSVMAll fields 4.0E-08
BERTDiagnosisSVMDiagnosis 6.4E-05
SVMAll fieldsSVMDiagnosis 6.8E-03
Supplementary Table 10

Additional performance statistics: First three numerical columns: Averaged sensitivity and specificity across the XGBoost and BERT algorithms to denote overall predictive performance for each CPT code; Average Youden calculated from the sensitivity and specificity; Final three numerical columns: Changes in sensitivity, specificity, and Youden when utilizing all report subfields versus the diagnostic text alone

CodeAverage sensitivityAverage specificityAverage YoudenΔSensitivityΔSpecificityΔ Youden
850601.001.000.990.000.000.00
850970.990.990.990.010.010.02
874910.990.990.970.020.020.04
875910.990.990.980.000.020.02
876240.980.990.970.000.000.00
881080.950.970.920.080.040.12
881120.990.980.970.010.020.04
881411.001.001.000.000.000.00
881420.980.940.920.010.060.07
881720.950.970.920.080.040.12
881731.000.980.980.000.030.03
881750.980.990.970.000.000.00
881770.950.970.930.090.050.14
881840.930.950.880.070.040.11
881850.900.940.840.140.050.19
881880.930.920.850.060.050.11
881890.890.880.770.120.160.28
882710.950.960.910.040.040.07
882740.970.970.940.020.040.06
883000.990.990.980.000.000.00
883020.940.970.910.020.000.02
883040.960.960.920.000.000.01
883050.940.920.860.020.030.04
883070.970.970.940.010.000.01
883090.970.970.940.000.000.01
883110.980.980.970.020.010.02
883120.920.930.850.070.070.14
883130.920.930.860.060.050.12
883210.980.970.960.030.040.07
883310.960.970.930.050.040.09
883320.950.950.900.040.030.07
883330.980.980.950.030.030.06
883410.890.880.780.090.090.18
883420.920.910.830.110.120.23
883440.950.970.920.020.020.04
883460.980.970.950.030.040.08
883500.950.980.930.100.030.13
883600.940.950.890.040.030.08
Supplementary Table 11

Classification reports for pathologist prediction models (BERT, XGBoost, SVM) for reported subfields (diagnostic/all fields)

BERT

DiagnosisAll fields


PathologistPrecisionRecallF1-ScorePathologistPrecisionRecallF1-Score
10.940.940.9410.950.940.94
20.490.820.6120.610.840.70
30.940.860.8930.990.980.98
40.770.760.7740.810.810.81
50.800.850.8250.880.880.88
60.930.980.9560.960.960.96
70.810.820.8170.870.870.87
80.360.910.5180.410.800.55
90.860.780.8290.860.800.83
100.780.610.69100.740.680.71
110.670.710.69110.710.730.72
120.840.770.80120.870.830.85
130.800.910.85130.860.910.88
140.720.740.73140.830.850.84
150.830.740.78150.840.830.83
160.560.250.34160.540.350.42
170.890.960.93170.930.960.94
180.580.140.22180.450.270.34
190.710.720.71190.710.740.72
200.840.390.53200.740.430.54
Accuracy0.740.740.74Accuracy0.790.790.79
Macro Avg0.760.730.72Macro Avg0.780.770.77
Weighted Avg0.770.740.74Weighted Avg0.800.790.79
XGBoost
DiagnosisAll fields
PathologistPrecisionRecallF1-ScorePathologistPrecisionRecallF1-Score
10.920.880.9010.940.890.91
20.670.660.6720.680.760.72
30.900.850.8831.001.001.00
40.810.760.7840.800.830.81
50.740.890.8150.860.910.88
60.940.980.9660.970.980.97
70.880.770.8270.920.880.90
80.360.870.5180.510.730.60
90.720.880.7990.790.860.82
100.800.620.70100.760.670.72
110.750.770.76110.780.760.77
120.730.810.77120.790.870.83
130.830.760.79130.920.870.90
140.780.680.73140.910.830.87
150.750.730.74150.830.820.82
160.500.320.39160.560.470.51
170.690.520.59170.880.760.81
180.690.210.32180.580.420.48
190.710.740.72190.710.750.73
200.830.380.53200.700.510.59
Accuracy0.730.730.73Accuracy0.800.800.80
Macro Avg0.750.700.71Macro Avg0.790.780.78
Weighted Avg0.750.730.72Weighted Avg0.800.800.80
SVM
DiagnosisAll fields
PathologistPrecisionRecallF1-ScorePathologistPrecisionRecallF1-Score
10.590.620.6010.450.500.47
20.380.360.3720.100.000.00
30.560.570.5630.860.840.85
40.330.160.2240.200.180.19
50.390.520.4450.240.730.36
60.360.560.4460.340.650.45
70.090.040.0570.000.000.00
80.360.800.4980.340.920.49
90.490.670.5790.280.790.41
100.340.090.14100.180.050.07
110.440.320.38110.230.320.26
120.240.360.29120.210.240.23
130.000.000.00130.000.000.00
140.000.000.00140.000.000.00
150.230.420.30150.260.110.16
160.300.180.23160.280.040.07
170.000.000.00170.000.000.00
180.060.020.02180.180.030.05
190.320.490.38190.000.000.00
200.070.020.03200.110.000.00
Accuracy0.350.350.35Accuracy0.320.320.32
Macro Avg0.280.310.28Macro Avg0.210.270.20
Weighted Avg0.290.350.30Weighted Avg0.240.320.24
Summary of distribution of AUCs across ancillary CPT codes for BERT, XGBoost, and SVM prediction models for diagnostic and all-fields text Confidence intervals of 1000-sample nonparametric bootstrap of area under the receiver operating characteristic curve for each algorithm (BERT, XGBoost and SVM) and for each report type (Diagnosis and All-Fields); each AUC was averaged across the 5 cross-validation folds with the same random seed set for sampling values within each CV fold for each code/group of pathologists; ancillary CPT code and descriptions of codes listed on the left, in addition to the weighted AUC across 20 pathologists Wilcoxon tests for significance of relative performance gains (distribution of paired AUC differences for codes between two algorithms/report subfield combinations); all Wilcoxon tests were one-sided (algorithm 1 / selected subfields performance greater than algorithm 2 / selected subfields performance) to see which models perform the best for CPT code prediction Sensitivity/specificity for each algorithm/report subfield(s), averaged across cross-validation folds for each CPT code after optimization of Youden’s index to select the sensitivity/specificity Additional performance statistics: First three numerical columns: Averaged sensitivity and specificity across the XGBoost and BERT algorithms to denote overall predictive performance for each CPT code; Average Youden calculated from the sensitivity and specificity; Final three numerical columns: Changes in sensitivity, specificity, and Youden when utilizing all report subfields versus the diagnostic text alone Classification reports for pathologist prediction models (BERT, XGBoost, SVM) for reported subfields (diagnostic/all fields)

Model interpretation results

We also visualized which words were found to be important for a subsample of primary and ancillary procedural codes by using the XGBoost algorithm [Figure 5; Supplementary Figure 12]. In the Supplementary Materials, we have also included a table that denotes the relevance of the top 30 words for the XGBoost All Fields model for the prediction of specific primary CPT codes, as assessed through SHAP (Supplementary Table 12). Reports that were assigned the same ancillary CPT code clustered together in select low-dimensional representations learned by some of the All Fields BERT models [Figure 6A, C, and E]. Model-based interpretations of a few sample sentences for CPT codes using the Diagnosis BERT approach revealed important phrases that aligned with assignment of the respective CPT code [Figure 6C, D, and F]. Finally, we included a few examples of the attention mechanism used in the BERT approach, which highlights some of the many semantic/syntactic dependencies that the model finds within text subsections [Figure 7]. These attention matrices were plotted along with importance assigned to subsections of pathology reports using the All-Fields model [Figure 8], all with their respective textual content. Additional interpretation of reports for pathologists may be found in the Supplementary Materials (Supplementary Figures 13 and 14).
Figure 5

SHAP interpretation of XGBoost predictions: Word clouds demonstrating words found to be important using the XGBoost algorithm (All-Fields) for the prediction of primary CPT codes, found via shapley attribution; important words pertinent to each CPT code indicated by the relative size of the word in the word cloud; word clouds visualized for word importance (A) across all five primary CPT codes and (B–F) for the following CPT codes: (B) CPT code 88302; (C) CPT code 88304; (D) CPT code 88305; (E) CPT code 88307; and (F) CPT code 88309; note that the size of the word considers strength but not directionality of the relationship with the code, which may be negatively associated in some cases

Supplementary Table 12

SHAP coefficients depicting relationships between the top 30 words that distinguish the primary CPT codes and their related CPT code: Positive value indicates positive association, whereas negative value indicates negative association between word and code; top codes determined by summing absolute SHAP value across CPT codes and test cohort

8830288304883058830788309
Myocyte2.337
Excision pilomatricoma-1.5370.006
Endocervical-1.5150.0010.0
Ureter fresh1.313
Left ankle0.45-0.813
Products conception0.029-1.161
Biopsy-0.3760.0540.4660.2350.045
Specimen cm-1.0590.108
Mesh0.201-0.001-0.929
Spleen0.0-1.085
Diagnosis skin0.0250.0060.168-0.836
Reduction1.081
Termination0.234-0.846
Toe clinical-1.044
Mucocele1.013-0.001
Hemorrhoid0.818-0.177
Fixative pilonidal0.9580.0
Valve-0.9450.004
Irregular0.2170.684-0.048-0.003
Representative0.0320.2590.4840.0150.148
Metatarsal resection0.937
Submitted skin-0.690.064-0.044-0.118
Angioleiomyoma-0.3580.54
Ovary serous0.897
Foreskin clinical0.879
Capsule excision0.878
Dcagnosis fibroma0.874
Transected0.6580.046-0.159
Mass provided-0.7560.083
Excision suggestive-0.819
Figure 6

Embedding and Interpretation of BERT Predictions: (A, C, and E) UMAP projection of All-Fields BERT embedding vectors after applying the attention mechanism across report subfields; each point is reported with information aggregated from all report subfields; (B, D, and F) Select diagnostic text from individual reports interpreted by Integrated Gradients to elucidate words positively and negatively associated with calling the CPT code; Integrated Gradients was performed on the diagnostic text BERT models; Utilized CPT codes: (A and B) CPT code 88307, (C and D) CPT code 88342, and (E and F) CPT code 88360

Figure 7

BERT Diagnostic Model Self-Attention: Output of self-attention maps for select self-attention heads/layers from the BERT diagnostic text model visualizes various layers of complex word-to-word relationships for the assessment of a select pathology report that was found to report CPT code 88307

Figure 8

BERT All-Fields Model Interpretation: Visualization of importance scores assigned to pathology report subfields outside of the diagnostic section for three separate pathology reports (A–C) that were assigned by raters CPT code 88360; information from report subfields that appear more red was utilized more by the model for the final prediction of the code; attention scores listed below the text from the subfields and title of each subfield supplied

SHAP interpretation of XGBoost predictions: Word clouds demonstrating words found to be important using the XGBoost algorithm (All-Fields) for the prediction of primary CPT codes, found via shapley attribution; important words pertinent to each CPT code indicated by the relative size of the word in the word cloud; word clouds visualized for word importance (A) across all five primary CPT codes and (B–F) for the following CPT codes: (B) CPT code 88302; (C) CPT code 88304; (D) CPT code 88305; (E) CPT code 88307; and (F) CPT code 88309; note that the size of the word considers strength but not directionality of the relationship with the code, which may be negatively associated in some cases SHAP coefficients depicting relationships between the top 30 words that distinguish the primary CPT codes and their related CPT code: Positive value indicates positive association, whereas negative value indicates negative association between word and code; top codes determined by summing absolute SHAP value across CPT codes and test cohort Embedding and Interpretation of BERT Predictions: (A, C, and E) UMAP projection of All-Fields BERT embedding vectors after applying the attention mechanism across report subfields; each point is reported with information aggregated from all report subfields; (B, D, and F) Select diagnostic text from individual reports interpreted by Integrated Gradients to elucidate words positively and negatively associated with calling the CPT code; Integrated Gradients was performed on the diagnostic text BERT models; Utilized CPT codes: (A and B) CPT code 88307, (C and D) CPT code 88342, and (E and F) CPT code 88360 BERT Diagnostic Model Self-Attention: Output of self-attention maps for select self-attention heads/layers from the BERT diagnostic text model visualizes various layers of complex word-to-word relationships for the assessment of a select pathology report that was found to report CPT code 88307 BERT All-Fields Model Interpretation: Visualization of importance scores assigned to pathology report subfields outside of the diagnostic section for three separate pathology reports (A–C) that were assigned by raters CPT code 88360; information from report subfields that appear more red was utilized more by the model for the final prediction of the code; attention scores listed below the text from the subfields and title of each subfield supplied

DISCUSSION

In this study, we characterized a large corpus of almost 100,000 pathology reports at a mid-sized academic medical center. Our studies indicate that the XGBoost and BERT methodologies produce highly accurate predictions of both primary and ancillary CPT codes, which has the potential to save operating costs by first suggesting codes prior to manual inspection and flagging potential manual coding errors for review. Further, both the BERT and XGBoost models preserved the ordering of the code/case complexity, where most of the misclassifications were made between codes of a similar complexity. The model interpretations via SHAP suggest a terminology that is consistent with code complexity. For instance, “vulva,” “uterus,” and “adenocarcinoma” were associated with CPT code 88309. We noted associations between “endometrium diagnosis” and “esophagus” and CPT code 88305. “Biopsy” was associated with CPT codes 88305 and 88307, while “myocyte” was associated with CPT code 88307 (myocardium). In addition, we noticed a positive association between “products of conception” and lower complexity codes (CPT code 88304) and a negative association with higher complexity codes. The aforementioned associations uncovered using SHAP are consistent with reporting standards for histological examination.[313257] Previous studies predicting CPT codes have largely been unable to characterize the importance of different subsections of a pathology report. Using the BERT and XGBoost methods, we were also able to show that significant diagnostic / coding information is contained in nondiagnostic subsections of the pathology report, particularly the Clinical Information and Specimen Processing sections. Such information was more pertinent when predicting ancillary CPT codes, as nondiagnostic subfields are more likely to contain test ordering information, though performance gains were observed for primary codes when employing the XGBoost model over an entire pathology report. This is expected, as many of the CPT codes are based on procedure type / specimen complexity and ancillary CPT codes are expected to contain more informative text in the nondiagnostic sections. Potentially, the variable presence/absence of different reporting subfields may have made predicting primary codes using the BERT model more difficult, as the extraction of information different subsections was not optimized for aside from how much weight to apply to each section. Although our prediction accuracy is comparable to previous reports of CPT prediction using machine-learning methods, our work covers a wider range of codes than previously reported, compares the different algorithms through rigorous cross-validation, reports a significantly higher sensitivity and specificity, and demonstrates the importance of utilizing other parts of the pathology report for procedural code prediction. Further, previous works had only considered the first 100 words of the diagnostic section and had failed to properly account for class-balancing, potentially leading to inflated performance statistics; however, our study carefully considers the ordinality of the response and reports macro-averaged measures that take into account infrequently assigned codes. We also demonstrated that the pathology report subfields contained pertinent diagnostic and procedural information that could adequately separate our text corpus based on ancillary CPT codes and the signing pathologist. With regard to ancillary testing, it was interesting to note how some of the clinical codes for acquisition and quantification of markers on specialized stains (CPT 88341, 88342, 88344, 88360) performed the worst overall, which may potentially suggest inconsistent reporting patterns for the ordering of specialized stains.[34] The revision of CPT codes 88342 and 88360, and the addition of CPT codes 88341 and 88344 in 2015 lay just outside of the range of the data collection period, which was from June 2015 to June 2020.[58] Evolving coding/billing guidelines will always present challenges when developing NLP guidelines for clinical tests, though our models’ optimal performance and the fact that major coding changes occurred outside of the data collection period suggest that temporal changes in coding patterns did not likely impact the ability to predict CPT codes. We did not find significant changes in the assignment of most of the primary codes over the study period. Since major improvements were obtained through incorporating the other report subfields for the codes, nondiagnostic text may be more important for records of specialized stain processing and should be utilized as such.

Limitations

There are a few limitations to our work. For instance, due to computational constraints, most BERT models can only take as input 512 words at a time (Supplementary Section “Additional Information on BERT Pretraining”). We utilized a pretrained BERT model that inherited knowledge from large existing biomedical data repositories at the expense of flexibility in sequence length size (i.e. we could not modify the word limit while utilizing this pretrained model). We noticed that in our text corpus, less than 2% of reports were longer than this limitation and thus had to be truncated when input into the deep learning model, which may impact results. Potentially, longer pathology reports describe more complicated cases, which may utilize additional procedures. From our cluster analysis, we demonstrated that this appeared to be the case for a subset of report clusters, though for one cluster, the opposite was true. However, a vast majority of pathology reports fell within the BERT word limits, so we considered any word length-based association with CPT code complexity to have negligible impact on the model results. The XGBoost model, alternatively, is able to operate on the entire report text. Thus, XGBoost may more directly capture interactions between words spanning across document subsections pertaining to complex cases, which may serve as one plausible explanation of its apparent performance increase with respect to the BERT approaches. Although we attempted to take into account the ordinality of case complexity for the assignment of primary CPT codes, such work should be revisited as ordinal loss functions for both deep learning and tree-based models become more readily available. There were also cases where multiple primary codes were assigned; whereas the ancillary codes were predicted by using a multitarget objective, and the primary code prediction can be configured similarly though this was outside the scope of the study.[32] Although we conducted coarse hyperparameter scans, we note that generally such methods are deemed both practical and acceptable. Although other advanced hyperparameter scanning techniques exist (e.g., Bayesian optimization or genetic algorithm), in many cases, these methods obtain performance similar to randomized hyperparameter searches and may be far more resource intensive.[59]

Future directions

Given the secondary objectives of our study (e.g., prediction of ancillary codes, studying sources of variation in text, i.e. pathologist), we were able to identify additional areas for follow-up. First, we were able to assess nuanced pathologist-specific language, which was largely determined by specialty (e.g. subspecialties such as cytology use highly regimented language, making it more difficult to separate practitioners). There is also potentially useful information to be gained by working to identify text that can distinguish pathologists within subspecialties (found as a flag in the Pathology Information System) and conditional on code assignment rather than identify pathologists across subspecialties. This information can be useful in helping to create more standardized lexicons / diagnostic rubrics (for instance, The Paris System for Urine Cytopathology[60]). Research into creating a standard lexicon for particular specialties or converting raw free text into a standardized report could be very fruitful, especially for the positive impact it would have in allowing nonpathologist physicians to more easily interpret pathology reports and make clinical decisions. As an example of how nonstandardized text lexicon can impact reporting, it has long been suspected that outlier text can serve as a marker of uncertainty or ambiguity about the diagnosis. For instance, if there is a text content outlier in a body of reports with the same CPT code, then we can hypothesize that such text may be more prone to ambiguous phrases or hedging, from which pathologists may articulate their uncertainty for a definitive diagnosis. As such, we would also like to assess the impact of hedging in the assignment of procedural codes, and further its subsequent impact on patient care. As another example, excessive ordering of different specialized stains and pathology consults may suggest indecisiveness, as reflected in the pathology report. To ameliorate these differences in reporting patterns, generative deep learning methods can be employed to summarize the text through the generation of a standard lexicon. Other excellent applications of BERT-based text models include the prediction of relative value units (RVU’s) via report complexity for pathologist compensation calculations (which is related to primary code assignment) and the detection of cases that may have been mis-billed (e.g., a code of lower complexity was assigned), which can potentially save the hospital resources.[61] We are currently developing a web application that will both interface with the Pathology Information System and can be used to estimate the fiscal impact of underbilling by auditing reports with false positive findings. Tools such as Inspirata can also provide additional structuring for our pathology reports outside of existing schemas.[62] Although much of the patient’s narrative may be told separately through text, imaging, and omics modalities,[63] there is tremendous potential to integrate semantic information contained in pathologist notes with imaging and omics modalities to capture a more holistic perspective of the patient’s health and integrate potentially useful information that could otherwise be overlooked. For instance, the semantic information contained in a report may highlight specific morphological and macro-architectural features in the correspondent biopsy specimen that an image-based deep learning model might struggle to identify without additional information. Although XGBoost demonstrated equivalent performance with the deep learning methods used for CPT prediction, its usefulness in a multimodal model is limited because these machine-learning approaches rely heavily on the feature extraction approach, where feature generation mechanisms using deep learning can be tweaked during optimization to complement the other modalities. Alternatively, the semantic information contained within the word embedding layers of the BERT model can be fine-tuned when used in conjunction with or directly predicting on imaging data, allowing for more seamless integration of multimodal information. Integrating such information, in addition to structured text extraction systems (i.e., named entity recognition) that can recognize and correct the mention of such information in the text, may provide a unique search functionality that can benefit experiment planning.[34] Although comparisons between different machine-learning models may inform the optimal selection of tools that integrate with the Pathology Information System, we acknowledge that such comparisons can benefit from updating as new machine-learning architectures are developed. As such, we plan to incorporate newer deep learning architectures, such as the Reformer or Albert, which do not suffer from the word length limitations of BERT, though training all possible language models was outside of the scope of our study since pretrained medical word embeddings were not readily available at the time of modeling.

CONCLUSION

In this study, we compare three cutting-edge machine learning techniques for the prediction of CPT codes from pathology text. Our results provide additional evidence for the utility of machine-learning models to predict CPT codes in a large corpus of pathology reports acquired from a mid-sized academic medical center. Further, we demonstrated that utilizing text from parts of the document other than the diagnostic section aids in the extraction of procedural information. Although both the XGBoost and BERT methodologies yielded comparable results, either method can be used to improve the speed and accuracy of coding by the suggestion of relevant CPT codes to coders, though deep learning approaches present the most viable methodology for incorporating text data with other pathology modalities.

Financial support and sponsorship

This work was supported by NIH grants R01CA216265, R01CA253976, and P20GM104416 to BC, Dartmouth College Neukom Institute for Computational Science CompX awards to BC and LV, and Norris Cotton Cancer Center, DPLM Clinical Genomics and Advanced Technologies EDIT program. JL is supported through the Burroughs Wellcome Fund Big Data in the Life Sciences at Dartmouth. The funding bodies above did not have any role in the study design, data collection, analysis and interpretation, or writing of the manuscript.

Authors’ contributions

The conception and design of the study were contributed by JL and LV. Initial analyses were conducted by JL and NV. All authors contributed to writing and editing of the manuscript and all authors read and approved the final manuscript.

Conflicts of interest

There are no conflicts of interest.

SUPPLEMENTARY MATERIALS

S A In this section, we have compiled a list of all publicly available datasets used to remove identifiable patient names from the report text: First Names https://github.com/ankane/age/blob/master/names/ https://github.com/smashew/NameDatabases/blob/master/NamesDatabases/first%20names/all.txt https://hackage.haskell.org/package/gender https://raw.githubusercontent.com/solvenium/names-dataset/master/dataset/Male_given_names.txt https://raw.githubusercontent.com/solvenium/names-dataset/master/dataset/Female_given_names.txt https://github.com/philipperemy/name-dataset/tree/master/names_dataset/v1 Last Names https://github.com/smashew/NameDatabases/blob/master/NamesDatabases/surnames/all.txt https://raw.githubusercontent.com/solvenium/names-dataset/master/dataset/Surnames.txt https://github.com/philipperemy/name-dataset/tree/master/names_dataset/v1 A Here, we briefly provide an overview of the modeling techniques that, when utilized in conjunction, characterized the pathology report corpus through the establishment of important words that were not ubiquitous across the corpus (TF-IDF), removal of noise and discovery of clusters (UMAP and HDBSCAN), and generating topics that describe recurrent themes (LDA). TF-IDF (term frequency inverse document frequency) takes as input a sparse count matrix, which contains the reports as rows and individual words/n-grams as columns, where each element is a count of the n-gram in the document. TF-IDF re-weights the count matrix based on an algorithm that modifies word importance on the basis of whether the word is ubiquitous across all documents and/or enriched in its own document. The formula for TF-IDF is: where t and d refer to the specific term and document, respectively. The term-frequency, tf, is the reported count of the n-gram in the particular document, whereas document frequency, df, is the number of reports that contain the term (i.e. how ubiquitous the word is across the corpus). Usually these values are normalized via the euclidean norm to downweight longer documents. Such information may replace the count matrix for downstream analysis, though it is not necessary. The UMAP operates on the count/tf-idf matrix to reduce the dimensionality of the reports while preserving the key relationships between the reports. This is unlike PCA, which selects principal components to maximize variance, and TSNE (T-Stochastic Neighborhood Embedding), which learns a lower dimensional manifold that preserves local distance between reports. The UMAP forms fuzzy simple sets that represent the higher dimensional manifold at multiple distances. Computationally, this amounts to constructing a weighted nearest-neighbors graph and an optimization routine that preserves a similar structure in the low-dimensional manifold while optimizing a forcedirected graph layout. HDBSCAN is a clustering algorithm that operates on the lower dimensional manifold to find natural groupings of the data. HDBSCAN combines hierarchical clustering techniques, which iteratively merge similar clusters, with density-based clustering, which estimates clusters of a similar density. HDBSCAN estimates the density of points based on whether a certain number of points exist within a small well-defined neighborhood and whether two points share a common neighbor, both outside of that which is expected if there were noise. HDBSCAN varies the size of this neighborhood to consider/integrate density on multiple scales to form a hierarchy, which may be further processed to yield the clusters. This yields a set of clusters and points that have been defined as noise. Since the algorithm considers the notion of distance and connectedness on multiple scales / neighborhoods, it often pairs well with UMAP due to similarities in formulation. Latent Dirichlet Allocation (LDA) is a three-level probabilistic/Bayesian generative model that is used for inferring a distribution of topics across a document corpus. Ultimately, the goal of the model is to provide a mechanistic model for how the count matrix arises (that is, estimating the frequency of words in each of the reports). The simplified conceptual framework is as follows: A document is selected. N number of words are selected from a Poisson distribution (which iterates steps 3-4 N times). For each word (not yet selected), a topic is selected from a set of latent topics (topic mixture) that characterize the document with some probability. A word is selected from a set of words that are ascribed to the topic. The distribution of words selected via the generative approach in steps 1-3 are compared with the true distribution of words after marginalizing over the topics and documents. The generative model initially places two separate Dirichlet priors over selecting topics (topic mixture) and words from topics (a k words by V topics matrix). Variational Bayes and expectation maximization techniques are applied to estimate the posterior distribution of the topic mixture and topic-word parameters by assuming the conditional posterior follows a known family of distributions. Ultimately, sampling the predictive posterior allows for an inference of the distribution of topics across documents. A We performed coarse hyperparameter searches for ideal model specifications. We registered optimal hyperparameters based on the loss over each validation set (alternatively by either an F1-score or an AUC metric), depending on the modeling approach. Model convergence was monitored by using the validation set; the test set was completely held out from the updating of parameters or the tuning of hyperparameters. Here, we list the hyperparameters scanned over for each model through coarse inspection of validation set statistics. Selection of this grid was based on a mixture of sensible recommendations and experimentation. Selected hyperparameters are marked in bold, and unlisted hyperparameters were set to package defaults: Support Vector Machine: Kernel: Radial Basis (RBF), Linear Gamma (scales RBF distance): Automatic (set to [number features]-1, where number features is 6 based on UMAP embeddings), 1, or 5 XGBoost: Max Depth: 2, 5, 8, None (runs until split criterion are satisfied, e.g., minimum samples to split on) Number of trees: 100, 300, 600, 800 Number of GPU histogram bins (for optimal run time using GPU): 500, 800, 2000 BERT-Dx (Fine-tuning pretrained BERT model) (AdamW optimizer): Batch size: 16, 64 Number of Epochs: 1, 2, 3, 5, which is typical for fine-tuning a BERT model BERT-All Fields (Adam optimizer): Learning rate: 1e-2, 1e-3, 5e-4, 1e-4, 1e-5 Batch size: 4, 8, 16, 64 Number of epochs: 25, 100 Loss function: Cross-Entropy, Ordinal Penalized Cross-Entropy We note here that the BERT All-Fields model was trained by using a cosine annealing learning rate scheduler, which oscillates between the selected chosen rate and a ƞ value of 1e-5 repetitively over the course of many epochs. This serves to scan a range of potential learning rates for optimal validation loss, from which to terminate training. Similarly, the BERT-Dx model was trained with an initial learning rate of 5e-5 for fine-tuning, with a linear decay scheduler, from which the learning rate asymptotically decreased toward zero. The BERT-Dx model was finetuned to predict specific code(s)/pathologist(s) and update pretrained word embeddings for input to the BERT-All Fields model. We tested an ordinal loss function that penalized misclassifications by adjacent categories/code complexity less than more distant codes. Weight decay was employed for both the BERT-Dx and All-Fields models as additional regularization. A The BERT-Dx and BERT All-Fields models were pretrained by using the Bio-ClinicalBERT model, of which details for pre-training can be found here: https://huggingface.co/emilyalsentzer/Bio_ClinicalBERT. These word embeddings were downloaded a total of 1,290,981 times in the month of November 2021 alone, which demonstrates the widespread adoption of such an embedding system despite the fixed length nature of BERT inputs (512 words per input sequence). This embedding system was adopted for the purpose of this study because at the time of adoption, these pretraining embeddings were the most widely adopted embedding system and were only available for BERT. Further, leveraging a publicly available set of embeddings provides added value and generalization for fine-tuning our in-house dataset versus training from scratch. As additional evidence, other nondiagnostic report subfields were not fine-tuned on our corpus and performed well using the All-Fields models using the public embeddings alone, and only a small fraction of our corpus featured reports with a length that exceeded 512 words. A Shapley additive explanations (SHAP) is a technique that explains the results of any machine-learning model, which may have a complex decision surface. SHAP approximates this surface on a sample-by-sample basis by fitting one local additive model per sample. The coefficients of this model represent the importance of a feature or word. Local additive models operate to directly estimate the prediction of the machine-learning model when summed. That is, if f is the machine-learning model, g, for pathology report i, with term frequency x, then the approximation is as follows: Here, ϕ represents the shapley coefficient for term k of report i. The fitting procedure decides how to distribute the remainder between the mean value of the learned model over the dataset and the prediction to each of the predictors, while considering the importance of the individual predictor over the permutation or ensemble of the possible orderings of predictors when assigning reward (remainder). The predictor importance derived for individual CPT codes or pathologists was estimated by averaging these term/word-level importances / shapley coefficients across the entire document (we subsampled with a random seed for a more efficient computation) report corpus for a given model. This analysis was conducted for the XGBoost modeling approaches. We utilized integrated gradients to interpret which words were found to be important for individual sentences when utilizing the BERT model, which we applied on the diagnostic text. Integrated gradient is a backpropagationbased method that is used for identifying salient features. Many traditional methods for ascertaining important predictors will take the gradient of the model prediction with a defined input , which serves as a linear approximation to the complex functional approximation, and multiply by the original input, to yield the predictor specific importance However, this is less than ideal when exists in the domain where the gradient saturates and also has no baseline for comparison. Integrated gradients, related to shapley values, overcome these two issues by first establishing a noninformative baseline/counterfactual then, they successively sum more informative gradients along the path from the baseline to the observation to yield the overall importance of the predictors: Much of the success of the BERT methodology can be attributed to a neural network modeling approach known as the Transformer. As input to the model, each word is mapped to a semantic vector that captures the word's meaning, which is updated throughout the training process. The Transformer contextualizes the set of word vectors in a report through its encoder and decoder layers. The encoder and decoder layers are further decomposed into self-attention and feed-forward neural networks. Self-attention mechanisms capture dependencies between words within the sentence by forming a weight between each word and individually all of the words of the sentence; that is, identifying the most relevant words for the understanding of the current word. This is accomplished by estimating a weight between two words of a sentence. Here, matrix operations may be employed to speed up the calculation of the self-attention. Suppose the word embeddings of the sentence are encapsulated in matrix , where rows indicate words and columns are the latent dimensions. Parameterized query, key, and value matrices are generated via the following operations: The query and key vectors are utilized as follows to construct the paired attention weights across a sentence, which could be thought of as learning/estimating a weighted unipartite matrix, attention matrix (dk is used for further normalization): is the estimate of the word-to-word dependencies in the sentence for this particular operation. The embeddings of the sentence are updated/contextualized by multiplying this self-attention matrix with the embedding values: Usually, these self-attention matrices represent particular dependencies within the sentence. However, there may exist many complex dependencies to build a global understanding of the sentence/paragraph/report. As such, multiple self-attention “heads” are generated by allowing the existence of many query, key, and value matrices per encoding layer. We visualized the output of the estimated self-attention matrices in our article to demonstrate some of the learned dependencies. We have also omitted from this discussion nuanced specifics pertaining to the decoder (eg. retaining the query and key matrices from the encoder layers), residual connections, and positional embeddings, as they do not necessarily pertain to methods to interpret the output of the BERT model for a pathology report. Attention across document subfields is entirely separate from BERT self-attention mechanisms. As mentioned in the main text, attention weights are utilized to decide how much information from report subsections to incorporate into the final global representation of the report. A weight matrix W, an nz (number of latent dimensions) by one matrix (alternatively substituted by a gating neural network as detailed in the main text), serves as a filter/gate to score how important a subsection is. The scores for each of the report subfields are softmaxed to assign a probability to each subfield for incorporation, The gate is learned via model parameter updates during backpropagation. Importantly, we report the attention weights to communicate the importance of specific report subsections. S Supplementary Ancillary CPT Code Prediction Results XGBoost (median AUC=0.997) outperformed BERT (median AUC=0.995) statistically (p<0.001) when utilizing all of the report subfields but given the high predictive performance these differences were not meaningful. Plots and tabulated statistics of the Youden Index derived from sensitivity/specificity of these algorithms across all of the validation folds confirm that utilizing information from all report subfields is better than utilizing information from the diagnostic text for the ancillary codes (Supplementary Table 10; Supplementary Figures 8–10). Averaging Youden's J statistic across all XGBoost and deep learning models, codes for immunohistochemistry/cytochemistry (CPT 88341, 88342, 88344, 88360), surgical pathology (CPT 88305), and flow cytometry (CPT 88188, 88189) performed worse versus other ancillary procedural codes; however, the performance improved considerably when including all report subfields for these codes (Supplementary Tables 6–9). Interestingly, the code for cytogenetic testing (CPT 88271) also experienced large improvements in sensitivity and specificity by incorporating other report subfields (Supplementary Table 10).
Supplementary Table 9

Sensitivity/specificity for each algorithm/report subfield(s), averaged across cross-validation folds for each CPT code after optimization of Youden’s index to select the sensitivity/specificity

BERTXGBoostSVM



DiagnosisAll fieldsDiagnosisAll fieldsDiagnosisAll fields






codeSensitivitySpecificitySensitivitySpecificitySensitivitySpecificitySensitivitySpecificitySensitivitySpecificitySensitivitySpecificity
850601.001.001.001.001.001.001.001.001.001.001.001.00
850970.980.981.001.001.001.001.001.001.001.000.991.00
874910.960.980.991.000.990.981.001.000.990.980.990.97
875910.990.980.991.000.990.981.001.000.990.980.990.97
876240.980.990.980.990.980.990.980.990.980.970.970.98
881080.840.950.990.990.990.950.991.000.990.951.000.99
881120.970.960.990.990.990.971.000.990.990.970.990.99
881411.001.001.001.001.001.001.001.001.001.000.990.99
881420.950.890.990.970.990.930.970.970.990.930.940.95
881720.810.960.990.991.000.950.990.991.000.950.990.98
881731.000.971.000.991.000.971.000.990.990.970.990.99
881750.980.990.980.990.980.990.980.990.981.000.970.98
881770.830.951.000.990.990.951.001.000.980.951.000.99
881840.850.920.950.960.940.930.980.980.960.930.960.94
881850.710.900.950.950.940.930.980.970.960.920.950.95
881880.870.880.950.940.930.910.960.960.960.930.920.93
881890.770.730.940.950.880.880.950.970.950.950.920.95
882710.920.940.950.970.940.950.980.990.960.960.970.98
882740.960.940.970.990.950.950.990.990.970.970.980.99
883000.990.990.981.000.990.990.990.990.970.980.960.98
883020.900.980.940.960.960.970.960.970.950.930.930.92
883040.950.960.960.960.960.950.960.960.930.930.920.92
883050.940.900.960.920.930.910.950.950.200.680.180.70
883070.970.970.970.960.970.960.980.970.940.940.950.93
883090.960.970.960.970.970.970.980.980.940.950.960.95
883110.970.980.980.980.980.980.990.990.890.950.970.95
883120.840.880.930.940.930.920.980.980.850.860.840.84
883130.890.900.940.950.890.910.970.970.870.890.870.90
883210.980.950.990.990.960.951.001.000.910.910.990.99
883310.930.940.980.980.940.961.001.000.920.930.940.92
883320.920.930.960.950.940.940.990.990.870.950.950.96
883330.950.950.980.990.970.961.001.000.980.960.970.99
883410.840.830.910.910.850.850.960.950.800.800.900.89
883420.860.850.970.960.860.840.980.970.800.770.900.93
883440.950.970.960.970.940.940.970.980.950.970.940.98
883460.920.910.990.990.991.001.001.000.980.970.981.00
883500.800.941.001.000.991.001.001.000.970.980.991.00
883600.910.930.950.960.930.930.970.970.920.940.940.94
Supplementary Pathologist Prediction Results After subsetting to 64,583 documents that correspond to the 20 pathologists with the most sign-outs, the prediction of the pathologist who had written each pathology report was done with a reasonably high accuracy for the XGBoost and BERT approaches. BERT (macro-f1=0.72) performed comparably to XGBoost (macro-f1=0.71) for the prediction of pathologists on the diagnostic text; BERT (macro-f1=0.77) and XGBoost (macro-f1=0.78) also performed comparably when considering all report subfields (all-fields) (Supplementary Figure 11). Model performance improved when incorporating all report subsections. Interestingly, these pathologist-specific subtleties could not be distinguished via the SVM approach (Supplementary Tables 4 and 8). Comparisons between the embeddings formed by the All- Fields model and those using UMAP (Supplementary Figure 13A-B) show how the BERT methodology is able to extract features that are more pathologist specific, as compared with utilizing a bag-of-words approach. Comparing which pathologists were misclassified via the confusion matrix (Supplementary Figure 7 B) and corroboration with crosstabulations with procedural codes (Supplementary Figure 7 A) demonstrates that pathologists with similar subspecialties were less distinguishable; however, individual patterns persist. We visualized some of the patterns that BERT was able to find in sample sentences via Integrated Gradients and important words via the XGBoost for select pathologists using SHAP (Supplementary Figure 14). Boxenplots of the number of words for each subfield across pathology report corpus; BERT cutoff word count of 512 words represented by a horizontal dashed line CPT Code Statistics: A) Bar chart representing breakdown of the corpus by assigned codes (proportion); B) Changes in primary CPT codes over time, from 2017-2020, aggregated counts by week Strength of correlation between topics and CPT codes denoted by the size and color of each circle; large blue circles indicate strong positive associations, whereas large red circles indicate strong negative associations; associations for: A) diagnostic text; B) allfields text Strength of correlation between topics and HDBSCAN report clusters denoted by the size and color of each circle; large blue circles indicate strong positive associations, whereas large red circles indicate strong negative associations; associations for: A) diagnostic text; B) all-fields text Strength of correlation between topics and individual pathologists denoted by the size and color of each circle; large blue circles indicate strong positive associations, whereas large red circles indicate strong negative associations; associations for: A) diagnostic text; B) all-fields text Strength of correlation between CPT codes and HDBSCAN report clusters denoted by the size and color of each circle; large blue circles indicate strong positive associations, whereas large red circles indicate strong negative associations; associations for: A) diagnostic text; B) all-fields text Pathologist Associations: A) Clustered heatmap between associations/co-occurrence between pathologist and CPT codes establishes “subspecialties,” where pathologists who order similar CPT codes are likely of similar subspecialty/subspecialties; left color track is colored by established subspecialty clusters; B) Clustered confusion matrix for pathologist prediction task (BERT diagnostic-text model); rows indicate true pathologists, whereas columns indicate predicted pathologist; row and column color bars utilize established “subspecialty” clusters; since the clustering of rows and columns place pathologists of a similar subspecialty together, this indicates that the misclassification occurred mostly within subspecialties Ancillary CPT Code Model Performance: Grouped boxenplots demonstrating the performance of machinelearning models (BERT, XGBoost, SVM) across CPT codes (distribution of AUCs reported for each CPT code), given the analysis of either the diagnostic text (blue) or all report subfields (orange) Histogram of pairwise comparison (subtraction) of AUC statistics (averaged across cross-validation folds) between sets of algorithms / utilized document subfields; histogram tabulates AUC differences for individual codes, of which there are 38 values to be distributed among the histogram bins; reported relative performance gain (comparison/subtraction) of: A) XGBoost using all report subfields versus BERT using all report subfields, B) XGBoost using diagnostic subfield versus BERT using diagnostic subfield, C) BERT using all report subfields versus BERT using diagnostic subfield, D) XGBoost using all report subfields versus XGBoost using a diagnostic subfield Scatterplot of sensitivity and specificities for each CPT code, after averaging across CPT codes; the individual point is a CPT code; the point is colored by whether it was predicted from the diagnostic text or all report subfields; histograms at plot margins indicate marginal distribution of code sensitivity/specificity Averaged weighted AUC statistics across pathologists/cross-validation folds for the prediction of top 20 pathologists with most sign-outs; reports for BERT and XGBoost for the diagnosis and all-fields models SHAP interpretation of XGBoost predictions: Word clouds demonstrating words found to be important using the XGBoost algorithm for the prediction of specific ancillary CPT codes, found via shapley attribution; important words pertinent to each CPT code indicated by the relative size of the word in the word cloud; word clouds visualized for three example CPT codes: A-B) CPT code 88189; C-D) CPT code 88313; E-F) CPT code 88360; visualizations performed for A,C,E) diagnostic text only, B,D,F) all report subfields (all-fields) Pathology reports colored by practicing pathologist: UMAP embeddings of pathology reports, colored by the pathologist who had written the report; each point indicates a pathology report, projected from use of either: A) Bag-Of-Words / tf-idf count matrix; B) embeddings after integrating information from all report subsections via the BERT all-fields model Interpretation of BERT and XGBoost models for pathologist prediction: Word cloud output of top words (size of word indicates importance; importance determined using SHAP) for XGBoost model prediction of the specific pathologist and Integrated Gradients highlighting of text via the BERT diagnostic model for select pathologists: A) Pathologist 5; B) Pathologist 20
Supplementary Table 7

Confidence intervals of 1000-sample nonparametric bootstrap of area under the receiver operating characteristic curve for each algorithm (BERT, XGBoost and SVM) and for each report type (Diagnosis and All-Fields); each AUC was averaged across the 5 cross-validation folds with the same random seed set for sampling values within each CV fold for each code/group of pathologists; ancillary CPT code and descriptions of codes listed on the left, in addition to the weighted AUC across 20 pathologists

AUCs ( ± SE)

BERTXGBoostSVM



CodeDescriptionDiagnosisAll-FieldsDiagnosisAll-FieldsDiagnosisAll-Fields
85060 Blood smear interpretation by physi-cian with written report 0.998 ± 0.00020.9994 ± 0.00010.9989 ± 0.00020.9996 ± 0.00010.9983 ± 0.00020.9968 ± 0.0012
85097 Bone marrow, smear interpretation 0.9996 ± 0.00010.9994 ± 0.00010.9989 ± 0.00050.9997 ± 0.00.9985 ± 0.00010.9941 ± 0.0014
87491 Detection test for chlamydia 0.9905 ± 0.00080.9984 ± 0.00080.9898 ± 0.0010.9996 ± 0.00020.9872 ± 0.00130.9819 ± 0.0042
87591 Detection test for Neisseria gonor-rhoeae (gonorrhoeae bacteria) 0.9905 ± 0.00080.9994 ± 0.00010.9898 ± 0.0010.9996 ± 0.00020.9872 ± 0.00130.9819 ± 0.0042
87624 Detection test for human papilloma-virus (hpv) 0.9968 ± 0.00060.9973 ± 0.00030.9958 ± 0.00040.9984 ± 0.00020.9778 ± 0.00170.988 ± 0.0016
88108 Cell examination of specimen 0.9802 ± 0.00170.999 ± 0.00030.9808 ± 0.00080.9975 ± 0.00150.9717 ± 0.00260.9989 ± 0.0001
88112 Cell examination of specimen 0.9934 ± 0.00050.9991 ± 0.00010.9935 ± 0.00020.9995 ± 0.00.9887 ± 0.00040.9959 ± 0.0008
88141 Cytopathology, cervical or vaginal (any reporting system), requiring interpretation by physician 1.0 ± 0.00.9998 ± 0.00010.9996 ± 0.00010.9999 ± 0.00.9988 ± 0.00040.9923 ± 0.0014
88142 Pap test (Pap smear) 0.9886 ± 0.00170.9938 ± 0.00160.9826 ± 0.00170.9951 ± 0.00180.9663 ± 0.00180.9501 ± 0.0131
88172 Evaluation of fine needle aspirate 0.9825 ± 0.00110.999 ± 0.00020.9837 ± 0.00110.999 ± 0.00060.9749 ± 0.00150.9903 ± 0.001
88173 Evaluation of fine needle aspirate with interpretation and report 0.9867 ± 0.00240.9988 ± 0.00020.9899 ± 0.00050.9996 ± 0.00.9818 ± 0.0010.997 ± 0.0006
88175 Pap test 0.998 ± 0.00050.9976 ± 0.00030.9972 ± 0.00030.9981 ± 0.00030.9932 ± 0.00090.9847 ± 0.002
88177 Pap test 0.9774 ± 0.00230.9993 ± 0.00010.9783 ± 0.00310.9998 ± 0.00.9624 ± 0.00440.9955 ± 0.0003
88184 Flow cytometry technique for DNA or cell analysis 0.9731 ± 0.00820.9848 ± 0.00220.9738 ± 0.00250.9942 ± 0.00120.9699 ± 0.00290.9708 ± 0.0033
88185 Flow cytometry, cell suXGBace, cyto- plasmic, or nuclear marker, technical component only 0.9629 ± 0.00750.9841 ± 0.00220.9711 ± 0.00270.994 ± 0.00080.9594 ± 0.0030.9692 ± 0.0034
88188 Cytopathology procedures 0.9428 ± 0.01210.9773 ± 0.00290.9589 ± 0.00410.9875 ± 0.00240.9593 ± 0.00410.9486 ± 0.0029
88189 Flow cytometry technique for DNA or cell analysis 0.9043 ± 0.02950.9753 ± 0.00520.9199 ± 0.01010.9785 ± 0.00730.9611 ± 0.00740.9471 ± 0.0118
88271 FISH DNA probe, each 0.9943 ± 0.0020.9906 ± 0.00250.9735 ± 0.00550.995 ± 0.00240.9717 ± 0.00620.9768 ± 0.0061
88274 Genetic testing 0.9951 ± 0.00110.9943 ± 0.0030.9755 ± 0.00590.9941 ± 0.00360.9775 ± 0.00580.9922 ± 0.0029
88300 Pathology examination of tis-sue using a microscope, limited examination 0.9983 ± 0.00110.9969 ± 0.00080.9967 ± 0.00120.9978 ± 0.00090.9846 ± 0.00250.9868 ± 0.0023
88302 Pathology examination of tissue using a microscope 0.9768 ± 0.00830.9824 ± 0.00360.9887 ± 0.00280.9934 ± 0.00190.9581 ± 0.00470.9643 ± 0.0042
88304 Pathology examination of tissue using a microscope, moderately low complexity 0.991 ± 0.00110.9877 ± 0.00070.987 ± 0.00090.9907 ± 0.00060.9534 ± 0.00190.9509 ± 0.0021
88305 Pathology examination of tissue using a microscope, intermediate complexity 0.9726 ± 0.00120.9775 ± 0.00050.97 ± 0.00060.9889 ± 0.00030.1087 ± 0.00120.0807 ± 0.001
88307 Pathology examination of tissue using a microscope, moderately high complexity 0.9942 ± 0.00060.9928 ± 0.00040.9925 ± 0.00040.995 ± 0.00030.9614 ± 0.00150.968 ± 0.0013
88309 Pathology examination of tissue using a microscope, high complexity 0.9966 ± 0.00090.9885 ± 0.00210.9949 ± 0.00080.9967 ± 0.00070.9608 ± 0.00340.9777 ± 0.0022
88311 Preparation of tissue for examination by removing any calcium present 0.9906 ± 0.00330.9972 ± 0.00030.9943 ± 0.00090.9991 ± 0.00020.9316 ± 0.00350.9741 ± 0.0019
88312 Special stained specimen slides to identify organisms, including inter-pretation and report 0.9766 ± 0.00250.9792 ± 0.00120.9692 ± 0.00170.9972 ± 0.00040.8974 ± 0.00380.9063 ± 0.0031
88313 Special stained specimen slides to examine tissue, including interpreta-tion and report 0.9577 ± 0.00650.9854 ± 0.00130.9619 ± 0.00230.9953 ± 0.00060.9163 ± 0.00390.9234 ± 0.0036
88321 Surgical pathology consultation and report 0.9945 ± 0.00070.998 ± 0.00070.9889 ± 0.0010.9994 ± 0.00010.9483 ± 0.00330.9931 ± 0.0013
88331 Pathology examination of tissue during surgery 0.949 ± 0.01350.9958 ± 0.00120.9834 ± 0.00190.9996 ± 0.00020.9465 ± 0.00440.9592 ± 0.0024
88332 Pathology examination of specimen during surgery 0.8971 ± 0.04850.9821 ± 0.00630.974 ± 0.00590.9972 ± 0.00080.9077 ± 0.01860.9666 ± 0.0084
88333 Pathology examination of tissue specimen during surgery 0.9924 ± 0.00110.9963 ± 0.00180.9883 ± 0.00270.999 ± 0.00080.9827 ± 0.00210.979 ± 0.0076
88341 Immunohistochemistry or immunocy-tochemistry, per specimen 0.9353 ± 0.00340.96 ± 0.00120.9273 ± 0.00170.9901 ± 0.00040.8514 ± 0.00310.9262 ± 0.0022
88342 Immunohistochemistry or immuno-cytochemistry, per specimen; initial single antibody stain procedure 0.9384 ± 0.00240.9925 ± 0.00030.9319 ± 0.00110.9955 ± 0.00020.8404 ± 0.00210.9471 ± 0.0015
88344 Special stained specimen slides to examine tissue 0.9833 ± 0.01170.9824 ± 0.00750.9747 ± 0.00610.9942 ± 0.00280.9664 ± 0.00910.9627 ± 0.0091
88346 Antibody evaluation 0.9971 ± 0.00280.9972 ± 0.00180.9966 ± 0.00260.9977 ± 0.00230.987 ± 0.00450.989 ± 0.005
88350 Antibody evaluation 0.9999 ± 0.00010.9998 ± 0.00.9993 ± 0.00040.9999 ± 0.00.9852 ± 0.00480.9933 ± 0.0037
88360 Microscopic genetic analysis of tumor; morphometric analysis, tumor immunohistochemistry 0.7182 ± 0.02820.9853 ± 0.00220.9761 ± 0.00270.9944 ± 0.00130.9578 ± 0.00420.9564 ± 0.0048
Top 20 pathologists 0.984 ± 0.00020.9877 ± 0.00020.9823 ± 0.00020.99 ± 0.00010.3778 ± 0.00070.3726 ± 0.0007
  36 in total

1.  From Local Explanations to Global Understanding with Explainable AI for Trees.

Authors:  Scott M Lundberg; Gabriel Erion; Hugh Chen; Alex DeGrave; Jordan M Prutkin; Bala Nair; Ronit Katz; Jonathan Himmelfarb; Nisha Bansal; Su-In Lee
Journal:  Nat Mach Intell       Date:  2020-01-17

2.  CPT® Codes: What Are They, Why Are They Necessary, and How Are They Developed?

Authors:  Peggy Dotson
Journal:  Adv Wound Care (New Rochelle)       Date:  2013-12       Impact factor: 4.730

3.  A Review of Challenges and Opportunities in Machine Learning for Health.

Authors:  Marzyeh Ghassemi; Tristan Naumann; Peter Schulam; Andrew L Beam; Irene Y Chen; Rajesh Ranganath
Journal:  AMIA Jt Summits Transl Sci Proc       Date:  2020-05-30

4.  Hedging their mets: the use of uncertainty terms in clinical documents and its potential implications when sharing the documents with patients.

Authors:  David A Hanauer; Yang Liu; Qiaozhu Mei; Frank J Manion; Ulysses J Balis; Kai Zheng
Journal:  AMIA Annu Symp Proc       Date:  2012-11-03

5.  The feasibility of using natural language processing to extract clinical information from breast pathology reports.

Authors:  Julliette M Buckley; Suzanne B Coopey; John Sharko; Fernanda Polubriaginof; Brian Drohan; Ahmet K Belli; Elizabeth M H Kim; Judy E Garber; Barbara L Smith; Michele A Gadd; Michelle C Specht; Constance A Roche; Thomas M Gudewicz; Kevin S Hughes
Journal:  J Pathol Inform       Date:  2012-06-30

6.  Automated de-identification of free-text medical records.

Authors:  Ishna Neamatullah; Margaret M Douglass; Li-wei H Lehman; Andrew Reisner; Mauricio Villarroel; William J Long; Peter Szolovits; George B Moody; Roger G Mark; Gari D Clifford
Journal:  BMC Med Inform Decis Mak       Date:  2008-07-24       Impact factor: 2.796

7.  The Revival of the Notes Field: Leveraging the Unstructured Content in Electronic Health Records.

Authors:  Michela Assale; Linda Greta Dui; Andrea Cina; Andrea Seveso; Federico Cabitza
Journal:  Front Med (Lausanne)       Date:  2019-04-17

8.  Network graph representation of COVID-19 scientific publications to aid knowledge discovery.

Authors:  George Cernile; Trevor Heritage; Neil J Sebire; Ben Gordon; Taralyn Schwering; Shana Kazemlou; Yulia Borecki
Journal:  BMJ Health Care Inform       Date:  2021-01

9.  Hierarchical attention networks for information extraction from cancer pathology reports.

Authors:  Shang Gao; Michael T Young; John X Qiu; Hong-Jun Yoon; James B Christian; Paul A Fearn; Georgia D Tourassi; Arvind Ramanthan
Journal:  J Am Med Inform Assoc       Date:  2018-03-01       Impact factor: 4.497

View more
  1 in total

1.  Automated Generation of Synoptic Reports from Narrative Pathology Reports in University Malaya Medical Centre Using Natural Language Processing.

Authors:  Wee-Ming Tan; Kean-Hooi Teoh; Mogana Darshini Ganggayah; Nur Aishah Taib; Hana Salwani Zaini; Sarinder Kaur Dhillon
Journal:  Diagnostics (Basel)       Date:  2022-04-01
  1 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.