Literature DB >> 32166042

Artificial Intelligence-Driven Structurization of Diagnostic Information in Free-Text Pathology Reports.

Pericles S Giannaris^1,2, Zainab Al-Taie^1,3, Mikhail Kovalenko^1,2, Nattapon Thanintorn², Olha Kholod^1,2, Yulia Innokenteva¹, Emily Coberly², Shellaine Frazier², Katsiarina Laziuk², Mihail Popescu^4,1,5, Chi-Ren Shyu^1,5, Dong Xu^5,1, Richard D Hammer^2,1, Dmitriy Shin^2,1,5.

Abstract

BACKGROUND: Free-text sections of pathology reports contain the most important information from a diagnostic standpoint. However, this information is largely underutilized for computer-based analytics. The vast majority of NLP-based methods lack a capacity to accurately extract complex diagnostic entities and relationships among them as well as to provide an adequate knowledge representation for downstream data-mining applications.
METHODS: In this paper, we introduce a novel informatics pipeline that extends open information extraction (openIE) techniques with artificial intelligence (AI) based modeling to extract and transform complex diagnostic entities and relationships among them into Knowledge Graphs (KGs) of relational triples (RTs).
RESULTS: Evaluation studies have demonstrated that the pipeline's output significantly differs from a random process. The semantic similarity with original reports is high (Mean Weighted Overlap of 0.83). The precision and recall of extracted RTs based on experts' assessment were 0.925 and 0.841 respectively (P <0.0001). Inter-rater agreement was significant at 93.6% and inter-rated reliability was 81.8%.
CONCLUSION: The results demonstrated important properties of the pipeline such as high accuracy, minimality and adequate knowledge representation. Therefore, we conclude that the pipeline can be used in various downstream data-mining applications to assist diagnostic medicine. Copyright:

Entities: Chemical

Keywords: Free-text pathology reports; information extraction; n-ary modeling; structurization

Year: 2020 PMID： 32166042 PMCID： PMC7045509 DOI： 10.4103/jpi.jpi_30_19

Source DB: PubMed Journal: J Pathol Inform

INTRODUCTION

Pathologists document the diagnostic process of complex cancer diagnosis in unstructured and semistructured free-text pathology reports. These documents provide diagnostic information including clinical history, immunophenotypes, complex morphological features, and various molecular and genomic tests such as fluorescent in situ hybridization, cytogenetics, and next-generation sequencing.[12] A thorough analysis of diagnostic information is critical to make a correct diagnosis. For instance, the diagnosis of classical Hodgkin lymphoma (cHL) is based, to a great extent, on the combination of morphology and immunophenotypic biomarkers (e.g., CD20 and CD30).[23] To computationally analyze diagnostic information to improve the process of diagnosis, information is required to be in a structured format. The demand for structured representation of diagnostic data is exemplified by pathologists’ increasing use of computerized college of american pathologists (CAP) checklists, synoptic reporting (summarizations), semistructured final diagnosis, current procedural terminology coding (CPT) systems, tumor node metastasis (TNM) cancer staging systems, systematized nomenclature of medicine-clinical terms (SNOMED), cancer registry data and validation systems, or patient stratification techniques. However, the bulk of biomedically significant information is stored in free-text format without any predefined structure. This makes computational analysis of data and their relationships across pathology reports a daunting task.[4] In addition to that, the current attempts to present diagnostic findings in structured format (e.g., TNM cancer staging, synoptic reporting, or checklists) are unable to fully express the biological complexity of involved diagnostic entities (DEs). To this end, it is important to extract information from unstructured free-text pathology reports. Structured representation of diagnostic data could facilitate the development of knowledge bases (KBs), knowledge graphs (KGs), summarization applications, as well as question and answering systems. Published research illustrates that structured data can enable applications that could potentially enhance services and research related to patient cohort identification, discovery of predictive biomarkers for precision medicine, the study of mechanisms underlying cancer genesis for treating individual patients, or cancer surveillance to name a few.[156789101112131415161718] Currently, methods from natural language processing (NLP) and information extraction (IE) fields are used to convert information to a computable form. These NLP-IE techniques analyze natural language text and attempt to output data in structured form (e.g., vector and subject–predicate–object triples). The contribution of NLP-IE for extracting complete sets of information from biomedical text is being mostly demonstrated by machine learning (automatic model building for data analysis) and rule-based (if-then algorithm statements) approaches.[719202122232425262728] However, in the vast majority of cases, the extraction of information is limited to relatively simple DEs, including various cancer characteristics such as tumor site, stage, and diagnosis. To increase the chance of extraction of all important information contained in free text (high recall), recent NLP research has been extended to introduce the open IE (openIE) paradigm. This is a self-supervised learning task, which aims at the extraction of all possible relations between data in a text.[2930] As such, this approach has potential to be more suitable for the extraction of diagnostic information. In openIE methods, the relations between DEs are usually expressed as subject-predicate-object relational triples (RTs), for instance, [PAX5]-(shows bright positivity in)-[B-cells], or [heterogeneous cell population]-(are composed of)-[small to medium sized round lymphocytes].[30313233343536] Such triples can generate KGs. However, the output of the state-of-the-art openIE approaches, often, include “uninformative extractions” (i.e., extractions that omit critical information), “incoherent extractions” (i.e., extractions where the relational phrase has no meaningful interpretation), and “overly-specific relations that convey too much information to be useful in further downstream semantic tasks”.[30] For example, from the following example of a pathology report, “the large neoplastic cells show bi- and multi-nucleation with large nuclei, pale chromatin, prominent cherry red nucleoli, and abundant cytoplasm consistent with Reed-Sternberg cell variants [. . .]. Neoplastic cells are also negative for ALK1, EBV, CD57, EMA, and CD7”, openIE applications would generate the following RTs: (large neoplastic cells)-(show)-(multi-nucleation with large nuclei, pale chromatin, prominent cherry red nucleoli), (Neoplastic cells)-(are negative for)-(ALK1, EBV, CD57, EMA), (cells)-(show)-(large). Note here that the above output contains triples that suffer from compoundness, i.e., presence of several DEs in either the subject and/or the object of an RT, which is also referred as a lack of minimality in the openIE literature.[30] Yet, another drawback is an incoherent extraction i.e., (large)-(show)-(neoplasticl). Therefore, such compound RTs makes it impossible to generate KGs and mine them for implicit disease patterns. To achieve such computational analyses, RTs should express atomic information,[30] for example, (Neoplastic cells)-(are negative for)-(ALK1) or (neoplastic cells)-(show)-(multinucleation with large nuclei). Here, we introduce a novel informatics pipeline to extract information from free-text pathology reports as sets of atomic RTs. To accomplish that, we extend a state-of-the-art openIE method by adding two critical steps of (i) atomization of compound RTs and (ii) their knowledge representation using n-ary relational modeling. The next section provides a detailed description of the methodology.

METHODS

Overview

Our structurization pipeline consists of two main processes: Foreground process (FP) and background process (BP) [Figure 1]. In a semiautomated and recurring BP, pathology reports are searched for simple DEs (e.g., Reed-Sternberg cells), phrases that represent complex DEs (e.g., perivascular fibrosis with “onion skinning”) as well as other terms needed for structurization (e.g., Mayo Clinic-Rochester Main Campus). These terms are then added to a diagnostic practice vocabulary (DPV) and linked to appropriate terms in a diagnostic practice ontology (DPO). In addition, reports are searched for relations that can serve as predicates in RTs.

Figure 1

Architecture of extraction and structurization of diagnostic information pipeline

Architecture of extraction and structurization of diagnostic information pipeline Thereafter, in the first two steps [P1 and P2 in Figure 1] of a FP, a cohort of pathology reports for a specific study is retrieved and preprocessed. To extract information in RT format, we employ the state-of-the-art Stanford OpenIE[37] application, for which at scheduled times in the BP, the Stanford Named Entity Recognition[38] classifier is retrained to recognize named entities in pathology reports [P3, P6 in Figure 1]. In the final step of the FP, the subject and/or object of Stanford OpenIE-generated compound RTs are split into atomic terms, which are identified using DPV. Then, for each Stanford OpenIE-generated RT, a n-ary model, utilizing atomic terms, is created. Such n-ary model reflects the same semantics as the original RT. A set of n-ary models derived from a free-text pathology report represents a fully structurized version of a report.

Data acquisition and preprocessing (foreground process)

We have queried the pathology department's medical records’ systems to generate a cohort of free-text pathology reports for our analysis (See section S2 in supplementary material [SM]). Since natural language in pathology reports is often characterized by long sentences (e.g., “Reed-Sternberg cells are negative for CD20, CD3, CD43, BCL6, CD79a, EMA, Alk-1, and LCA [CD45]”), phrases that use multiple verbs (e.g., “Flow cytometric analysis reveals B-cells show no evidence of a monotypic population or aberrant antigen expression”), incomplete expressions (e.g., “no clusters are present,” “no blasts”), variable punctuation and symbols (e.g., “\”, “;”, “:”, “+/-”), abbreviations (e.g., RS-cells instead of Reed-Sternberg cells), etc., it is challenging for openIE applications to efficiently extract information. To address this issue, we preprocess reports with a combination of regular expression scripts to (i) remove patient, physician, and document identifiers, (ii) split reports into sentences, (iii) standardize variations of proper names, etc., (e.g., “CJC, PSF” replaced by “chris j chris psf,” “Dr. Smith” replaced by “john smith md”), (iv) edit from variable to single space in text, (v) convert punctuation marks to “,” or “.”, and (vi) convert all letter character strings to lower case,[394041] [S3 in Supplementary Material].

Diagnostic practice ontology and vocabulary (background process)

Our method utilizes the DPO, a schema to represent concepts from pathology practice, and a corresponding DPV, which includes instantiations of concepts from DPO. The DPV consists of various terms from pathology reports detected and identified by the BP. To refer to specific diseases and diagnostic tests, concepts from standard ontologies and controlled vocabularies such as SNOMED (e.g., ICD codes) and Cluster Differentiation System (e.g., names of antibody tests like CD30) are included in DPO. Names of biomedical providers and healthcare system departments, as well as descriptions such as “onion skinning,” “soccer ball-like,” “popcorn cells,” “hyperlobate cells in a 'shotgun distribution’,” used by pathologists in a specific pathology department, are encoded in the DPV. Concepts and terms in the DPO and the DPV are logically organized as follows. First of all, DPO consists of concepts that represent general terms in a pathology practice. For instance, persons and organizations are represented as a super class Healthcare_Actor along with corresponding sub-classes such as Pathologist, Cytologist, Health_Organization. Second of all, DPO contains classes corresponding to such concepts as specimen, biomarkers, which are represented by a super class Healthcare_Object and its specific sub-classes such as Specimen and Biomarker. DPV and DPO are updated manually or with computer scripts as new diagnostic specialists are hired; new biomedical techniques are used, or new terms are found in reports. For example, for a newly hired pathology resident “Dr. Smith, MD,” an instance of a class Pathology_Resident will be added to the DPV. However, if the job status of “Dr. Smith” changes to Attending_Pathologist, the corresponding instance will be updated. The ontology was created through a reverse engineering process during which pathologists, technologists, and staff in a pathology practice were interviewed.

Named Entity Recognition and open information extraction technologies (background process)

The purpose of modules P3 and P6, Figure 1, is to extract RTs from free-text diagnostic reports using openIE technologies. First of all, to achieve this, the Stanford's NER classifier is employed [P6 in Figure 1] to label sequences of words in free-text diagnostic reports that are names of things and assist openIE.[424344] The classifier is trained with an expanded vocabulary of named entities from 135 diagnosis comments on various cancer types and 35 microscopic descriptions on cHL. These documents contain a relatively large number of named entities such as names of medical providers, organizations, as well as names of various DEs. In our pipeline, sequences of words are labeled based on the following predefined categories: (i) person (e.g., [Charles J Chris PSF]person, [Miranda D Crown TRANSCRIBER]person), (ii) organization (e.g., [Department of Pathology]organization, [University of Missouri Healthcare System]organization), (iii) location (e.g., [1 Hospital Dr, Columbia, MO 65201, Suite N224]location), and (iv) immunophenotypic (e.g., [germinal center b-cell phenotype]immunophenotypic, [follicular dendritic cells]immunophenotypic). Then, a set of 136 diagnostic reports is used to train the classifier. Another 34 diagnostic reports are used as a testing set to evaluate the predictive performance of the classifier on the labeled named entities. Second, the Stanford OpenIE technology software is used to extract RTs. Specifically, this tool extracts all possible RTs discovered in a text without relying on predefined text patterns or a pre-specified relation schema.[294345] The extracted RTs are in subject-predicate-object form as it is generally described in the literature.[364546]

Structurization process (foreground process)

The structurization process starts with the detection, identification and atomization of compound Stanford OpenIE-generated RTs. For example, an RT extracted from example (a) in Table 1, is compound because the object represents multiple terms describing different properties of a cell population, namely, “large folded multilobulated nucleus,” “inconspicuous nucleolus,” “scant cytoplasm,” “popcorn cell variant,” “lymphocyte-predominant-lp cell variant.” Similarly, in the RT extracted from example (b) in Table 1, the object represents two immunohistochemical (IHC) antibody tests: “cd30” and “cd15.” The general workflow of the structurization process is depicted in Figure 2.

Table 1

Examples of compound relational triples

(a) “…neoplastic cells have a single large folded multilobulated nucleus with an inconspicuous nucleolus and scant cytoplasm consistent with a popcorn or lymphocyte-predominant-lp cell variant. …”
Extracted relational triples

Subject	Predicate	Object

neoplastic cells	have	large folded multilobulated nucleus with inconspicuous nucleolus consistent with lymphocyte-predominant-lp cell variant

Extracted relational triples

(b) “… neoplastic cells are positive for cd30 and cd15 …”

Subject	Predicate	Object

neoplastic cells	are positive for	cd30 cd15

Figure 2

Workflow of the structurization process

Examples of compound relational triples Workflow of the structurization process To detect compound triples and identify minimal tokens that represent DEs or named entities, the pipeline utilizes a word-alignment method with the Sliding Window technique to match words or phrases in RTs to terms in the DPV [See algorithm for Vocabulary Matching procedure in Figure 2 in S4 of Supplementary Material]. The Sliding Window scans words in the subject and/or object of a RT. Each time the window slides into a word the program checks whether the phrase in the window corresponds to a term in the DPV. The procedure continues until the Sliding Window has covered all words in the RT. In this manner, a set of tokens is generated. If both the subject and object are minimal, the RT is considered to be atomic and is added to the KG representation of the report. Otherwise, the RT is marked as compound and passed into the atomization and n-ary modeling procedure [See algorithm in Figure 3 in S4 of Supplementary Material]. Here, ontology patterns are leveraged in order to develop the N-ary Relation Modeling and to link an entity to multiple other entities.[47] The atomization and n-ary modeling procedure takes a set of subject and object tokens generated by the Vocabulary Matching and initiates a N-ary Relational Modeling. For this, an appropriate N-ary Anchor as well as a set of N-ary Predicates are retrieved from the DPO. The N-ary Predicates link each token of the subject and object to the N-ary Anchor. The N-ary Anchor is selected according to the relational predicate of the RT [line 4 in Figure 3 in S4 of Supplementary Material]. Each predicate is encoded as an ontological relation in the DPO. In Supplementary Material, the algorithm provides steps to select N-ary Predicates to link tokens to the N-ary Anchor [lines 7, 8 in Figure 2 in S3 of Supplementary Material]. N-ary Anchors and N-ary Predicates are encoded manually in the BP. The set of all n-ary modeling representations of all RTs from a pathology report constitutes a KG of the report. We use a a Resource Description Framework (RDF) store for storage and retrieval of the integrated KG generated for a set of pathology reports of a specific study. To illustrate the structurization process, consider an example of free-text pathology report: Neoplastic cells are negative for CD45, CD20, BCL-6, CD10, CD23, and ALK. MUM-1 and CD79a also highlight plasma cells. First of all, we preprocess the free-text report with our regular expression scripts (See S3 in SM) to convert all characters to lower case and to singular from, and to remove punctuation, and stop words. Thereafter, we segment the text into sentences to get the following text: neoplastic cell are negative for cd45 cd20 bcl6 cd10 cd23 alk mum1 cd79a highlight plasma cell. Next, the Stanford OpenIE is utilized to extract RTs, which are presented in Table 2.

Table 2

Relational triples extracted from the example in the in the case illustration

RT	Subject	Predicate	Object
1	neoplastic cell	are negative for	cd45 cd20 bcl-6 cd10 cd23 alk
2	mum-1 cd79a	highlight	plasma cells

RT: Relational triple

Relational triples extracted from the example in the in the case illustration RT: Relational triple In the third step, the subject and object of each triple are tokenized using the Vocabulary Matching procedure [Figure 2 in Supplementary Material], with each token having a matching term in the DPV. For instance, the following two sets of tokens are generated for the subject and object of the first RT: SUBJECT: neoplastic cell( OBJECT: cd45(, cd20(, bcl-6(, cd10(, cd23(, alk(. We have to emphasize here that the DPV is produced in the BP. It contains terms that strictly represent named entities, such as CD20, germinal center B-cell phenotype. However, in some cases, there could be two DEs where one is an extension of the other. For instance, there could be two terms, neoplastic cell and neoplastic cell proliferation. The Vocabulary Matching procedure is biased towards finding the most complete term in the DPV to match adjacent words in a RT. To this end, the Sliding Window does not stop when it finds a matching term for neoplastic cell in the DPV but continues to process the next word. If it finds a term neoplastic cell proliferation in the DPV, it generates a token for this phrase. Otherwise, a token for neoplastic cell is generated. In the next step, since both RTs are compound, with the first having a compound object and the second a compound subject, they are passed to the atomization and n-ary modeling procedure [Figure 3 in S5 of Supplementary Material]. The first step in this procedure is to determine a n-ary anchor for the atomization of the RT. For the relational predicates “are negative for” and “highlight”, N-ary Anchor “IHC_Study” is retrieved from the DPO, where it was encoded before this step in the BP. During the second step, N-ary_Predicates “is_done_for” and “is_negative_for” are retrieved for the subject and the object of the first RT, and “is_positive_for” and “is_done_for” for the subject and the object of the second RT correspondingly. The resulting n-ary modeling representation of the original pathology report constitutes a KG [Figure 3].

Figure 3

N-ary modeling representation of the free-text pathology report in the case illustration

Evaluation metrics

The effectiveness of the structurization pipeline is evaluated by the performance measures of precision and recall. Specifically, these measures correspond to the pipeline's capacity to generate all correct output (recall/sensitivity) “and only the correct output (precision/positive predictive value)”.[43] These measures are adapted from the information retrieval and openIE fields.[484950] In addition, comparisons with performances of other openIE applications extend this evaluation process.

RESULTS AND DISCUSSION

Evaluation studies

We have conducted several studies to evaluate our structurization pipeline with a sample of 35 microscopic description pathology reports that describe cases of Hodgkin's Lymphoma from 2014 to 2017. These reports included bone marrow and lymph node biopsies and a variety of laboratory tests (e.g., immunohistochemistry, flow cytometry, and molecular genetics). The reports were written in a narrative style that describes specimen of adult patients of different age, sex, and race. For additional information, see S2 in SM. The next sections provide detailed descriptions and results of these studies along with a discussion of the properties and limitations of the pipeline.

Comparison with a random process

First, we compared RTs generated by our pipeline with a set of randomly generated RTs to determine that the pipeline follows certain behavior that is distinctive from a uniform behavior. To conduct this evaluation, we randomly selected a data sample of 115 typical RTs produced by our algorithm from a set of 592 RTs. Then, we uniformly randomly sampled terms in DPV and predicates from DPO to construct a set of 136 random RTs. The random RTs were added to the 115 RTs from the data sample to construct a combined set of 251 RTs. Thereafter, a discrete probability mass function (PMF) was derived by dividing the number of instances of each RT by the size of the combined set of RTs. This PMF reflected data generation logic of the structurization algorithm. The reference PMF was generated as a uniform PMF for the combined set of 251 RTs. We used the Kullback–Leibler Divergence (KLD) to measure the difference between the two distributions.[40] The KLD value was 1.236. KLD values approximating zero denote significant similarity between two distributions.[5152] Since resulting KLD value was not close to zero, we concluded that the pipeline's output differed from the random process. The two distributions have been also found statistically different with a two-sample Kolmogorov–Smirnov test D-statistic of 0.916 and p-value < 0.0001. D-statistic values below 0.1 identify matching distributions.[53] The results indicated that the RTs generated by the structurization process are statistically different from RTs generated by a uniform random process.[545556]

Semantic similarity assessment

In the second study, we computed the semantic similarity between the original 35 free-text pathology reports and their corresponding KGs generated by the structurization pipeline. To achieve this, we utilized the Align Disambiguate and Walk (ADW) tool that computes the similarity between two lexical items such as words, sentences, and documents.[5758] ADW is a knowledge-based system that leverages the Topic-Sensitive PageRank algorithm over a graph of word senses generated by the WordNet ontology.[5960] We calculated the semantic similarity between the two datasets with the ADW adaptations of the Cosine, a measure of the angle between two documents represented as vectors with numerical values, as well as Weighted Overlap (WO), a similarity measure between “diverse and dissimilar inputs in order to create an integrated analysis”, measures. We report the mean values of these computations. Specifically, mean Cosine similarity is at 0.77 with a standard deviation of 0.063. Here, as values approach 1 the smaller the angle thus, greater the similarity. Mean WO is at 0.83 with a standard deviation of 0.051. Here, as values approach 1 the greater the overlap between the inputs thus, greater the similarity. The Cosine method measures the semantic similarity between two documents at the word level. On the other hand, the WO method calculates the similarity between two documents by considering larger text elements such as sentences, paragraphs or entire documents. Since we analyzed biomedical information that depends on the context of an entire pathology report, we accepted the WO score as a measure of semantic similarity. The mean WO score of 0.83 indicated that 17% of information was not found to be semantically similar. Further analysis of that information which was not included into the resulting KGs revealed that it consisted mostly of diagnostically irrelevant information such as case numbers and dates. After removal of such information from the original reports, we recalculated the WO similarity score. The mean WO score increased an average of 6%. Therefore, we deemed that the resulting KGs are semantically similar to the original free-text pathology reports in terms of diagnostic information. Note that although semantic similarity increased, 11% of information is not found to be semantically similar. This is attributed to invalid, uninformative, and incoherent RTs generated by the underlying openIE application. Consequently, our framework does not analyze that information.

Performance measures

Comparison of current open information extraction applications Our preliminary experiments with existing openIE methods showed that they cannot be directly used to extract diagnostic information in pathology domain. Table 3 shows comparative results of extraction of diagnostic information from free-text microscopic description sections of pathology reports. To compute precision and recall, we manually counted the number of distinct diagnostic informational points (DIPs) (e.g., “CD20 is positive”) in the free-text reports [Figure 4] and the number of distinct DIPs expressed by the subject-predicate-object triples in the output of the openIE pipelines. As we expected, the openIE pipelines generated large numbers of RTs, with Stanford OpenIE and ClausIE being ahead of the other methods. However, the number of coherent RTs in the extractions was significantly lower than the overall number of extracted RTs, which resulted in low recall values, even in the best cases. It happened because of the redundancy resulted from the presence of uninformative and incoherent RTs, which was in turn caused by the pipelines’ inability to accurately extract complex DEs (low precision). Even when DEs were correctly extracted, the resulting RTs suffered from compoundness, i.e., presence of several DEs in either the subject and/or the object of an RT, which is also referred as a lack of minimality in the openIE literature.[30] Yet, another drawback of existing openIE methods is unsatisfactory retention of the context in extracted RTs. Only some of the above-mentioned methods offer extraction of the contextual information and even in such cases the context often lacks adequate knowledge representation for effective and efficient implementation of downstream data-mining tasks (e.g., KGs). For an example of an extraction by Stanford OpenIE that illustrates the discussed shortcomings, (See S1 in SM).

Table 3

Comparative results of extraction of diagnostic information from free-text microscopic description section of pathology reports by different open information extraction methods

	Stanford OpenIE	OLLIE	ClauseIE	CSD	ReVerb
Precision	9.80%	18.18%	23.81%	14.29%	16.67%
Recall	18.52%	7.41%	18.52%	11.11%	3.70%

Figure 4

Example of a pathology report demonstrating (i) complex diagnostic entities, (ii) complex relations among these diagnostic entities, and (iii) context in which these complex relations take place

Comparative results of extraction of diagnostic information from free-text microscopic description section of pathology reports by different open information extraction methods Example of a pathology report demonstrating (i) complex diagnostic entities, (ii) complex relations among these diagnostic entities, and (iii) context in which these complex relations take place Precision, recall, and inter-rater reliability We assessed the effectiveness of the structurization pipeline with statistical measures of performance from information retrieval,[61] and openIE fields.[374350] To do that, we recruited six domain experts (pathologists and bioinformaticians) from the University of Missouri Department of Pathology to evaluate RTs generated by the structurization procedure. All experts were familiar with the concept of the RTs and were asked to provide evaluations in a 3-point Likert-like scale:[62] Score 1 - the triple does not state a fact from the pathology report, Score 2 - the triple “somewhat” states a fact from the pathology report, and Score 3 - the triple accurately states a fact from the pathology report. Specifically, we measured precision and recall based on the number of informational points corresponding to a set of 3,836 RTs generated by the structurization pipeline and the number of informational points determined by a panel of diagnostic experts through analysis of the original pathology reports. Here, an informational point in a text is a single fact or a diagnostic finding, such as the presence or absence of a specific cell type or result of a test, reflected in a pathology report. Therefore, we define a RT as relevant to a diagnostic report if it corresponds to an informational point in the report. As such, precision is the ratio of relevant RTs that are returned by the pipeline over all RTs returned by the pipeline [Equation 1]. Recall is the ratio of relevant RTs that are returned by the pipeline over all informational points in the pathology report [Equation 2]. Here, a relevant RT is a RT that has received the top score 3 from all six domain experts. According to our experts, this type of RT correctly corresponded to an informational point in the corresponding pathology report. RTs evaluated with scores 2 or 1 were considered ambiguous and/or incomprehensive respectively. The total number of RTs refers to all the RTs evaluated with scores 3, 2, and 1. For recall, we counted all relevant RTs divided by a composite denominator, which equals to the sum of relevant RTs plus the number of informational points in the reports that were not extracted by the pipeline as RTs and marked as “missed.” (Equation 1) (Equation 2) We achieved precision of 0.925, which means that 92.5% of informational points of the pathology reports have been extracted by the structurization pipeline and expressed as RTs in the corresponding KGs. Recall was 0.841, which means that the pipeline has the capacity to generate relevant RTs. We used Fisher's exact test to test the hypothesis that the proportion of RTs with score 3 was statistically different than the proportion of RTs with score 1 and 2, for which the following contingency table was constructed [Table 4].

Table 4

Contingency table for the Fisher’s exact test

RTs	RTs with score 3	RTs with score <3
Returned	TP: 3,551	FP: 69
Missed	FN: 485	TN: 186

TP: True positives, FN: False negatives, TN: True negatives, FP: False positives, RTs: Relational triples

Contingency table for the Fisher’s exact test TP: True positives, FN: False negatives, TN: True negatives, FP: False positives, RTs: Relational triples Fisher's exact test has odds ratio 19.7 and P < 0.0001. Therefore, we rejected the null hypothesis and concluded that the pipeline generates RTs that have high probability of being evaluated with the score 3. Our experiment specifically converts pathology reports rich in diagnostic information to machine-readable RTs. Since the RTs express stand-alone facts from the data, they are independent of each other. In such case, we are interested in the exact probability of whether RTs are associated or not regardless the sample size.[63646566] To assess the agreement among experts in this study we computed inter-raters’ reliability (IRR) score according to a two-way random effects model based on a fully crossed design as described in.[55] To do that, we computed the intra-correlation coefficient (ICC) statistic that reflects the level of correlation and magnitude of agreement between domain experts,[676869] based on recommendations by McGraw and Wong.[70] Accordingly, we selected a single score intra-class correlation, absolute agreement, two-way random effects model with six raters across 3,836 triples, with average-measure ICC as our IRR measures. The ICC score was 0.818, with P value of 0.99. We, therefore, do not reject the null hypothesis and concluded that the differences in the assessment were statistically insignificant. According to Cicchetti's study, we consider ICC values between 0.80 and 0.90 as good and anything above as excellent.[677172] Additionally, the calculated percentage of agreement was 93.6%. This statistic expresses the percentage of evaluations in which the domain experts are in absolute agreement.[697173] ICC and percent agreement were computed in R using the “irr” package.[74] These results indicate high level of agreement among all experts in the study. Therefore, we accepted the computed values of precision and recall as reliable measures of the structurization pipeline's performance. High values of precision mean that the majority of RTs returned by the structurization process accurately reflect informational points of the original pathology report. High values of recall indicate that the majority of the informational points in the report have been carried out to the corresponding KGs in form of RTs. These results demonstrate the pipeline effectively extracts and structurizes complex diagnostic information in free-text pathology reports.

Emerging properties, limitations, and future work

The transformation of diagnostic information from free-text pathology reports into KGs allows linkage of multiple informational points. Because of that, the retention of contextual information occurs naturally as a part of the reification procedure (“statement about statement”). Moreover, comprehensive ontological modeling of the DEs allows for complex and inexact semantical queries. For instance, a query can be constructed to retrieve reports where an IHC study was performed to detect the presence of B-cells. Since, multiple IHC antibodies can be used for this purpose, and their functions are encoded in the DPO, there is no need to run multiple queries for each antibody separately. The system is “smart enough” to recognize the semantics of the query. As it was discussed in the previous sections, the performance of our structurization pipeline was considered to be sufficient to tackle the task of extracting complex DEs, their relations and structurization of free-text pathology reports for downstream data-mining applications. However, some applications might require higher recall values. For instance, in some studies related to Quality Assurance and Quality Control, it could be critical to be able extract all DEs. Missing one or more DE that represents an important diagnostic clue can lead to inconclusive or erroneous results. Since this property depends on the recall values of the underlying information extraction technology, we are planning to explore other openIE and non-openIE methods of information extraction. We have to note here that from a purely technical perspective our structurization pipeline can be viewed as an ad-hoc solution, since an extensive vocabulary of terms (DPV) needs to be developed for each pathology practice. However, as we emphasized in the introduction, we believe that this is the only way we can handle variability of expressing complex DEs in narrative text by different diagnostic professionals. Hence, our hypothesis was that only using an extensive DPV, tuned to a specific pathology practice, and AI-based frameworks, enabling computers to act intelligently as humans, we can develop a method to reliably extract complex entities in free-text pathology reports. Furthermore, AI-based representation of reports enables description logic inference, which can help identify and study implicit relationships among various diagnostic factors. This is the primary goal of our future work.

CONCLUSION

In this article, we have introduced a novel informatics pipeline to transform free-text diagnostic reports into a structured format. Our work extends openIE techniques with AI-based semantic modeling to extract complex DEs and relationships among them. Evaluation studies have demonstrated that the structurization pipeline possess important properties such as accuracy, minimality, and accurate knowledge representation. Therefore, we conclude that our pipeline can be used in various downstream data mining applications in diagnostic medicine such as quality assurance, patient cohort identification, and cancer surveillance.

Financial support and sponsorship

Nil.

Conflicts of interest

There are no conflicts of interest. Examples of extracted relational triples by Stanford open information extraction Retrieval of pathology reports Data preprocessing Algorithm for the vocabulary matching procedure Algorithm for the atomization and n-ary modeling procedures

36 in total

1. Pattern-based information extraction from pathology reports for cancer registration.

Authors: Giulio Napolitano; Colin Fox; Richard Middleton; David Connolly
Journal: Cancer Causes Control Date: 2010-07-23 Impact factor: 2.506

2. Uncovering influence links in molecular knowledge networks to streamline personalized medicine.

Authors: Dmitriy Shin; Gerald Arthur; Mihail Popescu; Dmitry Korkin; Chi-Ren Shyu
Journal: J Biomed Inform Date: 2014-08-19 Impact factor: 6.317

3. A Guideline of Selecting and Reporting Intraclass Correlation Coefficients for Reliability Research.

Authors: Terry K Koo; Mae Y Li
Journal: J Chiropr Med Date: 2016-03-31

4. Hybrid bag of approaches to characterize selection criteria for cohort identification.

Authors: V G Vinod Vydiswaran; Asher Strayhorn; Xinyan Zhao; Phil Robinson; Mahesh Agarwal; Erin Bagazinski; Madia Essiet; Bradley E Iott; Hyeon Joo; PingJui Ko; Dahee Lee; Jin Xiu Lu; Jinghui Liu; Adharsh Murali; Koki Sasagawa; Tianshi Wang; Nalingna Yuan
Journal: J Am Med Inform Assoc Date: 2019-11-01 Impact factor: 4.497

5. A critical discussion of intraclass correlation coefficients.

Authors: R Müller; P Büttner
Journal: Stat Med Date: 1994 Dec 15-30 Impact factor: 2.373

6. DeepPhe: A Natural Language Processing System for Extracting Cancer Phenotypes from Clinical Records.

Authors: Guergana K Savova; Eugene Tseytlin; Sean Finan; Melissa Castine; Timothy Miller; Olga Medvedeva; David Harris; Harry Hochheiser; Chen Lin; Girish Chavan; Rebecca S Jacobson
Journal: Cancer Res Date: 2017-11-01 Impact factor: 12.701

7. Using machine learning to parse breast pathology reports.

Authors: Adam Yala; Regina Barzilay; Laura Salama; Molly Griffin; Grace Sollender; Aditya Bardia; Constance Lehman; Julliette M Buckley; Suzanne B Coopey; Fernanda Polubriaginof; Judy E Garber; Barbara L Smith; Michele A Gadd; Michelle C Specht; Thomas M Gudewicz; Anthony J Guidi; Alphonse Taghian; Kevin S Hughes
Journal: Breast Cancer Res Treat Date: 2016-11-08 Impact factor: 4.872

8. Computing Inter-Rater Reliability for Observational Data: An Overview and Tutorial.

Authors: Kevin A Hallgren
Journal: Tutor Quant Methods Psychol Date: 2012

9. Interrater reliability: the kappa statistic.

Authors: Mary L McHugh
Journal: Biochem Med (Zagreb) Date: 2012 Impact factor: 2.313

10. PathEdEx - Uncovering High-explanatory Visual Diagnostics Heuristics Using Digital Pathology and Multiscale Gaze Data.

Authors: Dmitriy Shin; Mikhail Kovalenko; Ilker Ersoy; Yu Li; Donald Doll; Chi-Ren Shyu; Richard Hammer
Journal: J Pathol Inform Date: 2017-07-25

3 in total

1. Empowering digital pathology applications through explainable knowledge extraction tools.

Authors: Stefano Marchesin; Fabio Giachelle; Niccolò Marini; Manfredo Atzori; Svetla Boytcheva; Genziana Buttafuoco; Francesco Ciompi; Giorgio Maria Di Nunzio; Filippo Fraggetta; Ornella Irrera; Henning Müller; Todor Primov; Simona Vatrano; Gianmaria Silvello
Journal: J Pathol Inform Date: 2022-09-15

2. Searching Full-Text Anatomic Pathology Reports Using Business Intelligence Software.

Authors: Simone Arvisais-Anhalt; Christoph U Lehmann; Justin A Bishop; Jyoti Balani; Laurie Boutte; Marjorie Morales; Jason Y Park; Ellen Araj
Journal: J Pathol Inform Date: 2022-02-07

3. Validation of deep learning natural language processing algorithm for keyword extraction from pathology reports in electronic health records.

Authors: Yoojoong Kim; Jeong Hyeon Lee; Sunho Choi; Jeong Moon Lee; Jong-Ho Kim; Junhee Seok; Hyung Joon Joo
Journal: Sci Rep Date: 2020-11-20 Impact factor: 4.379

3 in total