| Literature DB >> 31218554 |
Jackson M Steinkamp1,2, Charles Chambers3, Darco Lalevic3, Hanna M Zafar3, Tessa S Cook3.
Abstract
Unstructured and semi-structured radiology reports represent an underutilized trove of information for machine learning (ML)-based clinical informatics applications, including abnormality tracking systems, research cohort identification, point-of-care summarization, semi-automated report writing, and as a source of weak data labels for training image processing systems. Clinical ML systems must be interpretable to ensure user trust. To create interpretable models applicable to all of these tasks, we can build general-purpose systems which extract all relevant human-level assertions or "facts" documented in reports; identifying these facts is an information extraction (IE) task. Previous IE work in radiology has focused on a limited set of information, and extracts isolated entities (i.e., single words such as "lesion" or "cyst") rather than complete facts, which require the linking of multiple entities and modifiers. Here, we develop a prototype system to extract all useful information in abdominopelvic radiology reports (findings, recommendations, clinical history, procedures, imaging indications and limitations, etc.), in the form of complete, contextualized facts. We construct an information schema to capture the bulk of information in reports, develop real-time ML models to extract this information, and demonstrate the feasibility and performance of the system.Entities:
Keywords: Machine learning; Natural language processing; Radiology reports; Structured reporting
Year: 2019 PMID: 31218554 PMCID: PMC6646440 DOI: 10.1007/s10278-019-00234-y
Source DB: PubMed Journal: J Digit Imaging ISSN: 0897-1889 Impact factor: 4.056
Some of the most common information extraction subtasks
| Task formulation | Description | Example | Notes and limitations |
|---|---|---|---|
| Named entity recognition | System provides labels for spans of text within the document, using a pre-specified set of labels. Commonly used for nouns (e.g., radiologic findings and anatomic organs). | Input: “Patient has cyst in the kidney.” Output: Cyst → radiologic finding Kidney → anatomic location | -Does not inherently handle “implied” entities which are not directly mentioned in the text. -By itself, does not link entities to each other, limiting the scope of answerable questions to “which entities appear in this document.” -Traditionally, output is limited to pre-specified entity types. |
| Relation extraction | In addition to identifying entities, system identifies relations or predicates between entities. For instance, a Finding Is LocatedIn relation may connect a radiologic finding entity with an anatomic location entity. | Input: “Patient has cyst in kidney.” Valid outputs: “Finding Is Located In (cyst,kidney)” / cyst; is in; kidney. | -Can be either “open” (able to identify arbitrary relations between entities) or “closed” (limited to a set of pre-specified relations). -Systems may be limited to binary relations (between two entities) or may handle relations with arbitrary numbers of entities. |
| Natural language question answering | System provides arbitrary natural language answer to an arbitrary natural language question about a document. System may generate new natural language and/or “point” to spans of text within the document. | Input: “Does the patient have any kidney findings?“ Valid output: “A cyst.” Input: “In what organ is the cyst?” Valid output: “In the kidney.” | -Natural language questions can be seen as a generalization of other natural language tasks (e.g., named entity recognition and relation extraction can both be framed as question-answering tasks). -Natural language answers allow for maximum flexibility compared with a set of pre-specified labels and can ideally generalize to questions outside of the training set. -Requires very large amount of training data, as the task is inherently more complicated. -Providing a complete set of labels for a given training example is difficult, as there may be many correct ways to answer a question. |
Two major approaches for handling information extraction tasks
| Methodology | Description | Pros and cons |
|---|---|---|
| Rule-based string matching | System uses a human-provided list of rules with particular text strings (or regular expressions) to identify entities and relations within the text. | -Explainable and interpretable. -Simple to implement; may be sufficient for certain limited tasks. -Many natural language words have multiple meanings or classifications based on surrounding context, so it may be impossible to create a true one-to-one mapping between text and labels. -For more complex tasks such as relation extraction and question answering, it is very difficult or impossible to anticipate all rules necessary for the task. -No ability to generalize beyond provided rules. |
| Machine learning systems | Many available models, e.g., support vector machines and neural networks. System learns from a training data set of input/output pairs and can generalize to unseen examples which are similar. Deep learning models utilizing neural networks are currently state-of-the-art for the majority of complex tasks. | -Many models incorporate information from the entire surrounding sequence of text to produce an answer. -Able to generalize beyond provided training examples. -May be difficult to understand why model produces its output. -Require large amounts of labeled training data. |
Fig. 1Example fact with anchor and modifier text spans
Fig. 2Architecture for (a) first and (b) second neural network
Complete information schema. Examples are color-coded using the colors in the “anchor entity” and “modifier spans” columns. Modifier spans which are in standard color do not appear in the parsed example, while example texts in standard color are not part of any information spans. Note that multiple modifier and/or anchor spans may overlap, but for ease of visualization, no examples with overlapping spans were selected