INTRODUCTION: Nowadays in Panama, there is a lot of patient information stored in textual form which cannot be manipulated to manage adequate knowledge. There are multiple resources created to represent knowledge, including specialized glossaries, ontologies, among others. The ontologies are an important part within the scope of the recovery and organization of the information and the semantic web. Also in recent works they are used in applications of natural language processing (NLP), as a knowledge base. AIM: This research was conducted with the aim of creating a methodology that allows from a text written in NL, extract the necessary elements using NLP tools and with them create a knowledge base represented by one domain ontology and extract knowledge to help medical specialists. MATERIAL AND METHODS: In this study we carried out a methodology that allows the extraction of knowledge of patient clinical records, general medicine and palliative care, in order to show relevant knowledge elements to specialists. The methodology was validated with a data corpus of approximately 200 patient records. CONCLUSION: We have created a knowledge representation methodology, combining NLP techniques and tools and the automatic instantiation of an ontology, which can serve as a software agent for other applications or used to visualize the patient's clinical information. The study was validated using the traditional metrics of information retrieval systems precision, recall, F-measure obtaining excellent results, and can be used as a software agent or methodology for the development of information extraction software systems in the medical domain.
INTRODUCTION: Nowadays in Panama, there is a lot of patient information stored in textual form which cannot be manipulated to manage adequate knowledge. There are multiple resources created to represent knowledge, including specialized glossaries, ontologies, among others. The ontologies are an important part within the scope of the recovery and organization of the information and the semantic web. Also in recent works they are used in applications of natural language processing (NLP), as a knowledge base. AIM: This research was conducted with the aim of creating a methodology that allows from a text written in NL, extract the necessary elements using NLP tools and with them create a knowledge base represented by one domain ontology and extract knowledge to help medical specialists. MATERIAL AND METHODS: In this study we carried out a methodology that allows the extraction of knowledge of patient clinical records, general medicine and palliative care, in order to show relevant knowledge elements to specialists. The methodology was validated with a data corpus of approximately 200 patient records. CONCLUSION: We have created a knowledge representation methodology, combining NLP techniques and tools and the automatic instantiation of an ontology, which can serve as a software agent for other applications or used to visualize the patient's clinical information. The study was validated using the traditional metrics of information retrieval systems precision, recall, F-measure obtaining excellent results, and can be used as a software agent or methodology for the development of information extraction software systems in the medical domain.
Entities:
Keywords:
Natural language processing; information extraction; information retrieval; knowledge base
The convergence of different areas of knowledge has led to the design and implementation of computer systems that support the integration of innovative tools. Medical knowledge, represented in medical information, is most of the time in textual documents, written in NL, that is, in unstructured data sources, to manage it, NLP techniques are used (1).The first step for the computer processing of linguistic knowledge is the formal representation of such knowledge, there are multiple resources created to represent linguistic information, including specialized glossaries, taxonomies, thesauri and ontologies (2). The formal semantics underlying an ontology allows the automatic processing of information and allows the use of semantic reasoners to infer new knowledge. In this work we consider an ontology as “a formal and explicit specification of a shared conceptualization” (3).Ontologies provide a formal and structured representation of knowledge and have the advantage of being reusable and can be shared. They also provide a vocabulary for a specific domain and define with different levels of formality the meaning of terms and the relationships between them. The knowledge in ontologies is formalized mainly using 5 types of components: classes, relationships, functions, axioms and instances (4).Also, ontologies are an important part within the scope of the recovery and organization of information and the semantic web. In addition, they are becoming increasingly important within the NLP (5). Ontologies can be created manually, however this results in various cost and time problems (6). As an alternative, automatic learning of ontologies arises from textual documents whose objective is to identify the ontological elements automatically or semi automatically. It is an interesting approach that attempts to reduce time and resources. This is done using techniques and methods from fields such as artificial intelligence (AI), machine learning (ML), information retrieval (IR) or NLP.This work proposes a methodology that allows from a text written in NL, extract the necessary elements using NLP tools to then instantiate an ontology automatically and extract knowledge. To then visualize it in a friendly way by the stakeholders through a conceptual map.The structure of the document can be described as follows: section 2 show the aim of study; the section 3 materials and methods; section 4 explains the discussion of the work research on the software architecture and the conclusions are presented in section 5.
AIM
The objective of the present study is to provide an adequate methodology to allow the extraction of information from a clinical (7) text written in NL, which represents the corpus of patients medical records (8), extract the named entities and the relevant knowledge elements, and generate and create an instance of an ontology of the domain, which then serves as a software agent for other processes.Methodology consists of four stages: (i) The stage of NLP and corpus processing, (ii) the stage extraction of annotations (iii) the population stage of the ontology and (iv) the stage of visualization of clinical information. In addition, the methodology must comply with semantic interoperability restrictions and, more specifically, it should be possible to extract all the information from patients in the HL7 (9) electronic clinical records standard. Figure 1 shows the phases of this architecture.
Figure 1.
Phases of the proposed software architecture implemented within the primary care clinic of the Technological university of Panama.
MATERIALS AND METHODS
To validate the methodology, we made a descriptive study in two hospitals of the Republic of Panama. In 2017, we obtained a corpus of clinical information of patients, specifically the specialties of palliative care and general medicine. The total corpus is made up of 200 documents, 100 from the palliative care domain (10) and 100 from general medicine. These documents contain information about the general patient, their attendants, clinical diagnoses, medication, history of laboratory services and other problems written in Spanish. This information is what the specialist assesses and records once a patient is seen in the consultation. Clinical information of patients used in the corpus were obtained from the primary care clinic of the Technological university of Panama.The total corpus of the experiment contains approximately 98,000 words distributed throughout these 200 records of medical records. Finally, the methodology was constructed as a software architecture which was used and evaluated by the experts. It was evaluated using precision, recovery and F-measure validation metrics (11).
RESULTS AND DISCUSSION
We have decided to use the GATE tool that offers techniques and methods of ML (12) and we achieve IR (13) by programming a set of JAPE rules that extract and label patient information and then make the annotations defined in the rules. It was developed in Java and that gives it the advantage of being able to work under different operating systems. It also implements a broad set of knowledge modeling structures and actions that allow the creation, visualization and manipulation of ontologies in various representation formats.Architecture that has been implemented, follows functional requirements and develops the project according to the standards required by the software industry. The technological foundations presented are those that have been considered necessary to create a software system that carries out the NLP from text in NL. It was decided to use free software tools that are available to the research community, to perform the analysis, design and implementation of this computer program application. At all times good practices of software development have been followed and design patterns have been used, which facilitates adaptability to changes and maintenance of the tool. Figure 1 shows the phases of this architecture.For the validation of the architecture we have used the clinical information domain of patients. The total corpus is made up of 200 documents. Furthermore, to achieve a better clarity in the validation of data, each corpus has been segmented into different sections: general data, diagnosis, medication.Annotation extraction process was evaluated, using the precision, recall and F-measure validation metrics. We have obtained total experimental results for the document annotation extraction process, where we have the amount of general data annotations 2,764. To diagnose 505 extractions and 590 for the medication section. Table 1 below shows the total experimental results for the extraction of document annotations.
Table 1.
Total experimental results for the extraction of document annotations.
Type information
Quantity
General data
2,764
Diagnosis
505
Medication
590
These values of general data, diagnosis, medication, we have established the evaluation criteria obtaining 95.5% of precision, 88.5% of recall and 91.9% of F-measure respectively. According to other studies and comments from experts and professionals that have been studied, the metrics (precision, recall and F-measure) (14, 15) implemented in the architecture include specific and general criteria for the evaluation of this type of information extraction methodologies. The comparison of these metrics specified in the methodology are objective and possible, so that the result of the evaluation is accurate.Most of the medical information is currently in unstructured form, in textual documents (16, 17). There are some computer technologies that help us in this type of problem. In recent years, different approaches have been applied to extract knowledge from textual documents. Overall, in this study, we architecture allows that from a medical text written in NL, the necessary elements can be extracted using NLP tools to then instantiate an ontology automatically and extract knowledge.It should be noted that to develop the components of the architecture, only free software technologies were used, which results in the fact that there is no need to pay for licenses for the development of the project. At present there are very few ontological learning systems oriented to the information domain of the patient for the construction of ontologies, therefore the investigations carried out in this field are increasingly important.The contributions given in this architecture, help the progress of the construction of NLP automatic tools that can carry out the processes of construction and population of ontologies. It also provides an intermediate file in OWL (18, 19) format that can be consulted with an ontological editor such as Protégé, where you can see the results. A file is also generated in XML format where the HL7-CDA clinical information and message interoperability standard (20) is implemented. The interface of this software architecture has been designed so that it can be used by both NLP expert users as well as domain experts.
CONCLUSION
From the obtained results we can conclude the methodology presented in this study has been developed using a large amount of resources and documents of medical records of patients from the Republic of Panama. The objective is to support the processes of collection and extraction of knowledge in the medical domain, in areas such as palliative care as well as in general medicine.One of the advantages of our methodology is that it was validated and approved by the experts. Therefore, software programmers can use it as a basis to design, implement and produce software in other domains. Our proposed methodology follows the practice of the use of free software, is simple and can be used in other domains, such as finance, education.
Authors: Thomas J Smith; Patrick Coyne; Brian Cassel; Lynne Penberthy; Alison Hopson; Mary Ann Hager Journal: J Palliat Med Date: 2003-10 Impact factor: 2.947
Authors: P Coorevits; M Sundgren; G O Klein; A Bahr; B Claerhout; C Daniel; M Dugas; D Dupont; A Schmidt; P Singleton; G De Moor; D Kalra Journal: J Intern Med Date: 2013-10-18 Impact factor: 8.989