| Literature DB >> 35527259 |
Lucas Emanuel Silva E Oliveira1, Ana Carolina Peters2, Adalniza Moura Pucca da Silva2, Caroline Pilatti Gebeluca2, Yohan Bonescki Gumiel2, Lilian Mie Mukai Cintho2, Deborah Ribeiro Carvalho2, Sadid Al Hasan3, Claudia Maria Cabral Moro2.
Abstract
BACKGROUND: The high volume of research focusing on extracting patient information from electronic health records (EHRs) has led to an increase in the demand for annotated corpora, which are a precious resource for both the development and evaluation of natural language processing (NLP) algorithms. The absence of a multipurpose clinical corpus outside the scope of the English language, especially in Brazilian Portuguese, is glaring and severely impacts scientific progress in the biomedical NLP field.Entities:
Keywords: Clinical narratives; Corpora; Gold standard; Natural language processing; Semantic annotation
Mesh:
Year: 2022 PMID: 35527259 PMCID: PMC9080187 DOI: 10.1186/s13326-022-00269-1
Source DB: PubMed Journal: J Biomed Semantics
Fig. 1A broad view of SemClinBr corpus development. The diagram is an overview of the SemClinBr corpus development, which shows the selection of thousands of clinical notes from multiple hospitals and medical specialties. A multidisciplinary team developed the elements in orange, representing (i) the fine-grained annotation schema following the UMLS semantic types and (ii) the web-based annotation tool featuring the UMLS REST API. These resources supported the generation of the ground truth (i.e., gold standard), which was evaluated intrinsically (i.e., inter-annotation agreement) and extrinsically in two different NLP tasks (i.e., named entity recognition and negation detection)
Samples of different types of clinical narratives from our corpus
| Type/Specialty | Original narrative | Translated narrative |
|---|---|---|
Discharge summary Cardiology | PACIENTE DIABÉTICA, HIPERTENSA, CARDIOPATIA ISQ. COM IMPLANTE DE STENT EM DAE EM JUL/03 INTERNOU COM QUADRO DE ANGINA INSTÁVEL. TRANSFERIDA PARA O SERV DE HEMODINÂMICA, REALIZOU CAT SENDO SUBMETIDA A ACTP EM LESÃO DE ÓSTIO DA SEGUNDA DIAGONAL. PROCEDIMENTO REALIZADO COM SUCESSO ANGIOGRÁFICO. RECEBE ALTA ASSINTOMÁTICA. Paciente ex-tabagista, vem à emergência com quadro de dispnéia progressiva, ortopnéia, dispnéia paroxística noturna, edema de membros inferiores, turgência jugular. Diagnóstico de insuficiência cardíaca, com classe funcional IV (NYHA) na chegada. Sem história de dor torácica. ECG da chegada sem alterações. Marcadores de necrose miocárdica normais. Manejado para insuficiência cardíaca com boa resposta clínica. Ecocardiograma demonstrando dilatação de cavidades (AE = 5,3 cm, DDVE = 7,0, DSVE = 5,8), disfunção sistólica (FEVE = 35%) por hipocinesia difusa, septo e parede posterior de 0,9 cm, insuficiência mitral e tricúspide leves e PSAP = 52 mmHg. Realizado investigação etiológica com sorologia negativa para Chagas, cintilografia demonstrando necrose apical, sem condições de discriminar isquemia. Optado então pela realização de cateterismo cardíaco, que revelou artéria circunflexa dominante e livre de lesões significativas; artéria coronária direita livre com sinais de aterosclerose, mas sem lesões significativas; artéria descendente anterior de pequeno calibre, com lesão de cerca de 60% no terço proximal e lesão crítica no terço médio. Após revisão do filme, observou-se tratar de lesão de difícil manejo percutâneo, devido à sua extensão e ao pequeno calibre da artéria descrita. Após discussão do caso, optou-se por manejo clínico devido ao fato do paciente não apresentar angina, ter respondido com sucesso à terapêutica instituída e não apresentar evidência clara de benefício atual com procedimento de revascularização. Impressão de que a lesão em DAE não explicaria a hipocinesia difusa apresentada pelo paciente, devendo ser portanto doença aterosclerótica coexistindo em um coração com miocardiopatia dilatada. Realizado ainda espirometria que evidenciou distúrbio obstrutivo moderado. DCE estimada em 57 ml/min. Paciente recebe alta em bom estado geral, afebril, eupnéico, em otimização do tratamento para ICC (já em uso de betabloqueador, IECA e espironolactona), com plano de ajustes de doses a nível ambulatorial. OBS: peso na alta: 76 Kg. | DIABETIC PATIENT, HYPERTENSE, ISCHEMICAL CARDIOPATHY. WITH STENT IMPLANT IN LAD IN JUL / 03 HOSPITALIZED WITH SYMPTOMS OF UNSTABLE ANGINA. TRANSFERRED TO THE SERVICE OF HEMODYNAMIC, PERFORMED CATHETERISM, SUBMITTED TO PCTA IN THE SECOND DIAGONAL INJURY. PROCEDURE PERFORMED WITH ANGIOGRAPHIC SUCCESS. ASYMPTOMATIC HOSPITAL DISCHARGE. Ex-smoker patient comes to the emergency room with progressive dyspnea, orthopnea, paroxysmal nocturnal dyspnea, lower limb edema, and jugular turgence. Heart failure diagnosis, with functional class IV (NYHA) upon arrival. No history of chest pain. ECG on arrival without change. Normal myocardial necrosis markers. Managed for heart failure with good clinical response. Echocardiogram showing cavity dilatation (LA = 5.3 cm, LVDD = 7.0, LVSD = 5.8), systolic dysfunction (LVEF = 35%) due to diffuse hypokinesia, a 0.9 cm septum and posterior wall, mild mitral and tricuspid regurgitation, and APSP = 52 mmHg. Etiological investigation with Chagas negative serology, scintigraphy showing apical necrosis, unable to discriminate ischemia. Then opted for catheterization which revealed a dominant circumflex artery free of significant lesions; free right coronary artery with signs of atherosclerosis but no significant lesions; small anterior descending artery with a lesion of about 60% in the proximal third and critical injury in the middle third. After review of the film, it was observed that it was a difficult percutaneous management injury owing to its extension and the small caliber of the described artery. After discussion of the case, we opted for clinical management because the patient did not have angina, successfully responded to the therapy instituted, and did not present clear evidence of being benefitted by the revascularization procedure. The impression that the lesion in LAD would not explain the diffuse hypokinesia presented by the patient; therefore, atherosclerotic disease coexisting in a heart with dilated cardiomyopathy. Accomplished yet spirometry that showed moderate obstructive disorder. DCE estimated at 57 ml / min. Patient is discharged in good general condition, afebrile, eupneic condition, optimizing treatment for CHF (already using beta-blocker, ACEI and spironolactone), with outpatient dose adjustment plan. OBS: weight in the high: 76 Kg |
Ambulatory note Nephrology | NEFROPATIA DIABETICA EM TTO CONSERVADOR CANDIDATA A TX RENAL PREEMPTIVO LIBERADA PELA URO E ANESTESIO CANDIDATA A TX RENAL PREEMPTIVO ASSINTOMÁTICA, EXCETO PELOS SINAIS E SINTOMAS ASSOCIADOS A NEUROPATIA PERIFERICA (DIABETICA / UREMIA) SEM SINTOMAS URINARIOS AO EXAME PA 150/100 P 108 T 36 DIURESE FRR NORMAL HIPOCORADA + CPP LIVRES PC RITMO REGULAR, TAQUICARDICO ABD RHA+, PLANO, FLÁCIDO, CIC CX CST MMII PULSOS PRESENTES E SIMETRICOS | DIABETIC NEPHROPATHY IN CONSERVATIVE TREATMENT PREEMPTIVE KIDNEY TRANSPLANT CANDIDATE RELEASED BY UROLOGY AND ANESTHESIOLOGY PREEMPTIVE KIDNEY TRANSPLANT CANDIDATE ASYMPTOMATIC, EXCEPT FOR SIGNS AND SYMPTOMS ASSOCIATED WITH PERIPHERAL NEUROPATHY (DIABETIC / UREMIA) NO URINARY SYMPTOMS ON EXAMINATION BP 150/100 HR 108 T 36 DIURESE RR NORMAL PALLOR + FREE LF CS REGULAR RHYTHM, TACHYCARDIC ABDOMEN RHA +, FLAT, FLACCID, CIC CX CST LLLL PRESENT AND SYMMETRICAL PULSES |
Nursing note Not defined | Pcte com RNM de crânio agendada para hoje às 23:00 h. Por volta das 21:00 h pcte apresentou quadro de confusão mental, seguida de crise convulsiva generalizada, prontamente atendido na sala de poli, com MCC + oximetria digital de pulso + PNI contínuos. Instalado O2, medicado CPM e mantido em observação no leito. Hidantalizado pela R1 Vital Brasil da neurocirurgia, procedimento realizado sem intercorrências. Pcte bastante sonolento, mantido em sala de poli e suspenso RNM por hora. Diurese espontânea, com controle através de uropen. SSVV às 05:45 h PA = 133/74mmhg, FC = 114 bpm, SpO2 = 93%. Conforme orientação da neurocirurgia, mantém observação na sala de poli sob cuidados intensivos de enfermagem. CHOQUE NAO ESPECIFICADO | Patient Skull MRI scheduled today at 23:00. At around 21:00, the patient presented with mental confusion, followed by generalized seizure, promptly treated in the multiple trauma room, with MCC + digital pulse oximetry + continuous NIBP. Installed O2, medicated as prescribed and kept under observation in bed. Hidrantalized by R1 Vital Brasil of neurosurgery, procedure performed without complications. Very sleepy patient kept in emergency room and suspended MRI for hour. Spontaneous diuresis, with uropen control. VVSS at 05:45 h BP = 133 / 74 mmhg, HR = 114 bpm, PsO2 = 93%. As directed by neurosurgery, maintains observation in the emergency room under intensive nursing care. SHOCK NOT SPECIFIED |
Examples of clinical narratives included in our corpus include different types (e.g., discharge summaries, ambulatory notes, nursing notes) and medical specialties. The first column shows the original pt-br text, and in the next column, the translated version (some acronym translations may not make sense in English)
The medical specialties frequency table
| Specialty | Number |
|---|---|
| Cardiology | 260 |
| Nephrology | 157 |
| Orthopedics | 126 |
| 122 | |
| Surgery (general) | 61 |
| Neurology | 45 |
| Neurosurgery | 32 |
| Dermatology | 23 |
| Ophthalmology | 22 |
| Endocrinology | 19 |
| Gastroenterology | 16 |
| Otolaryngology | 14 |
| Pneumology | 11 |
| 92 |
The medical specialties of the selected clinical narratives were ordered according to their frequency in the corpus. Medical specialties with less than ten occurrences were grouped into “Others” category
Database entry data configuration
| Field | Data type |
|---|---|
| Occurrence-id | Number |
| Patient-id | Number |
| Gender | Text |
| Birth date | Date |
| Inclusion date | Date |
| Discharge date | Date |
| Discharge type | Text |
| Discharge reason | Text |
| ICD-10 | Text |
| Medical specialty | Text |
| Care reason | Text |
| Main complaint | Free-Text |
| History of disease | Free-Text |
| Past history | Free-Text |
| Family history | Free-Text |
| Physical examination | Free-Text |
| Main diagnosis hypothesis | Free-Text |
| Initial plan | Free-Text |
| Observations | Free-Text |
Data fields for each EHR entry in our main data source. The fields have different data types: numerical, date, text (one-line small text), and free-text (multi-line and large text)
Text samples containing the most used STYs
| SGR | STY | Original examples | Translated examples |
|---|---|---|---|
| Anatomy | Body Location or Region | MEIA TALA GESSADA EM apresenta edema em | Half-length plaster cast in presents edema in the FLAT AND FLACID |
| Anatomy | Body Part, Organ, or Organ Component | acesso venoso central em ACESSO VENOSO PERIFERICO EM | |
| Chemicals & Drugs | Organic Chemical | Fez uso de cefaléia em regiao parietal bilateral que melhora com | used headache in bilateral parietal region improved with |
| Chemicals & Drugs | Pharmacologic Substance | asmatica em uso de | asthmatic person using |
| Concepts & Ideas | Temporal Concept | Paciente em | WASHING |
| Devices | Drug Delivery Device | cloreto de potassio a 42 ml/h em | potassium chloride at 42 ml/h in |
| Devices | Medical Device | ||
| Disorders | Disease or Syndrome | REFERE | REFERS |
| Disorders | Finding | RETORNOU DO CC | RETURNED |
| Disorders | Injury or Poisoning | ||
| Disorders | Sign or Symptom | relata SINAIS VITAIS ESTAVÉIS, REFERE | reports STABLE VITAL SIGNS, REFERS |
| Living Beings | Patient or Disabled Group | ||
| Living Beings | Professional or Occupational Group | Orientada a segundo a | Advised according to the |
| Organizations | Health Care Related Organization | CONFORME ROTINA DA RETORNOU DO | AS RETURNED FROM |
| Phenomena | Laboratory or Test Result | ||
| Physiology | Clinical Attribute | ||
| Procedures | Diagnostic Procedure | ||
| Procedures | Health Care Activity | EM | |
| Procedures | Therapeutic or Preventive Procedure | IRC EM | CRF IN |
| N/A | Abbreviation | CONFORME ROTINA DA MEIA TALA GESSADA EM | AS Half-length plaster cast in |
| N/A | Negation | Paciente eupnéico e Paciente | Eupneic and Patient |
Text samples containing the most used semantic types and their corresponding semantic groups. The third column shows the original examples and the fourth column shows the translated versions. The underlined passages indicate the annotated concepts
Fig. 2Revision and quality verification process of the annotation guidelines. The iterative process started with the first guideline draft; then, a small number of documents were double-annotated, and their inter-annotator agreement was calculated. If the agreement remained stable, then the guideline was considered good enough to proceed with the gold standard production. Otherwise, the annotation differences were discussed; the guidelines were updated; and the process was reinitiated
Fig. 3Annotation process overview. The annotation process was divided into ground-truth phases 1 and 2, which are located above and below the dashed line, respectively. The elements in green represent the annotators and orange represents the adjudicators
Corpus size considering gold and silver divisions
| Segment | Documents | Entities | Relations |
|---|---|---|---|
| Gold | 613 | 41,588 | 7344 |
| Silver | 387 | 23,541 | 3919 |
Number of documents, entities, and relations for each corpus division (i.e., gold and silver)
Number of annotations per STY
| SGR | STY | Entities |
|---|---|---|
| Anatomy | Body Location or Region | 1452 |
| Anatomy | Body Part, Organ, or Organ Component | 1373 |
| Chemicals & Drugs | Organic Chemical | 2000 |
| Chemicals & Drugs | Pharmacologic Substance | 3013 |
| Concepts & Ideas | Quantitative Concept | 3953 |
| Concepts & Ideas | Qualitative Concept | 500 |
| Concepts & Ideas | Temporal Concept | 1663 |
| Devices | Medical Device | 1617 |
| Disorders | Disease or Syndrome | 2650 |
| Disorders | Finding | 6867 |
| Disorders | Injury or Poisoning | 521 |
| Disorders | Sign or Symptom | 4707 |
| Living Beings | Patient or Disabled Group | 844 |
| Living Beings | Professional or Occupational Group | 720 |
| Organizations | Health Care Related Organization | 639 |
| Phenomena | Laboratory or Test Result | 3079 |
| Physiology | Clinical Attribute | 1128 |
| Procedures | Diagnostic Procedure | 2012 |
| Procedures | Health Care Activity | 2763 |
| Procedures | Therapeutic or Preventive Procedure | 4791 |
| N/A | Abbreviation | 12,629 |
| N/A | Negation | 2676 |
The number of entities annotated per semantic type and the corresponding semantic groups for the entire corpus, considering the most frequent ones
Number of annotations per RTY
| RTY | Relations |
|---|---|
| associated_with | 9693 |
| negation_of | 1570 |
The number of relations per RTY for the entire corpus
Average IAA values for the entire corpus
| IAA type | IAA |
|---|---|
| Strict (full span + STY match) | 0.708 |
| Lenient (partial span + STY match) | 0.834 |
| Flexible (full span + SGR match) | 0.774 |
| Relaxed (partial span + SGR match) | 0.921 |
Average IAA values considering the four different IAA types for the entire corpus
Fig. 4Average IAA values for the most frequent STYs. The average IAA scores for the most frequent semantic types and their corresponding semantic groups (in parentheses). The heat map indicates the highest values in blue and the lowest values in red
Average IAA values per RTY
| RTY | IAA |
|---|---|
| associated_with | 0.823 |
| negation_of | 0.914 |
The average IAA scores per RTY for the entire corpus
Comparison between similar clinical annotation projects
| Corpus | Type | Strict | Lenient | Flexible | Relaxed |
|---|---|---|---|---|---|
| 0.77 (8.5%) | 0.80 (−3.6%) | 0.77 (0%) | 0.80 (−13.0%) | ||
| – | 0.75 (−12.7%) | – | – | ||
| 0.84 (18.3%) | 0.90 (8.4%) | 0.84 (9.1%) | 0.90 (−2.2%) | ||
| – | 0.82 (−4.6%) | – | – | ||
| 0.79 (11.2%) | – | 0.79 (2.6%) | – | ||
| – | 0.78 (−9.3%) | – | – | ||
| 0.80 (12.6%) | – | 0.80 (3.9%) | – | ||
| – | 0.66 (−23.2%) | – | – | ||
| 0.69 (− 2.8%) | 0.75 (−9.6%) | 0.69 (−10.4%) | 0.75 (−18.5%) | ||
| – | – | – | – |
The percentage difference in performance between the proposed corpus and other clinical annotation projects is shown in parentheses. Note that the IAA values for Flexible and Relaxed matches are copies of Strict and Lenient scores, respectively to be able to report the percentage difference between our values and those of other authors who did not calculate these metrics specifically