| Literature DB >> 26734936 |
Paul Thompson1, Riza Theresa Batista-Navarro1, Georgios Kontonatsios1, Jacob Carter1, Elizabeth Toon2, John McNaught1, Carsten Timmermann2, Michael Worboys2, Sophia Ananiadou1.
Abstract
Historical text archives constitute a rich and diverse source of information, which is becoming increasingly readily accessible, due to large-scale digitisation efforts. However, it can be difficult for researchers to explore and search such large volumes of data in an efficient manner. Text mining (TM) methods can help, through their ability to recognise various types of semantic information automatically, e.g., instances of concepts (places, medical conditions, drugs, etc.), synonyms/variant forms of concepts, and relationships holding between concepts (which drugs are used to treat which medical conditions, etc.). TM analysis allows search systems to incorporate functionality such as automatic suggestions of synonyms of user-entered query terms, exploration of different concepts mentioned within search results or isolation of documents in which concepts are related in specific ways. However, applying TM methods to historical text can be challenging, according to differences and evolutions in vocabulary, terminology, language structure and style, compared to more modern text. In this article, we present our efforts to overcome the various challenges faced in the semantic analysis of published historical medical text dating back to the mid 19th century. Firstly, we used evidence from diverse historical medical documents from different periods to develop new resources that provide accounts of the multiple, evolving ways in which concepts, their variants and relationships amongst them may be expressed. These resources were employed to support the development of a modular processing pipeline of TM tools for the robust detection of semantic information in historical medical documents with varying characteristics. We applied the pipeline to two large-scale medical document archives covering wide temporal ranges as the basis for the development of a publicly accessible semantically-oriented search system. The novel resources are available for research purposes, while the processing pipeline and its modules may be used and configured within the Argo TM platform.Entities:
Mesh:
Year: 2016 PMID: 26734936 PMCID: PMC4703377 DOI: 10.1371/journal.pone.0144717
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1Overview of work.
Annotated NE types.
| Entity Type | Description | Examples |
|---|---|---|
| Medical condition/ailment | phthisis, bronchitis, typhus | |
| Altered physical appearance/behaviour as probable result of injury/condition | cough, pain, rise in temperature, swollen | |
| Entity forming part of human body, including substances and abnormal alterations to bodily structures | lung, lobe, sputum, fibroid | |
| Individual or group under discussion | children, asthma patients, those with negative reactions to tuberculin | |
| Treatment/intervention administered to combat condition (including diet/foodstuffs), or substance, medium or procedure used in investigational medical or public health context | atrophine sulphate, generous diet, change of air, lobectomy | |
| Living entity not part of human body, including microorganisms, animals and insects | tubercle bacilli, mould, guinea-pig, flea | |
| Environmental factor relevant to incidence/prevention/control/treatment of condition. Includes climatic conditions, foodstuffs, infrastructure, household items or occupations whose environmental factors are mentioned | humidity, high mountain climates, infected milk, linen, drains, sewers, dusty occupations. |
Annotated event types.
| Event Type | Description | Possible Participants |
|---|---|---|
| A (previously existing) entity or event is affected, infected, undergoes change or is transformed, possibly by another entity or event. | ||
| An entity or event results in the manifestation of a (previously non-existing) entity or event. |
Fig 2Examples of annotated Affect events.
Event triggers are shown in blue and entity annotations are shown in green. Event participants are linked to the corresponding event trigger with arrows. The labels on the arrows represent the semantic label assigned to the participant.
Fig 3Examples of annotated Causality events.
Event triggers are shown in blue and entity annotations are shown in green. Event participants are linked to the corresponding event trigger with arrows. The labels on the arrows represent the semantic label assigned to the participant.
MetaMap to HIMERA category mappings.
| MetaMap Categories | HIMERA Category | Filtering Rules |
|---|---|---|
| Anatomical Abnormality, Body Substance, Body Part, Organ, Organ Component, Body Location or Region, Body Space or Junction, Tissue | Anatomical | |
| Animal, Mammal, Cell, Bacterium, Organism | Biological_Entity | |
| Food, Chemical Viewed Structurally, Element, Ion, or Isotope, Hazardous or Poisonous Substance, Substance, Natural Phenomenon or Process | Environmental | |
| Disease or Syndrome, Pathologic Function | Condition | Must be a noun phrase |
| Clinical Drug, Amino Acid, Peptide, or Protein, Immunologic Factor, Organic Chemical, Pharmacologic Substance, Biologically Active Substance, Lipid | Therapeutic_or_Investigational | Must be a noun phrase; Length must be greater than 1 |
| Sign or Symptom, Finding | Sign_or_Symptom | Tagged as noun |
| Group, Patient or Disabled Group | Subject |
Entity annotation agreement rates.
| Type | Agreement rate (F-score) | |
|---|---|---|
| Exact Match | Relaxed Match | |
| 0.81 | 0.85 | |
| 0.99 | 0.99 | |
| 0.92 | 0.95 | |
| 0.63 | 0.79 | |
| 0.84 | 0.88 | |
| 0.70 | 0.81 | |
| 0.73 | 0.78 | |
| 0.80 | 0.86 | |
Annotated Entity counts in HIMERA.
| Type | Count |
|---|---|
| 2002 | |
| 295 | |
| 1499 | |
| 1268 | |
| 1171 | |
| 1062 | |
| 1046 | |
| 8343 |
5-fold cross-validation NE results.
| Exact Span Match | Relaxed Span Match | |||||||
|---|---|---|---|---|---|---|---|---|
| Category | BL-Ex | SM-Ex | FM-Ex | UL-Ex | BL-Rel | SM-Rel | FM-Rel | UL-Rel |
| 0.63 | 0.64 | 0.65 | 0.76 | 0.77 | ||||
| 0.39 | 0.40 | 0.46 | 0.49 | 0.48 | ||||
| 0.49 | 0.49 | 0.58 | ||||||
| 0.78 | 0.80 | 0.88 | 0.89 | 0.89 | ||||
| 0.68 | 0.74 | 0.74 | 0.76 | 0.82 | 0.82 | |||
| 0.73 | 0.77 | 0.82 | 0.85 | |||||
| 0.70 | 0.79 | |||||||
| 0.70 | 0.70 | 0.70 | 0.76 | 0.76 | 0.76 | |||
| 0.74 | 0.74 | 0.81 | 0.81 | 0.81 | ||||
| 0.79 | 0.79 | 0.86 | 0.86 | |||||
| 0.56 | 0.61 | 0.60 | 0.61 | 0.66 | ||||
| 0.66 | 0.69 | 0.71 | 0.74 | |||||
| 0.80 | 0.80 | 0.80 | 0.88 | 0.88 | 0.88 | |||
| 0.66 | 0.70 | 0.70 | 0.72 | 0.77 | 0.76 | |||
| 0.73 | 0.75 | 0.79 | 0.82 | 0.82 | ||||
| 0.71 | 0.73 | 0.72 | 0.79 | 0.79 | 0.80 | |||
| 0.48 | 0.53 | |||||||
| 0.57 | 0.60 | 0.64 | 0.66 | |||||
| 0.93 | 0.93 | 0.93 | ||||||
| 0.57 | 0.57 | 0.61 | 0.58 | 0.62 | 0.63 | |||
| 0.70 | 0.70 | 0.73 | 0.72 | 0.74 | 0.75 | |||
| 0.77 | 0.77 | 0.78 | 0.86 | 0.85 | ||||
| 0.58 | 0.65 | 0.69 | ||||||
| 0.67 | 0.74 | |||||||
BL = Baseline; SM = Selective MetaMap; FM = Full MetaMap; UL = UMLS Lookup; Ex = Exact span matching; Rel = Relaxed span matching; P = Precision; R = Recall; F = F-Score. The best Precision, Recall and F-Score results for each category are shown in bold type.
Temporal-based model results.
| Test data | Training data | Env. | Cond. | Subj. | SS | Anat. | TI | Biol. | ALL |
|---|---|---|---|---|---|---|---|---|---|
| N/A | UL (5-fold) | ||||||||
| 1890s | 0.19 | 0.80 | 0.74 | 0.65 | 0.28 | 0.41 | 0.65 | ||
| 1920s | 0.34 | 0.73 | 0.70 | 0.54 | 0.55 | 0.37 | 0.33 | 0.55 | |
| 1960s | 0.33 | 0.70 | 0.61 | 0.61 | 0.42 | 0.08 | 0.00 | 0.49 | |
| 1890s/1920s | 0.33 | 0.67 | 0.77 | 0.67 | |||||
| 1920s/1960s | 0.72 | 0.73 | 0.56 | 0.61 | 0.34 | 0.34 | 0.58 | ||
| All other dec. | 0.37 | 0.43 | |||||||
| 1850s | 0.32 | 0.79 | 0.79 | 0.68 | 0.78 | 0.35 | 0.41 | 0.70 | |
| 1920s | 0.26 | 0.77 | 0.73 | 0.56 | 0.66 | 0.39 | 0.87 | 0.64 | |
| 1960s | 0.23 | 0.72 | 0.68 | 0.41 | 0.54 | 0.19 | 0.13 | 0.51 | |
| 1850s/1920s | 0.38 | 0.83 | 0.47 | 0.88 | 0.76 | ||||
| 1920s/1960s | 0.29 | 0.67 | 0.72 | 0.54 | 0.66 | 0.33 | 0.61 | ||
| All other dec. | 0.78 | 0.91 | |||||||
| 1850s | 0.37 | 0.77 | 0.80 | 0.61 | 0.48 | 0.32 | 0.22 | 0.53 | |
| 1890s | 0.17 | 0.76 | 0.77 | 0.64 | 0.49 | 0.35 | 0.66 | 0.52 | |
| 1960s | 0.15 | 0.67 | 0.67 | 0.28 | 0.41 | 0.25 | 0.10 | 0.42 | |
| 1850s/1890s | 0.78 | 0.67 | 0.54 | 0.44 | 0.68 | 0.59 | |||
| 1890s/1960s | 0.28 | 0.78 | 0.65 | 0.55 | 0.45 | 0.57 | |||
| All other dec. | 0.74 | ||||||||
| 1850s | 0.32 | 0.77 | 0.48 | 0.45 | 0.47 | 0.18 | 0.03 | 0.47 | |
| 1890s | 0.07 | 0.74 | 0.51 | 0.47 | 0.57 | 0.27 | 0.08 | 0.46 | |
| 1920s | 0.18 | 0.44 | 0.43 | 0.39 | 0.37 | 0.09 | 0.46 | ||
| 1850s/1890s | 0.30 | 0.78 | 0.55 | 0.32 | 0.07 | 0.50 | |||
| 1890s/1920s | 0.19 | 0.78 | 0.51 | 0.50 | 0.53 | ||||
| All other dec. | 0.51 | 0.61 | 0.46 |
Results are shown in terms of F-Score (relaxed span matching). Relaxed match F-Score UL results from the 5-fold cross validation experiments are shown on the first line, for comparison purposes. For each decade of test data, the bold figures indicate the best performing model(s) for each category of NEs. Env. = Environmental, Cond. = Condition, Subj. = Subject, SS = Sign_or_Symptom, Anat. = Anatomical, TI = Therapeutic_or_Investigational, Biol. = Biological_Entity.
Event trigger recognition results.
| Event Type | Exact Match | Relaxed Match |
|---|---|---|
| 0.41 | 0.46 | |
| 0.39 | 0.44 | |
| 0.40 | 0.45 | |
| 0.36 | 0.52 | |
| 0.13 | 0.18 | |
| 0.19 | 0.27 | |
| 0.40 | 0.47 | |
| 0.32 | 0.37 | |
| 0.35 | 0.42 |
P = Precision; R = Recall; F = F-Score
Fig 4DSM precision for diseases in BMJ.
Fig 7DSM recall for diseases in MOH.
Expert semantic categorisation of DSM output.
| Relation Type | Count (% proportion of total pairs) |
|---|---|
| 83 (24%) | |
| 21 (6%) | |
| 35 (10%) | |
| 10 (3%) | |
| 11(3%) | |
| 19 (5%) | |
| 35 (10%) | |
| 216 (62%) | |
| 106 (30%) | |
| 26 (7%) |
Fig 8Term-based search in HOM.
Fig 9Individual document display in HOM.