| Literature DB >> 24949194 |
George Karystianis1, Iain Buchan2, Goran Nenadic1.
Abstract
BACKGROUND: The health sciences literature incorporates a relatively large subset of epidemiological studies that focus on population-level findings, including various determinants, outcomes and correlations. Extracting structured information about those characteristics would be useful for more complete understanding of diseases and for meta-analyses and systematic reviews.Entities:
Keywords: Epidemiology; Key characteristics; Rule-based methodology; Text mining
Year: 2014 PMID: 24949194 PMCID: PMC4062908 DOI: 10.1186/2041-1480-5-22
Source DB: PubMed Journal: J Biomed Semantics
Figure 1The four steps of the approach applied to epidemiological abstracts in order to recognise key characteristics. Linnaeus is used to filter out abstracts not related to humans; Dictionary look-up and automatic term recognition (ATR) are applied to identify major medical concepts in text; MinorThird is used as an environment for the rule application and mention identification of epidemiological characteristics.
Examples of rules for recognition of study design, population, exposure, outcome, covariate and effect size in epidemiological abstracts
| [@st | a(types)] | ||||||
| Methods: This was a | |||||||
| a(totals) | re(‘(of|on|in)’) | [@stats | a(clusters)] | ||||
| Sibling study in a prospective cohort of | cohort | of | |||||
| @multiple | re(‘with|in|on’)? | [a(clusters) | re(‘with|without’) | @multiple] | |||
| bone mineral density in | bone mineral density | in | |||||
| a(relations) | eq(‘between’) | [@multiple] | eq(‘and’) | @multiple | |||
| … and analyze the association between | association | Between | and | blood pressure | |||
| [@multiple] | a(be) | a(related) | a(with) | eq(‘onset’)? | eq(‘of’)? | ||
| is | associated | with | onset | of | |||
| @factors | eq(‘of’) | [@multiple] | |||||
| Cardiovascular and disease related predictors of | predictors | of | |||||
| @multiple | a(be) | a(adverbs) | a(related) | a(with) | [@multiple] | ||
| Conclusions coffee intake is inversely associated with | coffee intake | is | inversely | associated | with | ||
| a(adj) | eq(‘for’) | [@multiple] | |||||
| … after adjusting | adjusting | for | |||||
| eq(‘including’) | [@multiple] | eq(‘as’) | @synonyms | ||||
| … including | including | as | covariates | ||||
| @multiple | [a(preva) | a(be) | @perce] | ||||
| Hernia | Hernia | ||||||
| @multiple | @or | | @ci | ||||
| … more likely to have | elevated blood pressure | ||||||
The rule components in square brackets are the extracted spans that denote the key characteristic; the rest of the rule (if any) specifies the context. The rules use explicit matching of spans (e.g. eq(‘onset’)), regular expressions (re) for matching specific verbs or prepositions (e.g. re(‘(of|on|in)’)), various vocabularies that contain single (e.g. a(types) – matching words that indicate the conduction of a study (e.g. study, analysis, review)) and multiword terms (e.g. @st, a vocabulary of epidemiological study designs (e.g. case control)). totals contains words that suggest the participant population; stats is a dictionary that contains numbers and words that express numeric values (e.g., one hundred); clusters includes the variations that a population sample can be described (e.g., men, patients, individuals); multiple contains single or multi-word biomedical concepts (e.g., depression, type 2 diabetes); relations is a dictionary with single words that describe an association between concepts (e.g., relationship, link, association); factors contains single or multi-word terms that describe risk factors (e.g., risk factors, predictors); or is a dictionary that contains noun phrases in which the effect size “odds ratio” can be expressed, including the ways in which its numeric value is presented (e.g., odds ratio = 1.34, or = 2.56); ci follows a similar pattern for confidence interval with its assigned numeric value e.g., (95% ci = 0.91, 95% ci: 4.36, 5.48).
Results, including true positives (TP), false positives (FP), false negative (FN), precision (P), recall (R) and F-score on the evaluation set
| . | ||||||
|---|---|---|---|---|---|---|
| 12 | 0 | 1 | 100.0 | 92.3 | 95.9 | |
| 35 | 1 | 4 | 97.2 | 89.7 | 93.3 | |
| 45 | 8 | 11 | 84.9 | 80.3 | 82.5 | |
| 73 | 19 | 13 | 79.3 | 84.8 | 82.4 | |
| 17 | 2 | 0 | 89.4 | 100.0 | 94.4 | |
| 65 | 2 | 10 | 97.0 | 86.6 | 91.5 | |
| 247 | 32 | 39 | 88.5 | 86.3 | 87.4 | |
| 91.3 | 88.9 | 90.0 | ||||
Micro averages are calculated across all different document level mentions; macro averages are calculated across different characteristics.
Results, including true positives (TP), false positives (FP), false negative (FN), precision (P), recall (R) and F-score on the training set
| . | ||||||
|---|---|---|---|---|---|---|
| 37 | 5 | 1 | 88.0 | 97.3 | 92.5 | |
| 94 | 10 | 5 | 90.3 | 94.9 | 92.6 | |
| 104 | 21 | 14 | 83.2 | 88.1 | 85.5 | |
| 125 | 26 | 8 | 82.7 | 93.9 | 88.0 | |
| 13 | 4 | 0 | 76.4 | 100.0 | 86.6 | |
| 41 | 5 | 9 | 89.1 | 82.0 | 85.4 | |
| 414 | 71 | 37 | 85.3 | 91.7 | 88.4 | |
| 84.9 | 92.7 | 88.4 | ||||
Micro averages are calculated across all different document level mentions; macro averages are calculated across different characteristics.
Results, including true positives (TP), false positives (FP), false negative (FN), precision (P), recall (R) and F-score on the development set
| . | ||||||
|---|---|---|---|---|---|---|
| 11 | 1 | 2 | 91.6 | 84.6 | 88.0 | |
| 36 | 4 | 4 | 90.0 | 90.0 | 90.0 | |
| 59 | 4 | 0 | 93.6 | 100.0 | 96.7 | |
| 65 | 13 | 1 | 83.3 | 98.4 | 90.2 | |
| 13 | 3 | 0 | 81.2 | 100.0 | 89.6 | |
| 50 | 17 | 5 | 74.6 | 90.9 | 81.9 | |
| 234 | 42 | 12 | 84.7 | 95.1 | 89.6 | |
| 85.7 | 93.8 | 89.5 | ||||
Micro averages are calculated across all different document level mentions; macro averages are calculated across different characteristics.
The most frequent study designs extracted from the obesity epidemiological literature
| 1,940 | 32.0 | |
| 1,876 | 30.9 | |
| 678 | 11.1 | |
| 521 | 8.5 | |
| 341 | 5.6 | |
| 191 | 3.1 | |
| 109 | 1.7 | |
| 109 | 1.7 | |
| 95 | 1.5 | |
| 49 | 0.8 |
Frequency is the number of documents, and the last column presents the share within the entire set.
The most frequent exposures extracted from the obesity epidemiological literature
| 2,450 | 10.4 | |
| 1,351 | 5.7 | |
| 531 | 2.2 | |
| 394 | 1.6 | |
| 291 | 1.2 | |
| 289 | 1.2 | |
| 256 | 1.0 | |
| 240 | 1.0 | |
| 218 | 0.9 | |
| 206 | 0.8 | |
| 193 | 0.8 | |
| 186 | 0.7 | |
| 135 | 0.5 | |
| 128 | 0.5 | |
| 117 | 0.4 | |
| 116 | 0.4 | |
| 108 | 0.4 | |
| 98 | 0.4 | |
| 92 | 0.3 | |
| 89 | 0.3 | |
| 82 | 0.3 | |
| 79 | 0.3 | |
| 79 | 0.3 | |
| 75 | 0.3 | |
| 75 | 0.3 | |
| 70 | 0.2 | |
| 69 | 0.2 | |
| 67 | 0.2 | |
| 66 | 0.2 | |
| 66 | 0.2 | |
| 59 | 0.2 | |
| 59 | 0.2 | |
| 55 | 0.2 | |
| 54 | 0.2 | |
| 52 | 0.2 | |
| 49 | 0.1 | |
| 49 | 0.1 | |
| 48 | 0.1 | |
| 47 | 0.1 | |
| 45 | 0.1 |
Frequency is the number of documents, and the last column presents the share within the entire set.
Distribution of UMLS semantic groups assigned to exposures
| Disorders | 8,700 | 36.9 |
| Concepts/ideas | 4,635 | 19.7 |
| Physiology | 3,969 | 16.8 |
| Procedures | 1,611 | 6.8 |
| Activities/behaviors | 1,285 | 5.4 |
| Living beings | 1,030 | 4.3 |
| Chemicals/drugs | 857 | 3.6 |
| Objects | 368 | 1.5 |
| Genes/molecular | 344 | 1.4 |
| Anatomy | 252 | 1.0 |
| Phenomena | 180 | 0.7 |
| Geographic areas | 145 | 0.6 |
| Occupations | 73 | 0.3 |
| Devices | 30 | 0.01 |
| Organizations | 21 | 0.0 |
| Other | 16 | 0.0 |
The most frequent outcomes extracted from the obesity epidemiological literature
| 5,220 | 12.9 | |
| 2,058 | 5.1 | |
| 1,379 | 3.4 | |
| 1,084 | 2.6 | |
| 728 | 1.8 | |
| 712 | 1.7 | |
| 659 | 1.6 | |
| 460 | 1.1 | |
| 297 | 0.7 | |
| 289 | 0.7 | |
| 260 | 0.6 | |
| 250 | 0.6 | |
| 225 | 0.5 | |
| 211 | 0.5 | |
| 209 | 0.5 | |
| 194 | 0.4 | |
| 193 | 0.4 | |
| 181 | 0.4 | |
| 180 | 0.4 | |
| 175 | 0.4 | |
| 162 | 0.4 | |
| 161 | 0.3 | |
| 155 | 0.3 | |
| 127 | 0.3 | |
| 122 | 0.3 | |
| 116 | 0.2 | |
| 110 | 0.2 | |
| 101 | 0.2 | |
| 98 | 0.2 | |
| 95 | 0.2 | |
| 94 | 0.2 | |
| 91 | 0.2 | |
| 91 | 0.2 | |
| 88 | 0.2 | |
| 86 | 0.2 | |
| 85 | 0.2 | |
| 85 | 0.2 | |
| 81 | 0.2 | |
| 78 | 0.1 | |
| 68 | 0.1 |
Frequency is the number of documents, and the last column presents the share within the entire set.
Distribution of UMLS semantic groups assigned to outcomes
| Disorders | 21,809 | 54.0 |
| Concepts/ideas | 7,277 | 18.0 |
| Physiology | 3,810 | 9.4 |
| Procedures | 1,697 | 4.2 |
| Living beings | 1,616 | 4.0 |
| Activities/behaviors | 1,413 | 3.5 |
| Chemicals/drugs | 990 | 2.4 |
| Anatomy | 577 | 1.4 |
| Objects | 314 | 0.7 |
| Genes/molecular | 265 | 0.6 |
| Phenomena | 250 | 0.6 |
| Geographic areas | 137 | 0.3 |
| Occupations | 102 | 0.2 |
| Organizations | 36 | 0.0 |
| Devices | 28 | 0.0 |
| Other | 16 | 0.0 |
The most frequent covariates extracted from the obesity epidemiological literature
| 1,066 | 19.3 | |
| 631 | 11.4 | |
| 346 | 6.2 | |
| 260 | 4.7 | |
| 160 | 2.9 | |
| 117 | 2.1 | |
| 108 | 1.9 | |
| 83 | 1.5 | |
| 70 | 1.2 | |
| 67 | 1.2 | |
| 60 | 1.0 | |
| 58 | 1.0 | |
| 53 | 0.9 | |
| 43 | 0.7 | |
| 42 | 0.7 | |
| 39 | 0.7 | |
| 36 | 0.6 | |
| 33 | 0.6 | |
| 32 | 0.5 | |
| 27 | 0.5 | |
| 25 | 0.5 | |
| 25 | 0.5 | |
| 22 | 0.4 | |
| 20 | 0.3 | |
| 20 | 0.3 | |
| 17 | 0.3 | |
| 17 | 0.3 | |
| 17 | 0.3 | |
| 16 | 0.2 | |
| 15 | 0.2 | |
| 14 | 0.2 | |
| 13 | 0.2 | |
| 13 | 0.2 | |
| 12 | 0.2 | |
| 12 | 0.2 | |
| 12 | 0.2 | |
| 11 | 0.2 | |
| 10 | 0.1 | |
| 10 | 0.1 | |
| 10 | 0.1 |
Frequency is the number of documents, and the last column presents the share within the entire set.
Distribution of UMLS semantic groups assigned to covariates
| Physiology | 2,381 | 43.2 |
| Concepts/ideas | 1,044 | 18.9 |
| Disorders | 783 | 14.2 |
| Activities/behaviors | 591 | 10.7 |
| Living beings | 232 | 4.2 |
| Procedures | 184 | 3.3 |
| Chemicals/drugs | 112 | 2.0 |
| Geographic areas | 41 | 0.7 |
| Occupations | 34 | 0.6 |
| Objects | 29 | 0.5 |
| Phenomena | 26 | 0.4 |
| Genes/molecular | 17 | 0.3 |
| Anatomy | 17 | 0.3 |
| Other | 4 | 0.0 |
| Organizations | 4 | 0.0 |
| Devices | 1 | 0.0 |