| Literature DB >> 28257498 |
Kristin Larsson1, Simon Baker2, Ilona Silins1, Yufan Guo2, Ulla Stenius1, Anna Korhonen2,3, Marika Berglund1.
Abstract
Chemical exposure assessments are based on information collected via different methods, such as biomonitoring, personal monitoring, environmental monitoring and questionnaires. The vast amount of chemical-specific exposure information available from web-based databases, such as PubMed, is undoubtedly a great asset to the scientific community. However, manual retrieval of relevant published information is an extremely time consuming task and overviewing the data is nearly impossible. Here, we present the development of an automatic classifier for chemical exposure information. First, nearly 3700 abstracts were manually annotated by an expert in exposure sciences according to a taxonomy exclusively created for exposure information. Natural Language Processing (NLP) techniques were used to extract semantic and syntactic features relevant to chemical exposure text. Using these features, we trained a supervised machine learning algorithm to automatically classify PubMed abstracts according to the exposure taxonomy. The resulting classifier demonstrates good performance in the intrinsic evaluation. We also show that the classifier improves information retrieval of chemical exposure data compared to keyword-based PubMed searches. Case studies demonstrate that the classifier can be used to assist researchers by facilitating information retrieval and classification, enabling data gap recognition and overviewing available scientific literature using chemical-specific publication profiles. Finally, we identify challenges to be addressed in future development of the system.Entities:
Mesh:
Year: 2017 PMID: 28257498 PMCID: PMC5336247 DOI: 10.1371/journal.pone.0173132
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1Chemical risk assessment.
The process of a chemical risk assessment includes exposure assessment, hazard identification, hazard characterization and risk characterization [1, 2].
Examples of information considered relevant for different nodes in the exposure taxonomy.
| Node | Relevant information | ||
|---|---|---|---|
| Measurements of exposure biomarkers (chemicals or metabolites) in corresponding human matrix (blood, urine, etc). | |||
| Measurements of effect biomarkers in human matrices. | |||
| Measurements of physiological markers of effect, such as blood pressure, lung function, birth weight, etc. | |||
| Intake calculations derived from biomonitoring data. Exposure modelling (e.g. PBPK) of multiple exposure routes simultaneously. Job exposure matrix. | |||
| Tape strip samples, hand wipes, hand washing samples, dermal wipes, dermal exposure modelling. | |||
| Data from ambient air monitoring stations used in exposure assessments or epidemiological studies. | |||
| Air in indoor microenvironments (homes, workplaces, schools, cars, etc). Environmental tobacco smoke. Inhalation from showers, cooking fuel, etc. | |||
| Personal air monitoring, breathing zone measurements. | |||
| Exposure estimates from drinking water, bottled water, well water, etc. | |||
| Dust in indoor microenvironments (homes, workplaces, schools, cars, etc). | |||
| Exposure estimates from food (e.g. intake assessments based on food concentration data and ingested amount of food, total diet studies, double portions, etc). | |||
| Exposure estimates from toys, cosmetics, personal care products, dental fillings, drugs and vaccines, household pesticides, etc. | |||
| Exposure estimates from playground soil or residential garden soil, etc. | |||
Number of annotated abstracts for each node in the taxonomy.
| Node | # abstracts | ||||
|---|---|---|---|---|---|
| 8 | |||||
| 106 | |||||
| Adipose tissue | 88 | ||||
| Blood | 744 | ||||
| Hair/nail | 418 | ||||
| Mother’s milk | 177 | ||||
| Other tissue | 143 | ||||
| Placenta | 60 | ||||
| Urine | 784 | ||||
| 78 | |||||
| Biomarker | 27 | ||||
| 141 | |||||
| 52 | |||||
| 94 | |||||
| 168 | |||||
| 300 | |||||
| 65 | |||||
| 62 | |||||
| Physiological parameter | 777 | ||||
| 168 | |||||
| 165 | |||||
| 153 | |||||
| 356 | |||||
| Outdoor air | 247 | ||||
| Indoor air | 254 | ||||
| Personal air | 174 | ||||
| 63 | |||||
| Drinking water | 424 | ||||
| Dust | 256 | ||||
| Food | 647 | ||||
| Products | 164 | ||||
| Soil | 131 | ||||
Fig 2The NLP pipeline for automatic classification of document abstracts.
Chem: Chemical lists, MeSH: Medical Subject Headings, GR: Grammatical Relations, LBOW: Lemmatized Bag of Words, N.Bigrams: Noun Bigrams, VC: Verb Clusters, NE: Named Entities.
Feature selection.
The number of features for each node in the taxonomy after the feature selection step.
| Node | LBOW | GR | NE | VC | N Bigram | MeSH | Chem | Total | ||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 4785 | 3544 | 352 | 128 | 1395 | 754 | 253 | ||||||
| 4244 | 2965 | 312 | 127 | 1177 | 626 | 216 | ||||||
| Adipose tissue | 361 | 66 | 23 | 67 | 38 | 56 | 19 | |||||
| Blood | 2453 | 1300 | 141 | 122 | 521 | 355 | 111 | |||||
| Hair/nail | 1390 | 532 | 47 | 109 | 266 | 186 | 34 | |||||
| Mother’s milk | 668 | 197 | 43 | 91 | 85 | 86 | 35 | |||||
| Other tissue | 612 | 106 | 14 | 92 | 66 | 82 | 21 | |||||
| Placenta | 307 | 52 | 13 | 80 | 32 | 37 | 13 | |||||
| Urine | 2310 | 1164 | 124 | 120 | 482 | 314 | 120 | |||||
| 2889 | 1678 | 190 | 122 | 715 | 508 | 164 | ||||||
| Biomarker | 1743 | 733 | 146 | 116 | 363 | 313 | 129 | |||||
| 637 | 141 | 39 | 96 | 79 | 96 | 25 | ||||||
| 1461 | 555 | 121 | 115 | 271 | 248 | 103 | ||||||
| 447 | 84 | 19 | 88 | 54 | 69 | 30 | ||||||
| 724 | 181 | 37 | 97 | 99 | 106 | 40 | ||||||
| 1167 | 352 | 99 | 112 | 200 | 179 | 74 | ||||||
| 357 | 44 | 16 | 73 | 35 | 38 | 11 | ||||||
| 316 | 48 | 16 | 80 | 30 | 39 | 14 | ||||||
| Physiological parameter | 2100 | 1018 | 78 | 121 | 434 | 347 | 77 | |||||
| 4574 | 3356 | 248 | 130 | 1297 | 694 | 211 | ||||||
| 715 | 156 | 18 | 98 | 91 | 98 | 31 | ||||||
| 773 | 142 | 15 | 99 | 94 | 79 | 29 | ||||||
| 2607 | 1349 | 89 | 123 | 525 | 379 | 111 | ||||||
| Outdoor air | 1064 | 407 | 30 | 109 | 159 | 136 | 14 | |||||
| Indoor air | 1308 | 369 | 35 | 107 | 191 | 178 | 58 | |||||
| Personal air | 974 | 221 | 25 | 99 | 124 | 104 | 32 | |||||
| 3216 | 2067 | 171 | 126 | 830 | 461 | 127 | ||||||
| Drinking water | 1338 | 501 | 44 | 117 | 250 | 191 | 45 | |||||
| Dust | 1057 | 293 | 43 | 97 | 163 | 145 | 53 | |||||
| Food | 1932 | 997 | 87 | 117 | 411 | 270 | 78 | |||||
| Products | 803 | 193 | 18 | 97 | 117 | 99 | 29 | |||||
| Soil | 652 | 124 | 5 | 84 | 86 | 80 | 24 | |||||
| 1328 | 624 | 67 | 100 | 271 | 193 | 61 | ||||||
LBOW: Lemmatized Bag of Words, GR: Grammatical Relations, NE: Named Entities, VC: Verb Clusters, N.Bigrams: Noun Bigrams, MeSH: Medical Subject Headings, Chem: Chemical lists.
The corpus of annotated PubMed abstracts and the software for classification are available at: https://figshare.com/articles/Corpus_and_Software/4668229
Results of intrinsic evaluation using 3-fold cross validation.
All scores are percentages.
| Node | Precision | Recall | Accuracy | F-score | ||||
|---|---|---|---|---|---|---|---|---|
| 94.9 | 95.5 | 93.1 | 95.2 | |||||
| 93.8 | 95.0 | 93.2 | 94.4 | |||||
| Adipose tissue | 93.9 | 87.5 | 99.6 | 90.6 | ||||
| Blood | 87.2 | 82.4 | 92.1 | 84.7 | ||||
| Hair/nail | 97.7 | 89.9 | 98.4 | 93.6 | ||||
| Mother’s milk | 91.6 | 86.4 | 99.0 | 89.0 | ||||
| Other tissue | 86.0 | 25.9 | 96.9 | 39.8 | ||||
| Placenta | 93.3 | 70.0 | 99.4 | 80.0 | ||||
| Urine | 95.7 | 91.9 | 97.1 | 93.8 | ||||
| 89.0 | 81.8 | 90.7 | 85.3 | |||||
| Biomarker | 89.4 | 69.2 | 94.2 | 78.0 | ||||
| 92.6 | 61.7 | 98.3 | 74.0 | |||||
| 85.4 | 63.1 | 94.5 | 72.6 | |||||
| 87.5 | 37.2 | 98.3 | 52.2 | |||||
| 80.0 | 42.9 | 96.9 | 55.8 | |||||
| 84.8 | 56.0 | 95.6 | 67.5 | |||||
| 91.7 | 33.8 | 98.8 | 49.4 | |||||
| 82.1 | 51.6 | 99.0 | 63.4 | |||||
| Physiological parameter | 84.0 | 70.1 | 90.8 | 76.4 | ||||
| 89.0 | 92.3 | 87.4 | 90.6 | |||||
| 80.9 | 43.9 | 97.0 | 56.9 | |||||
| 83.9 | 72.6 | 98.0 | 77.8 | |||||
| 92.4 | 81.5 | 93.3 | 86.6 | |||||
| Outdoor air | 91.1 | 83.1 | 98.0 | 86.9 | ||||
| Indoor air | 78.6 | 30.4 | 92.3 | 43.8 | ||||
| Personal air | 90.5 | 74.8 | 97.9 | 81.9 | ||||
| 87.7 | 82.1 | 88.1 | 84.8 | |||||
| Drinking water | 83.4 | 81.3 | 95.7 | 82.3 | ||||
| Dust | 90.4 | 75.1 | 97.4 | 82.0 | ||||
| Food | 88.4 | 75.8 | 93.2 | 81.6 | ||||
| Products | 80.4 | 42.3 | 96.4 | 55.4 | ||||
| Soil | 78.5 | 62.5 | 97.7 | 69.6 | ||||
Fig 3Results of the intrinsic evaluation.
The color coding is based on F-scores (Green = >75%, yellow = 50–75%, red = <50%).
Analysis of the influence of each feature type on the classification accuracy.
The classification accuracy is described as the F-score for each node after removal of respective feature type. The column “all” describes the F-scores when all feature types are used. F-scores that decreased after removal of respective feature type are presented in bold script.
| Node | All | LBOW | GR | NE | VC | N Bigram | MeSH | Chem | ||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 95.2 | ||||||||||||
| 94.4 | ||||||||||||
| Adipose tissue | 90.6 | |||||||||||
| Blood | 84.7 | 85.2 | ||||||||||
| Hair/nail | 93.6 | 95.3 | 97.0 | |||||||||
| Mother’s milk | 89.0 | |||||||||||
| Other tissue | 39.8 | |||||||||||
| Placenta | 80.0 | |||||||||||
| Urine | 93.8 | 94.5 | 96.0 | 95.4 | ||||||||
| 85.3 | ||||||||||||
| Biomarker | 78.0 | 80.4 | ||||||||||
| 74.0 | ||||||||||||
| 72.6 | 76.7 | |||||||||||
| 52.2 | 53.2 | 53.4 | 53.3 | |||||||||
| 55.8 | ||||||||||||
| 67.5 | ||||||||||||
| 49.4 | 51.5 | 54.8 | 54.7 | 54.7 | 53.3 | |||||||
| 63.4 | 72.7 | 69.4 | 66.5 | 66.1 | ||||||||
| Physiological parameter | 76.4 | 80.4 | ||||||||||
| 90.6 | ||||||||||||
| 56.9 | 57.2 | |||||||||||
| 77.8 | 83.0 | |||||||||||
| 86.6 | 88.5 | |||||||||||
| Outdoor air | 86.9 | 87.3 | ||||||||||
| Indoor air | 43.8 | 47.0 | 45.6 | 44.4 | 45.1 | |||||||
| Personal air | 81.9 | 87.8 | 85.6 | 82.1 | ||||||||
| 84.8 | 87.2 | |||||||||||
| Drinking water | 82.3 | 85.0 | ||||||||||
| Dust | 82.0 | 82.4 | ||||||||||
| Food | 81.6 | |||||||||||
| Products | 55.4 | 56.0 | ||||||||||
| Soil | 69.6 | 70.6 | ||||||||||
| 75.5 | ||||||||||||
LBOW: Lemmatized Bag of Words, GR: Grammatical Relations, NE: Named Entities, VC: Verb Clusters, N.Bigrams: Noun Bigrams, MeSH: Medical Subject Headings, Chem: Chemical lists.
Comparison between manual and automatic classification of articles describing measurements of nine chemicals/chemical groups in human blood and milk.
| Compound | Abstracts found with manual PubMed search | Measurements in blood | Measurements in mother’s milk | ||
|---|---|---|---|---|---|
| Abstracts automatically classified as blood | Automatic/ Manual | Abstracts automatically classified as milk | Automatic/ Manual | ||
| DDT/DDE | 2050 | 604 | 28/28 | 137 | 5/5 |
| α-, β- & γ-HCH | 699 | 203 | 70 | 3/3 | |
| Mirex | 86 | 46 | 2/2 | 11 | 0/0 |
| PCB | 3331 | 1152 | 30/30 | 232 | 10/10 |
| PCDD/F | 2886 | 480 | 5/5 | 162 | 5/5 |
| PFOS | 561 | 285 | 16/16 | 25 | |
| PBDE | 905 | 251 | 23/23 | 130 | |
| Aldrin/ dieldrin | 282 | 59 | 2/2 | 28 | 2/2 |
| Endosulfan | 257 | 37 | 2/2 | 13 | 1/1 |
1Number of abstracts found with a PubMed search using the chemical name as search term, applying time restriction 1 Jan 2000–1 July 2014 and only including abstracts indexed with the MeSH term “humans”.
2Number of abstracts automatically classified under respective node (blood or milk) regardless if they met the criteria for inclusion in the report.
3Number of manually selected abstracts that met the criteria for inclusion in the report (manual), and the number out of these manually selected abstracts that were found also among the abstracts automatically classified under each relevant node (automatic).
Fig 4Publication profiles of exposure information about 4-NP, HCB and lead.
The percentages of the total number of abstracts retrieved from PubMed and considered relevant for the full taxonomy are presented. The total number of abstracts was 130 for 4-NP, 722 for HCB and 7753 for lead.
Fig 5Publication profiles for exposure biomarkers and exposure routes for different phthalate esters.
Fig 6Publication profiles for effect biomarkers related to exposure to different phthalate esters.
The number of abstracts retrieved by PubMed using a search query VS the number of abstracts classified into the corresponding node in our system.
| PubMed search query | Node in the taxonomy | # of abstracts retrieved by PubMed | # of abstracts classified by our system |
|---|---|---|---|
| Exposure routes → Inhalation | 149 | 337 | |
| Biomonitoring → Effect biomarker → Biomarker → Gene | 65 | 120 | |
| Biomonitoring → Effect biomarker → Biomarker → Molecule → Protein | 149 | 357 | |
| Biomonitoring → Exposure biomarker → Blood | 407 | 3784 | |
| Biomonitoring → Exposure biomarker → Hair/nail | 24 | 257 |
1These numbers include both true and false positive abstracts.
Performance comparison according to top returned results sample.
Manual evaluation of 20 abstracts retrieved from PubMed using a search query VS automatic classification into the corresponding node in our system.
| PubMed search query | PubMed keyword search | Classification by our system | Sample size | ||
|---|---|---|---|---|---|
| Precision | False Positive | Precision | False Positive | ||
| 7439-92-1 AND inhalation | 50% | 50% | 100% | 0% | 20 |
| 7439-92-1 AND (DNA OR gene) AND biomarker | 35% | 65% | 100% | 0% | 20 |
| 7439-92-1 AND protein AND biomarker | 35% | 65% | 100% | 0% | 20 |
| 7439-92-1 AND blood AND biomarker | 85% | 15% | 95% | 5% | 20 |
| 7439-92-1 AND (hair OR nail) AND biomarker | 60% | 40% | 75% | 20% | 20 |