| Literature DB >> 34381873 |
Ferran Gonzalez Hernandez1,2, Simon J Carter3,4, Juha Iso-Sipilä5, Paul Goldsmith6, Ahmed A Almousa7, Silke Gastine8, Watjana Lilaonitkul9,10, Frank Kloprogge4, Joseph F Standing8.
Abstract
Pharmacokinetic (PK) predictions of new chemical entities are aided by prior knowledge from other compounds. The development of robust algorithms that improve preclinical and clinical phases of drug development remains constrained by the need to search, curate and standardise PK information across the constantly-growing scientific literature. The lack of centralised, up-to-date and comprehensive repositories of PK data represents a significant limitation in the drug development pipeline.In this work, we propose a machine learning approach to automatically identify and characterise scientific publications reporting PK parameters from in vivo data, providing a centralised repository of PK literature. A dataset of 4,792 PubMed publications was labelled by field experts depending on whether in vivo PK parameters were estimated in the study. Different classification pipelines were compared using a bootstrap approach and the best-performing architecture was used to develop a comprehensive and automatically-updated repository of PK publications. The best-performing architecture encoded documents using unigram features and mean pooling of BioBERT embeddings obtaining an F1 score of 83.8% on the test set. The pipeline retrieved over 121K PubMed publications in which in vivo PK parameters were estimated and it was scheduled to perform weekly updates on newly published articles. All the relevant documents were released through a publicly available web interface (https://app.pkpdai.com) and characterised by the drugs, species and conditions mentioned in the abstract, to facilitate the subsequent search of relevant PK data. This automated, open-access repository can be used to accelerate the search and comparison of PK results, curate ADME datasets, and facilitate subsequent text mining tasks in the PK domain. Copyright:Entities:
Keywords: Bioinformatics; Information extraction; Machine Learning; Natural Language Processing; Pharmacokinetics; Pharmacometrics; Text mining
Year: 2021 PMID: 34381873 PMCID: PMC8343403 DOI: 10.12688/wellcomeopenres.16718.1
Source DB: PubMed Journal: Wellcome Open Res ISSN: 2398-502X
Summary statistics reporting the percentage of documents in which a particular field was available, and the proportion of papers labeled as Relevant and Not Relevant.
The statistics are reported for both training and final test sets.
| Field | Training | Final test |
|---|---|---|
| Title | 100 | 100 |
| Abstract | 87.17 | 87.67 |
| Authors | 99.44 | 99.63 |
| Journal | 100 | 100 |
| Publication Type | 100 | 100 |
| Keywords | 15.41 | 16.125 |
| MeSH terms | 97.67 | 98.25 |
| Chemicals | 93.86 | 94.13 |
| Affiliations | 80.94 | 79.125 |
| Label | ||
| Relevant | 19.81 | 20.25 |
| Not Relevant | 80.19 | 79.75 |
Figure 1. A) Bootstrap procedure to compare the effect of different features during field selection, n-grams and distributed representations analyses. B) The best-performing features from previous analyses were selected to compare different hyperparameter combinations with 5-fold cross validation. Finally, the best-performing features and hyperparameters were used to apply the pipeline to the final test set.
Figure 2. Example of the approaches used to generate distributed representations for an input title after BioBERT encoding.
The same procedure was applied for tokens in the abstract.
Hyperparameters tuned during cross-validation and their default values.
The range represents the different values tested for each hyperparameter in the grid-search procedure. The step size refers to the increase between the starting and stop values.
| Parameter | Range (start, stop, step) | Default value |
|---|---|---|
|
| (2,512,x2) | 20 |
|
| (2,64,x2) | 4 |
|
| (1/3,1,+1/3) | 1 |
|
| Early stopping | - |
Summary table with performance metrics reported as median (95% CI) and F 1 interquartile variance (IQV) after 200 bootstrap iterations.
The performance metrics are compared across pipelines using different fields from PubMed entries.
| Pipeline | Precision (%) | Recall (%) |
|
|
|---|---|---|---|---|
| Title | 65.3 (59.0,72.7) | 65.8 (55.0,72.8) | 65.0 (59.5,71.0) | 11.5 |
| Abstract | 77.0 (69.8,82.6) | 79.8 (73.4,86.1) | 78.2 (73.6,82.6) | 9.0 |
| Authors
| 76.4 (69.4,82.6) | 80.4 (72.8,86.1) | 78.2 (73.3,82.6) | 9.3 |
| Journal
| 76.4 (70.2,82.2) | 79.8 (72.8,85.4) | 78.0 (73.6,82.0) | 8.4 |
| Publication Type
| 78.0 (71.6,84.3) | 81.6 (74.7,87.4) | 79.6 (75.5,84.2) | 8.7 |
| Keywords
| 76.6 (70.2,83.0) | 80.4 (72.8,85.5) | 78.2 (73.8,82.2) | 8.4 |
| MeSH terms
| 79.2 (72.1,85.2) | 79.8 (72.8,86.1) | 79.5 (74.3,83.3) | 9.0 |
| Chemicals
| 76.0 (69.5,81.9) | 80.4 (73.4,86.1) | 77.8 (73.0,82.0) | 9.0 |
| Affiliations
| 76.6 (69.8,82.1) | 80.4 (72.8,86.7) | 78.3 (73.2,81.9) | 8.7 |
| All fields |
| 81.6 (74.1,87.4) | 80.5 (75.7,84.9) | 9.2 |
| Optimal Fields
|
|
|
| 9.4 |
*Tokens from the title and abstract were also included when encoding this field.
**The optimal fields were the title, abstract, MeSH terms and publication type.
Figure 3. Distribution of F 1 scores for the different features used in the field selection analysis after 200 bootstrap iterations.
The fields Chemicals, Journal, Authors, Keywords, Affiliations, MeSH terms and Publication Type were encoded together with the title and abstract tokens. The Optimal Fields include the title, abstract, MeSH terms and Publication Type.
Summary table with performance metrics reported as median (95% CI) and F 1 interquartile variance (IQV) after 200 bootstrap iterations.
The performance metrics are compared across pipelines using different n-grams from the optimal fields.
| Pipeline | Precision (%) | Recall (%) |
|
|
|---|---|---|---|---|
| Unigrams | 80.1 (73.9,86.0) |
|
| 9.4 |
| Bigrams | 79.9 (72.2,86.9) | 81.6 (74.1,88.0) |
| 8.6 |
| Trigrams |
| 81.0 (73.4,88.0) |
| 7.9 |
Figure 4. Distribution of F 1 scores for the n-grams analysis after 200 bootstrap iterations.
Summary table with performance metrics reported as median (95% CI) and F 1 interquartile variance (IQV) after 200 bootstrap iterations.
The performance metrics are compared across pipelines using different distributed document representations.
| Pipeline | Precision (%) | Recall (%) |
|
|
|---|---|---|---|---|
| SPECTER | 74.1 (66.5,80.9) | 69.0 (62.0,76.6) | 71.2 (66.2,75.8) | 9.6 |
| BioBERT mean pooling | 78.1 (69.0,85.4) | 75.3 (68.3,82.9) | 76.6 (71.6,81.3) | 9.7 |
| BioBERT mean + min&max pooling |
|
|
| 8.7 |
Figure 5. Distribution of F 1 scores for the distributed analysis after 200 bootstrap iterations.
Summary table with performance metrics reported as median (95% CI) and F 1 interquartile variance (IQV) after 200 bootstrap iterations.
The performance metrics are compared across pipelines using BoW together with distributed representations.
| Pipeline | Precision (%) | Recall (%) |
|
|
|---|---|---|---|---|
| Unigrams | 80.1 (73.9,86.0) |
| 80.6 (75.8,85.2) | 9.4 |
| Unigrams + BioBERT mean pooling | 83.7 (76.7,89.1) | 80.4 (74.1,87.3) |
| 8.2 |
| Unigrams + BioBERT mean + min&max pooling |
| 79.1 (73.4,85.4) | 81.0 (77.2,85.4) | 8.2 |
Figure 6. F 1 score distributions for the pipelines using unigrams together with BioBERT embeddings.
Performance metrics of the final pipeline on the test set.
| Precision (%) | Recall (%) |
| Accuracy (%) |
|---|---|---|---|
| 84.8% | 82.8% | 83.8% | 93.2% |