| Literature DB >> 31896797 |
Joël Legrand1, Romain Gogdemir2, Cédric Bousquet3, Kevin Dalleau2, Marie-Dominique Devignes2, William Digan4,5, Chia-Ju Lee6, Ndeye-Coumba Ndiaye7, Nadine Petitpain8, Patrice Ringot2, Malika Smaïl-Tabbone2, Yannick Toussaint2, Adrien Coulet2,9.
Abstract
Pharmacogenomics (PGx) studies how individual gene variations impact drug response phenotypes, which makes PGx-related knowledge a key component towards precision medicine. A significant part of the state-of-the-art knowledge in PGx is accumulated in scientific publications, where it is hardly reusable by humans or software. Natural language processing techniques have been developed to guide experts who curate this amount of knowledge. But existing works are limited by the absence of a high quality annotated corpus focusing on PGx domain. In particular, this absence restricts the use of supervised machine learning. This article introduces PGxCorpus, a manually annotated corpus, designed to fill this gap and to enable the automatic extraction of PGx relationships from text. It comprises 945 sentences from 911 PubMed abstracts, annotated with PGx entities of interest (mainly gene variations, genes, drugs and phenotypes), and relationships between those. In this article, we present the corpus itself, its construction and a baseline experiment that illustrates how it may be leveraged to synthesize and summarize PGx knowledge.Entities:
Mesh:
Year: 2020 PMID: 31896797 PMCID: PMC6940385 DOI: 10.1038/s41597-019-0342-9
Source DB: PubMed Journal: Sci Data ISSN: 2052-4463 Impact factor: 6.444
Main characteristics of PGxCorpus in comparison with related corpora.
| Corpus name | Subcorpus | Corpus size | Sent. w/o rel. | Key entities | #Ent. types | #Rel. types | # Mod. | Nested entities | Discont. entities | ||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| sent. | rel. | % | nb. | Dru. | Gen. | Phe. | |||||||
| SNPPhenA | — | 483 | 1300 | 0 | 0 | ✓ | ✓ | 2 | 1 | 5 | |||
| EU-ADR | drug-disease | 244 | 176 | 0 | 0 | ✓ | ✓ | 2 | 1 | 3 | |||
| drug-target | 247 | 310 | 0 | 0 | ✓ | ✓ | 3 | 1 | 3 | ||||
| target-disease | 355 | 262 | 0 | 0 | ✓ | ✓ | 3 | 1 | 3 | ||||
| SemEval | DrugBank | 5,675 | 3,805 | 65.9 | 3739 | ✓ | 4 | 4 | 1 | ||||
| DDI | MEDLINE | 1,301 | 232 | 87.1 | 1133 | ✓ | 4 | 4 | 1 | ||||
| ADE-EXT | — | 5,939 | 6,701 | 28.9 | 1719 | ✓ | ✓ | 2 | 1 | 1 | |||
| — | 945 | 2,871 | 2.7 | 26 | ✓ | ✓ | ✓ | 10 | 7 | 4 | ✓ | ✓ | |
Sizes of corpora are reported in term of number of sentences (sent.) and annotated relationships (rel.). The number of sentences without any annotated relation (Sent. w/o rel.) is reported both as a percentage (%) and an absolute number of sentences (nb.). The specific presence of PGx key entities, i.e. drugs (Dru.), genetic factors (Gen.) and phenotypes (Phe.) is reported under the Key entities column. Overall numbers of types of entities and relations used in annotations are reported as #Ent. and #Rel. types respectively. #Mod. refers to the number of modalities for the annotation of relations (e.g. positive, hypothetical, negative).
Numbers of entities annotated in PGxCorpus, by type.
| PGxCorpus entity | Simple | Nested | Discont. | Both N&D | Total |
|---|---|---|---|---|---|
| Chemical | 1,512 | 192 | 2 | 12 | 1,718 |
| Genomic_factor | 21 | 68 | 7 | 3 | 99 |
| ↳Gene_or_protein | 1,685 | 20 | 3 | 0 | 1,708 |
| ↳Genomic_variation | 14 | 37 | 3 | 0 | 54 |
| ↳Limited_variation | 237 | 537 | 98 | 47 | 919 |
| ↳Haplotype | 15 | 112 | 4 | 6 | 137 |
| Phenotype | 282 | 330 | 60 | 27 | 699 |
| ↳Disease | 460 | 143 | 14 | 18 | 635 |
| ↳Pharmacodynamic_phenotype | 157 | 390 | 60 | 25 | 632 |
| ↳Pharmacokinetic_phenotype | 31 | 109 | 14 | 6 | 160 |
| Total | 4,414 | 1,938 | 265 | 144 |
Because nested and discontiguous (Discont.) entities are dealt with differently in our baseline experiments, we report numbers of “simple” annotations, i.e. those that are neither nested nor discontiguous. Nested and Discont. refers to annotations that are either nested or discontiguous. “Both N&D” refers to entities both nested and discontiguous. Every entity is only counted within its most specific type. An entity that appears several times is counted as many times it appears.
Numbers of relations annotated in PGxCorpus, by type.
| isAssociatedWith | 733 |
| ↳influences | 937 |
| ↳causes | 168 |
| ↳decreases | 263 |
| ↳increases | 243 |
| ↳treats | 238 |
| isEquivalentTo | 293 |
| Total | 2,875 |
Every relationship is only counted once within its most specific type.
Fig. 1Example of sentence annotated with PGx key and composite entities. The key entities, in red, correspond to entities retrieved by PubTator. Composite entities, in green, were obtained using the PHARE ontology. The syntactic dependency analysis is presented on the bottom of the figure and the entities on top.
Fig. 2Two annotated sentences of PGxCorpus. Sentence (a) encompasses a relationship of type influences and of modality hypothetical, denoted by the blue color. Sentence (b) is a title, with two annotated relationships. The first is a relationship of type influences and of modality hypothetical. It is hypothetical because the title states that the paper studies the relation, but not that it is valid. The second relationship is of type causes and annotates a nominal group.
Fig. 3Overview of the construction of PGxCorpus.
Performances reported for PubTator.
| Entity type | Tool | Evaluated on | Performance | ||
|---|---|---|---|---|---|
| P | R | F1 | |||
| Chemicals | Dictionary-based[ | n/a | n/a | n/a | 53.82 |
| Disease | DNorm[ | NCBI Disease | 82.8 | 81.9 | 80.9 |
| Corpus | |||||
| Gene | GeneTUKit[ | n/a | n/a | n/a | 82.97 |
| GNAT-100 | 43.0 | 56.7 | 48.9 | ||
| Mutation | tmVar[ | MutationFinder | 98.80 | 89.62 | 93.98 |
| Corpus | |||||
PubTator is the NER tool used during the pre-annotation step of PGxCorpus. P, R and F1 stand for Precision, Recall and F1-score, respectively. n/a denotes we were not able to find information to fill the cell.
Fig. 4Types of entities annotated in PGxCorpus and their hierarchy.
Mapping between PubTator entities types, PHARE classes and PGxCorpus entity types.
| Origin | Initial type | Type in PGxCorpus |
|---|---|---|
| Chemical | Chemical | |
| Disease | Disease | |
| Gene | Gene_or_protein | |
| Mutation | Limited_variation | |
| Drug | Chemical | |
| DrugMetabolite | Chemical | |
| Gene | Gene_or_protein | |
| GenomicRegion | Genomic_factor | |
| GenomicVariation | Genomic_variation | |
| GeneProduct | Gene_or_protein | |
| Mutation | Limited_variation | |
| Phenotype | Phenotype |
Fig. 5Types of relationships annotated in PGxCorpus and their hierarchy. Types directly related to PGx are marked with Δ, wheras isEquivalentTo and treats have a broader scope.
Type and number of entities recognized by PubTator in the pre-annotation.
| PubTator entity | Number recognized |
|---|---|
| Chemical | 90,816 |
| Disease | 125,487 |
| Gene | 196,460 |
| Mutation | 25,417 |
Number of entities pre-annotated after extending PubTator annotation with the PHARE ontology.
| PHARE entity | Discontiguous | All |
|---|---|---|
| Chemical | 430 | 87,764 |
| Disease | 0 | 29,589 |
| Gene_or_protein | 4,690 | 10,1326 |
| Genomic_variation | 8,698 | 13,601 |
| Phenotype | 10,935 | 16,770 |
Because discontiguous entities are excluded from our baseline experiments (see Section Technical Validation), their number is specified. No disease entity is discontiguously annotated first because PubTator is not generating discontiguous annotation, and second because the extension of annotations with PHARE (which may be discontiguous) is not producing disease annotations, but phenotype annotations.
Inter-annotator agreement (F1-score) for entity annotations.
| Entity matching: (exact or partial) | exact | exact | partial | partial |
|---|---|---|---|---|
| Considering hierarchy: (yes or no) | no | yes | no | yes |
| Chemical | 76.8 | 76.8 | 82.1 | 82.1 |
| Genomic_factor | 38.6 | 72.6 | 38.8 | 85.7 |
| ↳Gene_or_protein | 85.3 | 85.3 | 90.0 | 89.4 |
| ↳Genomic_variation | 32.9 | 49.3 | 53.0 | 76.8 |
| ↳Limited_variation | 50.8 | 50.8 | 69.0 | 66.2 |
| ↳Haplotype | 76.2 | 76.2 | 77.2 | 76.1 |
| Phenotype | 30.5 | 51.0 | 53.9 | 72.6 |
| ↳Disease | 71.3 | 71.0 | 80.9 | 79.1 |
| ↳Pharmacokinetic_phenotype | 48.2 | 48.2 | 57.0 | 57.0 |
| ↳Pharmacodynamic_phenotype | 31.7 | 31.7 | 47.0 | 47.0 |
| 57.4 | 63.8 | 68.7 | 76.1 |
Four different settings enabling more or less flexibility are presented. The agreement score is computed after the first phase of manual annotation.
Inter-annotator agreement (F1-score) for the annotation of relations. Five different settings are presented.
| Entity matching: | exact | exact | partial | partial | partial |
|---|---|---|---|---|---|
| Considering hierarchies: | none | both | none | both | both |
| Considering direction: | yes | yes | yes | yes | no |
| isAssociatedWith | 12.6 | 14.3 | 13.2 | 33.3 | 33.3 |
| ↳influences | 12.8 | 12.8 | 17.7 | 29.3 | 29.8 |
| ↳causes | 35.8 | 35.2 | 37.6 | 37.2 | 39.6 |
| ↳decreases | 25.8 | 26.8 | 33.6 | 36.7 | 36.7 |
| ↳increases | 14.5 | 15.6 | 27.4 | 30.2 | 30.2 |
| ↳metabolizes | 59.0 | 59.0 | 61.5 | 61.5 | 61.5 |
| ↳transports | 83.1 | 83.1 | 83.1 | 83.1 | 83.1 |
| ↳treats | 33.2 | 34.7 | 36.3 | 37.3 | 37.3 |
| isEquivalentTo | 39.6 | 40.2 | 40.7 | 41.3 | 62.5 |
| 47.3 | 47.1 | 50.3 | 53.8 | 57.0 |
Agreement (F1-score) between pre- and final annotations.
| Entity matching: | Exact | Exact | Partial | Partial |
|---|---|---|---|---|
| Type matching: | same type | potentially different | same type | potentially different |
| F1-Score (P;R) | 54.2 (66.9;45.5) | 59.0 (72.9;49.6) | 64.4 (79.6;54.1) | 76.2 (94.1;64.0) |
This agreement evaluates the amount of manual corrections that was required after the automatic pre-annotation phase. Precision (P) and recall (R) are given between parenthesis.
Performances of the task of named entity recognition in terms of F1-score (and its standard deviation in brackets, for the last setting).
| Entity matching: (exact or partial) | exact | exact | partial | partial |
|---|---|---|---|---|
| Considering hierarchy: (yes or no) | no | yes | no | yes |
| Chemical | 76.07 | 76.07 | 82.67 | 82.67 (7.24) |
| Genomic_factor | 22.86 | 71.41 | 27.68 | 83.19 (5.90) |
| ↳Gene_or_protein | 85.72 | 85.72 | 90.58 | 90.05 (3.89) |
| ↳Genomic_variation | 2.67 | 49.13 | 3.83 | 71.18 (9.55) |
| ↳Limited_variation | 47.08 | 47.02 | 72.71 | 71.57 (9.50) |
| ↳Haplotype | 66.97 | 66.97 | 72.47 | 72.47 (19.34) |
| Phenotype | 31.76 | 50.80 | 48.48 | 69.57 (5.40) |
| ↳Disease | 66.90 | 66.88 | 75.68 | 72.59 (7.30) |
| ↳Pharmacokinetic_phenotype | 29.30 | 29.30 | 36.47 | 36.27 (19.40) |
| ↳Pharmacodynamic_phenotype | 38.54 | 38.50 | 58.84 | 58.18 (10.11) |
| 49.15 | 59.11 | 59.76 | 71.93 (5.64) |
Balance between precision and recall, as well as details on standard deviations are provided in Supplementary Table S1.
Performances of the task of relation extraction in terms of F1-score (and standard deviation).
| Considering hierarchies: (yes or no) | no | yes |
|---|---|---|
| is Associated WithΔ | 30.89 | 51.71 (4.02) |
| ↳influencesΔ | 36.55 | 46.45 (5.17) |
| ↳causesΔ | 41.91 | 41.91 (13.35) |
| ↳decreasesΔ | 29.47 | 29.47 (9.85) |
| ↳increasesΔ | 17.94 | 17.94 (15.20) |
| ↳treats | 39.97 | 39.97 (12.60) |
| isEquivalentTo | 79.76 | 79.76 (7.69) |
| 45.67 | 49.56 (4.51) | |
The last line provides results of an experiment for which only one category is considered, merging all the types specific to PGx (marked with Δ). For leaves, performances are unchanged when considering the hierarchy. Balance between precision and recall, as well as details on standard deviations are provided in Supplementary Table S2.
| Measurement(s) | gene_variant • response to drug • textual entity • chemical entity • haplotype • gene • Pharmacogenomics • Pharmacogenetics • abbreviation textual entity • Pharmacokinetics • Pharmacodynamics • phenotype |
| Technology Type(s) | digital curation |
| Sample Characteristic - Organism | Homo sapiens |