| Literature DB >> 29066897 |
Nastassja A Lewinski1, Ivan Jimenez1, Bridget T McInnes2.
Abstract
A vast amount of data on nanomedicines is being generated and published, and natural language processing (NLP) approaches can automate the extraction of unstructured text-based data. Annotated corpora are a key resource for NLP and information extraction methods which employ machine learning. Although corpora are available for pharmaceuticals, resources for nanomedicines and nanotechnology are still limited. To foster nanotechnology text mining (NanoNLP) efforts, we have constructed a corpus of annotated drug product inserts taken from the US Food and Drug Administration's Drugs@FDA online database. In this work, we present the development of the Engineered Nanomedicine Database corpus to support the evaluation of nanomedicine entity extraction. The data were manually annotated for 21 entity mentions consisting of nanomedicine physicochemical characterization, exposure, and biologic response information of 41 Food and Drug Administration-approved nanomedicines. We evaluate the reliability of the manual annotations and demonstrate the use of the corpus by evaluating two state-of-the-art named entity extraction systems, OpenNLP and Stanford NER. The annotated corpus is available open source and, based on these results, guidelines and suggestions for future development of additional nanomedicine corpora are provided.Entities:
Keywords: corpora; informatics; nanotechnology; natural language processing; text mining
Mesh:
Year: 2017 PMID: 29066897 PMCID: PMC5644562 DOI: 10.2147/IJN.S137117
Source DB: PubMed Journal: Int J Nanomedicine ISSN: 1176-9114
US FDA-approved nanomedicines from the year 1975 to 2013
| Platform | Drug |
|---|---|
| Conjugate | |
| Antibody–drug | Adcetris®, Bexxar®, Kadcyla®, Zevalin® |
| Polymer–aptamer | Macugen® |
| Polymer–protein | Adagen®, Cimzia®, Krystexxa®, Mircera®, Neulasta®, Oncaspar®, Pegasys®, PEG-Intron®, Somavert® |
| Protein–drug | Abraxane®, Ontak® |
| Lipid | |
| Liposome | Abelcet®, AmBisome®, Amphotec®, DaunoXome®, DepoCyt®, DepoDur®, Diprivan®, Doxil®, Marquibo®, Visudyne® |
| Micelle | Estrasorb™, Taxotere® |
| Nanocrystal | Emend®, Megace ES®, Rapamune®, TriCor®, TriGlide® |
| Nanoparticle | |
| Iron | Feraheme®, Ferrlecit®, Venofer® |
| Polymer | Copaxone®, Eligard®, Renagel®, Welchol® |
Abbreviation: FDA, US Food and Drug Administration.
Extracted nanomedicine entities
| Class | Entity |
|---|---|
| Nanomedicine description | Company |
| FDA approval date | |
| Trade name | |
| US patent | |
| Nanoparticle physicochemical characterization | Active ingredient |
| Core composition | |
| Molecular weight | |
| Nanoparticle | |
| Particle diameter | |
| Surface coating | |
| Exposure | Dose |
| Route of administration | |
| Pharmacokinetics | AUC |
| Clearance | |
| Cmax | |
| Elimination half-life | |
| Plasma half-life | |
| Tmax | |
| Volume of distribution | |
| Biologic response | Adverse reaction |
| Indication |
Abbreviations: AUC, area under the curve; Cmax, maximum concentration measured in blood; FDA, US Food and Drug Administration; Tmax, time to reach Cmax.
Figure 1Annotated ferumoxytol drug product label using GATE.
Abbreviation: GATE, General Architecture for Text Engineering.
Entity definitions contained in the annotation guidelines
| Class | Entity | Description |
|---|---|---|
| Nanomedicine description | Company | Company names, including the drug manufacturer and distributor. When annotating, include any of the following abbreviation (eg, co., corp., inc., LLC) |
| FDA approval date | The year the nanomedicine was approved for clinical use by the US FDA | |
| Trade name | The trademark name of the nanomedicine. When annotating, do not include the registered trademark symbol | |
| US patents | The US patent number(s) associated with the nanomedicine | |
| Nanoparticle physicochemical characterization | Active ingredient | The chemical composition of the agent that is providing the pharmacologic effect |
| Core composition (NPO_1808) | The chemical composition of the nanoparticle | |
| Molecular weight (NPO_1171) | The size of the nanomedicine or components in kilodaltons or other units based on daltons | |
| Nanoparticle (NPO_707) | The generic name of the nanomedicine (eg, ferumoxytol), the type of nanomedicine (eg, antibody–drug conjugate, liposome, lipid complex), or the written description of the nanomedicine (eg, paclitaxel formulated as albumin-bound nanoparticles). Part of speech variants (eg, liposomal vs liposome) should also be annotated | |
| Particle diameter (NPO_1539) | The size of the nanomedicine or components in nanometers or other units based on meters | |
| Surface coating (NPO_1962) | The chemical composition (eg, polyethylene glycol [PEG]) of the surface coating of the nanomedicine. When annotating, include the abbreviations | |
| Pharmacokinetics | AUC (NPO_1523) | Area under the curve. The total drug concentration over time |
| Clearance (NPO_1525) | The volume of blood from which a drug is irreversibly cleared | |
| Cmax (NPO_1527) | The maximum concentration measured in the blood | |
| Elimination half-life (NPO_1522) | The time at which half of the administered dose remains in the body | |
| Plasma half-life (NPO_1589) | The time at which half of the maximum concentration of the drug (systemically available) remains in the plasma. Also referred to as terminal half-life | |
| Tmax (NPO_1528) | The time to reach Cmax | |
| Volume of distribution (NPO_1524) | The theoretical volume of the compartment the drug appears to fill as related to the concentration measured in the blood. Vd = dose/Cmax | |
| Exposure | Dose | The administered mass, volume, and/or concentration of the nanomedicine or other described drugs. Annotations should include units (eg, 5 mg) |
| Route of administration | The method in which the nanomedicine is administered to patients. Possible routes of administration include: dermal (skin), SC, oral (by mouth), IM, IT, IV, intravitreal | |
| Biologic response | Adverse reaction | Nontherapeutic/off-target/side effects or toxic injury due to taking the nanomedicine |
| Indication | The disease(s) that the nanomedicine is used to detect, treat, or prevent |
Abbreviations: Cmax, maximum concentration measured in blood; FDA, US Food and Drug Administration; IM, intramuscular; IT, intrathecal; IV, intravenous; PEG, polyethylene glycol; SC, subcutaneous; Vd, volume of distribution; NPO, NanoParticle Ontology.
Summary of corpus text structure
| Metric | Corpus | Liposomes |
|---|---|---|
| Number of inserts | 41 | 10 |
| Number of annotations | 22,033 | 4,520 |
| Average number of annotations per insert | 537 | 468 |
| Average number of sentences | 690 | 542 |
| Average number of words | 11,363 | 8,728 |
| Time span, year | 1975–2013 | 1989–2012 |
Statistics on the 21 annotated entities
| Class | Entity | No mentions | No unique mentions | No labels included |
|---|---|---|---|---|
| Nanomedicine description | Company | 197 | 69 | 41 |
| FDA approval date | 34 | 19 | 41 | |
| Trade name | 6,716 | 41 | 41 | |
| US patent | 31 | 31 | 8 | |
| Physicochemical characterization | Active ingredient | 2,161 | 61 | 41 |
| Core composition | 89 | 26 | 16 | |
| Molecular weight | 50 | 40 | 34 | |
| Nanoparticle | 854 | 42 | 41 | |
| Particle diameter | 7 | 6 | 6 | |
| Surface coating | 62 | 11 | 15 | |
| Pharmacokinetic parameters | AUC | 47 | 46 | 19 |
| Clearance | 49 | 46 | 24 | |
| Cmax | 45 | 42 | 20 | |
| Elimination half-life | 16 | 15 | 11 | |
| Plasma half-life | 56 | 53 | 13 | |
| Tmax | 30 | 18 | 14 | |
| Volume of distribution | 29 | 27 | 19 | |
| Exposure | Dose | 2,283 | 542 | 41 |
| Route of administration | 1,192 | 20 | 41 | |
| Biologic response | Adverse reaction | 6,689 | 1,773 | 41 |
| Indication | 1,396 | 162 | 41 |
Abbreviations: AUC, area under the curve; Cmax, maximum concentration measured in blood; FDA, US food and drug administration; Tmax, time to reach Cmax.
Annotation agreement between student and expert annotator
| Class | Entity | Precision | Recall | F-measure |
|---|---|---|---|---|
| Nanomedicine description | Company | 0.96 | 0.46 | 0.62 |
| FDA approval date | 0.97 | 1 | 0.98 | |
| Trade name | 0.99 | 1 | 1 | |
| US patent | 1 | 1 | 1 | |
| Physicochemical characterization | Active ingredient | 0.89 | 0.79 | 0.84 |
| Molecular weight | 0.89 | 0.69 | 0.78 | |
| Nanoparticle | 0.64 | 0.42 | 0.51 | |
| Particle diameter | 1 | 0.71 | 0.83 | |
| Surface coating | 0.48 | 0.27 | 0.34 | |
| Pharmacokinetic parameters | AUC | 0.82 | 0.47 | 0.60 |
| Clearance | 0.74 | 0.65 | 0.69 | |
| Cmax | 0.91 | 0.63 | 0.74 | |
| Elimination half-life | 0.80 | 0.73 | 0.76 | |
| Plasma half-life | 1 | 0.75 | 0.86 | |
| Tmax | 0.91 | 0.56 | 0.69 | |
| Volume of distribution | 0.91 | 0.87 | 0.89 | |
| Exposure | Dose | 0.86 | 0.32 | 0.46 |
| Route of administration | 0.95 | 0.49 | 0.65 | |
| Biologic response | Adverse reaction | 0.96 | 0.06 | 0.11 |
| Indication | 0.98 | 0.53 | 0.69 | |
| Total |
Abbreviations: AUC, area under the curve; Cmax, maximum concentration measured in blood; FDA, US Food and Drug Administration; Tmax, time to reach Cmax.
Annotation agreement between student annotators
| Class | Entity | Precision | Recall | F-measure |
|---|---|---|---|---|
| Nanomedicine description | FDA approval date | 1 | 0.97 | 0.99 |
| Trade name | 1 | 1 | 1 | |
| US patent | 1 | 1 | 1 | |
| Physicochemical characterization | Active ingredient | 0.80 | 0.89 | 0.84 |
| Molecular weight | 0.93 | 0.65 | 0.77 | |
| Nanoparticle | 0.63 | 0.68 | 0.65 | |
| Particle diameter | 1 | 0.43 | 0.60 | |
| Exposure | Dose | 0.99 | 0.56 | 0.72 |
| Route of administration | 1 | 0.67 | 0.80 | |
| Biologic response | Indication | 0.77 | 0.94 | 0.85 |
| Total |
Abbreviation: FDA, US Food and Drug Administration.
F-measure of state-of-the-art NER systems
| Class | Entity | No mentions | Open NLP | Stanford NER |
|---|---|---|---|---|
| Nanomedicine description | Company | 197 | 0.65 | 0.74 |
| Trade name | 6,716 | 0.72 | 0.81 | |
| Physicochemical characterization | Active ingredient | 2,161 | 0.59 | 0.77 |
| Core composition | 89 | 0.23 | 0.27 | |
| Molecular weight | 50 | 0.58 | 0.84 | |
| Nanoparticle | 854 | 0.65 | 0.82 | |
| Surface coating | 62 | 0.43 | 0.59 | |
| Pharmacokinetic parameters | AUC | 47 | 0.26 | 0.41 |
| Clearance | 49 | 0.35 | 0.31 | |
| Cmax | 45 | 0.45 | 0.50 | |
| Plasma half-life | 56 | 0.47 | 0.67 | |
| Exposure | Dose | 2,283 | 0.54 | 0.68 |
| Route of administration | 1,192 | 0.67 | 0.78 | |
| Biologic response | Adverse reaction | 6,989 | 0.10 | 0.12 |
| Indication | 1,396 | 0.51 | 0.64 |
Abbreviations: AUC, area under the curve; Cmax, maximum concentration measured in blood; NER, named entity recognition; NLP, natural language processing.