| Literature DB >> 35702625 |
Roopal Bhatnagar1, Sakshi Sardar2, Maedeh Beheshti2, Jagdeep T Podichetty2.
Abstract
Objective: To summarize applications of natural language processing (NLP) in model informed drug development (MIDD) and identify potential areas of improvement. Materials andEntities:
Keywords: NLP; deep learning; drug development; machine learning
Year: 2022 PMID: 35702625 PMCID: PMC9188322 DOI: 10.1093/jamiaopen/ooac043
Source DB: PubMed Journal: JAMIA Open ISSN: 2574-2531
Figure 1.Entire review process workflow. The review process was divided into 2 parts: (1) review of applications NLP in MIDD space and (2) technical review of state-of-the-art methods for implementation of various NLP functionalities most used in MIDD space.
Figure 2.Process flow for NLP libraries inventory. The figure describes the review process followed for developing the “NLP libraries inventory for drug discovery and development.” A total of 47 libraries were identified from Google scholar resources. Out of these, 7 libraries for speech processing were excluded from further screening. Out of the remaining 40 libraries, 20 were found to be used in different biomedical or biochemical applications. The websites, github repositories, and publications on the libraries were reviewed and the libraries were analyzed for the presence or absence of 14 features. These features were selected based on the most used NLP functionalities in the drug discovery and development space.
NLP libraries for MIDD
| Library name | Features | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Programming language | Pretrained neural network models | Word embeddings | Multi-language support | Tokenization | Part-of-speech tagging | Stemming/lemmatization | Named entity recognition | Entity resolution | Sentiment analysis | Relation extraction | Assertion status detection | Topic modeling | |
| Spacy | Python | x | x | x | x | x | x | x | x | x | |||
| Gensim | Python | x | x | x | x | x | x | x | |||||
| NLTK | Python | x | x | x | x | x | x | x | x | ||||
| CoreNLP | Java | x | x | x | x | x | x | x | x | x | |||
| Scispacy | Python | x | x | x | x | x | x | x | x | x | |||
| SparkNLP | Python, Java, Scala, R | x | x | x | x | x | x | x | x | x | x | x | x |
| SparkNLP for healthcare | Python, Java, Scala, R | x | x | x | x | x | x | x | x | x | x | ||
| Torchtext | Python | x | x | x | x | ||||||||
| KoRpus | R | x | x | x | |||||||||
| Tensorflow | Python | x | x | x | x | x | x | x | x | x | x | ||
| Scikit learn | Python | x | x | x | |||||||||
| Textblob | Python | x | x | x | x | ||||||||
| Pattern | Python, R | x | x | x | x | x | |||||||
| Hugging face | Python | x | x | x | x | x | |||||||
| Allen NLP | Python | x | x | x | x | x | x | x | x | x | |||
| Fasttext | Python | x | x | x | x | x | x | ||||||
| Stanza | Python | x | x | x | x | x | x | x | |||||
| Flair | Python | x | x | x | x | x | x | ||||||
| Fastai | Python | x | x | x | x | ||||||||
| Spacyr | R | x | x | x | x | x | x | ||||||
NLP models for MIDD
| Model | Full form | Pretrained on | Architecture | Built on | Performance | Year |
|---|---|---|---|---|---|---|
| BioBERT | Bio-Bidirectional Encoder Representations from Transformers | PubMed and PMC | Transformer | BERT | Outperforms state-of-the-art (SOTA) for named entity recognition, relation extraction, question answering | September 19 |
| SciBERT | Science—Bidirectional Encoder Representations from Transformers | Semantic Scholar | Transformer | BERT | Outperforms SOTA for named entity recognition, relation extraction, patient enrollment task | November 19 |
| ClinicalBERT | Clinical Bidirectional Encoder Representations from Transformers | MIMIC III | Transformer | BERT | Outperforms deep language model for clinical prediction | November 20 |
| BioClinicalBERT | Bio-Clinical Bidirectional Encoder Representations from Transformers | MIMIC III | Transformer | BioBERT | Outperforms BERT and BioBERT on named entity recognition and natural language inference | June 19 |
| BioMed-RoBERTa | BioMedical Robustly optimized Bidirectional Encoder Representations from Transformers | Semantic Scholar | Transformer | RoBERTa | Outperforms RoBERTa on text classification, relation extraction and named entity recognition | May 20 |
| Bio Discharge Summary BERT | Bio Discharge Summary Bidirectional Encoder Representations from Transformers | MIMIC III discharge summaries | Transformer | BioBERT | Outperforms BERT and BioBERT on named entity recognition and natural language inference | June 19 |
| BioALBERT | Bio-A Lite Bidirectional Encoder Representations from Transformers | PubMed, PMC, MIMIC III | Transformer | ALBERT | Outperforms SOTA for named entity recognition, relation extraction, question answering, sentence similarity, document classification | July 21 |
| ChemBERTa | Chem-Bidirectional Encoder Representations from Transformers | PubChem | Transformer | RoBERTa | Outperforms baseline on one task of molecular property prediction | October 20 |
Relevant NLP key concepts
| NLP concept | Definition | Methodology | Biomedical or biochemical applications | MIDD-specific open-source resources |
|---|---|---|---|---|
| Word embedding | A class of techniques where individual words are represented as real-valued vectors, often tens or hundreds of dimensions in a predefined vector space. | It uses language models and feature extraction methods to map words to vectors capturing their context and meaning. Generic pre-trained models such as GloVe, | Biomedical NLP encompasses use of word embeddings as feature input to downstream ML or DL models. Different textual resources like EHR, clinical notes, biomedical publications, Wikipedia, news etc. are utilized to train these word embeddings. | BioWordVec and BioSentVec |
| Named Entity Recognition (NER) | A sequence-labeling task that encompasses locating and categorizing important nouns and proper nouns in text which carry key information in a sentence. | It utilizes either 1 or a combination of the 2 underlying methods: (1) Rule-based method which uses a set of handcrafted grammatical and syntactic rules, and dictionaries to extract the named entities. (2) Machine learning (ML) or deep learning (DL) based method that utilizes a feature-based representation of the observed data. | It is used in the clinical domain to extract names of drugs, protein, disease, and genes from radiology reports, discharge summaries, problem lists, nursing documentation, medical education documents, and scientific literature. | MedLEE, |
| Assertion status detection | Status detection in medical assertions as “present,” “absent,” “conditional,” or “associated with someone else,” | Given an entity in a medical text, it classifies its asserted class from the context as being present, absent, or possible in the patient. | In bio-clinical NLP, it is primarily used for assertion status detection for disease modeling. The meaning of clinical entities is heavily affected by assertion modifiers such as negation, uncertain, hypothetical, experiencer, and so on. | MITRE system |
| Entity resolution | It is the practice of linking data records that represent the same entity in the absence of a join key. | The process is comprised of the following steps: (1) Blocking—categorizing entities into blocks based on their descriptions. (2) Block processing—removing redundancies within blocks. (3) Matching—matching within a block based on entity descriptions. (4) Clustering—grouping of identified matches together. | In biomedical applications, it is used in record linkage by taking domain-specific knowledge into consideration to avoid domain-general assumptions that do not hold in this domain (eg, overlap in names of chemical compounds). | DeepER |
| Relation extraction | It is the task of extracting structured information and semantic relations from natural language text between 2 or more entities of a certain type like person, organization, or location. | It uses co-occurrence, pattern matching, machine learning, deep learning, knowledge-driven methods, | In the drug discovery and development domain, it is relevant in extraction of drug–disease, gene–disease, drug–target, and drug–drug relationships. | BioReI |
| Topic modeling | It is an unsupervised approach used for finding and classifying various topics embedded within a document or a piece of text. | It is based on the idea that a document is a mixture of topics which are a probability distribution over words. Term frequency-inverse document frequency, non-negative matrix factorization, Latent Dirichlet Allocation, Latent Semantic Analysis, | In the biomedical domain, topic modeling has been applied to use-cases beyond documents and words, eg, to classify genomic sequences, to classify drugs according to safety and therapeutic use and to find links between genes and diseases. | Gensim, Stanford topic modling toolbox and MALLET |
Figure 3.NLP in stages of drug development. The figure shows NLP functionalities used for applications in 3 stages of drug development process: (1) drug discovery, (2) clinical trials, and (3) pharmacovigilance. The data sources utilized for NLP implementation in these applications are also listed. We also provide some examples of open-source systems for these applications along with links to training datasets.