Literature DB >> 35702625

How can natural language processing help model informed drug development?: a review.

Roopal Bhatnagar¹, Sakshi Sardar², Maedeh Beheshti², Jagdeep T Podichetty².

Abstract

Objective: To summarize applications of natural language processing (NLP) in model informed drug development (MIDD) and identify potential areas of improvement. Materials and
Methods: Publications found on PubMed and Google Scholar, websites and GitHub repositories for NLP libraries and models. Publications describing applications of NLP in MIDD were reviewed. The applications were stratified into 3 stages: drug discovery, clinical trials, and pharmacovigilance. Key NLP functionalities used for these applications were assessed. Programming libraries and open-source resources for the implementation of NLP functionalities in MIDD were identified.
Results: NLP has been utilized to aid various processes in drug development lifecycle such as gene-disease mapping, biomarker discovery, patient-trial matching, adverse drug events detection, etc. These applications commonly use NLP functionalities of named entity recognition, word embeddings, entity resolution, assertion status detection, relation extraction, and topic modeling. The current state-of-the-art for implementing these functionalities in MIDD applications are transformer models that utilize transfer learning for enhanced performance. Various libraries in python, R, and Java like huggingface, sparkNLP, and KoRpus as well as open-source platforms such as DisGeNet, DeepEnroll, and Transmol have enabled convenient implementation of NLP models to MIDD applications. Discussion: Challenges such as reproducibility, explainability, fairness, limited data, limited language-support, and security need to be overcome to ensure wider adoption of NLP in MIDD landscape. There are opportunities to improve the performance of existing models and expand the use of NLP in newer areas of MIDD. Conclusions: This review provides an overview of the potential and pitfalls of current NLP approaches in MIDD.

Entities: Chemical

Keywords: NLP; deep learning; drug development; machine learning

Year: 2022 PMID： 35702625 PMCID： PMC9188322 DOI： 10.1093/jamiaopen/ooac043

Source DB: PubMed Journal: JAMIA Open ISSN： 2574-2531

INTRODUCTION

Natural language processing (NLP) is an artificial intelligence (AI) technique to process and analyze human-generated spoken or written data. It utilizes syntactic and semantic analysis to analyze text data. NLP has evolved over the last decade and advanced to a level where it has become an integral part of our life—it is being used for email filters, voice assistants, language translation, digital phone calls, and text analytics. The rise of big data in the healthcare industry is setting the stage for AI tools such as NLP to assist with improving the delivery of care. One of the big problems of healthcare fields is that about 80% of medical data remains unstructured (eg, text, image, signal, etc.) and untapped after it is created. NLP has shown high potential in healthcare and model informed drug development (MIDD) to overcome the challenges that exist with natural language data utilization and generation. NLP has enabled the shift from time-consuming manual and siloed curation of natural language data to automated, large scale and standard processes for analyzing text and speech data. MIDD involves leveraging quantitative models to inform decision-making in drug development. In the field of MIDD, NLP can be leveraged to extract information out of structured (eg, electronic health records [EHRs]) and unstructured (eg, research documents) data to optimize and/or accelerate various processes in the drug development lifecycle, eg, determining drug–target interaction and drug–drug interaction, biomarker discovery, drug repurposing,, patient-trial matching, model-based meta-analysis, disease progression modeling, and others. NLP platforms perform the role of assessing potential associations between chemical/drug entities, their target proteins, and novel disease-related pathways by extensive analysis of scientific literature. NLP can also accelerate repurposing of approved drugs for new diseases which enables pharmacologists to address new market at a fraction of cost and time. NLP contribution in future drug safety is an important aspect of leveraging text mining automation to unveil valuable information invisible among aggregation of unstructured data. NLP usage for matching participants to clinical trials is a crucial application in this area. NLP and AI provide a suitable solution to handle this problem to save time. Papers in the past have identified the potential of AI in drug discovery and development fields.,, Some literature has focused on the drug discovery processes. This review focuses on current NLP applications in the field of drug discovery and development and provides a comprehensive overview of NLP in MIDD. We highlight the technical aspects of various tools utilized to develop the existing language models. We also provide information on various easily accessible resources which can be deployed to develop an NLP model for MIDD applications. Lastly, this article gives insights into potential opportunities that currently exist to expand and carry NLP in MIDD forward.

METHODS

The review process was divided into 2 parts: review of the applications of NLP algorithms in different stages of drug development lifecycle and review of technical aspects of various NLP algorithms (Figure 1).

Figure 1.

Entire review process workflow. The review process was divided into 2 parts: (1) review of applications NLP in MIDD space and (2) technical review of state-of-the-art methods for implementation of various NLP functionalities most used in MIDD space. Firstly, all the papers identified on PubMed and Google Scholar with use of NLP techniques in different stages of drug discovery and development were reviewed based on the inclusion and exclusion criteria. The drug development process was stratified into 3 stages: (1) Discovery, (2) Clinical Trials, and (3) Pharmacovigilance. Papers highlighting the use of NLP were classified into 1 of the 3 stages. For each application, key NLP functionalities in the workflow were identified. In the next step, a technical review of all the identified NLP functionalities was carried out. For each functionality, the implementation pipeline was analyzed. Furthermore, the current state-of-the-art for the functionalities were identified. Biomedical application specific AI-based models and libraries for implementation of those functionalities were reviewed from sources which include publications and GitHub or websites for the specific models (Figure 2). This information on various models and libraries was used to populate the 2 inventories presented in this article (Tables 2 and 3). Various NLP-task-based features of the libraries such as ability to perform text preprocessing, named entity recognition (NER), relation extraction, sentiment analysis etc. were included in the inventory (Table 2). Additional features for the libraries in the inventory include the availability of pre-trained neural models for direct implementation using transfer learning and support for multiple languages. The current state-of-the-art model inventory (Table 3) incorporates information about transformer-based models that were recently developed and can be used for carrying out various NLP tasks in the MIDD space. These models have been pretrained on biomedical literature and are known to produce state-of-the-art results on various tasks.

Figure 2.

Table 2.

NLP libraries for MIDD

Library name	Features
	Programming language	Pretrained neural network models	Word embeddings	Multi-language support	Tokenization	Part-of-speech tagging	Stemming/lemmatization	Named entity recognition	Entity resolution	Sentiment analysis	Relation extraction	Assertion status detection	Topic modeling
Spacy⁴³	Python	x	x	x	x	x	x	x	x		x
Gensim⁴⁴	Python	x	x	x	x		x		x				x
NLTK⁴⁵	Python	x		x	x	x	x	x		x	x
CoreNLP⁴⁶	Java	x		x	x	x	x	x	x	x	x
Scispacy⁴⁷	Python	x	x	x	x	x	x	x	x		x
SparkNLP⁴⁸	Python, Java, Scala, R	x	x	x	x	x	x	x	x	x	x	x	x
SparkNLP for healthcare⁴⁹	Python, Java, Scala, R	x	x		x	x	x	x	x	x	x	x
Torchtext⁵⁰	Python	x	x		x	x
KoRpus⁵¹	R			x	x	x
Tensorflow⁵²	Python	x	x	x	x	x	x	x		x	x		x
Scikit learn⁵³	Python				x					x			x
Textblob⁵⁴	Python				x	x	x			x
Pattern⁵⁵	Python, R		x		x	x	x			x
Hugging face⁵⁶	Python	x		x				x		x			x
Allen NLP⁵⁷	Python	x	x		x	x	x	x	x	x	x
Fasttext²¹	Python	x	x	x	x					x			x
Stanza⁵⁸	Python	x		x	x	x	x	x			x
Flair⁵⁹	Python		x	x		x		x		x			x
Fastai⁶⁰	Python	x	x		x					x
Spacyr⁶¹	R	x		x	x	x		x			x

Table 3.

NLP models for MIDD

Model	Full form	Pretrained on	Architecture	Built on	Performance	Year
BioBERT⁶²^,⁶³	Bio-Bidirectional Encoder Representations from Transformers	PubMed and PMC	Transformer	BERT	Outperforms state-of-the-art (SOTA) for named entity recognition, relation extraction, question answering	September 19
SciBERT⁶⁴^,⁶⁵	Science—Bidirectional Encoder Representations from Transformers	Semantic Scholar	Transformer	BERT	Outperforms SOTA for named entity recognition, relation extraction, patient enrollment task	November 19
ClinicalBERT⁶⁶^,⁶⁷	Clinical Bidirectional Encoder Representations from Transformers	MIMIC III	Transformer	BERT	Outperforms deep language model for clinical prediction	November 20
BioClinicalBERT⁶⁸^,⁶⁹	Bio-Clinical Bidirectional Encoder Representations from Transformers	MIMIC III	Transformer	BioBERT	Outperforms BERT and BioBERT on named entity recognition and natural language inference	June 19
BioMed-RoBERTa⁷⁰^,⁷¹	BioMedical Robustly optimized Bidirectional Encoder Representations from Transformers	Semantic Scholar	Transformer	RoBERTa	Outperforms RoBERTa on text classification, relation extraction and named entity recognition	May 20
Bio Discharge Summary BERT⁶⁹^,⁷²	Bio Discharge Summary Bidirectional Encoder Representations from Transformers	MIMIC III discharge summaries	Transformer	BioBERT	Outperforms BERT and BioBERT on named entity recognition and natural language inference	June 19
BioALBERT⁷³	Bio-A Lite Bidirectional Encoder Representations from Transformers	PubMed, PMC, MIMIC III	Transformer	ALBERT	Outperforms SOTA for named entity recognition, relation extraction, question answering, sentence similarity, document classification	July 21
ChemBERTa⁷⁴^,⁷⁵	Chem-Bidirectional Encoder Representations from Transformers	PubChem	Transformer	RoBERTa	Outperforms baseline on one task of molecular property prediction	October 20

Process flow for NLP libraries inventory. The figure describes the review process followed for developing the “NLP libraries inventory for drug discovery and development.” A total of 47 libraries were identified from Google scholar resources. Out of these, 7 libraries for speech processing were excluded from further screening. Out of the remaining 40 libraries, 20 were found to be used in different biomedical or biochemical applications. The websites, github repositories, and publications on the libraries were reviewed and the libraries were analyzed for the presence or absence of 14 features. These features were selected based on the most used NLP functionalities in the drug discovery and development space. NLP libraries for MIDD NLP models for MIDD The included articles must be published between 2010 and February 2022. Articles which discussed the most recent development (until February 2022) or current state-of-the-art algorithm that outperforms baseline for various NLP functionalities. Articles which highlighted applications of NLP algorithms in various stages of drug development including drug discovery, clinical trial, and pharmacovigilance. Recently launched transformer-based current state-of-the-art models for biomedical applications were included for the model inventory. Articles which highlighted NLP algorithm implementation in drug discovery and development areas using any open-source pre-trained models. Libraries used for Biomedical NLP applications in Python, Java, R, and Scala were included for the library inventory. Any NLP implementation libraries in languages other than python, java, R, Scala, and C++ were excluded from the review. NLP transformer models that were trained on datasets other than biomedical datasets such as PubMed, ChemProt, NCBI-diseases etc. NLP systems involving speech analysis or generation.

RESULTS

NLP aims to transform text information into structured data with the purpose of enhancing the usability of the data and quality of decisions made based on that data. Looking into the specific field of drug discovery and development, a plethora of NLP approaches have been utilized in the previous few years to make use of the huge amount of unstructured data that has been generated and is available in the domain. NLP offers several functionalities that enable analysis of unstructured text data for drug discovery and development applications. Some of the most used NLP functionalities for drug development are listed and explained in the next section (Table 1) and Supplementary Material.

Table 1.

Relevant NLP key concepts

NLP concept	Definition	Methodology	Biomedical or biochemical applications	MIDD-specific open-source resources
Word embedding	A class of techniques where individual words are represented as real-valued vectors, often tens or hundreds of dimensions in a predefined vector space.	It uses language models and feature extraction methods to map words to vectors capturing their context and meaning. Generic pre-trained models such as GloVe,¹⁹ word2vec,²⁰ and fastText²¹ have become prevalent.	Biomedical NLP encompasses use of word embeddings as feature input to downstream ML or DL models. Different textual resources like EHR, clinical notes, biomedical publications, Wikipedia, news etc. are utilized to train these word embeddings.	BioWordVec and BioSentVec²²
Named Entity Recognition (NER)	A sequence-labeling task that encompasses locating and categorizing important nouns and proper nouns in text which carry key information in a sentence.	It utilizes either 1 or a combination of the 2 underlying methods: (1) Rule-based method which uses a set of handcrafted grammatical and syntactic rules, and dictionaries to extract the named entities. (2) Machine learning (ML) or deep learning (DL) based method that utilizes a feature-based representation of the observed data.²³	It is used in the clinical domain to extract names of drugs, protein, disease, and genes from radiology reports, discharge summaries, problem lists, nursing documentation, medical education documents, and scientific literature.	MedLEE,²⁴ MetaMap,²⁵ KnowledgeMap,²⁶ cTAKES,²⁷ HiTEX,²⁸ MedTagger,²⁹ and ChemSpot³⁰
Assertion status detection	Status detection in medical assertions as “present,” “absent,” “conditional,” or “associated with someone else,”	Given an entity in a medical text, it classifies its asserted class from the context as being present, absent, or possible in the patient.³¹ In recent years, assertion detection models have been developed using Convolutional neural networks (CNNs), Long-short term memory network (LSTMs) and attention techniques.³²	In bio-clinical NLP, it is primarily used for assertion status detection for disease modeling. The meaning of clinical entities is heavily affected by assertion modifiers such as negation, uncertain, hypothetical, experiencer, and so on.	MITRE system³³
Entity resolution	It is the practice of linking data records that represent the same entity in the absence of a join key.	The process is comprised of the following steps: (1) Blocking—categorizing entities into blocks based on their descriptions. (2) Block processing—removing redundancies within blocks. (3) Matching—matching within a block based on entity descriptions. (4) Clustering—grouping of identified matches together.	In biomedical applications, it is used in record linkage by taking domain-specific knowledge into consideration to avoid domain-general assumptions that do not hold in this domain (eg, overlap in names of chemical compounds).³⁴	DeepER³⁵ and Bell et al.’s rule-based sieve architecture³⁴
Relation extraction	It is the task of extracting structured information and semantic relations from natural language text between 2 or more entities of a certain type like person, organization, or location.	It uses co-occurrence, pattern matching, machine learning, deep learning, knowledge-driven methods,³⁶ or transfer learning.	In the drug discovery and development domain, it is relevant in extraction of drug–disease, gene–disease, drug–target, and drug–drug relationships.	BioReI³⁷ and DocRBERT³⁸
Topic modeling	It is an unsupervised approach used for finding and classifying various topics embedded within a document or a piece of text.	It is based on the idea that a document is a mixture of topics which are a probability distribution over words. Term frequency-inverse document frequency, non-negative matrix factorization, Latent Dirichlet Allocation, Latent Semantic Analysis,³⁹ attention,³⁹ and generative adversarial networks⁴⁰ are some of the methods used for implementing it.	In the biomedical domain, topic modeling has been applied to use-cases beyond documents and words, eg, to classify genomic sequences, to classify drugs according to safety and therapeutic use and to find links between genes and diseases.⁴¹	Gensim, Stanford topic modling toolbox and MALLET⁴²

Relevant NLP key concepts One or more of these functionalities can be utilized to build a text processing pipeline to accomplish various drug discovery and development application objectives such as mining EHR data to detect adverse drug reactions or extracting information for potential drug targets from scientific literature. A typical NLP pipeline in drug development can also include text pre-processing methodologies such as tokenization, stemming, lemmatization, and part-of-speech tagging followed by a combination of various NLP functionalities.

NLP model inventory

The NLP landscape is promising, as several techniques have been developed to optimize the performance of various NLP functionalities. With the emergence of transfer learning and transformers in NLP, the efficiency of the process has increased while reducing the dependence on large amount of training data. Several libraries in most-commonly used languages such as python, R, and Java have made the implementation easier with the availability of pre-trained state-of-the-art neural network models. With the increasing use of various NLP algorithms in the healthcare sector, transformer models have been developed specific to healthcare applications such as BioBERT, SciBERT, etc. by training the basic BERT model on biomedical data corpus like PubMed, MIMIC-III, etc. Numerous libraries have made such models accessible for implementation. We summarized our findings on the NLP in MIDD libraries and models in Tables 2 and 3, respectively. Table 2 provides features of all the state-of-the-art libraries in python, R, and Java for biomedical applications. The features in the inventory are crucial NLP functionalities in different phases of drug discovery and development which were extracted from our literature search and highlighted in the section above. Table 3 provides an overview of state-of-the-art NLP models useful in MIDD. These inventories aim to help researchers in choosing the resources for implementing pre-trained neural models or NLP techniques for their respective NLP application.

NLP in the lifecycle of drug development

There are several facets of the lifecycle of drug development (DD) in which Real World Data (RWD) and NLP algorithms have been implemented with the aim of improving outcomes. To guide the structure of the model inventory, we reviewed literature and use-cases in the following areas: Drug discovery Clinical trials Pharmacovigilance In each of the cases, different forms of NLP were applied to textual data to derive novel insights in all stages of drug development that would previously have been difficult or even impossible to capture.

Drug discovery

In the drug discovery process, understanding gene–disease associations, pathways, and systems is critical. Much of the data that can aid in extracting this information is in unstructured text. Furthermore, most new targets are derived from novel biological discoveries first appearing in scientific literature from academic sources. NLP-based text mining has provided a solution which has been widely utilized for applications in gene–disease mapping, target identification, biomarker discovery, and drug repurposing efforts. NLP has also been utilized to analyze text-based representations of molecular structures for discovery and design of novel drugs. Instead of relying on disparate manually curated sources, an NLP system can mine and extract relevant and valuable knowledge from all these sources at once. In the sections below, the uses of NLP in various drug discovery areas are highlighted.

Gene–disease mapping

Analyzing gene–disease association is a crucial step for target identification and biomarker discovery in the drug discovery pipeline. Experimental methods for identifying gene–disease associations, such as genome-wide association studies and linkage analysis can be expensive and time-consuming. Hence, researchers have turned to various insilico methods in the past few years which utilize text-mining, crowdsourcing, network and semantic-similarity-based algorithms., Mining of biomedical literature is key to extracting actionable information present in free-text data. NLP comes into play in the process by enabling automated text-mining with techniques such as NER and relation extraction. A few examples of such systems include DisGeNET, BeFREE, a co-occurrence interaction network presented in Al-Aamri et al and a BioBERT-based model introduced in Deng et al.

Drug–target interaction prediction

Predicting drug–target interaction aims to identify binding of new drug candidate compounds to protein targets. A few approaches have been recently developed to address this problem using NLP techniques. These techniques use word embeddings to represent chemical structures of the drug molecule and the binding protein from an un-labeled biomedical literature. Raw data such as simplified molecular-input line-entry system (SMILES) strings for molecules and protein sequences are vectorized in this process of feature representation. One common approach for feature representation is using CNN-based models. However, it fails to take into account the relationship between different atoms in the molecules. Hence, self-attention and transformer-based embedding models are used to overcome this challenge. Further steps in the process involve machine learning or deep learning models to predict the affinity between the drug molecule and target protein., Features such as the biological, topological and physio-chemical properties of the drugs/target are considered for making these predictions.

Biomarker discovery

With the rise of the technologies to extract valuable information available in biomedical big data—electronic medical records (EMRs) and biomedical literature, there is an increased hope of discovering novel biomarkers that can be used to diagnose, predict, and monitor the important aspects of a disease. Biomarkers also serve as surrogate endpoints in early-phase trials.,, Biomarker and disease names are identified in free-text data using NER and the frequency of their co-occurrences. The relationship between disease and biomarkers can be understood using word embedding and similarity approaches. Singh et al presents a big data mining approach from EHR data using NER and assertion status detection techniques along with machine learning to facilitate biomarker discovery. Holmes et al, introduced a new method to extract high quality, contextual biomarker information from pathology reports using MetaMap.

Drug repurposing

Drug repurposing is discovering new therapeutic opportunities for existing drugs. It can ensure a faster drug development and approval process, safer treatment and reduced healthcare cost. Computational approaches like virtual screening, molecular docking, deep learning and NLP play a vital role in many of the drug repurposing studies., The drug–disease treatment pairs that are extracted using NLP from literature, EHR, clinical notes, and real-world sources can be used for drug repurposing in 2 ways: the extracted pairs being used themselves or a drug or disease’s similarity with candidate drug or disease respectively is used to hypothesize a new therapeutic indication for a given drug. Subramanian et al used SciBERT for drug-cancer association classification. In another study, the researchers carried out a drug-wide association study for COVID-19 drug-repurposing using MedXN NLP platform for drug information extraction. Relation extraction and entity linking are other key NLP techniques that help capture complex relationships in unstructured text.

Drug design

Drug design in the initial stages of drug discovery is rendered as an optimization problem to search for the optimal combination of building blocks to find the most stable structure in the given conditions. De novo design of molecules has recently benefited from deep generative models, various NLP techniques and transfer learning. The task of generating more SMILES strings having an input string is viewed as a language modeling task. To this end, Transmol was developed as a vanilla transformer language model for SMILES sequence generation., In another study, ULMFit model is used to leverage transfer learning to generate new molecular sequences. A recent study introduced Seq2Mol, a method conditioned on the protein target sequence to generate de novo SMILES strings of molecules that are relevant to the target using a deep bi-directional language model ELMo.

Clinical trials

Approximately 5.6% of clinical trials in the clinicaltrials.gov database have been terminated prematurely (2021). A failed trial sinks not only the investment into the trial itself but also the preclinical development costs, rendering the loss per failed clinical trial at 800 million to 1.4 billion USD. Failure to optimize clinical trial design, inefficient enrollment processes, and poor retention rates are one of the main reasons for premature trial closure. NLP has been utilized to overcome these issues and help improve the clinical trial process.

Patient-trial matching

Identification of suitable patients can be resource-intensive, often relying on manual review of clinical notes to identify potentially eligible patients, where the information may be split over different systems. Researchers have utilized various NLP techniques for automating clinical trial eligibility pre-screening for patients, increasing the efficiency of the patient selection and recruitment process. NER, assertion status detection, relation extraction, and entity linking features have been primarily used to extract relevant fields from clinical trial eligibility criteria. These were mapped against relevant fields extracted from unstructured patient EHRs using the same techniques for efficient patient-cohort matching., Recently, advanced models such as Criteria2Query which uses an Information Extraction pipeline integrated with a Natural Language Interface, DeepEnroll which uses hierarchical embeddings and COMPOSE which uses word embeddings on clinical trials eligibility criteria along with a pseudo-Siamese network have provided significant improvement in the patient-trial matching process.

Pharmacokinetic/Pharmacodynamic (PK/PD) studies

PK/PD studies are crucial to determine the dosing and schedule during a clinical trial. Post-marketing PK/PD analyses are used to evaluate drug response in patients in real-world setting. These studies require longitudinal dose, outcomes, and potential covariates information. Mining EHRs for this data can be a potential solution. NLP has been leveraged by researchers to automate the process of real-world data extraction from EHRs. Existing NLP data mining tools such as MedEx, MedXN, and medExtractR were utilized.

Document preparation for regulatory submissions

NLP is being used to accelerate document preparation with tools that can perform parallel search, document creation, data integrity review and rapidly assembling briefing documents for regulatory submissions.

Pharmacovigilance

According to the Center for Disease Control and Prevention (CDC), adverse drug events (ADEs) cause approximately 1.3 million emergency department visits each year. It is extremely vital to assess the safety of a drug to avoid any potential adverse events resulting from it. Additionally, active post-marketing surveillance is crucial to account for all side effects that can result from the drug in a larger population over the duration of its usage. EHR and NLP have enabled a more accurate detection of such adverse events compared to the conventional manual methods.

Adverse drug event detection

ADEs are unexpected medical occurrences resulting from drug related intervention. The current method for ADE detection involves manual retrospective record review of medical data stored within EMRs in structured and free-text form. Over the past few years, researchers have utilized various NLP techniques to automate this process by including free-text data from EHRs. The workflow includes identifying and extracting the relationship between a drug and ADE from unstructured EHR data, incident reporting systems, or social media. NER identifies medications and their attributes (dosage, route, duration, and frequency), indications, ADEs, and severity., Word Sense Disambiguation is used to further filter the identified entities and confirm their contextual sense. The relation extraction task identifies relations between the named entities: medication-indication and medication-ADE. Word embeddings are utilized to vectorize the input for training an ML or DL model to identify and classify ADEs. Numerous publicly available NLP systems have been extended to perform ADE detection tasks, including MedLEE, MetaMap, cTAKES, MedEx, and GATE. Wu et al introduced an NLP system with multi-head self-attention to detect adverse drug reactions from tweets using pre-trained word embeddings, text pre-processing, part of speech embeddings, and sentiment embeddings.

Drug–drug interaction prediction

In cases where 2 or more drugs are co-administered, drug-drug interaction detection becomes a critical part of post-marketing surveillance. Interactions between 2 drugs may lead to side-effects, increased or decreased impact or an adverse reaction. Since there are numerous combinations of drugs available, it is difficult and time-consuming to manually collect all the drug–drug interaction events of patients from reports and scientific literature. To overcome this, several efforts have been made to automate the process by using different text-mining approaches incorporating NLP techniques such as NER, relation extraction, and word embeddings.,

EHR data de-identification

To facilitate the use of EHR data without compromising patient privacy multiple NLP methods are being explored. These methods include Rule-based extraction, feature-based ML, and Neural methods. The goal is to be both effective in detecting protected health information (PHI) and efficient in processing the data.

DISCUSSION

With the rapid advances in the field of NLP in the past few years, it has found applications in several industries to automate time-consuming manual processing of human-generated natural language. Drug discovery and development is one such field which can leverage the promising future of NLP to its advantage. Our literature review results presented in this article highlight some of the most promising avenues of NLP applications in the journey of a drug from molecule to market. We found that several researchers have utilized NLP techniques such as NER, relation extraction, word embeddings, assertion status detection, topic modeling, natural language generation, and entity resolution for drug discovery and development applications. Table 2 lists all the current state-of-the-art library resources in python, Java, R, and Scala that can be used to develop models for one or more of the mentioned tasks. The table also includes bio- and clinical-specific libraries that can be utilized to achieve better performance in drug discovery and development applications. The state-of-the-art performance is attributed to the availability of pre-trained neural network models within these libraries that have been trained on biomedical literature. The neural language model-based approaches have been proven to achieve better performance. With further improvements in the deep learning space, NLP models have moved from Recurrent neural network (RNN) and LSTM to attention-based models and transformers. The added feature of transfer learning with the transformers has led to even higher accuracies. Table 3 captures the trends in the evolution of the state-of-the-art transformer models pretrained on biochemical and biomedical literature. Many of the libraries listed in table 2 utilize these models for enhanced performance. Both these tables provide insights into the technical aspects of various NLP algorithms and tools that are available to easily access those algorithms for drug development implementation that are highlighted in Figure 3.

Figure 3.

NLP in stages of drug development. The figure shows NLP functionalities used for applications in 3 stages of drug development process: (1) drug discovery, (2) clinical trials, and (3) pharmacovigilance. The data sources utilized for NLP implementation in these applications are also listed. We also provide some examples of open-source systems for these applications along with links to training datasets. Figure 3 provides a comprehensive overview of the applicability of various NLP tasks for drug discovery and development use-cases. The figure summarizes NLP use cases in MIDD with examples for drug discovery, clinical trials and pharmacovigilance. The figure ties together the findings of the 2 parts of the literature review—applications and technical aspects of various NLP techniques. It depicts the techniques used for each application of NLP in the drug development domain. As we saw in our results, there has been a shift from rule-based approaches to increasingly complex neural language models leading to achieving state-of-the-art results. However, this shift has come with a performance-explainability trade-off. The traditional NLP techniques using rule-based or statistical methods are inherently explainable but the prevalence of deep learning models and word embedding techniques have given rise to the need of incorporating explainability as a feature in the models. Some work has been done in recent years to expand the field of explainable and interpretable NLP models. Biological and chemical interpretability and explainability of NLP models remains a challenge to be addressed in the field of drug discovery and development. Further exploration and application of explainability and interpretability for NLP neural models in drug discovery and development is crucial at the moment to improve NLP acceptability by researchers as well as regulators. Further issues like bias in NLP models stemming from bias in data and algorithm design, security issues surrounding PHI, reproducibility of results are some of the limitations that are hindering wider adoption of these advance techniques for drug development applications. In order to ensure better adoption of NLP to MIDD, we identified the following opportunities in the field as a result of our research: The drug discovery and development fields present several opportunities to researchers to apply NLP to further improve the performance of the already existing models. In order to enable wider adoption of NLP in MIDD, additional work is required in the field to make the models more explainable, interpretable, fair, reproducible, and to overcome issues of security (discussed further in Supplementary File). Within drug discovery and development, applications in improving clinical trials and pharmacovigilance can be critical for cost savings. It is evident from our review that opportunities exist to explore the fields of document preparation for regulatory submissions, PK/PD modeling and EHR deidentification as not much work has been done on these applications. Current NLP approaches are limited to a few languages like English or Dutch., The research can be expanded to include other languages to make the best use of the plethora of available data in regional languages. This can be useful in expanding the reach of NLP systems and improving the performance of the current state-of-the-art algorithms. Another avenue of interest can be the use of few-shot learning in NLP to overcome the challenge of limited data, eg, in the case of drug discovery for rare diseases.

CONCLUSION

Our review focuses on how NLP’s use is evolving in the drug development space. It highlights several functionalities of NLP that aid in automation of MIDD processes in favor of increased efficiency. The article also mentions some resources that can be useful in developing an NLP pipeline using current state-of-the-art methods for MIDD applications. Lastly, it provides insights into how it can be taken forward by addressing some of the unmet needs in the field.

FUNDING

Critical Path Institute is supported by the Food and Drug Administration (FDA) of the U.S. Department of Health and Human Services (HHS) and is 54.2% funded by the FDA/HHS, totaling $13 239 950, and 45.8% funded by non-government source(s), totaling $11 196 634. The contents are those of the author(s) and do not necessarily represent the official views of, nor an endorsement by, FDA/HHS or the U.S. Government. For more information, please visit FDA.gov.

AUTHOR CONTRIBUTIONS

RB was involved in the conception and design of the work and acquisition, analysis and interpretation of the data. JTP, SS, and MB were involved in the conception and design of the work.

SUPPLEMENTARY MATERIAL

Supplementary material is available at JAMIA Open online. Click here for additional data file.

72 in total

1. Model-Informed Drug Development: Current US Regulatory Practice and Future Considerations.

Authors: Yaning Wang; Hao Zhu; Rajanikanth Madabushi; Qi Liu; Shiew-Mei Huang; Issam Zineh
Journal: Clin Pharmacol Ther Date: 2019-03-01 Impact factor: 6.875

2. Machine learning based natural language processing of radiology reports in orthopaedic trauma.

Authors: A W Olthof; P Shouche; E M Fennema; F F A IJpma; R H C Koolstra; V M A Stirler; P M A van Ooijen; L J Cornelissen
Journal: Comput Methods Programs Biomed Date: 2021-07-23 Impact factor: 5.428

Review 3. Computational Approaches for De Novo Drug Design: Past, Present, and Future.

Authors: Xuhan Liu; Adriaan P IJzerman; Gerard J P van Westen
Journal: Methods Mol Biol Date: 2021

Review 4. Overview of the First Natural Language Processing Challenge for Extracting Medication, Indication, and Adverse Drug Events from Electronic Health Record Notes (MADE 1.0).

Authors: Abhyuday Jagannatha; Feifan Liu; Weisong Liu; Hong Yu
Journal: Drug Saf Date: 2019-01 Impact factor: 5.606

5. A knowledge base of clinical trial eligibility criteria.

Authors: Hao Liu; Yuan Chi; Alex Butler; Yingcheng Sun; Chunhua Weng
Journal: J Biomed Inform Date: 2021-04-01 Impact factor: 6.317

6. DeepDTA: deep drug-target binding affinity prediction.

Authors: Hakime Öztürk; Arzucan Özgür; Elif Ozkirimli
Journal: Bioinformatics Date: 2018-09-01 Impact factor: 6.937

7. Analyzing a co-occurrence gene-interaction network to identify disease-gene association.

Authors: Amira Al-Aamri; Kamal Taha; Yousof Al-Hammadi; Maher Maalouf; Dirar Homouz
Journal: BMC Bioinformatics Date: 2019-02-08 Impact factor: 3.169

8. Development of a System for Postmarketing Population Pharmacokinetic and Pharmacodynamic Studies Using Real-World Data From Electronic Health Records.

Authors: Leena Choi; Cole Beck; Elizabeth McNeer; Hannah L Weeks; Michael L Williams; Nathan T James; Xinnan Niu; Bassel W Abou-Khalil; Kelly A Birdwell; Dan M Roden; C Michael Stein; Cosmin A Bejan; Joshua C Denny; Sara L Van Driest
Journal: Clin Pharmacol Ther Date: 2020-02-11 Impact factor: 6.875

9. EliIE: An open-source information extraction system for clinical trial eligibility criteria.

Authors: Tian Kang; Shaodian Zhang; Youlan Tang; Gregory W Hruby; Alexander Rusanov; Noémie Elhadad; Chunhua Weng
Journal: J Am Med Inform Assoc Date: 2017-11-01 Impact factor: 4.497

1 in total

1. Adera2.0: A Drug Repurposing Workflow for Neuroimmunological Investigations Using Neural Networks.

Authors: Marzena Lazarczyk; Kamila Duda; Michel Edwar Mickael; Onurhan Ak; Justyna Paszkiewicz; Agnieszka Kowalczyk; Jarosław Olav Horbańczuk; Mariusz Sacharczuk
Journal: Molecules Date: 2022-09-30 Impact factor: 4.927

1 in total