| Literature DB >> 34216835 |
Zhichao Liu1, Ruth A Roberts2, Madhu Lal-Nag3, Xi Chen4, Ruili Huang5, Weida Tong6.
Abstract
The discovery and development of new medicines is expensive, time-consuming, and often inefficient, with many failures along the way. Powered by artificial intelligence (AI), language models (LMs) have changed the landscape of natural language processing (NLP), offering possibilities to transform treatment development more effectively. Here, we summarize advances in AI-powered LMs and their potential to aid drug discovery and development. We highlight opportunities for AI-powered LMs in target identification, clinical design, regulatory decision-making, and pharmacovigilance. We specifically emphasize the potential role of AI-powered LMs for developing new treatments for Coronavirus 2019 (COVID-19) strategies, including drug repurposing, which can be extrapolated to other infectious diseases that have the potential to cause pandemics. Finally, we set out the remaining challenges and propose possible solutions for improvement.Entities:
Keywords: Artificial intelligence; COVID-19; Drug development; Drug discovery; Language models; Natural language processing
Mesh:
Year: 2021 PMID: 34216835 PMCID: PMC8604259 DOI: 10.1016/j.drudis.2021.06.009
Source DB: PubMed Journal: Drug Discov Today ISSN: 1359-6446 Impact factor: 7.851
Figure 1Artificial intelligence (AI)-powered language models in the context of drug discovery and development. The overall stages of the development process are illustrated in the top layer (green), and the objectives from this process are captured in the layer below (blue). The text documents related to each stage are listed, and the opportunities of AI-powered language models are summarized in the following two layers (yellow and pink). Abbreviations: PD, pharmacodynamics; PK, pharmacokinetics.
Figure 2Comparison of artificial intelligence (AI)-powered language models and human intelligence: (1) Transfer learning (green); (2) Apply knowledge (blue); and (3) summarize knowledge (yellow).
Selected examples of transformer-based language models.
| Architectures | BERT | OpenAI GPT | XLNet | ALBERT | RoBERTa | ELECTRA | DistillBERT |
|---|---|---|---|---|---|---|---|
| Pre-training corpus | BooksCorpus and English Wikipedia | 8 million web pages from Common Crawl | BooksCorpus and English Wikipedia | BooksCorpus + English Wikipedia + Common Crawl news dataset + Web text corpus + stories from Common Crawl | BooksCorpus and English Wikipedia | BooksCorpus and English Wikipedia | |
| Model Parameters | GPT-2: 1.5 billion parameters | ||||||
| Training strategies | Masked Language Model (MLM) and | process the input text left-to-right, predicting the next word given the previous context | Permutation-based modeling | BERT with reduced parameters for sentence order prediction | BERT with a dynamic masking strategy. Without Next Sentence Prediction (NSP) | ELECTRA models are trained to distinguish “real” input tokens vs. “fake” input tokens generated by another neural network. | BERT base model with a distillation loss function |
| Training time | Unknown | Large: 2.5 days with 512 TPUs | Large: 1.7 faster than BERT | 1 day with 1024 Nvidia Tesla V100 GPUs | 3.5 days with 8 Nvidia Tesla V100 GPUs | ||
| Feature dimension | GPT-2: Up to 1024 | Up to 4096 | |||||
| Performance | Outperformed state-of-the-art in 11 NLP tasks | GP7-2 Achieves state-of-the-art results on 7 out of 8 tested language modeling datasets | 2%∼15% improvement over BERT | outperforms 2%-20% both BERT and XLNet on GLUE benchmark results | retaining 97% performance of BERT base model on GLUE benchmark results | ||
| Weblink | GPT-2: | ||||||
| References |
Selected examples AI-based NLP applied in drug discovery.
| Biomedical Named Entity Recognition | BC2GM; BC5CD; NCBI -Disease; JNLPBA | BioBERT | A multi-ask (MT)-BioNER proposed for biomedical named entity recognition using BioBERT as shared layers and different data sets in task-specific layers | ||
| Gene–disease relationship extraction | DisGeNET: database of gene–disease associations | Convolution neural network (CNN) and attention-based BiLSTM | Proposed Deep-GDAE integrates specificities of CNN and an attention-based BiLSTM to classify gene–disease associations | ||
| Biomedical text summarization | PubMed | BERT and hierarchical clustering algorithm | Biomedical text summarizer (BERT-based-Summ) proposed by interrogating BERT and hierarchical clustering algorithm to extract biomedical content summarization | ||
| Drug properties prediction | 1 million SMILE codes of compounds in ChEMBL database | BiLSTM-based transfer learning | Transfer learning framework, MolPMoFiT, to predict physical and biological endpoints, such as lipophilicity and blood–brain barrier penetration, for given compounds | ||
| Virtual screening | SMILES | BERT | MOLBERT model proposed by applying BERT model to SMILES for virtual screening | ||
| Patient–trial matching | Patient EHR data | ClinicalBERT | Proposed DeepEnroll based on ClinicalBERT jointly encodes enrollment criteria and patient EHRs into shared latent space for patient–trial matching | ||
| Trial eligibility criteria | Patient EHR data | CrOss-Modal PseudO-SiamEse network (COMPOSE) | COMPOSE aims to address challenges for patient–trial matching; one path of network encodes EC using convolutional highway network | ||
| Biomedical entity normalization | Clinical notes; PubMed abstract; drug labeling | BERT; BioBERT; ClinicalBERT | Authors proposed entity normalization architecture by fine-tuning pretrained BERT/BioBERT/ClinicalBERT models and applying them to SNOMED-CT coding, MedDRA coding, and Medical Subject Headings (MESH) coding | – | |
| Disease coding | Clinical notes and ICD-10 | BERT | Authors proposed ML model, BERT-XML, for large-scale automated ICD coding from EHR notes | ||
| Biomedical mention disambiguation | CTDbase and gene2pubmed | CNN with LSTM | Authors developed biomedical corpus for curating biomedical terms ambiguous between one or more concept types; model is used by interrogating LSTM and CNN | – | |
| ADR detection | Bidirectional BiLSTM | Authors proposed RNN model using pretrained word embeddings created from a large, nondomain-specific Twitter data set for ADR extraction | |||
| Ensemble models of BERT, BioBERT, and ClinicalBERT | Authors proposed ensemble model by integrating BERT, BioBERT, and ClinicalBERT for ADR detection from social media data | – | |||
Publicly available FDA documents for promoting AI-powered LMs in regulatory applications.
| Drug labeling | Drug labeling comprises a summary of information for safe and effective use of the drug, which is proposed by manufacturer and approved by FDA | Drug labeling could be a useful resource (>120 000 product labelings) to develop biomedical named-entity recognition/normalization, and relation extraction between drug and AEs, drug–drug interaction, etc. | |
| FAERS | FAERS is a database that contains information on AE and medication error reports submitted to FDA | FAERS is designed to support the post-marketing safety surveillance program for drug and therapeutic biologic products of the FDA. There are more than 19 million case reports in FAERS; AI-powered LMs could be applied to carry out AE detection, causal relationship extraction, etc. | |
| Orange Book | Orange Book identifies drug products approved on basis of safety and effectiveness by FDA under the Federal Food, Drug, and Cosmetic Act and related patent and exclusivity information | Orange book provides crucial regulatory information, such as biological equivalence, reference listed drug (RLD), Reference Standard (RS), and patent status. This information could be included in AI-powered LMs to compare drug product information with RLD and RS to facilitate abbreviated new drug application (ANDA) submissions | |
| Drugs@FDA | Drugs@FDA includes most drug products approved since 1939. Most patient information, labels, approval letters, reviews, and other information are available for drug products approved since 1998 | Drugs@FDA provides rich information on drug approval history, which could be used as AI-powered LMs to explore underlying reasons for labeling changes and increase business success | |
| FDA Guidance Documents | Guidance documents describe FDA’s interpretation of policy on a regulatory issue (21 CFR 10.115(b)). These documents usually discuss more specific products or issues that relate to design, production, labeling, promotion, manufacturing, and testing of regulated products | FDA Guidance Documents could be useful to implement AI-powered LMs for standardizing and monitoring crucial steps in drug discovery and development in terms of their consistency and alignment with regulatory requirements | |
| FDA Acronyms and Abbreviations | FDA Acronyms and Abbreviations database provides a quick reference to acronyms and abbreviations related to FDA activities | Emphasis of FDA Acronyms and Abbreviations is on scientific, regulatory, government agency, and computer application terms. The database includes some FDA organizational and program acronyms. It is a useful resource to define vocabularies in AI-powered LMs and increase model generalization |
Figure 3Artificial intelligence (AI)-powered language models for accelerating Coronavirus 2019 (COVID-19) treatment development. Potential opportunities, data resources, and key questions are illustrated. Abbreviation: CDC, Centers for Disease Control and Prevention.