Wei Wang1, Shuo Feng2, Zhuyifan Ye1, Hanlu Gao1, Jinzhong Lin2, Defang Ouyang1. 1. State Key Laboratory of Quality Research in Chinese Medicine, Institute of Chinese Medical Sciences, University of Macau, Macau 999078, China. 2. State Key Laboratory of Genetic Engineering, School of Life Sciences, Zhongshan Hospital, Fudan University, Shanghai 200438, China.
Abstract
Lipid nanoparticle (LNP) is commonly used to deliver mRNA vaccines. Currently, LNP optimization primarily relies on screening ionizable lipids by traditional experiments which consumes intensive cost and time. Current study attempts to apply computational methods to accelerate the LNP development for mRNA vaccines. Firstly, 325 data samples of mRNA vaccine LNP formulations with IgG titer were collected. The machine learning algorithm, lightGBM, was used to build a prediction model with good performance (R 2 > 0.87). More importantly, the critical substructures of ionizable lipids in LNPs were identified by the algorithm, which well agreed with published results. The animal experimental results showed that LNP using DLin-MC3-DMA (MC3) as ionizable lipid with an N/P ratio at 6:1 induced higher efficiency in mice than LNP with SM-102, which was consistent with the model prediction. Molecular dynamic modeling further investigated the molecular mechanism of LNPs used in the experiment. The result showed that the lipid molecules aggregated to form LNPs, and mRNA molecules twined around the LNPs. In summary, the machine learning predictive model for LNP-based mRNA vaccines was first developed, validated by experiments, and further integrated with molecular modeling. The prediction model can be used for virtual screening of LNP formulations in the future.
Lipid nanoparticle (LNP) is commonly used to deliver mRNA vaccines. Currently, LNP optimization primarily relies on screening ionizable lipids by traditional experiments which consumes intensive cost and time. Current study attempts to apply computational methods to accelerate the LNP development for mRNA vaccines. Firstly, 325 data samples of mRNA vaccine LNP formulations with IgG titer were collected. The machine learning algorithm, lightGBM, was used to build a prediction model with good performance (R 2 > 0.87). More importantly, the critical substructures of ionizable lipids in LNPs were identified by the algorithm, which well agreed with published results. The animal experimental results showed that LNP using DLin-MC3-DMA (MC3) as ionizable lipid with an N/P ratio at 6:1 induced higher efficiency in mice than LNP with SM-102, which was consistent with the model prediction. Molecular dynamic modeling further investigated the molecular mechanism of LNPs used in the experiment. The result showed that the lipid molecules aggregated to form LNPs, and mRNA molecules twined around the LNPs. In summary, the machine learning predictive model for LNP-based mRNA vaccines was first developed, validated by experiments, and further integrated with molecular modeling. The prediction model can be used for virtual screening of LNP formulations in the future.
The global pandemic of coronavirus disease 2019 (COVID-19) has caused nearly 220 million confirmed cases and more than four million deaths worldwide, according to the updated record by the World Health Organization (WHO). To suppress the prevalence of COVID-19, many pharmaceutical industries in multiple countries have developed vaccines with an unprecedented speed and are promoting their usage globally. The BNT162b2 from BioNTech and Pfizer and mRNA-1273 from Moderna were the first two vaccines approved by the US Food and Drug Administration (FDA) in November 2020 with the development period of less than one year, and impressively high preventing efficacy, 95% for BNT162b2 and 94.1% for mRNA-1273. Rapid development, high efficacy, and risk-free of insertional mutagenesis or infection induced by vaccine, show a promising prospect for this vaccine platform. mRNA takes effect by first being delivered into cells and then translated to immunogenic antigens. There are many aspects that mRNA sequence can be engineered to influence its efficacy,, such as self-amplifying,, choice of untranslated region,, modification of nucleoside,, codon optimization,, and combination of encoded antigens,. Besides, administration routes also affect the immune effect,. However, a successful mRNA vaccine further requires a proper delivery system, such as the lipid nanoparticle (LNP). Both vaccines against COVID-19 adopt LNP as the delivery system.LNP-based mRNA vaccines usually consist of four types of lipids, cholesterol, distearoylphosphatidylcholine (DSPC), polyethylene glycol (PEG) -lipid, and ionizable lipid. Cholesterol adjusts the flexibility and fusogenicity of lipids during mixing, facilitating the LNP formation. DSPC, the helper lipid, is related to LNP structure,, interfacial tension, and helps mRNA release. PEG-lipid influences the LNP stability, size of LNP, and further impacts the potency. The ionizable lipid, due to its cationic head group, should be the most critical ingredient. It dominates the binding to mRNA, interacting with the endosomal membrane and mRNA release,. Besides, a desired ionizable lipid should also show high biodegradability to ameliorate the adverse effect induced by lipid accumulation. Traditionally, ionizable lipids are screened by synthesizing numerous lipids and testing their in vivo efficacy,. However, current experimental screening needs a large amount of cost, time, and materials.Machine learning (ML) is a branch of artificial intelligence, which is the science of enabling computers to learn knowledge without being explicitly programmed. After succeeding in areas such as machine translation and computer vision, ML has been increasingly applied by pharmaceutical companies in recent years. ML can explore the existing dataset and determine the relationship between the input and output parameters, wherein the former could present the formulation information and experimental conditions while the latter may indicate the formulation behaviors of interest. This approach is helpful in formulation prediction. Previous studies have successfully applied ML to predict the drug delivery systems, such as nanocrystals, solid dispersion, cyclodextrin complex, and self-emulsifying drug delivery systems (SEDDS). In the case of SEDDS, the trained ML model predicted the molar composition of oils, surfactants, and cosurfactant where they can form self-emulsion, based on their physicochemical properties input, which helped to choose proper excipients for SEDDS formulation.Molecular dynamic (MD) simulation is another computational tool that can visualize and investigate the interaction among ingredient molecules and the environment from a physicochemical view. MD method has become a helpful tool for pharmaceutical scientists to obtain a mechanistic understanding of formulation behaviors,. Previous studies applied the MD method to investigate topics such as the aggregation of polymer–siRNA complex and the dissolution of solid dispersion. In the case of the polymer–siRNA complex, the aggregation was simulated to be driven by the interaction between cationic groups on polymers and the negative backbone of siRNA. This aggregation stabilized the siRNA in the aqueous solution, reflected by the less altering major groove width of siRNA. Besides, the MD result also revealed the saturation molar ratio of polymers to siRNA. It resulted from mutually counteracting forces of electrostatic effect and steric crowding, which were influenced by siRNA length, cationic charge sites, and the shape of polymers.This work aimed to build an ML model to predict LNP formulations for mRNA vaccines against viruses. Data from publications were collected of build the model. A typical such study, includes the information of mRNA sequence synthesis, LNP preparation, treatment of subjects, and detection of the time course of binding IgG titer. The binding IgG titer is a surrogate of antibody concentration produced by the immune system after stimulation of antigen encoded by mRNA injected. The IgG titer is influenced by many factors dependent or independent of LNP formulation, as mentioned. Therefore, all information was needed to train an ML model, and the influence of LNP, mainly the ionizable structures, on IgG titer could be specifically untangled by the algorithm. Thus, the ML model was able to predict the LNP formulation. The prediction result was further validated by an in vivo experiment. Then, the MD simulation was used to investigate the interaction between mRNA molecules and lipid components in the microscope. This developed model will benefit the development of mRNA vaccines.
Materials and methods
ML modeling methods
Data collecting and cleaning
The data collecting and cleaning method are shown in Fig. 1. First, keywords of ‘mRNA’, ‘vaccine’, ‘virus’ were used to retrieve literature from Web of Science and Scopus. After the initial screen, 65 studies using lipid-based formulation were maintained.
Figure 1
Data collecting and cleaning process for machine learning (ML) work. (A) Data collecting and cleaning process. (B) The eventual dataset contained lipid nanoparticle (LNP) with seven kinds of ionizable lipids, including DLin-MC3-DMA (MC3), DLinDMA, L319, Lipid M, N, and Q, and SM-102,.
Data collecting and cleaning process for machine learning (ML) work. (A) Data collecting and cleaning process. (B) The eventual dataset contained lipid nanoparticle (LNP) with seven kinds of ionizable lipids, including DLin-MC3-DMA (MC3), DLinDMA, L319, Lipid M, N, and Q, and SM-102,.After analyzing the first extracted data from 22 studies, the data intended for ML work were simplified, considering the numerical balance between input and predicted parameters. Thus, the eventual dataset only contained mRNAs encoding a single antigen45, 46, 47, 48, 49, 50 and LNPs comprising DSPC, cholesterol, 1,2-dimyristoyl-rac-glycero-3-methoxypolyethylene glycol (PEG-DMG), and various ionizable lipids,,, (Fig. 1B). The included experiments all lasted for no longer than one year, with no more than two doses of vaccination and without virus challenge test.The input parameters or features include antigen protein type, self-amplifying,, cap type, pseudouridine modification, codon optimization, the molar ratio between nitrogen on the ionizable lipid to phosphate on RNA (N/P ratio), structure of ionizable lipid, ionizable lipid fraction, DSPC fraction, PEG–lipid fraction, cholesterol fraction, subject type, population or strain, injection route, log10 dose, the second vaccination time, and IgG titer test time. Parameterization of ionizable lipid structure can be seen in the next section. Whether or not the mRNA sequences functioning as self-amplifying, containing pseudouridine, and undergoing codon optimization were assigned ‘1’ or ‘0’. The antigen protein type, cap type, subject type (human, primate, mice, etc.), population or strain (adult or elderly human, C57BL/6 or BALB/c mice, etc.), and injection route were deemed as multi-categorical variables. The other parameters were numerical variables.The output or predicted parameter was decided to be the binding IgG titer because this index reflected antibody concentration induced by vaccines, and the related data was the richest. Binding IgG titer is usually tested by enzyme-linked immunosorbent assay (ELISA), and it is the dilution-fold of tested serum-containing antibodies that can still neutralize the antigen-coated at the bottom of a 96-well plate. A high titer means the serum still can neutralize the virus even if it has been diluted to a high-fold. The collected IgG titers only contained the assays with coated antigens corresponding to mRNAs used in vaccines. For studies against influenza,,, hemagglutination inhibition (HAI) titer is also often tested. Hemagglutination happens when red blood cells contact the influenza virus, and the addition of tested serum containing antibodies neutralizing the influenza virus would inhibit hemagglutination. Thus, HAI titer, similar to IgG titer, also reflects the antibodies' concentration. Analysis of our data found a linear relationship (Supporting Information Fig. S1) between two titers [log10(IgG titer) = 1.0286 × log10(HAI titer) + 1.4103, R = 0.7986, when HAI titer ≥1]. Thus, IgG titers transformed from HAI tests were also included if only HAI titers were available. The dose and the titers were transformed via log10 function to shape the distribution closer to normal. Eventually, there were a total of 325 samples in the dataset.
Structural representation of ionizable lipids
In this study, the extended connectivity fingerprints (ECFP) were introduced to represent the ionizable lipid structural characteristics. ECFP is a bit string constituted by ‘1’ or ‘0’. Each bit of ECFP corresponds to a set of chemical substructures, and the ‘1’ or ‘0’ indicates whether or not the compound contains it. ECFP shows good modeling fitting in cheminformatics and bioinformatics. The ionizable lipid ECFP (IL-ECFP) was generated by the RDKit package version 2020.09.1.0 in Python. Ionizable lipids used in mRNA vaccine formulation have long chains. Thus, the ECFP radius was set to 3 to cover a chain segment with up to 7 atoms, larger than the regular ECFP4 structure (ECFP with a radius of 2). The ECFP sequence length was set to 1024.
Data splitting strategy
The whole dataset was split into two sets of training (260 data points) and validation (65 data points). Stratified random sampling was adopted to keep the same proportion of data points in each mRNA vaccine formulation. The training set was used for training models, and the validation set was for turning hyperparameters to find the best model. Additionally, 10-fold cross-validation (CV) was used to evaluate the final generalization of machine learning models.
Evaluation criteria
Mean absolute error (MAE), mean squared error (MSE), root mean squared error (RMSE), and determination coefficient (R) are the metrics for evaluating regression model performance. MAE measures the mean absolute error between real labels and predictions. MSE indicates the mean squared error between real labels and predictions. RMSE indicates the root mean squared error between real labels and predictions. R shows the correlation between real labels and predictions. They are defined as the following Eqs. (1), (2), (3), (4):where n is the number of data, is the ith real label, and is the ith prediction.
Hyper-parameters of lightGBM
In recent years, the techniques and applications of machine learning have been driven by algorithmic advances and data accumulation. Diverse machine learning algorithms and structures have been tested to fit different types of correlations,. The gradient boosting decision tree (GBDT) framework-based ensemble learning algorithms are shown to have superior accuracies in both classification and regression problems on tabular data with pre-engineered features. A typical algorithm is the lightGBM. In the present study, the model was constructed to predict the titer concentration of mRNA vaccine immunological performance. The model was established by the lightGBM package version 2.2.3 in Python. We searched 1000 hyperparameter combinations in the hyperparameter space. The hyperparameter configuration of lightGBM is that the learning rate is 0.018, the number of trees is 930, the subsample ratio is 0.783, the subsample ratio of columns is 0.394. The regularization terms (maximum of eight leaves for base learners and minimum of 12 samples in a leaf) were used to prevent the overfitting issue. The machine learning hyperparameters decide model selection and have impact on model generalization. A random search method was used for lightGBM. It has been shown that random search is more efficient than grid search and manual search.
Experimental methodology
mRNA synthesis
Our laboratory has established a system to conveniently test the LNP delivery efficiency using mRNAs encoding the extracellular segment of human angiotensin-converting enzyme 2 [ACE2 (18-615)], human IgG Fc fragment, and an HiBit tag. The ectodomain of ACE2 fused with IgG Fc fragment makes it a long-lasting secreted protein, which can be directly detected in the blood. A HiBit tag added to the C-terminus of the protein facilitates later protein detection.Our mRNA was synthesized in vitro by a T7 RNA polymerase mediated transcription system (IVT). The DNA template incorporates the 5ʹ and 3ʹ untranslated regions (UTRs) and a poly(A) tail. Pseudo UTP instead of natural UTP was used during IVT to reduce the immunogenicity of the mRNA. Cap1 is added co-transcriptionally to ensure the normal translation of mRNA. The pseudo UTP and cap1 were purchased from APExBIO Technology LLC. (Houston, TX, USA). We purified mRNA by oligo dT column, then diluted the mRNA in sodium citrate buffer to desired concentrations. The purity of mRNA was confirmed by gel electrophoresis.
LNP formulation and characterization
DLin-MC3-DMA (MC3) and SM-102 were purchased from APExBIO. Cholesterol was purchased from AVT (Shanghai) Pharmaceutical Tech Co., Ltd. (Shanghai, China). DSPC and PEG2000-DMG were purchased from Avanti Polar Lipids Inc. (Alabaster, AL, USA). Lipids were dissolved in ethanol at molar ratios of 50:10:38.5:1.5 (ionizable lipid/DSPC/cholesterol/PEG2000-DMG). The mRNA was diluted in sodium citrate buffer (pH 3.0) to desired concentrations for final N/P ratios 3:1 and 6:1, respectively. We gave high-pressure to mix mRNA solution and lipid solution rapidly through a T mixer. Formulations were dialyzed against PBS (pH 7.4) in a dialysis cassette for 20 h. After dialysis, LNPs were passed through a 0.22 μm filter, concentrated to a suitable concentration, stored at 4 °C, and used within a week. RiboGreen Assay from Invitrogen Corp. (Carlsbad, CA, USA) was used to quantify the mRNA in LNPs, particle size was determined by dynamic light scattering. The encapsulation efficiency of our LNP was around 90%, and the particle size was around 100 nm.
Animal studies
All animal experiments were performed under the ethical guidelines of Fudan University. Sixteen C57BL/6JGPt mice (eight weeks old, mixed-sex) were randomly divided into four groups, corresponding to four LNP formulations administrated (MC3-3:1, MC3-6:1, SM102-3:1, and SM102-6:1). LNPs diluted in PBS were injected into mice via the tail vein using a disposable syringe (15 μg mRNA/dose). Tail vein blood was taken at 0, 4, 8, 24, 48, 72, 96, 168 h after injection with capillaries. The serum was separated by centrifugation at 6000 rpm for 10 min. The ACE2 level in mice serum was measured using Nano-Glo® HiBit Lytic Detection System from Promega Corp. (Madison, WI, USA) following the manufacturer's recommendations. Then the luminescence signal was detected on the microplate reader.
Statistical analysis
Means were compared using the unpaired t-test, and the area under the curve (AUC) was calculated after 168 h of administration for all tests. Two-tailed P values < 0.05 were considered statistically significant and are shown in the figures as ∗P ≤ 0.05, ∗∗P ≤ 0.005, ∗∗∗P ≤ 0.001. Prism 8 (GraphPad Software, San Diego, CA, USA) was used.
MD modeling methods
Model building of the formulation systems
The all-atom dynamic simulation method, was performed to investigate the formation mechanism of LNPs. Molecular structures of two ionizable lipids (MC3 and SM-102), cholesterol, DSPC, and PEG2000-DMG, were manually built by Discovery Studio 2016 Client, as shown in Fig. 2. The mRNA nucleotide sequence consisted of the 32-mer poly(A) tail generated by the NAB package in AMBER (University of California, San Francisco, CA, USA). Poly(A) tail was chosen because it is generally added to all manufactured mRNA sequences. The length of mRNA sequence for simulation was decided because of the sizes of eventual simulated systems limited to computer capacity. These molecules constituted five different simulated systems in total (Table 1). At first, mRNA in an aqueous solution in the absence of lipids (referred to as mRNA system) was observed. Then, MD simulation simulated the other four systems that consisted of a single mRNA, ionizable lipids, cholesterol, DSPC, PEG2000-DMG, and water. Lipids were added at the molar ratio of ionizable lipid/DSPC/cholesterol/PEG2000-DMG = 50:10:38.5:1.5, and the N/P ratio was 3:1 or 6:1. Both composition ratio and N/P ratio were generally seen in our collected data. In the mRNA system, sodium counterions were used to ensure electrical neutrality, while in the other LNP systems, chlorine counterions were added to the system.
Figure 2
Three-dimensional structure of the MC3, SM-102, DSPC, cholesterol, and PEG2000-DMG.
Table 1
Molecules' number of the five simulated LNP systems.
Parameter
mRNA
MC3-3:1
MC3-6:1
SM102-3:1
SM102-6:1
N/P ratioa
NA
3:1
6:1
3:1
6:1
mRNAb
1
1
1
1
1
MC3
0
96
192
0
0
SM-102
0
0
0
96
192
Cholesterol
0
74
148
74
148
DSPC
0
19
38
19
38
PEG2000-DMG
0
3
6
3
6
Na+
31
0
0
0
0
Cl‒
0
65
162
65
161
Water
6190
69,747
98,183
68,936
97,877
The number ratio between the nitrogen groups of ionizable lipids to phosphate groups of mRNA sequence.
The mRNA consists of 32-mer of poly (A).
Three-dimensional structure of the MC3, SM-102, DSPC, cholesterol, and PEG2000-DMG.Molecules' number of the five simulated LNP systems.The number ratio between the nitrogen groups of ionizable lipids to phosphate groups of mRNA sequence.The mRNA consists of 32-mer of poly (A).
Simulation method
The detailed simulation method was similar to the previous study. All the simulations were carried out using AMBER 18 and AMBER Tools 18 software package. The FF14SB force field was used to model mRNA, and the GAFF force field was applied to model lipid molecules. For lipid molecules, the atom type and charge were described by the antechamber package. The conformation of initially constructed systems may be far from their equilibrated state, and molecules may arrange too close, inducing unreasonably high energy in systems. Thus, minimization of systems was needed before the simulation to relax the structure and remove unreasonable contacts. First, the solute molecules were constrained, and only water molecules were minimized for a short time. Then the whole system underwent 20,000 steps of minimization. After minimization, systems were heated, and the Langevin thermostat was used to maintain the temperature at 300 K, while the Berendsen barostat was used to keep the pressure at 1 atm. All the systems were equilibrated at least 100 ns with a time step of 2 fs to produce the simulated results.
Results
Data distribution and model performance of ML work
The data collected included LNP and mRNA information as input features and IgG titer induced by vaccines at corresponding time points as output parameters for prediction. The data distributions of these parameters are overall uniform (Figure 3, Figure 4).
Figure 3
Data distribution of 325 formulation datasets. Numerical counts of the eventual data dependent on disease and protein (A), subject type (B), population or strain (C), injection route (D), ionizable lipid type (E). In (A), H1N1 Cal and PR8 referred to strains A/California/07/2009 and A/Puerto Rico/8/1934, respectively; SARS-CoV-2 S-2P and RBD referred to the S protein with two substitutions of proline at 986 and 987 amino acid positions and receptor binding domain (RBD), respectively; and RSV mDS-Cav-1 referred to the full-length F protein respiratory syncytial virus (RSV) with four-point mutations,.
Figure 4
Data distribution of 325 formulation datasets. Numerical counts of the eventual data dependent on N/P ratio (A), log10(dose) (B), the second vaccination time (C), IgG titer test time (D), and log10(IgG titer) results (E) were given.
Data distribution of 325 formulation datasets. Numerical counts of the eventual data dependent on disease and protein (A), subject type (B), population or strain (C), injection route (D), ionizable lipid type (E). In (A), H1N1 Cal and PR8 referred to strains A/California/07/2009 and A/Puerto Rico/8/1934, respectively; SARS-CoV-2 S-2P and RBD referred to the S protein with two substitutions of proline at 986 and 987 amino acid positions and receptor binding domain (RBD), respectively; and RSV mDS-Cav-1 referred to the full-length F protein respiratory syncytial virus (RSV) with four-point mutations,.Data distribution of 325 formulation datasets. Numerical counts of the eventual data dependent on N/P ratio (A), log10(dose) (B), the second vaccination time (C), IgG titer test time (D), and log10(IgG titer) results (E) were given.Table 2 shows that the ML model using the LightGBM algorithm presents a good performance. The dataset was first divided into the training set and validation set, with samples of 260 and 65, respectively. After training and tuning hyperparameters, the model shows impressive predictivity. For the training and validation set, the MAE and RMSE are around 0.2 and 0.3 log10 units, respectively, corresponding to the error commonly seen in the experiments. The R is above 0.9, showing this model has covered major factors resulting in the variation in the IgG titer. Moreover, additional 10-fold cross-validation was performed. The whole dataset was divided into ten folds. One fold served as the validation set and the rest as the training set for each iteration, and this process was repeated ten times. The average results of 10 iterations are also presented in Table 2. Although the MAE, MSE, and RMSE are slightly larger than those from the first training, they also show an accurate predictivity on experimental value.
Table 2
Model performance of LightGBM.
Parameter
Training set
Validation set
10-fold cross-validation (mean ± SD)
Mean absolute error
0.220
0.278
0.303 ± 0.053
Mean squared error
0.092
0.139
0.178 ± 0.078
Root mean squared error
0.303
0.373
0.412 ± 0.086
R2
0.935
0.904
0.871 ± 0.061
Model performance of LightGBM.The next analysis determines the important input parameters or features that hugely influence the model. The top 7 important parameters are biological factors, including protein type, log10 dose, titer test time, population or strain, the second vaccination time, subject type, and injection route. The following parameters are formulation-related features, as shown in Fig. 5A. The codon optimization, self-amplifying, and uridine modification show the important role of mRNA sequence modification. Then, the formulation features of the N/P ratio and some IL-ECFPs present the LNP importance. The top 18 important positions among 1024 IL-ECFP and their corresponding specific substructure of ionizable lipid are shown in Fig. 5B. For the important 18 IL-ECFPs, 5 of them are contained in DLinDMA, while 7, 8, and 11 are contained in selective ionizable lipids MC3, L319, and SM-102. Compared to DLinDMA, MC3 contains a secondary ester linker (IL-ECFP 69 and 77). Compared to MC3, L319 contains a primary ester linker (IL-ECFP 147) in tails, replacing one double bond (IL-ECFP 12), and the chain after the double bond comprises six carbon atoms (IL-ECFP 46). SM-102, compared to MC3, has a hydroxy group (IL-ECFP 132) in the head and a primary (IL-ECFP 10) as well as a secondary ester (IL-ECFP 935) in tails. The distance from the nitrogen to the ester is five carbons (IL-ECFP 795) in one tail of SM-102. These features distinguish ionizable lipids from each other and are deemed essential and ranked in the model.
Figure 5
Features ranking and important substructure of ionizable lipids. (A) The top 25 important features related to the formulation. Importance times were recognized using the information gain (IG) values as a criterion from the lightGBM model. (B) The top 18 important IL-ECFP and their corresponding specific substructure of ionizable lipid. The center atom, recognizing length, and environmental information of each ECFP are indicated by the stressed blue area, black bonds, and grey bonds, respectively.
Features ranking and important substructure of ionizable lipids. (A) The top 25 important features related to the formulation. Importance times were recognized using the information gain (IG) values as a criterion from the lightGBM model. (B) The top 18 important IL-ECFP and their corresponding specific substructure of ionizable lipid. The center atom, recognizing length, and environmental information of each ECFP are indicated by the stressed blue area, black bonds, and grey bonds, respectively.
Experimental validation of the ML model
The ML model was validated by the animal test. LNPs of various formations (ionizable lipid as MC3 or SM-102, N/P ratio at 3:1 or 6:1) were used to intravenously deliver mRNAs encoding human ACE2 to mice. The ACE2 expression level was measured as the relative light unit (RLU) of the nanoluciferase enabled by the fused HiBit tag. Table 3 shows the characteristics of these LNPs. Four LNPs have around 90% encapsulation efficiency and a similar particle size of around 100 nm. Fig. 6 compares the prediction results and in vivo test. Fig. 6A shows that the MC3-based LNP is predicted to induce a higher titer value than that based on SM-102, but the N/P ratio does not influence the predicted titer. In the animal results of Fig. 6B, MC3-based LNP with N/P ratio at 6:1 resulted in an overall higher RLU than that based on SM-102, though there is no significant difference between them (Fig. 6C and D). LNPs based on two ionizable lipids with an N/P ratio at 3:1 show similar RLU.
Table 3
Encapsulation efficiency and particle size of LNPs.
Formulationa
MC3-3:1
MC3-6:1
SM102-3:1
SM102-6:1
Encapsulation efficiency
89.5%
91.3%
89.6%
91.2%
Particle size (nm)
97.8
101.0
101.2
106.3
LNP was formulated from ionizable lipid, DSPC, cholesterol, and PEG-lipid at a molar ratio of 50:10:38.5:1.5. The N/P ratio was 6:1 or 3:1.
Figure 6
Comparison between ML prediction and in vivo expression level. (A) Predicted log10(IgG titer) versus time profile of BALB/c mice induced by mRNA-LNP encoding S-2P protein of SARS-CoV2 at the dose of 20 μg by i.m. administration on Days 0 and 21. LNP consists of ionizable lipid, DSPC, cholesterol, and PEG-lipid at a molar ratio of 50:10:38.5:1.5. Ionizable lipids included MC3 and SM-102. The N/P ratio is 6:1 or 3:1. (B) Relative light unit (RLU) of HiBit tag versus time profiles in C57BL/6JGPt mice induced by mRNA-LNP encoding angiotensin-converting enzyme 2 (ACE2) following i.v. administration. The LNP formulations were the same as the prediction task. The difference in the maximum RLU at 8 h (C) and the AUC at 168 h (D) after administration were tested. Data are presented as mean ± SD (n = 4). ∗∗P ≤ 0.005. ns, not significant.
Encapsulation efficiency and particle size of LNPs.LNP was formulated from ionizable lipid, DSPC, cholesterol, and PEG-lipid at a molar ratio of 50:10:38.5:1.5. The N/P ratio was 6:1 or 3:1.Comparison between ML prediction and in vivo expression level. (A) Predicted log10(IgG titer) versus time profile of BALB/c mice induced by mRNA-LNP encoding S-2P protein of SARS-CoV2 at the dose of 20 μg by i.m. administration on Days 0 and 21. LNP consists of ionizable lipid, DSPC, cholesterol, and PEG-lipid at a molar ratio of 50:10:38.5:1.5. Ionizable lipids included MC3 and SM-102. The N/P ratio is 6:1 or 3:1. (B) Relative light unit (RLU) of HiBit tag versus time profiles in C57BL/6JGPt mice induced by mRNA-LNP encoding angiotensin-converting enzyme 2 (ACE2) following i.v. administration. The LNP formulations were the same as the prediction task. The difference in the maximum RLU at 8 h (C) and the AUC at 168 h (D) after administration were tested. Data are presented as mean ± SD (n = 4). ∗∗P ≤ 0.005. ns, not significant.
Investigating the molecular structure of mRNA LNPs by MD simulations
MD modeling was performed to investigate the interaction between lipids and mRNA in LNP formation. Fig. 7 shows the initial and final structure of a single mRNA sequence for 100 ns MD simulation, which shows that the mRNA sequence is folded in the water solution. Fig. 8 displays the final structure of four lipid systems (ionizable lipid as SM-102 or MC3, and N/P ratio of 3:1 or 6:1) for 200 ns MD simulation. The four systems self-assemble rapidly and form aggregates in the water solution, but the degree of aggregation is different. In the SM102-6:1 system with a ratio of 6:1, all molecules aggregate together. However, the other three systems form several clusters of different sizes. All LNPs formed show dense core structure. As for mRNA encapsulation, the SM102-6:1 system entraps a part of the mRNA sequence. However, in the rest of the lipid systems, the whole mRNA sequence is almost exposed to the aqueous solution. Besides, long mRNA in system MC3-6:1 attaches to multiple LNPs, while mRNA sequences in the other three systems only bind to a single LNP.
Figure 7
The snapshots of mRNA structure at the initial time (A) and 100 ns of simulation (B).
Figure 8
The snapshots of four lipid systems for 200 ns MD simulation: (A) SM102-3:1; (B) SM102-6:1; (C) MC3-3:1; (D) MC3-6:1; water molecules were not displayed in the figure. Red represents mRNA; purple represents SM-102 ionizable lipid; blue represents MC3 ionizable lipid; yellow represents cholesterol; cyan represents DSPC; green represents PEG2000-DMG.
The snapshots of mRNA structure at the initial time (A) and 100 ns of simulation (B).The snapshots of four lipid systems for 200 ns MD simulation: (A) SM102-3:1; (B) SM102-6:1; (C) MC3-3:1; (D) MC3-6:1; water molecules were not displayed in the figure. Red represents mRNA; purple represents SM-102 ionizable lipid; blue represents MC3 ionizable lipid; yellow represents cholesterol; cyan represents DSPC; green represents PEG2000-DMG.To show how mRNA interacts with LNP, the simulation results were re-colored with nitrogen atoms on the ionizable lipid highlighted in Fig. 9. It shows that all lipid molecules aggregate together to form the LNPs, and nitrogen groups of cationic lipids preferentially locate at the surface of LNP. The mRNA molecule twines around LNP by two possible mechanisms. On the one hand, the nucleosides of mRNA are direct to or lie on the LNP by the hydrophobic interaction. On the other hand, the phosphate groups of the mRNA backbone are close to the nitrogen atoms of LNP due to the electrostatic effect.
Figure 9
The snapshots of four lipid systems for 200 ns MD simulation: (A) SM102-3:1; (B) SM102-6:1; (C) MC3-3:1; (D) MC3-6:1; water molecules were not displayed in the figure. Yellow: mRNA sequence. Blue: nitrogen on the ionizable lipids.
The snapshots of four lipid systems for 200 ns MD simulation: (A) SM102-3:1; (B) SM102-6:1; (C) MC3-3:1; (D) MC3-6:1; water molecules were not displayed in the figure. Yellow: mRNA sequence. Blue: nitrogen on the ionizable lipids.Fig. 10 shows the quantitative analysis of four lipid systems during 200 ns MD simulation. The RMSD profile indicates that four lipid systems reach a stable state after about 50 ns. The surface areas of mRNA exposed to water in the systems decrease with time, which indicates the encapsulation of mRNA molecule to LNP. The mass-weighted radius of gyration (Rg) vs. time of the whole system represents that aggregation of SM102-3:1 and SM102-6:1 is more obvious than those of MC3-3:1 and MC3-6:1. The density profile of the SM102-6:1 system shows a high-intensity peak at about 100 Å. In contrast, more than one peak is observed in the other three systems, and the MC3-6:1 system shows a wide distribution. These results imply that SM102-6:1 is compact, while MC3-6:1 is relatively loose.
Figure 10
Quantitative analysis of four lipid systems during 100 ns MD simulation. (A) Root mean square displacement (RMSD) vs. time. (B) Solvent accessible surface area of the mRNA sequence vs. time. (C) Mass-weighted radius of gyration (Rg) vs. time. (D) Density profile of a system as a function of the distance from the geometric center of the system.
Quantitative analysis of four lipid systems during 100 ns MD simulation. (A) Root mean square displacement (RMSD) vs. time. (B) Solvent accessible surface area of the mRNA sequence vs. time. (C) Mass-weighted radius of gyration (Rg) vs. time. (D) Density profile of a system as a function of the distance from the geometric center of the system.
Discussion
Currently, the selection of the ionizable lipid has attracted significant attention for optimizing the LNP formulation for mRNA delivery. Since traditional screening tests often consume a lot of time and materials, computational tools that can accelerate the development should be valuable. The present work builds an ML model with good prediction performance, which correlates the critical substructures of ionizable lipids to the in vivo potency (IgG titer) of mRNA vaccines to help the choice of ionizable lipids.More importantly, the importance of features is ranked. The 18 critical IL-ECFPs among 1024 are identified in Fig. 5B, representing the cationic head group, ester linker, and tail of ionizable lipids. A small head group, such as IL-ECFP 160, combined with two relatively large dilinoleyloxy tails (IL-ECFP 162 and 171) may behave in a “cone” shape and facilitate the formation of hexagonal HII phase when contacting with endosomal membrane, disrupting the bilayer structure and release the RNA therapeutics. These substructures are the symbols of DLinDMA, which is one ionizable lipid that turned out to be highly effective in the early stage. DLinDMA was then optimized to DLin-KC2-DMA29 and further MC328 by substituting with a second ester linker (IL-ECFP 69) distant from the nitrogen at three carbons length (IL-ECFP 77), which are deemed more important than IL-ECFP 160 and 162. The original research also shows an improvement in potency by this optimization. However, the adverse effect is also commonly seen when administrated with MC3-containing LNP because of its low biodegradability. Maier et al. developed biodegradable L319 by substituting one double bond with an ester linker (IL-ECFP 147 and 12), assuming this compound could be metabolized by hydrolysis and β-oxidation. Sabnis et al. developed SM-102 from a similar compound to MC3 by obtaining the balance between the lipid pKa,, potency, and metabolism behavior. SM-102 has a head group as IL-ECFP 132 and two ester linkers, IL-ECFP 10 and 935. The distribution of esters in two tails maintains the pKa within the desired range, the side chain resulted from the secondary ester may facilitate the “cone” shape, and the distance from the nitrogen to the ester (IL-ECFP 795 in SM-102) also impacts the metabolism. Both L319 and SM-102 also show high in vivo potency. The present AI model recognizes these IL-ECFPs above as important substructures, though the mechanism is not reflected in the collected data. Besides, notice that the ECFPs presented here are just the top 18 important ones, but there are 1024 ECFPs in total. The efficiency of ionizable lipids should be more dependent on the sum importance of all ECFPs.The validation of the ML model against the in vivo test result also proves our model with some suggestive ability about ionizable lipids (Fig. 6). We modified the mRNA of human ACE2 to encode a secreted protein, and therefore the ACE2 expression can be directly detected from the blood samples, which is a straightforward and undisturbed way to assess the efficiency of LNP carriers. The animal test shows that ACE2 expression level induced by LNP at N/P ratio of 6:1 is higher than that at the ratio of 3:1, consistent with the previous finding that the higher N/P ratio induces more potency,,. The result also shows that the MC3-based LNP induces more expression than that based on SM-102 at the N/P ratio of 6:1. A reasonable assumption is that expression level is positively correlated to IgG titer, which conforms to the prediction by the ML model that MC3 induces higher IgG titer than SM-102 at an N/P ratio of 6:1. However, low biodegradability correlates to side effects, making the choice of ionizable lipid complicated. In fact, it is SM-102 that is formulated in mRNA-1273 vaccine. Besides, the model predicts IgG titers for two N/P ratios are similar. It is due to that the N/P ratio is little varied for one kind of ionizable lipid, and the ML model is difficult to discriminate the impact of the N/P ratio from ionizable lipid. Inputting more diverse data can address this issue easily.The structure of LNP is another important topic concerned in the pharmaceutical field. The cryo-TEM images have shown that LNPs are overall in dense core, or lamellar, structure. However, visualization of LNP structure at a molecular level often relies on modeling method70, 71, 72. In the present study, we performed an MD simulation of LNP entrapping mRNA (Fig. 8). During the simulation, the firstly dispersed lipids aggregate to form small dense core particles. The mRNA can twine around or be partly entrapped in these lipid particles. Besides, mRNA twining around multiple particles is also possible. Analysis of the aggregation behavior (Fig. 10) shows that system SM102-6:1 (SM-102-based, N/P ratio 6:1) converges most rapidly and forms the most compact structure, while system MC3-6:1 forms a relatively loose structure. Interestingly, increasing the lipid content (N/P ratio from 3:1 to 6:1) makes the SM-102 system more compressed but loosens the MC3 system. These results indicate that both ionizable lipid type and N/P ratio influence the LNP aggregation behavior. Our simulated LNP structure agrees with another all-atom modeling result by Rissanou et al., who have simulated the aggregation behavior of mRNA and so-called DML lipid molecules, wherein mRNA also twines around the LNPs.The interaction between the phosphate on mRNA and the nitrogen on ionizable lipids is of research interest. The electrostatic effect between the two kinds of molecules is presumed to promote the mRNA binding to LNP. Our modeling result suggests that during the LNP formation, lipids aggregate first, and mRNA twines around LNP with its phosphate groups getting close to nitrogen atoms. This binding is also helped by the hydrophobic effect implemented on the nucleosides of mRNA, resulting in those nucleosides generally directing to or lying on the LNP. Besides, hydrogen bonds is reported to be another potential factor facilitating this binding.The all-atom dynamic simulation provides rich insights into the LNP formation mechanism. However, limited to computational capacity, this method can only handle small simulated systems. Considering scaling up the system based on the mechanism of particle formation deduces theoretical structures of LNP as Fig. 11. At the first stage of mixing, lipids in the vicinity should aggregate to form small clusters and attach to mRNA in the line. This step is supported by the MD modeling (Fig. 8D and Ref.). The lipid clusters tend to fuse by their nature to reduce the surface energy, but the cluster volume cannot grow unlimitedly since the main stuff, the ionizable lipid, is amphipathic. Thus, the cluster either grow along the mRNA (Fig. 11C) or enlarge like a liposome particle (Fig. 11D), and this deduces two theoretical LNP structure (Fig. 11E and F), respectively. The tunnel organization of the LNP core proposed in Fig. 11E is supported by the transmission electron microscopy (TEM) image, which shows the LNP core texture as many arranged channels. These channels correspond to hexagonal phase rods, as reported. The liposome-like volume proposed in Fig. 11F is also recorded by TEM and is called “blebs”. The occurrence of this structure seems to depend on PEG content, DSPC content, and the type of nucleic acids entrapped. It seems that long nucleic acids, such as DNA and mRNA, are likely to induce the blebs. It is reasonable since longer nucleic acids form a larger obstacle in the system, which may induce clusters fusing less randomly in the space and result in a heterogeneous particle. As for the other three lipids in LNP, PEG-lipid, and DSPC, mainly located at the exterior, while the cholesterol helping to constitute the core, are also evidenced,.
Figure 11
The evolution of lipids fusion and theoretical structure of mRNA LNP system. (A) At the initial mixing stage, lipids form many small clusters and attach along the mRNA sequence by electrostatic effect. (B) The clusters getting close tend to fuse into a bigger cluster to decrease the surface energy. The tails of lipid in these clusters are reduced for clarity. Then, more clusters participate in the fusion to form a long lipid particle (C) or liposome-like particle (D). If the fusion primarily results in long lipid particles, they should form tube structures in the core of LNP (E). Otherwise, lipid fusion leading to liposome-like particles produces LNP containing a large chamber filled with aqueous phase (F). The DSPC and PEG should locate at the exterior of LNP while cholesterols insert in the interval between lipids.
The evolution of lipids fusion and theoretical structure of mRNA LNP system. (A) At the initial mixing stage, lipids form many small clusters and attach along the mRNA sequence by electrostatic effect. (B) The clusters getting close tend to fuse into a bigger cluster to decrease the surface energy. The tails of lipid in these clusters are reduced for clarity. Then, more clusters participate in the fusion to form a long lipid particle (C) or liposome-like particle (D). If the fusion primarily results in long lipid particles, they should form tube structures in the core of LNP (E). Otherwise, lipid fusion leading to liposome-like particles produces LNP containing a large chamber filled with aqueous phase (F). The DSPC and PEG should locate at the exterior of LNP while cholesterols insert in the interval between lipids.In this study, the ML model was built to predict the formulation of mRNA vaccines, and the MD method was used to investigate the LNP formation. The application domain is a critical issue for an ML model. Our model was trained on data from ionizable lipids across a long history. These lipids contain important substructures such as the tertiary amine, hydroxy group, ester bond, secondary ester bond, and dienyloxy chain, which have been represented as ECFPs and can be combined to build other new and typical lipids. Thus, the coverage of ionizable lipids in our model is extended, benefiting the formulation selection, which is the primary target of our study. On the other hand, though we have collected data as much as possible, the resulted data size is still relatively small, and the searched antigens cover just several diseases, which narrows the range that uses this model to predict IgG titer for specific diseases. Besides, small data size impedes the further analysis about the dependence of LNP formulation on specific diseases. More data of diverse diseases and formulations are desired in the future to expand its application domain and refine the design of LNP for specific diseases. As for MD simulation, the current simulated systems are relatively small in scale. To our best knowledge, modeling on LNP at a typical size with good stability (60 nm) has not been published. Thus, the real internal structure of an LNP can only be deduced. The recently reported ML-based MD modeling method has dramatically increased the scale of simulated systems, which may become a powerful tool in LNP modeling in the future. Last, the current study has not revealed the relationship between MD results and LNPs' pharmacological effect. More MD modeling associated with data science technology is promising to deal with this issue.
Conclusions
The first ML model has been successfully developed to predict the LNP formulations with the IgG titer of the mRNA vaccine, which is validated by in vivo test on the ACE2 expression. The ML model also recognizes important substructures of ionizable lipids. The MD model is used to investigate the aggregation behavior and the molecular structure of LNP. The integrated computational methodology is able to design better ionizable lipid, which serves a constructive role in the formulation development of nucleic acids therapeutics.
Authors: Gina Cannarozzi; Gina Cannarrozzi; Nicol N Schraudolph; Mahamadou Faty; Peter von Rohr; Markus T Friberg; Alexander C Roth; Pedro Gonnet; Gaston Gonnet; Yves Barral Journal: Cell Date: 2010-04-16 Impact factor: 41.582
Authors: Robert A Feldman; Rainard Fuhr; Igor Smolenov; Amilcar Mick Ribeiro; Lori Panther; Mike Watson; Joseph J Senn; Mike Smith; Ӧrn Almarsson; Hari S Pujar; Michael E Laska; James Thompson; Tal Zaks; Giuseppe Ciaramella Journal: Vaccine Date: 2019-05-10 Impact factor: 3.641
Authors: Akin Akinc; Michael Goldberg; June Qin; J Robert Dorkin; Christina Gamba-Vitalo; Martin Maier; K Narayanannair Jayaprakash; Muthusamy Jayaraman; Kallanthottathil G Rajeev; Muthiah Manoharan; Victor Koteliansky; Ingo Röhl; Elizaveta S Leshchiner; Robert Langer; Daniel G Anderson Journal: Mol Ther Date: 2009-03-03 Impact factor: 11.454
Authors: Kizzmekia S Corbett; Barbara Flynn; Kathryn E Foulds; Joseph R Francica; Seyhan Boyoglu-Barnum; Anne P Werner; Britta Flach; Sarah O'Connell; Kevin W Bock; Mahnaz Minai; Bianca M Nagata; Hanne Andersen; David R Martinez; Amy T Noe; Naomi Douek; Mitzi M Donaldson; Nadesh N Nji; Gabriela S Alvarado; Darin K Edwards; Dillon R Flebbe; Evan Lamb; Nicole A Doria-Rose; Bob C Lin; Mark K Louder; Sijy O'Dell; Stephen D Schmidt; Emily Phung; Lauren A Chang; Christina Yap; John-Paul M Todd; Laurent Pessaint; Alex Van Ry; Shanai Browne; Jack Greenhouse; Tammy Putman-Taylor; Amanda Strasbaugh; Tracey-Ann Campbell; Anthony Cook; Alan Dodson; Katelyn Steingrebe; Wei Shi; Yi Zhang; Olubukola M Abiona; Lingshu Wang; Amarendra Pegu; Eun Sung Yang; Kwanyee Leung; Tongqing Zhou; I-Ting Teng; Alicia Widge; Ingelise Gordon; Laura Novik; Rebecca A Gillespie; Rebecca J Loomis; Juan I Moliva; Guillaume Stewart-Jones; Sunny Himansu; Wing-Pui Kong; Martha C Nason; Kaitlyn M Morabito; Tracy J Ruckwardt; Julie E Ledgerwood; Martin R Gaudinski; Peter D Kwong; John R Mascola; Andrea Carfi; Mark G Lewis; Ralph S Baric; Adrian McDermott; Ian N Moore; Nancy J Sullivan; Mario Roederer; Robert A Seder; Barney S Graham Journal: N Engl J Med Date: 2020-07-28 Impact factor: 91.245
Authors: Kimberly J Hassett; Kerry E Benenato; Eric Jacquinet; Aisha Lee; Angela Woods; Olga Yuzhakov; Sunny Himansu; Jessica Deterling; Benjamin M Geilich; Tatiana Ketova; Cosmin Mihai; Andy Lynn; Iain McFadyen; Melissa J Moore; Joseph J Senn; Matthew G Stanton; Örn Almarsson; Giuseppe Ciaramella; Luis A Brito Journal: Mol Ther Nucleic Acids Date: 2019-02-07
Authors: Lindsey R Baden; Hana M El Sahly; Brandon Essink; Karen Kotloff; Sharon Frey; Rick Novak; David Diemert; Stephen A Spector; Nadine Rouphael; C Buddy Creech; John McGettigan; Shishir Khetan; Nathan Segall; Joel Solis; Adam Brosz; Carlos Fierro; Howard Schwartz; Kathleen Neuzil; Larry Corey; Peter Gilbert; Holly Janes; Dean Follmann; Mary Marovich; John Mascola; Laura Polakowski; Julie Ledgerwood; Barney S Graham; Hamilton Bennett; Rolando Pajon; Conor Knightly; Brett Leav; Weiping Deng; Honghong Zhou; Shu Han; Melanie Ivarsson; Jacqueline Miller; Tal Zaks Journal: N Engl J Med Date: 2020-12-30 Impact factor: 91.245
Authors: Kapil Bahl; Joe J Senn; Olga Yuzhakov; Alex Bulychev; Luis A Brito; Kimberly J Hassett; Michael E Laska; Mike Smith; Örn Almarsson; James Thompson; Amilcar Mick Ribeiro; Mike Watson; Tal Zaks; Giuseppe Ciaramella Journal: Mol Ther Date: 2017-04-27 Impact factor: 12.910