Literature DB >> 35350604

Large-Scale G Protein-Coupled Olfactory Receptor-Ligand Pairing.

Xiaojing Cong¹, Wenwen Ren², Jody Pacalon¹, Rui Xu³, Lun Xu⁴, Xuewen Li³, Claire A de March⁵, Hiroaki Matsunami⁵, Hongmeng Yu^4,6,7, Yiqun Yu^4,6, Jérôme Golebiowski^1,8.

Abstract

G protein-coupled receptors (GPCRs) conserve common structural folds and activation mechanisms, yet their ligand spectra and functions are highly diverse. This work investigated how the amino-acid sequences of olfactory receptors (ORs)-the largest GPCR family-encode diversified responses to various ligands. We established a proteochemometric (PCM) model based on OR sequence similarities and ligand physicochemical features to predict OR responses to odorants using supervised machine learning. The PCM model was constructed with the aid of site-directed mutagenesis, in vitro functional assays, and molecular simulations. We found that the ligand selectivity of the ORs is mostly encoded in the residues up to 8 Å around the orthosteric pocket. Subsequent predictions using Random Forest (RF) showed a hit rate of up to 58%, as assessed by in vitro functional assays of 111 ORs and 7 odorants of distinct scaffolds. Sixty-four new OR-odorant pairs were discovered, and 25 ORs were deorphanized here. The best model demonstrated a 56% deorphanization rate. The PCM-RF approach will accelerate OR-odorant mapping and OR deorphanization.

Entities: Chemical

Year: 2022 PMID： 35350604 PMCID： PMC8949627 DOI： 10.1021/acscentsci.1c01495

Source DB: PubMed Journal: ACS Cent Sci ISSN： 2374-7943 Impact factor: 14.553

Introduction

Decoding the sequence–function relationship of proteins is extremely challenging. Slight changes in the sequence may significantly affect the function, whereas proteins with low sequence identity may exhibit similar functions. G protein-coupled receptors (GPCRs) are the most remarkable examples of this phenomenon. They are the largest membrane protein family and the targets for about 40% of marketed drugs.[1] The human genome contains over 800 genes coding for GPCRs,[2] which exert differentiated and specific functions in the complex cellular signaling network. Half of these genes are olfactory receptors (ORs) that endow us with fascinating capacities of odor discrimination.[3] Mammalian GPCRs conserve a typical structure of seven transmembrane helices (7TM) that house an orthosteric ligand-binding pocket.[4] They show a conserved signaling mechanism that involves large-scale conformational changes to accommodate their cognate G proteins. The mechanism is encoded in conserved motifs throughout the 7TM, which form a network of inter-TM contacts converging at the cytoplasmic side.[5] Specifically, the “D(E)RY”, “CWLP”, and “NPxxY” motifs in TM3, TM6, and TM7, respectively, are the most conserved hubs of the allosteric communication between the orthosteric pocket and the cytoplasmic side of class A GPCRs.[4] The orthosteric pocket, by contrast, has diversified extensively and resulted in huge variations in the receptors’ function. This study focuses on the functional heterogeneity of ORs and how this is encoded in the OR sequences. ORs discriminate a vast spectrum of volatile molecules (odorants) and code for an innumerous number of odors perceived in the brain. The many-to-many relationships between ORs and odorants are key to understanding odor perception.[6] Although odorant-binding proteins (OBPs) also contribute to odor detection, they are abundant extracellular proteins that participate in perireceptor events by selecting/carrying odorants.[7,8] Currently, OR–odorant interactions are mostly measured in heterologous cells, especially for human ORs, which neglects the effect of OBPs. ORs are also expressed ectopically, and some have emerged as appealing drug targets.[9−12] We sought to predict OR responses to various odorants using OR sequence alignment, proteochemometrics (PCM),[13] and machine learning. The PCM model was based on the OR sequence similarities and the chemical features of the odorants. Sequence-based approaches can handle large protein families and circumvent the difficulties in obtaining high-resolution structures, as is the case for ORs. Machine learning models using protein sequences and ligand chemical similarities have shown great success in predicting drug–target interactions, such as reviewed in refs (14−16). Attempts to predict OR responses to odorants have also achieved encouraging results.[17−20] However, data scarcity in the immense odor space is a major bottleneck for good predictivities. To date, less than 50% of human ORs (hORs) and 20% of mouse ORs (mORs) have been deorphanized with less than 250 odorants (Table S1). One effective way to handle data scarcity is dimension reduction, such as by selecting relevant residues in the OR sequences (the so-called feature selection). A recent study on insect and mammalian ORs demonstrated that selecting subsets of 20 residues could indeed increase the model predictivity.[20] However, if one assumes that a given function is mostly encoded by 20 residues out of a GPCR sequence of ∼300 residues, the binomial coefficient [300!/20!(300 – 20)!] gives more than 10[30] possible combinations. Therefore, selecting relevant residues is key to constructing an effectual model. Like other GPCRs, ORs respond to their ligands via allosteric mechanisms, which involve distinct interwound factors: ligand affinity, intrinsic stability of different receptor states, as well as long-range allosteric coupling between the ligand-binding pocket and the cytoplasmic side.[21] Ligand affinity is thought to be dictated by the residues outlining the binding pocket.[22,23] ORs that respond to the same odorants share higher sequence homology around the pocket than in the rest of the receptor sequence.[18] The OR response to odorants can be drastically altered by mutations that are distant from the pocket.[24] It is nontrivial to select the relevant residues. Here, we combined molecular modeling, site-directed mutagenesis with in vitro functional assays, and machine learning to identify the most relevant residues. PCM modeling and random forest (RF) were employed to predict OR responses to prototypical odorants using the relevant residues. Finally, in vitro functional assays were performed to assess the selection of relevant residues as well as the predictivity of the PCM-RF model. This approach (outlined in Figure A) largely outperformed existing models by enabling knowledge-based residue selection. It illustrated how the functional heterogeneity of G protein-coupled ORs is encoded in the sequence.

Figure 1

Machine learning protocol and residue selection. (A) Machine learning workflow, in which different residue subsets were extracted from the sequence alignment for the training of different models. The PCM approach combined the OR sequence features, the ligand physicochemical features, and the response data (if available) of each OR–ligand pair. (B) Available site-directed mutagenesis data (including literature data, summarized in ref (24)) projected on the 3D model of mOR256-31. Residues in dark red and red belong to poc17 and poc20, respectively. (C) Matthew’s correlation coefficient (MCC)[28] and hit rate of the RF classifiers on the in vitro test set.

Results

Database of OR–Odorant Pairs for Model Training

We examined all of the literature data of in vitro dose-dependent responses of hORs and mORs to diverse odorants. These include 1293 OR–odorant pairs consisting of 390 ORs and 244 odorants. In addition, we included more than 14 400 OR–odorant pairs which have been reported to be nonresponsive in vitro. The database (Data File S1) contains 720 distinct ORs (including 318 orphan ORs) and 244 odorants. Four odorants were considered here as test cases: acetophenone, coumarin, R-carvone, and 4-chromanone. They have been associated with many ORs (dozens to hundreds) in previous studies (Table ). To enlarge the training set, we also included the data of 6 additional odorants that have similar chemical structures to the 4 target odorants.

Table 1

Chemical Structure, PubChem CID, and Training Dataa of the Query Odorants (in Bold) and Their Analogues

P: number of responsive (positive) ORs. N: number of nonresponsive (negative) ORs. See Data File S1 for the lists of ORs.

Selection of Relevant Residues

Molecular Modeling

Given the existing knowledge of GPCR structures, we first sought for odorant-binding residues within the orthosteric ligand-binding pocket. The mouse OR mOR256-31 (gene name Olfr263) was chosen as a prototype, since it is a broadly tuned receptor which responds to three of the four odorants (coumarin, R-carvone, and acetophenone).[25,26] We built a 3D homology model of mOR256-31 bound with the odorants using our previously established approaches and molecular dynamics simulations.[24,25,27] The 3D model was built under the constraints of conserved amino-acid motifs and site-directed mutagenesis data covering nearly 50% (95 residues) of the TM domain.[24] Seventeen residues were identified within a 5 Å distance of the bound odorants (Table S2). Fourteen of these residues had been shown to be important for OR responses to odorants by site-directed mutagenesis (Table S2). These 17 residues were assumed to be in direct contact with the odorants (named poc17 hereafter, Figure B). However, the relevant residues should include many more than the sole binding pocket.

Site-Directed Mutagenesis

Twenty-four point-mutations were generated within and around poc17 of mOR256-31. Their impact on the receptor’s response to five ligands was measured by in vitro dose-dependent responses (Figure S1). We projected the mutational effect onto the 3D model of mOR256-31, together with all of the OR mutations reported in the literature (Figure B). Twenty residues including poc17 and 3 peripheric residues (Figure B) delineated a larger orthosteric pocket (poc20). Mutations within poc20 consistently affected the response to most of the odorants. Beyond the region of poc20, the mutational effect was less systematic (Figure B). To determine the best subset of residues for predicting OR responses to odorants, we proceeded in an empirical approach. Namely, we selected 5 small-to-large residue subsets as heuristics, based on the above results: poc17, poc20, poc27, poc60, and TM191. poc27 and poc60 are extensions of the pocket until 6 and 8 Å from the bound odorant, containing 27 and 60 residues, respectively (Figure C and Table S3). TM191 contains the whole 7TM region made up of 191 residues. Machine learning models were then built with these residue subsets to compare their predictive power.

PCM and Machine Learning

From the sequence alignment of hORs and mORs, each of the 5 heuristic residue subsets were extracted. PCM models were constructed using the data in Table and physiochemical features of the odorants (see the Material and Methods section). Each OR–odorant pair was labeled with the in vitro response (responsive or nonresponsive). We trained and assessed supervised support vector machine (SVM) and RF classifiers using 5-fold cross validation. The response probability of each OR–odorant pair was predicted, and a probability >0.5 was classified as responsive. The predictivity was measured by Matthew’s correlation coefficient (MCC).[28] RF performed better than SVM. The predictivities of the five RF classifiers were not significantly different from one another. However, they were clearly superior to a naive statistical inference (Figure S2A; see the Supplementary Methods section for the calculation of the statistical inference). The poc60 classifier performed the best on average (Figure S2A, Data File S2A,B). Control models built with 60 randomized residues, as expected, showed no predictivity (Figure S2A). To determine the best residue subset, we constructed five final RF classifiers (poc17, poc20, poc27, poc60, and TM191) using 100% of the data in Table . Each classifier was then used to screen for new ORs for acetophenone, R-carvone, coumarin, and 4-chromanone. The in silico screening was performed on 360 ORs (223 hORs and 138 mORs), including 346 orphan ORs. Each classifier predicted and ranked the probabilities of the ORs to respond to each of the 4 odorants (Data File S2C).

In Vitro Assessment of Relevant Residues

We tested the predictions of all five classifiers in cell functional assays. For each model, we tested all ORs in the responsive class (predicted response probability >0.5 for any odorant) as well as 60 negative control ORs (response probability <0.5 for all odorants). These ORs were tested against all 4 odorants. For instance, in the case of poc60, we tested all 20 ORs in the responsive class and 60 randomly picked negative controls from the nonresponsive class (Figure ). Similar tests were performed on the other four models (Figure S3 and Table S4, Data File S2C,D). When significant responses were observed at 300 μM, dose-dependent responses were measured. Otherwise, the OR–odorant pair was considered nonresponsive. The poc60 classifier performed the best on the in vitro test set (Figure C). It showed 0.39–0.60 hit rates and 0.43–0.48 predictivity (MCC) for the 4 odorants (Table ). Therefore, in vitro data confirmed that poc60 is the most relevant residue subset to decode the receptor’s response to odorants. These residues show very low conservation in hORs and mORs (Figure S2B), suggesting that they have diversified to adapt to various ligands.[22,23] This implies that amino acid conservations in the OR sequences contain essential information for their functionality. Thus, we tested an additional model using the amino acid conservations in the TM region. This model turned out to be nearly as predictive as using the amino acid physicochemical features (Figure C). This indicates that the type of features used to describe the amino acids is not critical, as long as the features sufficiently convey the sequence differences to the machine learning algorithm.

Figure 2

In vitro evaluation of machine learning predictions of OR responses to odorants. (A) All of the OR–odorant pairs were ranked by the predicted probability to be responsive. The initial model assessments focused on four odorants. 20 responsive and 60 nonresponsive ORs (negative controls) predicted by the poc60 model were selected for functional assays. Heatmaps show the in vitro EC50 values, in which the false predictions are labeled with ×. Assessments of the other models are provided in Figure S3. (B) In vitro assessment of the poc60 model predictivity for acyclic odorants. (C) Dose-dependent response curves of all of the responsive OR–odorant pairs identified in this study. Error bars indicate SEM (n = 3–6).

Table 2

Performance of the poc60 Model in Predicting New OR–Odorant Pairsa

	initial test odorants				additional test odorants
metricsb	acetophenone	R-carvone	coumarin	4-chromanone	citral	nonanal	nonanoic acid
MCC	0.47	0.45	0.43	0.48	0.24	0.48	0.40
hit rate (precision)	0.39	0.6	0.58	0.6	0.50	0.50	0.25
recall (sensitivity)	0.78	0.46	0.47	0.5	0.25	0.67	1.00
F1 score	0.52	0.52	0.52	0.55	0.33	0.57	0.40
specificity	0.85	0.94	0.92	0.94	0.93	0.88	0.65
AUC	0.84	0.72	0.72	0.74	0.58	0.66	0.74
true positives	7	6	7	6	1	2	2
true negatives	60	63	60	64	14	14	11
false positives	11	4	5	4	1	2	6
false negatives	2	7	8	6	3	1	0

See Data File S2C for the raw data.

See the Methods section in the SI for the definitions.

Assessment of Model Utility

Applicability to Other Odorants

While 50% of hORs and 20% of mORs have been deorphanized at the time of this study, only a tiny fraction of the odorant chemical space (<250 odorants) has been tested. The lack of data on odorants is a major restraint on the model utility. To explore this limitation, we generated a learning curve of the poc60 model predictivity on the external test set versus the amount of training data used (Figure S4A). The learning curve suggested that a meaningful prediction could be obtained for an odorant with ∼15 known ORs. In the current database containing 244 odorants, only 17 (7%) met this criterion, 11 of which contained aromatic or cyclic structures. We attempted three more odorants that contain alkyl chains, citral, nonanal, and nonanoic acid. Following the same procedure, we tested in vitro all 11 ORs that were predicted to respond to any of the three odorants as well as 8 negative control ORs (Figure B). Because the training data lacked responsive ORs for these odorants, the model predicted less responsive pairs than for the 4 cyclic odorants. In vitro assays showed that the model performed well on nonanal and nonanoic acid but not on citral (Table ). The poor predictivity on citral was likely due to the lack of analogues (thus the lack of data) in the training set (Table ) and the fact that citral is a mixture of two isomers, which add ambiguity to the available data. The results demonstrate that the model is generalizable to odorants of different chemical groups, provided enough training data for the odorants in question or their close analogues.

General Model Performance

We evaluated the general performance of the poc60 model on all of the external test set data, including those tested for the other models and for citral. The test set data were shuffled and split into 5 folds, like in a cross validation. The model predictivity was coherent on the 5 folds of the data set, which gave 0.39–0.46 hit rates and 0.32–0.34 MCC (Table S6). Blind OR–odorant screening hit rates in Hana3A cells are expected to be lower than 0.1, such as in a pioneer study on 245 hORs and 219 mORs against 93 odorants.[19] Note that the odorants tested here might be more promiscuous than average, since the model requires training data for the query odorants or their analogues. Our test set also enriched more responsive ORs (26%) than in the natural pool of ORs (e.g., 13% in ref (19)), despite the large number of negative-control ORs included. Since many ORs fail to express on the membrane of heterologous cells, it is difficult to estimate the general response rate of ORs to various odorants. The total external test sets in this work contained 111 ORs and 438 OR–odorant pairs. We identified 63 new OR–odorant pairs with EC50 values in the micromolar to millimolar range, corresponding to 29 ORs (Figure C, Figure S3 and Table S5). Twenty-five ORs were deorphanized in this study, including 9 from the negative control groups. Nevertheless, the deorphanization rate is significantly higher in the predicted positive groups than in the negative control groups (Figure S4B), which are 56% and 15%, respectively, for the poc60 model.

Utility for New ORs and Odorants

One important aspect of the model utility is its predictivity on new ORs and odorants that are not part of the training set. While 56 out of the 95 ORs in the external test set are “new”, we recalculated the model performance metrics for this part of the test set. The model still showed good predictivity compared to the full test set (Table S7). The model predictivity on new odorants was evaluated by the following test: we excluded the 7 odorants one by one from the training set, retrained the model, and calculated the performance metrics on the test set containing only the excluded odorant. In this case, the model only showed predictivity for cyclic odorants, acetophenone, R-carvone, and 4-chromanone (Table S8). Therefore, the application to new odorants is currently limited by the lack of training data, as already discussed above. New data will gradually enable the application to more odorants. Currently, the model is readily applicable to new ORs for which there are no training data.

Discussion

This work illustrates how the G protein-coupled ORs’ response to ligands can be decoded from their sequence. Sixty residues around the odorant-binding pocket contain the highest signal-to-noise ratio and dictate the variation in the ORs’ response to the odorants (Figure ). The ligand-binding pocket of GPCRs has highly diversified during evolution to discriminate various stimuli. It is not surprising that the ORs’ response to the odorants could be predicted by using less than 20% of the sequence, made up with highly variable residues. The results validate previous predictions of pocket residues based on OR sequence analysis[22,23] and numerous site-directed mutagenesis data,[23,24] which are located in the upper portion of TM3 and TM5–TM7. Here, we highlight 4 residues in TM2 near a conserved allosteric site (centered at D2.50). The allosteric site in nonolfactory class A GPCRs (typically composed of D2.50, N3.35, and S3.39) is known to bind the Na+ ion, which modulates the receptors’ activation and affinity/response to ligands (reviewed in ref (29)). Most ORs contain a second acidic residue (E3.39) at this site, which might also accommodate divalent cations.[29] While copper ions play important roles in the recognition of sulfur odorants,[30,31] it remains unclear whether this conserved site in the ORs is involved. The machine learning model established here outperformed existing models using full sequences.[17,19] The pocket residues are essential for understanding how chemically similar odorants are differentiated by the OR family with such high specificity/selectivity.

Figure 3

Location of the residues that best encode OR responses to ligands, illustrated with mOR256-31. Conserved motifs in ORs are squared. The N- and C-termini are truncated for clarity.

Location of the residues that best encode OR responses to ligands, illustrated with mOR256-31. Conserved motifs in ORs are squared. The N- and C-termini are truncated for clarity. So far, research focusing on specific OR–ligand recognition has mostly employed molecular modeling (e.g., homology modeling, docking, and molecular simulations) verified by site-directed mutagenesis and functional assays of individual ORs, such as the studies reviewed in ref (32), as well as the more recent work on hOR1A1 for R-/S-carvone enantiomers,[33] hOR5AN1 and mOR215-1 for musk odorants,[34] zebrafish ORs for bile acids/salts,[35] and a virtual screening for new mOR-EG ligands.[36] This approach provides valuable insights into OR–ligand recognition and will continue to generate data for new ORs and ligands. Since it relies on experimental data to generate predictive molecular models, this approach is not suitable for large-scale OR–ligand pairing. The molecular modeling process can be automated to enable large-scale studies;[37] however, the performance has yet to be tested. Ligand QSAR/SAR models using machine learning have also been adopted to predict new OR ligands.[38,39] This approach allows a rapid virtual screening of large compound databases and is widely used in drug design and drug toxicity prediction.[40] It is limited to the target receptor and the chemical scaffolds of the known ligands. However, the application on ORs will gradually enrich ligand data and reduce the bottleneck of our PCM model. The machine learning PCM approach established here is readily applicable to the entire mammalian OR family. It will significantly accelerate OR–ligand mapping and OR deorphanization. It is an open loop process where newly identified OR–odorant pairs can be added to continuously improve the model. Because we optimized the model to maximize the hit rate (to reduce the cost of in vitro assays), this consequently gave way to false negatives (Figure S4C). Therefore, repeating the prediction–test loop is necessary to rescue the false negatives by injecting new training data. Note that the lack of response of many orphan ORs might be due to impaired functions in heterologous cells, e.g., lack of cell surface expression.[41] For instance, ∼30% of the mORs responding to acetophenone in vivo did not show significant responses in heterologous cells.[18] Such cases may be present in the nonresponsive ORs in the in vitro test set, the proportion of which is difficult to estimate. This approach is mostly applicable to large protein families like GPCRs or promiscuous proteins, such as functionally related enzymes,[34] odorant/pheromone-binding proteins in insects,[35] intrinsically disordered protein regions,[36] as well as GPCR-G protein binding partners.[37] The approach focuses on the sequence of the binding region, which overcomes the difficulties in obtaining high-resolution structures or full sequence alignments. It may find applications in, for example, predicting off-target activities in drug design, targeting insect pheromone receptors for pest control, or studies of protein–protein interactions and protein evolution. It requires sequence alignment and a number of known ligands as input data. The selection of relevant residues is important, which enables knowledge-based human intervention to reduce the dimensionality and enhance machine learning on scarce data. Combining in vitro functional assays, site-directed mutagenesis, knowledge of GPCR structures and sequences, as well as molecular modeling, we could generate heuristics to decipher how nature has encoded the specific functions of ORs into their varied sequences. The model is currently limited to the transmembrane domain where the sequence alignment has been established. The loop regions may be addressed for OR subfamilies for which good sequence alignments can be obtained. The discovery of residue subsets associated with given functions could indicate evolutionary hotspots and compensate for existing tools such as phylogenetic analysis based on full sequences.

Materials and Methods

Chemicals and OR Constructs

Odorants were purchased from Sigma-Aldrich. They were dissolved in DMSO to make stock solutions at 1 mM and then freshly diluted in optimal MEM (ThermoFisher) to prepare the odorant stimuli. The OR constructs were kindly provided by Dr. Hanyi Zhuang (Shanghai Jiaotong University, China). Site-directed mutants were constructed using the Quikchange site-directed mutagenesis kit (Agilent Technologies). The sequences of all plasmid constructs were verified by both forward and reverse sequencing (Sangon Biotech, Shanghai, China). The list of primers used in this study are listed in Table S9.

Cell Culture and Transfection

We used Hana3A cells, a HEK293T-derived cell line that stably expresses receptor-transporting proteins (RTP1L and RTP2), receptor expression-enhancing protein 1 (REEP1), and olfactory G protein (Gαolf).[42] The cells were grown in MEM (Corning) supplemented with 10% (v/v) fetal bovine serum (FBS; ThermoFisher) and 100 μg/mL penicillin–streptomycin (ThermoFisher), 1.25 μg/mL amphotericin (Sigma-Aldrich), and 1 μg/mL puromycin (Sigma-Aldrich). All constructs were transfected into the cells using Lipofectamine 2000 (ThermoFisher). Before the transfection, the cells were plated on 96-well plates (NEST) and incubated overnight in MEM with 10% FBS at 37 °C and 5% CO2. For each 96-well plate, 2.4 μg of pRL-SV40, 2.4 μg of CRE-Luc, 2.4 μg of mouse RTP1S, and 12 μg of receptor plasmid DNA were transfected. The cells were subjected to a luciferase assay 24 h after transfection.

Luciferase Assay

The luciferase assay was performed with the Dual-Glo luciferase assay kit (Promega) following the protocol in ref (42). OR activation triggers the Gαolf-driven AC-cAMP-PKA signaling cascade and phosphorylates CREB. Activated CREB induces luciferase gene expression, which can be quantified luminometrically [measured here with a bioluminescence plate reader (MD SPECTRAMAX L)]. Cells were cotransfected with firefly and Renilla luciferases where firefly luciferase served as the cAMP reporter. Renilla luciferase is driven by a constitutively active simian virus 40 (SV40) promoter (pRL-SV40; Promega), which served as a control for cell viability and transfection efficiency. The ratio between firefly luciferase versus Renilla luciferase was measured. Normalized OR activity was calculated as (LN – Lmin)/(Lmax – Lmin), where LN is the luminescence in response to the odorant, and Lmin and Lmax are the minimum and maximum luminescence values on a plate, respectively. The assay was carried out as follows: 24 h after transfection, the medium was replaced with 100 μL of odorant solution (at different doses) diluted in optimal MEM (ThermoFisher), and cells were further incubated for 4 h at 37 °C and 5% CO2. After incubation in lysis buffer for 15 min, 20 μL of Dual-Glo luciferase reagent was added to each well of a 96-well plate, and firefly luciferase luminescence was measured. Next, 20 μL of Stop-Glo luciferase reagent was added to each well, and Renilla luciferase luminescence was measured. The data analysis followed the published procedure in ref (42). Three-parameter dose–response curves were fitted with GraphPad Prism 8.

Molecular Modeling

Homology models of mOR256-3, mOR256-8, and mOR256-31 were built using the approach in our previous work.[24,27] Four X-ray crystal structures of class A GPCRs were used as templates, rhodopsin (1U19), CXCR4 (3ODU), A2aR (2YDV), and CXCR1 (2LNL), to build 100 models with Modeler v9.15.[43] For docking, we chose the model with the lowest DOPE score. Autodock Vina[44] and the Haddock 2.2 Web server[45] were used to identify a common top-ranked binding pose for each odorant. Residues in the putative ligand-binding pocket were set flexible during docking. Enhanced-sampling all-atom molecular dynamics simulations were performed in a bilayer of an explicit POPC membrane (see the Methods section in the SI for details). A cluster analysis of the ligand-binding pose was carried out on the simulation trajectories using the Gromacs Cluster tool. The middle structure of the most populated cluster was selected as the final binding pose.

Proteochemometric Machine Learning Model

We assembled the response data of 720 ORs and 244 odorants from the literature to construct the training set (Data File S1). Ambiguous data records (i.e., OR responses without clear dose-dependent data) were discarded. The full training set contained 1293 responsive OR–odorant pairs (composed of 392 ORs and 244 odorants) and 14 459 OR–odorant pairs that have been reported to be nonresponsive in vitro (composed of 550 ORs and 127 odorants, including 318 orphan ORs). Each OR–odorant pair was represented by a vector composed of physicochemical descriptors (features) of the OR sequence and the odorant (see the Methods section in the SI for details). The OR–odorant pairs in the training set were labeled “positive” or “negative” according to the response data for supervised machine learning. The test set was constructed in the same manner without labels. The test set contained 360 ORs (including 346 orphan ORs) available in our laboratory, paired with the 7 odorants tested in this study. RF and SVM classification models were built with the Caret package in R.[46] RF performed better than SVM and was chosen for the final model. The R code generated during this study is available as a Jupyter notebook, along with the input and output data, at https://github.com/chemosim-lab/OlfactoryReceptors under the GNU General Public License v3.0. The Jupyter notebook illustrates step-by-step the model building, training, and the in vitro assessment. The process is illustrated in Figure S2A. More details can be found in the Methods section in the SI.

Safety Statement

No unexpected or unusually high safety hazards were encountered.

45 in total

1. Combinatorial receptor codes for odors.

Authors: B Malnic; J Hirono; T Sato; L B Buck
Journal: Cell Date: 1999-03-05 Impact factor: 41.582

2. Comparison of the predicted and observed secondary structure of T4 phage lysozyme.

Authors: B W Matthews
Journal: Biochim Biophys Acta Date: 1975-10-20

3. Evaluating cell-surface expression and measuring activation of mammalian odorant receptors in heterologous cells.

Authors: Hanyi Zhuang; Hiroaki Matsunami
Journal: Nat Protoc Date: 2008 Impact factor: 13.491

Review 4. Class A GPCRs: Structure, Function, Modeling and Structure-based Ligand Design.

Authors: Xiaojing Cong; Jeremie Topin; Jerome Golebiowski
Journal: Curr Pharm Des Date: 2017-11-16 Impact factor: 3.116

5. Odorant Receptor 7D4 Activation Dynamics.

Authors: Claire A de March; Jérémie Topin; Elise Bruguera; Gleb Novikov; Kentaro Ikegami; Hiroaki Matsunami; Jérôme Golebiowski
Journal: Angew Chem Int Ed Engl Date: 2018-03-24 Impact factor: 15.336

6. Structural determinants of a conserved enantiomer-selective carvone binding pocket in the human odorant receptor OR1A1.

Authors: Christiane Geithe; Jonas Protze; Franziska Kreuchwig; Gerd Krause; Dietmar Krautwurst
Journal: Cell Mol Life Sci Date: 2017-06-27 Impact factor: 9.261

7. Crucial role of copper in detection of metal-coordinating odorants.

Authors: Xufang Duan; Eric Block; Zhen Li; Timothy Connelly; Jian Zhang; Zhimin Huang; Xubo Su; Yi Pan; Lifang Wu; Qiuyi Chi; Siji Thomas; Shaozhong Zhang; Minghong Ma; Hiroaki Matsunami; Guo-Qiang Chen; Hanyi Zhuang
Journal: Proc Natl Acad Sci U S A Date: 2012-02-10 Impact factor: 11.205

8. The G-protein-coupled receptors in the human genome form five main families. Phylogenetic analysis, paralogon groups, and fingerprints.

Authors: Robert Fredriksson; Malin C Lagerström; Lars-Gustav Lundin; Helgi B Schiöth
Journal: Mol Pharmacol Date: 2003-06 Impact factor: 4.436

9. Computational modeling of the olfactory receptor Olfr73 suggests a molecular basis for low potency of olfactory receptor-activating compounds.

Authors: Shuguang Yuan; Thamani Dahoun; Marc Brugarolas; Horst Pick; Slawomir Filipek; Horst Vogel
Journal: Commun Biol Date: 2019-04-23

10. Humans can discriminate more than 1 trillion olfactory stimuli.

Authors: C Bushdid; M O Magnasco; L B Vosshall; A Keller
Journal: Science Date: 2014-03-21 Impact factor: 47.728

1 in total

Review 1. Synthesis of Cyclic Fragrances via Transformations of Alkenes, Alkynes and Enynes: Strategies and Recent Progress.

Authors: Zhigeng Lin; Baoying Huang; Lufeng Ouyang; Liyao Zheng
Journal: Molecules Date: 2022-06-02 Impact factor: 4.927

1 in total