The United Nations Office on Drugs and Crime has designated several "legal highs" as "plants of concern" because of the dangers associated with their increasing recreational abuse. Routine identification of these products is hampered by the difficulty in distinguishing them from innocuous plant materials such as foods, herbs, and spices. It is demonstrated here that several of these products have unique but consistent headspace chemical profiles and that multivariate statistical analysis processing of their chemical signatures can be used to accurately identify the species of plants from which the materials are derived. For this study, the headspace volatiles of several species were analyzed by direct analysis in real-time high-resolution mass spectrometry (DART-HRMS). These species include Althaea officinalis, Calea zacatechichi, Cannabis indica, Cannabis sativa, Echinopsis pachanoi, Lactuca virosa, Leonotis leonurus, Mimosa hositlis, Mitragyna speciosa, Ocimum basilicum, Origanum vulgare, Piper methysticum, Salvia divinorum, Turnera diffusa, and Voacanga africana. The results of the DART-HRMS analysis revealed intraspecies similarities and interspecies differences. Exploratory statistical analysis of the data using principal component analysis and global t-distributed stochastic neighbor embedding showed clustering of like species and separation of different species. This led to the use of supervised random forest (RF), which resulted in a model with 99% accuracy. A conformal predictor based on the RF classifier was created and proved to be valid for a significance level of 8% with an efficiency of 0.1, an observed fuzziness of 0, and an error rate of 0. The variables used for the statistical analysis processing were ranked in terms of the ability to enable clustering and discrimination between species using principal component analysis-variable importance of projection scores and RF variable importance indices. The variables that ranked the highest were then identified as m/z values consistent with molecules previously identified in plant material. This technique therefore shows proof-of-concept for the creation of a database for the detection and identification of plant-based legal highs through headspace analysis.
The United Nations Office on Drugs and Crime has designated several "legal highs" as "plants of concern" because of the dangers associated with their increasing recreational abuse. Routine identification of these products is hampered by the difficulty in distinguishing them from innocuous plant materials such as foods, herbs, and spices. It is demonstrated here that several of these products have unique but consistent headspace chemical profiles and that multivariate statistical analysis processing of their chemical signatures can be used to accurately identify the species of plants from which the materials are derived. For this study, the headspace volatiles of several species were analyzed by direct analysis in real-time high-resolution mass spectrometry (DART-HRMS). These species include Althaea officinalis, Calea zacatechichi, Cannabis indica, Cannabis sativa, Echinopsis pachanoi, Lactuca virosa, Leonotis leonurus, Mimosa hositlis, Mitragyna speciosa, Ocimum basilicum, Origanum vulgare, Piper methysticum, Salvia divinorum, Turnera diffusa, and Voacanga africana. The results of the DART-HRMS analysis revealed intraspecies similarities and interspecies differences. Exploratory statistical analysis of the data using principal component analysis and global t-distributed stochastic neighbor embedding showed clustering of like species and separation of different species. This led to the use of supervised random forest (RF), which resulted in a model with 99% accuracy. A conformal predictor based on the RF classifier was created and proved to be valid for a significance level of 8% with an efficiency of 0.1, an observed fuzziness of 0, and an error rate of 0. The variables used for the statistical analysis processing were ranked in terms of the ability to enable clustering and discrimination between species using principal component analysis-variable importance of projection scores and RF variable importance indices. The variables that ranked the highest were then identified as m/z values consistent with molecules previously identified in plant material. This technique therefore shows proof-of-concept for the creation of a database for the detection and identification of plant-based legal highs through headspace analysis.
While significant attention
has been given in recent years to the
surge of the opioid epidemic, the dramatic increase in the abuse of
unregulated psychoactive plants remains troublesome. The rising concern
is such that the United Nations Office on Drugs and Crime (UNODC)
has designated 20 species as plants of concern.[1] These plants are perceived by users to be a more safe and
natural alternative to achieving altered states of consciousness than
synthetic drugs. Products derived from these materials are readily
available through Internet commerce and are difficult to regulate
in large part because of the challenge of distinguishing them from
innocuous plant materials such as food, spices, and medicinal herbs.
Examples of such species include Salvia divinorum and Turnera diffusa, both endemic
to Central and South America, and Mitragyna speciosa, native to Southeast Asia.These drugs are bulk-shipped into
the United States in large containers
and are often purposefully mislabeled. Because of the difficulty in
identifying them, it is impossible for border protection agents to
assess the veracity of the species identity listed on the product
labels. In principle, a technique that could be exploited for the
identification of these materials is headspace analysis. This approach
would be successful if the plant materials exhibit headspace volatiles
profiles that are consistent for a given plant material but distinct
from the headspace of others. The number of studies that have explored
this hypothesis is limited. A few reports have shown that a handful
of psychoactive materials can be detected and identified through the
use of headspace analysis, including cocaine and 3,4-methylenedioxymethamphetamine.[2−4] Additional studies have shown that cannabis can also be detected
and identified through the use of headspace analysis by targeting
specific compounds.[2,5−7] This technique
has also been applied to innocuous plant materials including basil
and oregano,[8−10] but the exploration of this approach for the identification
of psychoactive plant-based legal highs has not been reported.Should it be demonstrated that plant materials exhibit fingerprint
profiles that are diagnostic for a given species, it should be possible
to create a database of these against which the headspace of unknown
materials can be screened to make an identification. The feasibility
of creating such a database hinges on being able to generate hundreds
of replicates of the requisite data and the development of an appropriate
statistical analysis approach for classification. In this regard,
the utilization of ambient ionization mass spectral techniques such
as direct analysis in real-time high-resolution mass spectrometry
(DART-HRMS), shows significant promise in promoting the rapid analysis
of samples to generate the data required to create a robust database.
For example, previous research shows that DART-HRMS can be used for
the identification of different forensically relevant samples including
entomological specimens and condom residue evidence, based on the
ability to rapidly generate large replicate datasets.[11,12] DART-MS analysis facilitated by the concentration of analytes on
solid supports (e.g., sorbents) has previously been reported.[13] Furthermore, the blending of these techniques
with headspace collection specifically has been used to detect reaction
intermediates induced by plant defense mechanisms in Mimosa pudica roots,[14] as well as for the study of the volatile profiles of beers.[15]Herein, we describe a proof-of-concept
for the identification of
plant-based legal highs through the use of sorbent-facilitated DART-HRMS
analysis and multivariate statistical analysis processing of the generated
data.
Results and Discussion
The overall approach that was
devised to accomplish the identification
of plant-based materials from headspace analysis is presented in Scheme . To assess whether
the headspace of psychoactive legal highs exhibits consistent and
diagnostic chemical signatures, the headspace volatiles of 11 plant-based
legal highs identified by the UNODC as plants of concern, as well
as two nonpsychoactive controls (Ocimum basilicum and Origanum vulgare), were sampled
using solid-phase microextraction (SPME) fibers, which were subsequently
analyzed by DART-HRMS in positive-ion mode. Bulk materials derived
from plant parts that have historically been used for their psychoactive
effects were analyzed (i.e., Mimosa hostilis (ground leaves), Voacanga africana (ground root bark), T. diffusa (ground
leaves), Piper methysticum (ground
leaves), etc.). The headspace of each sample was concentrated on poly(dimethylsiloxane)
(PDMS) SPME fibers for 30 min, and this was followed by the analysis
of the adsorbed compounds by DART-HRMS (Scheme , step 1). Also performed were direct DART-HRMS
analyses of the bulk plant material, the results of which served to
enable comparison with the headspace results. A representative example
of the mass spectra generated from these DART-HRMS experiments is
shown in Figure ,
while the DART-HRMS spectra of the headspace and plant-based legal
high material are presented in Figure S1. The spectra of the headspace of cannabis species are presented
in Figure S2.
Scheme 1
Steps in the Workflow for the Species Identification of Psychoactive
Plants Based on Chemometric Processing of DART-HRMS Data Acquired
from Headspace Analysis
Figure 1
Representative DART mass
spectra for the headspace (left panel)
and plant material (right panel) analysis of Calea
zacatechichi.
Representative DART mass
spectra for the headspace (left panel)
and plant material (right panel) analysis of Calea
zacatechichi.In Figures and S1, the left panels
show the spectra of the headspace profiles, while those on the right
are of the spectra obtained from direct analysis of the bulk material.
The results of these analyses revealed two trends. First, multiple
replicates of the material of the same species, even when acquired
from different sources, exhibited similar headspace small-molecule
profiles, and this was also true for the direct analysis of the bulk
plant material. For example, the headspace spectra of Calea zacatechichi (Figure ) all contained the m/z values (± 0.005) of 120.0970, 170.1527, and 219.1011,
and the bulk material spectra all contained the m/z values (± 0.005) of 137.1318, 203.1768,
and 219.1011. Not only did each of the respective spectra have similar m/z values, but they also had similar mass
spectral patterns that were unique to that species. The trends seen
for C. zacatechichi were also observed
for the other species analyzed in this study (Figures S1 and S2). This indicated that the replicates representing
different samples of the same species showed diagnostic intraspecies
similarities and differentiating interspecies distinctions. Second,
while there was some duplication of compounds between the plant material
and the headspace constituents, the spectra of the two were markedly
different. For example, both the mass spectra of the headspace and
plant material of C. zacatechichi (Figure ) contain the high-resolution m/z value 219.1011 (± 0.005), which
has been identified as representing protonated euparone [(C12H10O4) + H+] based on its fragmentation
pattern.[16] Similar findings were observed
with the other plant species analyzed (Figure S1). This observation was anticipated, since by and large,
the headspace signatures would be composed of the subset of compounds
contained within the plant materials that are volatilized under ambient
conditions. Interestingly, it was also observed that the headspace
profiles of two different strains of cannabis could be distinguished
visually (Figure S2). This aligns with
the previously reported observations.[5,7]In the
spectra of each species (both plant material and headspace),
several of the observed high-resolution masses could be correlated
to formulas that were consistent with compounds well known to be present
in the plant. For example, m/z 137.1330
(± 0.005) is consistent with terpene compounds known to be present
in T. diffusa,[17]S. divinorum,[18] and other plant species.[19] The m/z value 149.0966 (± 0.005) found
in O. basilicum corresponds to C10H13O, which is consistent with the presence of
protonated estragole, which has previously been shown to be present
in the plant material.[20]The observations
of consistent intraspecies similarities and interspecies
differences set the stage for the successful development of a database
and a corresponding statistical analysis model that could serve as
a screening device against which the chemical fingerprints of unknowns
could be compared for species identification purposes. To study the
possibility of utilizing plant material headspace for differentiation
between species, a mass data matrix that aligned plant-derived DART-HRMS
spectra according to common m/z values
was created and subjected to statistical analysis processing methods.
Thus, as indicated in Scheme (step 1), the mass spectral data from 15 species (in replicates
of 10 each, resulting in a total of 150 spectra) were first binned
and normalized, yielding a 150 × 355 data matrix (355 represents
the total number of m/z values).
Then, as indicated in Scheme step 2, principal component analysis (PCA) and global t-distributed stochastic neighbor embedding (g-SNE), as unsupervised methods, were applied to explore and visualize
the structure inherent in the data and to reveal clustering of species
within a lower-dimensional space. With PCA, the data were resolved
to scores and loadings. Figure illustrates the three-dimensional (3-D) score plot (along
principal components (PCs) 1–3). These three PCs explained
∼46% of the variance of the data. Each point in the plot corresponds
to a sample, and the distances between points reveal the relative
level of similarity and dissimilarity between samples. For ease of
visualization, each species is represented by a color. From the plot,
a clear separation of the species O. basilicum, O. vulgare, Leonotis
leonurus, Cannabis sativa, Cannabis indica, T. diffusa, P. methysticum, and Althaea officinalis is readily
apparent. The rectangular panel embedded within the plot is a magnification
of the upper-left quadrant and shows that the species M. hostilis, M. speciosa, S. divinorum, Lactuca
virosa, V. africana, C. zacatechichi, and Echinopsis pachanoi are clustered together.
Figure 2
3-D scores
plot featuring principal components (PCs) 1–3
derived from principal component analysis (PCA) of DART-HRMS data
generated by analysis of the headspace of each of the indicated species.
The score plot displays clear separation for species O. basilicum, O. vulgare, L. leonurus, C. sativa, C. indica, T. diffusa, P. methysticum, and A. officinalis. The inset, which is enclosed in the
smaller rectangle, is expanded for ease of visualization to further
illustrate the relationships between the clustered species M. hostilis, M. speciosa, S. divinorum, L.
virosa, V. africana, C. zacatechichi, and E. pachanoi. The percentage variance accounted for
by each of the indicated PCs is shown in parentheses.
3-D scores
plot featuring principal components (PCs) 1–3
derived from principal component analysis (PCA) of DART-HRMS data
generated by analysis of the headspace of each of the indicated species.
The score plot displays clear separation for species O. basilicum, O. vulgare, L. leonurus, C. sativa, C. indica, T. diffusa, P. methysticum, and A. officinalis. The inset, which is enclosed in the
smaller rectangle, is expanded for ease of visualization to further
illustrate the relationships between the clustered species M. hostilis, M. speciosa, S. divinorum, L.
virosa, V. africana, C. zacatechichi, and E. pachanoi. The percentage variance accounted for
by each of the indicated PCs is shown in parentheses.The plot displayed in Figure shows the results of the application of the g-SNE technique in two dimensions. Similar to a PCA score
plot, the points define the positions of observations based on the
relative g-SNE similarities, and each species is
defined by a color. The plot illustrates the clustering of the samples
of each species and shows a clear separation between them that corresponds
closely to the true labels. Of note is the fact that the local similarity
relations between species are comparable with the PCA results. However,
one sample belonging to the C. indica class was observed to be an extreme outlier and was thus removed
prior to further analysis.
Figure 3
Clustering results observed from the application
of global t-distributed stochastic neighbor embedding
(g-SNE) to DART-HRMS data generated from plant headspace
analysis.
This 2-D rendering shows points that appear in clusters that are color-coded
to species. The clustering is based on the relative similarities of
the data points that correspond closely to the true labels and illustrates
a clear separation of species.
Clustering results observed from the application
of global t-distributed stochastic neighbor embedding
(g-SNE) to DART-HRMS data generated from plant headspace
analysis.
This 2-D rendering shows points that appear in clusters that are color-coded
to species. The clustering is based on the relative similarities of
the data points that correspond closely to the true labels and illustrates
a clear separation of species.The results of these exploratory analyses unmasked the hidden discrimination
structure between species. Subsequent application of the supervised
random forest (RF) technique (using the “RandomForest”
package in R) (Scheme , step 2, center panel) was performed on the 149 × 355 data
matrix to define the discrimination model for the classification of
plant species using DART-HRMS data and class labels. The RF method
hyperparameters, the minimum number of nodes and the number of variables
(m/z) randomly sampled as candidates
at each split, were optimized based on a random search of their values
within a range. The minimum number of nodes was explored within the
range of 1–5, and the number of sampled variables was set to
between 20 and 350 variables. Cross-validation (10-fold) of the created
RF classifiers was repeated 100 times to find the optimum parameter
values that enabled the building of an accurate model. The optimum
values were observed to coincide with 1 node for the minimum number
of nodes and 55 randomly sampled variables for each split. The RF
technique set with these optimized parameters was then performed with
different numbers of trees, and in this case, a forest with 1000 trees
was found to provide a model with an improved error rate in prediction.
The RF algorithm categorizes approximately a third of the dataset
as “out-of-the-bag” (OOB) samples (for validation purposes)
and performs training with the remaining two-thirds. Thus, the votes
for the OOB samples are aggregates of only those decision trees that
were not included in the training set. The OOB samples were used to
calculate error rates and variable importance values. Figure S3 illustrates the estimated error rate
for the OOB classifier on the training set for the grown trees in
the RF model. The error converged to a plateau at a value of 0.007
after growing around 382 trees. Table S1 shows the performance results of the discrimination model for each
species (i.e., classification precision, sensitivity, and specificity),
and it displayed an accuracy of 99% for the OOB sample predictions.
The sensitivity and specificity illustrate the true positive and true
negative rates for species identification, respectively. The results
show that a single sample of C. indica was incorrectly predicted to be E. pachanoi but that all other observations were identified correctly. This
indicates that DART-HRMS analysis of plant-derived headspace in combination
with the RF model is a satisfactory approach for identifying plant
species.One of the important properties of RF is the added
possibility
of computing a “proximity matrix” as a descriptive measure.
The proximity matrix quantifies the similarity between samples and
is calculated in those instances when two samples are placed in the
same terminal node. The results of the application of multidimensional
scaling to this distance matrix (1-proximity) to obtain the two principal
coordinate components are shown in Figure S4 (with each species assigned a color). Like points were observed
to cluster correctly, but the plot also revealed the close similarities
between O. vulgare, O. basilicum, L. leonurus, T. diffusa, and P.
methysticum on the one hand and between E. pachanoi, V. africana, L. virosa, A. officinalis, and M. hostilis on the other. In
comparing these results with those obtained by PCA and g-SNE, it was deduced that the three methods provide complementary
information in presenting the similarities between species, as is
described below.To determine the accuracy of the method for
predicting the identity
of unknowns, 14 samples were analyzed blindly by DART-HRMS. Their
mass spectrometric data were then aligned and binned with the training
samples. A conformal predictor based on the RF classifier was created
to determine the prediction with an assigned confidence level for
each test sample. All training samples were considered as members
of the bag of calibration samples, and an off-line experiment using
the leave-one-out (LOO) approach was applied. The conformity measure
and p-values (from eq , see Experimental Section)
were then calculated for LOO sample prediction. Of the 149 LOO samples,
15 were assigned to multiple classes (at the ε = 8% significance
level), but all of the other samples were assigned a single label,
which was correct in each case. Thus, the designed conformal predictor
proved to be valid for a significance level of 8% with an efficiency
of 0.1, an observed fuzziness of 0, and an error rate of 0. The efficiency
is the number of multiple predictions over all tested samples, and
the observed fuzziness is defined as the sum of all p-values for the incorrect class labels. A predictor makes an error
when the predicted region does not contain the true label, and the
error rate refers to the number of observations predicted incorrectly.Table presents
the performance outcomes for the prediction of the identities of these
unknowns, as well as the prediction credibility and confidence level
using the RF model. The results show that the true class labels fall
within the correct prediction region (with a significance level of
8%) for all unknown samples. The confidence level for the unknown
samples representing M. hostilis, O. vulgare, E. pachanoi, and C. sativa samples indicates
that the p-value for some other class(es) should
be 0.09. The calculated p-value for each species
is displayed for each sample in Table S2. The table illustrates that the four aforementioned samples can
each be classified as members of two species.
Table 1
Prediction
Results for the Indicated
14 Test Samples Representing Each Speciesa
species
prediction
credibilityb
confidence
levelc
A. officinalis
true
0.09
1
C. indica
true
0.27
1
C. sativa
true
0.09
0.91
C. zacatechichi
true
0.9
1
E. pachanoi
true
0.09
0.91
L. leonurus
true
0.09
1
L. virosa
true
1
1
M. hostilis
true
0.09
0.91
M. speciosa
true
0.45
1
O. basilicum
true
0.09
1
O. vulgare
true
0.18
0.91
P. methysticum
true
0.45
1
S. divinorum
true
0.36
1
T. diffusa
true
0.55
1
The credibility
and confidence levels
are reported for each.
Credibility
corresponds to the highest
computed p-value.
Confidence level refers to 1 minus
the second-highest p-value.
The credibility
and confidence levels
are reported for each.Credibility
corresponds to the highest
computed p-value.Confidence level refers to 1 minus
the second-highest p-value.Aiming to rank the variables in terms of their ability
to facilitate
clustering and discrimination between species, the importance of the
variables was quantified using principal component analysis–variable
importance of projection (PCA–VIP) scores and RF variable importance
indices (Scheme ,
step 2, last panel). The importance of the primary variables identified
by PCA as contributing to the maximum variance are defined in eq (see Experimental Section). The average relative importance of
the variables (m/z values) in the
bootstrap analysis PCA–VIP for the three principal components
(which accounted for ∼46% of the variance of the data) is illustrated
in Figure a, in which
the 30 most important m/z values
with PCA–VIPs are labeled. These include monoterpenoids (β-myrcene,
camphene, β-pinene, β-phellandrene, γ-terpinene,
and α-pinene at m/z 137.1096
in O. basilicum, C.
zacatechichi, C. zacatechichi, T. diffusa, T. diffusa, and L. leonurus, respectively),
sesquiterpenoids (α-curcumene at m/z 203.1789 in C. zacatechichi and T. diffusa; trans-α-bergamotene, caryophyllene, and β-sesquiphellandrene
at m/z 205.1889 in O. basilicum, T. diffusa, and T. diffusa, respectively), and
estragole (at m/z 149.0895 in O. basilicum). In addition, the permutation-based
importance of predictive variables in the 10 repeats of the RF modeling
was applied to show which m/z values
were useful for discrimination between plant species. All variables
(m/z values) were considered for
all of the trees in all 10 RF classifiers. Each variable’s
importance is the average of the importance values derived from the
classifiers. The bar plot in Figure b displays the rankings for the 30 most important variables
computed by this method.
Figure 4
Values (30 m/z) observed to be
most important for enabling clustering and species discrimination,
calculated using PCA and RF modeling of DART-HRMS-derived data from
the analysis of plant headspace. (a) Variables (m/z values) of importance in discrimination, revealed
through bootstrap PCA–VIP analysis based on the three principal
components, which explained ∼46% of the variance of the data,
and their corresponding average scores. (b) The m/z values important for discrimination were extracted
using permutation-based importance of predictive variables in RF.
In both panels, the m/z values are
listed in the order of decreasing PCA–VIP scores and variable
importance RF values.
Values (30 m/z) observed to be
most important for enabling clustering and species discrimination,
calculated using PCA and RF modeling of DART-HRMS-derived data from
the analysis of plant headspace. (a) Variables (m/z values) of importance in discrimination, revealed
through bootstrap PCA–VIP analysis based on the three principal
components, which explained ∼46% of the variance of the data,
and their corresponding average scores. (b) The m/z values important for discrimination were extracted
using permutation-based importance of predictive variables in RF.
In both panels, the m/z values are
listed in the order of decreasing PCA–VIP scores and variable
importance RF values.In comparison to the
PCA–VIP results, it is noteworthy that
40% of the m/z values detected by
PCA–VIP aligned with those that emerged by RF modeling. For
visualization of this correspondence, Figure illustrates the 3-D loading plot created
using the first three PCs, along with the marked loadings for the
important m/z values detected by
PCA and RF analyses. The solid navy points in the figure show the
loadings for 355 variables, while the magenta stars and red circles
are markers for m/z values and loadings
that were derived from PCA and RF, respectively. This rendering makes
apparent that both methods furnish similar results and that about
40% of the m/z values that emerged
in RF analysis as important were also essential in explaining the
maximum variance of the data. Table S3a–f
reports the average relative intensities for m/z values that were ranked by both methods to be important.
Figure 5
Equivalent
semantic relationships between PCA–VIP and RF
variable importance methods from within the set of important predictors
(m/z values), rendered as a 3-D
loading plot. The navy points display the loadings of 355 m/z values. The loadings of the m/z values representing the top-ranking
variables obtained from the RF and PCA–VIP analyses are indicated
with red circles and magenta stars, respectively. The observed overlap
of circles and stars illustrates alignment in the predictions of the
two methods regarding the m/z variables
that were the most important contributors to the ability to differentiate
between species.
Equivalent
semantic relationships between PCA–VIP and RF
variable importance methods from within the set of important predictors
(m/z values), rendered as a 3-D
loading plot. The navy points display the loadings of 355 m/z values. The loadings of the m/z values representing the top-ranking
variables obtained from the RF and PCA–VIP analyses are indicated
with red circles and magenta stars, respectively. The observed overlap
of circles and stars illustrates alignment in the predictions of the
two methods regarding the m/z variables
that were the most important contributors to the ability to differentiate
between species.From the point of view
of the local variable importance for clustering
of species based on PCA score and loading plots (Figures and 5, respectively), m/z values 152.1294,
104.0698, 180.1592, and 85.0299 were important in the clustering of M. hostilis, M. speciosa, S. divinorum, L.
virosa, V. africana, C. zacatechichi, and E. pachanoi. The m/z values 81.0500, 137.1096, 99.0399, 93.0699, 173.0992, and 175.1191
were important for the detection of the similarities between L. leonurus, C. sativa, C. indica, T. diffusa, P. methysticum, and A. officinalis, respectively.Table S4a–c lists information
on the characteristics of the important variables and shows the 20
most important discriminating features for each species that were
revealed by the RF approach and which represent the mean of the importance
of each variable in the samples belonging to each species. These m/z values illustrate the features that
were significant for enabling the discrimination of a specific species
from all other species. However, it should be noted that these variables
do not necessarily match with those indicated in Figure and Table S3 that enabled the creation of the classification model. This
is because there were two types of investigations accomplished using
the RF results. One was differentiation of a given species from the
14 others that were the subject of the investigation. The m/z values that enabled the accomplishment
of this were described as being of “local” importance
and are listed in Table S4. The second
enabled discrimination between all species simultaneously such that
the discrimination between species could be readily visualized through
the clustering observed in Figure S4 (two-dimensional
(2-D) plot of the proximity matrix analysis). The m/z values associated with this type of classification
are described here as “global” and appear in Figure . As the two types
of exploration accomplish different tasks, the variables that are
most heavily weighted in achieving the two types of classification
are not necessarily the same.The results of this study reveal
that the headspace volatiles of
the legal high plant materials analyzed in this study exhibit consistent
and unique chemical profiles, the constituents of which can be concentrated
using solid-phase microextraction fibers. The results are highly accurate
despite the SPME-facilitated volatiles collection having been performed
at ambient (as opposed to elevated) temperature and the data variability
inherent in the manual DART-MS analysis process. The mass spectra
observed were remarkably consistent for samples of the same class.
Their chemical signatures, rapidly acquired by DART-HRMS analysis,
can then be subjected to multivariate statistical analysis using a
conformal predictor based on a random forest model, to predict the
species identifies of plant material unknowns at a significance level
of 8%, an efficiency of 0.1, an observed fuzziness of 0, and an error
rate of 0. This is important, in that it shows proof-of-concept for
the creation of a headspace chemical profile database, which can be
used to rapidly screen headspace mass spectra of unknowns, to identify
plant-based legal highs.
Experimental Section
Plant Material
Dried samples of A. officinalis leaves, C. zacatechichi leaves, L. virosa leaves, L. leonurus flowering material,
and V. africana root bark were purchased
from World Seed Supply (Mastic Beach, NY).
Dried M. hostilis root bark was purchased
from Mr. Botanicals (MrBotanicals.com). Dried P. methysticum root powder and T. diffusa leaves
were purchased from Bouncing Bear Botanicals (Lawrence, KS). Dried M. speciosa leaves were purchased from Kratom King
(Reno, NV). Dried O. basilicum leaves
and O. vulgare leaves were purchased
from Hannaford Bros. Co. (Scarborough, ME). A fresh E. pachanoi plant was purchased from World Seed Supply
(Mastic Beach, NY) and then cut and dried. A fresh S. divinorum plant was purchased from Undergroundroots.net
(La Conner, WA) and then cut and dried. Cannabis samples (i.e., C. sativa and C. indica) were analyzed at the U.S. Fish and Wildlife Forensics Laboratory
(Ashland, OR).
Solid-Phase Microextraction Fibers
Divinylbenzene/carboxen/poly(dimethylsiloxane)-coated
24 ga 50/30 μm solid-phase microextraction fibers and solid-phase
microextraction fiber holders for use with manual sampling were purchased
from Supelco Inc. (Bellefonte, PA). Fibers were conditioned for 30
min at 250 °C under a stream of helium gas before each headspace
sampling.
Headspace Sampling
Roughly 10 g of each plant species
was placed in separate 25 mL Erlenmeyer flasks. The mouth of the flask
was covered with aluminum foil. A conditioned solid-phase microextraction
fiber was then exposed to the headspace of the sample for 30 min at
room temperature (Figure ). This concentration step was performed under ambient conditions
(rather than at elevated temperature) to detect volatile components
that are more likely to be observed under the ambient conditions present
in the vessels containing the samples or within the general vicinity
of the samples (in the field). Each of the plant samples was analyzed
in replicates of 10. Spectra of C. sativa and C. indica headspace were acquired
by transferring the samples to a 20 mL scintillation vial and placing
it uncapped between the ion source and the mass spectrometer inlet.
Figure 6
Headspace
volatile collection using an SPME fiber.
Headspace
volatile collection using an SPME fiber.
DART-HRMS Analysis
Exposed SPME fibers were analyzed
using a direct analysis in real-time (DART)-SVP ion source (IonSense,
Saugus, MA) interfaced with a JEOL AccuTOF mass spectrometer (JEOL
USA, Peabody, MA). Each fiber, while extended from the holder assembly,
was manually “waved” back and forth in the DART gas
stream until there was no longer an MS signal that was registered
(which signified that the content of the fiber had been fully desorbed
and which took ∼1 min) (Figure ). The fibers were analyzed in positive-ion mode with
the gas heater temperature in the DART software set to 250 °C,
over a mass range of m/z 40–800.
The DART ion source helium flow rate was 2.0 L/min. The mass spectrometer
settings were as follows: the orifice 1 voltage was 20 V, the orifice
2 voltage was 5 V to minimize fragmentation, and the peak voltage
was 400 V to allow for the detection of ions over m/z 40. The mass spectrometer has a resolving power
of 6000 full width at half maximum. Poly(ethylene glycol) (PEG 600)
was used to calibrate the mass spectra following the analysis of each
individual fiber. Plant material for each species was also analyzed
directly using the same DART parameters as the SPME fibers for comparison.
Figure 7
SPME fiber
introduction to the DART gas stream.
SPME fiber
introduction to the DART gas stream.
Spectral Analysis
Calibration, background subtraction,
and peak centroiding were conducted using TSSPro3 software (Schrader
Analytical Laboratories, Detroit, MI). Mass spectral analysis was
performed using Mass Mountaineer (Mass-spec-software.com, RBC Software,
Portsmouth, NH). The DART mass spectrum of a conditioned SPME fiber
that was not exposed to the headspace of any samples was used as a
blank for the SPME samples.
Statistical Analysis
To model discrimination
between
plant species and to discover which features (m/z values) are most important for distinguishing between
them, multivariate statistical analysis methods were applied to the
DART-HRMS data acquired from the analysis of plant samples. The workflow
outlined in Scheme illustrates the approach.In step 1, SPME fiber-facilitated
DART-HRMS was used to generate a mass spectrum for each sample, with
the analysis performed using multiple species and 10 replicates. In
all, the mass spectra of 150 samples representing 15 different species
were imported into MATLAB 9.3.0, R2017b Software (The MathWorks, Inc.,
Natick, MA), in text format (composed of m/z values and their corresponding intensities) for further
analysis in MATLAB and R 3.5.1 (http://cran.r-project.org/). A data matrix with the dimensions
150 × 355 was created from binning of mass spectra, with the
optimal bin width and the relative abundance threshold being ±10
mmu and 0.2%, respectively. In step 2, the data matrix was subjected
to descriptive and predictive methods to reveal information on species
in terms of discriminative markers. This step consisted of three parts:
exploration, classification, and determination of variable (m/z) importance, detailed below.
Exploration
An extended form of t-distributed
stochastic neighbor embedding, termed “g-SNE”,
was used to visualize the data structure in a 2-D scatter plot. This
neighbor-embedding technique preserves the pairwise similarities of
probable neighbors by minimizing the divergence of similarity distributions
between neighboring data points and embedding the points in a lower-dimensional
space. The dataset was subjected to principal component analysis (PCA)
to explore its similarity structure and to reveal the m/z values which were the primary indicators of similarities
and dissimilarities between like and unlike groups, respectively.
Classification
The random forest (RF) technique proposed
by Breiman was investigated as a plant species discrimination model.[21] Random forest is a classifier which aggregates
a large number of “trees” to reduce overfitting and
preserve reliable predictions. Every tree in the forest is “grown”
on an independently drawn bootstrap replica of the data matrix and
assigned a vote for each class (i.e., the estimated probability of
the observation originating from the given class) at each input sample.
The samples not included in the replica for a given tree are considered
to be “out-of-bag” (OOB) for that tree. The overall
accuracy and the performance characteristics of the model are computed
based on the predictions of OOB observations. For the prediction of
new samples, a conformity measure was used to yield a confidence level
prediction based on a random forest classifier.[22] Conformal prediction provides the opportunity to have output
region predictions (i.e., a set of predicted labels) with a guaranteed
error rate based on the calculated p-value. The conformity
score for a given observation i in the bag (i.e.,
the calibration set in the conformal prediction context) for a specific
class k (designated as αi) is the proportion
of votes of all of the trees for a given class k.
The result is a matrix of conformity scores with one row per observation
and one column per class.The parameters m and nc indicate
the number of samples in the bag and classes, respectively. The resulting
scores were then used to calculate the p-value for
the labeling of an unknown sample representing a given species, according
to eq . To calculate
the p-value for observation “m + 1” for a specific class k (represented
as p(α)), the conformity
score of the observation for class k (α) was computed and compared with the observations’
in-bag scores for class k with the following conditions:
the scores of observations belong to class k (α, i ∈ 1,...,m|y = k), and the maximum conformity measure of the observations does not
belong to class k (max (α), i ∈ 1,...,m|y ≠ k). In the
case of single-label predictions, the confidence of the prediction
is one minus the second-largest p-value, and the
credibility is the largest p-value.
Variable Importance
PCA and RF results were explored
to deduce the relative importance of the various m/z values in enabling the clustering of and discrimination
between plant species. This was accomplished by generating variable
importance of projection (VIP) scores, as proposed by Ginsburg et
al.[23] VIPs enable the consideration of
the structure of the reduced dimensional PCA space and the class labels
according to eqs and 3, where T, P, y, and b (in eq ) are the scores, loadings, class labels, and regression coefficients
between class labels and scores, respectively. Equation represents the decomposition of the mass
data matrix into scores (T) and loadings (P) matrices, and regression between scores (T) and class
labels (y). Equation displays the computation equation for VIP scores.
The terms npc, m, and nv define the number of principal
components, samples, and m/z values,
respectively.PCA–VIP scores were calculated by randomized
bootstrapping (1000 repetitions), with 80% of the samples used to
create a PCA model in each repeat. Determination of the m/z values that were most important in enabling discrimination
between sample types was accomplished by defining an importance measure
(permutation-based variable importance) that was embedded in the OOB
observations in the RF model. The score of a given variable was computed
as the average decrease in model accuracy of the OOB samples when
the values of the corresponding variable were randomly permuted across
the OOB observations. Therefore, for each variable in every tree grown,
the difference in the percentage of two votes for the correct class
of the OOB observations was measured: a vote for the untouched OOB
data and another vote for the variable permuted OOB data. The average
of this measure for all of the trees in the ensemble represented the
importance score for each variable (i.e., m/z value).[24]
Authors: M Viana; C Postigo; X Querol; A Alastuey; M J López de Alda; D Barceló; B Artíñano; P López-Mahia; D García Gacio; N Cots Journal: Environ Pollut Date: 2011-02-16 Impact factor: 8.071
Authors: Rabi A Musah; Ashton D Lesiak; Max J Maron; Robert B Cody; David Edwards; Kristen L Fowble; A John Dane; Michael C Long Journal: Plant Physiol Date: 2015-12-09 Impact factor: 8.340