Gelio Alves1, Aleksey Ogurtsov1, Roger Karlsson2,3,4,5, Daniel Jaén-Luchoro2,4,6, Beatriz Piñeiro-Iglesias3,4, Francisco Salvà-Serra2,3,4,6,7, Björn Andersson8, Edward R B Moore2,3,4,6, Yi-Kuo Yu1. 1. National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20894, United States. 2. Department of Infectious Diseases, Sahlgrenska Academy, University of Gothenburg, 40530 Gothenburg, Sweden. 3. Department of Clinical Microbiology, Sahlgrenska University Hospital, 40234 Gothenburg, Sweden. 4. Center for Antibiotic Resistance Research (CARe), University of Gothenburg, 40016 Gothenburg, Sweden. 5. Nanoxis Consulting AB, 40234 Gothenburg, Sweden. 6. Culture Collection University of Gothenburg (CCUG), Sahlgrenska Academy of the University of Gothenburg, 40234 Gothenburg, Sweden. 7. Microbiology, Department of Biology, University of the Balearic Islands, 07122 Palma de Mallorca, Spain. 8. Bioinformatics Core Facility at Sahlgrenska Academy, University of Gothenburg, Box 413, 40530 Gothenburg, Sweden.
Abstract
Fast and accurate identifications of pathogenic bacteria along with their associated antibiotic resistance proteins are of paramount importance for patient treatments and public health. To meet this goal from the mass spectrometry aspect, we have augmented the previously published Microorganism Classification and Identification (MiCId) workflow for this capability. To evaluate the performance of this augmented workflow, we have used MS/MS datafiles from samples of 10 antibiotic resistance bacterial strains belonging to three different species: Escherichia coli, Klebsiella pneumoniae, and Pseudomonas aeruginosa. The evaluation shows that MiCId's workflow has a sensitivity value around 85% (with a lower bound at about 72%) and a precision greater than 95% in identifying antibiotic resistance proteins. In addition to having high sensitivity and precision, MiCId's workflow is fast and portable, making it a valuable tool for rapid identifications of bacteria as well as detection of their antibiotic resistance proteins. It performs microorganismal identifications, protein identifications, sample biomass estimates, and antibiotic resistance protein identifications in 6-17 min per MS/MS sample using computing resources that are available in most desktop and laptop computers. We have also demonstrated other use of MiCId's workflow. Using MS/MS data sets from samples of two bacterial clonal isolates, one being antibiotic-sensitive while the other being multidrug-resistant, we applied MiCId's workflow to investigate possible mechanisms of antibiotic resistance in these pathogenic bacteria; the results showed that MiCId's conclusions agree with the published study. The new version of MiCId (v.07.01.2021) is freely available for download at https://www.ncbi.nlm.nih.gov/CBBresearch/Yu/downloads.html.
Fast and accurate identifications of pathogenic bacteria along with their associated antibiotic resistance proteins are of paramount importance for patient treatments and public health. To meet this goal from the mass spectrometry aspect, we have augmented the previously published Microorganism Classification and Identification (MiCId) workflow for this capability. To evaluate the performance of this augmented workflow, we have used MS/MS datafiles from samples of 10 antibiotic resistance bacterial strains belonging to three different species: Escherichia coli, Klebsiella pneumoniae, and Pseudomonas aeruginosa. The evaluation shows that MiCId's workflow has a sensitivity value around 85% (with a lower bound at about 72%) and a precision greater than 95% in identifying antibiotic resistance proteins. In addition to having high sensitivity and precision, MiCId's workflow is fast and portable, making it a valuable tool for rapid identifications of bacteria as well as detection of their antibiotic resistance proteins. It performs microorganismal identifications, protein identifications, sample biomass estimates, and antibiotic resistance protein identifications in 6-17 min per MS/MS sample using computing resources that are available in most desktop and laptop computers. We have also demonstrated other use of MiCId's workflow. Using MS/MS data sets from samples of two bacterial clonal isolates, one being antibiotic-sensitive while the other being multidrug-resistant, we applied MiCId's workflow to investigate possible mechanisms of antibiotic resistance in these pathogenic bacteria; the results showed that MiCId's conclusions agree with the published study. The new version of MiCId (v.07.01.2021) is freely available for download at https://www.ncbi.nlm.nih.gov/CBBresearch/Yu/downloads.html.
Entities:
Keywords:
identification of antibiotic resistance proteins; mass spectrometry; microorganism identification/classification workflow
Fast and accurate identification of pathogenic bacteria along with
the identification of antibiotic resistance (AR) proteins is of paramount
importance for patient treatments and public health.[1−5] Once the pathogenic bacteria causing the infections are identified
swiftly along with their AR proteins (if present), proper treatment
can be administered, which can increase patients’ survival
rate and minimize improper use of antibiotics.[6,7]Currently, molecular methods such as next-generation sequencing
(NGS) and mass spectrometry (MS) are used and are being developed
to speed up identifications of pathogenic bacteria.[8−25] While several computational workflows/pipelines for analyzing NGS
data have been developed to identify pathogenic bacteria and AR genes,[26−28] a mass spectrometry workflow with this capability is still lacking.[24] This has motivated us to augment the workflow
of our pathogen identification tool, Microorganism Classification and Identification (MiCId),[21,29,30] to enable the identification
of AR proteins using MS data from a high-performance liquid chromatography
system coupled to a high-resolution tandem MS (HPLC–MS/MS).
Another motivation for augmenting MiCId’s workflow is that,
even though NGS workflows can provide information about the presence
of AR genes, they do not provide information about protein expression,
which is extremely important for treating infections and for understanding
the mechanism of antibiotic resistance in bacteria.[31−34]For a summary of some of
the existing workflows employed for the
identification of bacteria using HPLC–MS/MS experiments, we
refer readers to previous publications.[24,29,30] Overall, there has been significant progress made
in the identification of bacteria using HPLC–MS/MS experiments,
although there is plenty of room for improvement in sample preparation
protocols and data analysis workflows.[35−37] Developers of HPLC–MS/MS
data analysis workflows often use the sensitivity (true positive rate)
and specificity (true negative rate) as the only criteria to assess
the usability of the developed workflow. Although sensitivity and
specificity are acceptable criteria to measure the performance of
a workflow, these criteria alone are not enough to justify the usability
of a workflow. For example, an important criterion that is often not
mentioned in performance evaluations is the execution time. Identification
of bacteria is a computationally demanding task for a workflow, as
it has to query tens of thousands of MS/MS spectra in a microorganismal
database containing thousands to tens of thousands of bacteria. In
order to scale with the number of HPLC–MS/MS experiments, a
workflow with appropriate amount of computer resources must have execution
time less than the time it takes to conduct the HPLC–MS/MS
experiment, which is approximately 1–2 h. This remained an
unattainable goal for most workflows.[38] Other criteria to consider include whether or not a workflow provides
for identified bacterial biomass estimation,[39,30] protein identification[40,29] with protein quantification,[41,42] and AR protein identification.[24] Data
on the relative biomasses of identified bacteria identified are essential
for studying microbial communities[39] and
are valuable when determining treatment options for patients suffering
from coinfections.[43−45] Knowledge of proteins and protein expression levels
are essential for analyzing gene expression and function[46,47] and for investigating possible mechanisms of antibiotic resistance
in bacteria.[31−34] Information about AR proteins is crucial for proper treatments for
AR-resistant bacterial infections.[6,7] The criteria
above cover most of the data analysis features needed for a workflow
to be useful. In order to ensure a workflow to be user-friendly, intuitive,
and customizable, we propose additional criteria. A useful workflow
should: (1) automate and customize microorganismal protein sequences
for download and database construction; (2) automate and customize
AR protein sequences for download and database construction; (3) be
computationally efficient and scalable to handle large microorganismal
databases, large numbers of MS/MS spectra, and large number of MS/MS
experiments; (4) be available to execute in different computer operating
systems; (5) offer a user-friendly graphical interface. Meeting these
latter criteria allows a workflow to eliminate elaborate intermediate
steps and reaching a broader group of users in addition to experts
in the field.In previous studies, we have demonstrated that
MiCId’s workflow
meets most of the criteria listed above.[21,29,30,48] We have shown
that MiCId’s workflow:Offers automated microorganismal database construction
by automatically downloading from the NCBI database protein sequences
of organisms specified by the user.Offers
customized microorganismal database construction
using a list of protein sequence Fasta files of organisms specified
by the user that are stored in the local computer.Is able to identify bacteria in samples containing single
and multiple bacteria with high sensitivity and high specificity by
computing, for each identified taxon, an E-value
which can be used to control the proportion of false discoveries (PFD)
without the need of a decoy database.[21,29] When a list
of candidate taxa are ranked by a quality score S, the E-value E(S ≥ S0) is defined as the expected
number of random taxa with scores the same as or better than S0.Is able to estimate
taxonomic biomass by computing a
quantity called the prior using a modified expectation-maximization
(EM) method. The prior is defined as the probability
for a taxon to emit any evidence peptide and can be regarded as the
taxon’s relative protein biomass within the sample analyzed.[30]Provides protein
identifications via combining peptides’ E-values,
using theoretically derived mathematical formulas.[40,49]Is computationally efficient and scalable,
taking 6–17
min to process tens of thousands of MS/MS spectra in a large database,
using resources available in most desktop/laptop computers.Is a self-contained workflow available with
a friendly
graphical user interface (GUI) with many features available for data
analysis and visualization.However,
the previous versions of MiCId’s workflow do not
provide protein quantification or AR protein identification and are
only available for the Linux operating system. In this study, we have
augmented the MiCId’s workflow to meet the criterion for the
identification of AR proteins, and we intend to address the other
two unmet criteria in the near future. MiCId’s workflow can,
however, be used in the Windows operating system via a virtual machine.
Details of how to run MiCId’s workflow in the Windows operating
system are described in MiCId’s user manual.The AR protein
identification task for an MS/MS workflow can be
formulated as follows. First, using data from an MS/MS experiment,
a workflow needs to identify the species/strains present in the biological
sample. Second, it needs to construct, on the fly, a target protein
database to be used for AR protein identifications. Even if a workflow
has high sensitivity and high specificity for the identification of
microorganisms and proteins, a remaining difficulty to be dealt with
in identification of AR proteins is deciding what protein sequences
to include in the target protein database. In principle, the ideal
target protein database to use would include all of the protein sequences
obtained directly from the strains present in the biological sample
and with AR proteins unambiguously annotated. However, such a database
is unobtainable from an MS/MS based proteomics approach, even if strain
level identification is attained. It is standard practice for workflows
to use databases such as those hosted by the National Center for Biotechnology
Information (NCBI) to obtain protein sequences for as-yet-to-be-identified
strains to build a target protein database. A target protein database
constructed by using this procedure is an approximation to the ideal
target protein database because the strains present in the biological
sample could have gained new proteins via horizontal gene transfer
and mutations through rapid multiplication and environmental pressure.[50,51] To mitigate this issue, MiCId constructs on the fly a target protein
database made of proteins from the reference/representative proteomes
of confidently identified species and AR proteins from a high-quality
AR database.[27,52,53] This strategy is employed because the proteomes of reference/representative
strains are proteome assemblies of higher quality; hence, they are
to be used as anchors for the analysis of closely related proteomes
within the same taxonomic group.[54] By including
a comprehensive AR protein database in a target protein database,
MiCId’s workflow can potentially deal with the horizontal AR
gene transfer, and the presence of a few mutations in an AR protein
does not prevent it from being identified provided that there are
sufficient identified peptides containing no mutations. This can potentially
allow the presence of few mutations occurring in the AR proteins to
be identified. Overall, the target protein database used in MiCId’s
strategy is not too far off from the ideal target database because
the proteomes of most strains under a given species share a significant
number of highly homologous proteins[55,56] and the inclusion
of AR proteins in a general manner takes care of the possible gain,
via horizontal gene transfer, of known AR proteins.We have
used five MS/MS data sets, consisting in total of 126 HPLC–MS/MS
datafiles (each containing about 20000–30000 spectra), covering
10 antibiotic-resistant bacterial strains, to evaluate the newly augmented
MiCId workflow in terms of AR protein identifications. In our evaluation,
AR proteins are identified at the AR protein family level, following
the AR protein family classification used by the CARD database.[52,57] Identification of AR proteins is performed at the family level because
of the large number of highly homologous AR proteins within most AR
protein families. (Many AR proteins within the same AR family differ
from each other by only one to few amino acid residues.) The high
degree of protein sequence similarity makes the task of distinguishing
among individual proteins beyond the AR protein family level not always
possible, especially when a data-dependent acquisition mode is used
in MS/MS experiments. Although identification of the exact AR protein
is not always possible, obtaining identifications at the AR protein
family level are enough to improve antibiotic treatments for patients
suffering from bacterial infections since AR proteins within the same
AR protein family are largely resistant to the same antibiotics.In our evaluation, we have shown that MiCId’s workflow has
a sensitivity of approximately 85% (with an estimated lower bound
of 72%) and a precision greater than 95% in the identification of
AR protein families. We have demonstrated, using an MS/MS data set
from samples of two human pathogens, that MiCId’s workflow
can be employed to investigate possible mechanisms of antibiotic resistance
in bacteria. We have also shown that MiCId’s workflow can provide
microorganismal identification, protein identification, sample biomass
estimation, and AR protein identification in 6–17 min using
computer resources that are available in most desktop and laptop computers.
The new MiCId version v.07.01.2021, designed to run in a Linux environment
and tested under (i) CentOS Linux release 7.9.2009, (ii) Red Hat Enterprise
Linux Server release 7.9, (iii) Ubuntu release 18.04.3, and (iv) Windows
10 using Oracle VirtualBox 6.1.22 running Ubuntu release 18.04.3,
is freely available for download at https://www.ncbi.nlm.nih.gov/CBBresearch/Yu/downloads.html.
Materials and Methods
MiCId’s AR Protein
Identification Algorithm
MiCId’s workflow is augmented
to allow for AR protein identifications.
MiCId’s workflow contains procedures for taxonomic identifications,
biomass estimations, and protein identifications. In the workflow,
it is the protein identification part that gets augmented for the
purpose of AR protein identifications. Below, we summarize MiCId’s
workflow and highlight the augmentations required. MiCId begins by
querying a sample’s MS/MS spectra in the microorganismal database,
containing protein sequences from reference and representative genomes,
for the identifications of microorganismal peptides; these identified
microorganismal peptides are then used for taxonomic identifications
via an iterative approach at each taxonomic level, and for relative
taxa biomasses estimates within the sample.[29,30] The proteins from the reference/representative proteomes of species
identified with E-value ≤0.01 and prior ≥ 0.01 are then assembled on-the-fly for protein
identification.In the augmented MiCId, we add to the aforementioned
protein database AR proteins from an AR database. Namely, in the protein
identification procedure, MiCId now queries the updated protein database
(combining the protein database constructed on-the-fly and the AR
database) with MS/MS spectra to identify peptides for protein and
AR protein identifications. (This should not be confused with the
peptide identifications needed for taxonomic identifications and biomass
estimates.) MiCId uses the scoring function and statistics from the
database search tool RAId_DbS[49] to score
peptides and for assigning statistical confidences, E-values, to identified peptides. Identified peptides are then used
as evidence for protein identifications. See Figure for an overview of MiCId’s workflow.
Figure 1
MiCId’s
workflow overview. To execute MiCId, users must
provide the following input: a list of taxonomic identifiers taken
from the NCBI, an experimental datafile (containing MS/MS spectra)
of a microorganism sample, an antibiotic resistance (AR) protein file,
and the parameters for database search. The list of taxonomic identifiers
is used by MiCId to download from the NCBI the Fasta files of the
protein sequences for all the taxa specified along with their taxonomic
information. The downloaded protein Fasta files and the taxonomic
file are used to create the microorganismal database. In step 1, the
MS/MS spectra are queried in the microorganismal database in order
to determine the taxonomic composition (via an iterative approach
that propagate only taxa identified at one level to identifications
at the next level) and the relative biomasses of microorganisms in
the sample.[29,30] In step 2, the newly augmented
step, MiCId generates a protein database that includes protein sequences
from reference/representative strain of species identified with E-value ≤ 0.01 and prior ≥
0.01 and from the user-specified AR protein file. The MS/MS spectra
are then used to query this database to perform protein identifications,
AR proteins included.
MiCId’s
workflow overview. To execute MiCId, users must
provide the following input: a list of taxonomic identifiers taken
from the NCBI, an experimental datafile (containing MS/MS spectra)
of a microorganism sample, an antibiotic resistance (AR) protein file,
and the parameters for database search. The list of taxonomic identifiers
is used by MiCId to download from the NCBI the Fasta files of the
protein sequences for all the taxa specified along with their taxonomic
information. The downloaded protein Fasta files and the taxonomic
file are used to create the microorganismal database. In step 1, the
MS/MS spectra are queried in the microorganismal database in order
to determine the taxonomic composition (via an iterative approach
that propagate only taxa identified at one level to identifications
at the next level) and the relative biomasses of microorganisms in
the sample.[29,30] In step 2, the newly augmented
step, MiCId generates a protein database that includes protein sequences
from reference/representative strain of species identified with E-value ≤ 0.01 and prior ≥
0.01 and from the user-specified AR protein file. The MS/MS spectra
are then used to query this database to perform protein identifications,
AR proteins included.MiCId aims to identify
AR protein candidates that are globally
homologous to the AR proteins already validated (e.g., proteins in
an AR database). When performing protein identifications, proteins
that share a large number of identified peptides are grouped as a
cluster. To control the number of identified proteins, several existing
methods[58] report those similar proteins
as one. Adopting the same idea, we implemented this approach via two
clustering procedures: (1) a peptide-centric clustering procedure
and (2) a protein-similarity clustering procedure. Details regarding
the clustering procedures are provided in the first section of Supplementary File S1.
MS/MS
Data Sets
A total of five MS/MS
data sets were used for this study. One data set, generated in-house
PXD026634, is composed of 21 experimental MS/MS datafiles from samples
of five bacteria strains. The other four data sets, downloaded from
the ProteomeXchange Database (PD),[59] contain
105 experimental MS/MS datafiles from samples of five other bacterial
strains. For seven bacterial strains used in this study, one may download
their complete genomic sequences[24,60−62] and protein sequences from the National Center for Biotechnology
Information (NCBI) databases.[63] In Table S1, we provide the pertinent information
for each MS/MS data set.
In-House MS/MS Data Set
Two carbapenem-resistant P. aeruginosa strains
were included in the study. Strain
CCUG 51971 (= PA 66) was isolated from a human urine sample, at the
Karolinska Hospital (Stockholm, Sweden), carrying OXA-35, OXA-488,
PDC-35, and VIM-4.[60] The VIM-4 metallo-β-lactamase
is responsible for the high carbapenem resistance levels (minimum
inhibitory concentration (MIC) of imipenem and meropenem greater than
256 μg/mL; MIC of imipenem + ethylenediaminetetraacetic acid
[EDTA] = 6 μg/mL).[60] Strain CCUG
70744 was isolated from a human sputum sample, at the Sahlgrenska
University Hospital (Gothenburg, Sweden), carrying OXA-905 and PDC-8.[62,64,65]Furthermore, one E. coli and two K. pneumoniae strains,
isolated from various clinical samples at the Sahlgrenska University
Hospital, carrying different β-lactamases (including extended
spectrum β-lactamases, ESBL, and carbapenem resistance genes)
were included in the study. E. coli CCUG 70745 isolated
from human feces, carrying CMY-6, CTX-M-15, NDM-7,and OXA-1; K. pneumoniae CCUG 70742 isolated from human urine, carrying
CTX-M-15, OXA-1, OXA-48, and TEM-1; and K. pneumoniae CCUG 70747 isolated from human wound, carrying KPC-2, SHV-200, TEM-1,
and VIM-1.[62] Lyophiles of all strains were
obtained from the Culture Collection of University of Gothenburg (CCUG,
Gothenburg, Sweden; www.ccug.se). The strains were reconstituted on Müller-Hilton agar (Substrate
Unit, Department of Clinical Microbiology, Sahlgrenska University
Hospital), at 37 °C, for 24 h.Further details regarding
cultivation conditions, sample preparation,
and LC–MS/MS acquisition are provided in the second section
of the Supplementary File S1.
Downloaded MS/MS Data Sets
Four
publicly available data sets, previously used in two different studies
on the identification of AR proteins in bacteria, were downloaded
from the ProteomeXchange Database (PXD) (http://www.proteomexchange.org/). Data set PXD004321 was taken from the study of the computational
method TCUP on the identification of AR proteins.[24] This data set contains six experimental MS/MS datafiles
from samples of a ESBL E. coli strain CCUG 62462,
carrying CTX-M-15 and TEM-1;[61] the CCUG
62462 strain was grown in pure cultures without and with cefotaxime
at 1000 μg/mL. Data set PXD011105, containing 35 experimental
MS/MS datafiles, was taken from the study on the mechanism of antibiotic
resistance of two clonal isolates (the P. aeruginosa strain CLJ1 antibiotic-sensitive isolate and the P. aeruginosa strain CLJ3 multidrug-resistant isolate obtained from the same patient
at different times) grown in pure cultures with carbenicillin at 200
μg/mL.[66] Data set PXD005587, containing
24 experimental MS/MS datafiles, was taken from the investigation
on proteomics changes due to antibiotic-dependent perturbations in
ESBL K. pneumoniae strain 34618, grown in pure cultures
without and with doxycycline or streptomycin.[31] Data set PXD010244, containing 40 MS/MS datafiles, was taken from
the research on the mechanism of antibiotic resistance in ESBL K. pneumoniae strain KpV513 grown in pure cultures without
and with doxycycline or streptomycin or doxycycline and streptomycin.[32]
MS/MS Data Analysis Using
MiCId Workflow
All data sets were analyzed using the MiCId
(v.07.01.2021) workflow.[21,29,30] The peptide-centric microorganismal
database used for analysis includes all reference and representative
proteomes of bacteria (12,703 strains in total) that are available
in the National Center for Biotechnology Information (NCBI) database
as of Feb 4, 2021. The reference proteome of one Homo sapiens is also included for two reasons. First, human proteins are a major
component of microorganism samples when obtained directly from human
hosts. Second, human proteins (mostly keratin) are also frequently
identified, albeit at lower abundances, even in microorganism samples
from laboratory cultures. The reference proteomes of Bos taurus and Saccharomyces cerevisiae are also included
because they are present, respectively, in the Mueller Hinton Broth
and in the Luria–Bertani Broth; both broth media are routinely
used to grow bacterial cultures. The protein sequence Fasta files
for the 12703 organisms along with the file containing taxonomic information
were downloaded from the NCBI database at https://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/ and at https://www.ncbi.nlm.nih.gov/Taxonomy on Feb 4, 2021. In total, 60176722 protein sequences were downloaded.
As previously described,[29] to speed up
MS/MS spectrum analysis, MiCId processes the protein sequences and
the taxonomic file into a peptide-centric microorganismal database.
The final size of the peptide-centric microorganismal database is
100 GB.To allow for AR protein identifications, in addition
to the database mentioned above, MiCId included in its search scope
AR proteins from one of the following databases: ResFinder,[27] CARD,[52] or NDARO.[53,67] Users also have the option to provide for MiCId their own assembled
AR protein database in a Fasta file. Table S2 lists the protein identifiers for the AR proteins along with the
taxonomy identifiers and scientific names of the organisms included
in MiCId’s databases.While querying the database with
PXD004321, PXD026634, PXD011105,
PXD005587, and PXD010244 data sets, the following search parameters
were employed. The digestion rules of trypsin and lys-c were assumed
with up to two missed cleavage sites per peptide. The mass error tolerance
of 5 ppm was set for the precursor ions and 20 ppm for the product
ions except when analyzing PXD011105, PXD005587, and PXD010244 (10
ppm for the precursor ions). For PXD004321 and PXD026634, cysteines
were unmodified, and for PXD011105, PXD005587, and PXD010244 iodoacetamide
was used as the reduction agent, changing the molecular mass of every
cysteine from 103.00919 to 160.030647 Da. RAId’s Rscore, using
b- and y-ions as evidence, was used to score peptides. The statistical
significance of each peptide was assigned via RAId_DbS’s theoretically
derived peptide score distribution.[49] The
largest (cutoff) E-value for a peptide to be reported
was set to 1. For taxa identifications at the genus level and lower,
all microorganisms under the genus Shigella were
excluded from consideration to avoid classification ambiguity because
some researchers have argued that taxonomically Shigella should be classified under Escherichia coli.[68]
Results and Discussion
In this section, we present the evaluation of MiCId’s workflow
in identifying AR proteins. First, we use 126 experimental MS/MS datafiles
to assess MiCId’s AR protein identification strategy. Second,
we estimate the sensitivity of AR protein identifications via MiCId
using 27 experimental MS/MS datafiles from samples of six antibiotic
resistant bacteria strains (from three species included in the pathogen
priority list of the World Health Organization[69] for antibiotics research and development), cultured with
and without antibiotics. Third, using 35 experimental MS/MS datafiles
from samples of two human pathogens, we employ MiCId’s workflow
to investigate possible mechanisms of antibiotic resistance.Following the AR protein classification used by the CARD database,[52,57] one finds that there are large numbers of highly homologous AR proteins
within most AR protein families and expects this family level crowdedness
to remain as AR protein databases continue to grow. As an illustration,
we note that each AR protein within the β-lactamase family has
very homologous sequences within the family: if one takes an AR protein
as the query to align with each of the rest of the AR proteins in
the β-lactamase family, for the best pairwise alignment, there
are many high score alignments and the best of which has an average
length normalized BLAST bit-score ≥2. This is shown in Table S3. The length normalized BLAST bit-score
is defined as the BLAST bit score divided by the length of the longer
of the two sequences aligned. As illustrated in Figures S1 and S2, a good cutoff for length normalized BLAST
bit-score is 1.6.The high degree of similarity for AR proteins
in the same family
makes distinguishing among AR proteins at finer-than-family level
not always possible, particularly when data acquisition in the MS/MS
experiment is untargeted. For these reasons, during our evaluation,
identified AR proteins are counted as true positives and false positives
at the AR protein family level. For example, assume a bacteria strain
contains OXA-1 and OXA-48 proteins, leading to two proteins in the
OXA family; if during the analysis MiCId identifies three OXA proteins,
then the two best ranking OXA proteins identified are counted as true
positives and the remaining OXA protein is counted as a false positive.
For some AR protein families that are not yet overly represented/crowded
in the database, correct identification can be achieved at finer-than-family
when closely related homologous proteins are present in the database.
For example, for AR proteins from aminoglycoside families, families
that are not overly represented in the database, we observed correct
identifications for these AR proteins not only at the family level
but also at the isoenzyme level,[70,71] which is a
finer level than the family level. Of course, if the target family
(to which the query protein belongs) is too much under-represented
in the database, either no identification is made or misidentifications
of the AR protein families occur.
Evaluation of MiCId’s
AR Protein Identification
Strategy
MiCId’s strategy for AR protein identifications
is to first identify species in a microorganismal database and then
identify AR proteins in a target protein database composed of proteins
from the reference/representative proteomes of confidently identified
species and AR proteins from a high-quality AR database.[27,52,53] MiCId’s strategy capitalizes
on microorganismal identifications at the species level because high
confidence microorganism identifications at taxonomic levels lower
than species become challenging because of the lack of discriminative
peptides among the ones identified[30,72] when using
the routinely employed high-resolution data-dependent acquisition
mode in MS/MS experiments.[73] In principle,
more advanced MS/MS experiments such as targeted MS/MS using selected
reaction monitoring (SRM) or parallel reaction monitoring (PRM) can
be used for taxonomic identification below the species level by targeting
peptides that are unique to taxa at lower taxonomic levels.[72,74−77] However, a limitation of such approaches is that they can only be
employed for the identifications of a microorganism within a small,
predetermined set of microorganisms. Another reason for employing
MiCId’s strategy has to do with the trustworthiness in annotation
of the taxonomic database for taxonomic levels below species.[78−81] It is important to mention that although for this study we only
included the proteins from strains that are labeled as reference and
representative in the microorganismal database, as these are proteins
from higher quality genomes,[54,82,83] MiCId’s workflow is not limited to microorganismal databases
composed of only reference and representative genomes, and it can
perform microorganismal identifications beyond the species level.When, for the purpose of protein identifications, selecting a proteome
as the representative for a confidently identified species, MiCId
relies on a heuristic because under a given species there could be
many strains and priority for each has to be established. The heuristic
gives strains that are labeled as reference first
priority and representative second priority. Information
about reference and representative strains is taken from the RefSeq
and GenBank assembly summary files downloaded from the NCBI. If there
is more than one reference strain or representative strain for a given
species, the strain with the larger number of proteins is selected.
When a species has neither reference strain nor representative strain,
the proteome from the strain, under that species, with the larger
number of proteins is selected. The rule of assigning high priority
to the proteomes from reference strains and representative strains
is applied because these are proteome assemblies of higher quality
and importance that have been curated by the NCBI staff and are to
be used as anchors for the analysis of closely related proteomes within
the same taxonomic group.[54]For identifications
of AR proteins, the ideal target protein database
would include all of the protein sequences obtained directly from
the strains present in the biological sample and with AR proteins
unambiguously annotated. From an MS/MS-based proteomics viewpoint,
such a database is unattainable even if strain level identification
is achieved. MS/MS-based proteomics approaches rely on databases such
as the ones at the NCBI to obtain protein sequences for yet-to-be-identified
strains. A target protein database constructed using this procedure
would still be an approximation to the ideal target protein database
because the strains present in the biological sample could have acquired
new proteins via horizontal gene transfer and mutations through rapid
multiplication and environmental pressure.[50,51] By including a comprehensive AR protein database in the target database,
MiCId can potentially deal with the horizontal AR gene transfer; with
the clustering procedure, a few mutations of an AR protein do not
prevent it from being identified provided that there are sufficient
identified peptides containing no mutations. However, lacking a complete
AR protein database encoding[84] all existent
mutations, MiCId cannot pinpoint the mutation sites and their amino
acid polymorphisms. MiCId can potentially allow the presence of a
few mutations occurring in the AR proteins to be identified. Overall,
the target protein database used in MiCId’s strategy is not
a bad approximation to the ideal target database because the proteomes
for most strains under a given species shared a significant number
of homologous proteins[55,56] and include AR proteins in a
global manner, covering the possible acquisition, via horizontal gene
transfer, of known AR proteins.To evaluate MiCId’s strategy
for identifying AR proteins,
we prepared two sample-specific target protein databases and queried
them with the same MS/MS datafiles from specific samples. The first
target protein database is composed of proteins from the reference
proteome of the species present in the sample and AR proteins from
the ResFinder database, referred to here as reference strains plus
AR database (RA). The other target protein database is composed of
proteins from the proteome of the true strain present in the sample
and AR proteins from the ResFinder database, referred to here as correct
strain plus AR database (CA). Table S4 contains
the protein identifiers for each protein used to generate both versions
of the sample-specific target protein databases. Plotted in panels
A and B of Figure are the PFD curves from querying the RA and CA databases with 62
experimental MS/MS datafiles from samples of seven strains. There
are in total 14 target protein databases, two for each of the seven
strains that have complete genome sequence available, used in generating
panels A and B of Figure . The PFD curves in panel A of Figure show that using the target protein databases
RA and CA produced comparable PFD curves. Furthermore, the curves
in panel B of Figure show that using RA and CA databases yields 131 common AR protein
identifications. What was not shown is that there are 12 AR protein
identifications, covering five AR protein families, not shared: there
are eight PDC protein family identifications and the one ant(3”)
family identification present in the list identified using the CA
database but absent from that using the RA database; on the other
hand, only one TEM family identification, one OXA family identification,
and one ARR family identification are found using the RA database.
Multiple PDC family identifications are found using both databases:
23 identifications using the CA database and 15 identifications using
the RA database. The identification rate of PDC protein family in
the CA database is higher because it contains the correct PDC protein
PTC38756.1, which belongs to the CLJ1 strain, even though this protein
is not yet included in the ResFinder database. In addition, in the
ResFinder database the PDC family—containing only four PDC
proteins: AAM08942.1, ACQ82815.1, ACQ82807.1, and AAM08945.1—is
under-represented, making it difficult to identify the correct PDC
using the ResFinder database since even the most homologous PDC protein
(AAM08942.1) and the correct PDC protein (PTC38756.1) differ by more
than 50 amino acid residues. The discrepancy in true positives in
the other four AR protein families is also mainly caused by composition
difference of the two target databases. Table S5 contains pertinent information on all the identified proteins/families
in both databases.
Figure 2
MiCId workflow evaluation. Let us mention here again that
the abbreviations
CA and RA refer, respectively, to target databases composed of proteins
from the correct strain plus the chosen AR database and from reference
strains plus the chosen AR database. Panels A and B display PFD curves
when querying the CA and RA with 62 experimental MS/MS datafiles.
Panel A shows that the PFD curves from searching in CA and RA are
comparable. Panel B shows that there are 131 true positive antibiotic
resistance (AR) proteins identified in common in CA and RA. The PFD
values in panels C and D were obtained from querying RA with 126 experimental
MS/MS datafiles. Panel C also indicates that 180 AR proteins are identified
at the 5% PFD level. Panel D shows that using an E-value cutoff of 0.01 the identification of AR proteins can be controlled
at the PFD level smaller than 5%. The abbreviations TP, FP, and PFD
refer, respectively, to true positive, false positive, and proportion
of false discoveries.
MiCId workflow evaluation. Let us mention here again that
the abbreviations
CA and RA refer, respectively, to target databases composed of proteins
from the correct strain plus the chosen AR database and from reference
strains plus the chosen AR database. Panels A and B display PFD curves
when querying the CA and RA with 62 experimental MS/MS datafiles.
Panel A shows that the PFD curves from searching in CA and RA are
comparable. Panel B shows that there are 131 true positive antibiotic
resistance (AR) proteins identified in common in CA and RA. The PFD
values in panels C and D were obtained from querying RA with 126 experimental
MS/MS datafiles. Panel C also indicates that 180 AR proteins are identified
at the 5% PFD level. Panel D shows that using an E-value cutoff of 0.01 the identification of AR proteins can be controlled
at the PFD level smaller than 5%. The abbreviations TP, FP, and PFD
refer, respectively, to true positive, false positive, and proportion
of false discoveries.Panel C of Figure shows that there are 180 AR
protein family identifications at the
5% PFD level when all 126 MS/MS data files are analyzed. Panel D of Figure shows that when
an E-value cutoff of 0.01 is used the identification
of AR proteins can be controlled at the 5% PFD level. On the basis
of this result, in order to control the false positives at around
the 5% PFD level, only AR protein family identifications with E-values below 0.01 are deemed true positives with high
confidence by MiCId. Table S6 has the list
of all identified proteins for all 126 MS/MS data files. Because the
126 datafiles used have been annotated with true positives, we are
able to display the “theoretical” PFD values in Figure to show the retrieval
effectiveness. However, in real experimental data analyses, the PFD
has to be estimated as the true positives and false positives are
not known beforehand. In the third section of Supplementary File S1, we show how to estimate the PFD values
via E-values. The closeness between the “theoretical”
PFD and estimated PFD is also shown in Figure S3. Even though the PFD can be estimated, control of PFD does
not prioritize the proteins that meet the PFD cutoff. For this reason,
we find that E-values, when assigned accurately,
provide more useful information. Not only it can be used to infer
the expected number of false positives, hence, type-I error control,
it can also be used to prioritize the proteins meet a PFD cutoff.
We also like to stress that MiCId searches the database in a single
pass with the PFD computed via the accurate E-values
reported;[40,49,85] it does not
use multipass target-decoy heuristics. The latter was designed with
the intent to amplify the identification rates but, unfortunately,
violates the statistical foundations, of the target-decoy approach.[86−89]As mentioned above, a requirement for MiCId’s strategy
to
work is that it must have accurate species-level identification. MiCId
achieves accurate microorganism identification with trustworthy confidence
assignments by properly computing for every identified taxon an E-value and a prior probability.[29,30] For a quality score S, the E-value
reflects the expected number of random taxa with scores the same as
or better than S.[40] A
taxon’s prior probability is the probability
for an identified taxon to emit any evidence peptide which can also
be viewed as that taxon’s protein biomass up to an overall
proportionality constant as described earlier.[30] Therefore, identified taxa with small E-values and large priors are more likely to be present in the sample.
As we have demonstrated, MiCId can control the PFD below 5% by calling
true positives only identified taxa with E-values
≤ 0.01 and prior ≥ 0.01.[29,30] In addition, MiCId employs an iterative approach for taxa identification
at each taxonomic level; only taxa identified at the upper taxonomic
level are considered for the next level identifications.[29,30] As shown in Figure , when considering all identifications with E-values
≤ 1, MiCId identifies for each of the 126 samples the correct
species with only 13 false positives overall. Interestingly, MiCId
also identifies H. sapiens in 103 samples, B. taurus in 18 samples, and S. cerevisiae in 12 samples. H. sapiens are identified using
evidence peptides from keratin proteins detected in the samples. Keratin
proteins are a common contaminant to mass spectrometry experiments,
usually originating from skin and hair as well as dust, clothing,
and latex gloves. The identification of B. taurus and S. cerevisiae in some of the samples is expected
as they are present in the broth medium used to grow the bacterial
cultures.
Figure 3
Species level composition for samples 1–126. Plotted in
panel A are the cumulative frequency of the number of true positives
species, H. sapiens, B. taurus, S. cerevisiae, and false positives species versus the natural
logarithm of the E-value of all species identified
with E-value ≤ 1. Plotted in panel B are the
histograms of priors for the true positive species, H. sapiens, B. taurus, S. cerevisiae, and
false positive species. Among the true positives: E. coli is identified 10 times, K. pneumoniae 72 times,
and P. aeruginosa 44 times. Among the 13 false positives: Algisphaera agarilytica is identified 1 time, Cerasicoccus
arenae 4 times, Chlorobium tepidum 2 times, Desulfosporosinus orientis 1 time, Fervidobacterium
thailandense 1 time, Ktedonosporobacter rubrisoli 1 time, Streptococcus thermophilus 1 time, and Thiofilum flexile 2 times. H. sapiens is
identified 103 times, B. taurus 18 times, and S. cerevisiae 12 times. To control the proportion of false
discoveries below 5% only species identified with an E-value ≤ 0.01 (ln(E-value) = −4.6)
and prior ≥ 0.01 are considered true positives
with high confidence. When employing the recommended cutoffs of E-value ≤ 0.01 and prior ≥
0.01 MiCId still identifies all true positives with no false positives.
Species level composition for samples 1–126. Plotted in
panel A are the cumulative frequency of the number of true positives
species, H. sapiens, B. taurus, S. cerevisiae, and false positives species versus the natural
logarithm of the E-value of all species identified
with E-value ≤ 1. Plotted in panel B are the
histograms of priors for the true positive species, H. sapiens, B. taurus, S. cerevisiae, and
false positive species. Among the true positives: E. coli is identified 10 times, K. pneumoniae 72 times,
and P. aeruginosa 44 times. Among the 13 false positives: Algisphaera agarilytica is identified 1 time, Cerasicoccus
arenae 4 times, Chlorobium tepidum 2 times, Desulfosporosinus orientis 1 time, Fervidobacterium
thailandense 1 time, Ktedonosporobacter rubrisoli 1 time, Streptococcus thermophilus 1 time, and Thiofilum flexile 2 times. H. sapiens is
identified 103 times, B. taurus 18 times, and S. cerevisiae 12 times. To control the proportion of false
discoveries below 5% only species identified with an E-value ≤ 0.01 (ln(E-value) = −4.6)
and prior ≥ 0.01 are considered true positives
with high confidence. When employing the recommended cutoffs of E-value ≤ 0.01 and prior ≥
0.01 MiCId still identifies all true positives with no false positives.When imposing the recommended cutoffs, E-values
≤ 0.01 and prior ≥ 0.01, to control
the PFD below 5%, MiCId still identifies correctly the true positive
species out of each of the 126 samples. This is because, as shown
in Figure , all of
the true positives are identified with a much lower E-value than 0.01 and much larger prior than 0.01.
However, with the recommended cutoffs, H. sapiens is identified now in 41 samples, B. taurus in 11
samples, S. cerevisiae in 11 samples, and no false
positives. In terms of the prior, reflecting the
taxon’s relative protein biomass, one would expect it to be
very close to 1 for true positive species identified, given that the
samples are each assumed to contain a single microorganism. The main
reason that it deviates from 1 is because out of the 126 samples,
when one only imposes E-values ≤ 1 for reporting
identification, 105 samples have, in addition to the underlying microbe,
identifications matching some of the following three organisms: H. sapiens, B. taurus, and S. cerevisiae. For these 105 samples, as shown in panel B of Figure , H. sapiens contributes to the overall protein biomasses with prior values ranging from 0.00076 to 0.085 with an average value of 0.016; B. taurus has prior values ranging from
0.0019 to 0.25 with an average value of 0.1; S. cerevisiae has prior values ranging from 0.0011 to 0.048 with
an average value of 0.026. These non-zero prior values
for H. sapiens, B. taurus, and S. cerevisiae cause the observed deviation of the prior value from 1 for the TP. Table S7 contains pertinent information on the identified species
for each sample.It is important to mention that the taxa identification
results
reported by MiCId are not filtered by using the recommended cutoff
to avoid incidental false negatives. MiCId reports the complete list
of identified taxa using a color-coded scheme. Identified taxa passing
the recommended cutoffs, E-values ≤0.01 and prior ≥ 0.01, are highlighted in green for high-confidence
in being a true positive; taxa identified with an E-value ≤ 1 and prior ≥ 0.001 are highlighted
in yellow for low confidence in being a true positive, and taxa identified
with an E-value > 1 or prior <
0.001 are highlighted in red for no-confidence in being a true positive.
Estimate for Sensitivity of AR Protein Identifications
via MiCId’s Workflow
Having computational methods
that can correctly identify bacteria and also their AR proteins is
among the most important research fronts for fighting infections.
We demonstrate the usefulness of MiCId’s workflow in serving
as such a computational method in this subsection and next. We use
datafiles from some bacteria containing β-lactamase proteins
as examples for the reasons listed below. First, β-lactam antibiotics
are the most prescribed class of antibiotic to fight infections globally;[90] second, in the United States, about 65% of the
antibiotics prescribed are β-lactam antibiotics.[91] Of special importance in this class of antibiotic
are carbapenems. Carbapenems have a broad spectrum of activity and
are usually used as the last-line of the defense for seriously ill
patients suspected of harboring resistant bacteria.[92] Evidently, correct identifications of carbapenem resistance
can help significantly in fighting infections. In addition, β-lactamase
proteins can be harbored by plasmid, and when this occurs they can
be easily transmitted into different bacteria cells, introducing resistance
to the bacteria.[90,93,94]To estimate the sensitivity of MiCId’s workflow on
the identification of AR proteins, we used 27 MS/MS experimental datafiles
from six antibiotic-resistant bacterial strains (from three species
included in the pathogen priority list of the World Health Organization[69] for antibiotics research and development), cultured
with and without β-lactam antibiotics. The three β-lactam
antibiotics used belong to two classes of antibiotics: belonging to
the cephalosporin class is cefotaxime and belonging to the carbapenem
class are ertapenem and meropenem. Each of the bacterial strains carries
between two and four predicted β-lactamase proteins and shows
resistance to a variety of antibiotics.[60−62,65,95] β-Lactamase proteins for
these strains were computationally predicted using ResFinder.[27]Table S8 provides
a protein-centric view. This table lists for each predicted β-lactamase
the strains containing it and the names of the β-lactam drug
classes it resists. For the purpose of estimating the sensitivity
value, we view each possible β-lactamase identification per
experiment as a different event. Summing the numbers of possibly identifiable
β-lactamase proteins from each of the 27 experiments, one obtains
a total of 88 potential true positives. This may be viewed as the
maximum set of the true positives. An avid reader may ask what happens
if some AR proteins, in this case β-lactamase proteins, are
missed from the database. When that happens, because these proteins
will never be identified, they do not contribute counts to either
the numerator or the denominator while the sensitivity value is computed.
Hence, for the purpose of assessing the sensitivity, one does not
need to worry about AR proteins that are not included in the database.
On the other hand, a predicted AR protein may never be observed because
it is usually expressed in low abundance or it is not even a true
protein in the corresponding microorganism’s proteome. When
this is the case, it becomes inappropriate to use the maximum TP set
as the TP set for the purpose of estimating the sensitivity value.When using all 88 possible identifications as the TP set, one obtains
a sensitivity value of 72.7% (64/88). This may be viewed as the lower
bound of the sensitivity of MiCId’s workflow. If one excludes
from the TP set the β-lactamases—OXA-1 in K.
pneumoniae CCUG 70742, OXA-488 in P. aeruginosa CCUG 51971, and OXA-905 in P. aeruginosa CCUG 70744—that
were never confidently observed in any of the corresponding experiments
considered, one obtains a sensitivity value of 85.3% (64/75). This
sensitivity value may be viewed as the typical sensitivity value while
employing MiCId’s workflow.Table shows the
identification results of β-lactamase protein families for all
the 27 MS/MS experiments. Displayed in Table are 64 identifications with E-value ≤ 0.01 highlighted in green and marked with a checkmark,
six identifications with 0.01 < E-value ≤
1 highlighted in yellow and marked with a checkmark, five cases of
missed identification (while identified in other samples) marked with
an X, and 12 cases of no identification with no marks.
Table 1
Identification Results of β-Lactamase
Proteins from Culture Samples of Six Antibiotic-Resistant Strains
Cultivated with and without β-Lactam Antibioticsa
Cells in green color and marked
with a checkmark indicates that the protein was identified with an E-value ≤ 0.01, indicating high confidence in the
identification. Cells in yellow and marked with a checkmark indicate
that the protein was identified with an 0.01< E-value ≤ 1, indicating low confidence in the identification.
Cells with X indicate that the protein was not identified for that
sample number (SN); cells with no mark indicate that the protein was
not identified in that data set; CTX, cefotaxime; ETP, ertapenem;
MEM, meropenem; NA, no antibiotic. Cases of no identification have
no mark.
Cells in green color and marked
with a checkmark indicates that the protein was identified with an E-value ≤ 0.01, indicating high confidence in the
identification. Cells in yellow and marked with a checkmark indicate
that the protein was identified with an 0.01< E-value ≤ 1, indicating low confidence in the identification.
Cells with X indicate that the protein was not identified for that
sample number (SN); cells with no mark indicate that the protein was
not identified in that data set; CTX, cefotaxime; ETP, ertapenem;
MEM, meropenem; NA, no antibiotic. Cases of no identification have
no mark.For bacterial cultures
exposed to an antibiotic, one would expect
the bacteria to express some of its AR proteins at high levels.[96,97] MiCId’s workflow does identify, except for OXA-1, OXA-488,
and OXA-905, all of the predicted β-lactamase proteins. The
AR protein OXA-1 is copresent with OXA-48, CTX-15, and TEM-1 in the
genome of K. pneumoniae CCUG 70742; OXA-488 is present
along with OXA-35, VIM-4, and PDC-35 in the genome of P. aeruginosa CCUG 51971; and OXA-905 is copresent with PDC-8 in the genome of P. aeruginosa CCUG 70744. For K. pneumoniae CCUG 70742, MiCId’s workflow identified OXA-48 in three samples
via OXA-232, OXA-199, and OXA-548 as these three proteins are highly
homologous to OXA-48 and have length-normalized BLAST bit-scores of
2.04, 2.05, and 1.69, respectively. For P. aeruginosa CCUG 51971, MiCId’s workflow identified OXA-35 in five samples
with high confidence via OXA-19, OXA-101, OXA-35, OXA-147, and OXA-240
as these five proteins are highly homologous to OXA-35 and have length-normalized
BLAST bit-scores of 2.04, 2.03, 2.05, 2.03, and 1.98, respectively.
The complete list of identified AR proteins can be found in Table S6.MiCId’s identification
results for P. aeruginosa CCUG 51971 and P. aeruginosa CCUG 70744 correlate
well with a previous study showing that, in the model strain P. aeruginosa—PAO1 the gene of the OXA-50-like oxacillinase—is
expressed at relatively low levels and is not inducible by β-lactams,
while the gene of blaPDC, also expressed
at relatively low levels usually, is strongly induced by β-lactams.[98] This could be the reason why MiCId did not detect
OXA-488 in P. aeruginosa CCUG 51971 and OXA-905 in P. aeruginosa CCUG 70744 but detected PDC-35 in P. aeruginosa CCUG 51971, albeit only at the highest concentrations
of meropenem. There are also several experimental reasons ranging
from digestion enzyme, data-dependent acquisition mode selection,
protein expression level, as well as nonoptimal liquid-chromatography
separation that can be used to explain why some of the β-lactamase
proteins were not identified or were not confidently identified. To
further validate that the missed identification of β-lactamase
proteins was not due to MICId’s inability, we analyzed all
27 MS/MS experimental datafiles using the Proteome Discoverer software
(version 2.4), and the results obtained, displayed in Table S9, are in agreement with the MiCId results.This assessment shows that MiCId’s workflow has a typical
sensitivity value around 85% (and with a lower bound at about 72.7%),
suggesting that it is a useful tool for the detection of AR proteins.
Using the MiCId Workflow to Investigate Possible
Mechanisms of Antibiotic Resistance
We demonstrate here how
the MiCId workflow may aid in the investigation of the possible mechanisms
of antibiotic resistance of a human pathogen, Pseudomonas
aeruginosa strain CLJ3, and compare the mechanism suggested
by using MiCId with published results.[66]P. aeruginosa strains were obtained from a patient
having hemorrhagic pneumonia but treated unsuccessfully with antibiotics.
Strain CLJ1, sensitive to antibiotics, was isolated before antibiotic
therapy started; 12 days after antibiotic therapy started, as the
patient conditions worsened, strain CLJ3 was isolated. A multiomics
approach was used to understand the process of antibiotic resistance
development in CLJ3. Genomics data shows that the genome of CLJ3,
when compared to the genome of CLJ1, has acquired several genetic
modifications that could have contributed to phenotypic changes. Genomics
data shows that antibiotic resistance of CLJ3 is probably linked to
interruption-causing insertions detected in genes oprD and ampD.[66]For
each strain, proteomics samples comprising proteins contained in the
whole-cell (W), inner and outer membranes (M), and secretome (S) were
collected and used for MS/MS analysis.[66] The CARD database was used as the input AR protein database in MiCId’s
workflow as it contains proteins belonging to multidrug efflux pumps.[99] The suggested mechanisms of antibiotic resistance
for CLJ3 by using MiCId’s workflow agrees with the published
results.[66] Comparing the AR proteins identified
in membrane samples from CLJ3 and CLJ1, one notes that CLJ3 does not
express the outer membrane protein oprD and is overexpressing
the β-lactamase PDC. Lack of the outer membrane protein oprD, caused by interruption of oprD gene,
makes the cell impermeable to most antibiotics in the β-lactam
class.[66,100,101] Interruption
in the ampD gene brings about the overexpression
of blaPDC as the ampD gene is responsible for the regulation of blaPDC.[102,103]Table S10 contains the identifiers of
AR protein families for each strain. Table S10 also shows that in agreement with the previous study on the mechanism
of antibiotic resistance for CLJ3 was only obtained for the membrane
samples. One obvious reason is that one expects to find higher concentration
of oprD proteins in the membrane extract and of β-lactamase
proteins in the periplasm, which is the cellular component between
the inner- and outer-membrane of Gram-negative bacteria, and thus
can often be a component contaminant for the membrane samples.[104,105] This accentuates the necessity of sample extraction selection and
sample fractionation when investigating possible mechanisms of antibiotic
resistance.[105−107]We further used principal component
analysis (PCA) to demonstrate
the reproducibility of MiCId’s workflow. The vector component
for each sample was set to be ln(1/E-value) of the
identified AR protein family. AR proteins not identified in a given
sample was assigned the E-value 100, yielding a vector
component value of −4.605. For each sample, the components
are further scaled to have norm 1. Figure shows tight clusters for samples derived
from whole-cell and membrane for each strain; for the secretome samples,
data points are not as close, indicating that the secretome might
not be suitable for studying AR proteins. The results from principal
component analysis validate the reproducibility of MiCId’s
workflow in AR protein identifications as shown by the tight sample
clusters in Figure .
Figure 4
Principal component analysis (PCA) for antibiotic resistance (AR)
protein families identified by MiCId’s workflow. Included in
the PCA are 35 identification results, each from an experiment whose
sample contains either P. aeruginosa strain CLJ1
or P. aeruginosa strain CLJ3 with proteins collected
from whole-cell (W), membrane (M), or secretome (S). Also revealed
in the plot, there are only five experimental replicates for the combination
CLJ3-S, the other five combinations each has six experimental replicates.
This brings the total number of experiments included to 35.
Principal component analysis (PCA) for antibiotic resistance (AR)
protein families identified by MiCId’s workflow. Included in
the PCA are 35 identification results, each from an experiment whose
sample contains either P. aeruginosa strain CLJ1
or P. aeruginosa strain CLJ3 with proteins collected
from whole-cell (W), membrane (M), or secretome (S). Also revealed
in the plot, there are only five experimental replicates for the combination
CLJ3-S, the other five combinations each has six experimental replicates.
This brings the total number of experiments included to 35.
Execution Time of MiCId’s
Workflow
With speed a main consideration, MiCId was written
in C++ and its
routines for organism and protein identifications were implemented
using parallel programming. Hence, MiCId allows users to specify the
desired number of cores for each job. Using 28150 MS/MS spectra to
query two databases of sizes 100 GB (12703 organisms) and 20 GB (3868
organisms), we measured the execution time of MiCId’s workflow
in performing organism identification, biomass estimation, and protein
identifications. Figure shows that in the 20 GB database takes about 13 min with four cores
and reduces to around 6 min with 16 cores. On the other hand, the
execution time in the 100GB database ranges from 17 min (with four
cores) down to 7 min (with 16 cores). Our results indicate that when
the database size increases by a factor of 5.0, the execution time
increases only by a factor of about 1.2 (using 16 cores). This reflects
the scalability of MiCId in handling large databases. Figure also shows that the execution
time reduction reaches a plateau at around 16 cores. This is because
the C++ routine used to compute statistical significance for identified
organisms and proteins is not yet parallelized, incurring a constant
time cost. Table S2 contains the taxonomic
identifiers for all the organisms in the 100 and the 20 GB databases
as well as the identifiers for proteins taken from the ResFinder database.
Figure 5
Average
execution time, in minutes, of MiCId’s workflow
in performing organism identification, biomass estimation, and protein
identifications in a 100GB (containing 12703 organisms) database and
a 20GB (containing 3868 organisms) database. There are 28150 MS/MS
spectra used as the queries. Results from using various number of
cores are displayed. MiCId’s workflow execution time performance
was carried-out in a computer running the operating system CentOS
Linux release 7.9.2009 and containing 32 Intel(R) Xeon(R) central
processing units (CPUs) with a clock speed of 2.60 GHz.
Average
execution time, in minutes, of MiCId’s workflow
in performing organism identification, biomass estimation, and protein
identifications in a 100GB (containing 12703 organisms) database and
a 20GB (containing 3868 organisms) database. There are 28150 MS/MS
spectra used as the queries. Results from using various number of
cores are displayed. MiCId’s workflow execution time performance
was carried-out in a computer running the operating system CentOS
Linux release 7.9.2009 and containing 32 Intel(R) Xeon(R) central
processing units (CPUs) with a clock speed of 2.60 GHz.MiCId’s workflow execution time performance was measured
using a computer running the operating system CentOS Linux release
7.9.2009 and containing 32 Intel(R) Xeon(R) central processing units
(CPUs) with a clock speed of 2.60 GHz. More information about the
operating system and CPUs used is provided in Table S11.
Conclusion
Fast
and accurate identification of pathogenic bacteria along with
the identification of AR proteins is of paramount importance for patient
treatments and public health. The newly augmented MiCId workflow was
designed to achieve this important goal by identifying AR proteins
when processing MS/MS data acquired in high-resolution mass spectrometers.
The augmented workflow of MiCId also fills the need for having mass
spectrometry-based workflow for identifying bacteria along with AR
proteins. We have shown in section that the strategy employed by the MiCId workflow for
identifying AR protein yields sensible results. The MiCId workflow
identifies 93.5% (131/140) of the AR proteins that are also identified
if the target protein database used is composed of protein sequences
from the correct strain. Results from our AR protein identification
assessment show that MiCId’s workflow has a sensitivity of
85% (with a lower bound at about 72.7%) and a precision of 95% when
the E-value cutoff of 0.01 is used to control the
number of false positives. Being fast, yielding sensible results,
and having high sensitivity and high precision, MiCId is shown to
be a valuable tool for identification of bacteria and their AR proteins.
However, limitations to the current MiCId workflow remain. Even though
the relative biomasses among multiple microbes present in a sample
can be provided, MiCId does not yet provide quantification of individual
proteins; although MiCId’s AR protein identification allows
few mutations, the impossibility of having a complete AR protein database prevents MiCId from pinpointing the mutation
sites and types. Nevertheless, while the latter limitation cannot
be circumvented with proteomics workflow alone, we do plan to address
the former limitation in the near future.The augmented workflow
of MiCId is a self-contained tool capable
of performing microorganism identification, protein identification,
biomass estimation, and AR protein identification in minutes using
limited amount of computer resources available in most desktop and
laptop computers. MiCid’s workflow was tested under (i) CentOS
Linux release 7.9.2009, (ii) Red Hat Enterprise Linux Server release
7.9, (iii) Ubuntu release 18.04.3, and (iv) Windows 10 using Oracle
VirtualBox 6.1.22 running Ubuntu release 18.04.3. Having a user-friendly
graphical user interface, the new MiCId version (v.07.01.2021) for
Linux environment is freely available for download at https://www.ncbi.nlm.nih.gov/CBBresearch/Yu/downloads.html.
Authors: Manuel Kleiner; Erin Thorson; Christine E Sharp; Xiaoli Dong; Dan Liu; Carmen Li; Marc Strous Journal: Nat Commun Date: 2017-11-16 Impact factor: 14.919