Literature DB >> 35430813

DEIMoS: An Open-Source Tool for Processing High-Dimensional Mass Spectrometry Data.

Sean M Colby¹, Christine H Chang¹, Jessica L Bade¹, Jamie R Nunez¹, Madison R Blumer¹, Daniel J Orton¹, Kent J Bloodsworth¹, Ernesto S Nakayasu¹, Richard D Smith¹, Yehia M Ibrahim¹, Ryan S Renslow¹, Thomas O Metz¹.

Abstract

We present DEIMoS: Data Extraction for Integrated Multidimensional Spectrometry, a Python application programming interface (API) and command-line tool for high-dimensional mass spectrometry data analysis workflows that offers ease of development and access to efficient algorithmic implementations. Functionality includes feature detection, feature alignment, collision cross section (CCS) calibration, isotope detection, and MS/MS spectral deconvolution, with the output comprising detected features aligned across study samples and characterized by mass, CCS, tandem mass spectra, and isotopic signature. Notably, DEIMoS operates on N-dimensional data, largely agnostic to acquisition instrumentation; algorithm implementations simultaneously utilize all dimensions to (i) offer greater separation between features, thus improving detection sensitivity, (ii) increase alignment/feature matching confidence among data sets, and (iii) mitigate convolution artifacts in tandem mass spectra. We demonstrate DEIMoS with LC-IMS-MS/MS metabolomics data to illustrate the advantages of a multidimensional approach in each data processing step.

Entities: Chemical

Mesh：

Year: 2022 PMID： 35430813 PMCID： PMC9047447 DOI： 10.1021/acs.analchem.1c05017

Source DB: PubMed Journal: Anal Chem ISSN： 0003-2700 Impact factor: 8.008

The ability to process raw instrument data reliably and accurately is critical to any molecular profiling assay. Though useful, commercial software solutions provided by vendors of mass spectrometry (MS) instrumentation lack flexibility required to rapidly adapt to evolving community needs. Demand for open-source, community-driven development has motivated researchers to pursue alternatives across instrument platforms, for example liquid or gas chromatography (LC or GC) and ion mobility spectrometry (IMS) coupled to mass spectrometry (MS), including tandem mass spectrometry (MS/MS). Software implementations also differ in their offered functionality: data input/output, multidimensional feature detection, alignment across samples, isotope detection, and deconvolution of MS/MS spectra. However, few available open-source, platform-agnostic solutions provide such core functionality for data of high-dimensionality, hindering the development and application of new instrumentation and analysis paradigms. These limitations are predominantly tied to the specificity—and thus, relative inflexibility—of existing software and algorithm implementations. For example, LC or GC coupled to MS or MS/MS results in two primary feature dimensions, the retention time/index and MS mass-to-charge ratio (m/z), which is reflected in community software algorithms.[1−13] Existing feature detection algorithms are tailored to the underlying data: features are detected in one or two dimensions, and dimensions are often inflexibly constrained based on respective assumptions. In the short term, instrumentation advances[14−19] force platform-specific software to either ignore additional dimensions—for instance, summing across the least-distinguishing dimensions—or iteratively apply one- or two-dimensional algorithms.[20] Over time, wider instrument adoption engenders development to extend or modify existing algorithms to take full advantage of additional separation dimensions.[20,21] The problem is thus cyclical in nature: for software to adapt to technology, the technology must be mature and widely used, but for technology to mature and achieve widespread adoption, instrument data must be robustly analyzed by user-friendly software. To overcome this paradox, instrument vendors have historically developed and supplied the software required to process the data, for example, Agilent MassHunter, Bruker MetaboScape, and Waters Progenesis QI. However, vendor offerings have their own limitations.[22] Because the underlying software is proprietary, details of underlying algorithm implementations are neither available to the public via open-source codebases nor sufficiently documented in publications. Moreover, vendor software is often tailored to a specific instrument and involves proprietary data formats, limiting one-to-one comparison of data across different instrument or vendor types. In some cases, existing software can also lack customizability; for instance, algorithm selection tends to be fixed to specific peak detection, alignment, and deconvolution implementations, unless additional options are explicitly implemented by the vendor. Thus, users are subject to the functionality provided by the vendor software, or must assemble multiple software solutions into one workflow.[3,23] Finally, many vendor solutions are automated only to a small degree, thus impeding reproducibility, and are not amenable to high-performance (HPC) or cloud computing. As a result, the metabolomics community has worked to develop open-source solutions.[8,10,20,24−27] Each cover either some or all of the steps in a typical metabolomics workflow and are positioned to analyze GC-MS or LC-MS data (MS-DIAL[20] additionally handles some aspects of LC-IMS-MS data) and tandem MS. These software tools offer insight into best practices and algorithm implementations and serve as foundational references for our work. However, challenges remain in supporting data of arbitrary dimensionality, generalizing algorithmic implementations to operate in native dimensionality, offering flexibility and control over the analysis workflow, and scaling efficiently to computational resources. To this end, we present the design and implementation of DEIMoS, or Data Extraction for Multidimensional Spectrometry, and include an initial evaluation on LC-IMS-MS/MS metabolomics data from analysis of blood plasma samples. DEIMoS’s functionality is generalized through use of N-dimensional signal processing algorithms from the open-source, efficient, and widely used Python-based scientific computing packages NumPy[28] and SciPy.[29] Additionally, DEIMoS’s design makes minimal assumptions about each underlying dimension. As a result, researchers may analyze GC-MS, LC-MS, IMS-MS, or LC-IMS-MS data, or another hypothetical MS-based platform, with or without MS/MS, using the same software with minimal reconfiguration. The underlying source code has also been written to account for hypothetical additional separation or analytical dimensions that may be introduced as instrumentation continues to advance (e.g., solid phase extraction[30] and associated chemical class-based separation, cryogenic infrared spectroscopy,[31] or multiplexed higher resolution ion mobility separations such as provided by structures for lossless ion manipulations, SLIM[32]). That is, calls to the application programming interface (API) and logic of the analysis may change, but the underlying source code can remain intact. This paradigm facilitates rapid advancement in metabolomics and introduces the potential to unify community efforts in informatics software development. Furthermore, DEIMoS benefits from Python’s rich existing ecosystem for scientific programming and offers even greater flexibility beyond the core API. DEIMoS’s functionality is organized into several modules, each addressing one or more key data processing steps, including file input and output, peak detection, alignment, isotope detection, MS2 spectra extraction and deconvolution, and data subsetting operations. We describe each module relative to LC-IMS-MS/MS data, which represents higher dimensionality—and, by extension, complexity—among most current metabolomics analysis techniques. Data acquired on other platforms—for example, LC-MS(/MS), GC-MS(/MS)—require similar processing but in a lower dimensional space. Future algorithms and additional dimensions of data may be slotted in easily. We architected DEIMoS to adhere to software development best practices,[33] including installation through Anaconda[34] or PyPI,[35] in-line documentation via docstrings and aggregation via Sphinx,[36] unit test implementations with pytest(37) coupled with continuous integration and static code coverage analysis, and version control with Git.[38] DEIMoS is open-source and freely available online at github.com/pnnl/deimos, and community contributions via pull request are welcome. Documentation, including a user guide, API reference, examples, and contribution instructions are available at deimos.readthedocs.io.

Methods

Experimental Methods

To demonstrate DEIMoS, we examined LC-IMS-MS/MS data from a large study of human plasma samples consisting of 40 quality control (QC) samples from the NIST Standard Reference Material 1950[39] and 112 study samples. An internal standard mixture was added prior to extraction, its composition listed in the SI. Each sample was spiked with 50 μL of a solution of the internal standards at 0.166 mg/mL in water. Metabolites and lipids were extracted with concomitant protein precipitation using the Matyash protocol[40] described previously.[41] The metabolite layer was removed and dried in vacuo. Lipid and protein layers were not analyzed. An Agilent 1260 Infinity II high flow liquid chromatography system (San Jose, CA) equipped with a Vial Sampler and Binary Pump was used to inject and chromatographically separate samples prior to introduction to the ion mobility spectrometry-mass spectrometry instrument. A steady flow rate of 0.300 mL/min was delivered through a Millipore-Sigma (Burlington, MA) SeQuant Zic-pHILIC column (15 cm length, 2.1 mm inner diameter, packed with 5 μm particles). A corresponding guard column of the same packing material was also used. Mobile phases consisted of (A) 20 μM ammonium acetate in water and (B) 100% acetonitrile with the following gradient profile (min, %B): 0, 90; 4, 90; 12, 20; 13, 10; 15, 10; 17, 90. An Agilent 1100 Column Heater was used with a static temperature of 45 °C. Ion mobility spectrometry-tandem mass spectrometry analysis was performed using an Agilent 6560 Ion Mobility LC/Q-TOF system. Spectra were acquired separately in both positive and negative ionization modes. Data were collected in the mass range of 50–1700 m/z. Ionization was accomplished using a Dual AJS ESI source, with gas temperature set to 325 °C, drying gas set to 5 L/min, nebulizer set to 30 psi, sheath gas temperature set to 275 °C, sheath gas flow of 11 L/min, VCap set to 2500 V, nozzle voltage set to 2000 V, and the fragmentor set to 400 V. For the ion mobility separations, the trap fill time was 30 000 μs and released for 300 μs. Frame rate was 1 frame/s, 19 IM transitions/frame, and max drift time set to 50 ms. Fixed collision energies were employed at 10, 20, and 40 eV on alternating frames. Data were collected for 22 min immediately following the injection of the sample. To communicate generalizability to other instrument platforms and/or data types, we acquired data from a Bruker timsTOF Pro and a Waters UPLC i-Class coupled to a Synapt G2Si. The former was obtained from the Mass Spectrometry Interactive Virtual Environment[42] (MassIVE, accession MSV000088020), the latter from Metabolights[43] (accession MTBLS812). The Bruker instrument implements trapped ion mobility, resulting in measurements of inverse reduced mobility as opposed to drift time; the Waters instrument implements traveling wave ion mobility.

Overview

DEIMoS was developed as an instrument-independent, high-dimensional metabolomics data analysis tool, and design choices reflect this philosophy. DEIMoS is written in Python, prioritizing user productivity, and ease of development and use. While high-level interpreted languages, such as Python, often suffer from reduced computational efficiency, many popular scientific Python libraries, such as NumPy[28] and SciPy,[29] wrap C or Fortran code and are, thus, highly optimized. In addition, Python is ubiquitous across the sciences and in industry, user-friendly, and largely agnostic to client platform (Windows, macOS, Linux). As a result, Python boasts a large, active, and continually growing community in the sciences, positioning Python-based software for wide adoption both by users and collaborative developers.[44] DEIMoS is one of few Python-based offerings for metabolomics data processing uniquely offering support for data of any dimension, algorithmic implementations that operate in native dimensionality, flexibility and control over the analysis workflow, and efficient scaling to computational resources. An overview of functionality is depicted in Figure . All computation was performed on AMD EPYC 7502 CPUs with 4 GB of memory per core.

Figure 1

Functionality overview. High level overview of available DEIMoS functionality, with operations delineated as “per sample” versus “per study”. That is, the former operations are performed for each instrument acquisition, whereas the latter is performed among all data from comparable samples acquired. Functionality and underlying methods are easily extended or modified.

File Input/Output

To accommodate disparate instrument types and manufacturers (e.g., Bruker, Waters, Thermo, Agilent), DEIMoS operates under the assumption that input data are in an open, standard format. As of this publication, the accepted file format for DEIMoS is mzML,[45] which contains metadata, separation, and spectrometry data that reproduce the contents of vendor formats. Conversion to mzML from several other formats can be performed using the free and open-source ProteoWizard msconvert utility.[24] By default, DEIMoS exports a lightweight, data frame-based representation in Hierarchical Data Format version 5 (HDF5) file format.[46] Additionally, adaptors are included to support exporting to delimited text (e.g., CSV), Mascot Generic Format (MGF), and mzML for downstream use with other tools (e.g., MAME,[47] LIQUID,[48] GNPS[49]).

Feature Detection

Feature detection, also referred to as peak detection, is the process by which local maxima that fulfill certain criteria (such as sufficient signal-to-noise ratio) are located in the signal acquired by a given analytical instrument. This process results in “features” associated with the analysis of molecular analytes from the sample under study or from chemical, instrument, or random noise. Typically, feature detection involves a mass dimension (m/z), as well as one or more separation dimensions, the latter offering distinction among isobaric/isotopic features. DEIMoS implements an N-dimensional maximum filter from scipy.ndimage that convolves the instrument signal with a structuring element, also known as a kernel, and compares the result against the input array to identify local maxima as candidate features or peaks. We discuss what qualifies as a dimension in the SI. Additional filters, including integral and average intensity, kurtosis, skew, etc., can be applied to yield statistics for later downselection. To provide additional confidence in detected features, we required that a given feature be observed across analytical triplicates. Key to this process is the selection of kernel size, which can vary by instrument, data set, and even compound. For example, in LC-IMS-MS/MS data, peak width increases with increasing m/z and drift time, and also varies in retention time. Ideally, the kernel would be the same size as the N-dimensional peak (i.e., wavelets[1,5,12,50]), though computational efficiency considerations for high-dimensional data currently limit the ability to dynamically adjust kernel size. Thus, the selected kernel size should be representative of likely features of interest. In some scenarios, dynamic kernel size may be appropriate, per the kernel selection discussion in the SI. While we recommend processing the data in its native dimensionality, DEIMoS’s algorithms are flexible and can detect features in iterative subspaces, for example 2D followed by 1D, 1D followed by 2D, or successive 1D. We used the same parameters per dimension to evaluate feature detection in all dimensional permutations for LC-IMS-MS data. Features were only kept if they appeared across all three analytical replicates. To compare methods, we (i) compared feature coordinates directly and (ii) used tolerances of ±20 ppm, ± 1.5%, and ±0.3 min for m/z, drift time, and retention time, respectively, based on peak dimensions determined during kernel size selection (Bruker and Waters values differ here, as reported in the SI). We used relative tolerances, such as parts-per-million and percent for m/z and drift time, respectively, because unlike in the retention time dimension, peak widths in m/z and drift time varied with mass (Figure S1). This analysis was performed for all samples from the study, averaged per ionization mode.

Alignment

Alignment is the process by which feature coordinates across samples are adjusted to account for instrument variation (drift, calibration, etc.) such that matching features are aligned to adjust for small differences in coordinates. To perform alignment, we first constructed a model for each dimension of a sample by putatively matching detected features against an in-study reference sample, minimizing the residual, and subsequently applying the fit transform. Next, we matched corresponding features across data sets within a user defined tolerance. We refer to the former as “reference-based alignment” and the latter as “cross-sample alignment”. For reference-based alignment, we defined corresponding features between two samples based on minimum distance in the dimension of interest and selected tolerances to accommodate potentially complex nonlinear relationships. We suggest visualizing putative matches with multiple tolerance selections. Once features were matched, we modeled the relationship between samples using support vector regression (SVR) as implemented in scikit-learn.[51] While SVR was selected here for its broad applicability to both linear and nonlinear alignment, many approaches have been successfully developed in this space,[2,13,22,52−55] and SVR is not necessarily superior. Many existing algorithms and implementations can perform cross-sample alignment.[8,13,55] We initially explored use of a modified version of the “join align” method from MZmine[8] but ultimately arrived at an agglomerative clustering-based approach. Though similar with respect to resulting alignment, the agglomerative clustering-based approach was more amenable to processing many samples simultaneously. Additional details on the agglomerative clustering implementation can be found in the SI. By default, alignment considers all features detected among data sets, though users may design more complicated and restrictive workflows using the DEIMoS API. For example, users may choose to only align features that appear across some number of replicates or exclude features that appear in blank samples. To demonstrate alignment functionality, we analyzed all acquired data files (N = 912, 54, and 21 for Agilent, Bruker, and Waters data, respectively). First, we performed alignment across analytical replicates and only kept features appearing in triplicate. Next, we performed alignment across samples in both positive and negative electrospray ionization (ESI) modes, as available. Samples were aligned by agglomerative clustering method with maximum linkage distance tolerances in each dimension of ±20 ppm, ± 1.5%, and ±0.3 min for m/z, drift time, and retention time, respectively. Tolerances for Bruker and Waters values differ, as reported in the SI.

MS2 Extraction

With MS1 features of interest determined by peak detection, corresponding tandem mass spectra, if available, must be extracted and assigned to the MS1 parent ion feature. For data independent acquisition, we use non-m/z dimensions to assign fragments; for instance, drift time and retention time are used to match fragments in LC-IMS-MS/MS. These additional separations enable better attribution of MS2 ions to parent ions, a form of deconvolution inherent in the acquisition, but convolution artifacts can still occur. Explicit, algorithmic deconvolution[20,25,56−58] has been implemented in DEIMoS such that MS1 and MS2 features overlapping in non-m/z separation dimensions are disambiguated to minimize false assignments. In this form of deconvolution, similar to the approach in Yin et al.,[58] the profiles of non-m/z separation dimensions are used to identify only those ions in the MS2 with distributions that correspond to the precursor ion distribution. This technique simultaneously excludes MS2 ions arising from noise or chemical background, while also attributing MS2 ions only to precursor ions with similar separation distributions. Correspondence is determined by cosine similarity, producing a value between 0 and 100 for each separation dimension for all MS1: MS2 pairings. The user may then filter putative matches by this value, for example considering only those above some tolerance in one or more of the separation dimensions.

Collision Cross Section Calibration

To yield collision cross section (CCS) from IMS arrival time, a calibration must be performed using a standard tune mix containing compounds of known CCS. Drift times, or analogous measurement, such as inverse reduced mobility in TIMS, are reported by the instrument and calibrated against the known CCS values to yield calibration coefficients beta and tfix. For drift tube and trapped IMS, the single-field calibration equation detailed in Stow et al. and as implemented in Lee et al. was used.[59,60] For traveling wave IMS, the relationship between measurement and CCS was first linearized by the natural logarithm, then fit by linear regression. DEIMoS performs this calibration given arrival times (or analogous measurement), known CCS values, m/z, and nominal charge of each calibrant. Correlation coefficient and sum of residuals are reported to characterize goodness of fit. Users may also supply beta and tfix directly.

Extracted Ion Approach

DEIMoS can locate features based on extracted ion chromatograms (XIC), mobilograms (XIM), or multidimensional analogs. Here, a specified m/z of interest is supplied and the feature of maximal intensity in the remaining dimensions is returned. This technique is useful, for example, when detecting an internal standard that has been spiked into a sample or when single or mixtures of pure compounds are analyzed. Adduct m/z were calculated using the mass spectrometry adduct calculator (MSAC).[61] We recommend multidimensional representations for targeted feature detection. Further details are included in the SI.

Isotope Detection

Isotopologues, or molecules that differ only in their isotopic composition, are common in mass spectrometry analyses. In many analysis workflows, isotopologues are used to down select the total feature list to include only the most abundant feature, as well as to glean ion charge state and provide further evidence for identification by way of a detected isotopic signature. Details of the DEIMoS implementation, as well as examples of isotopic signatures for a singly charged feature, a multiply charged feature, and overlapping features, are available in the SI.

Automation

While DEIMoS functionality is implemented as a Python API, a typical workflow has been implemented using the Snakemake[62] workflow management system[62] and made accessible via command line interface (CLI). Users need only modify a configuration file and interact with a CLI to process input mzML files into output feature coordinates and extracted MS2. Moreover, Snakemake can automatically handle scaling to HPC and cloud resources, enabling the high-throughput processing of numerous samples. A graphical user interface (GUI) is currently in development to facilitate accessibility of DEIMoS to those without programming experience and will be reported in a subsequent manuscript.

Results and Discussion

In total, sample acquisition resulted in 912 data files for 112 study samples, 40 quality control samples, and 10 blanks, each in positive and negative ionization modes and collected at 3 collision energies (10, 20, and 40 eV): cumulatively 1.1 terabytes. These data were processed by DEIMoS using feature detection, alignment, CCS calibration, and MS2 extraction by deconvolution. Acquired Bruker data was comprised of 54 data files for 9 study samples in triplicate, spanning 2 ionization modes: cumulatively 83 gigabytes. The Waters data consisted of 21 data files for 6 study samples in triplicate and 3 blanks, each in positive ionization mode: cumulatively 16 gigabytes. These data were processed similarly, though no CCS calibration nor MS2 deconvolution was performed for the Bruker data, in that neither CCS calibration data nor MS2 was available.

Feature Detection

Feature detection for a subset of a single sample has been visualized in Figure . The selected region illustrates the inherent convolution of MS1 features leads to overlap in both drift and retention time, resulting in several putative precursor ions for MS2 assignment. The requirement of explicit deconvolution to appropriately attribute ions in the MS2 spectra becomes apparent, as features are not sufficiently resolved by drift and retention time coordinates. Representative features extracted from the Bruker and Waters data are also included in Figures S2 and S3.

Figure 2

Multidimensional peak detection. Peak detection involves convolving the input signal in N dimensions (here, in LC-IMS-MS, 3D) with a maximum filter. The input and maximum filtered arrays are then compared point-by-point and, where equal, a local maximum is indicated. While the data is collected in 3D, this approach is best visualized in 2D and 1D projections, capturing all lower dimensional representations of the underlying 3D data. Note a well-defined peak in a given 2D view may or may not correspond to a true 3D apex, or can be the product of multiple underlying features. It is, thus, important to interpret the 1D projections carefully. For this subset of the data, the top 10 most intense local maxima are shown, colored by m/z, with similar m/z (i.e., isopologues) sharing hues. For all acquired samples, we compared permutations of the possible feature detection modes (3D, 2D followed by 1D, 1D followed by 2D, and iterative 1D). Notably, LC-IMS-MS/MS data exist in a 3D space; thus, the underlying features are also represented in 3D. Iterative feature detection in lower-dimensional projections simplifies the resulting data structure by summing along nonprojection axes, potentially introducing artifacts. In practice, some projections, such as m/z versus drift time, are affected less significantly than others, such as drift time versus retention time, the latter suffering significant information loss when summing along the m/z axis. We anticipated that processing the data in all 3 dimensions simultaneously would result in the greatest separation among features to better isolate the local maxima. That is, given the same feature detection tolerances across methods, 3D feature detection would theoretically afford the least overlap and, by extension, greatest number of features. Per Figure S4, a contrary result was, thus, surprising. However, the coordinates of the features detected by lower dimensional projections are not always congruent with the 3D approach (Figure S5). This signals that the projections along the various data axes, whether 1D or 2D, aggregate signal to the point of, in some cases, losing the underlying feature defined in 3D – the sum operation along a given axis “merges” previously separated features, skewing the coordinate in that dimension. However, this phenomenon is pronounced to varying degree among methods: the most comparable technique, as implemented by MS-DIAL—m/z versus RT followed by DT—results in poor agreement when considering strict tolerances (only ∼13% intersection in both positive and negative mode), but intersection increases substantially (to ∼90%) when imposing the same tolerances that would be used in cross-sample alignment. That is, tolerances that would result in the combining of those features anyway. In this case, the lower dimensional projections result in slight differences in feature coordinates, but in practical application would be treated as “same”. The difference resulted in slightly less accurate characterization of feature coordinates, for example exact m/z, drift time (and by extension CCS), and retention time, where different, deviated by an average of 5.8 ppm, 0.2%, and 0.05 min for m/z, drift time, and retention time, respectively, when comparing 3D processing to m/z versus RT followed by DT. Additional comparisons are depicted in Figure S6. Critically, the order of peak picking operations had a large impact on the number and composition of the features detected (e.g., comparing 1D–2D to 2D–1D operations, as well as the consecutive 1D operations). A key advantage of feature detection in native dimensionality is that computation time does not scale with the number of features (Figure S4). Peak detection in native dimensionality required 0.54 ± 0.11 core-hours and 0.57 ± 0.11 core-hours per data file for positive and negative ionization modes, respectively (mean ± standard deviation). Cumulatively, 373.74 core-hours for all 912 data files. For the Bruker data, because the data were collected at higher resolution (1 728 842, 937, and 4769 unique values for m/z, inverse reduced mobility, and retention time, respectively; collectively 165 times the Agilent resolution), processing time for peak detection was much longer at 12.20 ± 1.35 core-hours and 35.83 ± 15.23 core-hours per data file for positive and negative ionization modes, respectively (mean ± standard deviation). The Waters data, which was collected at lower resolution (79 863, 200, and 707 unique values for m/z, drift time, and retention time, respectively; collectively 0.25 times the Agilent resolution), required less processing time for peak detection: 0.33 ± 0.03 core-hours for positive ionization mode (negative mode not collected). The longer processing times for the negative ionization mode data were due to signal being observed over a larger range of unique m/z, drift (or analogous measurement), and retention times. The array partitioning leveraged for multicore computation split along the m/z dimension. For each partition, computation scaled with the number of observed measurements spanning remaining dimensions. Computation also scaled with number of partitions, in that empty partitions were not processed. To ameliorate, a nominal intensity threshold may be applied to decrease signal span of the 3D space. Said threshold must be selected at or below the noise floor in the data. We examined results from various SVR kernels and found that the linear kernel achieved satisfactory results when the instrument misalignment could be corrected by linear regression, whereas the radial basis function (RBF) kernel was able to account for nonlinear relationships. For instance, the RBF SVR model could match features when samples were run consecutively on a degrading LC column. We selected example data sets to highlight both S-shaped (Figure ) and linear (Figure S7) relationships in prealignment retention time and illustrate the flexibility of the SVR-based approach. In these data, m/z and drift time were already sufficiently aligned; as such, plots for m/z and drift time were omitted. The potential difference realized by kernel selection motivates visual confirmation of the alignment relationship between samples: an RBF kernel applied to linearly related samples would be considered overfit, whereas a linear kernel applied to a nonlinear case would achieve poor alignment.

Figure 3

Nonlinear alignment by support vector regression. Support vector regression (SVR) was evaluated here on the retention time dimension between 2 illustrative samples described by a nonlinear, “S-shaped” relationship in retention time. To model this relationship, a radial basis function (RBF) kernel was selected. Measurements between samples varied negligibly in drift time and m/z, and thus alignment was only necessary in retention time. An example involving samples with a linear relationship in retention time is included in Figure S7. Subsequent cross-sample alignment by agglomerative clustering resulted in 11 698 and 14 784 features for positive and negative mode appearing across each of 3 analytical replicates, excluding features that appeared in all blank samples. Though tolerances were selected according to expected variance across samples, agglomerative clustering resulted in tighter linkages (Figure S8), indicating intersample variance is lower than typical peak width. This variance reaches near-maximal values at 15 ppm in m/z, 1% in drift time, and 0.2 min in retention. Thus, exploratory analysis of the data enables users to characterize interfeature and intersample variance to set appropriate tolerances relative to the source of maximum variance, though agglomerative clustering is relatively forgiving so long as tolerances are not too small. With respect to computation time, reference-based alignment required on the order of milliseconds per file. For cross-sample alignment, each set of 3 analytical replicates (N = 304) required 39.89 ± 6.59 and 44.90 ± 5.53 core-seconds for positive and negative mode, respectively (mean ± standard deviation); cumulatively 3.58 core-hours. Per-file processing times did not significantly differ for the Bruker and Waters data. Aligning those features that appeared in each analytical replicate required 0.66 and 0.94 core-hours for positive and negative mode, respectively; cumulatively 1.60 core-hours. Because of the relatively smaller number of samples, this step only required seconds for the Bruker and Waters data.

MS2 extraction

For the MS1 features shown in Figure , we performed deconvolution to putatively assign ions in the MS2 spectra to corresponding ions in the MS1 spectra. The utility of deconvolution is highlighted in Figure . A window-based approach yields MS2 spectra that, in the worst case, erroneously include all ions for all MS1 precursors to give identical MS2 spectra or, in the best case, results in only two distinct spectra among precursors. The limitation here is visualized by the plot of drift time versus retention time, where only two overlapping peaks emerge. Naïve assignment would thus yield convolved spectra, the degree of convolution depending on window selection.

Figure 4

MS2 deconvolution. The MS2 spectra belonging to the MS1 features highlighted in Figure were algorithmically deconvolved. The profiles of the MS1 features are indicated by respective colors, plotted along the positive y-axis for drift time (A) and retention time (B). These are accompanied by corresponding MS2 profiles, plotted along the negative y-axis, colored according to the closest matching MS1 profile by cosine similarity. Panels C and D show the pairwise cosine similarity of MS1 and MS2 profiles for drift and retention time, respectively. In panel E, ions in the MS2 spectra are colored according to the closest matching MS1 drift time profile. As in Figure , only the 10 most intense MS1 features were explicitly colored; the ions in the MS2 spectra corresponding to remaining MS1 features, or not sufficiently similar to an MS1 precursor, are indicated in gray. Note that disambiguating among isotopologues is not possible here; thus, the difference in color among isotopic groupings is largely superficial. Notably, deconvolution results may differ substantially if employing the cosine similarity scores of drift versus retention time, as false positives can occur if using retention time alone. For example, the masses between 77 and 83 Da would be assigned to the precursor with m/z 212 using retention time, whereas they are sufficiently separated in drift time (Figure ). Computation time required to deconvolve MS2 spectra was a function of number of distinct non-m/z separation dimension populations, as determined by agglomerative clustering, and the size of each such population. Figure shows one such population. In this analysis, each data file required 1.18 ± 0.28 core-hours and 2.27 ± 0.31 (mean ± standard deviation) for positive and negative mode, respectively; cumulatively 1579.07 core-hours. Per-file processing times did not significantly differ for the Waters data (MS2 not collected for the Bruker data).

Conclusion

Metabolomics and exposomics data processing tools offer immense value for diagnosis of disease, evaluation of environmental exposures, and discovery of novel molecules. However, few open-source solutions are currently positioned to fully leverage the latest instrumentation. Importantly, though demonstrated for LC-IMS-MS/MS data, DEIMoS’s architecture supports extension to other measurement modalities, such as cryogenic infrared spectroscopy, minimizing development barriers as instrumentation evolves. Further, all development has been accomplished using design principles necessary for the long-term success for metabolomics data: format interoperability, workflow flexibility, open-source software implementation, community development, and reproducibility.

52 in total

1. A dynamic wavelet-based algorithm for pre-processing tandem mass spectrometry data.

Authors: Penghao Wang; Pengyi Yang; Jonathan Arthur; Jean Yee Hwa Yang
Journal: Bioinformatics Date: 2010-07-13 Impact factor: 6.937

2. XCMS: processing mass spectrometry data for metabolite profiling using nonlinear peak alignment, matching, and identification.

Authors: Colin A Smith; Elizabeth J Want; Grace O'Maille; Ruben Abagyan; Gary Siuzdak
Journal: Anal Chem Date: 2006-02-01 Impact factor: 6.986

3. VIPER: an advanced software package to support high-throughput LC-MS peptide identification.

Authors: Matthew E Monroe; Nikola Tolić; Navdeep Jaitly; Jason L Shaw; Joshua N Adkins; Richard D Smith
Journal: Bioinformatics Date: 2007-06-01 Impact factor: 6.937

Review 4. The power of ion mobility-mass spectrometry for structural characterization and the study of conformational dynamics.

Authors: Francesco Lanucara; Stephen W Holman; Christopher J Gray; Claire E Eyers
Journal: Nat Chem Date: 2014-04 Impact factor: 24.427

5. Ultra-High Resolution Ion Mobility Separations Utilizing Traveling Waves in a 13 m Serpentine Path Length Structures for Lossless Ion Manipulations Module.

Authors: Liulin Deng; Yehia M Ibrahim; Ahmed M Hamid; Sandilya V B Garimella; Ian K Webb; Xueyun Zheng; Spencer A Prost; Jeremy A Sandoval; Randolph V Norheim; Gordon A Anderson; Aleksey V Tolmachev; Erin S Baker; Richard D Smith
Journal: Anal Chem Date: 2016-08-26 Impact factor: 6.986

6. ADAP-GC 4.0: Application of Clustering-Assisted Multivariate Curve Resolution to Spectral Deconvolution of Gas Chromatography-Mass Spectrometry Metabolomics Data.

Authors: Aleksandr Smirnov; Yunping Qiu; Wei Jia; Douglas I Walker; Dean P Jones; Xiuxia Du
Journal: Anal Chem Date: 2019-07-05 Impact factor: 6.986

7. XCMS2: processing tandem mass spectrometry data for metabolite identification and structural characterization.

Authors: H P Benton; D M Wong; S A Trauger; G Siuzdak
Journal: Anal Chem Date: 2008-07-16 Impact factor: 6.986

8. Ion Mobility Spectrometry: Fundamental Concepts, Instrumentation, Applications, and the Road Ahead.

Authors: James N Dodds; Erin S Baker
Journal: J Am Soc Mass Spectrom Date: 2019-09-06 Impact factor: 3.109

9. A Case Report of Switching from Specific Vendor-Based to R-Based Pipelines for Untargeted LC-MS Metabolomics.

Authors: Álvaro Fernández-Ochoa; Rosa Quirantes-Piné; Isabel Borrás-Linares; María de la Luz Cádiz-Gurrea; Marta E Alarcón Riquelme; Carl Brunius; Antonio Segura-Carretero
Journal: Metabolites Date: 2020-01-08

10. A Practical Guide to Metabolomics Software Development.

Authors: Hui-Yin Chang; Sean M Colby; Xiuxia Du; Javier D Gomez; Maximilian J Helf; Katerina Kechris; Christine R Kirkpatrick; Shuzhao Li; Gary J Patti; Ryan S Renslow; Shankar Subramaniam; Mukesh Verma; Jianguo Xia; Jamey D Young
Journal: Anal Chem Date: 2021-01-19 Impact factor: 6.986

1 in total

1. An Open-Source Pipeline for Processing Direct Infusion Mass Spectrometry Data of the Human Plasma Metabolome.

Authors: Anna Kozlova; Timur Shkrigunov; Semyon Gusev; Maria Guseva; Elena Ponomarenko; Andrey Lisitsa
Journal: Metabolites Date: 2022-08-21

1 in total