Literature DB >> 33659909

Machine-learning-enhanced time-of-flight mass spectrometry analysis.

Ye Wei¹, Rama Srinivas Varanasi¹, Torsten Schwarz¹, Leonie Gomell¹, Huan Zhao¹, David J Larson², Binhan Sun¹, Geng Liu³, Hao Chen³, Dierk Raabe¹, Baptiste Gault^1,4.

Abstract

Mass spectrometry is a widespread approach used to work out what the constituents of a material are. Atoms and molecules are removed from the material and collected, and subsequently, a critical step is to infer their correct identities based on patterns formed in their mass-to-charge ratios and relative isotopic abundances. However, this identification step still mainly relies on individual users' expertise, making its standardization challenging, and hindering efficient data processing. Here, we introduce an approach that leverages modern machine learning technique to identify peak patterns in time-of-flight mass spectra within microseconds, outperforming human users without loss of accuracy. Our approach is cross-validated on mass spectra generated from different time-of-flight mass spectrometry (ToF-MS) techniques, offering the ToF-MS community an open-source, intelligent mass spectra analysis.

Entities: Chemical Disease Gene Species

Keywords: atom probe tomography; machine learning; pattern recognition; secondary ion mass spectrometry; time-of-flight mass spectrometry

Year: 2021 PMID： 33659909 PMCID： PMC7892357 DOI： 10.1016/j.patter.2020.100192

Source DB: PubMed Journal: Patterns (N Y) ISSN： 2666-3899

Introduction

Mass spectrometry is a widespread approach for revealing what constitutes a solution or a material. An array of techniques are used in the life sciences, geology, and materials science. Among this arsenal, time-of-flight mass spectrometry (ToF-MS) is one of the mainstream techniques in which an ion's mass-to-charge (m/z) ratio is determined via a ToF measurement. It can provide a quantitative analysis of the composition of the sampled material with high precision and for a wide range of atomic and molecular masses. The principles of ToF-MS are common to techniques such as matrix-assisted laser desorption/ionization (MALDI), secondary ion mass spectrometry (SIMS), or atom probe tomography (APT). Each of these techniques relies on a different concept to emit the ions from the sample, and this versatility means that their common underlying analysis approach viz. ToF-MS has found use in chemical reaction studies, large-molecule characterization, and the quantification of dopants in semiconductors or the atomic-scale distribution of impurities at grain boundaries in metallic alloys, for instance.3, 4, 5, 6, 7, 8 The ToF-MS data are essentially a plot of the counts as a function of the m/z ratio—typically a peak appears for each isotope of each element present—and the amplitude is proportional to the relative amount of each species within the sampled volume. Fast and accurate identification and interpretation of the rich patterns and correlations in the spectral data are of great importance and can lead to discoveries. Yet the interpretation and identification rely on the user's expertise, making it slow and prone to error and hindering reproducibility. Challenges in the development of automatic ToF-MS data analysis are two-fold. First, in ToF-MS, ions of the same species typically show distribution in their velocity or distribution in their instant of departure from the specimen. These lead to distribution in flight times. As a result, depending on the experimental conditions, ToF-MS peak patterns can take various shapes and are not always simple to recognize (Figure 1). Second, molecular patterns are commonly encountered in ToF-MS spectra, i.e., not only signals from atomic ions are detected.11, 12, 13, 14, 15 Combining individual atoms into a molecular ion usually leads to a new pattern comprising the distribution of the combination of isotopes from each individual element. Building a database for all possible molecular formula is practically impossible.

Figure 1

Examples of peak patterns under various experimental conditions

(A) Perfect peak pattern.

(B) Peak pattern with broadened peak width due to primary spatial distribution of ions.

(D) Peak patterns with short thermal tails.

Examples of peak patterns under various experimental conditions (A) Perfect peak pattern. (B) Peak pattern with broadened peak width due to primary spatial distribution of ions. (C) Peak patterns with long thermal tails. (D) Peak patterns with short thermal tails. Machine learning (ML) is well known for its powerful ability to recognize patterns and signals. Recently, the mass spectrometry community has embraced ML techniques for large-scale data analysis. The data-analyzing speed of ion-trap-based mass spectrometry has been dramatically accelerated,, whereas ToF-MS data analysis still largely relies on database searching., Some pioneering works demonstrated the potential of applying statistical/ML techniques to ToF-MS spectra analysis. For example, unsupervised ML has been used in exploratory data analysis for ToF-SIMS and ToF-MALDI.21, 22, 23, 24 Lately, a Bayesian approach has been adopted for peak identification in APT., The Bayesian approach implemented by A. Mikhalychev et al. is able to identify and deconvolute many different types of ToF-APT mass spectra simultaneously. With reasonable prior information, this method can lead to robust results. However, prior knowledge is often provided by users. If a bad prior is assumed, the computation can become very expensive. Here, we introduce a ML-based approach that automates the process of assigning elemental and molecular identities to peaks and series of peaks within ToF-MS spectra. Moreover, uncertainties are attached to these identities indicating to what extent the peak patterns are affected by the noise level and shape features. We name this approach “ML-ToF.” It is shown that ML-ToF can handle various ToF-MS spectra without prior knowledge of composition information and from the analysis of a variety of materials systems and techniques. Indeed, we cross-validate ML-ToF on ToF-APT and ToF-SIMS spectra. The materials investigated include a high-strength Al alloy developed for aerospace applications, medium-Mn steel found in automotive applications, Cu-In-based materials used in solar cell absorbers, and SmCo-based permanent magnets. Furthermore, we benchmark the results by comparing ML-ToF-assigned labels with those yielded by field experts. ML-ToF drastically reduces the duration of the peak recognition process. In general, it takes ML-ToF microseconds to obtain a labeled spectrum, whereas human users could take minutes or even hours. An overview of our approach is shown in Figure 2.

Figure 2

Flowchart of ML-assisted time-of-flight mass spectrum identification (ML-ToF)

An atomic pattern recognizer takes a mass spectrum as input and identifies all atomic patterns (mainly pure metal elements). A molecular database is then constructed by combining atomic patterns from elements with non-metal elements (e.g., hydrogen, oxygen, nitrogen). Trained in such an on-the-fly database, a machine-learning-based molecular pattern recognizer assigns molecular identities to non-atomic patterns. In such a way, ML-ToF recognizes both the elemental and the molecular fingerprints in mass spectra.

Flowchart of ML-assisted time-of-flight mass spectrum identification (ML-ToF) An atomic pattern recognizer takes a mass spectrum as input and identifies all atomic patterns (mainly pure metal elements). A molecular database is then constructed by combining atomic patterns from elements with non-metal elements (e.g., hydrogen, oxygen, nitrogen). Trained in such an on-the-fly database, a machine-learning-based molecular pattern recognizer assigns molecular identities to non-atomic patterns. In such a way, ML-ToF recognizes both the elemental and the molecular fingerprints in mass spectra.

Results

Peak pattern detection

Mass spectra can be regarded as a one-dimensional array whose values are always positive. We focus on patterns with sufficient signal-to-background level to demonstrate that our approach can work properly with discernible patterns. We import the peak detection algorithm from a Python library (SciPy package, de facto standard package for signal processing in Python) that finds the peak positions and the corresponding intensity values. The peak detection algorithm takes the mass spectra as input and searches for local maxima by a simple comparison of intensity. A subset of these peaks can be further chosen by specifying conditions of peak properties. There are three major peak properties: peak height, interpeak distance, and peak prominence. The prominence is defined as the intensity difference between the peak's height and its adjacent local minima, as indicated by Figure 3A. In Figure 3B, one can find the definition of peak height (the absolute count value in log scale). Throughout ToF-APT examples we used the same parameters for the detection (see Figure 3): peak height = 4 (log count); interpeak distance = 0.25 Da; prominence = 0.5 (log count). By visual inspection, the peak detection algorithm with this set of parameters can capture the vast majority of peaks.

Figure 3

Examples of peak detection parameters in the SciPy Python package

(A) Schematic diagram showing the definition of peak prominence. Peak prominence is defined as the vertical distance between the peak and the lowest contour line (the dashed lines).

(B) Peak detection example from the ToF-APT dataset, showing the interpeak distance between detected peaks and peak height. Blue and dark green regions represent the range of peaks assigned by human users.

Examples of peak detection parameters in the SciPy Python package (A) Schematic diagram showing the definition of peak prominence. Peak prominence is defined as the vertical distance between the peak and the lowest contour line (the dashed lines). (B) Peak detection example from the ToF-APT dataset, showing the interpeak distance between detected peaks and peak height. Blue and dark green regions represent the range of peaks assigned by human users. In the manual procedure, users need to select a start and end position for each peak, as shown in Figure 3B. This procedure is often referred to as “ranging,” and this process can lead to errors due to the different peak shapes, which depend in part on the instrument used and the experimental conditions. For instance, the laser pulse energy or the base temperature was shown to have an influence.29, 30, 31 Here, we confine the task of ML-ToF to the identification of elemental or molecular patterns and assume the intensity represents the peak intensity at the detected position instead of the entire peak range. This assumption works well in practice: ML-ToF can recognize the peaks even when they exhibit long tails. Tails originate from either energy deficits or uncertainty on the instant at which the ion left the specimen's surface32, 33, 34, 35 (see Discussion). The detected m/z ratios and the corresponding intensity serve as the input of ML-ToF.

ToF-MS pattern recognition

In general terms, patterns existing in the mass spectra can be categorized into two types: (1) atomic pattern, exhibiting the natural abundance ratio of one particular element, and (2) molecular pattern, formed by two or more elements with mixed abundance ratio distribution. In this section, we introduce a systematic approach that identifies both types simultaneously. Two main aspects are addressed, i.e., the strategy to construct a reasonable database and the search and identification of the most probable patterns.

Atomic pattern recognizer

First, we introduce the atomic pattern recognizer designed to identify all the atomic patterns. The general protocol is demonstrated in Figure 4.

Figure 4

Protocol of atomic pattern recognizer

Patterns to be recognized are peaks with interpeak distance ratio and their respective abundance ratio. After the probable labels are obtained, a database search based on mass to charge is performed to identify the exact composition.

Protocol of atomic pattern recognizer Patterns to be recognized are peaks with interpeak distance ratio and their respective abundance ratio. After the probable labels are obtained, a database search based on mass to charge is performed to identify the exact composition.

Database

ML can produce optimal results only if it is trained in a good database. In our case, the atomic pattern database consists of three parts: the number of isotope peaks, the natural abundance ratio, and the interpeak distance ratio (IDR). The IDR is defined as the distance between two neighboring peaks divided by the smallest neighboring distance within a group of peaks. For example, Fe+ has four peaks at 54, 56, 57, and 58 Da. So the distance ratio is (56 − 54)/(58 − 57):(57 − 56)/(58 − 57):(58 − 57)/(58 − 57) = 2:1:1. As such, even if Fe is in the form of charge state 2 with four peaks at 27, 28, 28.5, and 29 Da, the IDR is still 2:1:1. We do not have to impose any constraints on the specific charge state of the elements. This is important, as the charge-to-state ratio can vary significantly (i.e., element Fe can have 1+, 2+, or 3+ charge state) based on the experimental parameters and even within a single dataset. The database contains the most commonly encountered elements (excluding the inert gases) and some lanthanides. Currently, it contains 37 elements and 3 compounds, such as S2 and C2. These compounds are included because some elements have a strong tendency to form molecular ions, as frequently observed experimentally. Further information regarding the database can be found under Supplemental experimental procedure 1.1.

Interpeak distance ratio filter

As can be seen in Figure 4, matching the IDR is the first step toward a full pattern recognition. For a given peak pattern, the IDR filter searches for all possible candidates with matched IDR. Subsequently, the algorithm will examine the abundance ratio of these candidates. In practice, ToF mass spectra often contain calibration errors. Therefore, ML-ToF rounds the m/z ratio's digits up to 0, 1/4, 1/3, 2/3, 3/4, and 1 Da so that the IDR can be correctly calculated.

Learning the abundance ratio

The next step is concerned with pattern recognition of the isotopic abundance ratio. Classification of the abundance ratio is not a trivial task. Different patterns sometimes aggregate at similar m/z ratios. It is often very difficult to deconvolute them. The ML technique is naturally suited to data-driven classification tasks, thanks to its ability to learn and improve from experience without human intervention automatically. Unlike the conventional yes/no answer, ML algorithms produce a list of possible answers with corresponding likelihoods. Even if an exact match from the given input to the theoretical database cannot be found, the ML-based algorithm can still provide a ranking of likely labels. In other words, ML looks for partially retained patterns and thus assigns a higher matching probability. For elements with two isotopes, ML-ToF calculates the measured intensity ratio between the peaks () and compares with the expected ones from the natural abundances (r). If the absolute value of the deviation (r − r)/(r) exceeded a certain threshold (here we chose empirically 0.3), then we classified this as unidentified peaks. For example, the pattern for Cu has a natural abundance ratio of 69.17:30.83, therefore the theoretical ratio r = 69.17/30.83 = 2.24. ML-ToF will not assign element Cu to this pattern if its abundance ratio goes outside the range [1.56, 2.91]. For monoisotopic elements (e.g., Al, As, Co), since there is no abundance ratio, ML-ToF searches for their different charge states and assigns the element if two or more of its corresponding charge states are found (e.g., Al+ at 27 Da and Al2+ at 13.5 Da). In the present study, we selected Light Gradient Boosting Machine (LightGBM) as our learning model. LightGBM belongs to the framework of Gradient Boosting Decision Tree (GBDT). GBDT is an ensemble model of weaker learners that are trained in sequence. In each training iteration, a decision tree learns from the errors up to the current iteration. Via a gradient descent approach, every subsequent tree minimizes the loss function between the actual output and the weighted sum of predictions from previous iterations. The final model is the weighted average of all weak learners. GBDT has achieved state-of-the-art performance in many ML tasks, such as multiclass classification and ranking tasks. Our label-predicting task is essentially a multilabel classification task. In such a setting, the algorithm tries to minimize the objective function L: L represents the cross-entropy. Here, this ML-specific entropy formulation serves as a measure for the difference between two probability distributions and is used as a loss function for classification models; N represents the number of labels, y is the ground truth, and s denotes predictions of the ML model. This objective function measures how off the machine’s prediction is from the truth. The smaller the loss of objective function is, the closer the prediction of the machine is to the ground truth. Zero loss would imply that the model has achieved 100% accuracy. In general, using the cross-entropy function instead of the sum of mean square errors for a classification problem leads to a faster training as well as improved generalization. In contrast to other black-box ML models like a neural network, the decision tree enjoys a unique advantage; namely, it is an explainable ML model, which provides not only the predictions but also methods to interpret them. A specific example can be found in Figure S1. Other parameters of the current LightGBM model and the corresponding explanations can be found in Supplemental experimental procedure 1.2. We generate 5,000 data points for each element. During the training, the total dataset is further split into a first one used for the training (around 4,000 data points) and a second (around 1,000 data points) to validate the trained model. More details of database construction can be found in Supplemental experimental procedure 1.1. Figures 5A–5D illustrate the training histories of the LightGBM model for three-, four-, five-, and seven-peak patterns. The model for three-peak classification achieves near-zero loss after about 200 iterations and then plateaus at zero. Loss histories of four-peak, five-peak, and seven-peak patterns show similar trends. Notably, the four-peak pattern model converges to zero at a much faster rate, reaching near-zero loss at 100 iterations. Thus this model stops early at 500 iterations. Training and validating losses are almost identical in all four cases, resulting in two completely overlapping curves.

Figure 5

Training history

(A–D) Training histories of the LightGBM model for three- (A), four- (B), five- (C), and seven- (D) peak patterns are shown. In the training histories of objective function L, we have training and validation curves (indicated by training and valid_1, correspondingly). In all four cases, training and validating loss histories are almost the same. Hence, the two curves overlap completely.

Training history (A–D) Training histories of the LightGBM model for three- (A), four- (B), five- (C), and seven- (D) peak patterns are shown. In the training histories of objective function L, we have training and validation curves (indicated by training and valid_1, correspondingly). In all four cases, training and validating loss histories are almost the same. Hence, the two curves overlap completely. Confusion matrix is a useful tool for visualizing the performance of a model. It enables a direct comparison between the ML prediction and the ground truth on the test dataset. These confusion matrices (shown in Figures 6A–6D) indicate that the LightGBM models can perfectly predict the element given its abundance ratio. In addition, the training dataset introduced “redundancy” to deal with the partial pattern or overlapped pattern. For instance, three patterns are assigned to Fe: (1) atomic mass 54, 56, 57, 58 Da; abundance ratio 5.8:91.8:2.1:0.3; (2) atomic mass 54, 56, 57 Da; abundance ratio 5.8:91.8:2.1; and (3) atomic mass 56, 57, 58 Da; abundance ratio 91.8:2.1:0.3. This is because sometimes the signal-to-noise ratio of some peaks is too weak to be detected. Or a strong Ni presence (major peaks at 58 Da) destroys the first pattern of Fe. In these cases, ML-ToF is still able to recognize the presence of Fe. Such a redundancy scheme guarantees that ML-ToF has a certain degree of robustness against various noise sources.

Figure 6

Confusion matrix

(A–D) Confusion matrices for three- (A), four- (B), five- (C), and seven- (D) peak patterns. The confusion matrix indicates that the models achieve 100% accuracy on the abundance ratio classification task. Small randomness is introduced in the training/testing splitting. Therefore, the size of the test dataset is not always 1,000 but quite close to it.

Confusion matrix (A–D) Confusion matrices for three- (A), four- (B), five- (C), and seven- (D) peak patterns. The confusion matrix indicates that the models achieve 100% accuracy on the abundance ratio classification task. Small randomness is introduced in the training/testing splitting. Therefore, the size of the test dataset is not always 1,000 but quite close to it.

Matching the mass-to-charge ratio

A “probable label” is defined as a peak pattern with more than 90% certainty (assigned by the LightGBM model). However, the probable label is not yet the final identified label. For example, if a pattern satisfies both the IDR and the abundance ratio of element Fe, it is still possible that this pattern can be another element. Therefore, as the last step, the probable label is confirmed if its m/z ratio can be matched to a m/z ratio database, i.e., a pattern with the same IDR and abundance ratio of an element. In the case of Fe, for instance, if its m/z ratio were 54, 56, 57, 58 Da, then ML-ToF would predict Fe+, but if its m/z ratio were 60, 72, 73, 74 Da, ML-ToF would indicate FeO+.

Molecular pattern recognizer

When two or more elements with a different natural abundance ratio combine, the resulting molecule forms a new fingerprint. As we mentioned in the introduction, the new fingerprint differs not only in the atomic number but also in the abundance ratio. This type of combination is often found between the non-metal elements (e.g., carbon, oxygen, nitrogen, sulfur) and sometimes metallic elements too. This poses a significant challenge to the database's construction, since it is impossible to search for all combinations by brute force. To identify the molecular fingerprint, we introduce a molecular pattern recognizer, which adopts a different workflow compared with the atomic pattern recognizer, as outlined in Figure 7.

Figure 7

Molecular pattern recognizer

Molecular pattern recognizer For any undetermined patterns, a molecular pattern recognizer first performs a heuristic search (Figure 7) by matching their m/z ratios to an on-the-fly molecular label database and assign a molecular label to this pattern if a match is found. This on-the-fly database contains all possible recombinations between the identified atomic patterns and the non-metal elements. The range of this new molecular database depends on the maximum detected m/z ratio. If there are multiple possible candidates, an abundance-ratio-based LightGBM will be trained and will find the most probable labels. This part is similar to the atomic pattern recognizer.

Discussion

Atom probe tomography

APT is a microscopy and microanalysis technique that provides the three-dimensional compositional mapping of materials at the near-atomic scale.,, Accurate analysis of atom probe data typically involves assigning an elemental nature to each ion based on its m/z-ratio in the ToF-APT mass spectrum. In this section, we evaluate the performance of our approach on ToF-APT spectra from different alloy systems.

Aerospace high-strength Al alloy

Al-Zn-Mg-Cu-(Zr) alloys are widely employed in aerospace and automobile applications due to their low mass density and high strength., These alloys are strengthened by a high-volume fraction of nanoscale precipitates., ToF-APT of this alloy system generally has clear peak patterns and involves only a few molecular ions (demonstrated in Figure 8). This first example shows three possible categories for these detected peaks: identified peaks, unidentified peaks, and uncertain peaks. Overall, the patterns identified by ML-ToF are consistent with the expert's indexing, and the ML-ToF-identified peaks account for 99.9% of the total intensity of detected patterns.

Figure 8

ML-ToF identification of a simple alloy system

Ion mass spectrum of a simple alloy system. The color of the circle markers indicates the state of the peaks. Red, green, and blue markers indicate atomic (identified), unidentified, and uncertain peaks, respectively; the majority of the ML-ToF assigned labels are consistent with APT operators.

ML-ToF identification of a simple alloy system Ion mass spectrum of a simple alloy system. The color of the circle markers indicates the state of the peaks. Red, green, and blue markers indicate atomic (identified), unidentified, and uncertain peaks, respectively; the majority of the ML-ToF assigned labels are consistent with APT operators. The peaks are grouped into five clusters to facilitate visualization, and they are separately described in Table 1. We provide a list of tables that compare expert-assigned elements to those assigned by ML-ToF. For clusters 1, 3, 4, and 5, theoretical and measured normalized intensity (all involved normalized intensities sum up to 100) are also present. More specifically, one can observe that for clusters 1, 3, and 5, ML-ToF and expert are in complete agreement; ML-ToF assigns 100% certainty to its selected candidates (shown in parentheses after the assigned element). However, in cluster 4 (m/z ratio: 45, 45.5, 46, 47 Da), the ML algorithm is confused between a random (51%) and a Zr pattern (45%). Two main reasons lead to this result. The first relates to the detection criteria: the fifth peak intensity is too low, such that a peak at 48 Da is not detected. The second relates to the abundance ratio: the measured abundance ratio significantly differs from Zr's natural abundance ratio. The normalized intensity of the second peak (in theory, the percentile is 11.22% but measured to be 15.18%) deviates 36% from theory. This deviation likely originates from the detection of Zr-H peaks. Despite the uncertainty, ML-ToF still ranks Zr as the second most likely candidate, with 45% certainty.

Table 1

Peak pattern identity analysis for Al-Zn-Mg-Cu-(Zr) alloys

Cluster number	1	1	2	2	2	3	4	5	5
m/z	12 12.5 13	13.5	27	28	29	32 33 33.5 34 35	45 45.5 46 47	64 66 67 68 70	63 65
Expert	Mg²⁺	Al²⁺	Al⁺	AlH⁺	AlH₂⁺	An²⁺	Zr²⁺	Zn⁺	Cu⁺
ML-ToF	Mg²⁺ (100%)	Al²⁺	Al⁺	AlH⁺	AlH₂⁺	Zn²⁺ (100%)	Random (51%)Zr²⁺ (45%)	Zn⁺ (100%)	None
Theory	78.99:10.00:11.01	None	None	None	None	48.27:27.98:4.10:19.20:0.63	51.45:11.22:17.15:17.38:2.80	45.85:25.02:12.26:16.17:0.71	69.15:30.85
Measure	78.00:10.13:11.87	None	None	None	None	48.63:27.73:4.26:18.50:0.87	45.42:15.18:21.37:18.03	48.63:27.73:4.26:18.50:0.87	42.41:57.59

Five rows can be found for each individual cluster: mass-to-charge ratio, expert-assigned element, ML-ToF-assigned element, theoretical normalized intensity (theory), and measured normalized intensity (measurement).

Peak pattern identity analysis for Al-Zn-Mg-Cu-(Zr) alloys Five rows can be found for each individual cluster: mass-to-charge ratio, expert-assigned element, ML-ToF-assigned element, theoretical normalized intensity (theory), and measured normalized intensity (measurement). Moreover, in the case of the green-colored peaks within cluster 5, ML-ToF is not able to assign any identity to peak patterns with m/z ratio values of 63 and 65 Da, while the expert would assign them as Cu+. This is owing to the fact that ML-ToF makes predictions of two peak patterns based on a simple threshold method. In this case, the measured intensity ratio between the two peaks is 0.73. Meanwhile, if it is stand-alone element Cu, this ratio would be 2.24. Hence ML-ToF observes a remarkable deviation (67.1%) and rejects candidate Cu, contrary to an expert assignment. Cu in its 1+ charge state is also prone to be detected as CuH21+, which will then lead to CuH2 to overlap with the Zn peak at 67 Da, which, in part, explains the discrepancy between the measured and the theoretical ratios for Zn, which did not affect ML-ToF's capacity to identify Zn correctly. In general, when ML-ToF associates a peak with two or more atomic/molecular labels, one can apply the element deconvolution technique to differentiate different labels in the same peak in terms of their spatial distribution.

Medium-Mn steel

Medium-manganese steels are promising candidates for the automotive industry owing to their excellent mechanical properties. Atom probe studies help us understand the local chemistry, particularly the crystal defects, such as dislocations and grain boundaries,51, 52, 53, 54 thereby providing insights into the atomic-scale mechanisms at play in this class of steels. Figure 9 illustrates a mass spectrum for the more complex Fe-Mn-C-Al alloy system. More than 99% of the ions are within detected peaks assigned an identity that is consistent with that given by the field expert.

Figure 9

Identification of Fe-Mn-C-Al alloy system

Ion mass spectrum of Fe-Mn-C-Al alloy. Markers are colored based on the indicated state of the peaks (red for identified and green unidentified, yellow suggests molecular ions); dashed lines are used to separate clusters; peaks identified by the atomic pattern recognizer are indicated. ML-ToF identifies the majority of the peaks, among which atomic patterns constitute 98% intensity of the detected peaks, and about 1% are of possible molecule origins.

Identification of Fe-Mn-C-Al alloy system Ion mass spectrum of Fe-Mn-C-Al alloy. Markers are colored based on the indicated state of the peaks (red for identified and green unidentified, yellow suggests molecular ions); dashed lines are used to separate clusters; peaks identified by the atomic pattern recognizer are indicated. ML-ToF identifies the majority of the peaks, among which atomic patterns constitute 98% intensity of the detected peaks, and about 1% are of possible molecule origins. ML-ToF successfully identified the existence of the elements Fe, Mn, and Al. Non-metal elemental patterns of O, N, and C are identified too. Therefore, a new database is proposed, which contains four different types of molecular patterns: FeHCNO, AlHCNO, MnHCNO, and HCNO. The number of metals (x) is set to 1, 2, 3, 4; H (a) to 0, 1, 2; and C (b), N (c), and O (d) to 0, 1, 2, 3, 4 and charge state to 1, 2. These ranges include almost all the common types of molecular patterns. In addition, the search for molecular patterns is restricted to values below 70 Da since no peaks occur beyond this value. Combining all the above-mentioned conditions, we construct a molecular pattern database shown in Table 3.

Table 3

Molecular pattern database

Molecular ion	Fe_xH_aC_bN_cO_d	Al_xH_aC_bN_cO_d	Mn_xH_aC_bN_cO_d	H_aC_bN_cO_d
Database size	329	455	222	750

x = 1, 2; a = 0, 1, 2; b = 0, 1, 2, 3, 4; c = 0, 1, 2, 3, 4; d = 0, 1, 2, 3, 4; charge state = 1, 2 and mass-to-charge ratio is restricted to below 75 Da, since no peaks are detected beyond such. The search of molecular pattern is performed within this dataset.

Table 2 shows both the expert's and ML-ToF's assignment of peaks. In cluster 2, both Al+ and Fe+ were assigned to the peak at 27 Da, a known overlap that makes the quantification by APT of Al in Fe or Fe in Al challenging. Even in the presence of Al, the atomic pattern recognizer is still able to recognize the Fe isotope pattern with 100% certainty. At 40 Da, the algorithm offers some multiple candidates (FeC22+, CN2+, C2O+, with the same number of atoms) compared with the expert's choice of FeC22+. In such a case, the algorithm would also choose FeC22+ since Fe is the most abundant element (80% of intensity is assigned to element Fe).

Table 2

Peak pattern identity analysis for Fe-Mn-C-Al alloy

Cluster number	1	1	1	1	1	1	1	1	1	2	2	2	2	2	2	2	2	3	3
m/z	2	6	9	12	13.5	14	16	17	18	24	26	27.5	27 28 28.5 29	32	36	40	44	5456 57 58	55
Expert	H₂⁺	C²⁺	Al³⁺	C⁺	Al²⁺	N⁺	O⁺	HO⁺	C₃²⁺	C₂⁺	CN⁺	Mn⁺	Fe²⁺	O₂⁺	Fe O ²⁺	FeC₂²⁺	None	Fe⁺ and FeH⁺	Mn⁺
ML-ToF	H₂⁺	C²⁺	Al³⁺	C⁺	Al²⁺	N⁺	O⁺	HO⁺	C₃²⁺	C₂⁺	CN⁺	Mn⁺	Al⁺ and Fe²⁺	O₂⁺	Fe O ²⁺	FeC₂²⁺ and C₂O⁺ and CN₂⁺	AlOH⁺	Fe⁺ and FeH⁺	Mn⁺

Peak pattern identity analysis for Fe-Mn-C-Al alloy Molecular pattern database x = 1, 2; a = 0, 1, 2; b = 0, 1, 2, 3, 4; c = 0, 1, 2, 3, 4; d = 0, 1, 2, 3, 4; charge state = 1, 2 and mass-to-charge ratio is restricted to below 75 Da, since no peaks are detected beyond such. The search of molecular pattern is performed within this dataset.

Sm-Co-based hard magnet

Sm-Co-based materials are known for their outstanding magnetic properties related to their complex microstructure., By changing the pinning mechanisms and pinning strength, the coercivity of the alloy Sm2(Co, Fe, Cu, Zr)17 can be controlled by substituting Fe for Co. In this example (Figure 10), ML-ToF shows its robustness against broadened peaks due to the relatively high laser power used for this analysis.

Figure 10

Identification of Sm-Co-based hard magnet

Ion mass spectrum of an Sm-Co-based hard magnet. The color of the circle markers indicates the state of the peaks.

Identification of Sm-Co-based hard magnet Ion mass spectrum of an Sm-Co-based hard magnet. The color of the circle markers indicates the state of the peaks. As shown in Table 4, in cluster 1, ML-ToF identified aluminum due to the detection of peaks at Al+ (peak at 13.5 Da) and Al2+ (peak at 27 Da). Also, ML-ToF identifies Zr3+, albeit with reduced certainty (85%). This is likely due to the long thermal tails of the peaks. In cluster 2, ML-ToF identified Zr2+ with 48.3% certainty at 45, 45.5, 46, 47, and 48 Da. This relatively low probability (still considerably higher than the second-highest pattern: random [30%]) indicates the existence of other types of ions, which is pointed out by an expert as ZrH2+. ML-ToF fails to assign any labels to peaks at 48.3, 49.6, 50, 50.6, and 51.3 Da. This is largely due to their relatively low signal-to-background ratio, which does not meet our detection criteria. In cluster 3, peaks at 56, 57, and 58 Da are not detected due to their low signal-to-noise ratio but still labeled by experts as Fe+. Finally, at cluster 72, 73, 74, 74.5, 75, 76, and 77 Da, the element Sm is identified.

Table 4

Peak pattern identity analysis for Sm-Co-based hard magnet

Cluster number	1	1	1	1	2	2	3	3	4
m/z	27 28 28.5 29	29.5	31.5 32.5	30 30.7 31 31.3 32	45 45.5 46 46.5 47 47.5 48	48.3 49.7 50 50.751.3	56 57 58	59 60	72 73 74 74.5 75 76 77
Expert	Fe²⁺	Co²⁺	Cu²⁺	Zr³⁺	Zr²⁺ and ZrH²⁺	Mn⁺	Fe⁺	Co⁺ and CoH⁺	Sm²⁺
ML-ToF	Al⁺ and Fe²⁺	CO²⁺	Cu²⁺	Zr³⁺ (85%)	Zr²⁺ (48.3%)	Peak not detected	Peak not detected	Co⁺ and CoH⁺	Sm²⁺

Peak pattern identity analysis for Sm-Co-based hard magnet Elemental signatures like N+ (peak at 14 Da), As+ (peak at 75 Da), Sc+ (peak at 45 Da), and Ca2+ are identified too. But since we did not detect other charge states from these one/two peak elements, ML-ToF rejects these possible candidates. This can be considered as an inherent limit of the instrument itself rather than ML-ToF.

Solar cell absorber

Here, we showcase ML-ToF's application to a much more complex mass spectrum. Cu(In,Ga)S2 is a compound semiconductor with a direct band gap, which can be tuned between 1.55 and 2.4 eV for pure CuInS2 and CuGaS2, respectively. It is, therefore, suitable as an absorber material in solar cells, especially as a top junction in tandem solar cells, to overcome the Shockley-Queisser limit. However, the microstructure, especially the composition-structure relationships of grain boundaries, for this material is not well known., Here, we present for clarity only the mass spectrum of the Cu-In-S system (without Ga). Indexing the complex mass spectrum, shown in Figure 11, is more difficult than the previous two cases. ML-ToF identifies atomic fingerprints: In, Cu, S, and O. As they tend to recombine with one another, the newly formed molecular pattern will change in terms of not only the atomic number but also the abundance ratio. Such an example is shown in Table 6. Cu and S form a compound (CuS) with atomic numbers of 95, 97, and 99, and a new abundance ratio of 63.7:32.2:1.3. Table 7 shows that the size of the new molecular database is also considerably larger than in the case of medium-Mn steel. Nevertheless, as we can see in the peak identity analysis in Table 5, ML-ToF provides a result almost identical to that of the field expert without any prior knowledge.

Figure 11

Identification of Cu-In-S system

Ion mass spectrum of a solar cell absorber system. Because most of the peaks are molecular pattern, for better visualization, circular markers with different colors are used to separate different clusters. The atomic pattern recognizer has identified In, Cu, S, and O as the atomic elements.

Table 6

An example of a new molecule pattern formation ( Cu and S form CuS)

m/z ratio	63 65 (Cu)	32 33 34 (S)	95 97 99 (CuS)
Abundance ratio	69.1:30.9	95:0.8:4.2	65.7:32.2:1.3

Molecule CuS shows a new pattern.

Table 7

New database

Molecular ion	Cu_xS_yO_aH_b	In_xS_yO_aH_b	S_yO_aH_b
Database size	2,059	1,602	450

x = 1, 2, 3, 4; y = 1, 2, 3; a = 0, 1, 2, 3; b = 0, 1, 2; charge state = 1, 2, and mass-to-charge ratio is restricted to below 300 Da, since no peaks are detected beyond that. The search of molecular patterns is performed within this dataset.

Table 5

Peak pattern identity analysis for Cu-In-S alloy system

Cluster number	1	2	2	3	4	4	4	5	5	6	6	7	8	9	10	12	13
m/z	16	32 33 34	31.5 32.5	48	57.5	63 65	64 65 66 67 68	79	80 81 82 83	91 93	95 96 97 98 99	113 115	127 128 129 130 131	142 143 144 145	158 159 160 161 162 163	190 191 192 193 194 195	221 222 223 224 225 226
Expert	O⁺	S⁺	Cu²⁺	Ti²⁺	In²⁺	Cu⁺	S²⁺	CuO⁺	CuOH₂⁺	CuN₂⁺	CuS⁺ and S₃⁺	In⁺	CuS₂⁺	None	CuS₃⁺ and Cu₂S⁺	Cu₂S₂⁺ and CuS₄⁺	Cu₃S²⁺
ML-ToF	O⁺	S⁺	Cu²⁺	None	In²⁺	Cu⁺	S²⁺	CuO⁺	CuOH⁺ and CuOH₂⁺		CuS⁺ and S₃⁺	In⁺	CuS₂⁺ and InO⁺	Cu₄S²⁺	CuS₃⁺ and Cu₂S⁺	Cu₂S₂⁺ and CuS₄⁺	Cu₃S⁺ and Cu₂S₃⁺

Identification of Cu-In-S system Ion mass spectrum of a solar cell absorber system. Because most of the peaks are molecular pattern, for better visualization, circular markers with different colors are used to separate different clusters. The atomic pattern recognizer has identified In, Cu, S, and O as the atomic elements. Peak pattern identity analysis for Cu-In-S alloy system An example of a new molecule pattern formation ( Cu and S form CuS) Molecule CuS shows a new pattern. New database x = 1, 2, 3, 4; y = 1, 2, 3; a = 0, 1, 2, 3; b = 0, 1, 2; charge state = 1, 2, and mass-to-charge ratio is restricted to below 300 Da, since no peaks are detected beyond that. The search of molecular patterns is performed within this dataset. As can be seen from Table 5, for clusters 1, 2, 4, 7, 8, 10, and 11, ML-ToF's choice of element identity is identical to the expert's. For cluster 3, ML-ToF fails to assign any labels to peak 48 Da, whereas the expert assigns Ti+. This is because the background signal is relatively higher compared with the side peaks of Ti. Therefore only one peak is detected, whereas, in theory, element Ti should show five peaks. Regarding cluster 5 (81–83 Da), the expert chose CuOH2+, while ML-ToF chose CuOH+ and CuOH2+. Two other interesting cases are worth mentioning. The first case is CuN2+, which is identified at (91–93 Da, cluster 6) but not confirmed by ML-ToF. A closer look reveals that this ambiguity is due to the fact that ML-ToF did not identify the pattern associated with nitrogen at 7 or 14 Da, i.e., N2+ and N+. Therefore no N-containing compounds in the new molecular pattern database involve nitrogen. In the second case, ML-ToF can predict the identity (Cu4S2+ and InS+) at 142–145 Da (cluster 9), while the user did not assign any identity to them. Overall, ML-ToF has shown high fidelity in handling complicated cases, even identifying some peaks for which humans did not assign any label. More importantly, it takes ML-ToF only a half-second to complete the task. Experts would have taken 15 min on average, sometimes even longer, when scientists had no prior experience with the material system.

Secondary ion mass spectrometry

ToF-SIMS is another analytical imaging mass spectrometry technique, which provides unique insights into surface chemistry.62, 63, 64 The large-scale and high-dimensional data generated by contemporary ToF-SIMS instruments consists of x-y-z spatial information and mass spectrum associated with each pixel. In comparison to APT, the strength of SIMS is its sensitivity associated with the larger probe volumes. The associated drawback is lower spatial resolution. A single ToF-SIMS dataset contains hundreds to thousands of mass spectra. In comparison to APT mass spectra, peak patterns of ToF-SIMS generally have a high signal-to-noise ratio. Although many peak patterns have very low intensity, these peaks are still of great importance and need to be identified. Hence the detection criteria are also different from those for ToF-APT: peak height = 0.0001 (log count); interpeak distance = 0.25 Da; prominence = 0.0001 (log count). In the following examples, we demonstrate the efficiency of ML-ToF on ToF-SIMS mass spectra of different complexities. Here we omit the tabular peak analysis and directly insert ML-ToF-assigned-labels as the expert-assigned labels are available for only a few peaks.

Corrosion and wear Co-based alloy

The chemical composition (wt%) of this alloy characterized by nanoscopic SIMS is Ni 0.32, Cr 0.20, Al 0.08, and Y 0.4, balanced with Co, which is designed as a corrosion- and wear-resistant alloy employed in turbine blades. The mass spectrum shown in Figure 12 was constructed by TOF-SIMS Explorer 1.3.1.0 software from the total ions information of the scanned region. In this spectrum, ToF-ML identifies Al+, Cr+, Co+, Ni+, Ca+, Ti+. This composition is relatively simple. However, abundant complex molecular fingerprints are identified by ML-ToF, as evidenced in Figure 12.

Figure 12

Identification of spectral patterns from secondary ion mass spectrometry using ML-ToF

The region of interest of mass-to-charge ratio ranges from 20 to 90 Da. ML-ToF also identifies complex molecular patterns. This can be seen in the zoom-in region (65–90 Da); note that the count [log] value is very small, because the spectrum was already normalized once by the nanoscopic SIMS software.

Identification of spectral patterns from secondary ion mass spectrometry using ML-ToF The region of interest of mass-to-charge ratio ranges from 20 to 90 Da. ML-ToF also identifies complex molecular patterns. This can be seen in the zoom-in region (65–90 Da); note that the count [log] value is very small, because the spectrum was already normalized once by the nanoscopic SIMS software.

Unknown alloy from mine dump

Finally, ML-ToF was tested on an unknown alloy sample from a mine dump in Erzgebirge, Germany (Figure 13). There is no specification for nominal composition. The spectrum is produced by dynamic SIMS, showing complex peak patterns. ToF-ML identifies a variety of elements and compounds: Na+, Al+, Fe+, Co+, Cu+, Ni+, As+, Mo+, Bi+, NaO+, MnO+, and CuO+. ML-ToF is able to extract rich information even with no prior knowledge on the material.

Figure 13

ML-ToF successfully assigns labels to the vast majority of peaks from mass spectra of an unknown alloy sample from a mine dump in Erzgebirge, Germany

Conclusions

We have developed a gradient-boosting-decision-tree-based approach that converts raw ToF mass spectra to its elemental or molecular identified form. The training dataset is generated based on natural abundance ratios, which does not require any human labeling. The workflow is validated on experimental datasets from APT and SIMS. Its outputs are compared with identification provided by different operators. The main bottleneck of our approach mainly lies at the detection limits. Higher signal-to-noise ratio of the spectrum will lead to more identified patterns. Maximum peak intensity can be very sensitive to various noise sources (e.g., shot noise). To further increase the robustness of ML-ToF, one can use integral intensity as the input. A suggested criterion for such integration is to start from the maximum peak position and continue to the position whose peak intensity is 3 σ above the surrounding noise level. Sigma is the standard deviation of the surrounding noise level, assuming the noise is Gaussian distributed; 3 σ above means there is a 95% chance that the signal at this level is not noise. Another limitation is that the atomic dataset does not include all elements in the periodic table, because sufficient testing and validation must be performed when new elements are added to the training data. Mass spectra containing these new elements were not typically available at the time the method was being developed. The next step is to collect more data and extend ML-ToF to more element types, thus making ML-ToF a universal technique for ToF spectral data analysis. Currently, ML-ToF still relies on brute-force search of molecular ion combinations. To accelerate this search process, one could envision a heuristic search algorithm to be integrated into the ML-ToF (e.g., beam search), which rules out impossible combinations of ions. The identification of monoisotopic species is another bottleneck of ML-TOF. Current ML-ToF could be improved using the mass defects (i.e., actual atomic masses) as an indicator for the existence of monoisotopic species, if ToF mass spectra were accurately calibrated. Finally, the implementation of real-time ML-ToF for mass spectra pattern recognition during the atom probe experiment has the potential of avoiding peak overlapping problems, thus further boosting the accuracy of APT. Finally, our method is open source, easy to implement, and capable of making instant, accurate, and consistent predictions. A wide range of ToF-based techniques can benefit from this approach, e.g., hunting for patterns of biomarkers in high-throughput ToF-MALDI data or for contamination on the solid surface in SIMS data, etc. ML-ToF enables significant acceleration of the identification process and paves the way for more reliable and more reproducible data analysis.

Experimental procedures

Resource availability

Lead contact

The lead contact for this article is Ye Wei, y.wei@mpie.de.

Materials availability

This study did not generate any physical material.

Data and code availability

To ensure transparency and reusability, the entire program is written in Python and is available at Github: https://github.com/DeepHeisenberg/Time-of-flight-Mass-spectra-analysis.

1 in total

Review 1. Precision Medicine Approaches with Metabolomics and Artificial Intelligence.

Authors: Elettra Barberis; Shahzaib Khoso; Antonio Sica; Marco Falasca; Alessandra Gennari; Francesco Dondero; Antreas Afantitis; Marcello Manfredi
Journal: Int J Mol Sci Date: 2022-09-24 Impact factor: 6.208

1 in total