Literature DB >> 36037245

Accounting for multiple imputation-induced variability for differential analysis in mass spectrometry-based label-free quantitative proteomics.

Marie Chion1,2,3, Christine Carapito2,4, Frédéric Bertrand1,5.   

Abstract

Imputing missing values is common practice in label-free quantitative proteomics. Imputation aims at replacing a missing value with a user-defined one. However, the imputation itself may not be optimally considered downstream of the imputation process, as imputed datasets are often considered as if they had always been complete. Hence, the uncertainty due to the imputation is not adequately taken into account. We provide a rigorous multiple imputation strategy, leading to a less biased estimation of the parameters' variability thanks to Rubin's rules. The imputation-based peptide's intensities' variance estimator is then moderated using Bayesian hierarchical models. This estimator is finally included in moderated t-test statistics to provide differential analyses results. This workflow can be used both at peptide and protein-level in quantification datasets. Indeed, an aggregation step is included for protein-level results based on peptide-level quantification data. Our methodology, named mi4p, was compared to the state-of-the-art limma workflow implemented in the DAPAR R package, both on simulated and real datasets. We observed a trade-off between sensitivity and specificity, while the overall performance of mi4p outperforms DAPAR in terms of F-Score.

Entities:  

Mesh:

Substances:

Year:  2022        PMID: 36037245      PMCID: PMC9462777          DOI: 10.1371/journal.pcbi.1010420

Source DB:  PubMed          Journal:  PLoS Comput Biol        ISSN: 1553-734X            Impact factor:   4.779


This is a PLOS Computational Biology Methods paper.

Introduction

Dealing with incomplete data is one of the main challenges as far as statistical analysis is concerned. Different strategies can be used to tackle this issue. The simplest way consists of deleting from the dataset the observations for which there are too many missing values, leading to a complete-case dataset. However, it causes information loss, might create bias, and ultimately could result in poorly informative datasets. Some methods combining qualitative and quantitative statistical tools can also be considered [1]. Another way to cope with missing data is to use methods that account for the missing information. For the last decades, researchers advocated the use of a single technique called imputation. Imputing missing values consists of replacing a missing value with a value derived using a user-defined formula (such as the mean, the median or a value provided by an expert, thus considering the user’s knowledge). Hence it makes it possible to perform the analysis as if the data were complete. More particularly, the vector of parameters of interest can be then estimated. Single imputation means completing the dataset once and considering the imputed dataset as if it was never incomplete, see Fig 1. However, single imputation has the major disadvantage of discarding the variability from the missing data and the imputation process. It may also lead to a biased estimator of the vector of parameters of interest.
Fig 1

Single imputation strategy.

(1) Initial dataset with missing values. It is supposed to be made of N observations that are split into K groups. (2) Single imputation provides an imputed dataset. (3) The vector of parameters of interest is estimated based on the single imputed dataset.

Single imputation strategy.

(1) Initial dataset with missing values. It is supposed to be made of N observations that are split into K groups. (2) Single imputation provides an imputed dataset. (3) The vector of parameters of interest is estimated based on the single imputed dataset. Multiple imputation [2] closes this loophole by generating several imputed datasets. These datasets are then used to build a combined estimator of the vector of parameters of interest by usually using the mean of the estimates among all the imputed datasets, see Fig 2. This combined estimator is known as the first Rubin’s rule. The second Rubin’s rule states a formula to estimate the variance of the combined estimator, decomposing it as the sum of the intra-imputation variance component and the between-imputation component. The rule of thumb takes the number of imputed datasets as the percentage of missing values in the original dataset [3]. Recent work focused on better estimating the Fraction of Missing Information [4] or improving that rule [5]. Note that Rubin’s rules cannot be used in order to get a combined imputed dataset but instead provide an estimator of the vector of parameters of interest and an estimator of its covariance matrix, both based on the multiple imputation process, see Fig 2.
Fig 2

Multiple imputation strategy.

(1) Initial dataset with missing values. It is supposed to have N observations that are split into K groups. (2) Multiple imputation provides D estimators for the vector of parameters of interest. (3a) The D estimators are combined using the first Rubin’s rule to get the combined estimator. (3b) The estimator of the variance-covariance matrix of the combined estimator is provided by the second Rubin’s rule.

Multiple imputation strategy.

(1) Initial dataset with missing values. It is supposed to have N observations that are split into K groups. (2) Multiple imputation provides D estimators for the vector of parameters of interest. (3a) The D estimators are combined using the first Rubin’s rule to get the combined estimator. (3b) The estimator of the variance-covariance matrix of the combined estimator is provided by the second Rubin’s rule. Dealing with missing values is also one of the main struggles in label-free quantitative proteomics. Intensities of thousands of peptides are obtained by liquid chromatography-tandem mass spectrometry, using extracted ion chromatograms. Missing peptides’ intensities arise from various reasons (biological, analytical, bioinformatical) and obey different missing values mechanisms. For example, the considered peptide is missing in the given biological sample, and the intensity is then missing not at random (MNAR) or it could have not been accurately identified (non searched biochemical modification or peptides co-elution, …) and the intensity is then missing at random (MAR). In state-of-the-art software for statistical analysis in label-free quantitative proteomics, single imputation is the most commonly used method to deal with missing values. The MSstats R package (available on Bioconductor) [6] distinguishes missing completely at random values from missing values due to low intensities. The user can then choose to impute the censored value using either a threshold value or an Accelerated Failure Time model. The Perseus software [7] offers three methods for single imputation: either imputing by “NaN”(hence ignoring missing values in downstream analysis), impute by a user-defined constant or impute according to a Gaussian distribution in order to simulate intensities, which are lower than the limit of detection. As far as machine learning is concerned, a method for imputing missing values in label-free mass spectrometry-based proteomics datasets was suggested [8]. Note that the authors of the MSqRob R package recently proposed to bypass imputation with a hurdle model that combines count-based differential detection with intensity-based differential abundance [9]. The ProStaR software based on the DAPAR R package splits missing values into two categories, whether they are Missing in an Entire Condition (MEC) or Partially Observed Values (POV) and allow them to be imputed using different methods [10, 11]. The software allows single imputation, using either a small quantile from the distribution of the considered biological sample, the k-Nearest Neighbours (kNN) algorithm or the Structured Least Squares Adaptative algorithm or by choosing a fixed value. The PANDA-view software [12] also enables the use of the kNN algorithm or a fixed value. Moreover, both software programs allow the possibility of imputing the dataset several times before combining the imputed datasets to get a final dataset without any missing values. PANDA-view relies on the mice R package [13], whereas ProStaR accounts for the nature of missing values and imputes them with the imp4p R package [14, 15]. However, both software programs consider the final dataset as if it had always been complete. The uncertainty due to multiple imputation is not properly taken into account downstream of the imputation step. In the following, we will conduct the multiple imputation process to its end and use the imputed datasets to provide a combined estimator of the vector of parameters of interest as well as a combined estimator of its variance-covariance matrix estimator. We will then project this matrix to get a unidimensional variance estimator before moderating it using the empirical Bayes procedure [16, 17]. It is well known that such a moderating step highly improves the following statistical analyses such as significance testing of confidence interval estimation, both at the peptide level [18, 19] or the protein level [19, 20].

Methods

Multiple imputation algorithms

Several methods for imputing missing values in mass spectrometry-based proteomics datasets were developed in the last decade. However, the recent benchmarks of imputation algorithms do not reach a consensus (as shown in S1 Table). This is mainly due to the complex nature of the underlying missing values mechanism. This work focuses on some of the most commonly used methods, which are described in Table 1. The k-nearest neighbours (kNN) method [21-23] imputes missing values by averaging the k-nearest observations of the given missing value in terms of Euclidean distance. The Maximum Likelihood Estimation (MLE) method imputed missing values using the EM algorithm [15, 24]. The Bayesian linear regression (norm) method imputes missing values using the normal model and following the method described and implemented in the mice R package [25, 26]. Some methods implemented in the imp4p R package [15] were also considered, namely principal component analysis (PCA) [27] and random forests (RF) method [28]. We repeated the imputation process D times to obtain D imputed datasets for each method considered. We set the number of draws D equal to the ceiling of the proportion of missing values in the dataset [3]. Note that if the proportion of missing values is less than 1%, the number of draws is set to 2.
Table 1

Overview of the imputation methods considered in this work.

MethodImplementationReferences
k-nearest neighboursimpute.knn (impute R package)[2123]
Maximum likelihood estimationimpute.mle (imp4p R package)[15, 24, 26]
Bayesian linear regression mice (mice R package)[24, 25]
Principal component analysis impute.pca (imp4p R package)[15, 27]
Random forests impute.RF (imp4p R package)[15, 28]

Estimation of the parameters of interest

The objective of multiple imputation is to estimate from D drawn datasets the vector of parameters of interest = (β, …, β) (e.g. being the vector of coefficients of the linear model for peptide p) and its variance-covariance matrix Σ. Notably, accounting for multiple-imputation-based variability is possible thanks to Rubin’s rules, which provide an accurate estimation of these parameters [25]. Hence, the first Rubin’s rule provides the combined estimator of : where is the estimator of in the d-imputed dataset. The second Rubin’s rule gives the combined estimator of the variance-covariance matrix for each estimated vector of parameters of interest for peptide p through the D imputed datasets such as: where W denotes the variance-covariance matrix of , i.e. the variability of the vector of parameters of interest as estimated in the d-th imputed dataset.

Projection of the covariance matrix

State-of-the-art tests, including Student’s t-test, Welch’s t-test and moderated t-test, rely on the variance estimation. Here, the variability induced by multiple imputation is described by a variance-covariance matrix, given by Eq 2. Therefore, a projection step is required to get a univariate variance parameter. Rubin’s second rule decomposes the variability of the combined dataset as the sum of the within-imputation variability and the between-imputation variability. Thus, analytes whose values have been imputed should have a greater variance estimation than if the multiple imputation-induced variability was not accounted for. This amounts to “penalising” analytes for which intensity values were not observed and subsequently imputed. Hence, the projection method needs to be wisely chosen. In our work, we chose to perform projection using the following formula: where is the k-th diagonal element of the matrix and X is the design matrix. Nevertheless, it is to be noted that this choice for the projection method is not without consequences. Indeed, this method greatly penalises imputed analytes. However, analytes that show high variance estimations might be wrongly considered non differentially expressed, as their distributions in each condition to be compared can overlap.

Hypotheses testing

In our work, we focus our methodology on the moderated t-test [16] that relies on the empirical Bayes procedure, commonly used in microarray data analysis, and to a more recent extent for differential analysis in quantitative proteomics [10]. Hence, we consider the following Bayesian hierarchical model: where is the peptide-wise variance, d is the residual degrees of freedom for the linear model for peptide p, d0 and s0 are hyperparameters to be estimated [17]. This leads to the following posterior distribution of conditional to : From there, a so-called moderated variance estimator of the variance is derived from the posterior mean: This estimator is then computed in the test statistic associated to the null hypothesis , by replacing the usual sample variance by into to the classical t-statistic (see Eq 7). Therefore, the results of this testing procedure account both for the specific structure of the data and the uncertainty caused by the multiple imputation step. with the j-th diagonal element in the matrix ()−1 and is the j-th coefficient of the linear model for peptide p. Under the null hypothesis , T is assumed to follow a Student distribution with d + d0 degrees of freedom. As there are as many tests performed as the number of peptides considered, the proportion of falsely rejected hypotheses has to be controlled. Here, the Benjamini-Hochberg False Discovery Rate control procedure was performed using the cp4p R package [29, 30]. Note that the implementation of the aforementioned testing framework strongly relies on the limma R package. Hence, this work can be generalised to any experimental design.

Aggregation

The methodology implemented in the mi4p R package can be applied to peptide-level quantification data as well as protein-level quantification data. However, common practice in proteomics consists in inferring results at the protein level from peptide-level data. In particular, imputation should be performed at the peptide level, before aggregating peptides into proteins [31]. Therefore, we adjusted our pipeline as follows: Out-filtration of non-unique peptides from the peptide-level quantification dataset. Normalisation of the log2-transformed peptide intensities. Multiple imputation of log2-transformed peptide intensities. Aggregation by summing all peptides intensities (non-log2-transformed) from a given protein in each imputed dataset. log2-transformation of protein intensities. Estimation of variance-covariance matrix. Projection of the estimated variance-covariance matrix. Moderated t-testing on the combined protein-level dataset

Indicators of performance

We compared our methodology to the limma testing pipeline implemented in the state-of-the-art ProStaR software, through the DAPAR R package, as described in Fig 3. To assess the performances of both methods, we used the following measures: sensitivity (also known as true positive rate or recall), specificity (also known as true negative rate), precision (also known as positive predictive value), F-score and Matthews correlation coefficient. In our work, we define a true positive (respectively negative) as a peptide/protein that is correctly considered as (not) differentially expressed by the testing procedure. Similarly, we define a false positive (respectively negative) as a peptide/protein that is falsely considered as (not) differentially expressed by the testing procedure.
Fig 3

Workflow conducted for performance evaluation of the mi4p methodology and comparison to the one implemented in the DAPAR R package.

Results and discussion

Simulated datasets under missing at random assumption

Simulation designs

We evaluated our methodology on three types of simulated datasets. First, we considered an experimental design where the distributions of the two groups to be compared scarcely overlap. This design led to a fixed effect one-way analysis of variance model (ANOVA), which can be written as: with μ = 100, δ = 100 if 1 ≤ p ≤ 10 and k = 2 and δ = 0 otherwise and . Here, y represented the log-transformed abundance of peptide p in the n-th sample. Thus, we generated 100 datasets by considering 200 individuals and 10 variables, divided into 2 groups of 5 variables, using the following steps: For the first 10 rows of the data frame, set as differentially expressed, draw the first 5 observations (first group) from a Gaussian distribution with a mean of 100 and a standard deviation of 1. Then draw the remaining 5 observations (second group) from a Gaussian distribution with a mean of 200 and a standard deviation of 1. For the remaining 190 rows, set as non-differentially expressed, draw the first 5 observations as well as the last 5 observations from a Gaussian distribution with a mean of 100 and a standard deviation of 1. Secondly, we considered an experimental design, where the distributions of the two groups to be compared might highly overlap. Hence, we based it on the random hierarchical ANOVA model [31, 32]. The simulation design followed the following model: where y is the log-transformed abundance of peptide p in the n-th sample, P is the mean value of peptide p, G is the mean differences between the condition groups, and ϵ is the random error terms, which stands for the peptide-wise variance. We generated 100 datasets by considering 1000 individuals and 20 variables, divided into 2 groups of 10 variables, using the following steps: Generate the peptide-wise effect P by drawing 1000 observations from a Gaussian distribution with a mean of 1.5 and a standard deviation of 0.5. Generate the group effect G by drawing 200 observations (for the 200 individuals set as differentially expressed) from a Gaussian distribution with a mean of 1.5 and a standard deviation of 0.5 and 800 observations fixed to 0. Build the first group dataset by replicating 10 times the sum of P and the random error term, drawn from a Gaussian distribution of mean 0 and standard deviation 0.5. Build the second group dataset by replicating 10 times the sum of P, G and the random error term drawn from a Gaussian distribution of mean 0 and standard deviation 0.5. Bind both datasets to get the complete dataset. Finally, we considered an experimental design similar to the second one, but with random effects P and G. The 100 datasets were generated as follows. For the first group, replicate 10 times (for the 10 variables in this group) a draw from a mixture of 2 Gaussian distributions. The first one has the following parameters: a mean of 1.5 and a standard deviation of 0.5 (corresponds to P). The second one has the following parameters: a mean of 0 and a standard deviation of 0.5 (corresponds to ϵ). For the second group replicate 10 times (for the 10 variables in this group) a draw from a mixture of the following 3 distributions. The first one is a Gaussian distribution with the following parameters: a mean of 1.5 and a standard deviation of 0.5 (corresponds to P). The second one is the mixture of a Gaussian distribution with a mean of 1.5 and a standard deviation of 0.5 for the 200 first rows (set as differentially expressed) and a zero vector for the remaining 800 rows (set as not differentially expressed). This mixture illustrates the G term in the previous model. The third distribution has the following parameters: a mean of 0 and a standard deviation of 0.5 (corresponds to ϵ). All simulated datasets were then amputed to produce MAR missing values in the following proportions: 1%, 5%, 10%, 15%, 20% and 25%.

Comparison of imputation methodologies

To compare the imputation methods considered in Table 1, we used the synthetic data from the aforementioned second set of MAR simulations. Let us highlight that reviews on imputation methods evaluation often base their study on real datasets by subsetting them to complete data and amputating them afterwards (S1 Table). However, such approaches remain limited, as the parameters of the data cannot be controlled. Recall that we simulated 100 datasets, which were amputated afterwards. Hence both imputed and real values can be accessed. In this section, we aim at evaluating the potential bias that can arise from the imputation process. We based our comparison on the amputated datasets with a proportion of missing values of 10%, so we impute each dataset D = 10 times. Consider then the set of all missing values coming from the Q = 100 datasets. Let n denote the number of missing values in the q-th dataset, with 1 ≤ q ≤ Q. The set of all missing data is then constituted of elements. In our work, we take the number of draws for multiple imputation as the percentage of missing values. Therefore, multiple imputation produces ten vectors of size N corresponding to the ten draws of the considered vector.

Imputation error for each draw

To evaluate the performance of the imputation methodologies considered, we first consider the error on each draw. Let y denote the i-th value in the previously defined set and the d-th draw for y. Hence, we define the error for each imputed value as: The D × N errors are calculated for all imputation method considered, namely kNN, MLE, norm, PCA and RF (detailed in Table 1). To compare the performances of these methods, Fig 4 summarises the distributions of for the five imputation methods considered. First, it is comforting to observe that the errors are all centred on zero. Moreover, let us also point out that the MLE and norm methods provide a slightly increased variability compared to other methods. The kNN, PCA and RF methods show equivalent performance as far as single imputation is concerned.
Fig 4

Distribution of empirical errors for the five imputation methods considered on the second set of MAR simulations.

Imputation error for the mean of draws

Following the first Rubin’s rule (Eq 1), the D drawn datasets are combined using the mean. In order to provide additional insights about the empirical errors of the different multiple-imputation procedures, let us compute the differences between the averaged imputed values used in practice and the actual values. For each imputation method, the errors are averaged over the D draws (corresponding to the D different imputations), which we expect to stabilise the error values. In contrast to the previous approach, the associated formula becomes: Fig 5 suggests equivalent performance for all five methods as far as the mean of all imputed datasets is concerned. In terms of variability, we can still observe a slightly increased interquartile range for the MLE imputation method.
Fig 5

Distribution of errors of the averaged imputed values for the five imputation methods considered on the second set of MAR simulations.

Computation time

As a complement to determine the advantages of each approach, we compared the running time of all imputation processes. Therefore, we considered the total time needed for imputing each simulated dataset D times. The boxplots on Fig 6 highlight the MLE and kNN method to be the fastest.Compared to MLE imputation method, the PCA method is on average 3.5 times slower and the norm and RF methods are respectively on average 7.4 times and 8.1 times slower. At this stage of the comparison, as all imputation methods exhibit comparable performances in terms of imputation bias, a preference can be drawn for the kNN and MLE methods.
Fig 6

Distributions of duration of the imputation process for the five imputation methods considered on the second set of MAR simulations.

Influence on testing results

The evaluation of performance for our mi4p methodology relies on the results produced by the testing procedure. For the MAR simulation designs, testing results were provided for all imputation methods considered. However, we could observe that no positives were produced for some datasets. As a summary, Table 2 describes under which conditions such pathological datasets arise in the second set of MAR simulations. The mi4p workflow dramatically underperforms at detecting positives when using the norm imputation method. The high number of pathological datasets can be explained by this method being a global one (i.e. applied to the full dataset), whereas other methods considered are local in that they are applied experimental condition-wise. Therefore, the norm method might lead to an increased between-imputation variability. Otherwise, no pathological cases occur while using the mi4p method on this particular set of simulated datasets. However, a few pathological datasets can be consistently observed when using the DAPAR workflow, regardless of the chosen imputation method. Overall, the MLE imputation offers a slight advantage over other methods.
Table 2

Number of pathological cases for each missing value proportion in the second set of MAR simulations.

Imputation methodTesting workflowMissing value proportion
1%5%10%15%20%25%
kNN DAPAR 002221
MI4P 000000
MLE DAPAR 002110
MI4P 000000
norm DAPAR 002210
MI4P 00072657
PCA DAPAR 002230
MI4P 000000
RF DAPAR 003230
MI4P 000000

A glimpse of real datasets imputation

As a conclusion of this thorough analysis of synthetic data, let us draw some perspectives for the subsequent real datasets study. At this stage, kNN and MLE imputation methods might equivalently be considered. However, in quantitative proteomics datasets, rows sometimes present more than 50% missing values. When this threshold is exceeded, current kNN method implementations only use mean imputation for these rows. However, mean imputation results in identical imputed values and no between-imputation variability arises, preventing from taking advantage of our mi4p methodology. In contrast, the MLE imputation method still provides reliable imputations for a reduced computational cost in all situations. Moreover, the MLE method offers a more principled and interpretable approach compared to alternatives, which also motivated our choice to retain this method for further analysis of both MNAR + MCAR simulated datasets and real datasets.

Experiments

The distributions of the differences in sensitivity, specificity, precision, F-score and Matthews correlation coefficient between mi4p and DAPAR for all missing values proportion were summarised on the boxplots on Fig 7. Detailed results can be additionally found in S2 Table for MLE imputation and in S3, S4, S5 and S6 Tables for kNN, norm, PCA and RF imputations respectively. Both methods showed equivalent performance for a small proportion of missing values (1%), where the imputation process induces little variability. However, above 5% missing values, precision, F-Score and Matthews correlation coefficient were increasingly improved with the mi4p workflow compared to the DAPAR one. Moreover, sensitivity remained at 100% while specificity slightly improved, regardless of the missing value proportion. Note that the distributions of intensity values within each experimental condition for differentially expressed analytes are separate for the first set of MAR simulations. Indeed, intensity values for those analytes were drawn from a distribution for the first condition and from a distribution for the second one.
Fig 7

Distributions of differences in sensitivity, specificity, precision, F-score and Matthews correlation coefficient for the first MAR set of simulations.

Missing values were imputed using the maximum likelihood estimation method.

Distributions of differences in sensitivity, specificity, precision, F-score and Matthews correlation coefficient for the first MAR set of simulations.

Missing values were imputed using the maximum likelihood estimation method. Compared to the first one, the second and the third sets of MAR simulations illustrate a case where the distributions of intensity values within each experimental condition for differentially expressed analytes are closer. Indeed, intensity values for these analytes were approximately drawn from a distribution for the first condition and a distribution for the second one. Fig 8 summarises the evolution of the distribution of differences in sensitivity, specificity, precision, F-score and Matthews correlation coefficient between mi4p and DAPAR depending on the proportion of missing values in the second set of MAR simulations. Detailed results can be additionally found in S7 Table for MLE imputation and in S8, S9, S10 and S11 Tables for kNN, norm, PCA and RF imputations respectively. A trade-off between sensitivity and specificity was observed for all proportions of missing values. Indeed, a slight loss in specificity (yet remaining above 99%) provided a greater gain in terms of sensitivity. However, precision performance remained equivalent in both methods. Furthermore, the mean of F-scores and Matthews correlation coefficients across the 100 datasets were also increased with the mi4p workflow compared to the DAPAR one, suggesting a global improvement of the testing procedure’s accuracy.
Fig 8

Distributions of differences in sensitivity, specificity, precision, F-score and Matthews correlation coefficient for the second MAR set of simulations.

Missing values were imputed using the maximum likelihood estimation method.

Distributions of differences in sensitivity, specificity, precision, F-score and Matthews correlation coefficient for the second MAR set of simulations.

Missing values were imputed using the maximum likelihood estimation method. The third set of MAR simulations extended the second one from fixed to random effects. The difference in performance indicators represented in Fig 9 remained equivalent to the one observed in the previous set of simulations. However, the detailed results described in S12 Table suggested that both mi4p and DAPAR methods underperformed on data simulated based on random effects simulated data compared to the fixed effect simulation design. Detailed results can be additionally found in S12, S14, S15 and S16 Tables for kNN, norm, PCA and RF imputations respectively. Furthermore, the linear model on which both methods rely was not designed to account for random effects and thus struggles to capture such a source of variability. Therefore, an overall underperformance of both mi4p and DAPAR methods could be noticed in the third set of MAR simulations (S12 Table) compared to the second one (S7 Table).
Fig 9

Distributions of differences in sensitivity, specificity, precision, F-score and Matthews correlation coefficient for the third MAR set of simulations.

Missing values were imputed using the maximum likelihood estimation method.

Distributions of differences in sensitivity, specificity, precision, F-score and Matthews correlation coefficient for the third MAR set of simulations.

Missing values were imputed using the maximum likelihood estimation method.

Simulated datasets under Missing completely at random and not at random assumption

The previous results were provided using only missing at random data. This section extends the simulation study to a mixture of missing completely at random (MCAR) and missing not at random (MNAR) data. The data were simulated following an experimental design implemented in the imp4p R package through the sim.data function [14, 15]. The first set of simulations was based on the following experimental design. Two experimental conditions with ten biological samples each were considered, for which the log-intensities of 1000 analytes were simulated. Among them, 200 were set to be differentially expressed. Hence, the 200 differentially expressed analytes have log-intensities drawn from a Gaussian distribution with a mean of 12.5 in the first condition and 25 in the second one. The remaining simulated log-intensities of non differentially expressed analytes are drawn for both conditions from a Gaussian distribution with a mean of 12.5. The standard deviation in each condition for all analytes is set to 2. Other parameters to be passed as arguments in the sim.data function were set to default values. The second set of simulations considered extends the first one by increasing the number of simulated analytes to 10,000, among which 500 are differentially expressed. Note that in this design, the proportion of differentially expressed analytes is decreased from 20% to 5%. For both simulation studies, six datasets were built with 1%, 5%, 10%, 20% and 25% missing values. The distributions of the difference of the previously described indicators of performance between the mi4p and the DAPAR workflows for the first set of simulations were shown in Fig 10. A trade-off between sensitivity and specificity could be observed: sensitivity was increased by 15% on average while specificity was decreased by 15% on average for the mi4p workflow compared to the DAPAR one. Furthermore, performance in terms of precision was equivalent for both methods. As far as global performances are concerned, the F-Score was slightly increased by an average of 2%, and the MCC was quite stable, with a slight decrease observed for the data with the highest missing values proportion.
Fig 10

Distributions of differences in sensitivity, specificity, precision, F-score and Matthews correlation coefficient for the first MCAR + MNAR set of simulations.

Missing values were imputed using the maximum likelihood estimation method.

Distributions of differences in sensitivity, specificity, precision, F-score and Matthews correlation coefficient for the first MCAR + MNAR set of simulations.

Missing values were imputed using the maximum likelihood estimation method. Fig 11 depicts the distributions of the difference of the previously described indicators of performance between the mi4p and the DAPAR workflows for the second set of simulations. The dispersions of the distributions are globally reduced, but the same trends as in the first set of simulations can be observed. Detailed results for both sets of simulations can be found in S17 and S18 Tables. Overall performance in terms of sensitivity, specificity, and precision is quite low for both mi4p and DAPAR methods, mainly due to a large number of false positives. In particular, precision performance drops when the number of analytes considered is increased. Moreover, the poor performance in terms of MCC suggests that both methods behave almost as random guess classifiers. Hence, the relevance of the chosen imputation method should be questioned in this framework.
Fig 11

Distributions of differences in sensitivity, specificity, precision, F-score and Matthews correlation coefficient for the second MCAR + MNAR set of simulations.

Missing values were imputed using the maximum likelihood estimation method.

Distributions of differences in sensitivity, specificity, precision, F-score and Matthews correlation coefficient for the second MCAR + MNAR set of simulations.

Missing values were imputed using the maximum likelihood estimation method. Simulation studies showed more false positives in datasets with MNAR and MCAR values than with MAR values. While the considered datasets were simulated differently, this observation requires further investigation, particularly on the imputation method used. Recent works suggested that a combination of MCAR-devoted and MNAR-devoted imputation algorithms perform most accurately and reproducibly on bottom-up proteomics data regardless of the missing value type (except for high MNAR proportions) [14, 33].

Real quantitative proteomics datasets

Complex total cell lysates spiked UPS1 standard protein mixtures

We consider a first real dataset based on the following experiment. Six peptide mixtures, composed of a constant yeast (Saccharomyces cerevisiae) background, into which increasing amounts of UPS1 standard proteins (48 recombinant human proteins, Merck) were spiked at 0.5, 1, 2.5, 5, 10 and 25 fmol, respectively [34]. In a second well-calibrated dataset, yeast was replaced by a more complex total lysate of Arabidopsis thaliana in which UPS1 was spiked in 7 different amounts, namely 0.05, 0.25, 0.5, 1.25, 2.5, 5 and 10 fmol. For each mixture, technical triplicates were constituted. The Saccharomyces cerevisiae dataset was acquired on a nanoLC-MS/MS coupling composed of a nanoAcquity UPLC device (Waters) coupled to a Q-Exactive Plus mass spectrometer (Thermo Scientific, Bremen, Germany) [34]. The Arabidopsis thaliana dataset was acquired on a nanoLC-MS/MS coupling composed of nanoAcquity UPLC device (Waters) coupled to a Q-Exactive HF-X mass spectrometer (Thermo Scientific, Bremen, Germany) as described hereafter.

Data preprocessing

For the Saccharomyces cerevisiae and Arabidopsis thaliana datasets, Maxquant software was used to identify peptides and derive extracted ion chromatograms. Peaks were assigned with the Andromeda search engine with full trypsin specificity. The database used for the searches was concatenated in house with the Saccharomyces cerevisiae entries extracted from the UniProtKB-SwissProt database (16 April 2015, 7806 entries) or the Arabidopsis thaliana entries (09 April 2019, 15 818 entries) and those of the UPS1 proteins (48 entries). The minimum peptide length required was seven amino acids and a maximum of one missed cleavage was allowed. Default mass tolerances parameters were used. The maximum false discovery rate was 1% at peptide and protein levels with the use of a decoy strategy. For the Arabidopsis thaliana + UPS1 experiment, data were extracted both with and without Match Between Runs and 2 pre-filtering criteria were applied prior to statistical analysis: only peptides with at least 1 out of 3 quantified values in each condition on one hand and 2 out of 3 on the other hand were kept. Thus, 4 datasets derived from the Arabidopsis thaliana + UPS1 were considered. For the Saccharomyces cerevisiae + UPS1 experiment, the same filtering criteria were applied, but only on data extracted with Match Between Runs, leading to 2 datasets considered. An additional normalisation step was performed on each dataset considered. Normalising peptides’ or proteins’ intensities aims at reducing batch effects, sample-level variations and therefore better comparing intensities across studied biological samples [35]. In this work, we chose to perform quantile normalisation [36], using the normalize.quantiles function from the preprocessCore R package [37].

Supplemental methods for Arabidopsis thaliana dataset

Peptide separation was performed on an ACQUITY UPLC BEH130 C18 column (250 mm × 75 μm with 1.7 μm diameter particles) and a Symmetry C18 precolumn (20 mm ×180 μm with 5 μm diameter particles; Waters). The solvent system consisted of 0.1% FA in water (solvent A) and 0.1% FA in ACN (solvent B). The samples were loaded into the enrichment column over 3 min at 5 μL/min with 99% of solvent A and 1% of solvent B. The peptides were eluted at 400 nL/min with the following gradient of solvent B: from 3 to 20% over 63 min, 20 to 40% over 19 min, and 40 to 90% over 1 min. The MS capillary voltage was set to 2kV at 250°C. The system was operated in a data-dependent acquisition mode with automatic switching between MS (mass range 375–1500 m/z with R = 120 000, automatic gain control fixed at 3 × 106 ions, and a maximum injection time set at 60 ms) and MS/MS (mass range 200–2000 m/z with R = 15 000, automatic gain control fixed at 1 × 105, and the maximal injection time set to 60 ms) modes. The twenty most abundant peptides were selected on each MS spectrum for further isolation and higher energy collision dissociation fragmentation, excluding unassigned and monocharged ions. The dynamic exclusion time was set to 40s. The trade-off suggested by the simulation study was confirmed by the results obtained on the real datasets. In the Saccharomyces cerevisiae + UPS1 experiment, a decrease of 70% in the number of false positives was observed, improving the specificity and precision (see S25 Table), at the cost of the number of true positives (Table 3), thus decreasing the sensitivity.
Table 3

Performance of the mi4p methodology expressed in percentage with respect to DAPAR workflow, on Saccharomyces cerevisiae + UPS1 experiment, with Match Between Runs and at least 1 out of 3 quantified values in each condition.

Missing values (6%) were imputed using the maximum likelihood estimation method.

Condition vs. 25fmolTrue positivesFalse positivesSensitivitySpecificityF-Score
0.5fmol -2.7%-67.2%-2.7%+1.6%+53.6%
1fmol -1.6%-71.1%-0.5%+0.9%+37.8%
2.5fmol -3.2%-75.8%-3.3%+0.7%+26.9%
5fmol -14.3%-78.7%-14.3%+0.5%+11.4%
10fmol -41.9%-75.2%-41.9%+0.5%-14.4%

Performance of the mi4p methodology expressed in percentage with respect to DAPAR workflow, on Saccharomyces cerevisiae + UPS1 experiment, with Match Between Runs and at least 1 out of 3 quantified values in each condition.

Missing values (6%) were imputed using the maximum likelihood estimation method. The same trend is observed in the Arabidopsis thaliana + UPS1 experiment; the number of false positives is decreased by 50% (see Table 4 and S19 Table), thus improving specificity and precision at the cost of sensitivity. The loss in sensitivity is larger in the highest points of the range in both experiments. The structure of the calibrated datasets used here can explain these observations. Indeed, the quantitative dataset considered takes into account all samples from all conditions, while the testing procedure focuses on one-vs-one comparisons. Two issues can be raised:
Table 4

Performance of the mi4p methodology expressed in percentage with respect to DAPAR workflow, on Arabidopsis thaliana + UPS1 experiment, with at least 1 out of 3 quantified values in each condition.

Missing values (6%) were imputed using the maximum likelihood estimation method.

Condition vs. 10fmolTrue positivesFalse positivesSensitivitySpecificityF-Score
0.05fmol -2.3%-43%-2.3%+15%+62.7%
0.25fmol -1.5%-43%-1.4%+13.9%+65.3%
0.5fmol -1.5%-50.6%-1.4%+10.8%+81.4%
1.25fmol -2.3%-62.6%-2.3%+10.9%+119.8%
2.5fmol -25.6%-69.3%-25.5%+2.4%+45.9%
5fmol -30.3%-65.2%-30.4%+5.5%+56.1%
The data preprocessing step can lead to more data filtering than necessary. For instance, we chose to use the filtering criterion such that rows with at least one quantified value in each condition were kept. The more conditions are considered, the more stringent the rule is, possibly leading to a poorer dataset (with fewer observations) for the conditions of interest. The imputation process is done on the whole dataset, as well as the estimation step. Then, while projecting the variance-covariance matrix, the estimated variance (later used in the test statistic) is the same for all comparisons. Thus, if one is interested in comparing conditions with fewer missing values, the variance estimator will be penalised by the presence of conditions with more missing values in the initial dataset.

Performance of the mi4p methodology expressed in percentage with respect to DAPAR workflow, on Arabidopsis thaliana + UPS1 experiment, with at least 1 out of 3 quantified values in each condition.

Missing values (6%) were imputed using the maximum likelihood estimation method. This phenomenon was illustrated in S20 Table, where solely the two highest points of the range have been compared, only using the quantitative data from those two conditions. Hence, more peptides have been taken into account for the statistical analysis. This strategy led to overall better scores for precision, F-score and Matthews correlation coefficient compared to the previous framework. As far as data extracted without the Match Between Runs algorithm are concerned, the results were equivalent in both methods considered in the Arabidopsis thaliana + UPS1 experiment (as illustrated in S22 and S23 Tables). Furthermore, the same observations could be drawn from datasets filtered with the criterion of a minimum of 2 out of 3 observed values in each group for the Arabidopsis thaliana + UPS1 experiment (S21 and S23 Tables) as well as for the Saccharomyces cerevisiae + UPS1 experiment (S26 Table). These observations translated a loss of global information in the dataset, as filtering criteria led to fewer peptides considered with fewer missing values per peptide. The mi4p methodology also provided better results at the protein-level (after aggregation) in terms of specificity, precision, F-score and Matthews correlation coefficient, with a minor loss in sensitivity (S27 Table). In particular, a decrease of 63.2% to 80% in the number of false positives was observed with a lower loss on the number of true positives and on sensitivity (up to 2.6%) for the Saccharomyces cerevisiae + UPS1 experiment, as illustrated in Table 5. As far as the Arabidopsis thaliana + UPS1 experiment is concerned, the same trend was observed (S24 Table). Indeed, the number of false positives was decreased by 31% to 66.8%, with a maximum loss in the number of true positives of 9.8%, as illustrated in Table 6.
Table 5

Performance of the mi4p methodology (with the aggregation step) expressed in percentage with respect to DAPAR workflow, on Saccharomyces cerevisiae + UPS1 experiment, with at least 1 out of 3 quantified values in each condition.

Missing values were imputed using the Maximum Likelihood Estimation method.

Condition vs. 25fmolTrue positivesFalse positivesSensitivitySpecificityF-Score
0.5fmol 0%-73.3%0%+2.9%+61.1%
1fmol -2.4%-80%-2.4%+2.3%+51.4%
2.5fmol 0%-70.4%0%+0.8%+20.9%
5fmol -2.4%-63.2%-2.4%+0.5%+11.6%
10fmol -2.6%-69.6%-2.6%+0.7%+16.5%
Table 6

Performance of the mi4p methodology (with the aggregation step) expressed in percentage with respect to DAPAR workflow, on Arabidopsis thaliana + UPS1 experiment, with at least 1 out of 3 quantified values in each condition.

Missing values were imputed using the Maximum Likelihood Estimation method.

Condition vs. 10fmolTrue positivesFalse positivesSensitivitySpecificityF-Score
0.05fmol 0%-27.6%0%+18.3%+34.2%
0.25fmol 0%-25.7%0%+18.1%+31%
0.5fmol 0%-31%0%+15.2%+39.5%
1.25fmol 0%-65.3%0%+12.1+119.2%
2.5fmol -2.4%-66.8%-2.4%+5.8%+88.3%
5fmol -9.8%-57.3%-9.8%+12.9%+78.9%

Performance of the mi4p methodology (with the aggregation step) expressed in percentage with respect to DAPAR workflow, on Saccharomyces cerevisiae + UPS1 experiment, with at least 1 out of 3 quantified values in each condition.

Missing values were imputed using the Maximum Likelihood Estimation method.

Performance of the mi4p methodology (with the aggregation step) expressed in percentage with respect to DAPAR workflow, on Arabidopsis thaliana + UPS1 experiment, with at least 1 out of 3 quantified values in each condition.

Missing values were imputed using the Maximum Likelihood Estimation method.

Conclusion

In this work, we presented a rigorous multiple imputation method as a key step of a workflow by combining the imputed datasets using Rubin’s rules. We thus obtained for each analyte, on the one hand, a combined estimator of the vector of interest parameters, and on the other hand, an estimator of its corresponding variance-covariance matrix. Hence, both within- and between-imputation variabilities are accounted for. The variance-covariance matrix was projected to get a univariate variance parameter for each analyte. We then considered this variability downstream of the statistical analysis by including it in the well-known moderated t-test statistic. In addition, we provided insights on the comparison of imputation methods. Our methodology was implemented in a publicly available R package named mi4p. Its performance was compared on both simulated and real datasets to the DAPAR state-of-the-art methodology, using confusion matrix-based indicators. The results showed a trade-off between those indicators. In real datasets, the methodology reduces the number of false positives in exchange for a minor reduction of the number of true positives. The results are similar among all imputation methods considered, especially when the proportion of missing values is small. Our methodology with an additional aggregation step provides better results with a minor loss in sensitivity and can be of interest for proteomicists who will benefit from results at the protein level while using peptide-level quantification data.

State of the art on imputation in quantitative proteomics.

This table gives an overview of the recent literature on imputation methods in quantitative proteomics. (PDF) Click here for additional data file.

Performance evaluation on the first set of MAR simulations imputed using maximum likelihood estimation.

Results are provided as mean ± standard deviation over the 100 simulated datasets for each indicator of performance. (PDF) Click here for additional data file.

Performance evaluation on the first set of MAR simulations imputed using k-nearest neighbours.

Results are provided as mean ± standard deviation over the 100 simulated datasets for each indicator of performance. (PDF) Click here for additional data file.

Performance evaluation on the first set of MAR simulations imputed using Bayesian linear regression.

Results are provided as mean ± standard deviation over the 100 simulated datasets for each indicator of performance. (PDF) Click here for additional data file.

Performance evaluation on the first set of MAR simulations imputed using principal component analysis.

Results are provided as mean ± standard deviation over the 100 simulated datasets for each indicator of performance. (PDF) Click here for additional data file.

Performance evaluation on the first set of MAR simulations imputed using random forests.

Results are provided as mean ± standard deviation over the 100 simulated datasets for each indicator of performance. (PDF) Click here for additional data file.

Performance evaluation on the second set of MAR simulations imputed using maximum likelihood estimation.

Results are provided as mean ± standard deviation over the 100 simulated datasets for each indicator of performance. (PDF) Click here for additional data file.

Performance evaluation on the second set of MAR simulations imputed using k-nearest neighbours.

Results are provided as mean ± standard deviation over the 100 simulated datasets for each indicator of performance. (PDF) Click here for additional data file.

Performance evaluation on the second set of MAR simulations imputed using Bayesian linear regression.

Results are provided as mean ± standard deviation over the 100 simulated datasets for each indicator of performance. (PDF) Click here for additional data file.

Performance evaluation on the second set of MAR simulations imputed using principal component analysis.

Results are provided as mean ± standard deviation over the 100 simulated datasets for each indicator of performance. (PDF) Click here for additional data file.

Performance evaluation on the second set of MAR simulations imputed using random forests.

Results are provided as mean ± standard deviation over the 100 simulated datasets for each indicator of performance. (PDF) Click here for additional data file.

Performance evaluation on the third set of MAR simulations imputed using maximum likelihood estimation.

Results are provided as mean ± standard deviation over the 100 simulated datasets for each indicator of performance. (PDF) Click here for additional data file.

Performance evaluation on the third set of MAR simulations imputed using k-nearest neighbours.

Results are provided as mean ± standard deviation over the 100 simulated datasets for each indicator of performance. (PDF) Click here for additional data file.

Performance evaluation on the third set of MAR simulations imputed using Bayesian linear regression.

Results are provided as mean ± standard deviation over the 100 simulated datasets for each indicator of performance. (PDF) Click here for additional data file.

Performance evaluation on the third set of MAR simulations imputed using principal component analysis.

Results are provided as mean ± standard deviation over the 100 simulated datasets for each indicator of performance. (PDF) Click here for additional data file.

Performance evaluation on the third set of MAR simulations imputed using random forests.

Results are provided as mean ± standard deviation over the 100 simulated datasets for each indicator of performance. (PDF) Click here for additional data file.

Performance evaluation on the first set of MCAR + MNAR simulations imputed using maximum likelihood estimation.

Results are provided as mean ± standard deviation over the 100 simulated datasets for each indicator of performance. (PDF) Click here for additional data file.

Performance evaluation on the second set of MCAR + MNAR simulations imputed using maximum likelihood estimation.

Results are provided as mean ± standard deviation over the 100 simulated datasets for each indicator of performance. (PDF) Click here for additional data file.

Performance evaluation on the Arabidopsis thaliana + UPS1 dataset, filtered with at least 1 quantified value in each condition.

Missing values were imputed using the maximum likelihood estimation method. (PDF) Click here for additional data file.

Performance evaluation on the Arabidopsis thaliana + UPS1 dataset, filtered with at least 1 quantified value in each condition and focusing only on the comparison 5fmol vs. 10fmol.

Missing values were imputed using the maximum likelihood estimation method. (PDF) Click here for additional data file.

Performance evaluation on the Arabidopsis thaliana + UPS1 dataset, filtered with at least 2 quantified values in each condition.

Missing values were imputed using the maximum likelihood estimation method. (PDF) Click here for additional data file.

Performance evaluation on the Arabidopsis thaliana + UPS1 dataset, extracted without Match Between Runs and filtered with at least 1 quantified value in each condition.

Missing values were imputed using the maximum likelihood estimation method. (PDF) Click here for additional data file.

Performance evaluation on the Arabidopsis thaliana + UPS1 dataset, extracted without Match Between Runs and filtered with at least 2 quantified value in each condition.

Missing values were imputed using the maximum likelihood estimation method. (PDF) Click here for additional data file.

Performance evaluation on the Arabidopsis thaliana + UPS1 dataset at the protein-level, filtered with at least 1 quantified values in each condition.

Missing values were imputed using the maximum likelihood estimation method. (PDF) Click here for additional data file.

Performance evaluation on the Saccharomyces cerevisiae + UPS1 dataset, filtered with at least 1 quantified value in each condition.

Missing values were imputed using the maximum likelihood estimation method. (PDF) Click here for additional data file.

Performance evaluation on the Saccharomyces cerevisiae + UPS1 dataset, filtered with at least 2 quantified values in each condition.

Missing values were imputed using the maximum likelihood estimation method. (PDF) Click here for additional data file.

Performance evaluation on the Saccharomyces cerevisiae + UPS1 dataset, at the protein-level and filtered with at least 1 quantified values in each condition.

Missing values were imputed using the maximum likelihood estimation method. (PDF) Click here for additional data file. 11 Apr 2022 Dear Dr. Chion, Thank you very much for submitting your manuscript "Accounting for multiple imputation-induced variability for differential analysis in mass spectrometry-based label-free quantitative proteomics" for consideration at PLOS Computational Biology. As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. In light of the reviews (below this email), we would like to invite the resubmission of a significantly-revised version that takes into account the reviewers' comments. As recommended by the reviewers, it would be useful to contextualize mi4p better within the state of the art for proteomics imputation. Additionally, the reviewers raise important concerns related to the simulated datasets that should be addressed. Finally, several changes are recommended to improve the usability of the mi4p software, also for non-statistical users, and provide full code details associated with the manuscript. We cannot make any decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is also likely to be sent to reviewers for further evaluation. When you are ready to resubmit, please upload the following: [1] A letter containing a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out. [2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file). Important additional instructions are given below your reviewer comments. Please prepare and submit your revised manuscript within 60 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. Please note that revised manuscripts received after the 60-day due date may require evaluation and peer review similar to newly submitted manuscripts. Thank you again for your submission. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments. Sincerely, Wout Bittremieux Guest Editor PLOS Computational Biology William Noble Deputy Editor PLOS Computational Biology *********************** Reviewer's Responses to Questions Comments to the Authors: Please note here if the review is uploaded as an attachment. Reviewer #1: I have uploaded my review as a pdf file. Reviewer #2: The authors developed a multiple imputation strategy to get a better performance than the single imputation strategy. In the results section, they simulated datasets with missing at random and missing not at random assumptions to test the performance of their method. The results show that their method performed better than DAPAR. They also used real dataset to confirm the conclusion. Overall, this manuscript was well written and organized. The performance of the tool seems promising. Reviewer #3: Comments attached ********** Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes Reviewer #2: Yes Reviewer #3: No: No code for data simulations or data analysis were provided ********** PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: Yes: Ludger Goeminne Reviewer #2: No Reviewer #3: No Figure Files: While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, . PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at . Data Requirements: Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5. Reproducibility: To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols Submitted filename: review_Plos_Comp_biol.pdf Click here for additional data file. Submitted filename: PLOSCB_Chion_Review.docx Click here for additional data file. 18 Jun 2022 Submitted filename: response-to-reviewers.pdf Click here for additional data file. 6 Jul 2022 Dear Dr. Chion, Thank you very much for submitting your manuscript "Accounting for multiple imputation-induced variability for differential analysis in mass spectrometry-based label-free quantitative proteomics" for consideration at PLOS Computational Biology. The reviewers appreciated your thorough efforts in submitting your revised manuscript. As you will see, there are still a few minor recommendations for clarifications. As soon as these have been provided, we will accept this manuscript for publication. Please prepare and submit your revised manuscript within 30 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. When you are ready to resubmit, please upload the following: [1] A letter containing a detailed list of your responses to all review comments, and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out [2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file). Important additional instructions are given below your reviewer comments. Thank you again for your submission to our journal. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments. Sincerely, Wout Bittremieux Guest Editor PLOS Computational Biology William Noble Deputy Editor PLOS Computational Biology *********************** A link appears below if there are any accompanying review attachments. If you believe any reviews to be missing, please contact ploscompbiol@plos.org immediately: [LINK] Reviewer's Responses to Questions Comments to the Authors: Please note here if the review is uploaded as an attachment. Reviewer #1: Dear editor I would like to congratulate the authors on their thorough revision. My comments have been accurately addressed. I only have some minor comments on the newly inserted paragraph, which I think could be addressed relatively smoothly. Comment 1: I am not sure what is meant by "pathological cases" in table 2 and the text preceding it. Do the authors mean the number of "false positives"? If so, they should use this terminology. If this interpretation is correct, I wonder why the authors picked this metric and not e.g. sensitivity, specificity or F-score for example? (this could be addressed by adding a sentence explaining why the authors compare these methods in terms of the numbers of false positives) Comment 2: Line 275: the authors state "The mi4p workflow dramatically underperforms at detecting positives when using the norm imputation method." I assume the term "positives" refers to "true positives"? If this interpretation is correct, is there somewhere a table with the number of true positives to back up this claim? Alternatively, if this sentence refers to the high number of false positives, I would rephrase this sentence accordingly, e.g. by saying that the mi4p workflow detects a lot of false positives when using the norm imputation method. Comment 3: I would also suggest to explicitly add a field to table 2 to make it clear from the table itself that the given percentages are the amputation percentages. Comment 4: I further have some small grammar suggestions: Line 105: "coefficient of the linear model pour peptide p" -> "coefficient of the linear model for peptide p" Line 264: "a slightly increased variability than other methods" -> "a slightly increased variability compared to other methods" Line 270: remove "are compared as well" Line 271: "for imputing DQ times each simulated dataset" -> "for imputing each simulated dataset DQ times" Line 272: remove "According to" Reviewer #3: The revised manuscript is greatly improved, and the authors have satisfactorily addressed all of my concerns. ********** Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes Reviewer #3: Yes ********** PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: Yes: Ludger Goeminne Reviewer #3: No Figure Files: While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org. Data Requirements: Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5. Reproducibility: To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols References: Review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice. 20 Jul 2022 Submitted filename: response-to-reviewers_v2.pdf Click here for additional data file. 21 Jul 2022 Dear Dr. Chion, We are pleased to inform you that your manuscript 'Accounting for multiple imputation-induced variability for differential analysis in mass spectrometry-based label-free quantitative proteomics' has been provisionally accepted for publication in PLOS Computational Biology. Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests. Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated. IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript. Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS. Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology. Best regards, Wout Bittremieux Guest Editor PLOS Computational Biology William Noble Deputy Editor PLOS Computational Biology *********************************************************** 25 Aug 2022 PCOMPBIOL-D-22-00385R2 Accounting for multiple imputation-induced variability for differential analysis in mass spectrometry-based label-free quantitative proteomics Dear Dr Chion, I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course. The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript. Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers. Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work! With kind regards, Zsuzsanna Gémesi PLOS Computational Biology | Carlyle House, Carlyle Road, Cambridge CB4 3DN | United Kingdom ploscompbiol@plos.org | Phone +44 (0) 1223-442824 | ploscompbiol.org | @PLOSCompBiol
  23 in total

1.  Missing value estimation methods for DNA microarrays.

Authors:  O Troyanskaya; M Cantor; G Sherlock; P Brown; T Hastie; R Tibshirani; D Botstein; R B Altman
Journal:  Bioinformatics       Date:  2001-06       Impact factor: 6.937

2.  MissForest--non-parametric missing value imputation for mixed-type data.

Authors:  Daniel J Stekhoven; Peter Bühlmann
Journal:  Bioinformatics       Date:  2011-10-28       Impact factor: 6.937

3.  Using Peptide-Level Proteomics Data for Detecting Differentially Expressed Proteins.

Authors:  Tomi Suomi; Garry L Corthals; Olli S Nevalainen; Laura L Elo
Journal:  J Proteome Res       Date:  2015-09-29       Impact factor: 4.466

4.  Calibration plot for proteomics: A graphical tool to visually check the assumptions underlying FDR control in quantitative experiments.

Authors:  Quentin Giai Gianetto; Florence Combes; Claire Ramus; Christophe Bruley; Yohann Couté; Thomas Burger
Journal:  Proteomics       Date:  2016-01       Impact factor: 3.984

5.  Linear models and empirical bayes methods for assessing differential expression in microarray experiments.

Authors:  Gordon K Smyth
Journal:  Stat Appl Genet Mol Biol       Date:  2004-02-12

6.  Protein-Level Statistical Analysis of Quantitative Label-Free Proteomics Data with ProStaR.

Authors:  Samuel Wieczorek; Florence Combes; Hélène Borges; Thomas Burger
Journal:  Methods Mol Biol       Date:  2019

7.  MSqRob Takes the Missing Hurdle: Uniting Intensity- and Count-Based Proteomics.

Authors:  Ludger J E Goeminne; Adriaan Sticker; Lennart Martens; Kris Gevaert; Lieven Clement
Journal:  Anal Chem       Date:  2020-04-15       Impact factor: 6.986

8.  RobNorm: model-based robust normalization method for labeled quantitative mass spectrometry proteomics data.

Authors:  Meng Wang; Lihua Jiang; Ruiqi Jian; Joanne Y Chan; Qing Liu; Michael P Snyder; Hua Tang
Journal:  Bioinformatics       Date:  2021-05-05       Impact factor: 6.937

9.  Multiple imputation using chained equations: Issues and guidance for practice.

Authors:  Ian R White; Patrick Royston; Angela M Wood
Journal:  Stat Med       Date:  2010-11-30       Impact factor: 2.373

10.  Normalization and missing value imputation for label-free LC-MS analysis.

Authors:  Yuliya V Karpievitch; Alan R Dabney; Richard D Smith
Journal:  BMC Bioinformatics       Date:  2012-11-05       Impact factor: 3.169

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.