Literature DB >> 35061791

A parametric framework for multidimensional linear measurement error regression.

Abstract

The ordinary linear regression method is limited to bivariate data because it is based on the Cartesian representation y = f(x). Using the chain rule, we transform the method to the parametric representation (x(t), y(t)) and obtain a linear regression framework in which the weighted average is used as a parameter for a multivariate linear relation for a set of linearly related variable vectors (LRVVs). We confirm the proposed approach by a Monte Carlo simulation, where the minimum coefficient of variation for error (CVE) provides the optimal weights when forming a weighted average of LRVVs. Then, we describe a parametric linear regression (PLR) algorithm in which the Moore-Penrose pseudoinverse is used to estimate measurement error regression (MER) parameters individually for the given variable vectors. We demonstrate that MER parameters from the PLR and nonlinear ODRPACK methods are quite similar for a wide range of reliability ratios, but ODRPACK is formulated only for bivariate data. We identify scale invariant quantities for the PLR and weighted orthogonal regression (WOR) methods and their correspondences with the partitioned residual effects between the variable vectors. Thus, the specification of an error model for the data is essential for MER and we discuss the use of Monte Carlo methods for estimating the distributions and confidence intervals for MER slope and correlation coefficient. We distinguish between elementary covariance for the y = f(x) representation and covariance vector for the (x(t), y(t)) representation. We also discuss the multivariate generalization of the Pearson correlation as the contraction between Cartesian polyad alignment tensors for the LRVVs and weighted average. Finally, we demonstrate the use of multidimensional PLR in estimating the MER parameters for replicate RNA-Seq data and quadratic regression for estimating the parameters of the conical dispersion of read count data about the MER line.

Entities: Chemical

Mesh：

Year: 2022 PMID： 35061791 PMCID： PMC8782541 DOI： 10.1371/journal.pone.0262148

Source DB: PubMed Journal: PLoS One ISSN： 1932-6203 Impact factor: 3.240

1 Introduction

In this work, we consider the problem of fitting a multidimensional line for data that are subject to stochastic error. The motivation for this work comes from a collaborative R & D effort involving the application of the genome-wide association (GWAS) [1] and eQTL methods [2] to identify beneficial agronomic variation in maize. This led to our applied algebraic investigation of the merits of various effect size measures and their associated statistical methodologies as described in our two recent publications in this journal. In the first, we discussed the importance of factoring a 2 × 2 contingency table for obtaining marginal scale invariant effect size measures for proportional variation [3]. We also used projective geometric concepts for discussing connections between the phi coefficient, odds ratio, relative risk and proportion with respect to the representation of effect sizes. In the second, we discussed the formulation of a nonparametric measure for nonoverlap proportion and the importance of specifying a complete set of parameters for point-biserial variation in the formulation of an effect size measure [4]. We also discussed the importance of providing an error model for data and the use of Monte Carlo (MC) methods for estimating distributions and confidence intervals for effect size, as required for best practices [5]. However, in RNA-Seq studies, error estimation is complicated by the low number of replications usually employed. Our RNA-Seq experimental design is consistent with the recommended minimum of three replicates [6], but this is less than the minimum of 13 replicates required for reliably estimating variance [7]. Furthermore, RNA-Seq read counts can range over more than three orders of magnitude with heteroscedastic error. Thus, weighted least squares (WLS) methods [8, 9] are needed for properly partitioning residual effects in data analyses. The importance of accounting for measurement errors in omics studies has been discussed in several publications [7, 10–13]. These problems provide the motivation for our investigation of linear measurement error regression (MER) methods for multivariate data. The concepts discussed in this paper significantly extend our previous gene expression data analysis work that is briefly described in [14]. We use the term ‘measurement error regression’ in accordance with the recommendation in [15] instead of the other commonly used terms, such as ‘error-in-variables’ or ‘major axis regression’ [16]. There are many studies on the MER problem [17-19], but [9, 20–22] were particularly useful in this work. There are two forms of WLS optimization in measurement error regression for (x, y) data. The elementary form involves the application of the normal equations [9] algorithm to obtain WLS-based linear regression (WR) estimates for the necessary parameters. However, WR is subject to the limitation where a design matrix is constructed from error free independent data, and the residual effects are assigned exclusively to the dependent data. Thus, WR requires the specification of ‘dependent’ and ‘independent’ quantities. Then, when both x and y are subject to error the WR estimate is biased, and there is extensive literature describing various approaches for solving this problem [18-20]. The second form of WLS involves ‘weighted orthogonal regression’ (WOR) or ‘orthogonal distance regression’ where the weighted sum of squares of the combined residual effects for x and y are minimized [23]. The ‘dependent’ versus ‘independent’ distinction for y and x does not hold in WOR. Then, WOR can be regarded as a generalization because under the condition where only the y data are subject to error, the WOR and WR estimates are equivalent. However, there is no analytical solution for WOR [23], and the implementation requires a numerical iterative search method. In this work, we use the implementation of WOR in open source statistics software [24, 25] based on the nonlinear ODRPACK algorithm [26]. However, ODRPACK is limited to the analysis of bivariate data because it is based on the Cartesian representation y = f(x). Our objective is to describe a multivariate generalization of the WR algorithm that provides unbiased estimates for MER parameters for a set of linearly related variable vectors (LRVVs). The main novel contributions of this work are as follows: 1) We associate a set of LRVVs with a convex set of weighted averages and use Monte Carlo simulation to demonstrate that the minimum coefficient of variation for error (CVE) provides the optimal weights for forming the weighted average (Eqs 13, 16 and 17). 2) We formulate a parametric linear regression (PLR) algorithm where the weighted average of the LRVVs serves as the independent parameter, and the MER parameters are estimated using the Moore-Penrose pseudoinverse (Eqs 22 and 23; Proposition 1). 3) Convex mappings from the data points in the MER graph to the predicted values along the MER line are provided (Eqs 6 and 7). 4) We propose that in the parametric representation for the linear relation between variable vectors, the corresponding covariance is a parametric vector quantity (Eq 25; Proposition 2). 5) We identify scale invariant quantities for the partitioned residual effects between the LRVVs for both PLR (Eq 27) and WOR (Eq 37). 6) We propose a multivariate generalization of the Pearson correlation coefficient based on the m-fold contraction of Cartesian polyad alignment tensors for the LRVVs and weighted average (Eqs 32, 33 and 35). 7) A quadratic regression method is used to estimate the conical dispersion parameters for determining the read counting error in replicated RNA-Seq data (Eqs 40 and 43).

2 Methods

Section 2.1 introduces our notations. In sections 2.2–2.4, we examine the WLS methods for partitioning residual effects in linear regression for (x, y) data. In section 2.5, we discuss the minimum CVE criterion for obtaining the optimal weighted average of a set of LRVVs. In section 2.6, we describe the PLR framework where parametric equations are used to obtain a novel multidimensional generalization of the normal WLS algorithm for linear regression. In section 3.2, we describe the application of the PLR algorithm for the analysis of the conical dispersion of replicated RNA-Seq data about the MER line. It has already been established that the specification of an error model for the given data is essential in MER analysis. An error model serves as a realistic assessment of the performance of a data acquisition system, and error parameters are estimated from replicated measurements [27]. The parameters are usually summarized as an ‘expected variance’ for error with respect to each observation. When referring to experimental error, we use the term ‘expected variance’ because error parameters are empirical. We provide a detailed discussion of the role of in partitioning residual effects for the WLS optimization of MER. Furthermore, proper statistical practice dictates that effect size measures such as the regression slope, the correlation coefficient and Cohen’s d must be qualified by a distribution [5, 28], and an assessment of substantive significance [29] must account for a distributed system response. Consequently, we also discuss the use of Monte Carlo methods and for estimating the distributions and confidence intervals of MER parameters.

2.1 Notation

Our notation and terminology are partly taken from [30], but comprehensive discussions of vector spaces and linear algebra for applied statistics are found in many textbooks, such as [8, 9, 31]. Scalars are denoted by italics y, and vectors are written in bold lowercase letters y ≡ (y) ≡ (y1, y2, y3, …, ydim(). Matrices are written in bold uppercase letters Y ≡ [y], and the transpose is denoted as Y. Then, n joint observations for m experimental quantities produce a dataset , which corresponds to the Cartesian product of the data vectors , , where . We use a convenient terminology for the subsets of , where each axis of a regression graph is assigned to a variable vector y, the linear combinations ∑ k y are elements of the variable space, and each point in the graph corresponds to an observation vector y( [30, 32]. In this work, we describe a novel parametric linear regression algorithm for estimating MER parameters for a set of LRVVs. We also use the shortened term ‘variable vectors’ synonymously when referring to LRVVs, except in a few clearly identified cases. The inner product with the one-vector 1 = (1, 1, 1, …) produces the mean value . The subtraction of the mean vector produces a centered vector . The hat symbol ‘^’ denotes a unit-length vector, i.e., . Then, the Pearson correlation coefficient r(x, y) is defined as the cosine of the angle between unit-length, centered vectors [30, 32] The variance for y is denoted as Var(y). The expected variance for normally distributed errors is denoted as , and the vector denotes the expected variance of the error in y. For bivariate data (y, y), the covariance is denoted as Cov(y, y); the notation for a variance-covariance matrix is not needed in this work. In our Cov(y, y) notation, y and y are independent and dependent quantities, respectively, and the roles are reversed for Cov(y, y). The connection between correlation and covariance as measures of linear dependence in data is discussed in [21, 32]. A general discussion of convex sets, linear fractional transformations and perspective functions is found in [33]. The set containing all convex combinations of a set of variable vectors is denoted as , where , for ∑ w = 1 and 0 ≤ w∀j. A perspective function has the form P(y, t) = y/t. In recent publications, we provide an elementary discussion about projective geometry for fractional variation [3], and identify the homogeneous coordinates of the Pearson correlation coefficient as points on the line passing through the point [4]. Equivalent homogeneous coordinates are indicated by a two-sided arrow, i.e., y ↔ x ⇒ y = x t.

2.2 The WLS linear regression algorithm

Consider a set of n linearly related observations {(x, y)|1 ≤ i ≤ n}, which are represented as the variable vectors . The observations are subject to error, and the differences between the true (x*, y*) and observed values are denoted by (e, e) = (x − x*, y − y*). The fact that (x*, y*) are unknown implies that (e, e) are also indeterminate. The specification of an error model for the data is essential for estimating the distributions and corresponding confidence intervals of the statistical parameters, including the slope; see section 2.6. The default approach is to reduce to a stochastic form or some approximation thereof, where the errors (e, e) are independent and associated with normal distributions with zero means (0, 0) and expected variances , where , and . In the standard MER framework [20, 21], the linear relation between x and y is expressed using Cartesian equations: where are the model parameters. We use the subscript ‘xy’ to indicate the correspondence with the derivative , because it will be necessary to distinguish between different parameterizations in Eq 21. In linear regression analysis, the objective is to obtain sample estimates (a, b) for (α, β). After introducing the design matrix X = [1, x] and the diagonal variance matrix , the WLS linear regression estimate is obtained by solving the normal equations to obtain [9] where = (a, b). V is required for the WLS scaling of both X and y, and (X X)−1 X is the Moore-Penrose pseudoinverse of X; in practice, the pseudoinverse is estimated by singular value decomposition. The ordinary linear regression (OLR) estimate is obtained when V ∝ I, where I is the identity matrix. However, the b estimate is biased because of the correlation of x with e in Eq 3 [20]. The reliability ratio [17] for x is defined as where Var(x) = Var(x*) + Var(e), and . As illustrated in Fig 1A and 1B, b is attenuated as κ decreases from 1. This bias in b arises from the algebraic requirement that X must serve as an error-free quantity in the normal equations, and the residuals are assigned exclusively to y. The solution u holds under the condition where Var(e) is small enough such that x ≈ x* with κ ≈ 1. A key objective of this work is to describe a novel parametric generalization in which the WR algorithm itself serves as the basis for measurement error regression for a set of LRVVs (section 2.6). We also compare the performance of PLR with that of existing MER methods and provide realistic MER examples using publicly available data. In particular, we discuss the WLS orthogonal regression and attenuation-corrected slope estimation methods.

Fig 1

Comparison of various measurement error regression methods.

A, B, C: Monte Carlo simulations of MER for 25 evenly-spaced values . The statistical parameters are estimated from 5000 Monte Carlo datasets (x, y) = (x* + e, y* + e), with normally distributed errors (e, e). The parametric linear regression and orthogonal distance regression methods both produce consistent estimates for the slope (bMER) for a wide range of reliability ratios κ. The weighted average (b) of the attenuation-corrected direct and inverse WR estimates also produces comparable results. In contrast, the standard (bWR) estimate is attenuated. A: Homoscedastic error: (s = 0.3, s = α), and α = {0.01, 0.1, 0.15, 0.2, 0.25, 0.3, 0.35}. B: Heteroscedastic error: , and α = {0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6}. C: Using the results from (B), the Pearson correlation coefficient r is attenuated compared to the parametric correlation coefficient for the weighted average τ = w x + (1 − w)y. D, E, F: MER results for the data (x, y) from Table 11.3.1 in [21]. However, error estimates were not provided for the data. For demonstration purposes, we used a rudimentary error model: , , κ = 0.93, κ = 0.91, r = 0.49, and rPLR = 0.74. D: Each MER line is labeled with its slope. The ordinary least squares (bOLR) and orthogonal regression (bOR) estimates are consistent with [21]. The WLS orthogonal regression (bORw) estimate is also shown. The predicted values for (x, y) are plotted for PLR and WOR. E: The b estimate is obtained from separate regressions b and b. The inset shows the distribution of b estimated from 100 histograms with 5000 MC samples per histogram (see Fig 4); the error bars indicate the degree of convergence for the MC simulation. F: The weighted average prediction residuals for b = 1.38 and the WOR residuals are quite similar. The PLR residuals from the τ WR results in (E) are comparable but do not account for heteroscedastic error. A-E: The error bars correspond to ±2σ.

Comparison of various measurement error regression methods.

Fig 4

Distribution of the statistical parameters in parametric linear regression.

A two-dimensional histogram (B) is calculated from 500000 Monte Carlo datasets for the data in Fig 1D. The distributions of rPLR (A) and bPLR (C) are averages obtained from 100 MC histograms, with 5000 MC datasets (x, y) per histogram. The 2σ error bars are estimates of the positive and negative deviations from the median and indicate the degree of convergence of the MC estimate. The (rPLR, bPLR) = (0.74, 1.44) estimates are indicated in red.

2.3 WLS orthogonal regression

The WOR algorithm minimizes the weighted sum of squares for the combined residual error for : with [26] and does not require the identification of dependent and independent variables. However, this minimization problem does not possess an analytical solution [23], and a numerical method is needed for the implementation of WOR. For our demonstration, the python scipy.odr package [25] provides a convenient WOR implementation based on the ODRPACK optimization library [26]. We are interested in WOR for a linear function y = a + bx, but ODRPACK is more general and uses the Levenberg-Marquardt nonlinear least squares algorithm to support parameter estimation for a user-specified bivariate function y = f(x). As demonstrated in Fig 1, the ODRPACK algorithm performs well for homoscedastic (Fig 1A) and heteroscedastic (Fig 1B) error models over a wide range of reliability ratios. When either κ or κ approaches 1, the WOR estimate is equivalent to the WR estimate. Thus, the WOR algorithm is regarded as a generalization of WR. Furthermore, the ODRPACK algorithm provides predicted and residual values for (x, y), as demonstrated in Fig 1F. For comparison, we describe a novel algebraic method to estimate predicted and residual values in MER. First, we associate the observed data (x, y) with a convex mapping to an interval on an MER line with (intercept, slope)≡(a, b). The interval is bounded by the following two points: and which correspond to the special cases and , respectively. Then, the weighted average is contained in for 0 ≤ w ≤ 1. Using the signal-to-noise ratios (see Eq 17), the optimal weight is obtained: which produces the weighted average prediction (WP) value v. In Fig 1F, we demonstrate that the ODRPACK and WP estimates for residuals are quite similar. The generalization of Eqs 6 and 7 for MER in is straightforward. However, the WR and WOR algorithms are both subject to the algebraic limitation that the Cartesian representation y = f(x) only allows for the representation of lines (or curves) in two dimensions. Alternatively, the specification of a straight line in requires the use of the parametric representation with and [34]. Consequently, our objective is to describe a parametric framework where the WR algorithm serves as the basis for measurement error regression for LRVVs.

2.4 Covariance in ordinary linear regression

The OLR algorithm corresponds to the special case where V is replaced with the identity matrix I in Eq 4. Then, the expression for the slope takes a simple form and provides the intuition for our novel parametric algorithm in section 2.6. From Eq 3, the sample covariance [22] is expressed as where Cov(e, e) = 0, and Cov(x, e) = Var(e). Then, we obtain the direct estimator for the slope where and as κ → 1. Even though is referred to as the ‘regression coefficient corrected for attenuation’ [17], we note that is also biased. Reversing the roles of x and y yields the inverse estimator for β: where κ = 1 − Var(e)/Var(y). The relations in Eqs 9 and 11 indicate that the OLR estimates serve as bounding values because for any least squares estimate bMER, it can be shown [20] that In section 2.6, we provide a statistical explanation (see Eq 26) and a graphical illustration (Fig 2) of the bounded range of |bMER|. To compensate for the bias in and , various ad-hoc combinations have been proposed, including the arithmetic mean [16, 35]:

Fig 2

Bounded ranges of the regression slopes for the errors in both x and y.

Bounded ranges of the regression slopes for the errors in both x and y.

Monte Carlo simulations of parametric linear regression are conducted for 25 evenly spaced values , and the error model is , for α = 0.01, 0.2, 0.4. The estimates for the slope bPLR (A) and correlation rPLR = r r (B) are averages over 5000 datasets (x, y) = (x* + e, y* + e) with normally distributed errors (e, e) and a weighted average τ = w x + (1 − w)y. The optimal (bPLR, rPLR) estimates corresponding to the minimum CVE in τ are also plotted (red). The 2σ error bars are estimates for the positive and negative deviations from the median value. Unfortunately, the limiting values of b are not consistent with the OLR estimate for either κ or κ = 1. Instead, we introduce the weighted average where the weights are determined from signal to noise ratios (see Eq 17). In Fig 1A and 1B, we observe that the b estimates compare well with the expected slope but with much larger confidence intervals compared to those of WOR.

2.5 Noise reduction for a weighted average of linearly related variable vectors

Consider the weighted average of the LRVVs : with 0 ≤ w ≤ 1. Then, τ ∈ conv(x, y) and corresponds to points on the line segment connecting x and y. The average of x and y is associated with the partial cancellation of error and an improved signal-to-noise ratio. Let Var(e) and Var(e) denote the expected variances of the errors in x and y, respectively. Then, the propagation of independent error for the weighted sum of two variables [34] gives By introducing the mean value and the change in coordinates ω = w/(1 − w), the coefficient of variation for error in is Then, setting the derivative yields and we obtain the criterion for the minimum CVE. In Fig 3, we use Monte Carlo simulations to demonstrate that the minimum is consistent with Eq 16. By partitioning the weighted average = ∑ w y into a sequence of pairwise averages for the data (y), we obtain the general min(CVE) expression

Fig 3

Noise reduction for a weighted average of variable vectors.

Noise reduction for a weighted average of variable vectors.

Monte Carlo simulations of weighted averaging are conducted for the data (x, y) consisting of 25 evenly spaced values for y* = 2x*. The statistical parameters are estimated from 5000 datasets (x, y) = (x* + e, y* + e) with normally distributed errors (e, e), and the error model is: s = 0.3y*, s = α x* for α = 0.01, 0.3, 0.5. The minimum coefficient of variation for error for the weighted average = w x + (1 − w)y is consistent with Eq 17, as shown by the optimal wmin curve. Deming regression is associated with the weights and the corresponding CVE curve (grey) intersects the wmin curve. Thus, the noise reduction for the average is optimal when the ratios are balanced for the LRVVs. To the best of our knowledge, this expression for optimizing the CVE of a weighted average has not been previously reported. We note that the correspond to the signal-to-noise ratios [36]. However, for data that are not linearly related, averaging results in the uncontrolled mixing of noise and signal and a loss of information. This implies that the treatment of measurement error in more general problems, including multiple and polynomial regression, requires nonlinear optimization methods. That explains why this work is limited to the treatment of MER for LRVVs. In the next section, we discuss the use of τ as a parameter for measurement error regression.

2.6 The chain rule in linear regression

By applying the chain rule [34] to Eq 2, we obtain the factorization The fractional form of Eq 19 is associated with scale invariance and corresponds to the homogeneous coordinates t, with . The mapping β ↦ is equivalent to invoking the parametric representation (x(t), y(t)) for the linear relation between the variables (Eq 8). Therefore, the chain rule permits a generalization in which the slope in linear regression is parameterized by τ*, where β corresponds to the point (β, 1) on the projective line and a perspective function of the slope vector [4, 33]. Consequently, Eq 21 serves as the basis for our generalization of the linear regression algorithm, where the weighted average serves to parameterize the relation between variable vectors. Analogously to Eq 9, the PLR slope vector is obtained as where the components of b are estimated individually using the WR algorithm (Eq 4) with in the design matrix. Thus, x and y both serve as dependent variables. Without loss of generality, the factor can be ignored because of the homogeneous coordinates equivalence. Therefore, we can regard as corresponding to a set of fixed values. Then, the parametric linear regression estimator for β is In the special case where w = 1, = x and b → b, x serves as both a dependent variable and an independent variable. Conversely, for w = 1, and y serves as an independent variable. Therefore, the fact that there is a bounded range of values for the MER slope (Eq 12) comes from the dependence of b on , which ranges over conv(x, y). The monotonic decrease of b with w is shown in Fig 2A, with the lower and upper bounds given by (b, 1) and (1, b), respectively; this is consistent with the OLR constraints in Eq 12. As we discuss in section 3.2, the extension of the PLR algorithm for estimating the MER parameters for a set of LRVVs is straightforward. Therefore, we make the parametric linear regression proposition. Proposition 1. For a set of linearly related variable vectors , the weighted average = ∑ wy corresponding to the minimum coefficient of variation for error (Eq 17) provides the optimal measurement error regression estimate for the slope vector . In Fig 1A and 1B, we show that the performance of the b estimate is comparable to that of the WOR algorithm for a wide range of reliability ratios. The PLR algorithm provides predicted values for y( (Fig 1E), but predicted values can also be estimated using the WP method (Eq 7). In the special case where the LRVVs are not subject to error and the points fall exactly on a straight line, we obtain , then any can serve as the parametric variable vector in PLR. Now, we consider the pedagogically important special case where V and V are both proportional to I. We obtain where b = Cov(, y)/Cov(, x) is a perspective function of b. In Eq 25, we introduce the parametric covariance vector notation Cov(y, x ∣ ) to indicate the homogeneous coordinates equivalence with b. Thus, we note that covariance serves as a measure of linear dependence between variable vectors [21, 32] and is therefore subject to the chain rule. Consequently, we make the parametric covariance proposition. Proposition 2 In the analysis of linear dependence for a set of LRVVs , the chain rule implies that the covariance is associated with a conditional vector form , which is parameterized by the weighted average . Therefore, covariance is a vector quantity in the parametric representation for the linear relation between variable vectors. Then, there is a range for Cov(y, x ∣ ) in correspondence with b and the bounded range for bMER in Eq 12. For (w, w) given by Eq 17 and = w x + w y, we obtain the expression observe that the right side is invariant to scaling the data: (x, y) → (kx, ky), and note that there is a similar factorization for Cov(, x). Conversely, we conclude that the scaling invariance property for the quantities implies that w/w ∝ k/k, which is a necessary condition but not sufficient for specifying the optimal weights for ; in section 3.1, we discuss Deming weighting which also satisfies the scaling invariance property. Then, we obtain the homogeneous coordinates equivalence and observe the corresponding scaling relation Then, b varies proportionally with the scaling (kx, ky); the b estimate is consistent with respect to the scaling of the data. These covariance factorizations provide another demonstration of the importance of WLS optimization for partitioning residual effects in measurement error regression. The generalization of these expressions for a (y) dataset is straightforward. However, we note that Cov(, y) assumes that and does not account for the heteroscedastic error in y. Thus, in practice, the PLR estimate (Eq 22) is preferred. Finally, we note that the two special cases for = x and for = y, correspond to the bounding values in Eq 12. Therefore, Cov(x, y) and Cov(y, x) are numerically equal but serve as components for different parametric covariance vectors, and correspond to different estimates for MER slope, b and b, respectively. Now, we consider the multivariate generalization of the Pearson correlation coefficient r (Eq 1). Using the well-known correspondence between covariance and Pearson correlation [32], we transform Eq 25 and define a parametric correlation vector Then, we observe that the Pearson correlation coefficient r(x, y) can be written in Cartesian tensor form as the double contraction of dyads The dyad is a bilinear measure of the joint variation in x and y. In the special cases where and , r reduces to the standard Pearson correlation coefficient r (Eq 1). Geometrically, r serves as a measure of the degree to which x and y are mutually aligned in the direction. Then, the generalization of Eq 33 as the m-fold contraction of polyads for the variable vectors serves as the basis for our definition of the parametric correlation coefficient: with m ≥ 2. The -polyads are referred to as 2nd, 3rd, … order Pearson correlation tensors, respectively. Therefore, we associate a set of m LRVVs with a set of two-way, three-way, …, m-way weighted averages, PLRs, covariance vectors and PLR correlation coefficients. However, a detailed discussion of Cartesian tensors [37, 38] is beyond the scope of this paper. In Fig 2B, the concave down variation of rPLR = r r results from the product of the monotonically increasing r function and decreasing r function, with the endpoints equal to the Pearson correlation coefficient r. Thus, rPLR varies with the expected error , but Pearson’s r serves as the lower bound. Then, r is attenuated compared to rPLR when the data (x, y) are subject to measurement errors [13], as shown in Fig 1C. When the data (x, y) are highly correlated and Pearson’s r approaches 1, the rPLR curve becomes a horizontal line at rPLR = 1 (or nearly so). We conclude that the parametric covariance (Eq 25), correlation (Eq 32) and PLR slope (Proposition 1) correspond to different perspective functions and coordinate systems for representing the linear relation between variable vectors, i.e., points dispersed about the MER line. Then, a condition for the interpretation of covariance and correlation is that the deviation of the points from the MER line must be consistent with the expected error . In section 3.1, we discuss the ambiguity that arises when the MER line is perturbed by outliers. Analytical methods for the propagation of errors and the estimation of distributions for ratios [39, 40], proportions [41, 42], and correlation coefficients [43] are complicated by fractional transformations, bounded ranges, and discreteness. The propagation of errors for convoluted quantities such as b and rPLR is prohibitively complicated. Alternatively, Monte Carlo methods are practical approaches for estimating the distributions and confidence intervals of fractional statistical parameters, and they allow for the detailed simulation of stochastic effects in the data acquisition process as described previously [4]. The error bars shown in various figures in this paper are estimated using MC methods. Fig 4 shows an example of a two-dimensional MC histogram for the joint (rPLR, b) distribution. We note that the distributions for MER parameters are often skewed, so confidence intervals are estimated separately for + /− deviations from the median.

Distribution of the statistical parameters in parametric linear regression.

3 Data analysis and results

Using publicly available data, we provide practical examples that illustrate the ambiguity that arises when the variable vectors are not linearly related and the application of multidimensional parametric linear regression for analyzing the dispersion of RNA-Seq read count data.

3.1 Example 1: Measurement error regression with outliers

The data for this example come from Table 11.3.1 in [21] and originate from consulting work (R. Berger, personal communication). However, error estimates are not available for these data. For our demonstration, we provided a rudimentary error model with , as shown by the 2σ xy-error bars in Fig 1D. The ordinary linear regression estimate for the slope is attenuated as expected (Eq 9). However, the orthogonal regression (OR) estimate reported in [21] is also biased because of the equal weight assumption for the residual effects (e, e). Using the defined notations for sums of squares, and making the substitution (x, y) ↦ (wx, wy), we obtain a WLS OR estimate: where the w/w factor is needed for the transformation back to the original coordinates for (x, y), and the factor for the weighted residuals in (x, y) is The weights are determined from the signal to noise ratios with the constraints w + w = 1 and w, w ≥ 0 (Eq 17). Then, we note that B is invariant to the scaling of the data, (x, y)→(k x, k y). Therefore, the b estimate is consistent with respect to the scaling of the data. The b estimate is recovered when w = w [20], and because of this constraint, b is not a consistent estimate since it does not vary proportionally with (kx, ky) scaling. There is better agreement between b and b (Fig 1D), but the latter does not account for the heteroscedastic errors . The b estimate is obtained from the separate regressions (b, b), as shown in Fig 1E. The differences between the b and b estimates are small compared to the experimental uncertainty. However, we observe that there are many positive and negative outliers in the data (Fig 1D), and only a subset of the points are close to any one of the MER lines. This implies that the data (x, y) are not linearly related, and the weighted average ∈ conv(x, y) is then subject to confounding effects. Then, WLS optimization produces misleading results, and the MER line does not possess a functional or operational interpretation. Criteria can be applied to select an LRVV subset for measurement error regression, but this works best when the outlier subset is small. Otherwise, linear regression is not recommended for the analysis of data that are not linearly related. In [4], we discuss the use of regression tree association graphs in the analysis of weakly correlated data. Finally, we note that the Deming orthogonal regression estimate [22] is obtained by substituting the alternative weighting into Eq 36. Our simulations confirm that Deming weighting has the scale invariance property but it is not optimal for MER (Fig 3), and the corresponding parameter estimates are biased for both PLR and ORw (not shown).

3.2 Example 2: Conical dispersion for replicated RNA-Seq data

In this section, our objective is to demonstrate that multidimensional linear MER has practical applications. We expect that there are many applications for PLR in the ‘big data’ world but the application for RNA-Seq data analysis is convenient because of our previous collaborations in gene expression research. In particular, we discuss the use of the PLR method in the analysis of read counting errors in a replicated RNA-Seq dataset. The estimation of the experimental errors and confidence intervals for effect size measures are important problems in RNA-Seq analysis because experiments are usually performed with very few replicates [6]. The replicated data correspond to the Cartesian product of LRVVs with , ; y is the read count for the ith RNA tag of the y replicate, n is the number of RNA tags, m is the number of replicates, and m ≪ n because RNA-Seq assays are highly multiplexed and n is large. The observation vectors for the RNA tags are denoted g(. Our error analysis algorithm is iterative and requires an initial guess for the expected error in the data ; the read counting errors are assumed to be independent or approximately so: . In our implementation, the initial error estimate is of the form . Let min(g() correspond to a set of RNA tags with the lowest read counts, excluding those with ; i.e., min(g() is a set of points close to the origin of the regression graph. Then, q0 is initially estimated from the variance for min(g(); typically 1 ≤ q0 ≤ 50. An initial value for q2 is obtained by visual inspection of the data (Fig 5); typically 0.002 ≤ q2 ≤ 0.05 (Fig 6 legend). Then, the optimal for the minimum CVE is estimated, and the PLR method provides parameter estimates for the MER line: with , , . Then, the data are transformed so that passes through the origin with a unit slope (Fig 5B & 5D); i.e., bPLR = 1, and amin ≊ 0. This adjustment for is important because afterwards, the g( are conically dispersed about the 1 line. Then, the g( are partitioned into orthogonal components for deviation ( and mean :

Fig 5

Conical quadratic error parameters for replicated RNA-Seq data.

Replicate Arabidopsis thaliana RNA-Seq data from [46] for the (μg, 4 reps) sample in (A, B) and (1g, 3 reps) sample in (C,D) are scaled so that the parametric linear regression line corresponds to the unit slope line and passes through the origin. Then, the conical dispersion of points about the 1 line in the MER graph serves as a measure of the replication error, as shown in the projections (B,D). An iterative WLS regression of the mean squared deviation versus the mean value produces parameter estimates for the quadratic error model (Eq 41), as shown in (A, C). The corresponding 2σ threshold curves are shown in (B, D). The linear component of the error () is shown as ‘convergent’ dashed curves. The larger overdispersion q2 in (C, D) compared to that in (A, B) indicates that there is a significant variation in data quality between the two experiments. The histogram (D inset) shows that the read counts range over four decades.

Fig 6

Overdispersion of the residual effects in replicated RNA-Seq data.

Conical quadratic error parameters for replicated RNA-Seq data.

Overdispersion of the residual effects in replicated RNA-Seq data.

Empirical survival functions for the root mean square scaled deviation () from the conical dispersion analysis (see Fig 5) of the RNA-Seq data are plotted for six A. thaliana samples under fractional gravity [46]. The and curves from the Monte Carlo simulation for the standard normal distribution are also shown. The slower decay in the observed indicates that RNA-Seq data are subject to heterogeneous dispersive effects. The inset shows the overdispersion for a 1000 point subset around the median for the μg sample. Thus, overdispersive effects are observed across the entire range of read counts. In the legend, each sample is labeled with its coefficient for the quadratic error model. The variation in indicates that there are significant data quality variations between samples. Statistically, only the magnitude of ( is important because the orientation is random. Next, we introduce the mean squared deviation (MSD) , and we obtain the two parameter representation for the dispersion , as shown in Fig 5A and 5C. We adopt a heuristic quadratic model for the MSD [14, 44] Then, a WLS quadratic regression of the MSD produces sample estimates for the q (Fig 5A and 5C), which then serve as the error model for the input data: as shown by the 2σ threshold curves in Fig 5B and 5D. The q0 and q1 terms correspond to the background noise and Poisson-like components of the dispersion, respectively [45]. The overdispersive component q2 suggests that there are uncontrolled experimental effects that cannot be averaged in a random manner. We note that for a well-designed data acquisition process, the q2 component should be minimized. The initial guess for is updated, and the algorithm iterates several times to reach convergence for both ℓ and {q}. The data for this demonstration come from a study of the response of Arabidopsis thaliana to fractional gravity [47] and can be retrieved from the publicly accessible NASA GeneLab Omics database [46]. The dataset contains read count data for six samples with varying gravities (g) {μ, 0.09, 0.18, 0.36, 0.57, 1}. For each sample, the replicates (y) with m = 3∨4, correspond to a set of LRVVs with approximately identical noise properties: , with the possible exception of a small subset of irreproducible outliers which must be flagged. The read counts are distributed over more than three decades (Fig 5D inset), and the point estimates are skewed due to the heavy tail effects at large values [6]. Consequently, an upper limit for the mean value of the observation vectors is imposed, and 4% of the data are removed from the MER analysis; however, all of the data are included in the graphs (Figs 5 and 6). In Fig 5, we observe that the dispersion varies between the μg (A, B) and 1g (C, D) samples. As discussed above, the data are transformed such that the PLR slope is 1 and the curvature is negligible near the origin in log-log plots (B, D). We note that log-log scales are convenient for visualizing dispersion, but the data are not log-transformed in actual calculations; a log or ratio transformation results in nonlinear confounding effects for weighted averages. For duplicate read counts g( = (x, y), we define the scaled difference with the expected variance for the error in x − y given by the sum of the individual variances: [21]. Then, we apply the condition to obtain the quadratic equation The |d| = 2 threshold curves in Fig 5B and 5D are calculated from the roots of this expression. The q2 component dominates at high read counts, as indicated by the convergence of the dashed curves for the linear components (q0, q1). Then, q2 serves as an indicator of the data quality and reproducibility, and we observe significant variations between the samples (Fig 6 legend). We define the root mean square (rms) scaled deviation as where . A comparison of the empirical survival distributions for shows that there are heavy tails at large deviations for the six samples (Fig 6). We use Monte Carlo methods to simulate the read count data for normal distributions and confirm that the simulated distributions for are identical to chi-squared distributions (Fig 6). The much slower decay in the experimental is an indication of uncontrolled processes in the RNA-Seq assay that result in overdispersion in the g(. The inset graph in Fig 6 shows the overdispersion for a small {g(} subset around the median for the μg sample. Thus, the overdispersive effects in the are observed for intervals over the entire range of read counts. Statistical methods that assume normal distributions and do not account for this overdispersion will overestimate the effect size. In biological studies, the variable vectors for RNA-Seq are not linearly related. Then, our parametric LRVV framework serves an elementary role in searching for multiway dependency in the variation of gene expression. The detailed discussion of weighted least squares optimization problems, including normalization and the estimation of effect size, in gene expression data analysis is a topic for a separate paper.

4 Software

Numerical computations were performed using the python (v3.8.4) language with the NumPy (v1.19.0) [48] and SciPy (v1.5.0) [49] packages. Figures were prepared using the Matplotlib (v3.4.1) package [50]. The manuscript was prepared using the MiKTeX (v2.9) implementation of TeX/LaTeX [51].

5 Discussion

In this work, we discuss the fact that statistical measures of linear dependence, including covariance, correlation coefficient and regression slope, are subject to the chain rule. We develop a novel multidimensional linear regression algorithm where the relation between variable vectors is parameterized by a weighted average, and the weights are determined from an error model for the input data. Then, the implementation of PLR involves the use of weighted least squares normal equations to estimate the parameters of the best fit line for a set of LRVVs. Standard statistical concepts, including covariance and the Moore-Penrose pseudoinverse, provide the necessary intuition for the formulation of our PLR method. This contrasts with the ODRPACK algorithm which requires iterative numeric optimization because the WOR least-squares equation cannot be solved analytically. The PLR and ODRPACK methods yield very similar results but the latter is formulated only for bivariate data. We find that in the parametric representation of a linear relation, covariance and correlation are associated with vector quantities. We also identify scale invariant quantities for partitioned residual effects between the LRVVs for both the PLR and weighted orthogonal regression methods. Therefore, when is undefined, MER is not a well-posed problem [52]; i.e., there is not a unique solution. This supports previous suggestions that the specification of is a requirement for solving the MER problem [27]. is also essential for determining the distributions and corresponding confidence intervals of PLR parameter estimates and effect size. Constructing a realistic error model for a data acquisition process can be daunting, especially in omics studies, because it is necessary to account for all relevant sources of experimental error and system variability. However, an approximate trial and error approach can be informative, and Monte Carlo methods are convenient because they allow for the detailed simulation of stochastic effects in the data acquisition process. Finally, the averaging of variable vectors that are not linearly related results in the loss of information because of the destructive interference associated with the mixing of signal and noise. Consequently, accounting for measurement error in polynomial and multiple regression requires nonlinear optimization methods, which is a topic for a separate paper. We conclude that the parametric representation for the relation between variable vectors provides a more general framework for linear regression compared to the standard Cartesian representation. 22 Sep 2021

PONE-D-21-22997

The chain rule, measurement error regression and RNA-Seq analysis

PLOS ONE Dear Dr. Luck, Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process. Please incorporate the suggestions/comments given by both the reviewers and revise the paper accordingly. Please submit your revised manuscript by Nov 06 2021 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file. Please include the following items when submitting your revised manuscript:

A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'. A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'. An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'. If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter. If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols. We look forward to receiving your revised manuscript. Kind regards, Sriparna Saha, PhD Academic Editor PLOS ONE Journal Requirements: When submitting your revision, we need you to address these additional requirements. 1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf 2. Thank you for providing the following Funding Statement: The author, Stanley Luck, is a member of Vector Analytics LLC, which is a science consulting company. Vector Analytics LLC did not have any additional role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript. The specific roles of the author are articulated in the `author contributions' section. The author received no specific funding for this work. We note that one or more of the authors is affiliated with the funding organization, indicating the funder may have had some role in the design, data collection, analysis or preparation of your manuscript for publication; in other words, the funder played an indirect role through the participation of the co-authors. If the funding organization did not play a role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript and only provided financial support in the form of authors' salaries and/or research materials, please review your statements relating to the author contributions, and ensure you have specifically and accurately indicated the role(s) that these authors had in your study in the Author Contributions section of the online submission form. Please make any necessary amendments directly within this section of the online submission form. Please also update your Funding Statement to include the following statement: “The funder provided support in the form of salaries for authors [insert relevant initials], but did not have any additional role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript. The specific roles of these authors are articulated in the ‘author contributions’ section.” If the funding organization did have an additional role, please state and explain that role within your Funding Statement. "Please also provide an updated Competing Interests Statement declaring this commercial affiliation along with any other relevant declarations relating to employment, consultancy, patents, products in development, or marketed products, etc. Within your Competing Interests Statement, please confirm that this commercial affiliation does not alter your adherence to all PLOS ONE policies on sharing data and materials by including the following statement: "This does not alter our adherence to PLOS ONE policies on sharing data and materials.” (as detailed online in our guide for authors http://journals.plos.org/plosone/s/competing-interests) . If this adherence statement is not accurate and there are restrictions on sharing of data and/or materials, please state these. Please note that we cannot proceed with consideration of your article until this information has been declared. 3. Please include captions for your Supporting Information files at the end of your manuscript, and update any in-text citations to match accordingly. Please see our Supporting Information guidelines for more information: http://journals.plos.org/plosone/s/supporting-information. Additional Editor Comments: As commented by both the reviewers, the paper should be revised based on the suggestions provided. [Note: HTML markup is below. Please do not edit.] Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: Yes Reviewer #2: No ********** 2. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: Yes Reviewer #2: No ********** 3. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: No Reviewer #2: Yes ********** 4. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: Yes Reviewer #2: Yes ********** 5. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: In this paper, the author proposed “The chain rule, measurement error regression and RNA-Seq analysis”. The strengths of the paper are that it is well structured, the description of the related work is well done and that results are extensively compared to results of the similar research. 1) Draw a graphical abstract of this work. 2) Justify the novelty of the proposed approach? 3) Proofread the article once again? 4) The experimental validation of the proposed approach is quite confusing justify it. 5) Give some experimental comparison of the proposed approach with existing approach? Reviewer #2: The manuscript reading looks more like a technical report (heavy on the methodological description). To attract strong interest, it is highly recommended to revise the manuscript to focus on the major objects and findings in a concise manner. I have also some comments which are given below. 1. Title: Consider modifying it to be more specific. The wording of this current title ("The chain rule, measurement error regression and RNA-Seq analysis") is somewhat imprecise. 2. In page no 3 and 4 author mentioned seven novel contributions of his work but they are not clearly established in the whole paper. 3. Recommendation to revise the “data analysis and results” part in details. 4. The discussion part (page no. 23 and 24) does not clearly state the point of view of this paper. The revision is needed for this part. 5. I recommend that the author should revise the Fig. 1, Fig. 5 and Fig. 6 to illustrate the workflow. ********** 6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: No [NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.] While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step. 16 Oct 2021 I have provided responses to each point raised by the reviewers in the attached `Response to Reviewers' document. Submitted filename: ResponseToReviewers.pdf Click here for additional data file. 19 Dec 2021 A parametric framework for multidimensional linear measurement error regression PONE-D-21-22997R1 Dear Dr. Luck, We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements. Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication. An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org. If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org. Kind regards, Sriparna Saha, PhD Academic Editor PLOS ONE Additional Editor Comments (optional): Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation. Reviewer #1: All comments have been addressed Reviewer #2: All comments have been addressed ********** 2. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: Yes Reviewer #2: Yes ********** 3. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: Yes Reviewer #2: Yes ********** 4. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes Reviewer #2: Yes ********** 5. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: No Reviewer #2: Yes ********** 6. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: All my comments are addressed by the authors clearly and explanations given by authors is impressive. Reviewer #2: Author tried to make the changes according to the instructions. It looks better than previous though the introduction portion is not up-to the mark. ********** 7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: No 22 Dec 2021 PONE-D-21-22997R1 A parametric framework for multidimensional linear measurement error regression Dear Dr. Luck: I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department. If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org. If we can help with anything else, please email us at plosone@plos.org. Thank you for submitting your work to PLOS ONE and supporting open access. Kind regards, PLOS ONE Editorial Office Staff on behalf of Dr. Sriparna Saha Academic Editor PLOS ONE

20 in total

1. Dealing with discreteness: making 'exact' confidence intervals for proportions, differences of proportions, and odds ratios more exact.

Authors: A Agresti
Journal: Stat Methods Med Res Date: 2003-01 Impact factor: 3.021

Review 2. On effect size.

Authors: Ken Kelley; Kristopher J Preacher
Journal: Psychol Methods Date: 2012-04-30

3. Whole genome scan detects an allelic variant of fad2 associated with increased oleic acid levels in maize.

Authors: André Beló; Peizhong Zheng; Stanley Luck; Bo Shen; David J Meyer; Bailin Li; Scott Tingey; Antoni Rafalski
Journal: Mol Genet Genomics Date: 2007-10-13 Impact factor: 3.291

4. Genome-wide expression quantitative trait loci (eQTL) analysis in maize.

Authors: Beth Holloway; Stanley Luck; Mary Beatty; J-Antoni Rafalski; Bailin Li
Journal: BMC Genomics Date: 2011-06-30 Impact factor: 3.969

5. Testing for differentially-expressed microRNAs with errors-in-variables nonparametric regression.

Authors: Bin Wang; Shu-Guang Zhang; Xiao-Feng Wang; Ming Tan; Yaguang Xi
Journal: PLoS One Date: 2012-05-24 Impact factor: 3.240

6. Erratum to: A survey of best practices for RNA-seq data analysis.

Authors: Ana Conesa; Pedro Madrigal; Sonia Tarazona; David Gomez-Cabrero; Alejandra Cervera; Andrew McPherson; Michal Wojciech Szcześniak; Daniel J Gaffney; Laura L Elo; Xuegong Zhang; Ali Mortazavi
Journal: Genome Biol Date: 2016-08-26 Impact factor: 13.583

7. RNAseq Analysis of the Response of Arabidopsis thaliana to Fractional Gravity Under Blue-Light Stimulation During Spaceflight.

Authors: Raúl Herranz; Joshua P Vandenbrink; Alicia Villacampa; Aránzazu Manzano; William L Poehlman; Frank Alex Feltus; John Z Kiss; Francisco Javier Medina
Journal: Front Plant Sci Date: 2019-11-26 Impact factor: 5.753

8. Error Analysis and Propagation in Metabolomics Data Analysis.

Authors: Hunter N B Moseley
Journal: Comput Struct Biotechnol J Date: 2013-01-01 Impact factor: 7.271

9. Impact of measurement error on testing genetic association with quantitative traits.

Authors: Jiemin Liao; Xiang Li; Tien-Yin Wong; Jie Jin Wang; Chiea Chuen Khor; E Shyong Tai; Tin Aung; Yik-Ying Teo; Ching-Yu Cheng
Journal: PLoS One Date: 2014-01-24 Impact factor: 3.240

10. Corruption of the Pearson correlation coefficient by measurement error and its estimation, bias, and correction under different error models.

Authors: Edoardo Saccenti; Age K Smilde; Margriet H W B Hendriks
Journal: Sci Rep Date: 2020-01-16 Impact factor: 4.379