Todd K Moon1, Jacob H Gunther1. 1. Electrical and Computer Engineering Department, Utah State University, Logan, UT 84332, USA.
Abstract
Estimating the parameters of the autoregressive (AR) random process is a problem that has been well-studied. In many applications, only noisy measurements of AR process are available. The effect of the additive noise is that the system can be modeled as an AR model with colored noise, even when the measurement noise is white, where the correlation matrix depends on the AR parameters. Because of the correlation, it is expedient to compute using multiple stacked observations. Performing a weighted least-squares estimation of the AR parameters using an inverse covariance weighting can provide significantly better parameter estimates, with improvement increasing with the stack depth. The estimation algorithm is essentially a vector RLS adaptive filter, with time-varying covariance matrix. Different ways of estimating the unknown covariance are presented, as well as a method to estimate the variances of the AR and observation noise. The notation is extended to vector autoregressive (VAR) processes. Simulation results demonstrate performance improvements in coefficient error and in spectrum estimation.
Estimating the parameters of the autoregressive (AR) random process is a problem that has been well-studied. In many applications, only noisy measurements of AR process are available. The effect of the additive noise is that the system can be modeled as an AR model with colored noise, even when the measurement noise is white, where the correlation matrix depends on the AR parameters. Because of the correlation, it is expedient to compute using multiple stacked observations. Performing a weighted least-squares estimation of the AR parameters using an inverse covariance weighting can provide significantly better parameter estimates, with improvement increasing with the stack depth. The estimation algorithm is essentially a vector RLS adaptive filter, with time-varying covariance matrix. Different ways of estimating the unknown covariance are presented, as well as a method to estimate the variances of the AR and observation noise. The notation is extended to vector autoregressive (VAR) processes. Simulation results demonstrate performance improvements in coefficient error and in spectrum estimation.
Entities:
Keywords:
RLS algorithm; autoregressive model estimation; spectrum estimation; vector AR model
The problem of estimating the parameters of an autoregressive (AR) process
where the input is a white noise process, is important in many aspects of signal processing. It plays roles a variety of applications, such as speech coding and analysis (see References [1,2,3,4,5,6,7]). Autoregressive modeling is instrumental in many spectrum estimation algorithms, and algorithms for noise-free measurements have been developed from that perspective [8,9,10,11,12,13,14]; see also Reference [5] for a survey and perspective. Vector autoregressive modeling is also important in econometric modeling [15]. Among many other applications, AR models have also been used in biomedical signal processing (see, e.g., Reference [16]); communication (see, e.g., Reference [17]); and financial modeling (see, e.g., Reference [15]).Because of its importance, the AR parameter estimation problem has been well-studied from a realization of noise-free process When autocorrelations are known (or estimated), the Yule-Walker equations describe the solutions [18,19]. This problem may also be considered as the problem of finding optimal predictor coefficients, and solutions invoking adaptive filters are well known [20]. The parameters can also be estimated using a Kalman filter applied to a statespace formulation ([21], Section 3.3), in which the unknown parameters are taken to be the state, and the observation matrix is from the measurements. The Kalman filter method bears resemblance to the approach presented here, but as will be shown there are differences between the two. First, in the case of noisy observations, the previous AR values are not known, so the elements of observation matrix are only approximately known in the Kalman filter approach. No such approximation is used in the method here. Second, the approach presented here does not require the introduction of the system noise with the question of what the variance should be. And thirdly, instead of a scalar observation equation, vector observations are used, allowing the covariance structure to be used with a time-varying covariance.It is well known ([5], Section 6.7), [22,23] that additive noise broadens spectral peaks in autoregressive spectral estimation and can result in loss of resolution of closely-spaced sinusoids. Dealing with noise added to AR processes has been addressed many times in the literature, as we summarize below. None of these previous methods use the covariance structure employed here. Some of these techniques are related to finding rank p representations of a data matrix (using, for example, an SVD or total least squares) [24,25,26,27,28,29,30]. From another perspective, it is known that white noise added to an process produces an process. This has resulted in algorithms for dealing with the noisy AR process using an ARMA estimator. For example, modified Yule-Walker (YW) equations [31] can be used to find the ARMA model. However, this does not take advantage of the relationship between the AR and the MA parameters in this noisy AR case. Another approach is to estimate the ARMA process using a large-order AR model. Another approach to handling noisy AR observations is total least squares (TLS) [32,33,34,35]. None of these approaches is equivalent to the approach described here. Noisy AR estimation has been studied over many years by Zheng [36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55], who estimates the variance of the observation noise to reduce bias. An approach that is somewhat similar to our approach is bias-corrected RLC (BCRLS) [56,57,58,59,60,61], but the BRCLS does not employ the covariance matrix-weighted least squares used here. Other approaches to noisy AR parameter estimation have been investigated in References [4,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92]The approach described here uses the fact that the observation noise produces an AR-like process with correlated noise, where the noise correlation depends on the AR coefficients. The correlation is exploited by dealing with vectors of stacked observations, rather than scalar observations. A maximum likelihood approach results in a weighted least-squares problem, in which previous estimates are used to update the covariance matrix in an iterative fashion. We refer to this general algorithm as iterative covariance-weighted autoregressive estimation (ICWARE). While maximum likelihood has been used for AR parameter estimation [18] ([93], Chapter 5), [94], in previous estimators the noise is assumed white, so the autocorrelation structure derived here is not exploited. Iterative parameter estimation has been used for system parameter estimation. An early iterative method of fitting a given impulse response to a transfer function, without an explicit noise model, is in Reference [95]. Iterative quadratic ML (IQML) [96] is another iterative technique. IQML is similar to the algorithm of Reference [97], which requires knowledge of the input signal. IQML is also similar to the method of Reference [98]. An IQML approach [99] has been used for noisy AR estimation. None of these make use of the covariance weighting of ICWARE.There is some similarity of this approach to the measurement error models studied in the statistical literature [100], in which parameters of regression models such as are estimated when only noisy observations of the regressors are available. These noisy observations induce changes to conventional minimum mean-square error (MMSE) estimators which are somewhat related to errors-in-variables approaches to AR estimation [101,102,103].Another approach to this problem makes use of coupled Kalman filters [104] or cross-coupled filters [105,106] has been developed. In essence these move beyond the Kalman filter approach referenced above ([21], Section 3.3) and operate as follows. One Kalman filter estimates the true autoregressive value , assuming knowledge of the AR parameters , while the other Kalman filter estimates the using the estimates as if they were the true values. These two filters operate in conjunction to jointly converge to the AR parameter estimate. To evaluate our method in comparison with the dual Kalman filter method, the algorithm here is compare to models also seen in Reference [104].Recently, Monte Carlo methods, such as particle filtering has been used for AR and ARMA estimation [107,108,109,110]. Particle filtering methods are quite general and can track dynamical variables and jointly estimate static parameters. These methods offer the possibility of convergence to good estimates in situations when the noise is not Gaussian (which has been assumed in this paper), but offer the usual drawbacks of potentially high computational complexity if many particles are used. These represent an alternative to the methods of this paper, and a thorough comparison, while valuable, lies beyond the scope of this paper.This topic is relevant to its journal of publication (Entropy), since spectral analysis under a maximum entropy criterion has (famously) been shown to be equivalent to AR modeling [11]. As this work shows, however, when there are noisy observations particular attention must be devoted to the information present in the autocorrelation matrix of the observations, beyond the first order equations that result from the historical maximum entropy approach.We present the method first for the scalar AR model. For generality, noise is assumed to be circular complex Gaussian noise; modifications for real signals is straightforward. In addition to the coefficient estimation problem, we also address the problem of variance estimation. Following development for scalar AR processes, the notation is developed for the vector AR (VAR) processes. Several simulation results demonstrate that the ICWARE method provide significant improvements over classical YW-type methods.
2. Scalar Parameter Estimation in Noise
The noisy observation equation, represented by the system in Figure 1, is
where and each , and where is a white noise process. The process is said to be the noise-free AR process, and the process is said to be the noisy AR process.
Figure 1
Noisy Autoregressive Model.
The input and observation noises are assumed to be complex circular Gaussian, so that
where the real and imaginary parts are uncorrelated, and draws at different times are independent. Furthermore, these noises are assumed to be uncorrelated with each other, soLet and (this vector is re-defined below to have a different length). The conventional least-squares approach to estimating (in the forward prediction error sense) would be to compute
which results in the conventional estimateThis is essentially the covariance method [19]. The estimate (3) can be computed using recursive formulation, resulting in a recursive least-squares (RLS) adaptive filter. This approach essentially neglects the noise . It is known that the noise will bias the estimate results [22].Note: The error could also be computed in a backward prediction error sense, or in a combined forward-backward sense as is done in some spectrum estimation algorithms such as the modified covariance method [14], ([5], Section 75). The ICWARE method we describe below can be extended to these generalizations as well, but for brevity only forward prediction error is used.Substituting the AR signal in terms of the observation into the AR model (1) and using the observation model (2) we obtain
where
denotes the noise. The expression (4) is an process and is the basis for some ARMA-based approaches to AR estimation in noise. The expression (5) has the appearance of an process, except that the noise is not white, having correlationIf the noise were uncorrelated, the solution of (3) would be least-squares optimal. However, since is correlated, the sample-by-sample approach of (3) is suboptimal since it does not take the correlation into account. As we now show, there is information in the covariance structure of the signal that can be used to improve the estimate. To take the correlation into account, stack the observations of (5) into vectors of length d (that is, the depth), asWrite this asThe correlation matrix of the vector noise isIn moving through a sequence of data, the data can be advanced by a skip s to form a sequence of vectors and matrices , which we write for convenience as , and , respectively. The index m is chosen to make it possible to use only causal data, .Assuming independence (such as when ) and assuming that the correlation function is known, the likelihood function can be written as a circular complex Gaussian asFinding the true maximum likelihood solution by directly maximizing (9) with respect to is difficult, since enters nonlinearly in . Instead, an iterative approach is employed. An estimated value is used at the ith step. Assuming still that each is known, a maximum likelihood solution may be obtained fromHere, a forgetting factor , with , has been introduced to allow for tracking a time-varying . Significantly, the norm here is weighted by the inverse covariance . If we take , we obtain the regular least-squares problem, and the additional correlation structure does not appear. Taking the gradient of the cost functional (10) results in the normal equationWrite this as . This can be recursively updated using RLS recursions (see, e.g., References ([111], Section 8.7), [112]), starting from an initial for a small scalar and propagating .Computing (11) requires knowledge of , which is not available since is to be found. In combination with a recursively updated solution to (11), is estimated using previously computed values of and used as a time-varying covariance matrix. That is, we write . The result is the framework shown in Algorithm 1, similar to a vector RLS adaptive filter.Input: , ,Previous Conditions: ,ComputeUpdate the covariance to obtain and compute .Return , , .The “Update the covariance” step, detailed below, is how the structure of can be used to improve the parameter estimates.We have explored several different approaches to computing and using :Ignore : Neglect the correlation structure and simply assume that . This gives the equivalent of taking a scalar measurement and is used as a sort of worst-case basis for comparison among the different algorithms.Use the correct value of : That is, assume that and and are known and compute according to (6). This provides a limit on best-case performance against which other methods can be compared.Use the estimate of : Using the correct values of and , compute the autocorrelation matrix using in (6).Estimate , fix and : With assumed values of and , compute the autocorrelation matrix using in (6).Estimate everything: Estimate the values of and , then use them with in (6).In the early stages while is poorly converged, it is best to use option 1 (assuming identity covariance matrix) until has settled near its final value, then switch to option 4 using estimated until the moments in have converged sufficiently well that reliable estimates of the variances can be obtained and option 5 can be used. As described in Section 3, the information necessary to estimate the variances can be accumulated without yet having a decent estimate for .The covariance update/inverse does not necessarily need to be done at every step. Particularly as settles towards its final value, there is little to be gained by updating at every step.
3. Estimating the Variances
As the results below will indicate, fixed values of and can be used in the estimate of instead of estimated values, so for the purposes of estimating , it is not strictly necessary to estimate these variances. But for other purposes it may be necessary to have an estimate of and . A maximum likelihood estimation approach is thus described in this section.For a given estimate of the coefficients , write asFor a sequence of k vectors , assumed to be nonoverlapping, encompassing N samples, the joint log likelihood function is a complex (circularly symmetric) GaussianLet
denote the sample covariance. It can be shown that (see Appendix A)
and thatThe gradients (14) and (15) are equal to zero whenThe variance estimates are chosen to satisfy the condition (16) using an estimate of from previous estimates, that isWe define the offset trace as
where the usual trace is obtained when , and for ℓ > 0, the sum is taken on the ℓth superdiagonal. Let for , where that is, the trace on the ℓth superdiagonal. Taking the offset trace of (17) givesThe ML estimates of the variances are the solutions toA significant number of terms of must be accumulated in order to estimate the variances well. Initially, before reasonably accurate estimates of are available, using inaccurate s can result in a highly inaccurate . As a result, it is desirable to accumulate moments of and without using . The trace of the sample covariance isThe different terms can be written as
(It is noted that, except when , these two terms are not conjugates of each other, so that both of these terms must be retained.)These terms can be computed by recursively propagating the following quantities, for and :ThenFigure 2 shows the result of this estimation in an example with and . The mean and standard deviation over 50 independent runs is shown, for up 2000 samples averaged to produce . The true value of is used. The least-squares solution to (18) can produce variance estimates for which are negative, which is nonphysical. Solutions constrained so that the variances are constrained to the range were also computed using CVX [113]. From these results, about 500 samples are needed before the variance estimate converges to a value somewhat close to the true value.
Figure 2
Example of estimating the variances (top plot) and (bottom plot), using true . k indicates the number of points used in the estimates. Shading indicates standard deviation of the estimates over 50 iterations. Bounded solution is computed using CVX to avoid negative estimates.
4. Vector Autoregressive Formulation
In this section we extend the results of the previous section to vector autoregressive random processes in noise, where at each time vector of length K is produced, and the coefficients are matrices. In the interest of generality, a constant offset is also included. The noisy observations can be written asAs in the scalar case, the noisy observations can be written as
whereFollowing the usual practice [15], this is vectorized to obtain a vector of unknown parameters as follows.Applying the vec operator ([114], Chapter 9) we obtainLet
andThenThe noise structure for a single vector can be written asThenWe also findAs in the scalar case, stack up multiple observations so that the correlation structure may be exploited:Write this asThe correlation structure of the stacked noise vector isLet and be a sequence of vectors and matrices where the vector samples are skipped by s samples at each step, and for convenience denote these as and . Following the method of Section 2, the estimate
is determined by the solution to the normal equationsWith the appropriate use of vectorized data, Algorithm 1 can be used for the VAR as well.
5. Some Results
Several test cases were examined to determine the performance of the ICWARE approach; these are summarized in Table 1, where the pole locations are given in polar form .
Table 1
Parameters for test cases.
Case
Order
Pole Locations
1
3
0.95ej2π0.65, 0.95ej2π0.7, 0.95ej2π0.75
2
3
0.9ej2π0.65, 0.9ej2π0.7, 0.9ej2π0.75
3
3
ej2π0.65, ej2π0.7, ej2π0.75
4
6
0.75e±j2π0.1, 0.8e±j2π0.2, 0.85e±j2π0.35
5
6
0.98e±j2π0.05, 0.97e±j2π0.15, 0.8e±j2π0.35
6
6
0.98e±j2π0.1, 0.97e±j2π0.1, 0.98e±j2π0.15
The first example, designated Case 1, is a system with having poles at , with and . resulting in a fairly narrowband complex AR signal. The noise variances are , and . This case is used to explore some of variations in performance as parameters of the algorithm are varied. Figure 3 shows as a function of iteration using various forms of and values of d (the stack height) from 1 to 7. The “skip” is set to . The results are obtained by averaging the results of 50 independent runs. In these plots, “Iteration number” refers to the number of blocks of data used in an online scheme, each iteration of the algorithm corresponding to one data block. The results are compared with for YW with noisy measurements and noise-free measurements and scalar AR estimation. The autocorrelation values for the YW method are computed using 30000 points. The Burg method was also used, but with this many points there is little difference between Burg and conventional YW. The scalar AR estimation (black dotted line) converges to the YW performance, as does the ICWARE with , as expected.
Figure 3
Case 1: Error performance fo different values of d with . (a) Solid: True ; (b) Dashed: estimated from and correct variances; (c) Dotted: estimated from and fixed incorrect variances. Comparison: Black dotted: conventional least-squares; Solid black: Yule-Walker (YW) with noisy observations; Dashed black: NW with noise-free observations.
In Figure 3, three different ways of determining are used. The solid lines (subplot (a)) use the true . This is, of course, unavailable in practice, but these plots serve as a basis for comparison against real algorithms. The dashed lines (subplot (b)) use an identity matrix for for the first 30 iterations (to establish an estimate of ), following which is computed using estimated and the correct variances and . The dotted lines (subplot (c)) are similar, except that fixed (but wrong) values of and are used in the computation of . It is clear that as d increases, the performance significantly improves, especially for the first few values of d. Interestingly, for , the error performance is on the order of the error in the noise-free YW.The improvements of the new method after 1000 iterations (denoted by ) compared with the YW result tabulated in Table 2 for and 4. There is little difference between and , but when , that is, , the performance declines.
Table 2
Comparison of computed for with for YW.
d
10log10∥b−b^k∥22/∥b−b^YW∥2
s=2
s=3
s=4
2
−5.9
−5.8
11.1
3
−12.4
−12.5
4.5
4
−17.5
−17.5
−0.6
5
−20.6
−20.6
−3.9
6
−22.6
−22.7
−5.9
7
−23.9
−24.0
−7.2
Figure 4 shows the influence of the skip s on the performance, with the same methods of estimating as shown in Figure 3. For a fixed value of , the error curves for different values of s are shown. Larger s for does improve the performance, but only slightly.
Figure 4
Case 1: Error performance for different values of s with . (a) Solid: True ; (b) Dashed: estimated from and variances; (c) Dotted: estimated from and fixed incorrect variances. Comparison: Black dotted: conventional least-squares; Solid black: Yule-Walker (YW) with noisy observations; Dashed black: NW with noise-free observations.
(a) Solid: True ; (b) Dashed: estimated from and correct variances; (c) Dotted: estimated from and fixed incorrect variances.Figure 5 shows estimates of the variances and for different values of d. Obviously we need (in order to have two equations). However, when there is a very large variance. By the point , the variance estimates are very close to the true variances.
Figure 5
Case 1: Estimated variances for different values of d. Top: Estimated ; Bottom: Estimated .
Figure 6 shows the magnitude frequency response for the true spectrum, the YW spectrum, and the spectrum computed from obtained using and after convergence. (The spectrum is not an average, but only a single outcome.) The ICWARE estimated spectrum is very close to the true spectrum while the YW spectrum has considerable bias, exhibiting strong peaks not present in the true spectrum.
Figure 6
Case 1: Comparison of ICWARE estimated spectrum with true spectrum, and the YW estimated spectrum.
As a point of comparison, TLS is used to estimate the AR parameters. In TLS, the equation
(for some m and M) is perturbed in both the matrix on the left and the vector of observations on the right. TLS can be computed using the singular value decomposition (SVD), making it computationally complex. It also makes it more difficult to create algorithms which track changing parameters, and, as shown below, gives inferior parameter estimates than the method presented here. Figure 7 shows the error for computed using TLS. The parameter M is the height of the linear system that is solved using the TLS method described in Reference [114]. The values are examined. At each iteration, a different set of points are used in the TLS equations, each of which is computed using a (relatively complicated) singular value decomposition. Even with 20 rows in the TLS equations, performance only roughly comparable to the YW equations is obtained. For 60 or more equations, the improvement of the performance quickly saturates, so that 60 or 140 perform rather comparably. The computational complexity is rather high. For each solution (at each iteration), the SVD of a matrix is computed. Despite the computational complexity, the ICWARE method performs superior to the TLS for . (The method of estimating in is the same as used for Figure 3b.)
Figure 7
Case 1: Comparison of new method with total least squares (TLS) solutions for different TLS sizes. Dashed: ICWARE for different values of d; Solid: TLS for different matrix sizes.
The results presented to this point take 1000 blocks of data. A consideration is whether it is possible to iteratively re-use data so that less data is required, but the reduced error from the ICWARE algorithm can be obtained. In Figure 8, blocks of data are employed, and the processing is repeated 10 times on each block. Figure 9 shows the error as a result of each iteration. Each line in the plot corresponds to one pass through the data. The top plot uses estimated using and true variances; the bottom plot uses using and fixed variances. These figures show that iterating over the data provides essentially the same performance as longer data.
Figure 8
Case 1: Error performance of 10 repetitions on blocks of length 100. (a) Solid: True ; (b) Dashed: estimated from and variances; (c) Dotted: estimated from and fixed incorrect variances. Comparison: Black dotted: conventional least-squares; Solid black: Yule-Walker (YW) with noisy observations; Dashed black: NW with noise-free observations.
Figure 9
Performance on 10 repetitions on blocks of length 100, with the results folded for each iteration. Top: Estimated using and true variances; Bottom: Estimated using and fixed variances.
The next example, designated Case 2, is also a third order system with poles at the same angles as Case 1, but with . The following figures show examples of the same sort of results shown for Case 1. In Figure 10 the various methods are compared. In this case, however, the minimum error does not reach as low as the noise-free YW case. Figure 11 shows the resulting spectrum, again showing that the ICWARE spectrum is closer than the YW spectrum.
Figure 10
Case 2: Error performance for different values of d with . (a) Solid: True ; (b) Dashed: estimated from and variances; (c) Dotted: estimated from and fixed incorrect variances. Comparison: Black dotted: conventional least-squares; Solid black: Yule-Walker (YW) with noisy observations; Dashed black: NW with noise-free observations.
Figure 11
Case 2: Comparison of ICWARE estimated spectrum with true spectrum, and the YW estimated spectrum.
Case 3 involves a system having poles at —that is, pure sinusoidal signals—and angles . The estimated spectra are shown in Figure 12. The spectral peaks are clearly evident in the ICWARE estimate.
Figure 12
Case 3: Comparison of ICWARE estimated spectrum with true spectrum, and the YW estimated spectrum.
As another comparison, ICWARE spectral estimates are compared to the spectra from dual Kalman filters using the examples from Reference [104], shown as Cases 4,5, and 6 in Table 1. Figure 13 shows the results. Spectra for twenty realizations are plotted, along with the true spectrum and the mean. The values of and were selected, since simulations for Case 1 (above) suggest these are reasonable values. Comparison of these results with Figures 3, 4, and 6 of Reference [104] shows fairly comparable performance. When the poles are not near the unit circle (Case 4), the ICWARE estimate tend to have smoother spectra (estimated poles with smaller magnitude). When the poles are nearer the units circle (Case 5, Case 6), ICWARE seems to do a comparable job at capturing the peaks, and does somewhat better at representing the high-frequency dropoff of the spectrum. In all of these simulations, an estimated was employed.
Figure 13
Spectral estimation results for examples from Reference [104]. (a) Case 4; (b) Case 5; (c) Case 6. .
6. Summary and Conclusions
In this paper, we have shown that accounting for observation noise added to a pure AR(p) process results in noise which is correlated across lags. This makes it expedient to employ stacked observation vectors, and estimating the parameters in a vector sense. This results in an algorithm that is essentially a vector RLS adaptive filter. Several different methods were described to obtain information about the correlation matrix . Also, estimating the variances of the AR and observation noise was described.It was shown by simulation that the method can be applied repetitively to the same block of data, providing accurate results with data of moderate length.The improvement of the technique compared with Yule-Walker is a function of the depth d to which the observations are stacked. Values even greater than p continue to yield improvement. The improvement relative to YW also is system dependent. Based on simulations, a depth , where p is the order of the system, seems a reasonable choice.Comparisons were made against a dual Kalman filter approach, with comparable or slightly superior results.